mbox series

[0/4] hw: make TCO watchdog actually work by default for Q35

Message ID 20221031131934.425448-1-berrange@redhat.com
Headers show
Series hw: make TCO watchdog actually work by default for Q35 | expand

Message

Daniel P. Berrangé Oct. 31, 2022, 1:19 p.m. UTC
The TCO watchdog is unconditionally integrated into the Q35 machine
type by default, but at the same time is unconditionally disabled
from firing by a host config option that overrides guest OS attempts
to enable it. People have to know to set a magic -global to make
it non-broken

IOW we're exposing a broken watchdog by default to all Q35 machines,
but which to the guest OS & its apps looks fully functional :-(

This behaviour was set in response to feedback from Michael:

  https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg07128.html

    "I think sample high is a safer default."

but as explained in the commit message in the last patch, I think the
watchdog defaults were already safe without that pin strap setting.
The guest OS needs to take explicit action to clear the guest visible
'no reboot' flag, and so we don't need a second guest hidden 'no reboot'
flag to override that choice IMHO. Am I missing something ?

NB, I'm toggling this for 7.2 machine type since that's the current
git latest machine. Since this has already been "broken" for 7 years
though, I am ambivalent about whether we try todo this for 7.2, vs
just wait until the 8.0 machine types arrive.

Daniel P. Berrangé (4):
  hw/acpi: add trace events for TCO watchdog register access
  hw/isa: add trace events for ICH9 LPC chip config access
  hw/watchdog: add trace events for watchdog action handling
  hw/isa: enable TCO watchdog reboot pin strap by default

 hw/acpi/tco.c            | 41 +++++++++++++++++++++++++++-------------
 hw/acpi/trace-events     |  2 ++
 hw/i386/pc.c             |  4 +++-
 hw/isa/lpc_ich9.c        |  5 ++++-
 hw/isa/trace-events      |  4 ++++
 hw/watchdog/trace-events |  4 ++++
 hw/watchdog/watchdog.c   |  4 ++++
 tests/qtest/tco-test.c   |  2 +-
 8 files changed, 50 insertions(+), 16 deletions(-)

Comments

Daniel P. Berrangé Oct. 31, 2022, 1:50 p.m. UTC | #1
On Mon, Oct 31, 2022 at 01:19:30PM +0000, Daniel P. Berrangé wrote:
> The TCO watchdog is unconditionally integrated into the Q35 machine
> type by default, but at the same time is unconditionally disabled
> from firing by a host config option that overrides guest OS attempts
> to enable it. People have to know to set a magic -global to make
> it non-broken

Incidentally I found that originally the TCO watchdog was not
unconditionally enabled. Its exposure to the guest could be
turned on/off using

  -global ICH9-LPC.enable_tco=bool

This was implemented for machine type compat, but it also gave
apps a way to disable the watchdog functionality. Unfortunately
that ability was discarded in this series:

  https://lore.kernel.org/all/1453564933-29638-1-git-send-email-ehabkost@redhat.com/

but the 'enable_tco' property still exists in QOM, but silently
ignored.

Seems we should either fix the impl of 'enable_tco', or remove the
QOM property entirely, so we don't pretend it can be toggled anymore.

With regards,
Daniel
Michael S. Tsirkin Oct. 31, 2022, 3:48 p.m. UTC | #2
On Mon, Oct 31, 2022 at 01:50:24PM +0000, Daniel P. Berrangé wrote:
> On Mon, Oct 31, 2022 at 01:19:30PM +0000, Daniel P. Berrangé wrote:
> > The TCO watchdog is unconditionally integrated into the Q35 machine
> > type by default, but at the same time is unconditionally disabled
> > from firing by a host config option that overrides guest OS attempts
> > to enable it. People have to know to set a magic -global to make
> > it non-broken
> 
> Incidentally I found that originally the TCO watchdog was not
> unconditionally enabled. Its exposure to the guest could be
> turned on/off using
> 
>   -global ICH9-LPC.enable_tco=bool
> 
> This was implemented for machine type compat, but it also gave
> apps a way to disable the watchdog functionality. Unfortunately
> that ability was discarded in this series:
> 
>   https://lore.kernel.org/all/1453564933-29638-1-git-send-email-ehabkost@redhat.com/
> 
> but the 'enable_tco' property still exists in QOM, but silently
> ignored.
> 
> Seems we should either fix the impl of 'enable_tco', or remove the
> QOM property entirely, so we don't pretend it can be toggled anymore.
> 
> With regards,
> Daniel

i am inclined to say you are right and the fix is to fix the impl.

> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Igor Mammedov Nov. 1, 2022, 12:57 p.m. UTC | #3
On Mon, 31 Oct 2022 11:48:58 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Oct 31, 2022 at 01:50:24PM +0000, Daniel P. Berrangé wrote:
> > On Mon, Oct 31, 2022 at 01:19:30PM +0000, Daniel P. Berrangé wrote:  
> > > The TCO watchdog is unconditionally integrated into the Q35 machine
> > > type by default, but at the same time is unconditionally disabled
> > > from firing by a host config option that overrides guest OS attempts
> > > to enable it. People have to know to set a magic -global to make
> > > it non-broken  
> > 
> > Incidentally I found that originally the TCO watchdog was not
> > unconditionally enabled. Its exposure to the guest could be
> > turned on/off using
> > 
> >   -global ICH9-LPC.enable_tco=bool
> > 
> > This was implemented for machine type compat, but it also gave
> > apps a way to disable the watchdog functionality. Unfortunately
> > that ability was discarded in this series:
> > 
> >   https://lore.kernel.org/all/1453564933-29638-1-git-send-email-ehabkost@redhat.com/
> > 
> > but the 'enable_tco' property still exists in QOM, but silently
> > ignored.
> > 
> > Seems we should either fix the impl of 'enable_tco', or remove the
> > QOM property entirely, so we don't pretend it can be toggled anymore.
> > 
> > With regards,
> > Daniel  
> 
> i am inclined to say you are right and the fix is to fix the impl.

Is there need for users to disable whatchdog at all?
It was always present since then and no one complained, 
so perhaps we should ditch property instead fixing it
to keep it simple.

> 
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|  
>
Daniel P. Berrangé Nov. 1, 2022, 1:03 p.m. UTC | #4
On Tue, Nov 01, 2022 at 01:57:24PM +0100, Igor Mammedov wrote:
> On Mon, 31 Oct 2022 11:48:58 -0400
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Oct 31, 2022 at 01:50:24PM +0000, Daniel P. Berrangé wrote:
> > > On Mon, Oct 31, 2022 at 01:19:30PM +0000, Daniel P. Berrangé wrote:  
> > > > The TCO watchdog is unconditionally integrated into the Q35 machine
> > > > type by default, but at the same time is unconditionally disabled
> > > > from firing by a host config option that overrides guest OS attempts
> > > > to enable it. People have to know to set a magic -global to make
> > > > it non-broken  
> > > 
> > > Incidentally I found that originally the TCO watchdog was not
> > > unconditionally enabled. Its exposure to the guest could be
> > > turned on/off using
> > > 
> > >   -global ICH9-LPC.enable_tco=bool
> > > 
> > > This was implemented for machine type compat, but it also gave
> > > apps a way to disable the watchdog functionality. Unfortunately
> > > that ability was discarded in this series:
> > > 
> > >   https://lore.kernel.org/all/1453564933-29638-1-git-send-email-ehabkost@redhat.com/
> > > 
> > > but the 'enable_tco' property still exists in QOM, but silently
> > > ignored.
> > > 
> > > Seems we should either fix the impl of 'enable_tco', or remove the
> > > QOM property entirely, so we don't pretend it can be toggled anymore.
> > > 
> > > With regards,
> > > Daniel  
> > 
> > i am inclined to say you are right and the fix is to fix the impl.
> 
> Is there need for users to disable whatchdog at all?
> It was always present since then and no one complained, 
> so perhaps we should ditch property instead fixing it
> to keep it simple.

Thinking about it more, I think we should NOT fix the 'enable_tco' property,
because there will be no way for a mgmt appp to tell if they're using a
fixed or broken QEMU. So if they use 'enable_tco' on a broken QEMU and then
live migrate, they'll get an guest ABI change. If we did want to support
disabling it, then we should have a brand new property that apps can probe
for.

In the absence of a request to disable watchdog, I'd say we just delete
'enable_tco' right now. If someone wants it in future, we can add it with
a new name.

With regards,
Daniel
Michael S. Tsirkin Nov. 10, 2022, 4:29 p.m. UTC | #5
On Tue, Nov 01, 2022 at 01:03:17PM +0000, Daniel P. Berrangé wrote:
> On Tue, Nov 01, 2022 at 01:57:24PM +0100, Igor Mammedov wrote:
> > On Mon, 31 Oct 2022 11:48:58 -0400
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Mon, Oct 31, 2022 at 01:50:24PM +0000, Daniel P. Berrangé wrote:
> > > > On Mon, Oct 31, 2022 at 01:19:30PM +0000, Daniel P. Berrangé wrote:  
> > > > > The TCO watchdog is unconditionally integrated into the Q35 machine
> > > > > type by default, but at the same time is unconditionally disabled
> > > > > from firing by a host config option that overrides guest OS attempts
> > > > > to enable it. People have to know to set a magic -global to make
> > > > > it non-broken  
> > > > 
> > > > Incidentally I found that originally the TCO watchdog was not
> > > > unconditionally enabled. Its exposure to the guest could be
> > > > turned on/off using
> > > > 
> > > >   -global ICH9-LPC.enable_tco=bool
> > > > 
> > > > This was implemented for machine type compat, but it also gave
> > > > apps a way to disable the watchdog functionality. Unfortunately
> > > > that ability was discarded in this series:
> > > > 
> > > >   https://lore.kernel.org/all/1453564933-29638-1-git-send-email-ehabkost@redhat.com/
> > > > 
> > > > but the 'enable_tco' property still exists in QOM, but silently
> > > > ignored.
> > > > 
> > > > Seems we should either fix the impl of 'enable_tco', or remove the
> > > > QOM property entirely, so we don't pretend it can be toggled anymore.
> > > > 
> > > > With regards,
> > > > Daniel  
> > > 
> > > i am inclined to say you are right and the fix is to fix the impl.
> > 
> > Is there need for users to disable whatchdog at all?
> > It was always present since then and no one complained, 
> > so perhaps we should ditch property instead fixing it
> > to keep it simple.
> 
> Thinking about it more, I think we should NOT fix the 'enable_tco' property,
> because there will be no way for a mgmt appp to tell if they're using a
> fixed or broken QEMU.

This is always the case for any bug. We don't as a rule add properties
for this, it is distro's responsibility to pick bugfixes for
features users care about.

What makes this bug different?


> So if they use 'enable_tco' on a broken QEMU and then
> live migrate, they'll get an guest ABI change. If we did want to support
> disabling it, then we should have a brand new property that apps can probe
> for.
> 
> In the absence of a request to disable watchdog, I'd say we just delete
> 'enable_tco' right now. If someone wants it in future, we can add it with
> a new name.
> 
> With regards,
> Daniel

I am really stressed about watchdog firing and resetting VMs that previously
were working fine. Adding this without a safety mechanism to quickly
disable in case of problems in the field is not wise imho.

> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Michael S. Tsirkin Nov. 10, 2022, 4:30 p.m. UTC | #6
On Mon, Oct 31, 2022 at 01:19:30PM +0000, Daniel P. Berrangé wrote:
> The TCO watchdog is unconditionally integrated into the Q35 machine
> type by default, but at the same time is unconditionally disabled
> from firing by a host config option that overrides guest OS attempts
> to enable it. People have to know to set a magic -global to make
> it non-broken
> 
> IOW we're exposing a broken watchdog by default to all Q35 machines,
> but which to the guest OS & its apps looks fully functional :-(
> 
> This behaviour was set in response to feedback from Michael:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg07128.html
> 
>     "I think sample high is a safer default."
> 
> but as explained in the commit message in the last patch, I think the
> watchdog defaults were already safe without that pin strap setting.
> The guest OS needs to take explicit action to clear the guest visible
> 'no reboot' flag, and so we don't need a second guest hidden 'no reboot'
> flag to override that choice IMHO. Am I missing something ?
> 
> NB, I'm toggling this for 7.2 machine type since that's the current
> git latest machine. Since this has already been "broken" for 7 years
> though, I am ambivalent about whether we try todo this for 7.2, vs
> just wait until the 8.0 machine types arrive.


So I expect v2 with minor issues fixed and hopefully a
fixed property (we can debate removing it down the road
if we want).


> Daniel P. Berrangé (4):
>   hw/acpi: add trace events for TCO watchdog register access
>   hw/isa: add trace events for ICH9 LPC chip config access
>   hw/watchdog: add trace events for watchdog action handling
>   hw/isa: enable TCO watchdog reboot pin strap by default
> 
>  hw/acpi/tco.c            | 41 +++++++++++++++++++++++++++-------------
>  hw/acpi/trace-events     |  2 ++
>  hw/i386/pc.c             |  4 +++-
>  hw/isa/lpc_ich9.c        |  5 ++++-
>  hw/isa/trace-events      |  4 ++++
>  hw/watchdog/trace-events |  4 ++++
>  hw/watchdog/watchdog.c   |  4 ++++
>  tests/qtest/tco-test.c   |  2 +-
>  8 files changed, 50 insertions(+), 16 deletions(-)
> 
> -- 
> 2.37.3
Daniel P. Berrangé Nov. 10, 2022, 6:21 p.m. UTC | #7
On Thu, Nov 10, 2022 at 11:29:21AM -0500, Michael S. Tsirkin wrote:
> On Tue, Nov 01, 2022 at 01:03:17PM +0000, Daniel P. Berrangé wrote:
> > On Tue, Nov 01, 2022 at 01:57:24PM +0100, Igor Mammedov wrote:
> > > On Mon, 31 Oct 2022 11:48:58 -0400
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Mon, Oct 31, 2022 at 01:50:24PM +0000, Daniel P. Berrangé wrote:
> > > > > On Mon, Oct 31, 2022 at 01:19:30PM +0000, Daniel P. Berrangé wrote:  
> > > > > > The TCO watchdog is unconditionally integrated into the Q35 machine
> > > > > > type by default, but at the same time is unconditionally disabled
> > > > > > from firing by a host config option that overrides guest OS attempts
> > > > > > to enable it. People have to know to set a magic -global to make
> > > > > > it non-broken  
> > > > > 
> > > > > Incidentally I found that originally the TCO watchdog was not
> > > > > unconditionally enabled. Its exposure to the guest could be
> > > > > turned on/off using
> > > > > 
> > > > >   -global ICH9-LPC.enable_tco=bool
> > > > > 
> > > > > This was implemented for machine type compat, but it also gave
> > > > > apps a way to disable the watchdog functionality. Unfortunately
> > > > > that ability was discarded in this series:
> > > > > 
> > > > >   https://lore.kernel.org/all/1453564933-29638-1-git-send-email-ehabkost@redhat.com/
> > > > > 
> > > > > but the 'enable_tco' property still exists in QOM, but silently
> > > > > ignored.
> > > > > 
> > > > > Seems we should either fix the impl of 'enable_tco', or remove the
> > > > > QOM property entirely, so we don't pretend it can be toggled anymore.
> > > > > 
> > > > > With regards,
> > > > > Daniel  
> > > > 
> > > > i am inclined to say you are right and the fix is to fix the impl.
> > > 
> > > Is there need for users to disable whatchdog at all?
> > > It was always present since then and no one complained, 
> > > so perhaps we should ditch property instead fixing it
> > > to keep it simple.
> > 
> > Thinking about it more, I think we should NOT fix the 'enable_tco' property,
> > because there will be no way for a mgmt appp to tell if they're using a
> > fixed or broken QEMU.
> 
> This is always the case for any bug. We don't as a rule add properties
> for this, it is distro's responsibility to pick bugfixes for
> features users care about.
> 
> What makes this bug different?

It is impossible to safely use the enable_tco property because it
will break ABI across live migrations between old and new QEMU.
If it was a recent bug I'd be ok with just fixing it. Since this
property has been broken for 6 years though, IMHO it just needs
to be deleted entirely.

> > So if they use 'enable_tco' on a broken QEMU and then
> > live migrate, they'll get an guest ABI change. If we did want to support
> > disabling it, then we should have a brand new property that apps can probe
> > for.
> > 
> > In the absence of a request to disable watchdog, I'd say we just delete
> > 'enable_tco' right now. If someone wants it in future, we can add it with
> > a new name.
> 
> I am really stressed about watchdog firing and resetting VMs that previously
> were working fine. Adding this without a safety mechanism to quickly
> disable in case of problems in the field is not wise imho.

We would only toggle the ICH9-LPC.noreboot flag for new machine types, so
wouldn't affect existing VMs, and if anyone has a problem they can explicitly
set ICH9-LPC.noreboot back to true, or pick the older machine type again.

With regards,
Daniel