[v2] opal/xstop: Use nvram option to enable/disable sw checkstop.

Message ID 151362060219.27708.7450373707409056345.stgit@jupiter.in.ibm.com
State Accepted
Headers show
Series
  • [v2] opal/xstop: Use nvram option to enable/disable sw checkstop.
Related show

Commit Message

Mahesh J Salgaonkar Dec. 18, 2017, 6:11 p.m.
From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Add a mechanism to enable/disable sw checkstop by looking at nvram option
opal-sw-xstop=<enable/disable>.

For now this patch disables the sw checkstop trigger unless explicitly
enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
an opportunity to get host kernel in panic path or xmon for unrecoverable
HMIs or MCE, to be able to debug the issue effectively.

To enable sw checkstop in opal issue following command:

# nvram -p ibm,skiboot --update-config opal-sw-xstop=enable

NOTE: This is a workaround patch to disable sw checkstop by default to gain
control in host kernel for better checkstop debugging. Once we have most of
the checkstop issues stabilized/resolved, revisit this patch to enable sw
checkstop by default.

For p8 platform it will remain enabled by default unless explicitly disabled.

To disable sw checkstop on p8 issue following command:

# nvram -p ibm,skiboot --update-config opal-sw-xstop=disable

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
---
Change in v2:
   - Add pr_log to indicate that sw checkstop was disabled.
---
 hw/xscom.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

Comments

Stewart Smith Jan. 15, 2018, 6:42 a.m. | #1
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> Add a mechanism to enable/disable sw checkstop by looking at nvram option
> opal-sw-xstop=<enable/disable>.
>
> For now this patch disables the sw checkstop trigger unless explicitly
> enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
> an opportunity to get host kernel in panic path or xmon for unrecoverable
> HMIs or MCE, to be able to debug the issue effectively.
>
> To enable sw checkstop in opal issue following command:
>
> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
>
> NOTE: This is a workaround patch to disable sw checkstop by default to gain
> control in host kernel for better checkstop debugging. Once we have most of
> the checkstop issues stabilized/resolved, revisit this patch to enable sw
> checkstop by default.
>
> For p8 platform it will remain enabled by default unless explicitly disabled.
>
> To disable sw checkstop on p8 issue following command:
>
> # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable
>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
> ---
> Change in v2:
>    - Add pr_log to indicate that sw checkstop was disabled.
> ---
>  hw/xscom.c |   32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)

All a bit umming-and-ahhing about the behaviour change... but this seems
to be the "easiest" for now.... and I reserve the right to change my
mind at any point :)

I think the correct solution here is to have the kernel make the
appropriate decision rather than having this workaround in OPAL.

BUt.. well... reality and today was checkstop heavy, so my mind kind of
changed :)

Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though.

I think having the option to *disable* it is always going to be good,
but... well... I don't like that we end up in a situation where the
kernel says "everything is terrible because you told me it was terrible,
please reboot now" and then we ignore it.

The real solution is a kernel one....
Oliver Jan. 15, 2018, 6:56 a.m. | #2
On Mon, Jan 15, 2018 at 5:42 PM, Stewart Smith
<stewart@linux.vnet.ibm.com> wrote:
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> Add a mechanism to enable/disable sw checkstop by looking at nvram option
>> opal-sw-xstop=<enable/disable>.
>>
>> For now this patch disables the sw checkstop trigger unless explicitly
>> enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
>> an opportunity to get host kernel in panic path or xmon for unrecoverable
>> HMIs or MCE, to be able to debug the issue effectively.
>>
>> To enable sw checkstop in opal issue following command:
>>
>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
>>
>> NOTE: This is a workaround patch to disable sw checkstop by default to gain
>> control in host kernel for better checkstop debugging. Once we have most of
>> the checkstop issues stabilized/resolved, revisit this patch to enable sw
>> checkstop by default.
>>
>> For p8 platform it will remain enabled by default unless explicitly disabled.
>>
>> To disable sw checkstop on p8 issue following command:
>>
>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable
>>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
>> ---
>> Change in v2:
>>    - Add pr_log to indicate that sw checkstop was disabled.
>> ---
>>  hw/xscom.c |   32 ++++++++++++++++++++++++++++++++
>>  1 file changed, 32 insertions(+)
>
> All a bit umming-and-ahhing about the behaviour change... but this seems
> to be the "easiest" for now.... and I reserve the right to change my
> mind at any point :)
>
> I think the correct solution here is to have the kernel make the
> appropriate decision rather than having this workaround in OPAL.
>
> BUt.. well... reality and today was checkstop heavy, so my mind kind of
> changed :)
>
> Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though.
>
> I think having the option to *disable* it is always going to be good,
> but... well... I don't like that we end up in a situation where the
> kernel says "everything is terrible because you told me it was terrible,
> please reboot now" and then we ignore it.
>
> The real solution is a kernel one....

It really isn't. If we are reporting unrecoverable HMIs to the kernel
then the kernel has every right to assume the world is on fire and
request a shutdown. If we want the kernel to do something else then we
need to change what OPAL reports back to the kernel. Just disabling
the software xstop is a gross hack at best. It's not even clear that
just disabling the xstop is sufficent to keep the host up and running
since the kernel thread that initiated the shutdown isn't expecting to
return...

That said, it's a stupid debug hack so who cares.

> --
> Stewart Smith
> OPAL Architect, IBM.
>
> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
Stewart Smith Jan. 16, 2018, 12:22 a.m. | #3
Oliver <oohall@gmail.com> writes:
> On Mon, Jan 15, 2018 at 5:42 PM, Stewart Smith
> <stewart@linux.vnet.ibm.com> wrote:
>> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
>>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>>
>>> Add a mechanism to enable/disable sw checkstop by looking at nvram option
>>> opal-sw-xstop=<enable/disable>.
>>>
>>> For now this patch disables the sw checkstop trigger unless explicitly
>>> enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
>>> an opportunity to get host kernel in panic path or xmon for unrecoverable
>>> HMIs or MCE, to be able to debug the issue effectively.
>>>
>>> To enable sw checkstop in opal issue following command:
>>>
>>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
>>>
>>> NOTE: This is a workaround patch to disable sw checkstop by default to gain
>>> control in host kernel for better checkstop debugging. Once we have most of
>>> the checkstop issues stabilized/resolved, revisit this patch to enable sw
>>> checkstop by default.
>>>
>>> For p8 platform it will remain enabled by default unless explicitly disabled.
>>>
>>> To disable sw checkstop on p8 issue following command:
>>>
>>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable
>>>
>>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
>>> ---
>>> Change in v2:
>>>    - Add pr_log to indicate that sw checkstop was disabled.
>>> ---
>>>  hw/xscom.c |   32 ++++++++++++++++++++++++++++++++
>>>  1 file changed, 32 insertions(+)
>>
>> All a bit umming-and-ahhing about the behaviour change... but this seems
>> to be the "easiest" for now.... and I reserve the right to change my
>> mind at any point :)
>>
>> I think the correct solution here is to have the kernel make the
>> appropriate decision rather than having this workaround in OPAL.
>>
>> BUt.. well... reality and today was checkstop heavy, so my mind kind of
>> changed :)
>>
>> Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though.
>>
>> I think having the option to *disable* it is always going to be good,
>> but... well... I don't like that we end up in a situation where the
>> kernel says "everything is terrible because you told me it was terrible,
>> please reboot now" and then we ignore it.
>>
>> The real solution is a kernel one....
>
> It really isn't. If we are reporting unrecoverable HMIs to the kernel
> then the kernel has every right to assume the world is on fire and
> request a shutdown. If we want the kernel to do something else then we
> need to change what OPAL reports back to the kernel. Just disabling
> the software xstop is a gross hack at best. It's not even clear that
> just disabling the xstop is sufficent to keep the host up and running
> since the kernel thread that initiated the shutdown isn't expecting to
> return...

Yeah, that would be the better place to fix things - telling the kernel
that it did a naughty rather than the machine is borked.

I'm sure we'll figure it out sometime after I stop seeing "PRD: Hardware
problem, very low chance of software cause" that's actually 100%
software problem.
Balbir Singh Jan. 17, 2018, 4:47 a.m. | #4
On Mon, Jan 15, 2018 at 12:26 PM, Oliver <oohall@gmail.com> wrote:
> On Mon, Jan 15, 2018 at 5:42 PM, Stewart Smith
> <stewart@linux.vnet.ibm.com> wrote:
>> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
>>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>>
>>> Add a mechanism to enable/disable sw checkstop by looking at nvram option
>>> opal-sw-xstop=<enable/disable>.
>>>
>>> For now this patch disables the sw checkstop trigger unless explicitly
>>> enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
>>> an opportunity to get host kernel in panic path or xmon for unrecoverable
>>> HMIs or MCE, to be able to debug the issue effectively.
>>>
>>> To enable sw checkstop in opal issue following command:
>>>
>>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
>>>
>>> NOTE: This is a workaround patch to disable sw checkstop by default to gain
>>> control in host kernel for better checkstop debugging. Once we have most of
>>> the checkstop issues stabilized/resolved, revisit this patch to enable sw
>>> checkstop by default.
>>>
>>> For p8 platform it will remain enabled by default unless explicitly disabled.
>>>
>>> To disable sw checkstop on p8 issue following command:
>>>
>>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable
>>>
>>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
>>> ---
>>> Change in v2:
>>>    - Add pr_log to indicate that sw checkstop was disabled.
>>> ---
>>>  hw/xscom.c |   32 ++++++++++++++++++++++++++++++++
>>>  1 file changed, 32 insertions(+)
>>
>> All a bit umming-and-ahhing about the behaviour change... but this seems
>> to be the "easiest" for now.... and I reserve the right to change my
>> mind at any point :)
>>
>> I think the correct solution here is to have the kernel make the
>> appropriate decision rather than having this workaround in OPAL.
>>
>> BUt.. well... reality and today was checkstop heavy, so my mind kind of
>> changed :)
>>
>> Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though.
>>
>> I think having the option to *disable* it is always going to be good,
>> but... well... I don't like that we end up in a situation where the
>> kernel says "everything is terrible because you told me it was terrible,
>> please reboot now" and then we ignore it.
>>
>> The real solution is a kernel one....
>
> It really isn't. If we are reporting unrecoverable HMIs to the kernel
> then the kernel has every right to assume the world is on fire and
> request a shutdown. If we want the kernel to do something else then we
> need to change what OPAL reports back to the kernel. Just disabling

The real issue is keeping backwards compat for p8 and allowing a
checkstop for machines that care to do so.


> the software xstop is a gross hack at best. It's not even clear that
> just disabling the xstop is sufficent to keep the host up and running
> since the kernel thread that initiated the shutdown isn't expecting to
> return...
>

The real goal of the patch is log (get the context of what was
happening when we triggered the platform error). We could still dump
that on the console, but going back to the kernel lets us crash/xmon
and get more info before rebooting the box. Hostboot printing that we
got a software initiated checkstop (TI) is not useful to be honest and
we're seeing NPU2 devices cause HMI's and machine checks, so its
useful to see the context at the time of the error

> That said, it's a stupid debug hack so who cares.
>

Balbir Singh.

Patch

diff --git a/hw/xscom.c b/hw/xscom.c
index de5a27e..0501278 100644
--- a/hw/xscom.c
+++ b/hw/xscom.c
@@ -24,6 +24,7 @@ 
 #include <errorlog.h>
 #include <opal-api.h>
 #include <timebase.h>
+#include <nvram.h>
 
 /* Mask of bits to clear in HMER before an access */
 #define HMER_CLR_MASK	(~(SPR_HMER_XSCOM_FAIL | \
@@ -826,6 +827,37 @@  static void xscom_init_chip_info(struct proc_chip *chip)
 int64_t xscom_trigger_xstop(void)
 {
 	int rc = OPAL_UNSUPPORTED;
+	bool xstop_disabled = false;
+
+	/*
+	 * Workaround until we iron out all checkstop issues at present.
+	 *
+	 * For p9:
+	 * By default do not trigger sw checkstop unless explicitly enabled
+	 * through nvram option 'opal-sw-xstop=enable'.
+	 *
+	 * For p8:
+	 * Keep it enabled by default unless explicitly disabled.
+	 *
+	 * NOTE: Once all checkstop issues are resolved/stabilized reverse
+	 * the logic to enable sw checkstop by default on p9.
+	 */
+	switch (proc_gen) {
+	case proc_gen_p8:
+		if (nvram_query_eq("opal-sw-xstop", "disable"))
+			xstop_disabled = true;
+		break;
+	case proc_gen_p9:
+	default:
+		if (!nvram_query_eq("opal-sw-xstop", "enable"))
+			xstop_disabled = true;
+		break;
+	}
+
+	if (xstop_disabled) {
+		prlog(PR_NOTICE, "Software initiated checkstop disabled.\n");
+		return rc;
+	}
 
 	if (xstop_xscom.addr)
 		rc = xscom_writeme(xstop_xscom.addr,