diff mbox series

[RFC] opal/xstop: Use nvram param to enable/disable sw checkstop.

Message ID 151326575824.4407.12557075834665962235.stgit@jupiter.in.ibm.com
State RFC
Headers show
Series [RFC] opal/xstop: Use nvram param to enable/disable sw checkstop. | expand

Commit Message

Mahesh J Salgaonkar Dec. 14, 2017, 3:36 p.m. UTC
From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Add a mechanism to enable/disable sw checkstop by looking at nvram option
opal-sw-xstop=<enable/disable>.

For now this patch disables the sw checkstop trigger unless explicitly
enabled through nvram option 'opal-sw-xstop=enable'. This will allow an
opportunity to get host kernel in panic path or xmon for unrecoverable
HMIs or MCE, to be able to debug the issue effectively.

To enable sw checkstop in opal issue following command:

# nvram -p ibm,skiboot --update-config opal-sw-xstop=enable

NOTE: This is a workaround patch to disable sw checkstop by default to gain
control in host kernel for better checkstop debugging. Once we have most of
the checkstop issues stabilized/resolved, revisit this patch to enable sw
checkstop by default.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 hw/xscom.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

ppaidipe Dec. 14, 2017, 5:15 p.m. UTC | #1
Hi Mahesh

On 2017-12-14 21:06, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> Add a mechanism to enable/disable sw checkstop by looking at nvram 
> option
> opal-sw-xstop=<enable/disable>.
> 
> For now this patch disables the sw checkstop trigger unless explicitly
> enabled through nvram option 'opal-sw-xstop=enable'. This will allow an
> opportunity to get host kernel in panic path or xmon for unrecoverable
> HMIs or MCE, to be able to debug the issue effectively.

It will be good to enable by default, and let's give control to user to 
disable
the sw-xstop using nvram option. Otherwise all test infrastructures will 
break.

Thanks
Pridhiviraj

> 
> To enable sw checkstop in opal issue following command:
> 
> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
> 
> NOTE: This is a workaround patch to disable sw checkstop by default to 
> gain
> control in host kernel for better checkstop debugging. Once we have 
> most of
> the checkstop issues stabilized/resolved, revisit this patch to enable 
> sw
> checkstop by default.
> 
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  hw/xscom.c |   13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/hw/xscom.c b/hw/xscom.c
> index de5a27e..4d3d88f 100644
> --- a/hw/xscom.c
> +++ b/hw/xscom.c
> @@ -24,6 +24,7 @@
>  #include <errorlog.h>
>  #include <opal-api.h>
>  #include <timebase.h>
> +#include <nvram.h>
> 
>  /* Mask of bits to clear in HMER before an access */
>  #define HMER_CLR_MASK	(~(SPR_HMER_XSCOM_FAIL | \
> @@ -827,6 +828,18 @@ int64_t xscom_trigger_xstop(void)
>  {
>  	int rc = OPAL_UNSUPPORTED;
> 
> +	/*
> +	 * Workaround until we iron out all checkstop issues at present.
> +	 *
> +	 * By deafult do not trigger sw checkstop unless explicitly enabled
> +	 * through nvram option 'opal-sw-xstop=enable'.
> +	 *
> +	 * NOTE: Once all checkstop issues are resolved/stabilized reverse
> +	 * the logic to enable sw checkstop by default.
> +	 */
> +	if (!nvram_query_eq("opal-sw-xstop", "enable"))
> +		return rc;
> +
>  	if (xstop_xscom.addr)
>  		rc = xscom_writeme(xstop_xscom.addr,
>  				PPC_BIT(xstop_xscom.fir_bit));
> 
> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
Oliver O'Halloran Dec. 14, 2017, 11:38 p.m. UTC | #2
On Fri, Dec 15, 2017 at 4:15 AM, ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote:
> Hi Mahesh
>
> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote:
>>
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> Add a mechanism to enable/disable sw checkstop by looking at nvram option
>> opal-sw-xstop=<enable/disable>.
>>
>> For now this patch disables the sw checkstop trigger unless explicitly
>> enabled through nvram option 'opal-sw-xstop=enable'. This will allow an
>> opportunity to get host kernel in panic path or xmon for unrecoverable
>> HMIs or MCE, to be able to debug the issue effectively.
>
>
> It will be good to enable by default, and let's give control to user to
> disable
> the sw-xstop using nvram option. Otherwise all test infrastructures will
> break.

Why does the test infrastructure require SW initiated checkstops?


>
> Thanks
> Pridhiviraj
>
>
>>
>> To enable sw checkstop in opal issue following command:
>>
>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
>>
>> NOTE: This is a workaround patch to disable sw checkstop by default to
>> gain
>> control in host kernel for better checkstop debugging. Once we have most
>> of
>> the checkstop issues stabilized/resolved, revisit this patch to enable sw
>> checkstop by default.
>>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>>  hw/xscom.c |   13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/hw/xscom.c b/hw/xscom.c
>> index de5a27e..4d3d88f 100644
>> --- a/hw/xscom.c
>> +++ b/hw/xscom.c
>> @@ -24,6 +24,7 @@
>>  #include <errorlog.h>
>>  #include <opal-api.h>
>>  #include <timebase.h>
>> +#include <nvram.h>
>>
>>  /* Mask of bits to clear in HMER before an access */
>>  #define HMER_CLR_MASK  (~(SPR_HMER_XSCOM_FAIL | \
>> @@ -827,6 +828,18 @@ int64_t xscom_trigger_xstop(void)
>>  {
>>         int rc = OPAL_UNSUPPORTED;
>>
>> +       /*
>> +        * Workaround until we iron out all checkstop issues at present.
>> +        *
>> +        * By deafult do not trigger sw checkstop unless explicitly
>> enabled
>> +        * through nvram option 'opal-sw-xstop=enable'.
>> +        *
>> +        * NOTE: Once all checkstop issues are resolved/stabilized reverse
>> +        * the logic to enable sw checkstop by default.
>> +        */
>> +       if (!nvram_query_eq("opal-sw-xstop", "enable"))
>> +               return rc;
>> +
>>         if (xstop_xscom.addr)
>>                 rc = xscom_writeme(xstop_xscom.addr,
>>                                 PPC_BIT(xstop_xscom.fir_bit));
>>
>> _______________________________________________
>> Skiboot mailing list
>> Skiboot@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/skiboot
>
>
> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
Balbir Singh Dec. 15, 2017, 1:02 a.m. UTC | #3
On Thu, 14 Dec 2017 21:06:09 +0530
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> Add a mechanism to enable/disable sw checkstop by looking at nvram option
> opal-sw-xstop=<enable/disable>.
> 

Thanks for doing this, we've had a bit of a mid-air collision. The approach
is the same, but the code paths and names are different.


From b209b538c9e85244eb598ba3be93c342dcf9f46c Mon Sep 17 00:00:00 2001
From: Balbir Singh <bsingharora@gmail.com>
Date: Fri, 15 Dec 2017 11:48:17 +1100
Subject: [PATCH] core/platform: Disable software initiated checkstop

Software initiated checkstops are hard to debug and becoming
more prevelant with coherent memory devices being attached
via the NPU. We do the necessary logging as before and introduce
a new nvram parameter "reboot-on-plat-error", which can be
set to true if the old behavious is desired

nvram -p ibm,skiboot --update-config reboot-on-plat-error=true

The advantage of doing this in skiboot is that this works
for existing distros and allows for better handling where
the OS and driver (for coherent memory if present) can do
a better job of logging errors and reporting broken state
more accurately.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 core/platform.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/core/platform.c b/core/platform.c
index 732f67e5..d3f75b63 100644
--- a/core/platform.c
+++ b/core/platform.c
@@ -91,7 +91,21 @@ static int64_t opal_cec_reboot2(uint32_t reboot_type, char *diag)
 			prerror("OPAL: failed to log an error\n");
 		}
 		disable_fast_reboot("Reboot due to Platform Error");
-		return xscom_trigger_xstop();
+
+		/*
+		 * Trigger a checkstop only if reboot-on-plat-error
+		 * is set to true. We've already logged the error
+		 * above, the platform has a record of the error.
+		 * Initiating a software checkstop makes it hard
+		 * to debug issues, specifically with coherent memory
+		 * devices attached via the NPU, where any change
+		 * in link status can trigger an HMI/MCE due to several
+		 * reasons. We want to drop back to the OS and debug
+		 */
+		if (!nvram_query_eq("reboot-on-plat-error", "true"))
+			return xscom_trigger_xstop();
+		else
+			return OPAL_UNSUPPORTED;
 	case OPAL_REBOOT_FULL_IPL:
 		disable_fast_reboot("full IPL reboot requested");
 		return opal_cec_reboot();
Balbir Singh Dec. 15, 2017, 1:04 a.m. UTC | #4
On Thu, 14 Dec 2017 22:45:48 +0530
ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote:

> Hi Mahesh
> 
> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote:
> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > 
> > Add a mechanism to enable/disable sw checkstop by looking at nvram 
> > option
> > opal-sw-xstop=<enable/disable>.
> > 
> > For now this patch disables the sw checkstop trigger unless explicitly
> > enabled through nvram option 'opal-sw-xstop=enable'. This will allow an
> > opportunity to get host kernel in panic path or xmon for unrecoverable
> > HMIs or MCE, to be able to debug the issue effectively.  
> 
> It will be good to enable by default, and let's give control to user to 
> disable
> the sw-xstop using nvram option. Otherwise all test infrastructures will 
> break.
>

It's the otherway around right? Lets look at what happens today

1. We get a software initiated checkstop on a WSP box
2. The BMC decides to reboot the box after a timeout
3. The BMC shows that a software initiated checkstop took place

How is this useful?

On the FSP side, we should still log our diagnostic information
before the OS reboots.

Am I missing something?

Balbir Singh.
ppaidipe Dec. 15, 2017, 7:23 a.m. UTC | #5
On 2017-12-15 05:08, Oliver wrote:
> On Fri, Dec 15, 2017 at 4:15 AM, ppaidipe <ppaidipe@linux.vnet.ibm.com> 
> wrote:
>> Hi Mahesh
>> 
>> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote:
>>> 
>>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>> 
>>> Add a mechanism to enable/disable sw checkstop by looking at nvram 
>>> option
>>> opal-sw-xstop=<enable/disable>.
>>> 
>>> For now this patch disables the sw checkstop trigger unless 
>>> explicitly
>>> enabled through nvram option 'opal-sw-xstop=enable'. This will allow 
>>> an
>>> opportunity to get host kernel in panic path or xmon for 
>>> unrecoverable
>>> HMIs or MCE, to be able to debug the issue effectively.
>> 
>> 
>> It will be good to enable by default, and let's give control to user 
>> to
>> disable
>> the sw-xstop using nvram option. Otherwise all test infrastructures 
>> will
>> break.
> 
> Why does the test infrastructure require SW initiated checkstops?

I mean existing system test/RAS/Other FW FVT teams got issues related to 
SW checkstop
during running of tests. So it will be good, if we enable by default, we 
can at-least
won't miss any issues related SW checkstop until as long as we have 
given a way of
disabling it using nvram option.

> 
> 
>> 
>> Thanks
>> Pridhiviraj
>> 
>> 
>>> 
>>> To enable sw checkstop in opal issue following command:
>>> 
>>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
>>> 
>>> NOTE: This is a workaround patch to disable sw checkstop by default 
>>> to
>>> gain
>>> control in host kernel for better checkstop debugging. Once we have 
>>> most
>>> of
>>> the checkstop issues stabilized/resolved, revisit this patch to 
>>> enable sw
>>> checkstop by default.
>>> 
>>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>> ---
>>>  hw/xscom.c |   13 +++++++++++++
>>>  1 file changed, 13 insertions(+)
>>> 
>>> diff --git a/hw/xscom.c b/hw/xscom.c
>>> index de5a27e..4d3d88f 100644
>>> --- a/hw/xscom.c
>>> +++ b/hw/xscom.c
>>> @@ -24,6 +24,7 @@
>>>  #include <errorlog.h>
>>>  #include <opal-api.h>
>>>  #include <timebase.h>
>>> +#include <nvram.h>
>>> 
>>>  /* Mask of bits to clear in HMER before an access */
>>>  #define HMER_CLR_MASK  (~(SPR_HMER_XSCOM_FAIL | \
>>> @@ -827,6 +828,18 @@ int64_t xscom_trigger_xstop(void)
>>>  {
>>>         int rc = OPAL_UNSUPPORTED;
>>> 
>>> +       /*
>>> +        * Workaround until we iron out all checkstop issues at 
>>> present.
>>> +        *
>>> +        * By deafult do not trigger sw checkstop unless explicitly
>>> enabled
>>> +        * through nvram option 'opal-sw-xstop=enable'.
>>> +        *
>>> +        * NOTE: Once all checkstop issues are resolved/stabilized 
>>> reverse
>>> +        * the logic to enable sw checkstop by default.
>>> +        */
>>> +       if (!nvram_query_eq("opal-sw-xstop", "enable"))
>>> +               return rc;
>>> +
>>>         if (xstop_xscom.addr)
>>>                 rc = xscom_writeme(xstop_xscom.addr,
>>>                                 PPC_BIT(xstop_xscom.fir_bit));
>>> 
>>> _______________________________________________
>>> Skiboot mailing list
>>> Skiboot@lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/skiboot
>> 
>> 
>> _______________________________________________
>> Skiboot mailing list
>> Skiboot@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/skiboot
ppaidipe Dec. 15, 2017, 7:35 a.m. UTC | #6
On 2017-12-15 06:34, Balbir Singh wrote:
> On Thu, 14 Dec 2017 22:45:48 +0530
> ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote:
> 
>> Hi Mahesh
>> 
>> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote:
>> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> >
>> > Add a mechanism to enable/disable sw checkstop by looking at nvram
>> > option
>> > opal-sw-xstop=<enable/disable>.
>> >
>> > For now this patch disables the sw checkstop trigger unless explicitly
>> > enabled through nvram option 'opal-sw-xstop=enable'. This will allow an
>> > opportunity to get host kernel in panic path or xmon for unrecoverable
>> > HMIs or MCE, to be able to debug the issue effectively.
>> 
>> It will be good to enable by default, and let's give control to user 
>> to
>> disable
>> the sw-xstop using nvram option. Otherwise all test infrastructures 
>> will
>> break.
>> 
> 
> It's the otherway around right? Lets look at what happens today
> 
> 1. We get a software initiated checkstop on a WSP box
> 2. The BMC decides to reboot the box after a timeout
> 3. The BMC shows that a software initiated checkstop took place
> 
> How is this useful?

you are right it is of less useful in case of BMC systems, but still we 
can stop
the auto reboot after checkstop and use existing pdbg tools to collect 
require debug
data(Not sure how much useful it can capture).


> 
> On the FSP side, we should still log our diagnostic information
> before the OS reboots.
> 

Correct.

> Am I missing something?
> 

And it disables for all P8 platforms as well. So atleast we can enable 
it there.

> Balbir Singh.
Mahesh J Salgaonkar Dec. 15, 2017, 4:36 p.m. UTC | #7
On 12/15/2017 01:05 PM, ppaidipe wrote:
> On 2017-12-15 06:34, Balbir Singh wrote:
>> On Thu, 14 Dec 2017 22:45:48 +0530
>> ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote:
>>
>>> Hi Mahesh
>>>
>>> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote:
>>> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>> >
>>> > Add a mechanism to enable/disable sw checkstop by looking at nvram
>>> > option
>>> > opal-sw-xstop=<enable/disable>.
>>> >
>>> > For now this patch disables the sw checkstop trigger unless explicitly
>>> > enabled through nvram option 'opal-sw-xstop=enable'. This will
>>> allow an
>>> > opportunity to get host kernel in panic path or xmon for unrecoverable
>>> > HMIs or MCE, to be able to debug the issue effectively.
>>>
>>> It will be good to enable by default, and let's give control to user to
>>> disable
>>> the sw-xstop using nvram option. Otherwise all test infrastructures will
>>> break.
>>>
>>
>> It's the otherway around right? Lets look at what happens today
>>
>> 1. We get a software initiated checkstop on a WSP box
>> 2. The BMC decides to reboot the box after a timeout
>> 3. The BMC shows that a software initiated checkstop took place
>>
>> How is this useful?
> 
> you are right it is of less useful in case of BMC systems, but still we
> can stop
> the auto reboot after checkstop and use existing pdbg tools to collect
> require debug
> data(Not sure how much useful it can capture).
> 
> 
>>
>> On the FSP side, we should still log our diagnostic information
>> before the OS reboots.
>>
> 
> Correct.
> 
>> Am I missing something?
>>
> 
> And it disables for all P8 platforms as well. So atleast we can enable
> it there.

Hmm.. We can add p9 check and keep the p8 behaviour as is. Would that help ?

Thanks,
-Mahesh.

> 
>> Balbir Singh.
ppaidipe Dec. 16, 2017, 2:35 p.m. UTC | #8
On 2017-12-15 22:06, Mahesh Jagannath Salgaonkar wrote:
> On 12/15/2017 01:05 PM, ppaidipe wrote:
>> On 2017-12-15 06:34, Balbir Singh wrote:
>>> On Thu, 14 Dec 2017 22:45:48 +0530
>>> ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote:
>>> 
>>>> Hi Mahesh
>>>> 
>>>> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote:
>>>> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>>> >
>>>> > Add a mechanism to enable/disable sw checkstop by looking at nvram
>>>> > option
>>>> > opal-sw-xstop=<enable/disable>.
>>>> >
>>>> > For now this patch disables the sw checkstop trigger unless explicitly
>>>> > enabled through nvram option 'opal-sw-xstop=enable'. This will
>>>> allow an
>>>> > opportunity to get host kernel in panic path or xmon for unrecoverable
>>>> > HMIs or MCE, to be able to debug the issue effectively.
>>>> 
>>>> It will be good to enable by default, and let's give control to user 
>>>> to
>>>> disable
>>>> the sw-xstop using nvram option. Otherwise all test infrastructures 
>>>> will
>>>> break.
>>>> 
>>> 
>>> It's the otherway around right? Lets look at what happens today
>>> 
>>> 1. We get a software initiated checkstop on a WSP box
>>> 2. The BMC decides to reboot the box after a timeout
>>> 3. The BMC shows that a software initiated checkstop took place
>>> 
>>> How is this useful?
>> 
>> you are right it is of less useful in case of BMC systems, but still 
>> we
>> can stop
>> the auto reboot after checkstop and use existing pdbg tools to collect
>> require debug
>> data(Not sure how much useful it can capture).
>> 
>> 
>>> 
>>> On the FSP side, we should still log our diagnostic information
>>> before the OS reboots.
>>> 
>> 
>> Correct.
>> 
>>> Am I missing something?
>>> 
>> 
>> And it disables for all P8 platforms as well. So atleast we can enable
>> it there.
> 
> Hmm.. We can add p9 check and keep the p8 behaviour as is. Would that 
> help ?
> 

yes, that will be good.

> Thanks,
> -Mahesh.
> 
>> 
>>> Balbir Singh.
diff mbox series

Patch

diff --git a/hw/xscom.c b/hw/xscom.c
index de5a27e..4d3d88f 100644
--- a/hw/xscom.c
+++ b/hw/xscom.c
@@ -24,6 +24,7 @@ 
 #include <errorlog.h>
 #include <opal-api.h>
 #include <timebase.h>
+#include <nvram.h>
 
 /* Mask of bits to clear in HMER before an access */
 #define HMER_CLR_MASK	(~(SPR_HMER_XSCOM_FAIL | \
@@ -827,6 +828,18 @@  int64_t xscom_trigger_xstop(void)
 {
 	int rc = OPAL_UNSUPPORTED;
 
+	/*
+	 * Workaround until we iron out all checkstop issues at present.
+	 *
+	 * By deafult do not trigger sw checkstop unless explicitly enabled
+	 * through nvram option 'opal-sw-xstop=enable'.
+	 *
+	 * NOTE: Once all checkstop issues are resolved/stabilized reverse
+	 * the logic to enable sw checkstop by default.
+	 */
+	if (!nvram_query_eq("opal-sw-xstop", "enable"))
+		return rc;
+
 	if (xstop_xscom.addr)
 		rc = xscom_writeme(xstop_xscom.addr,
 				PPC_BIT(xstop_xscom.fir_bit));