Message ID | 151326575824.4407.12557075834665962235.stgit@jupiter.in.ibm.com |
---|---|
State | RFC |
Headers | show |
Series | [RFC] opal/xstop: Use nvram param to enable/disable sw checkstop. | expand |
Hi Mahesh On 2017-12-14 21:06, Mahesh J Salgaonkar wrote: > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > > Add a mechanism to enable/disable sw checkstop by looking at nvram > option > opal-sw-xstop=<enable/disable>. > > For now this patch disables the sw checkstop trigger unless explicitly > enabled through nvram option 'opal-sw-xstop=enable'. This will allow an > opportunity to get host kernel in panic path or xmon for unrecoverable > HMIs or MCE, to be able to debug the issue effectively. It will be good to enable by default, and let's give control to user to disable the sw-xstop using nvram option. Otherwise all test infrastructures will break. Thanks Pridhiviraj > > To enable sw checkstop in opal issue following command: > > # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable > > NOTE: This is a workaround patch to disable sw checkstop by default to > gain > control in host kernel for better checkstop debugging. Once we have > most of > the checkstop issues stabilized/resolved, revisit this patch to enable > sw > checkstop by default. > > Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > --- > hw/xscom.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/hw/xscom.c b/hw/xscom.c > index de5a27e..4d3d88f 100644 > --- a/hw/xscom.c > +++ b/hw/xscom.c > @@ -24,6 +24,7 @@ > #include <errorlog.h> > #include <opal-api.h> > #include <timebase.h> > +#include <nvram.h> > > /* Mask of bits to clear in HMER before an access */ > #define HMER_CLR_MASK (~(SPR_HMER_XSCOM_FAIL | \ > @@ -827,6 +828,18 @@ int64_t xscom_trigger_xstop(void) > { > int rc = OPAL_UNSUPPORTED; > > + /* > + * Workaround until we iron out all checkstop issues at present. > + * > + * By deafult do not trigger sw checkstop unless explicitly enabled > + * through nvram option 'opal-sw-xstop=enable'. > + * > + * NOTE: Once all checkstop issues are resolved/stabilized reverse > + * the logic to enable sw checkstop by default. > + */ > + if (!nvram_query_eq("opal-sw-xstop", "enable")) > + return rc; > + > if (xstop_xscom.addr) > rc = xscom_writeme(xstop_xscom.addr, > PPC_BIT(xstop_xscom.fir_bit)); > > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot
On Fri, Dec 15, 2017 at 4:15 AM, ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote: > Hi Mahesh > > On 2017-12-14 21:06, Mahesh J Salgaonkar wrote: >> >> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >> >> Add a mechanism to enable/disable sw checkstop by looking at nvram option >> opal-sw-xstop=<enable/disable>. >> >> For now this patch disables the sw checkstop trigger unless explicitly >> enabled through nvram option 'opal-sw-xstop=enable'. This will allow an >> opportunity to get host kernel in panic path or xmon for unrecoverable >> HMIs or MCE, to be able to debug the issue effectively. > > > It will be good to enable by default, and let's give control to user to > disable > the sw-xstop using nvram option. Otherwise all test infrastructures will > break. Why does the test infrastructure require SW initiated checkstops? > > Thanks > Pridhiviraj > > >> >> To enable sw checkstop in opal issue following command: >> >> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable >> >> NOTE: This is a workaround patch to disable sw checkstop by default to >> gain >> control in host kernel for better checkstop debugging. Once we have most >> of >> the checkstop issues stabilized/resolved, revisit this patch to enable sw >> checkstop by default. >> >> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >> --- >> hw/xscom.c | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/hw/xscom.c b/hw/xscom.c >> index de5a27e..4d3d88f 100644 >> --- a/hw/xscom.c >> +++ b/hw/xscom.c >> @@ -24,6 +24,7 @@ >> #include <errorlog.h> >> #include <opal-api.h> >> #include <timebase.h> >> +#include <nvram.h> >> >> /* Mask of bits to clear in HMER before an access */ >> #define HMER_CLR_MASK (~(SPR_HMER_XSCOM_FAIL | \ >> @@ -827,6 +828,18 @@ int64_t xscom_trigger_xstop(void) >> { >> int rc = OPAL_UNSUPPORTED; >> >> + /* >> + * Workaround until we iron out all checkstop issues at present. >> + * >> + * By deafult do not trigger sw checkstop unless explicitly >> enabled >> + * through nvram option 'opal-sw-xstop=enable'. >> + * >> + * NOTE: Once all checkstop issues are resolved/stabilized reverse >> + * the logic to enable sw checkstop by default. >> + */ >> + if (!nvram_query_eq("opal-sw-xstop", "enable")) >> + return rc; >> + >> if (xstop_xscom.addr) >> rc = xscom_writeme(xstop_xscom.addr, >> PPC_BIT(xstop_xscom.fir_bit)); >> >> _______________________________________________ >> Skiboot mailing list >> Skiboot@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/skiboot > > > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot
On Thu, 14 Dec 2017 21:06:09 +0530 Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote: > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > > Add a mechanism to enable/disable sw checkstop by looking at nvram option > opal-sw-xstop=<enable/disable>. > Thanks for doing this, we've had a bit of a mid-air collision. The approach is the same, but the code paths and names are different. From b209b538c9e85244eb598ba3be93c342dcf9f46c Mon Sep 17 00:00:00 2001 From: Balbir Singh <bsingharora@gmail.com> Date: Fri, 15 Dec 2017 11:48:17 +1100 Subject: [PATCH] core/platform: Disable software initiated checkstop Software initiated checkstops are hard to debug and becoming more prevelant with coherent memory devices being attached via the NPU. We do the necessary logging as before and introduce a new nvram parameter "reboot-on-plat-error", which can be set to true if the old behavious is desired nvram -p ibm,skiboot --update-config reboot-on-plat-error=true The advantage of doing this in skiboot is that this works for existing distros and allows for better handling where the OS and driver (for coherent memory if present) can do a better job of logging errors and reporting broken state more accurately. Signed-off-by: Balbir Singh <bsingharora@gmail.com> --- core/platform.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/core/platform.c b/core/platform.c index 732f67e5..d3f75b63 100644 --- a/core/platform.c +++ b/core/platform.c @@ -91,7 +91,21 @@ static int64_t opal_cec_reboot2(uint32_t reboot_type, char *diag) prerror("OPAL: failed to log an error\n"); } disable_fast_reboot("Reboot due to Platform Error"); - return xscom_trigger_xstop(); + + /* + * Trigger a checkstop only if reboot-on-plat-error + * is set to true. We've already logged the error + * above, the platform has a record of the error. + * Initiating a software checkstop makes it hard + * to debug issues, specifically with coherent memory + * devices attached via the NPU, where any change + * in link status can trigger an HMI/MCE due to several + * reasons. We want to drop back to the OS and debug + */ + if (!nvram_query_eq("reboot-on-plat-error", "true")) + return xscom_trigger_xstop(); + else + return OPAL_UNSUPPORTED; case OPAL_REBOOT_FULL_IPL: disable_fast_reboot("full IPL reboot requested"); return opal_cec_reboot();
On Thu, 14 Dec 2017 22:45:48 +0530 ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote: > Hi Mahesh > > On 2017-12-14 21:06, Mahesh J Salgaonkar wrote: > > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > > > > Add a mechanism to enable/disable sw checkstop by looking at nvram > > option > > opal-sw-xstop=<enable/disable>. > > > > For now this patch disables the sw checkstop trigger unless explicitly > > enabled through nvram option 'opal-sw-xstop=enable'. This will allow an > > opportunity to get host kernel in panic path or xmon for unrecoverable > > HMIs or MCE, to be able to debug the issue effectively. > > It will be good to enable by default, and let's give control to user to > disable > the sw-xstop using nvram option. Otherwise all test infrastructures will > break. > It's the otherway around right? Lets look at what happens today 1. We get a software initiated checkstop on a WSP box 2. The BMC decides to reboot the box after a timeout 3. The BMC shows that a software initiated checkstop took place How is this useful? On the FSP side, we should still log our diagnostic information before the OS reboots. Am I missing something? Balbir Singh.
On 2017-12-15 05:08, Oliver wrote: > On Fri, Dec 15, 2017 at 4:15 AM, ppaidipe <ppaidipe@linux.vnet.ibm.com> > wrote: >> Hi Mahesh >> >> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote: >>> >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >>> >>> Add a mechanism to enable/disable sw checkstop by looking at nvram >>> option >>> opal-sw-xstop=<enable/disable>. >>> >>> For now this patch disables the sw checkstop trigger unless >>> explicitly >>> enabled through nvram option 'opal-sw-xstop=enable'. This will allow >>> an >>> opportunity to get host kernel in panic path or xmon for >>> unrecoverable >>> HMIs or MCE, to be able to debug the issue effectively. >> >> >> It will be good to enable by default, and let's give control to user >> to >> disable >> the sw-xstop using nvram option. Otherwise all test infrastructures >> will >> break. > > Why does the test infrastructure require SW initiated checkstops? I mean existing system test/RAS/Other FW FVT teams got issues related to SW checkstop during running of tests. So it will be good, if we enable by default, we can at-least won't miss any issues related SW checkstop until as long as we have given a way of disabling it using nvram option. > > >> >> Thanks >> Pridhiviraj >> >> >>> >>> To enable sw checkstop in opal issue following command: >>> >>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable >>> >>> NOTE: This is a workaround patch to disable sw checkstop by default >>> to >>> gain >>> control in host kernel for better checkstop debugging. Once we have >>> most >>> of >>> the checkstop issues stabilized/resolved, revisit this patch to >>> enable sw >>> checkstop by default. >>> >>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >>> --- >>> hw/xscom.c | 13 +++++++++++++ >>> 1 file changed, 13 insertions(+) >>> >>> diff --git a/hw/xscom.c b/hw/xscom.c >>> index de5a27e..4d3d88f 100644 >>> --- a/hw/xscom.c >>> +++ b/hw/xscom.c >>> @@ -24,6 +24,7 @@ >>> #include <errorlog.h> >>> #include <opal-api.h> >>> #include <timebase.h> >>> +#include <nvram.h> >>> >>> /* Mask of bits to clear in HMER before an access */ >>> #define HMER_CLR_MASK (~(SPR_HMER_XSCOM_FAIL | \ >>> @@ -827,6 +828,18 @@ int64_t xscom_trigger_xstop(void) >>> { >>> int rc = OPAL_UNSUPPORTED; >>> >>> + /* >>> + * Workaround until we iron out all checkstop issues at >>> present. >>> + * >>> + * By deafult do not trigger sw checkstop unless explicitly >>> enabled >>> + * through nvram option 'opal-sw-xstop=enable'. >>> + * >>> + * NOTE: Once all checkstop issues are resolved/stabilized >>> reverse >>> + * the logic to enable sw checkstop by default. >>> + */ >>> + if (!nvram_query_eq("opal-sw-xstop", "enable")) >>> + return rc; >>> + >>> if (xstop_xscom.addr) >>> rc = xscom_writeme(xstop_xscom.addr, >>> PPC_BIT(xstop_xscom.fir_bit)); >>> >>> _______________________________________________ >>> Skiboot mailing list >>> Skiboot@lists.ozlabs.org >>> https://lists.ozlabs.org/listinfo/skiboot >> >> >> _______________________________________________ >> Skiboot mailing list >> Skiboot@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/skiboot
On 2017-12-15 06:34, Balbir Singh wrote: > On Thu, 14 Dec 2017 22:45:48 +0530 > ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote: > >> Hi Mahesh >> >> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote: >> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >> > >> > Add a mechanism to enable/disable sw checkstop by looking at nvram >> > option >> > opal-sw-xstop=<enable/disable>. >> > >> > For now this patch disables the sw checkstop trigger unless explicitly >> > enabled through nvram option 'opal-sw-xstop=enable'. This will allow an >> > opportunity to get host kernel in panic path or xmon for unrecoverable >> > HMIs or MCE, to be able to debug the issue effectively. >> >> It will be good to enable by default, and let's give control to user >> to >> disable >> the sw-xstop using nvram option. Otherwise all test infrastructures >> will >> break. >> > > It's the otherway around right? Lets look at what happens today > > 1. We get a software initiated checkstop on a WSP box > 2. The BMC decides to reboot the box after a timeout > 3. The BMC shows that a software initiated checkstop took place > > How is this useful? you are right it is of less useful in case of BMC systems, but still we can stop the auto reboot after checkstop and use existing pdbg tools to collect require debug data(Not sure how much useful it can capture). > > On the FSP side, we should still log our diagnostic information > before the OS reboots. > Correct. > Am I missing something? > And it disables for all P8 platforms as well. So atleast we can enable it there. > Balbir Singh.
On 12/15/2017 01:05 PM, ppaidipe wrote: > On 2017-12-15 06:34, Balbir Singh wrote: >> On Thu, 14 Dec 2017 22:45:48 +0530 >> ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote: >> >>> Hi Mahesh >>> >>> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote: >>> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >>> > >>> > Add a mechanism to enable/disable sw checkstop by looking at nvram >>> > option >>> > opal-sw-xstop=<enable/disable>. >>> > >>> > For now this patch disables the sw checkstop trigger unless explicitly >>> > enabled through nvram option 'opal-sw-xstop=enable'. This will >>> allow an >>> > opportunity to get host kernel in panic path or xmon for unrecoverable >>> > HMIs or MCE, to be able to debug the issue effectively. >>> >>> It will be good to enable by default, and let's give control to user to >>> disable >>> the sw-xstop using nvram option. Otherwise all test infrastructures will >>> break. >>> >> >> It's the otherway around right? Lets look at what happens today >> >> 1. We get a software initiated checkstop on a WSP box >> 2. The BMC decides to reboot the box after a timeout >> 3. The BMC shows that a software initiated checkstop took place >> >> How is this useful? > > you are right it is of less useful in case of BMC systems, but still we > can stop > the auto reboot after checkstop and use existing pdbg tools to collect > require debug > data(Not sure how much useful it can capture). > > >> >> On the FSP side, we should still log our diagnostic information >> before the OS reboots. >> > > Correct. > >> Am I missing something? >> > > And it disables for all P8 platforms as well. So atleast we can enable > it there. Hmm.. We can add p9 check and keep the p8 behaviour as is. Would that help ? Thanks, -Mahesh. > >> Balbir Singh.
On 2017-12-15 22:06, Mahesh Jagannath Salgaonkar wrote: > On 12/15/2017 01:05 PM, ppaidipe wrote: >> On 2017-12-15 06:34, Balbir Singh wrote: >>> On Thu, 14 Dec 2017 22:45:48 +0530 >>> ppaidipe <ppaidipe@linux.vnet.ibm.com> wrote: >>> >>>> Hi Mahesh >>>> >>>> On 2017-12-14 21:06, Mahesh J Salgaonkar wrote: >>>> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >>>> > >>>> > Add a mechanism to enable/disable sw checkstop by looking at nvram >>>> > option >>>> > opal-sw-xstop=<enable/disable>. >>>> > >>>> > For now this patch disables the sw checkstop trigger unless explicitly >>>> > enabled through nvram option 'opal-sw-xstop=enable'. This will >>>> allow an >>>> > opportunity to get host kernel in panic path or xmon for unrecoverable >>>> > HMIs or MCE, to be able to debug the issue effectively. >>>> >>>> It will be good to enable by default, and let's give control to user >>>> to >>>> disable >>>> the sw-xstop using nvram option. Otherwise all test infrastructures >>>> will >>>> break. >>>> >>> >>> It's the otherway around right? Lets look at what happens today >>> >>> 1. We get a software initiated checkstop on a WSP box >>> 2. The BMC decides to reboot the box after a timeout >>> 3. The BMC shows that a software initiated checkstop took place >>> >>> How is this useful? >> >> you are right it is of less useful in case of BMC systems, but still >> we >> can stop >> the auto reboot after checkstop and use existing pdbg tools to collect >> require debug >> data(Not sure how much useful it can capture). >> >> >>> >>> On the FSP side, we should still log our diagnostic information >>> before the OS reboots. >>> >> >> Correct. >> >>> Am I missing something? >>> >> >> And it disables for all P8 platforms as well. So atleast we can enable >> it there. > > Hmm.. We can add p9 check and keep the p8 behaviour as is. Would that > help ? > yes, that will be good. > Thanks, > -Mahesh. > >> >>> Balbir Singh.
diff --git a/hw/xscom.c b/hw/xscom.c index de5a27e..4d3d88f 100644 --- a/hw/xscom.c +++ b/hw/xscom.c @@ -24,6 +24,7 @@ #include <errorlog.h> #include <opal-api.h> #include <timebase.h> +#include <nvram.h> /* Mask of bits to clear in HMER before an access */ #define HMER_CLR_MASK (~(SPR_HMER_XSCOM_FAIL | \ @@ -827,6 +828,18 @@ int64_t xscom_trigger_xstop(void) { int rc = OPAL_UNSUPPORTED; + /* + * Workaround until we iron out all checkstop issues at present. + * + * By deafult do not trigger sw checkstop unless explicitly enabled + * through nvram option 'opal-sw-xstop=enable'. + * + * NOTE: Once all checkstop issues are resolved/stabilized reverse + * the logic to enable sw checkstop by default. + */ + if (!nvram_query_eq("opal-sw-xstop", "enable")) + return rc; + if (xstop_xscom.addr) rc = xscom_writeme(xstop_xscom.addr, PPC_BIT(xstop_xscom.fir_bit));