Message ID | 151362060219.27708.7450373707409056345.stgit@jupiter.in.ibm.com |
---|---|
State | Accepted |
Headers | show |
Series | [v2] opal/xstop: Use nvram option to enable/disable sw checkstop. | expand |
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes: > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > > Add a mechanism to enable/disable sw checkstop by looking at nvram option > opal-sw-xstop=<enable/disable>. > > For now this patch disables the sw checkstop trigger unless explicitly > enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow > an opportunity to get host kernel in panic path or xmon for unrecoverable > HMIs or MCE, to be able to debug the issue effectively. > > To enable sw checkstop in opal issue following command: > > # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable > > NOTE: This is a workaround patch to disable sw checkstop by default to gain > control in host kernel for better checkstop debugging. Once we have most of > the checkstop issues stabilized/resolved, revisit this patch to enable sw > checkstop by default. > > For p8 platform it will remain enabled by default unless explicitly disabled. > > To disable sw checkstop on p8 issue following command: > > # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable > > Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > Reviewed-by: Balbir Singh <bsingharora@gmail.com> > --- > Change in v2: > - Add pr_log to indicate that sw checkstop was disabled. > --- > hw/xscom.c | 32 ++++++++++++++++++++++++++++++++ > 1 file changed, 32 insertions(+) All a bit umming-and-ahhing about the behaviour change... but this seems to be the "easiest" for now.... and I reserve the right to change my mind at any point :) I think the correct solution here is to have the kernel make the appropriate decision rather than having this workaround in OPAL. BUt.. well... reality and today was checkstop heavy, so my mind kind of changed :) Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though. I think having the option to *disable* it is always going to be good, but... well... I don't like that we end up in a situation where the kernel says "everything is terrible because you told me it was terrible, please reboot now" and then we ignore it. The real solution is a kernel one....
On Mon, Jan 15, 2018 at 5:42 PM, Stewart Smith <stewart@linux.vnet.ibm.com> wrote: > Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes: >> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >> >> Add a mechanism to enable/disable sw checkstop by looking at nvram option >> opal-sw-xstop=<enable/disable>. >> >> For now this patch disables the sw checkstop trigger unless explicitly >> enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow >> an opportunity to get host kernel in panic path or xmon for unrecoverable >> HMIs or MCE, to be able to debug the issue effectively. >> >> To enable sw checkstop in opal issue following command: >> >> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable >> >> NOTE: This is a workaround patch to disable sw checkstop by default to gain >> control in host kernel for better checkstop debugging. Once we have most of >> the checkstop issues stabilized/resolved, revisit this patch to enable sw >> checkstop by default. >> >> For p8 platform it will remain enabled by default unless explicitly disabled. >> >> To disable sw checkstop on p8 issue following command: >> >> # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable >> >> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >> Reviewed-by: Balbir Singh <bsingharora@gmail.com> >> --- >> Change in v2: >> - Add pr_log to indicate that sw checkstop was disabled. >> --- >> hw/xscom.c | 32 ++++++++++++++++++++++++++++++++ >> 1 file changed, 32 insertions(+) > > All a bit umming-and-ahhing about the behaviour change... but this seems > to be the "easiest" for now.... and I reserve the right to change my > mind at any point :) > > I think the correct solution here is to have the kernel make the > appropriate decision rather than having this workaround in OPAL. > > BUt.. well... reality and today was checkstop heavy, so my mind kind of > changed :) > > Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though. > > I think having the option to *disable* it is always going to be good, > but... well... I don't like that we end up in a situation where the > kernel says "everything is terrible because you told me it was terrible, > please reboot now" and then we ignore it. > > The real solution is a kernel one.... It really isn't. If we are reporting unrecoverable HMIs to the kernel then the kernel has every right to assume the world is on fire and request a shutdown. If we want the kernel to do something else then we need to change what OPAL reports back to the kernel. Just disabling the software xstop is a gross hack at best. It's not even clear that just disabling the xstop is sufficent to keep the host up and running since the kernel thread that initiated the shutdown isn't expecting to return... That said, it's a stupid debug hack so who cares. > -- > Stewart Smith > OPAL Architect, IBM. > > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot
Oliver <oohall@gmail.com> writes: > On Mon, Jan 15, 2018 at 5:42 PM, Stewart Smith > <stewart@linux.vnet.ibm.com> wrote: >> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes: >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >>> >>> Add a mechanism to enable/disable sw checkstop by looking at nvram option >>> opal-sw-xstop=<enable/disable>. >>> >>> For now this patch disables the sw checkstop trigger unless explicitly >>> enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow >>> an opportunity to get host kernel in panic path or xmon for unrecoverable >>> HMIs or MCE, to be able to debug the issue effectively. >>> >>> To enable sw checkstop in opal issue following command: >>> >>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable >>> >>> NOTE: This is a workaround patch to disable sw checkstop by default to gain >>> control in host kernel for better checkstop debugging. Once we have most of >>> the checkstop issues stabilized/resolved, revisit this patch to enable sw >>> checkstop by default. >>> >>> For p8 platform it will remain enabled by default unless explicitly disabled. >>> >>> To disable sw checkstop on p8 issue following command: >>> >>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable >>> >>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >>> Reviewed-by: Balbir Singh <bsingharora@gmail.com> >>> --- >>> Change in v2: >>> - Add pr_log to indicate that sw checkstop was disabled. >>> --- >>> hw/xscom.c | 32 ++++++++++++++++++++++++++++++++ >>> 1 file changed, 32 insertions(+) >> >> All a bit umming-and-ahhing about the behaviour change... but this seems >> to be the "easiest" for now.... and I reserve the right to change my >> mind at any point :) >> >> I think the correct solution here is to have the kernel make the >> appropriate decision rather than having this workaround in OPAL. >> >> BUt.. well... reality and today was checkstop heavy, so my mind kind of >> changed :) >> >> Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though. >> >> I think having the option to *disable* it is always going to be good, >> but... well... I don't like that we end up in a situation where the >> kernel says "everything is terrible because you told me it was terrible, >> please reboot now" and then we ignore it. >> >> The real solution is a kernel one.... > > It really isn't. If we are reporting unrecoverable HMIs to the kernel > then the kernel has every right to assume the world is on fire and > request a shutdown. If we want the kernel to do something else then we > need to change what OPAL reports back to the kernel. Just disabling > the software xstop is a gross hack at best. It's not even clear that > just disabling the xstop is sufficent to keep the host up and running > since the kernel thread that initiated the shutdown isn't expecting to > return... Yeah, that would be the better place to fix things - telling the kernel that it did a naughty rather than the machine is borked. I'm sure we'll figure it out sometime after I stop seeing "PRD: Hardware problem, very low chance of software cause" that's actually 100% software problem.
On Mon, Jan 15, 2018 at 12:26 PM, Oliver <oohall@gmail.com> wrote: > On Mon, Jan 15, 2018 at 5:42 PM, Stewart Smith > <stewart@linux.vnet.ibm.com> wrote: >> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes: >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >>> >>> Add a mechanism to enable/disable sw checkstop by looking at nvram option >>> opal-sw-xstop=<enable/disable>. >>> >>> For now this patch disables the sw checkstop trigger unless explicitly >>> enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow >>> an opportunity to get host kernel in panic path or xmon for unrecoverable >>> HMIs or MCE, to be able to debug the issue effectively. >>> >>> To enable sw checkstop in opal issue following command: >>> >>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable >>> >>> NOTE: This is a workaround patch to disable sw checkstop by default to gain >>> control in host kernel for better checkstop debugging. Once we have most of >>> the checkstop issues stabilized/resolved, revisit this patch to enable sw >>> checkstop by default. >>> >>> For p8 platform it will remain enabled by default unless explicitly disabled. >>> >>> To disable sw checkstop on p8 issue following command: >>> >>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable >>> >>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >>> Reviewed-by: Balbir Singh <bsingharora@gmail.com> >>> --- >>> Change in v2: >>> - Add pr_log to indicate that sw checkstop was disabled. >>> --- >>> hw/xscom.c | 32 ++++++++++++++++++++++++++++++++ >>> 1 file changed, 32 insertions(+) >> >> All a bit umming-and-ahhing about the behaviour change... but this seems >> to be the "easiest" for now.... and I reserve the right to change my >> mind at any point :) >> >> I think the correct solution here is to have the kernel make the >> appropriate decision rather than having this workaround in OPAL. >> >> BUt.. well... reality and today was checkstop heavy, so my mind kind of >> changed :) >> >> Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though. >> >> I think having the option to *disable* it is always going to be good, >> but... well... I don't like that we end up in a situation where the >> kernel says "everything is terrible because you told me it was terrible, >> please reboot now" and then we ignore it. >> >> The real solution is a kernel one.... > > It really isn't. If we are reporting unrecoverable HMIs to the kernel > then the kernel has every right to assume the world is on fire and > request a shutdown. If we want the kernel to do something else then we > need to change what OPAL reports back to the kernel. Just disabling The real issue is keeping backwards compat for p8 and allowing a checkstop for machines that care to do so. > the software xstop is a gross hack at best. It's not even clear that > just disabling the xstop is sufficent to keep the host up and running > since the kernel thread that initiated the shutdown isn't expecting to > return... > The real goal of the patch is log (get the context of what was happening when we triggered the platform error). We could still dump that on the console, but going back to the kernel lets us crash/xmon and get more info before rebooting the box. Hostboot printing that we got a software initiated checkstop (TI) is not useful to be honest and we're seeing NPU2 devices cause HMI's and machine checks, so its useful to see the context at the time of the error > That said, it's a stupid debug hack so who cares. > Balbir Singh.
diff --git a/hw/xscom.c b/hw/xscom.c index de5a27e..0501278 100644 --- a/hw/xscom.c +++ b/hw/xscom.c @@ -24,6 +24,7 @@ #include <errorlog.h> #include <opal-api.h> #include <timebase.h> +#include <nvram.h> /* Mask of bits to clear in HMER before an access */ #define HMER_CLR_MASK (~(SPR_HMER_XSCOM_FAIL | \ @@ -826,6 +827,37 @@ static void xscom_init_chip_info(struct proc_chip *chip) int64_t xscom_trigger_xstop(void) { int rc = OPAL_UNSUPPORTED; + bool xstop_disabled = false; + + /* + * Workaround until we iron out all checkstop issues at present. + * + * For p9: + * By default do not trigger sw checkstop unless explicitly enabled + * through nvram option 'opal-sw-xstop=enable'. + * + * For p8: + * Keep it enabled by default unless explicitly disabled. + * + * NOTE: Once all checkstop issues are resolved/stabilized reverse + * the logic to enable sw checkstop by default on p9. + */ + switch (proc_gen) { + case proc_gen_p8: + if (nvram_query_eq("opal-sw-xstop", "disable")) + xstop_disabled = true; + break; + case proc_gen_p9: + default: + if (!nvram_query_eq("opal-sw-xstop", "enable")) + xstop_disabled = true; + break; + } + + if (xstop_disabled) { + prlog(PR_NOTICE, "Software initiated checkstop disabled.\n"); + return rc; + } if (xstop_xscom.addr) rc = xscom_writeme(xstop_xscom.addr,