Message ID | 20180308003606.10721-1-bsingharora@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 5ee573e8ef034e687c420cb10911371488d14b10 |
Headers | show |
Series | powerpc/powernv/mce: Don't silently restart the machine | expand |
Balbir Singh <bsingharora@gmail.com> writes: > On MCE the current code will restart the machine with > ppc_md.restart(). This case was extremely unlikely since > prior to that a skiboot call is made and that resulted in > a checkstop for analysis. > > With newer skiboots, on P9 we don't checkstop the box by > default, instead we return back to the kernel to extract > useful information at the time of the MCE. While we still > get this information, this patch converts the restart to > a panic(), so that if configured a dump can be taken and > we can track and probably debug the potential issue causing > the MCE. This will likely change again, but I can send a patch that changes the comment (along with the logic of decoding it all and having enough information to make sensible decisions). But... I kind of don't want to bikeshed a comment to death :) I reckon the panic() here is the right thing to do no matter what. Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com>
On Thu, Mar 8, 2018 at 12:20 PM, Stewart Smith <stewart@linux.vnet.ibm.com> wrote: > Balbir Singh <bsingharora@gmail.com> writes: >> On MCE the current code will restart the machine with >> ppc_md.restart(). This case was extremely unlikely since >> prior to that a skiboot call is made and that resulted in >> a checkstop for analysis. >> >> With newer skiboots, on P9 we don't checkstop the box by >> default, instead we return back to the kernel to extract >> useful information at the time of the MCE. While we still >> get this information, this patch converts the restart to >> a panic(), so that if configured a dump can be taken and >> we can track and probably debug the potential issue causing >> the MCE. > > This will likely change again, but I can send a patch that changes the > comment (along with the logic of decoding it all and having enough > information to make sensible decisions). But... I kind of don't want to > bikeshed a comment to death :) > > I reckon the panic() here is the right thing to do no matter > what. > > Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com> Thanks! Balbir Singh.
On Thu, 2018-03-08 at 00:36:06 UTC, Balbir Singh wrote: > On MCE the current code will restart the machine with > ppc_md.restart(). This case was extremely unlikely since > prior to that a skiboot call is made and that resulted in > a checkstop for analysis. > > With newer skiboots, on P9 we don't checkstop the box by > default, instead we return back to the kernel to extract > useful information at the time of the MCE. While we still > get this information, this patch converts the restart to > a panic(), so that if configured a dump can be taken and > we can track and probably debug the potential issue causing > the MCE. > > Signed-off-by: Balbir Singh <bsingharora@gmail.com> > Reviewed-by: Nicholas Piggin <npiggin@gmail.com> > Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com> Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/5ee573e8ef034e687c420cb1091137 cheers
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c index c15182765ff5..516e23de5a3d 100644 --- a/arch/powerpc/platforms/powernv/opal.c +++ b/arch/powerpc/platforms/powernv/opal.c @@ -490,9 +490,12 @@ void pnv_platform_error_reboot(struct pt_regs *regs, const char *msg) * opal to trigger checkstop explicitly for error analysis. * The FSP PRD component would have already got notified * about this error through other channels. + * 4. We are running on a newer skiboot that by default does + * not cause a checkstop, drops us back to the kernel to + * extract context and state at the time of the error. */ - ppc_md.restart(NULL); + panic(msg); } int opal_machine_check(struct pt_regs *regs)