diff mbox series

[v1] powerpc/64s: Fix unrecoverable MCE crash

Message ID 20210922020247.209409-1-npiggin@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v1] powerpc/64s: Fix unrecoverable MCE crash | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 7 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 24 jobs.

Commit Message

Nicholas Piggin Sept. 22, 2021, 2:02 a.m. UTC
The machine check handler is not considered NMI on 64s. The early
handler is the true NMI handler, and then it schedules the
machine_check_exception handler to run when interrupts are enabled.

This works fine except the case of an unrecoverable MCE, where the true
NMI is taken when MSR[RI] is clear, it can not recover to schedule the
next handler, so it calls machine_check_exception directly so something
might be done about it.

Calling an async handler from NMI context can result in irq state and
other things getting corrupted. This can also trigger the BUG at
arch/powerpc/include/asm/interrupt.h:168.

Fix this by just making the 64s machine_check_exception handler an NMI
like it is on other subarchs.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/interrupt.h |  4 ----
 arch/powerpc/kernel/traps.c          | 23 +++++++----------------
 2 files changed, 7 insertions(+), 20 deletions(-)

Comments

Ganesh G R Sept. 23, 2021, 6:22 p.m. UTC | #1
On 9/22/21 7:32 AM, Nicholas Piggin wrote:

> The machine check handler is not considered NMI on 64s. The early
> handler is the true NMI handler, and then it schedules the
> machine_check_exception handler to run when interrupts are enabled.
>
> This works fine except the case of an unrecoverable MCE, where the true
> NMI is taken when MSR[RI] is clear, it can not recover to schedule the
> next handler, so it calls machine_check_exception directly so something
> might be done about it.
>
> Calling an async handler from NMI context can result in irq state and
> other things getting corrupted. This can also trigger the BUG at
> arch/powerpc/include/asm/interrupt.h:168.
>
> Fix this by just making the 64s machine_check_exception handler an NMI
> like it is on other subarchs.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---

Hi Nick,

If I inject control memory access error in LPAR on top of this patch
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210906084303.183921-1-ganeshgr@linux.ibm.com/

I see the following warning trace

WARNING: CPU: 130 PID: 7122 at arch/powerpc/include/asm/interrupt.h:319 machine_check_exception+0x310/0x340
  Modules linked in:
  CPU: 130 PID: 7122 Comm: inj_access_err Kdump: loaded Tainted: G   M              5.15.0-rc2-cma-00054-g4a0d59fbaf71-dirty #22
  NIP:  c00000000002f980 LR: c00000000002f7e8 CTR: c000000000a31860
  REGS: c0000039fe51bb20 TRAP: 0700   Tainted: G   M               (5.15.0-rc2-cma-00054-g4a0d59fbaf71-dirty)
  MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 88000222  XER: 20040000
  CFAR: c00000000002f844 IRQMASK: 0
  GPR00: c00000000002f798 c0000039fe51bdc0 c0000000020d0000 0000000000000001
  GPR04: 0000000000000000 4000000000000002 4000000000000000 00000000000019af
  GPR08: 00000077e5ad0000 0000000000000000 c0000077ee16c700 0000000000000080
  GPR12: 0000000088000222 c0000077ee16c700 0000000000000000 0000000000000000
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR24: 0000000000000000 0000000000000000 c0000000020fecd8 0000000000000000
  GPR28: 0000000000000000 0000000000000001 0000000000000001 c0000039fe51be80
  NIP [c00000000002f980] machine_check_exception+0x310/0x340
  LR [c00000000002f7e8] machine_check_exception+0x178/0x340
  Call Trace:
  [c0000039fe51bdc0] [c00000000002f798] machine_check_exception+0x128/0x340 (unreliable)
  [c0000039fe51be10] [c0000000000086ec] machine_check_common+0x1ac/0x1b0
  --- interrupt: 200 at 0x10000968
  NIP:  0000000010000968 LR: 0000000010000958 CTR: 0000000000000000
  REGS: c0000039fe51be80 TRAP: 0200   Tainted: G   M               (5.15.0-rc2-cma-00054-g4a0d59fbaf71-dirty)
  MSR:  8000000002a0f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 22000824  XER: 00000000
  CFAR: 000000000000021c DAR: 00007fffb00c0000 DSISR: 02000008 IRQMASK: 0
  GPR00: 0000000022000824 00007fffc9647770 0000000010027f00 00007fffb00c0000
  GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR08: 0000000000000000 00007fffb00c0000 0000000000000001 0000000000000000
  GPR12: 0000000000000000 00007fffb015a330 0000000000000000 0000000000000000
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR24: 0000000000000000 0000000000000000 0000000000000000 000000001000085c
  GPR28: 00007fffc9647d18 0000000000000001 00000000100009b0 00007fffc9647770
  NIP [0000000010000968] 0x10000968
  LR [0000000010000958] 0x10000958
  --- interrupt: 200
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h
index 6b800d3e2681..b32ed910a8cf 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -524,11 +524,7 @@  static __always_inline long ____##func(struct pt_regs *regs)
 /* Interrupt handlers */
 /* kernel/traps.c */
 DECLARE_INTERRUPT_HANDLER_NMI(system_reset_exception);
-#ifdef CONFIG_PPC_BOOK3S_64
-DECLARE_INTERRUPT_HANDLER_ASYNC(machine_check_exception);
-#else
 DECLARE_INTERRUPT_HANDLER_NMI(machine_check_exception);
-#endif
 DECLARE_INTERRUPT_HANDLER(SMIException);
 DECLARE_INTERRUPT_HANDLER(handle_hmi_exception);
 DECLARE_INTERRUPT_HANDLER(unknown_exception);
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index aac8c0412ff9..b21450c655d2 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -790,24 +790,19 @@  void die_mce(const char *str, struct pt_regs *regs, long err)
 	 * do_exit() checks for in_interrupt() and panics in that case, so
 	 * exit the irq/nmi before calling die.
 	 */
-	if (IS_ENABLED(CONFIG_PPC_BOOK3S_64))
-		irq_exit();
-	else
-		nmi_exit();
+	nmi_exit();
 	die(str, regs, err);
 }
 
 /*
- * BOOK3S_64 does not call this handler as a non-maskable interrupt
- * (it uses its own early real-mode handler to handle the MCE proper
- * and then raises irq_work to call this handler when interrupts are
- * enabled).
+ * BOOK3S_64 does not call this handler as a non-maskable interrupt (it uses
+ * its own early real-mode handler to handle the MCE proper and then raises
+ * irq_work to call this handler when interrupts are enabled), except in the
+ * case of unrecoverable_mce. If unrecoverable_mce was a separate NMI handler,
+ * then this could be ASYNC on 64s. However it should all work okay as an NMI
+ * handler (and it is NMI on other platforms) so just make it an NMI.
  */
-#ifdef CONFIG_PPC_BOOK3S_64
-DEFINE_INTERRUPT_HANDLER_ASYNC(machine_check_exception)
-#else
 DEFINE_INTERRUPT_HANDLER_NMI(machine_check_exception)
-#endif
 {
 	int recover = 0;
 
@@ -842,11 +837,7 @@  DEFINE_INTERRUPT_HANDLER_NMI(machine_check_exception)
 	if (regs_is_unrecoverable(regs))
 		die_mce("Unrecoverable Machine check", regs, SIGBUS);
 
-#ifdef CONFIG_PPC_BOOK3S_64
-	return;
-#else
 	return 0;
-#endif
 }
 
 DEFINE_INTERRUPT_HANDLER(SMIException) /* async? */