mbox series

[v3,0/4] Fixes for 3 separate NMI reentrancy bugs

Message ID 20190226060901.18715-1-npiggin@gmail.com (mailing list archive)
Headers show
Series Fixes for 3 separate NMI reentrancy bugs | expand

Message

Nicholas Piggin Feb. 26, 2019, 6:08 a.m. UTC
This series fixes several similar but unrelated bugs with NMIs
clobbering live registers without noticing it, because MSR[RI] is set.
Pretty rare bugs, but serious silent corruption consequences.

For the most part these can be observed and tested quite easily
with the mambo simulator, except that it does not seem to follow
the architecture wrt leaving MSR[RI] unchanged for HV interrupts.
Mambo clears MSR[RI], so you have to account for that manually.

Since v1:
- Fixed several build bugs.

Since v2:
- Improved changelog and comments.
- Fixed the NIA test for virt mode interrupts.

Nicholas Piggin (4):
  powerpc/64s: Fix HV NMI vs HV interrupt recoverability test
  powerpc/64s: system reset interrupt preserve HSRRs
  powerpc/64s: Prepare to handle data interrupts vs d-side MCE
    reentrancy
  powerpc/64s: Fix data interrupts vs d-side MCE reentrancy

 arch/powerpc/include/asm/asm-prototypes.h |  8 ++
 arch/powerpc/include/asm/nmi.h            |  2 +
 arch/powerpc/kernel/exceptions-64s.S      | 92 +++++++++++++++++++----
 arch/powerpc/kernel/mce.c                 |  3 +
 arch/powerpc/kernel/traps.c               | 91 +++++++++++++++++++++-
 5 files changed, 179 insertions(+), 17 deletions(-)

Comments

Satheesh Rajendran Feb. 26, 2019, 6:51 a.m. UTC | #1
On Tue, Feb 26, 2019 at 04:08:57PM +1000, Nicholas Piggin wrote:
> This series fixes several similar but unrelated bugs with NMIs
> clobbering live registers without noticing it, because MSR[RI] is set.
> Pretty rare bugs, but serious silent corruption consequences.
> 
> For the most part these can be observed and tested quite easily
> with the mambo simulator, except that it does not seem to follow
> the architecture wrt leaving MSR[RI] unchanged for HV interrupts.
> Mambo clears MSR[RI], so you have to account for that manually.
> 
> Since v1:
> - Fixed several build bugs.
> 
> Since v2:
> - Improved changelog and comments.
> - Fixed the NIA test for virt mode interrupts.

Hit with below crash on Power8 box, patch built with linuxppc merge branch with `ppc64le_defconfig`

UnknownStateTransition: Something happened system state="8" and we transitioned to UNKNOWN state.  Review the following for more details
Message="OpTestSystem in run_IPLing and Exception="Kernel OOPS (machine in state '5'): Oops: Kernel access of bad area, sig: 11 [#1]
[    0.000000] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc7-gf46b87021 #1
[    0.000000] NIP:  c000000000c1306c LR: c000000000c12f64 CTR: c00000000033d860
[    0.000000] REGS: c0000000014878b0 TRAP: 0380   Not tainted  (5.0.0-rc7-gf46b87021)
[    0.000000] MSR:  9000000000001033 <SF,HV,ME,IR,DR,RI,LE>  CR: 28002224  XER: 00000000
[    0.000000] CFAR: c000000000c12f7c IRQMASK: 1 
[    0.000000] GPR00: c000000000c12f64 c000000001487b40 c000000001488400 f000000000000000 
[    0.000000] GPR04: c000000001487b18 c000000001487b20 0000000000000000 c000000001388400 
[    0.000000] GPR08: f000000000000000 f000000000000008 0000000000000000 0000000800000000 
[    0.000000] GPR12: c0000000015e1ed0 c000000001670000 0000000000000000 0000000000000000 
[    0.000000] GPR16: 0000000000000000 0000000000000000 c0000000015e0d40 0000000000000001 
[    0.000000] GPR20: ffffffffffffffff ffffffffffffffff 0000000008000000 c000000001413b90 
[    0.000000] GPR24: c000000001413b98 007ffff000000000 0000000000080000 0000000000000000 
[    0.000000] GPR28: 0000000000000000 0000000000000000 007ffff000001000 0000000000000000 
[    0.000000] NIP [c000000000c1306c] memmap_init_zone+0x258/0x308
[    0.000000] LR [c000000000c12f64] memmap_init_zone+0x150/0x308
[    0.000000] Call Trace:
[    0.000000] [c000000001487b40] [c000000000c12f64] memmap_init_zone+0x150/0x308 (unreliable)
[    0.000000] [c000000001487be0] [c000000000f87acc] free_area_init_node+0x480/0x518
[    0.000000] [c000000001487cf0] [c000000000f88630] free_area_init_nodes+0x838/0x940
[    0.000000] [c000000001487e10] [c000000000f6340c] paging_init+0x8c/0xa8
[    0.000000] [c000000001487e80] [c000000000f5bc00] setup_arch+0x3b4/0x3f0
[    0.000000] [c000000001487ef0] [c000000000f53b68] start_kernel+0x94/0x630
[    0.000000] [c000000001487f90] [c00000000000b37c] start_here_common+0x1c/0x520
[    0.000000] Instruction dump:
[    0.000000] 71290002 41820014 ebea0008 7cc6fa14 78df8402 48000070 3d22000c 7bea3664 
[    0.000000] 39299d20 e9090000 7c685214 39230008 <fa290010> fa290018 fa290020 fa290030 
[    0.000000] random: get_random_bytes called from print_oops_end_marker+0x40/0x80 with crng_init=0
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] 
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] Rebooting in 10 seconds" caused the system to go to UNKNOWN_BAD and the system will be stopping."

Regards,
-Satheesh.
> 
> Nicholas Piggin (4):
>   powerpc/64s: Fix HV NMI vs HV interrupt recoverability test
>   powerpc/64s: system reset interrupt preserve HSRRs
>   powerpc/64s: Prepare to handle data interrupts vs d-side MCE
>     reentrancy
>   powerpc/64s: Fix data interrupts vs d-side MCE reentrancy
> 
>  arch/powerpc/include/asm/asm-prototypes.h |  8 ++
>  arch/powerpc/include/asm/nmi.h            |  2 +
>  arch/powerpc/kernel/exceptions-64s.S      | 92 +++++++++++++++++++----
>  arch/powerpc/kernel/mce.c                 |  3 +
>  arch/powerpc/kernel/traps.c               | 91 +++++++++++++++++++++-
>  5 files changed, 179 insertions(+), 17 deletions(-)
> 
> -- 
> 2.18.0
>