diff mbox series

pseries/eeh: fix the kdump kernel crash during eeh_pseries_init

Message ID 163215558252.413351.8600189949820258982.stgit@jupiter (mailing list archive)
State Accepted
Headers show
Series pseries/eeh: fix the kdump kernel crash during eeh_pseries_init | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 7 jobs.
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 24 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.

Commit Message

Mahesh J Salgaonkar Sept. 20, 2021, 4:33 p.m. UTC
On pseries lpar when an empty slot is assigned to partition OR on single
lpar mode, kdump kernel crashes during issuing PHB reset. In the kdump
scenario, we traverse all PHBs and issue reset using the pe_config_addr of
first child device present under each PHB. However the code assumes that
none of the PHB slot can be empty and uses list_first_entry() to get first
child device under PHB. Since list_first_entry() expect list to be not
empty, it returns invalid pci_dn entry and ends up accessing NULL phb
pointer under pci_dn->phb causing kdump kernel crash.

This patch fixes the below kdump kernel crash by skipping the empty slot:

[    0.003655] audit: initializing netlink subsys (disabled)
[    0.003765] thermal_sys: Registered thermal governor 'fair_share'
[    0.003767] thermal_sys: Registered thermal governor 'step_wise'
[    0.003783] cpuidle: using governor menu
[    0.003977] pstore: Registered nvram as persistent store backend
[    0.004590] Issue PHB reset ...
[    0.004794] audit: type=2000 audit(1631267818.000:1): state=initialized audit_enabled=0 res=1
[    2.233957] BUG: Kernel NULL pointer dereference on read at 0x00000268
[    2.233966] Faulting instruction address: 0xc000000008101fb0
[    2.233972] Oops: Kernel access of bad area, sig: 7 [#1]
[    2.233977] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[    2.233984] Modules linked in:
[    2.233989] CPU: 7 PID: 1 Comm: swapper/7 Not tainted 5.14.0 #1
[    2.233996] NIP:  c000000008101fb0 LR: c000000009284ccc CTR: c000000008029d70
[    2.234003] REGS: c00000001161b840 TRAP: 0300   Not tainted  (5.14.0)
[    2.234008] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 28000224  XER: 20040002
[    2.234022] CFAR: c000000008101f0c DAR: 0000000000000268 DSISR: 00080000 IRQMASK: 0
[    2.234022] GPR00: c000000009284ccc c00000001161bae0 c000000009c6d800 000000000000004d
[    2.234022] GPR04: 0000000000000004 0000000000000002 c00000001161bb4c 0000000000000000
[    2.234022] GPR08: 0000000000000000 0000000000000000 0000000000000001 c000000008e59a80
[    2.234022] GPR12: c000000008029d70 c000000009ff0400 c00000000801285c 0000000000000000
[    2.234022] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    2.234022] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    2.234022] GPR24: c00000000926338c c000000009248860 c0000000092f1048 c000000011079c00
[    2.234022] GPR28: c000000009785af8 c000000009d4b920 0000000000000000 0000000000000000
[    2.234091] NIP [c000000008101fb0] pseries_eeh_get_pe_config_addr+0x100/0x1b0
[    2.234100] LR [c000000009284ccc] __machine_initcall_pseries_eeh_pseries_init+0x2cc/0x350
[    2.234108] Call Trace:
[    2.234111] [c00000001161bae0] [c00000001161bb80] 0xc00000001161bb80 (unreliable)
[    2.234120] [c00000001161bb80] [c000000009284ccc] __machine_initcall_pseries_eeh_pseries_init+0x2cc/0x350
[    2.234128] [c00000001161bc00] [c000000008012210] do_one_initcall+0x60/0x2d0
[    2.234136] [c00000001161bcd0] [c000000009264990] kernel_init_freeable+0x350/0x3f8
[    2.234145] [c00000001161bda0] [c000000008012890] kernel_init+0x3c/0x17c
[    2.234151] [c00000001161be10] [c00000000800cdd4] ret_from_kernel_thread+0x5c/0x64
[    2.234159] Instruction dump:
[    2.234163] eba1ffe8 ebc1fff0 ebe1fff8 4e800020 7c0802a6 7ce33b78 39400001 7fe7fb78
[    2.234174] 38a00002 38800004 38c1006c f80100b0 <e91e0268> 79090020 79080022 4bf48edd
[    2.234187] ---[ end trace bee3ba4dca6761d3 ]---
[    2.235907]
[    3.235914] Kernel panic - not syncing: Fatal exception

Fixes: 5a090f7c363fd ("powerpc/pseries: PCIE PHB reset")
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/eeh_pseries.c |    4 ++++
 1 file changed, 4 insertions(+)

Comments

Michael Ellerman Oct. 8, 2021, 1:23 p.m. UTC | #1
On Mon, 20 Sep 2021 22:03:26 +0530, Mahesh Salgaonkar wrote:
> On pseries lpar when an empty slot is assigned to partition OR on single
> lpar mode, kdump kernel crashes during issuing PHB reset. In the kdump
> scenario, we traverse all PHBs and issue reset using the pe_config_addr of
> first child device present under each PHB. However the code assumes that
> none of the PHB slot can be empty and uses list_first_entry() to get first
> child device under PHB. Since list_first_entry() expect list to be not
> empty, it returns invalid pci_dn entry and ends up accessing NULL phb
> pointer under pci_dn->phb causing kdump kernel crash.
> 
> [...]

Applied to powerpc/fixes.

[1/1] pseries/eeh: fix the kdump kernel crash during eeh_pseries_init
      https://git.kernel.org/powerpc/c/eb8257a12192f43ffd41bd90932c39dade958042

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index bc15200852b7c..8780e7d33a0f5 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -867,6 +867,10 @@  static int __init eeh_pseries_init(void)
 	if (is_kdump_kernel() || reset_devices) {
 		pr_info("Issue PHB reset ...\n");
 		list_for_each_entry(phb, &hose_list, list_node) {
+			/* Skip the empty slot */
+			if (list_empty(&PCI_DN(phb->dn)->child_list))
+				continue;
+
 			pdn = list_first_entry(&PCI_DN(phb->dn)->child_list, struct pci_dn, list);
 			config_addr = pseries_eeh_get_pe_config_addr(pdn);