opal/cpu: Mark the core as bad while disabling threads of the core.

Message ID 150783587481.992.16332072755377718653.stgit@jupiter.in.ibm.com
State Accepted
Headers show
Series
  • opal/cpu: Mark the core as bad while disabling threads of the core.
Related show

Commit Message

Mahesh Jagannath Salgaonkar Oct. 12, 2017, 7:18 p.m.
From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

If any of the core fails to sync its TB during chipTOD initialization,
all the threads of that core are disabled. But this does not make
linux kernel to ignore the core/cpus. It crashes while bringing them up
with below backtrace:

[   38.883898] kexec_core: Starting new kernel
cpu 0x0: Vector: 300 (Data Access) at [c0000003f277b730]
    pc: c0000000001b9890: internal_create_group+0x30/0x304
    lr: c0000000001b9880: internal_create_group+0x20/0x304
    sp: c0000003f277b9b0
   msr: 900000000280b033
   dar: 40
 dsisr: 40000000
  current = 0xc0000003f9f41000
  paca    = 0xc00000000fe00000	 softe: 0	 irq_happened: 0x01
    pid   = 2572, comm = kexec
Linux version 4.13.2-openpower1 (jenkins@p89) (gcc version 6.4.0 (Buildroot 2017.08-00006-g319c6e1)) #1 SMP Wed Sep 20 05:42:11 UTC 2017
enter ? for help
[c0000003f277b9b0] c0000000008a8780 (unreliable)
[c0000003f277ba50] c00000000041c3ac topology_add_dev+0x2c/0x40
[c0000003f277ba70] c00000000006b078 cpuhp_invoke_callback+0x88/0x170
[c0000003f277bac0] c00000000006b22c cpuhp_up_callbacks+0x54/0xb8
[c0000003f277bb10] c00000000006bc68 cpu_up+0x11c/0x168
[c0000003f277bbc0] c00000000002f0e0 default_machine_kexec+0x1fc/0x274
[c0000003f277bc50] c00000000002e2d8 machine_kexec+0x50/0x58
[c0000003f277bc70] c0000000000de4e8 kernel_kexec+0x98/0xb4
[c0000003f277bce0] c00000000008b0f0 SyS_reboot+0x1c8/0x1f4
[c0000003f277be30] c00000000000b118 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 00007fff7f775074
SP (7fffe6c7bf10) is in userspace
0:mon>

This patch fixes this issue by marking the core status device property as
"bad".

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 core/cpu.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Stewart Smith Oct. 16, 2017, 8:10 a.m. | #1
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> If any of the core fails to sync its TB during chipTOD initialization,
> all the threads of that core are disabled. But this does not make
> linux kernel to ignore the core/cpus. It crashes while bringing them up
> with below backtrace:
>
> [   38.883898] kexec_core: Starting new kernel
> cpu 0x0: Vector: 300 (Data Access) at [c0000003f277b730]
>     pc: c0000000001b9890: internal_create_group+0x30/0x304
>     lr: c0000000001b9880: internal_create_group+0x20/0x304
>     sp: c0000003f277b9b0
>    msr: 900000000280b033
>    dar: 40
>  dsisr: 40000000
>   current = 0xc0000003f9f41000
>   paca    = 0xc00000000fe00000	 softe: 0	 irq_happened: 0x01
>     pid   = 2572, comm = kexec
> Linux version 4.13.2-openpower1 (jenkins@p89) (gcc version 6.4.0 (Buildroot 2017.08-00006-g319c6e1)) #1 SMP Wed Sep 20 05:42:11 UTC 2017
> enter ? for help
> [c0000003f277b9b0] c0000000008a8780 (unreliable)
> [c0000003f277ba50] c00000000041c3ac topology_add_dev+0x2c/0x40
> [c0000003f277ba70] c00000000006b078 cpuhp_invoke_callback+0x88/0x170
> [c0000003f277bac0] c00000000006b22c cpuhp_up_callbacks+0x54/0xb8
> [c0000003f277bb10] c00000000006bc68 cpu_up+0x11c/0x168
> [c0000003f277bbc0] c00000000002f0e0 default_machine_kexec+0x1fc/0x274
> [c0000003f277bc50] c00000000002e2d8 machine_kexec+0x50/0x58
> [c0000003f277bc70] c0000000000de4e8 kernel_kexec+0x98/0xb4
> [c0000003f277bce0] c00000000008b0f0 SyS_reboot+0x1c8/0x1f4
> [c0000003f277be30] c00000000000b118 system_call+0x58/0x6c
> --- Exception: c01 (System Call) at 00007fff7f775074
> SP (7fffe6c7bf10) is in userspace
> 0:mon>
>
> This patch fixes this issue by marking the core status device property as
> "bad".
>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  core/cpu.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)

So, this is certainly an improvement over the current situation,
although we should perhaps think about a centralized way to do things
like this when we discover during boot that a CPU/core shouldn't be
used.

Merged to master as of 5b1c330fd0b08d1244d61e7ef85be9475eef9796

Patch

diff --git a/core/cpu.c b/core/cpu.c
index 78565b5..be0e451 100644
--- a/core/cpu.c
+++ b/core/cpu.c
@@ -766,14 +766,24 @@  void cpu_remove_node(const struct cpu_thread *t)
 void cpu_disable_all_threads(struct cpu_thread *cpu)
 {
 	unsigned int i;
+	struct dt_property *p;
 
 	for (i = 0; i <= cpu_max_pir; i++) {
 		struct cpu_thread *t = &cpu_stacks[i].cpu;
 
 		if (t->primary == cpu->primary)
 			t->state = cpu_state_disabled;
+
 	}
 
+	/* Mark this core as bad so that Linux kernel don't use this CPU. */
+	prlog(PR_DEBUG, "CPU: Mark CPU bad (PIR 0x%04x)...\n", cpu->pir);
+	p = __dt_find_property(cpu->node, "status");
+	if (p)
+		dt_del_property(cpu->node, p);
+
+	dt_add_property_string(cpu->node, "status", "bad");
+
 	/* XXX Do something to actually stop the core */
 }