pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

Message ID 1544095908-2414-1-git-send-email-ego@linux.vnet.ibm.com
State New
Headers show
Series
  • pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop
Related show

Checks

Context Check Description
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 26 lines checked
snowpatch_ozlabs/build-pmac32 success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64e success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64be success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64le success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/apply_patch success next/apply_patch Successfully applied

Commit Message

Gautham R Shenoy Dec. 6, 2018, 11:31 a.m.
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Currently running DLPAR offline/online operations in a loop on a
POWER9 system with SMT=off results in the following crash:

[  223.321032] cpu 112 (hwid 112) Ready to die...
[  223.355963] Querying DEAD? cpu 113 (113) shows 2
[  223.356233] cpu 114 (hwid 114) Ready to die...
[  223.356236] cpu 113 (hwid 113) Ready to die...
[  223.356241] Bad kernel stack pointer 1faf6ca0 at 1faf6d50
[  223.356243] Oops: Bad kernel stack pointer, sig: 6 [#1]
[  223.356244] LE SMP NR_CPUS=2048 NUMA pSeries
[  223.356247] Modules linked in:
[  223.356255] CPU: 114 PID: 0 Comm: swapper/114 Not tainted 4.20.0-rc3 #39
[  223.356258] NIP:  000000001faf6d50 LR: 000000001faf6d50 CTR: 000000001ec6d06c
[  223.356259] REGS: c00000001e5cbd30 TRAP: 0700   Not tainted  (4.20.0-rc3)
[  223.356260] MSR:  8000000000081000 <SF,ME>  CR: 28000004  XER: 00000020
[  223.356263] CFAR: 000000001ec98590 IRQMASK: 8000000000009033
               GPR00: 000000001faf6d50 000000001faf6ca0 000000001ed1c448 00000267e6a273c3
               GPR04: 0000000000000000 00000000000000e0 000000000000dfe8 000000001faf6d30
               GPR08: 000000001faf6d28 00000267e6a273c3 000000001ec1b108 0000000000000000
               GPR12: 0000000001b6d998 c00000001eb55080 c0000003a1b8bf90 000000001eea3f40
               GPR16: 0000000000000000 c0000006fda85100 c00000000004c8b0 c0000000013d5300
               GPR20: c0000006fda85300 0000000000000008 c0000000019d2cf8 c0000000013d6888
               GPR24: 0000000000000072 c0000000013d688c 0000000000000002 c0000000013d688c
               GPR28: c0000000019cecf0 0000000000000390 0000000000000000 000000001faf6ca0
[  223.356281] NIP [000000001faf6d50] 0x1faf6d50
[  223.356281] LR [000000001faf6d50] 0x1faf6d50
[  223.356282] Call Trace:
[  223.356283] Instruction dump:
[  223.356285] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[  223.356286] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[  223.356290] ---[ end trace f46a4e046b564d1f ]---

This is due to multiple offlined CPUs (CPU 113 and CPU 114 above)
concurrently (within 3us) trying to make the rtas-call with the
"stop-self" token, something that is prohibited by the PAPR.

The concurrent calls can happen as follows.

  o In dlpar_offline_cpu() we prod an offline CPU X (offline due to
    SMT=off) and loop for 25 tries in pseries_cpu_die() querying if
    the target CPU X has been stopped in RTAS. After 25 tries, we
    prints the message "Querying DEAD? cpu X (X) shows 2" and return
    to dlpar_offline_cpu(). Note that at this point CPU X has not yet
    called rtas with the "stop-self" token, but can do so anytime now.

  o Back in dlpar_offline_cpu(), we prod the next offline CPU Y. CPU Y
    promptly wakes up and calls RTAS with the "stop-self" token.

  o Before RTAS can stop CPU Y, CPU X also calls RTAS with the
    "stop-self" token.

The problem occurs due to the short number of tries (25) in
pseries_cpu_die() which covers 90% of the cases. For the remaining 10%
of the cases, it was observed that we would need to loop for a few
hundred iterations before the target CPU calls rtas with "stop-self"
token.Experiments show that each try takes roughly ~25us.

In this patch we fix the problem by increasing the number of tries
from 25 to 4000 (which roughly corresponds to 100ms) before bailing
out and declaring that we have failed to observe the target CPU call
rtas-stop-self. This provides sufficient serialization to ensure that
there no concurrent rtas calls with "stop-self" token.

Reported-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Thiago Jung Bauermann Dec. 6, 2018, 5:28 p.m. | #1
Hello Gautham,

Gautham R. Shenoy <ego@linux.vnet.ibm.com> writes:

> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>
> Currently running DLPAR offline/online operations in a loop on a
> POWER9 system with SMT=off results in the following crash:
>
> [  223.321032] cpu 112 (hwid 112) Ready to die...
> [  223.355963] Querying DEAD? cpu 113 (113) shows 2
> [  223.356233] cpu 114 (hwid 114) Ready to die...
> [  223.356236] cpu 113 (hwid 113) Ready to die...
> [  223.356241] Bad kernel stack pointer 1faf6ca0 at 1faf6d50
> [  223.356243] Oops: Bad kernel stack pointer, sig: 6 [#1]
> [  223.356244] LE SMP NR_CPUS=2048 NUMA pSeries
> [  223.356247] Modules linked in:
> [  223.356255] CPU: 114 PID: 0 Comm: swapper/114 Not tainted 4.20.0-rc3 #39
> [  223.356258] NIP:  000000001faf6d50 LR: 000000001faf6d50 CTR: 000000001ec6d06c
> [  223.356259] REGS: c00000001e5cbd30 TRAP: 0700   Not tainted  (4.20.0-rc3)
> [  223.356260] MSR:  8000000000081000 <SF,ME>  CR: 28000004  XER: 00000020
> [  223.356263] CFAR: 000000001ec98590 IRQMASK: 8000000000009033
>                GPR00: 000000001faf6d50 000000001faf6ca0 000000001ed1c448 00000267e6a273c3
>                GPR04: 0000000000000000 00000000000000e0 000000000000dfe8 000000001faf6d30
>                GPR08: 000000001faf6d28 00000267e6a273c3 000000001ec1b108 0000000000000000
>                GPR12: 0000000001b6d998 c00000001eb55080 c0000003a1b8bf90 000000001eea3f40
>                GPR16: 0000000000000000 c0000006fda85100 c00000000004c8b0 c0000000013d5300
>                GPR20: c0000006fda85300 0000000000000008 c0000000019d2cf8 c0000000013d6888
>                GPR24: 0000000000000072 c0000000013d688c 0000000000000002 c0000000013d688c
>                GPR28: c0000000019cecf0 0000000000000390 0000000000000000 000000001faf6ca0
> [  223.356281] NIP [000000001faf6d50] 0x1faf6d50
> [  223.356281] LR [000000001faf6d50] 0x1faf6d50
> [  223.356282] Call Trace:
> [  223.356283] Instruction dump:
> [  223.356285] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> [  223.356286] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> [  223.356290] ---[ end trace f46a4e046b564d1f ]---
>
> This is due to multiple offlined CPUs (CPU 113 and CPU 114 above)
> concurrently (within 3us) trying to make the rtas-call with the
> "stop-self" token, something that is prohibited by the PAPR.
>
> The concurrent calls can happen as follows.
>
>   o In dlpar_offline_cpu() we prod an offline CPU X (offline due to
>     SMT=off) and loop for 25 tries in pseries_cpu_die() querying if
>     the target CPU X has been stopped in RTAS. After 25 tries, we
>     prints the message "Querying DEAD? cpu X (X) shows 2" and return
>     to dlpar_offline_cpu(). Note that at this point CPU X has not yet
>     called rtas with the "stop-self" token, but can do so anytime now.
>
>   o Back in dlpar_offline_cpu(), we prod the next offline CPU Y. CPU Y
>     promptly wakes up and calls RTAS with the "stop-self" token.
>
>   o Before RTAS can stop CPU Y, CPU X also calls RTAS with the
>     "stop-self" token.
>
> The problem occurs due to the short number of tries (25) in
> pseries_cpu_die() which covers 90% of the cases. For the remaining 10%
> of the cases, it was observed that we would need to loop for a few
> hundred iterations before the target CPU calls rtas with "stop-self"
> token.Experiments show that each try takes roughly ~25us.
>
> In this patch we fix the problem by increasing the number of tries
> from 25 to 4000 (which roughly corresponds to 100ms) before bailing
> out and declaring that we have failed to observe the target CPU call
> rtas-stop-self. This provides sufficient serialization to ensure that
> there no concurrent rtas calls with "stop-self" token.
>
>
>
> Reported-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 2f8e621..c913c44 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -214,14 +214,25 @@ static void pseries_cpu_die(unsigned int cpu)
>  			msleep(1);
>  		}
>  	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
> +		int max_tries = 4000; /* Roughly corresponds to 100ms */
> +		u64 tb_before = mftb();
>
> -		for (tries = 0; tries < 25; tries++) {
> +		for (tries = 0; tries < max_tries; tries++) {
>  			cpu_status = smp_query_cpu_stopped(pcpu);
>  			if (cpu_status == QCSS_STOPPED ||
>  			    cpu_status == QCSS_HARDWARE_ERROR)
>  				break;
>  			cpu_relax();
>  		}
> +
> +		if (tries == max_tries) {
> +			u64 time_elapsed_us = div_u64(mftb() - tb_before,
> +						      tb_ticks_per_usec);
> +
> +			pr_warn("Offlined CPU %d isn't stopped by RTAS after %llu us\n",
> +				cpu, time_elapsed_us);
> +			WARN_ON(1);
> +		}
>  	}
>
>  	if (cpu_status != 0) {

I posted a similar patch last year, but I wasn't able to arrive at a
root cause analysis like you did:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html

One thing I realized after I posted the patch was that in my case, the
CPU was crashing inside RTAS. From the NIP and LR in the trace above it
looks like it's crashing in RTAS in your case as well.

Michael Ellerman had two comments on my patch:

1. Regardless of the underlying bug, the kernel shouldn't crash so we
need a patch making it more resilient to this failure.

2. The wait loop should use udelay() so that the loop will actually take
a set amount of wall time, rather than just cycles.

Regarding 1. if the problem is that the kernel is causing RTAS to crash
because it calls it in a way that's unsupported, then I don't see how we
can make the kernel more resilient. We have to make the kernel respect
RTAS' restrictions (or alternatively, poke RTAS devs to make RTAS fail
gracefuly in these conditions).

Regarding 2. I implemented a new version of my patch (posted below) but
I was never able to test it because I couldn't access a system where the
problem was reproducible anymore.

There's also a race between the CPU driving the unplug and the CPU being
unplugged which I think is not easy for the CPU being unplugged to win,
which makes the busy loop in pseries_cpu_die() a bit fragile. I describe
the race in the patch description.

My solution to make the race less tight is to make the CPU driving the
unplug to only start the busy loop only after the CPU being unplugged is
in the CPU_STATE_OFFLINE state. At that point, we know that it either is
about to call RTAS or it already has.

Do you think this makes sense? If you do, would you mind testing my
patch? You can change the timeouts and delays if you want. To be honest
they're just guesses on my part...



From 4174cd0939fadeda93e49ed2e94ba84131c922d0 Mon Sep 17 00:00:00 2001
From: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Date: Thu, 6 Dec 2018 15:16:34 -0200
Subject: [PATCH] powerpc/pseries: Only wait for dying CPU after call to
 rtas_stop_self()

When testing DLPAR CPU add/remove on a system under stress,
pseries_cpu_die() doesn't wait long enough for a CPU to die and the kernel
ends up crashing:

[  446.143648] cpu 152 (hwid 152) Ready to die...
[  446.464057] cpu 153 (hwid 153) Ready to die...
[  446.473525] cpu 154 (hwid 154) Ready to die...
[  446.474077] cpu 155 (hwid 155) Ready to die...
[  446.483529] cpu 156 (hwid 156) Ready to die...
[  446.493532] cpu 157 (hwid 157) Ready to die...
[  446.494078] cpu 158 (hwid 158) Ready to die...
[  446.503527] cpu 159 (hwid 159) Ready to die...
[  446.664534] cpu 144 (hwid 144) Ready to die...
[  446.964113] cpu 145 (hwid 145) Ready to die...
[  446.973525] cpu 146 (hwid 146) Ready to die...
[  446.974094] cpu 147 (hwid 147) Ready to die...
[  446.983944] cpu 148 (hwid 148) Ready to die...
[  446.984062] cpu 149 (hwid 149) Ready to die...
[  446.993518] cpu 150 (hwid 150) Ready to die...
[  446.993543] Querying DEAD? cpu 150 (150) shows 2
[  446.994098] cpu 151 (hwid 151) Ready to die...
[  447.133726] cpu 136 (hwid 136) Ready to die...
[  447.403532] cpu 137 (hwid 137) Ready to die...
[  447.403772] cpu 138 (hwid 138) Ready to die...
[  447.403839] cpu 139 (hwid 139) Ready to die...
[  447.403887] cpu 140 (hwid 140) Ready to die...
[  447.403937] cpu 141 (hwid 141) Ready to die...
[  447.403979] cpu 142 (hwid 142) Ready to die...
[  447.404038] cpu 143 (hwid 143) Ready to die...
[  447.513546] cpu 128 (hwid 128) Ready to die...
[  447.693533] cpu 129 (hwid 129) Ready to die...
[  447.693999] cpu 130 (hwid 130) Ready to die...
[  447.703530] cpu 131 (hwid 131) Ready to die...
[  447.704087] Querying DEAD? cpu 132 (132) shows 2
[  447.704102] cpu 132 (hwid 132) Ready to die...
[  447.713534] cpu 133 (hwid 133) Ready to die...
[  447.714064] Querying DEAD? cpu 134 (134) shows 2
cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
    pc: 000000001ec3072c
    lr: 000000001ec2fee0
    sp: 1faf6bd0
   msr: 8000000102801000
   dar: 212d6c1a2a20c
 dsisr: 42000000
  current = 0xc000000474c6d600
  paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/134
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
WARNING: exception is not recoverable, can't continue

This was reproduced in v4.12-rc3 as well, but I don't have a crash log
handy for that version right now. Sorry.

This is a race between one CPU stopping and another one calling
pseries_cpu_die() to wait for it to stop. That function does a short busy
loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify
that it is stopped.

As can be seen in the dmesg right before or after the "Querying DEAD?"
messages, if pseries_cpu_die() waited a little longer it would have seen
the CPU in the stopped state.

I see two cases that can be causing this race:

1. It's possible that CPU 134 was inactive at the time it was unplugged. In
   that case, dlpar_offline_cpu() calls H_PROD on the CPU and immediately
   calls pseries_cpu_die(). Meanwhile, the prodded CPU activates and start
   the process of stopping itself. It's possible that the busy loop is not
   long enough to allow for the CPU to wake up and complete the stopping
   process.

2. If CPU 134 was online at the time it was unplugged, it would have gone
   through the new CPU hotplug state machine in kernel/cpu.c that was
   introduced in v4.6 to get itself stopped. It's possible that the busy
   loop in pseries_cpu_die() was long enough for the older hotplug code but
   not for the new hotplug state machine.

Either way, the solution is to only start querying if the CPU is stopped
after it had a chance to call rtas_stop_self(). Since
pseries_mach_cpu_die() sets the CPU current state to offline almost
immediately before the RTAS call, we use that as a signal that it is either
already stopped or very close to that point, and we can start the busy
loop.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 2f8e62163602..fad3ceb74623 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -214,13 +214,21 @@ static void pseries_cpu_die(unsigned int cpu)
 			msleep(1);
 		}
 	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+		/*
+		 * If the current state is not offline yet, it means that the
+		 * dying CPU (which is in pseries_mach_cpu_die) didn't have a
+		 * chance to call rtas_stop_self yet and therefore it's too
+		 * early to query if the CPU is stopped.
+		 */
+		spin_event_timeout(get_cpu_current_state(cpu) == CPU_STATE_OFFLINE,
+				   100000, 100);
 
 		for (tries = 0; tries < 25; tries++) {
 			cpu_status = smp_query_cpu_stopped(pcpu);
 			if (cpu_status == QCSS_STOPPED ||
 			    cpu_status == QCSS_HARDWARE_ERROR)
 				break;
-			cpu_relax();
+			udelay(100);
 		}
 	}
Gautham R Shenoy Dec. 7, 2018, 10:43 a.m. | #2
Hi Thiago,

On Thu, Dec 06, 2018 at 03:28:17PM -0200, Thiago Jung Bauermann wrote:

[..snip..]

> 
> 
> I posted a similar patch last year, but I wasn't able to arrive at a
> root cause analysis like you did:
> 
>
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html

Ah! Nice. So this is a known problem.

> 
> One thing I realized after I posted the patch was that in my case, the
> CPU was crashing inside RTAS. From the NIP and LR in the trace above it
> looks like it's crashing in RTAS in your case as well.
> 
> Michael Ellerman had two comments on my patch:
> 
> 1. Regardless of the underlying bug, the kernel shouldn't crash so we
> need a patch making it more resilient to this failure.
> 
> 2. The wait loop should use udelay() so that the loop will actually take
> a set amount of wall time, rather than just cycles.
> 
> Regarding 1. if the problem is that the kernel is causing RTAS to crash
> because it calls it in a way that's unsupported, then I don't see how we
> can make the kernel more resilient. We have to make the kernel respect
> RTAS' restrictions (or alternatively, poke RTAS devs to make RTAS fail
> gracefuly in these conditions).


I agree that the Kernel has to respect RTAS's restriction. The PAPR
v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:

    "The stop-self service needs to be serialized with calls to the
     stop-self, start-cpu, and set-power-level services. The OS must
     be able to call RTAS services on other processors while the
     processor is stopped or being stopped"

Thus the onus is on the OS to ensure that there are no concurrent rtas
calls with "stop-self" token.

> 
> Regarding 2. I implemented a new version of my patch (posted below) but
> I was never able to test it because I couldn't access a system where the
> problem was reproducible anymore.
> 
> There's also a race between the CPU driving the unplug and the CPU being
> unplugged which I think is not easy for the CPU being unplugged to win,
> which makes the busy loop in pseries_cpu_die() a bit fragile. I describe
> the race in the patch description.
> 
> My solution to make the race less tight is to make the CPU driving the
> unplug to only start the busy loop only after the CPU being unplugged is
> in the CPU_STATE_OFFLINE state. At that point, we know that it either is
> about to call RTAS or it already has.

Ah, yes this is good optimization. Though, I think we ought to
unconditionally wait until the target CPU has woken up from CEDE and
changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
would have caught it in dlpar_offline_cpu() itself.

> 
> Do you think this makes sense? If you do, would you mind testing my
> patch? You can change the timeouts and delays if you want. To be honest
> they're just guesses on my part...

Sure. I will test the patch and report back.

--
Thanks and Regards
gautham.
Gautham R Shenoy Dec. 7, 2018, 12:03 p.m. | #3
On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
> Hi Thiago,
> 
> 
> Sure. I will test the patch and report back.

I added the following debug patch on top of your patch, and after an
hour's run, the system crashed. Appending the log at the end.

I suppose we still need to increase the number of tries since we wait
only for 2.5ms looping before giving up.

----------------------------x8-----------------------------------------
From 9e9e1e459cdc63540a73575af359a7bc0d43e455 Mon Sep 17 00:00:00 2001
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
Date: Fri, 7 Dec 2018 16:47:25 +0530
Subject: [PATCH] Add debug on top of Thiago's patch

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index fad3ceb..2545aab 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -214,6 +214,9 @@ static void pseries_cpu_die(unsigned int cpu)
 			msleep(1);
 		}
 	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+		u64 tb_before = mftb();
+		u64 tb_after;
+
 		/*
 		 * If the current state is not offline yet, it means that the
 		 * dying CPU (which is in pseries_mach_cpu_die) didn't have a
@@ -230,6 +233,11 @@ static void pseries_cpu_die(unsigned int cpu)
 				break;
 			udelay(100);
 		}
+
+		tb_after = mftb();
+
+		printk("[DEBUG] Waited for CPU %d to enter rtas: tries=%d, time=%llu\n", cpu, tries,
+			div_u64(tb_after - tb_before, tb_ticks_per_usec));
 	}
 
 	if (cpu_status != 0) {
Thiago Jung Bauermann Dec. 8, 2018, 2:40 a.m. | #4
Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
> On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
>> Sure. I will test the patch and report back.
>
> I added the following debug patch on top of your patch, and after an
> hour's run, the system crashed. Appending the log at the end.

Thank you very much for testing! Your debug patch was very helpful as
well.

> I suppose we still need to increase the number of tries since we wait
> only for 2.5ms looping before giving up.

Do you think it would have helped? In the debug output you posted I
would have expected the following message to show up if the loop
finished too early, and it didn't:

"Querying DEAD? cpu %i (%i) shows %i\n"

So I don't think increasing the loop length would have made a
difference. In fact, the call to smp_query_cpu_stopped() always
succeeded on the first iteration.

I think there is something else going on which we don't fully understand
yet. From your other email:

> I agree that the Kernel has to respect RTAS's restriction. The PAPR
> v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:
>
>     "The stop-self service needs to be serialized with calls to the
>      stop-self, start-cpu, and set-power-level services. The OS must
>      be able to call RTAS services on other processors while the
>      processor is stopped or being stopped"
>
> Thus the onus is on the OS to ensure that there are no concurrent rtas
> calls with "stop-self" token.

As you say perhaps there's another call to stop-self, start-cpu or
set-power-level being made concurrently. I don't currently see how more
than one stop-self or start-cpu call could be in flight at the same time
given that there are a number of locks being grabbed during CPU hotplug
and unplug. OTOH the CPU that actually calls stop-self doesn't seem to
grab any locks itself so it's a possibility.

As for set-power-level, it's only used in the case of PCI hotplug from
what I can see, and that isn't part of the picture in this case, right?

We could address this problem directly by adding another lock separate
from rtas.lock to serialize just these calls. The challenge is with
stop-self, because the CPU calling it will never return to release the
lock. Is it possible to grab a lock (or down a semaphore) in the CPU
calling stop-self and then release the lock (or up the semaphore) in the
CPU running pseries_cpu_die()?

> > There's also a race between the CPU driving the unplug and the CPU
> > being unplugged which I think is not easy for the CPU being
> > unplugged to win, which makes the busy loop in pseries_cpu_die() a
> > bit fragile. I describe the race in the patch description.
> >
> > My solution to make the race less tight is to make the CPU driving
> > the unplug to only start the busy loop only after the CPU being
> > unplugged is in the CPU_STATE_OFFLINE state. At that point, we know
> > that it either is about to call RTAS or it already has.
>
> Ah, yes this is good optimization. Though, I think we ought to
> unconditionally wait until the target CPU has woken up from CEDE and
> changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
> would have caught it in dlpar_offline_cpu() itself.

I recently saw a QEMU-implemented hcall (H_LOGICAL_CI_LOAD) return
success when it had been given an invalid memory address to load from,
so my confidence in the error reporting of hcalls is a bit shaken. :-)

In that case the CPU would wait forever for the CPU state to change. If
you believe 100 ms is too short a timeout, we could make it 500 ms or
even 1s. What do you think?

> cpu 112 (hwid 112) Ready to die...
> [DEBUG] Waited for CPU 112 to enter rtas: tries=0, time=65
> cpu 113 (hwid 113) Ready to die...
> [DEBUG] Waited for CPU 113 to enter rtas: tries=0, time=1139
> cpu 114 (hwid 114) Ready to die...
> [DEBUG] Waited for CPU 114 to enter rtas: tries=0, time=1036
> cpu 115 (hwid 115) Ready to die...
> [DEBUG] Waited for CPU 115 to enter rtas: tries=0, time=133
> cpu 116 (hwid 116) Ready to die...
> [DEBUG] Waited for CPU 116 to enter rtas: tries=0, time=1231
> cpu 117 (hwid 117) Ready to die...
> [DEBUG] Waited for CPU 117 to enter rtas: tries=0, time=1231
> cpu 118 (hwid 118) Ready to die...
> [DEBUG] Waited for CPU 118 to enter rtas: tries=0, time=1231
> cpu 119 (hwid 119) Ready to die...
> [DEBUG] Waited for CPU 119 to enter rtas: tries=0, time=1131
> cpu 104 (hwid 104) Ready to die...
> [DEBUG] Waited for CPU 104 to enter rtas: tries=0, time=40

Interesting, so 1.2 ms can pass before the dying CPU actually gets close
to making the stop-self call. And even in those cases the retry loop is
succeeding on the first try! So this shows that changing the code to
wait for the CPU_STATE_OFFLINE state is worth it.

> ******* RTAS CALL BUFFER CORRUPTION *******
> 393: rtas32_call_buff_ptr=
>     0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>     0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>     0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>     0000 0060 0800 E07F ACA7 0000 0000 00C0 [...`............]
>     2500 0000 0000 0000 0000 0000 0000 0000 [%...............]
>     0000 0000 0000 0000 0000 0000 306E 7572 [............0nur]
>     4800 0008 .... .... .... .... .... .... [H...........0nur]
> 394: rtas64_map_buff_ptr=
>     0000 0000 5046 5743 0000 0000 4F44 4500 [....PFWC....ODE.]
>     0000 0000 6000 0000 0000 0000 0000 0069 [....`..........i]
>     0000 0000 0000 0000 0000 0000 0000 0000 [................]
>     0000 0000 0000 0005 0000 0000 0000 0001 [................]
>     0000 0000 1A00 0000 0000 0000 0000 0000 [................]
>     0000 0000 8018 6398 0000 0000 0300 00C0 [......c.........]
>     0000 0000 .... .... .... .... .... .... [......c.........]

Ah! I never saw this error message. So indeed the kernel is causing RTAS
to blow up. Perhaps it would be useful to instrument more RTAS calls
(especially start-cpu and set-power-level) to see if it's one of them
that is being called at the time this corruption happens.

> cpu 105 (hwid 105) Ready to die...
> Bad kernel stack pointer 1fafb6c0 at 0
> Oops: Bad kernel stack pointer, sig: 6 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in:
> CPU: 105 PID: 0 Comm: swapper/105 Not tainted 4.20.0-rc5-thiago+ #45
> NIP:  0000000000000000 LR: 0000000000000000 CTR: 00000000007829c8
> REGS: c00000001e63bd30 TRAP: 0700   Not tainted  (4.20.0-rc5-thiago+)
> MSR:  8000000000081000 <SF,ME>  CR: 28000004  XER: 00000010
> CFAR: 000000001ec153f0 IRQMASK: 8000000000009033
> GPR00: 0000000000000000 000000001fafb6c0 000000001ec236a0 0000000000000040
> GPR04: 00000000000000c0 0000000000000080 00046c4fb4842557 00000000000000cd
> GPR08: 000000000001f400 000000001ed035dc 0000000000000000 0000000000000000
> GPR12: 0000000000000000 c00000001eb5e480 c0000003a1b53f90 000000001eea3e20
> GPR16: 0000000000000000 c0000006fd845100 c00000000004c8b0 c0000000013d5300
> GPR20: c0000006fd845300 0000000000000008 c0000000019d2cf8 c0000000013d6888
> GPR24: 0000000000000069 c0000000013d688c 0000000000000002 c0000000013d688c
> GPR28: c0000000019cecf0 0000000000000348 0000000000000000 0000000000000000
> NIP [0000000000000000]           (null)
> LR [0000000000000000]           (null)
> Call Trace:
> Instruction dump:
> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 60000000 60000000 60000000 60000000
> ---[ end trace 1aa3b4936949457e ]---

Ok, so at about the time CPU 105 makes the stop-self call there is this
RTAS call buffer corruption and this bad kernel stack pointer in CPU 105.
We need to understand better what is causing this.

> Bad kernel stack pointer 1fafb4b0 at 1ec15004
> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> rcu:    88-...!: (0 ticks this GP) idle=2ce/1/0x4000000000000000 softirq=28076/28076 fqs=78
> rcu:    (detected by 72, t=10866 jiffies, g=180529, q=2526)
> Sending NMI from CPU 72 to CPUs 88:
> CPU 88 didn't respond to backtrace IPI, inspecting paca.
> irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 22978 (drmgr)
> Back trace of paca->saved_r1 (0xc0000006f94ab750) (possibly stale):
> Call Trace:
> [c0000006f94ab750] [c0000006f94ab790] 0xc0000006f94ab790 (unreliable)
> [c0000006f94ab930] [c0000000000373f8] va_rtas_call_unlocked+0xc8/0xe0
> [c0000006f94ab950] [c000000000037a98] rtas_call+0x98/0x200
> [c0000006f94ab9a0] [c0000000000d7d28] smp_query_cpu_stopped+0x58/0xe0
> [c0000006f94aba20] [c0000000000d9dbc] pseries_cpu_die+0x1ec/0x240
> [c0000006f94abad0] [c00000000004f284] __cpu_die+0x44/0x60
> [c0000006f94abaf0] [c0000000000d8e10] dlpar_cpu_remove+0x160/0x340
> [c0000006f94abbc0] [c0000000000d9184] dlpar_cpu_release+0x74/0x100
> [c0000006f94abc10] [c000000000025a74] arch_cpu_release+0x44/0x70
> [c0000006f94abc30] [c0000000009bd1bc] cpu_release_store+0x4c/0x80
> [c0000006f94abc60] [c0000000009ae000] dev_attr_store+0x40/0x70
> [c0000006f94abc80] [c000000000495810] sysfs_kf_write+0x70/0xb0
> [c0000006f94abca0] [c00000000049453c] kernfs_fop_write+0x17c/0x250
> [c0000006f94abcf0] [c0000000003ccb6c] __vfs_write+0x4c/0x1f0
> [c0000006f94abd80] [c0000000003ccf74] vfs_write+0xd4/0x240
> [c0000006f94abdd0] [c0000000003cd338] ksys_write+0x68/0x110
> [c0000006f94abe20] [c00000000000b288] system_call+0x5c/0x70

So CPU 88 is the one driving the hot unplug and waiting for CPU 105 to
die. But it is stuck inside RTAS. Perhaps because of the call buffer
corruption?

> rcu: rcu_sched kthread starved for 10709 jiffies! g180529 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=72
> rcu: RCU grace-period kthread stack dump:
> rcu_sched       I    0    11      2 0x00000808
> Call Trace:
> [c0000000061ab840] [c0000003a4a84800] 0xc0000003a4a84800 (unreliable)
> [c0000000061aba20] [c00000000001e24c] __switch_to+0x2dc/0x430
> [c0000000061aba80] [c000000000e5fb94] __schedule+0x3d4/0xa20
> [c0000000061abb50] [c000000000e6022c] schedule+0x4c/0xc0
> [c0000000061abb80] [c000000000e64ffc] schedule_timeout+0x1dc/0x4e0
> [c0000000061abc80] [c0000000001af40c] rcu_gp_kthread+0xc3c/0x11f0
> [c0000000061abdb0] [c00000000013c7c8] kthread+0x168/0x1b0
> [c0000000061abe20] [c00000000000b658] ret_from_kernel_thread+0x5c/0x64

I don't know what to make of CPU 72. :-)  Perhaps it's the one making
the other "rogue" RTAS call interfering with stop-self in CPU 105?

It must be some RTAS call made with rtas_call_unlocked, because CPU 88
is holding the RTAS lock.

-- Thiago Jung Bauermann
IBM Linux Technology Center
Michael Bringmann Dec. 10, 2018, 8:16 p.m. | #5
I have asked Scott Mayes to take a look at one of these crashes from
the phyp side.  I will let you know if he finds anything notable.

Michael

On 12/07/2018 08:40 PM, Thiago Jung Bauermann wrote:
> 
> Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
>> On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
>>> Sure. I will test the patch and report back.
>>
>> I added the following debug patch on top of your patch, and after an
>> hour's run, the system crashed. Appending the log at the end.
> 
> Thank you very much for testing! Your debug patch was very helpful as
> well.
> 
>> I suppose we still need to increase the number of tries since we wait
>> only for 2.5ms looping before giving up.
> 
> Do you think it would have helped? In the debug output you posted I
> would have expected the following message to show up if the loop
> finished too early, and it didn't:
> 
> "Querying DEAD? cpu %i (%i) shows %i\n"
> 
> So I don't think increasing the loop length would have made a
> difference. In fact, the call to smp_query_cpu_stopped() always
> succeeded on the first iteration.
> 
> I think there is something else going on which we don't fully understand
> yet. From your other email:
> 
>> I agree that the Kernel has to respect RTAS's restriction. The PAPR
>> v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:
>>
>>     "The stop-self service needs to be serialized with calls to the
>>      stop-self, start-cpu, and set-power-level services. The OS must
>>      be able to call RTAS services on other processors while the
>>      processor is stopped or being stopped"
>>
>> Thus the onus is on the OS to ensure that there are no concurrent rtas
>> calls with "stop-self" token.
> 
> As you say perhaps there's another call to stop-self, start-cpu or
> set-power-level being made concurrently. I don't currently see how more
> than one stop-self or start-cpu call could be in flight at the same time
> given that there are a number of locks being grabbed during CPU hotplug
> and unplug. OTOH the CPU that actually calls stop-self doesn't seem to
> grab any locks itself so it's a possibility.
> 
> As for set-power-level, it's only used in the case of PCI hotplug from
> what I can see, and that isn't part of the picture in this case, right?
> 
> We could address this problem directly by adding another lock separate
> from rtas.lock to serialize just these calls. The challenge is with
> stop-self, because the CPU calling it will never return to release the
> lock. Is it possible to grab a lock (or down a semaphore) in the CPU
> calling stop-self and then release the lock (or up the semaphore) in the
> CPU running pseries_cpu_die()?
> 
>>> There's also a race between the CPU driving the unplug and the CPU
>>> being unplugged which I think is not easy for the CPU being
>>> unplugged to win, which makes the busy loop in pseries_cpu_die() a
>>> bit fragile. I describe the race in the patch description.
>>>
>>> My solution to make the race less tight is to make the CPU driving
>>> the unplug to only start the busy loop only after the CPU being
>>> unplugged is in the CPU_STATE_OFFLINE state. At that point, we know
>>> that it either is about to call RTAS or it already has.
>>
>> Ah, yes this is good optimization. Though, I think we ought to
>> unconditionally wait until the target CPU has woken up from CEDE and
>> changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
>> would have caught it in dlpar_offline_cpu() itself.
> 
> I recently saw a QEMU-implemented hcall (H_LOGICAL_CI_LOAD) return
> success when it had been given an invalid memory address to load from,
> so my confidence in the error reporting of hcalls is a bit shaken. :-)
> 
> In that case the CPU would wait forever for the CPU state to change. If
> you believe 100 ms is too short a timeout, we could make it 500 ms or
> even 1s. What do you think?
> 
>> cpu 112 (hwid 112) Ready to die...
>> [DEBUG] Waited for CPU 112 to enter rtas: tries=0, time=65
>> cpu 113 (hwid 113) Ready to die...
>> [DEBUG] Waited for CPU 113 to enter rtas: tries=0, time=1139
>> cpu 114 (hwid 114) Ready to die...
>> [DEBUG] Waited for CPU 114 to enter rtas: tries=0, time=1036
>> cpu 115 (hwid 115) Ready to die...
>> [DEBUG] Waited for CPU 115 to enter rtas: tries=0, time=133
>> cpu 116 (hwid 116) Ready to die...
>> [DEBUG] Waited for CPU 116 to enter rtas: tries=0, time=1231
>> cpu 117 (hwid 117) Ready to die...
>> [DEBUG] Waited for CPU 117 to enter rtas: tries=0, time=1231
>> cpu 118 (hwid 118) Ready to die...
>> [DEBUG] Waited for CPU 118 to enter rtas: tries=0, time=1231
>> cpu 119 (hwid 119) Ready to die...
>> [DEBUG] Waited for CPU 119 to enter rtas: tries=0, time=1131
>> cpu 104 (hwid 104) Ready to die...
>> [DEBUG] Waited for CPU 104 to enter rtas: tries=0, time=40
> 
> Interesting, so 1.2 ms can pass before the dying CPU actually gets close
> to making the stop-self call. And even in those cases the retry loop is
> succeeding on the first try! So this shows that changing the code to
> wait for the CPU_STATE_OFFLINE state is worth it.
> 
>> ******* RTAS CALL BUFFER CORRUPTION *******
>> 393: rtas32_call_buff_ptr=
>>     0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>>     0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>>     0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>>     0000 0060 0800 E07F ACA7 0000 0000 00C0 [...`............]
>>     2500 0000 0000 0000 0000 0000 0000 0000 [%...............]
>>     0000 0000 0000 0000 0000 0000 306E 7572 [............0nur]
>>     4800 0008 .... .... .... .... .... .... [H...........0nur]
>> 394: rtas64_map_buff_ptr=
>>     0000 0000 5046 5743 0000 0000 4F44 4500 [....PFWC....ODE.]
>>     0000 0000 6000 0000 0000 0000 0000 0069 [....`..........i]
>>     0000 0000 0000 0000 0000 0000 0000 0000 [................]
>>     0000 0000 0000 0005 0000 0000 0000 0001 [................]
>>     0000 0000 1A00 0000 0000 0000 0000 0000 [................]
>>     0000 0000 8018 6398 0000 0000 0300 00C0 [......c.........]
>>     0000 0000 .... .... .... .... .... .... [......c.........]
> 
> Ah! I never saw this error message. So indeed the kernel is causing RTAS
> to blow up. Perhaps it would be useful to instrument more RTAS calls
> (especially start-cpu and set-power-level) to see if it's one of them
> that is being called at the time this corruption happens.
> 
>> cpu 105 (hwid 105) Ready to die...
>> Bad kernel stack pointer 1fafb6c0 at 0
>> Oops: Bad kernel stack pointer, sig: 6 [#1]
>> LE SMP NR_CPUS=2048 NUMA pSeries
>> Modules linked in:
>> CPU: 105 PID: 0 Comm: swapper/105 Not tainted 4.20.0-rc5-thiago+ #45
>> NIP:  0000000000000000 LR: 0000000000000000 CTR: 00000000007829c8
>> REGS: c00000001e63bd30 TRAP: 0700   Not tainted  (4.20.0-rc5-thiago+)
>> MSR:  8000000000081000 <SF,ME>  CR: 28000004  XER: 00000010
>> CFAR: 000000001ec153f0 IRQMASK: 8000000000009033
>> GPR00: 0000000000000000 000000001fafb6c0 000000001ec236a0 0000000000000040
>> GPR04: 00000000000000c0 0000000000000080 00046c4fb4842557 00000000000000cd
>> GPR08: 000000000001f400 000000001ed035dc 0000000000000000 0000000000000000
>> GPR12: 0000000000000000 c00000001eb5e480 c0000003a1b53f90 000000001eea3e20
>> GPR16: 0000000000000000 c0000006fd845100 c00000000004c8b0 c0000000013d5300
>> GPR20: c0000006fd845300 0000000000000008 c0000000019d2cf8 c0000000013d6888
>> GPR24: 0000000000000069 c0000000013d688c 0000000000000002 c0000000013d688c
>> GPR28: c0000000019cecf0 0000000000000348 0000000000000000 0000000000000000
>> NIP [0000000000000000]           (null)
>> LR [0000000000000000]           (null)
>> Call Trace:
>> Instruction dump:
>> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
>> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 60000000 60000000 60000000 60000000
>> ---[ end trace 1aa3b4936949457e ]---
> 
> Ok, so at about the time CPU 105 makes the stop-self call there is this
> RTAS call buffer corruption and this bad kernel stack pointer in CPU 105.
> We need to understand better what is causing this.
> 
>> Bad kernel stack pointer 1fafb4b0 at 1ec15004
>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>> rcu:    88-...!: (0 ticks this GP) idle=2ce/1/0x4000000000000000 softirq=28076/28076 fqs=78
>> rcu:    (detected by 72, t=10866 jiffies, g=180529, q=2526)
>> Sending NMI from CPU 72 to CPUs 88:
>> CPU 88 didn't respond to backtrace IPI, inspecting paca.
>> irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 22978 (drmgr)
>> Back trace of paca->saved_r1 (0xc0000006f94ab750) (possibly stale):
>> Call Trace:
>> [c0000006f94ab750] [c0000006f94ab790] 0xc0000006f94ab790 (unreliable)
>> [c0000006f94ab930] [c0000000000373f8] va_rtas_call_unlocked+0xc8/0xe0
>> [c0000006f94ab950] [c000000000037a98] rtas_call+0x98/0x200
>> [c0000006f94ab9a0] [c0000000000d7d28] smp_query_cpu_stopped+0x58/0xe0
>> [c0000006f94aba20] [c0000000000d9dbc] pseries_cpu_die+0x1ec/0x240
>> [c0000006f94abad0] [c00000000004f284] __cpu_die+0x44/0x60
>> [c0000006f94abaf0] [c0000000000d8e10] dlpar_cpu_remove+0x160/0x340
>> [c0000006f94abbc0] [c0000000000d9184] dlpar_cpu_release+0x74/0x100
>> [c0000006f94abc10] [c000000000025a74] arch_cpu_release+0x44/0x70
>> [c0000006f94abc30] [c0000000009bd1bc] cpu_release_store+0x4c/0x80
>> [c0000006f94abc60] [c0000000009ae000] dev_attr_store+0x40/0x70
>> [c0000006f94abc80] [c000000000495810] sysfs_kf_write+0x70/0xb0
>> [c0000006f94abca0] [c00000000049453c] kernfs_fop_write+0x17c/0x250
>> [c0000006f94abcf0] [c0000000003ccb6c] __vfs_write+0x4c/0x1f0
>> [c0000006f94abd80] [c0000000003ccf74] vfs_write+0xd4/0x240
>> [c0000006f94abdd0] [c0000000003cd338] ksys_write+0x68/0x110
>> [c0000006f94abe20] [c00000000000b288] system_call+0x5c/0x70
> 
> So CPU 88 is the one driving the hot unplug and waiting for CPU 105 to
> die. But it is stuck inside RTAS. Perhaps because of the call buffer
> corruption?
> 
>> rcu: rcu_sched kthread starved for 10709 jiffies! g180529 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=72
>> rcu: RCU grace-period kthread stack dump:
>> rcu_sched       I    0    11      2 0x00000808
>> Call Trace:
>> [c0000000061ab840] [c0000003a4a84800] 0xc0000003a4a84800 (unreliable)
>> [c0000000061aba20] [c00000000001e24c] __switch_to+0x2dc/0x430
>> [c0000000061aba80] [c000000000e5fb94] __schedule+0x3d4/0xa20
>> [c0000000061abb50] [c000000000e6022c] schedule+0x4c/0xc0
>> [c0000000061abb80] [c000000000e64ffc] schedule_timeout+0x1dc/0x4e0
>> [c0000000061abc80] [c0000000001af40c] rcu_gp_kthread+0xc3c/0x11f0
>> [c0000000061abdb0] [c00000000013c7c8] kthread+0x168/0x1b0
>> [c0000000061abe20] [c00000000000b658] ret_from_kernel_thread+0x5c/0x64
> 
> I don't know what to make of CPU 72. :-)  Perhaps it's the one making
> the other "rogue" RTAS call interfering with stop-self in CPU 105?
> 
> It must be some RTAS call made with rtas_call_unlocked, because CPU 88
> is holding the RTAS lock.
> 
> -- Thiago Jung Bauermann
> IBM Linux Technology Center
>
Thiago Jung Bauermann Dec. 10, 2018, 8:31 p.m. | #6
Hello Michael,

Michael Bringmann <mwb@linux.vnet.ibm.com> writes:

> I have asked Scott Mayes to take a look at one of these crashes from
> the phyp side.  I will let you know if he finds anything notable.

Thanks! It might make sense to test whether booting with
cede_offline=off makes the bug go away.

One suspicion I have is regarding the code handling CPU_STATE_INACTIVE.
From what I understand, it is a powerpc-specific CPU state and from the
perspective of the generic CPU hotplug state machine, inactive CPUs are
already fully offline. Which means that the locking performed by the
generic code state machine doesn't apply to transitioning CPUs from
INACTIVE to OFFLINE state. Perhaps the bug is that there is more than
one CPU making that transition at the same time? That would cause two
CPUs to call RTAS stop-self.

I haven't checked whether this is really possible or not, though. It's
just a conjecture.

--
Thiago Jung Bauermann
IBM Linux Technology Center
Michael Bringmann Dec. 11, 2018, 10:11 p.m. | #7
Note from Scott Mayes on latest crash:

Michael,

Since the partition crashed, I was able to get the last .2 seconds worth of RTAS call trace leading up to the crash.

Best I could tell from that bit of trace was that the removal of a processor involved the following steps:
-- Call to stop-self for a given thread
-- Repeated calls to query-cpu-stopped-state (which eventually indicated the thread was stopped)
-- Call to get-sensor-state for the thread to check its entity-state (9003) sensor which returned 'dr-entity-present'
-- Call to set-indicator to set the isolation-state (9001) indicator to ISOLATE state
-- Call to set-indicator to set the allocation-state (9003) indicator to UNUSABLE state

I noticed one example of thread x28 getting through all of these steps just fine, but for thread x20, although the
query-cpu-stopped state returned 0 status (STOPPED), a subsequent call to set-indicator to ISOLATE
failed.  This failure was near the end of the trace, but was not the very last RTAS call made in the trace.
The set-indicator failure reported to Linux was a -9001 (Valid outstanding translation) which was mapped
from a 0x502 (Invalid thread state) return code from PHYP's H_SET_DR_STATE h-call.

On 12/10/2018 02:31 PM, Thiago Jung Bauermann wrote:
> 
> Hello Michael,
> 
> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
> 
>> I have asked Scott Mayes to take a look at one of these crashes from
>> the phyp side.  I will let you know if he finds anything notable.
> 
> Thanks! It might make sense to test whether booting with
> cede_offline=off makes the bug go away.

Scott is looking at the system.  I will try once he is finished.

> 
> One suspicion I have is regarding the code handling CPU_STATE_INACTIVE.
>>From what I understand, it is a powerpc-specific CPU state and from the
> perspective of the generic CPU hotplug state machine, inactive CPUs are
> already fully offline. Which means that the locking performed by the
> generic code state machine doesn't apply to transitioning CPUs from
> INACTIVE to OFFLINE state. Perhaps the bug is that there is more than
> one CPU making that transition at the same time? That would cause two
> CPUs to call RTAS stop-self.
> 
> I haven't checked whether this is really possible or not, though. It's
> just a conjecture.

Michael

> 
> --
> Thiago Jung Bauermann
> IBM Linux Technology Center
> 
>

Patch

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 2f8e621..c913c44 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -214,14 +214,25 @@  static void pseries_cpu_die(unsigned int cpu)
 			msleep(1);
 		}
 	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+		int max_tries = 4000; /* Roughly corresponds to 100ms */
+		u64 tb_before = mftb();
 
-		for (tries = 0; tries < 25; tries++) {
+		for (tries = 0; tries < max_tries; tries++) {
 			cpu_status = smp_query_cpu_stopped(pcpu);
 			if (cpu_status == QCSS_STOPPED ||
 			    cpu_status == QCSS_HARDWARE_ERROR)
 				break;
 			cpu_relax();
 		}
+
+		if (tries == max_tries) {
+			u64 time_elapsed_us = div_u64(mftb() - tb_before,
+						      tb_ticks_per_usec);
+
+			pr_warn("Offlined CPU %d isn't stopped by RTAS after %llu us\n",
+				cpu, time_elapsed_us);
+			WARN_ON(1);
+		}
 	}
 
 	if (cpu_status != 0) {