[v3] powerpc/pseries: Only wait for dying CPU after call to rtas_stop_self()
diff mbox series

Message ID 20190311193517.23756-1-bauerman@linux.ibm.com
State New
Headers show
Series
  • [v3] powerpc/pseries: Only wait for dying CPU after call to rtas_stop_self()
Related show

Checks

Context Check Description
snowpatch_ozlabs/checkpatch warning total: 0 errors, 1 warnings, 1 checks, 24 lines checked
snowpatch_ozlabs/build-pmac32 success build succeeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64e success build succeeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64be success build succeeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64le success build succeeded & removed 0 sparse warning(s)
snowpatch_ozlabs/apply_patch success Successfully applied on branch next (9580b71b5a7863c24a9bd18bcd2ad759b86b1eff)

Commit Message

Thiago Jung Bauermann March 11, 2019, 7:35 p.m. UTC
When testing DLPAR CPU add/remove on a system under stress,
pseries_cpu_die() doesn't wait long enough for a CPU to die:

[  446.983944] cpu 148 (hwid 148) Ready to die...
[  446.984062] cpu 149 (hwid 149) Ready to die...
[  446.993518] cpu 150 (hwid 150) Ready to die...
[  446.993543] Querying DEAD? cpu 150 (150) shows 2
[  446.994098] cpu 151 (hwid 151) Ready to die...
[  447.133726] cpu 136 (hwid 136) Ready to die...
[  447.403532] cpu 137 (hwid 137) Ready to die...
[  447.403772] cpu 138 (hwid 138) Ready to die...
[  447.403839] cpu 139 (hwid 139) Ready to die...
[  447.403887] cpu 140 (hwid 140) Ready to die...
[  447.403937] cpu 141 (hwid 141) Ready to die...
[  447.403979] cpu 142 (hwid 142) Ready to die...
[  447.404038] cpu 143 (hwid 143) Ready to die...
[  447.513546] cpu 128 (hwid 128) Ready to die...
[  447.693533] cpu 129 (hwid 129) Ready to die...
[  447.693999] cpu 130 (hwid 130) Ready to die...
[  447.703530] cpu 131 (hwid 131) Ready to die...
[  447.704087] Querying DEAD? cpu 132 (132) shows 2
[  447.704102] cpu 132 (hwid 132) Ready to die...
[  447.713534] cpu 133 (hwid 133) Ready to die...
[  447.714064] Querying DEAD? cpu 134 (134) shows 2

This is a race between one CPU stopping and another one calling
pseries_cpu_die() to wait for it to stop. That function does a short busy
loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify
that it is stopped, but I think there's a lot for the stopping CPU to do
which may take longer than this loop allows.

As can be seen in the dmesg right before or after the "Querying DEAD?"
messages, if pseries_cpu_die() waited a little longer it would have seen
the CPU in the stopped state.

What I think is going on is that CPU 134 was inactive at the time it was
unplugged. In that case, dlpar_offline_cpu() calls H_PROD on that CPU and
immediately calls pseries_cpu_die(). Meanwhile, the prodded CPU activates
and start the process of stopping itself. The busy loop is not long enough
to allow for the CPU to wake up and complete the stopping process.

This can be a problem because if the busy loop finishes too early, then the
kernel may offline another CPU before the previous one finished dying,
which would lead to two concurrent calls to rtas-stop-self, which is
prohibited by the PAPR.

We can make the race a lot more even if we only start querying if the CPU
is stopped when the stopping CPU is close to call rtas_stop_self(). Since
pseries_mach_cpu_die() sets the CPU current state to offline almost
immediately before calling rtas_stop_self(), we use that as a signal that
it is either already stopped or very close to that point, and we can start
the busy loop.

As suggested by Michael Ellerman, this patch also changes the busy loop to
wait for a fixed amount of wall time. Based on the measurements that
Gautham did on a POWER9 system, in successful cases of
smp_query_cpu_stopped(cpu) returning affirmative, the maximum time spent
inside the loop was was 10 ms. This patch loops for 20 ms just be sure.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Analyzed-by: Gautham R Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

I have seen this problem since v4.8. Should this patch go to stable as
well?

Changes since v2:
- Increaded busy loop to 200 iterations so that it can last up to 20 ms
  (suggested by Gautham).
- Changed commit message to include Gautham's remarks.

Comments

Gautham R Shenoy March 12, 2019, 9:32 a.m. UTC | #1
Hello Thiago,

On Mon, Mar 11, 2019 at 04:35:17PM -0300, Thiago Jung Bauermann wrote:
> When testing DLPAR CPU add/remove on a system under stress,
> pseries_cpu_die() doesn't wait long enough for a CPU to die:
> 
> [  446.983944] cpu 148 (hwid 148) Ready to die...
> [  446.984062] cpu 149 (hwid 149) Ready to die...
> [  446.993518] cpu 150 (hwid 150) Ready to die...
> [  446.993543] Querying DEAD? cpu 150 (150) shows 2
> [  446.994098] cpu 151 (hwid 151) Ready to die...
> [  447.133726] cpu 136 (hwid 136) Ready to die...
> [  447.403532] cpu 137 (hwid 137) Ready to die...
> [  447.403772] cpu 138 (hwid 138) Ready to die...
> [  447.403839] cpu 139 (hwid 139) Ready to die...
> [  447.403887] cpu 140 (hwid 140) Ready to die...
> [  447.403937] cpu 141 (hwid 141) Ready to die...
> [  447.403979] cpu 142 (hwid 142) Ready to die...
> [  447.404038] cpu 143 (hwid 143) Ready to die...
> [  447.513546] cpu 128 (hwid 128) Ready to die...
> [  447.693533] cpu 129 (hwid 129) Ready to die...
> [  447.693999] cpu 130 (hwid 130) Ready to die...
> [  447.703530] cpu 131 (hwid 131) Ready to die...
> [  447.704087] Querying DEAD? cpu 132 (132) shows 2
> [  447.704102] cpu 132 (hwid 132) Ready to die...
> [  447.713534] cpu 133 (hwid 133) Ready to die...
> [  447.714064] Querying DEAD? cpu 134 (134) shows 2
> 
> This is a race between one CPU stopping and another one calling
> pseries_cpu_die() to wait for it to stop. That function does a short busy
> loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify
> that it is stopped, but I think there's a lot for the stopping CPU to do
> which may take longer than this loop allows.
> 
> As can be seen in the dmesg right before or after the "Querying DEAD?"
> messages, if pseries_cpu_die() waited a little longer it would have seen
> the CPU in the stopped state.
> 
> What I think is going on is that CPU 134 was inactive at the time it was
> unplugged. In that case, dlpar_offline_cpu() calls H_PROD on that CPU and
> immediately calls pseries_cpu_die(). Meanwhile, the prodded CPU activates
> and start the process of stopping itself. The busy loop is not long enough
> to allow for the CPU to wake up and complete the stopping process.
> 
> This can be a problem because if the busy loop finishes too early, then the
> kernel may offline another CPU before the previous one finished dying,
> which would lead to two concurrent calls to rtas-stop-self, which is
> prohibited by the PAPR.
> 
> We can make the race a lot more even if we only start querying if the CPU
> is stopped when the stopping CPU is close to call rtas_stop_self(). Since
> pseries_mach_cpu_die() sets the CPU current state to offline almost
> immediately before calling rtas_stop_self(), we use that as a signal that
> it is either already stopped or very close to that point, and we can start
> the busy loop.
> 
> As suggested by Michael Ellerman, this patch also changes the busy loop to
> wait for a fixed amount of wall time. Based on the measurements that
> Gautham did on a POWER9 system, in successful cases of
> smp_query_cpu_stopped(cpu) returning affirmative, the maximum time spent
> inside the loop was was 10 ms. This patch loops for 20 ms just be sure.
> 
> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>

Thanks for this version. I have tested the patch and we no longer see
the "Querying DEAD? cpu X (Y) shows 2" message.


Tested-and-Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>


--
Thanks and Regards
gautham.
Thiago Jung Bauermann March 12, 2019, 4:30 p.m. UTC | #2
Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:

>> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
>
> Thanks for this version. I have tested the patch and we no longer see
> the "Querying DEAD? cpu X (Y) shows 2" message.
>
>
> Tested-and-Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

Thanks for reviewing and testing the patch!
Thiago Jung Bauermann April 10, 2019, 11:08 p.m. UTC | #3
Hello,

Ping?
Nicholas Piggin April 16, 2019, 4:36 a.m. UTC | #4
Thiago Jung Bauermann's on April 11, 2019 9:08 am:
> 
> Hello,
> 
> Ping?
> 
> -- 
> Thiago Jung Bauermann
> IBM Linux Technology Center
> 
> 
> Thiago Jung Bauermann <bauerman@linux.ibm.com> writes:
> 
>> When testing DLPAR CPU add/remove on a system under stress,
>> pseries_cpu_die() doesn't wait long enough for a CPU to die:
>>
>> [  446.983944] cpu 148 (hwid 148) Ready to die...
>> [  446.984062] cpu 149 (hwid 149) Ready to die...
>> [  446.993518] cpu 150 (hwid 150) Ready to die...
>> [  446.993543] Querying DEAD? cpu 150 (150) shows 2
>> [  446.994098] cpu 151 (hwid 151) Ready to die...
>> [  447.133726] cpu 136 (hwid 136) Ready to die...
>> [  447.403532] cpu 137 (hwid 137) Ready to die...
>> [  447.403772] cpu 138 (hwid 138) Ready to die...
>> [  447.403839] cpu 139 (hwid 139) Ready to die...
>> [  447.403887] cpu 140 (hwid 140) Ready to die...
>> [  447.403937] cpu 141 (hwid 141) Ready to die...
>> [  447.403979] cpu 142 (hwid 142) Ready to die...
>> [  447.404038] cpu 143 (hwid 143) Ready to die...
>> [  447.513546] cpu 128 (hwid 128) Ready to die...
>> [  447.693533] cpu 129 (hwid 129) Ready to die...
>> [  447.693999] cpu 130 (hwid 130) Ready to die...
>> [  447.703530] cpu 131 (hwid 131) Ready to die...
>> [  447.704087] Querying DEAD? cpu 132 (132) shows 2
>> [  447.704102] cpu 132 (hwid 132) Ready to die...
>> [  447.713534] cpu 133 (hwid 133) Ready to die...
>> [  447.714064] Querying DEAD? cpu 134 (134) shows 2
>>
>> This is a race between one CPU stopping and another one calling
>> pseries_cpu_die() to wait for it to stop. That function does a short busy
>> loop calling RTAS query-cpu-stopped-state on the stopping CPU to verify
>> that it is stopped, but I think there's a lot for the stopping CPU to do
>> which may take longer than this loop allows.
>>
>> As can be seen in the dmesg right before or after the "Querying DEAD?"
>> messages, if pseries_cpu_die() waited a little longer it would have seen
>> the CPU in the stopped state.
>>
>> What I think is going on is that CPU 134 was inactive at the time it was
>> unplugged. In that case, dlpar_offline_cpu() calls H_PROD on that CPU and
>> immediately calls pseries_cpu_die(). Meanwhile, the prodded CPU activates
>> and start the process of stopping itself. The busy loop is not long enough
>> to allow for the CPU to wake up and complete the stopping process.
>>
>> This can be a problem because if the busy loop finishes too early, then the
>> kernel may offline another CPU before the previous one finished dying,
>> which would lead to two concurrent calls to rtas-stop-self, which is
>> prohibited by the PAPR.
>>
>> We can make the race a lot more even if we only start querying if the CPU
>> is stopped when the stopping CPU is close to call rtas_stop_self(). Since
>> pseries_mach_cpu_die() sets the CPU current state to offline almost
>> immediately before calling rtas_stop_self(), we use that as a signal that
>> it is either already stopped or very close to that point, and we can start
>> the busy loop.
>>
>> As suggested by Michael Ellerman, this patch also changes the busy loop to
>> wait for a fixed amount of wall time. Based on the measurements that
>> Gautham did on a POWER9 system, in successful cases of
>> smp_query_cpu_stopped(cpu) returning affirmative, the maximum time spent
>> inside the loop was was 10 ms. This patch loops for 20 ms just be sure.
>>
>> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
>> Analyzed-by: Gautham R Shenoy <ego@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 +++++++++++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> I have seen this problem since v4.8. Should this patch go to stable as
>> well?
>>
>> Changes since v2:
>> - Increaded busy loop to 200 iterations so that it can last up to 20 ms
>>   (suggested by Gautham).
>> - Changed commit message to include Gautham's remarks.
>>
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> index 97feb6e79f1a..ac6dc35ab829 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> @@ -214,13 +214,22 @@ static void pseries_cpu_die(unsigned int cpu)
>>  			msleep(1);
>>  		}
>>  	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
>> +		/*
>> +		 * If the current state is not offline yet, it means that the
>> +		 * dying CPU (which is either in pseries_mach_cpu_die() or in
>> +		 * the process of getting there) didn't have a chance yet to
>> +		 * call rtas_stop_self() and therefore it's too early to query
>> +		 * if the CPU is stopped.
>> +		 */
>> +		spin_event_timeout(get_cpu_current_state(cpu) == CPU_STATE_OFFLINE,
>> +				   100000, 100);

If the CPU state does not go to offline here, you should give up and
return online, right? Otherwise I think query-cpu-stopped-state can
get confused by CPUs in idle and you get a false positive.

That race can still happen, we would really need a sequence count check
over current CPU state to ensure we got a race-free qcss value, but at
least a check here should make the race implausible to hit.

Thanks,
Nick

>>  
>> -		for (tries = 0; tries < 25; tries++) {
>> +		for (tries = 0; tries < 200; tries++) {
>>  			cpu_status = smp_query_cpu_stopped(pcpu);
>>  			if (cpu_status == QCSS_STOPPED ||
>>  			    cpu_status == QCSS_HARDWARE_ERROR)
>>  				break;
>> -			cpu_relax();
>> +			udelay(100);
>>  		}
>>  	}
>>  
> 
>
Thiago Jung Bauermann April 18, 2019, 1 a.m. UTC | #5
Hello Nick,

Thank you very much for reviewing this patch!

Nicholas Piggin <npiggin@gmail.com> writes:

> Thiago Jung Bauermann's on April 11, 2019 9:08 am:
>>
>> Thiago Jung Bauermann <bauerman@linux.ibm.com> writes:
>>
>>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>> index 97feb6e79f1a..ac6dc35ab829 100644
>>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>> @@ -214,13 +214,22 @@ static void pseries_cpu_die(unsigned int cpu)
>>>  			msleep(1);
>>>  		}
>>>  	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
>>> +		/*
>>> +		 * If the current state is not offline yet, it means that the
>>> +		 * dying CPU (which is either in pseries_mach_cpu_die() or in
>>> +		 * the process of getting there) didn't have a chance yet to
>>> +		 * call rtas_stop_self() and therefore it's too early to query
>>> +		 * if the CPU is stopped.
>>> +		 */
>>> +		spin_event_timeout(get_cpu_current_state(cpu) == CPU_STATE_OFFLINE,
>>> +				   100000, 100);
>
> If the CPU state does not go to offline here, you should give up and
> return online, right? Otherwise I think query-cpu-stopped-state can
> get confused by CPUs in idle and you get a false positive.

Can it get confused? My impression from reading the definition for
query-cpu-stopped-state in the PAPR is that it will simply return a
CPU_status value of 2 in that case, meaning that "the processor thread
is not in the RTAS stopped state", but I don't know much about this.

> That race can still happen, we would really need a sequence count check
> over current CPU state to ensure we got a race-free qcss value, but at
> least a check here should make the race implausible to hit.

Actually, since rtas_stop_self() panics if the processor fails to stop
and also since callers of pseries_cpu_die()¹ already assume that it is
going to succeed in stopping the CPU (given that the function returns
void and can't signal an error), a more straightforward way of
eliminating the race is to simply do this:

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 97feb6e79f1a..2331a609f48f 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -215,7 +215,7 @@ static void pseries_cpu_die(unsigned int cpu)
                }
        } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {

-               for (tries = 0; tries < 25; tries++) {
+               while (true) {
                        cpu_status = smp_query_cpu_stopped(pcpu);
                        if (cpu_status == QCSS_STOPPED ||
                            cpu_status == QCSS_HARDWARE_ERROR)


What do you think?

--
Thiago Jung Bauermann
IBM Linux Technology Center

¹ dlpar_offline_cpu() and takedown_cpu() in generic code
Nicholas Piggin April 18, 2019, 1:55 a.m. UTC | #6
Thiago Jung Bauermann's on April 18, 2019 11:00 am:
> 
> Hello Nick,
> 
> Thank you very much for reviewing this patch!
> 
> Nicholas Piggin <npiggin@gmail.com> writes:
> 
>> Thiago Jung Bauermann's on April 11, 2019 9:08 am:
>>>
>>> Thiago Jung Bauermann <bauerman@linux.ibm.com> writes:
>>>
>>>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>>> index 97feb6e79f1a..ac6dc35ab829 100644
>>>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>>> @@ -214,13 +214,22 @@ static void pseries_cpu_die(unsigned int cpu)
>>>>  			msleep(1);
>>>>  		}
>>>>  	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
>>>> +		/*
>>>> +		 * If the current state is not offline yet, it means that the
>>>> +		 * dying CPU (which is either in pseries_mach_cpu_die() or in
>>>> +		 * the process of getting there) didn't have a chance yet to
>>>> +		 * call rtas_stop_self() and therefore it's too early to query
>>>> +		 * if the CPU is stopped.
>>>> +		 */
>>>> +		spin_event_timeout(get_cpu_current_state(cpu) == CPU_STATE_OFFLINE,
>>>> +				   100000, 100);
>>
>> If the CPU state does not go to offline here, you should give up and
>> return online, right? Otherwise I think query-cpu-stopped-state can
>> get confused by CPUs in idle and you get a false positive.
> 
> Can it get confused? My impression from reading the definition for
> query-cpu-stopped-state in the PAPR is that it will simply return a
> CPU_status value of 2 in that case, meaning that "the processor thread
> is not in the RTAS stopped state", but I don't know much about this.

In QEMU (non-KVM) mode, qcss I think may get confused between H_CEDE
and rtas-stop-self. KVM mode may be okay because H_CEDE is handled in
the kernel.

>> That race can still happen, we would really need a sequence count check
>> over current CPU state to ensure we got a race-free qcss value, but at
>> least a check here should make the race implausible to hit.
> 
> Actually, since rtas_stop_self() panics if the processor fails to stop
> and also since callers of pseries_cpu_die()¹ already assume that it is
> going to succeed in stopping the CPU (given that the function returns
> void and can't signal an error), a more straightforward way of
> eliminating the race is to simply do this:
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 97feb6e79f1a..2331a609f48f 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -215,7 +215,7 @@ static void pseries_cpu_die(unsigned int cpu)
>                 }
>         } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
> 
> -               for (tries = 0; tries < 25; tries++) {
> +               while (true) {
>                         cpu_status = smp_query_cpu_stopped(pcpu);
>                         if (cpu_status == QCSS_STOPPED ||
>                             cpu_status == QCSS_HARDWARE_ERROR)
> 
> 
> What do you think?

Yeah I think that may be a good idea, just makes things much simpler.

Thanks,
Nick

Patch
diff mbox series

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 97feb6e79f1a..ac6dc35ab829 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -214,13 +214,22 @@  static void pseries_cpu_die(unsigned int cpu)
 			msleep(1);
 		}
 	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+		/*
+		 * If the current state is not offline yet, it means that the
+		 * dying CPU (which is either in pseries_mach_cpu_die() or in
+		 * the process of getting there) didn't have a chance yet to
+		 * call rtas_stop_self() and therefore it's too early to query
+		 * if the CPU is stopped.
+		 */
+		spin_event_timeout(get_cpu_current_state(cpu) == CPU_STATE_OFFLINE,
+				   100000, 100);
 
-		for (tries = 0; tries < 25; tries++) {
+		for (tries = 0; tries < 200; tries++) {
 			cpu_status = smp_query_cpu_stopped(pcpu);
 			if (cpu_status == QCSS_STOPPED ||
 			    cpu_status == QCSS_HARDWARE_ERROR)
 				break;
-			cpu_relax();
+			udelay(100);
 		}
 	}