diff mbox series

[1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_remove()

Message ID 20210305173845.451158-1-danielhb413@gmail.com (mailing list archive)
State Superseded
Headers show
Series [1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_remove() | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch powerpc/merge (91966823812efbd175f904599e5cf2a854b39809)
snowpatch_ozlabs/build-ppc64le success Build succeeded
snowpatch_ozlabs/build-ppc64be success Build succeeded
snowpatch_ozlabs/build-ppc64e success Build succeeded
snowpatch_ozlabs/build-pmac32 success Build succeeded
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 18 lines checked
snowpatch_ozlabs/needsstable success Patch has no Fixes tags

Commit Message

Daniel Henrique Barboza March 5, 2021, 5:38 p.m. UTC
Of all the reasons that dlpar_cpu_remove() can fail, the 'last online
CPU' is one that can be caused directly by the user offlining CPUs
in a partition/virtual machine that has hotplugged CPUs. Trying to
reclaim a hotplugged CPU can fail if the CPU is now the last online in
the system. This is easily reproduced using QEMU [1].

Throwing a more specific error message for this case, instead of just
"Failed to offline CPU", makes it clearer that the error is in fact a
known error situation instead of other generic/unknown cause.

[1] https://bugzilla.redhat.com/1911414

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Daniel Henrique Barboza March 18, 2021, 9:59 a.m. UTC | #1
Ping

On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote:
> Of all the reasons that dlpar_cpu_remove() can fail, the 'last online
> CPU' is one that can be caused directly by the user offlining CPUs
> in a partition/virtual machine that has hotplugged CPUs. Trying to
> reclaim a hotplugged CPU can fail if the CPU is now the last online in
> the system. This is easily reproduced using QEMU [1].
> 
> Throwing a more specific error message for this case, instead of just
> "Failed to offline CPU", makes it clearer that the error is in fact a
> known error situation instead of other generic/unknown cause.
> 
> [1] https://bugzilla.redhat.com/1911414
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>   arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 12cbffd3c2e3..134f393f09e1 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -514,7 +514,17 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index)
>   
>   	rc = dlpar_offline_cpu(dn);
>   	if (rc) {
> -		pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc);
> +		/* dlpar_offline_cpu will return -EBUSY from cpu_down() (via
> +		 * device_offline()) in 2 cases: cpu_hotplug_disable is true or
> +		 * there is only one CPU left. Warn the user about the second
> +		 * since this can happen with user offlining CPUs and then
> +		 * attempting hotunplugs.
> +		 */
> +		if (rc == -EBUSY && num_online_cpus() == 1)
> +			pr_warn("Unable to remove last online CPU %pOFn\n", dn);
> +		else
> +			pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc);
> +
>   		return -EINVAL;
>   	}
>   
>
Michael Ellerman March 19, 2021, 11:26 a.m. UTC | #2
Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> Ping
>
> On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote:
>> Of all the reasons that dlpar_cpu_remove() can fail, the 'last online
>> CPU' is one that can be caused directly by the user offlining CPUs
>> in a partition/virtual machine that has hotplugged CPUs. Trying to
>> reclaim a hotplugged CPU can fail if the CPU is now the last online in
>> the system. This is easily reproduced using QEMU [1].

Sorry, I saw this earlier and never got around to replying.

I'm wondering if we neet to catch it earlier, ie. in
dlpar_offline_cpu().

By the time we return to dlpar_cpu_remove() we've dropped the
cpu_add_remove_lock (cpu_maps_update_done), so num_online_cpus() could
change out from under us, meaning the num_online_cpus() == 1 check might
trigger incorrectly (or vice versa).

Something like the patch below (completely untested :D)

And writing that patch makes me wonder, is == 1 the right check?

In most cases we'll remove all but one thread of the core, but we'll
fail on the last thread. Leaving that core effectively stuck in SMT1. Is
that useful behaviour? Should we instead check at the start that we can
remove all threads of the core without going to zero online CPUs?

cheers


diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..498c22331ac8 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -271,6 +271,12 @@ static int dlpar_offline_cpu(struct device_node *dn)
 			if (!cpu_online(cpu))
 				break;
 
+			if (num_online_cpus() == 1) {
+				pr_warn("Unable to remove last online CPU %pOFn\n", dn);
+				rc = EBUSY;
+				goto out_unlock;
+			}
+
 			cpu_maps_update_done();
 			rc = device_offline(get_cpu_device(cpu));
 			if (rc)
@@ -283,6 +289,7 @@ static int dlpar_offline_cpu(struct device_node *dn)
 				thread);
 		}
 	}
+out_unlock:
 	cpu_maps_update_done();
 
 out:
Daniel Henrique Barboza March 22, 2021, 11:30 p.m. UTC | #3
On 3/19/21 8:26 AM, Michael Ellerman wrote:
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>> Ping
>>
>> On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote:
>>> Of all the reasons that dlpar_cpu_remove() can fail, the 'last online
>>> CPU' is one that can be caused directly by the user offlining CPUs
>>> in a partition/virtual machine that has hotplugged CPUs. Trying to
>>> reclaim a hotplugged CPU can fail if the CPU is now the last online in
>>> the system. This is easily reproduced using QEMU [1].
> 
> Sorry, I saw this earlier and never got around to replying.

No problem. Thanks for the review!

> 
> I'm wondering if we neet to catch it earlier, ie. in
> dlpar_offline_cpu().
> 
> By the time we return to dlpar_cpu_remove() we've dropped the
> cpu_add_remove_lock (cpu_maps_update_done), so num_online_cpus() could
> change out from under us, meaning the num_online_cpus() == 1 check might
> trigger incorrectly (or vice versa).
> 
> Something like the patch below (completely untested :D)

Makes sense. I'll try it out to see if it works.

> 
> And writing that patch makes me wonder, is == 1 the right check?
> 
> In most cases we'll remove all but one thread of the core, but we'll
> fail on the last thread. Leaving that core effectively stuck in SMT1. Is
> that useful behaviour? Should we instead check at the start that we can
> remove all threads of the core without going to zero online CPUs?

I think it's ok to allow SMT1 cores, speaking from QEMU perspective.
QEMU does not have a "core hotunplug" operation where the whole core is
hotunplugged at once. The CPU hotplug/unplug operations are done as single
CPU thread add/removal. If the user wants to run 4 cores, all of them with
SMT1, QEMU will allow it.

Libvirt does not operate with the core granularity either - you can specify
the amount of vcpus the guest should run with, and Libvirt will send
hotplug/unplug requests to QEMU to match the desired value. It doesn't
bother with how many threads of a core were offlined or not.


Thanks,


DHB



> 
> cheers
> 
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 12cbffd3c2e3..498c22331ac8 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -271,6 +271,12 @@ static int dlpar_offline_cpu(struct device_node *dn)
>   			if (!cpu_online(cpu))
>   				break;
>   
> +			if (num_online_cpus() == 1) {
> +				pr_warn("Unable to remove last online CPU %pOFn\n", dn);
> +				rc = EBUSY;
> +				goto out_unlock;
> +			}
> +
>   			cpu_maps_update_done();
>   			rc = device_offline(get_cpu_device(cpu));
>   			if (rc)
> @@ -283,6 +289,7 @@ static int dlpar_offline_cpu(struct device_node *dn)
>   				thread);
>   		}
>   	}
> +out_unlock:
>   	cpu_maps_update_done();
>   
>   out:
>
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..134f393f09e1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -514,7 +514,17 @@  static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index)
 
 	rc = dlpar_offline_cpu(dn);
 	if (rc) {
-		pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc);
+		/* dlpar_offline_cpu will return -EBUSY from cpu_down() (via
+		 * device_offline()) in 2 cases: cpu_hotplug_disable is true or
+		 * there is only one CPU left. Warn the user about the second
+		 * since this can happen with user offlining CPUs and then
+		 * attempting hotunplugs.
+		 */
+		if (rc == -EBUSY && num_online_cpus() == 1)
+			pr_warn("Unable to remove last online CPU %pOFn\n", dn);
+		else
+			pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc);
+
 		return -EINVAL;
 	}