mbox series

[v2,0/1] pseries/hotplug: Change the default behaviour of cede_offline

Message ID 1571740391-3251-1-git-send-email-ego@linux.vnet.ibm.com (mailing list archive)
Headers show
Series pseries/hotplug: Change the default behaviour of cede_offline | expand

Message

Gautham R Shenoy Oct. 22, 2019, 10:33 a.m. UTC
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

This is the v2 of the fix to change the default behaviour of cede_offline.
The previous version can be found here: https://lkml.org/lkml/2019/9/12/222

The main change from v1 is that the patch2 to create a sysfs file to
report and control the value of cede_offline_enabled has been dropped.

Problem Description:
====================
Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since the CPUs are
still under the control fo the Linux Guests. Hence, when we change the
SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
will continue to count the values for offlined CPUs in extended cede
as if they are online.

One of the expectations with PURR is that the for an interval of time,
the sum of the PURR increments across the online CPUs of a core should
equal the number of timebase ticks for that interval.

This is currently not the case.

In the following data (Generated using
https://github.com/gautshen/misc/blob/master/purr_tb.py):

SD-PURR = Sum of PURR increments on online CPUs of that core in 1 second
      
SMT=off
===========================================
Core            SD-PURR         SD-PURR
                (expected)      (observed)
===========================================
core00 [  0]	512000000	69883784	
core01 [  8]	512000000	88782536	
core02 [ 16]	512000000	94296824	
core03 [ 24]	512000000	80951968	

SMT=2
===========================================
Core            SD-PURR         SD-PURR
                (expected)      (observed)
===========================================
core00 [  0,1]	512000000	136147792	
core01 [  8,9]	512000000	128636784	
core02 [ 16,17]	512000000	135426488	
core03 [ 24,25]	512000000	153027520	

SMT=4
===================================================
Core			SD-PURR         SD-PURR
                	(expected)      (observed)
===================================================
core00 [  0,1,2,3]	512000000	258331616	
core01 [  8,9,10,11]	512000000	274220072	
core02 [ 16,17,18,19]	512000000	260013736	
core03 [ 24,25,26,27]	512000000	260079672	

SMT=on
===================================================================
Core					SD-PURR         SD-PURR
                			(expected)      (observed)
===================================================================
core00 [  0,1,2,3,4,5,6,7]		512000000	512941248	
core01 [  8,9,10,11,12,13,14,15]	512000000	512936544	
core02 [ 16,17,18,19,20,21,22,23]	512000000	512931544	
core03 [ 24,25,26,27,28,29,30,31]	512000000	512923800

This patchset addresses this issue by ensuring that by default, the
offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call
by changing the default value of "cede_offline_enabled" to false.

With the patches, we see that the observed value of the sum of the
PURR increments across the the online threads of a core in 1-second
matches the number of tb-ticks in 1-second.

SMT=off
===========================================
Core            SD-PURR         SD-PURR
                (expected)      (observed)
===========================================
core00 [  0]	512000000	 512527568	
core01 [  8]	512000000	 512556128	
core02 [ 16]	512000000	 512590016	
core03 [ 24]	512000000	 512589440	

SMT=2
===========================================
Core            SD-PURR         SD-PURR
                (expected)      (observed)
===========================================
core00 [  0,1]	512000000	512635328
core01 [  8,9]	512000000	512610416	
core02 [ 16,17]	512000000	512639360	
core03 [ 24,25]	512000000	512638720	

SMT=4
===================================================
Core		        SD-PURR         SD-PURR
                	(expected)      (observed)
===================================================
core00 [  0,1,2,3]	512000000	512757328	
core01 [  8,9,10,11]	512000000	512727920	
core02 [ 16,17,18,19]	512000000	512754712	
core03 [ 24,25,26,27]	512000000	512739040	

SMT=on
==============================================================
Core				   SD-PURR         SD-PURR
                		   (expected)      (observed)
==============================================================
core00 [  0,1,2,3,4,5,6,7]	   512000000	   512920936	
core01 [  8,9,10,11,12,13,14,15]   512000000	   512878728	
core02 [ 16,17,18,19,20,21,22,23]  512000000	   512921192	
core03 [ 24,25,26,27,28,29,30,31]  512000000	   512924816	

Further, the patch
   gives an improvement of 5% in offlining of a core on POWER8,
   gives an improvement of 18% in offlining of a core on POWER9,
   causes a regression of 2.5% in onlining of a core on POWER8,
   causes a regression of 4.5% in onlining of a core on POWER9.
                 
POWER8
======================================================================
| Operation | Patch status |#Samples|Min |Max  |Median|Avg    |Stddev|
|           |              |        |(ms)|(ms) | (ms) |(ms)   |      |
======================================================================
| Offline   | Without Patch| 20     | 822| 1232| 972  | 986.8 |112.58|
| Offline   | With Patch   | 20     | 831| 1152| 941  | 938.6 | 80.33|
| --------- | -------------|--------|----| ----|------|-------|------|
| Online    | Without Patch| 20     |1460| 1760| 1620 |1591.2 | 82.72|
| Online    | With Patch   | 20     |1489| 1839| 1629 |1629.6 | 94.90|
======================================================================

POWER9
======================================================================
| Operation | Patch status |#Samples|Min |Max  |Median|Avg    |Stddev|
|           |              |        |(ms)|(ms) | (ms) |(ms)   |      |
======================================================================
| Offline   | Without Patch| 20     |1120|1653 | 1394 |1392.9 |133.63|
| Offline   | With Patch   | 20     | 930|1316 | 1161 |1130.8 |117.76|
| --------- | -------------|--------|----| ----|------|-------|------|
| Online    | Without Patch| 20     |1652|2108 | 1903 |1891.6 |130.74|
| Online    | With Patch   | 20     |1824|2222 | 1960 |1976.1 | 93.98|
======================================================================

Gautham R. Shenoy (1):
  pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"

 Documentation/core-api/cpu_hotplug.rst       |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++++++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

Comments

Nathan Lynch Oct. 25, 2019, 11:03 p.m. UTC | #1
"Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
> This is the v2 of the fix to change the default behaviour of
> cede_offline.

OK, but why keep the cede offline behavior at all? Can we remove it? I
think doing so would allow us to remove all the code that temporarily
onlines threads for partition migration.
Gautham R Shenoy Oct. 29, 2019, 11:13 a.m. UTC | #2
Hello Nathan,


On Fri, Oct 25, 2019 at 06:03:26PM -0500, Nathan Lynch wrote:
> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
> > This is the v2 of the fix to change the default behaviour of
> > cede_offline.
> 
> OK, but why keep the cede offline behavior at all? Can we remove it? I
> think doing so would allow us to remove all the code that temporarily
> onlines threads for partition migration.

May be I am missing something. But don't we want all the CPUs to come
online and execute the H_JOIN hcall before performing partition
migration? How will this change whether the offlined CPUs are in
H_CEDE or rtas-stop-self?

--
Thanks and Regards
gautham.
Nathan Lynch Oct. 29, 2019, 3:29 p.m. UTC | #3
Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
> On Fri, Oct 25, 2019 at 06:03:26PM -0500, Nathan Lynch wrote:
>> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
>> > This is the v2 of the fix to change the default behaviour of
>> > cede_offline.
>> 
>> OK, but why keep the cede offline behavior at all? Can we remove it? I
>> think doing so would allow us to remove all the code that temporarily
>> onlines threads for partition migration.
>
> May be I am missing something. But don't we want all the CPUs to come
> online and execute the H_JOIN hcall before performing partition
> migration? How will this change whether the offlined CPUs are in
> H_CEDE or rtas-stop-self?

The platform considers threads in H_CEDE to be active. It considers
threads that have performed stop-self to be inactive until they have
been restarted. The Thread Join Option section of the PAPR says active
threads must perform the H_JOIN. I have confirmed with hypervisor
development that this implies that the OS needn't involve inactive
threads in the join/suspend sequence.

It isn't quite explicit in the log for 120496ac2d2d ("powerpc: Bring all
threads online prior to migration/hibernation"), but it stands to reason
that using cede for offline is the reason that the code to online all
threads for join/suspend was introduced in the first place.