diff mbox series

OCC: Increase max pstate check on P9 to 255

Message ID 20171116223403.19363-1-stewart@linux.vnet.ibm.com
State Accepted
Headers show
Series OCC: Increase max pstate check on P9 to 255 | expand

Commit Message

Stewart Smith Nov. 16, 2017, 10:34 p.m. UTC
This has changed from P8, we can now have > 127 pstates.

This was observed on Boston during WoF bringup.

Reported-by: Minda Wei <minda.wei@ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
---
 hw/occ.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

Vaidyanathan Srinivasan Nov. 17, 2017, 11:22 a.m. UTC | #1
* Stewart Smith <stewart@linux.vnet.ibm.com> [2017-11-17 09:34:03]:

> This has changed from P8, we can now have > 127 pstates.
> 
> This was observed on Boston during WoF bringup.
> 
> Reported-by: Minda Wei <minda.wei@ibm.com>

Reported-by: Francesco A Campisano <campisan@us.ibm.com>
>
>
> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>

Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>


> ---
>  hw/occ.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/occ.c b/hw/occ.c
> index 78c6a6a356e3..8ad0dfe421d7 100644
> --- a/hw/occ.c
> +++ b/hw/occ.c
> @@ -612,13 +612,15 @@ static bool add_cpu_pstate_properties(int *pstate_nom)
>  	nr_pstates = labs(pmax - pmin) + 1;
>  	prlog(PR_DEBUG, "OCC: Version %x Min %d Nom %d Max %d Nr States %d\n",
>  	      occ_data->version, pmin, pnom, pmax, nr_pstates);
> -	if (nr_pstates <= 1 || nr_pstates > 128) {
> +	if ((occ_data->version == 0x90 && (nr_pstates <= 1)) ||
> +	    (occ_data->version <= 0x02 &&
> +	     (nr_pstates <= 1 || nr_pstates > 128))) {
>  		/**
>  		 * @fwts-label OCCInvalidPStateRange
>  		 * @fwts-advice The number of pstates is outside the valid
> -		 * range (currently <=1 or > 128), so OPAL has not added
> -		 * pstates to the device tree. This means that OCC (On Chip
> -		 * Controller) will be non-functional. This means
> +		 * range (currently <=1 or > 128 on p8, >255 on P9), so OPAL
> +		 * has not added pstates to the device tree. This means that
> +		 * OCC (On Chip Controller) will be non-functional. This means
>  		 * that CPU idle states and CPU frequency scaling
>  		 * will not be functional.
>  		 */

Thanks for the fix to increase the limit.

Checking for occ_data->version is the right check to decide permitted
range between P8 vs P9.

In case occ_data->version is bumped up, then we do not know the range
and we will fail in an earlier check and not export any pstates.

We need to update the right range for future versions here.

However Linux could have reporting problems with PState IDs more than
128 and also with large number of PStates.  This change does not as
such crash Linux, but more work is needed in Linux driver to support
large number of PStates.

--Vaidy
Stewart Smith Nov. 21, 2017, 2:34 a.m. UTC | #2
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes:
> * Stewart Smith <stewart@linux.vnet.ibm.com> [2017-11-17 09:34:03]:
>
>> This has changed from P8, we can now have > 127 pstates.
>> 
>> This was observed on Boston during WoF bringup.
>> 
>> Reported-by: Minda Wei <minda.wei@ibm.com>
>
> Reported-by: Francesco A Campisano <campisan@us.ibm.com>
>>
>>
>> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
>
> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
>
>
>> ---
>>  hw/occ.c | 10 ++++++----
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>> 
>> diff --git a/hw/occ.c b/hw/occ.c
>> index 78c6a6a356e3..8ad0dfe421d7 100644
>> --- a/hw/occ.c
>> +++ b/hw/occ.c
>> @@ -612,13 +612,15 @@ static bool add_cpu_pstate_properties(int *pstate_nom)
>>  	nr_pstates = labs(pmax - pmin) + 1;
>>  	prlog(PR_DEBUG, "OCC: Version %x Min %d Nom %d Max %d Nr States %d\n",
>>  	      occ_data->version, pmin, pnom, pmax, nr_pstates);
>> -	if (nr_pstates <= 1 || nr_pstates > 128) {
>> +	if ((occ_data->version == 0x90 && (nr_pstates <= 1)) ||
>> +	    (occ_data->version <= 0x02 &&
>> +	     (nr_pstates <= 1 || nr_pstates > 128))) {
>>  		/**
>>  		 * @fwts-label OCCInvalidPStateRange
>>  		 * @fwts-advice The number of pstates is outside the valid
>> -		 * range (currently <=1 or > 128), so OPAL has not added
>> -		 * pstates to the device tree. This means that OCC (On Chip
>> -		 * Controller) will be non-functional. This means
>> +		 * range (currently <=1 or > 128 on p8, >255 on P9), so OPAL
>> +		 * has not added pstates to the device tree. This means that
>> +		 * OCC (On Chip Controller) will be non-functional. This means
>>  		 * that CPU idle states and CPU frequency scaling
>>  		 * will not be functional.
>>  		 */
>
> Thanks for the fix to increase the limit.
>
> Checking for occ_data->version is the right check to decide permitted
> range between P8 vs P9.
>
> In case occ_data->version is bumped up, then we do not know the range
> and we will fail in an earlier check and not export any pstates.
>
> We need to update the right range for future versions here.
>
> However Linux could have reporting problems with PState IDs more than
> 128 and also with large number of PStates.  This change does not as
> such crash Linux, but more work is needed in Linux driver to support
> large number of PStates.

I'll pull the patch in and let's see how it goes for a bit... see if
people see this as a real problem or not. We can always pull it out and
do a dance with filtering pstates to limit to 128 and put the full list
somewhere else that the kernel doesn't poke at by default.

mpe, what are your thoughts here?
Michael Ellerman Nov. 21, 2017, 11:40 a.m. UTC | #3
Stewart Smith <stewart@linux.vnet.ibm.com> writes:

> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes:
>> * Stewart Smith <stewart@linux.vnet.ibm.com> [2017-11-17 09:34:03]:
>>
>>> This has changed from P8, we can now have > 127 pstates.
>>> 
>>> This was observed on Boston during WoF bringup.
>>> 
>>> Reported-by: Minda Wei <minda.wei@ibm.com>
>>
>> Reported-by: Francesco A Campisano <campisan@us.ibm.com>
>>>
>>>
>>> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
>>
>> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
>>
>>
>>> ---
>>>  hw/occ.c | 10 ++++++----
>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/hw/occ.c b/hw/occ.c
>>> index 78c6a6a356e3..8ad0dfe421d7 100644
>>> --- a/hw/occ.c
>>> +++ b/hw/occ.c
>>> @@ -612,13 +612,15 @@ static bool add_cpu_pstate_properties(int *pstate_nom)
>>>  	nr_pstates = labs(pmax - pmin) + 1;
>>>  	prlog(PR_DEBUG, "OCC: Version %x Min %d Nom %d Max %d Nr States %d\n",
>>>  	      occ_data->version, pmin, pnom, pmax, nr_pstates);
>>> -	if (nr_pstates <= 1 || nr_pstates > 128) {
>>> +	if ((occ_data->version == 0x90 && (nr_pstates <= 1)) ||
>>> +	    (occ_data->version <= 0x02 &&
>>> +	     (nr_pstates <= 1 || nr_pstates > 128))) {
>>>  		/**
>>>  		 * @fwts-label OCCInvalidPStateRange
>>>  		 * @fwts-advice The number of pstates is outside the valid
>>> -		 * range (currently <=1 or > 128), so OPAL has not added
>>> -		 * pstates to the device tree. This means that OCC (On Chip
>>> -		 * Controller) will be non-functional. This means
>>> +		 * range (currently <=1 or > 128 on p8, >255 on P9), so OPAL
>>> +		 * has not added pstates to the device tree. This means that
>>> +		 * OCC (On Chip Controller) will be non-functional. This means
>>>  		 * that CPU idle states and CPU frequency scaling
>>>  		 * will not be functional.
>>>  		 */
>>
>> Thanks for the fix to increase the limit.
>>
>> Checking for occ_data->version is the right check to decide permitted
>> range between P8 vs P9.
>>
>> In case occ_data->version is bumped up, then we do not know the range
>> and we will fail in an earlier check and not export any pstates.
>>
>> We need to update the right range for future versions here.
>>
>> However Linux could have reporting problems with PState IDs more than
>> 128 and also with large number of PStates.  This change does not as
>> such crash Linux, but more work is needed in Linux driver to support
>> large number of PStates.
>
> I'll pull the patch in and let's see how it goes for a bit... see if
> people see this as a real problem or not. We can always pull it out and
> do a dance with filtering pstates to limit to 128 and put the full list
> somewhere else that the kernel doesn't poke at by default.
>
> mpe, what are your thoughts here?

Hard to say.

The trade off seems to be unspecified "problems" in Linux if the large
number of states is exposed, vs having no CPU idle or CPU frequency
scaling.

The latter is clearly very bad, but it would be good to have more detail
from Vaidy on the issues we might see in Linux.

cheers
Vaidyanathan Srinivasan Nov. 21, 2017, 2 p.m. UTC | #4
* Michael Ellerman <mpe@ellerman.id.au> [2017-11-21 22:40:22]:

> Stewart Smith <stewart@linux.vnet.ibm.com> writes:
> 
> > Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes:
> >> * Stewart Smith <stewart@linux.vnet.ibm.com> [2017-11-17 09:34:03]:
> >>
> >>> This has changed from P8, we can now have > 127 pstates.
> >>> 
> >>> This was observed on Boston during WoF bringup.
> >>> 
> >>> Reported-by: Minda Wei <minda.wei@ibm.com>
> >>
> >> Reported-by: Francesco A Campisano <campisan@us.ibm.com>
> >>>
> >>>
> >>> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
> >>
> >> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> >>
> >>
> >>> ---
> >>>  hw/occ.c | 10 ++++++----
> >>>  1 file changed, 6 insertions(+), 4 deletions(-)
> >>> 
> >>> diff --git a/hw/occ.c b/hw/occ.c
> >>> index 78c6a6a356e3..8ad0dfe421d7 100644
> >>> --- a/hw/occ.c
> >>> +++ b/hw/occ.c
> >>> @@ -612,13 +612,15 @@ static bool add_cpu_pstate_properties(int *pstate_nom)
> >>>  	nr_pstates = labs(pmax - pmin) + 1;
> >>>  	prlog(PR_DEBUG, "OCC: Version %x Min %d Nom %d Max %d Nr States %d\n",
> >>>  	      occ_data->version, pmin, pnom, pmax, nr_pstates);
> >>> -	if (nr_pstates <= 1 || nr_pstates > 128) {
> >>> +	if ((occ_data->version == 0x90 && (nr_pstates <= 1)) ||
> >>> +	    (occ_data->version <= 0x02 &&
> >>> +	     (nr_pstates <= 1 || nr_pstates > 128))) {
> >>>  		/**
> >>>  		 * @fwts-label OCCInvalidPStateRange
> >>>  		 * @fwts-advice The number of pstates is outside the valid
> >>> -		 * range (currently <=1 or > 128), so OPAL has not added
> >>> -		 * pstates to the device tree. This means that OCC (On Chip
> >>> -		 * Controller) will be non-functional. This means
> >>> +		 * range (currently <=1 or > 128 on p8, >255 on P9), so OPAL
> >>> +		 * has not added pstates to the device tree. This means that
> >>> +		 * OCC (On Chip Controller) will be non-functional. This means
> >>>  		 * that CPU idle states and CPU frequency scaling
> >>>  		 * will not be functional.
> >>>  		 */
> >>
> >> Thanks for the fix to increase the limit.
> >>
> >> Checking for occ_data->version is the right check to decide permitted
> >> range between P8 vs P9.
> >>
> >> In case occ_data->version is bumped up, then we do not know the range
> >> and we will fail in an earlier check and not export any pstates.
> >>
> >> We need to update the right range for future versions here.
> >>
> >> However Linux could have reporting problems with PState IDs more than
> >> 128 and also with large number of PStates.  This change does not as
> >> such crash Linux, but more work is needed in Linux driver to support
> >> large number of PStates.
> >
> > I'll pull the patch in and let's see how it goes for a bit... see if
> > people see this as a real problem or not. We can always pull it out and
> > do a dance with filtering pstates to limit to 128 and put the full list
> > somewhere else that the kernel doesn't poke at by default.
> >
> > mpe, what are your thoughts here?
> 
> Hard to say.
> 
> The trade off seems to be unspecified "problems" in Linux if the large
> number of states is exposed, vs having no CPU idle or CPU frequency
> scaling.
> 
> The latter is clearly very bad, but it would be good to have more detail
> from Vaidy on the issues we might see in Linux.

The issue in Linux is only in reporting current frequency and not
really a major issue with the frequency request or the driver.  When
we set the PState request, we pick the PState ID corresponding to the
frequency and write to PMCR SPR register.  There is no impact here.
When user reads current frequency from sysfs cpuinfo_cur_freq, we read
PMSR SPR and map the PState ID back to a frequency.  At this stage the
PState ID read is treated as signed 8-bit to maintain compatibility
with P8 systems.

So when frequency is very low and the PState IDs go more than 128,
then the PState ID interpreted will become negative and *not* found in
Linux. In this situation we fall back to reporting nominal frequency.

The end user will observe that the reported frequency will suddenly
jump to a higher value of nominal frequency when we go down to the
last few PStates at the lowest frequency.

This problem is limited to few chips in few systems where the
calibrated frequency range is wide and really the number of PStates
exceeds 128.  This scenario is architecturally correct and permitted,
but just that I did not anticipate :)

I would suggest we could take this patch in skiboot and the work out
additional device tree nodes and later change Linux to use new format.
My idea would be to use different device tree format and make new
Linux driver use different elements for requests and reporting.  So in
future we can expose smaller number of PStates to Linux, but still
Linux can lookup and convert any PState back to a frequency value for
reporting.

--Vaidy
Stewart Smith Nov. 21, 2017, 8:39 p.m. UTC | #5
Stewart Smith <stewart@linux.vnet.ibm.com> writes:
> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes:
>> * Stewart Smith <stewart@linux.vnet.ibm.com> [2017-11-17 09:34:03]:
>>
>>> This has changed from P8, we can now have > 127 pstates.
>>> 
>>> This was observed on Boston during WoF bringup.
>>> 
>>> Reported-by: Minda Wei <minda.wei@ibm.com>
>>
>> Reported-by: Francesco A Campisano <campisan@us.ibm.com>
>>>
>>>
>>> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
>>
>> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
>>
>>
>>> ---
>>>  hw/occ.c | 10 ++++++----
>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/hw/occ.c b/hw/occ.c
>>> index 78c6a6a356e3..8ad0dfe421d7 100644
>>> --- a/hw/occ.c
>>> +++ b/hw/occ.c
>>> @@ -612,13 +612,15 @@ static bool add_cpu_pstate_properties(int *pstate_nom)
>>>  	nr_pstates = labs(pmax - pmin) + 1;
>>>  	prlog(PR_DEBUG, "OCC: Version %x Min %d Nom %d Max %d Nr States %d\n",
>>>  	      occ_data->version, pmin, pnom, pmax, nr_pstates);
>>> -	if (nr_pstates <= 1 || nr_pstates > 128) {
>>> +	if ((occ_data->version == 0x90 && (nr_pstates <= 1)) ||
>>> +	    (occ_data->version <= 0x02 &&
>>> +	     (nr_pstates <= 1 || nr_pstates > 128))) {
>>>  		/**
>>>  		 * @fwts-label OCCInvalidPStateRange
>>>  		 * @fwts-advice The number of pstates is outside the valid
>>> -		 * range (currently <=1 or > 128), so OPAL has not added
>>> -		 * pstates to the device tree. This means that OCC (On Chip
>>> -		 * Controller) will be non-functional. This means
>>> +		 * range (currently <=1 or > 128 on p8, >255 on P9), so OPAL
>>> +		 * has not added pstates to the device tree. This means that
>>> +		 * OCC (On Chip Controller) will be non-functional. This means
>>>  		 * that CPU idle states and CPU frequency scaling
>>>  		 * will not be functional.
>>>  		 */
>>
>> Thanks for the fix to increase the limit.
>>
>> Checking for occ_data->version is the right check to decide permitted
>> range between P8 vs P9.
>>
>> In case occ_data->version is bumped up, then we do not know the range
>> and we will fail in an earlier check and not export any pstates.
>>
>> We need to update the right range for future versions here.
>>
>> However Linux could have reporting problems with PState IDs more than
>> 128 and also with large number of PStates.  This change does not as
>> such crash Linux, but more work is needed in Linux driver to support
>> large number of PStates.
>
> I'll pull the patch in and let's see how it goes for a bit... see if
> people see this as a real problem or not. We can always pull it out and
> do a dance with filtering pstates to limit to 128 and put the full list
> somewhere else that the kernel doesn't poke at by default.
>
> mpe, what are your thoughts here?

I've merged to master as of
701556d2ea9b890a389183907ae484b78c05c665.... let's see if this is a
terrible idea :)
Michael Ellerman Nov. 22, 2017, 4:14 a.m. UTC | #6
Stewart Smith <stewart@linux.vnet.ibm.com> writes:
> Stewart Smith <stewart@linux.vnet.ibm.com> writes:
...
>>
>> I'll pull the patch in and let's see how it goes for a bit... see if
>> people see this as a real problem or not. We can always pull it out and
>> do a dance with filtering pstates to limit to 128 and put the full list
>> somewhere else that the kernel doesn't poke at by default.
>>
>> mpe, what are your thoughts here?
>
> I've merged to master as of
> 701556d2ea9b890a389183907ae484b78c05c665.... let's see if this is a
> terrible idea :)

Sounds good.

cheers
diff mbox series

Patch

diff --git a/hw/occ.c b/hw/occ.c
index 78c6a6a356e3..8ad0dfe421d7 100644
--- a/hw/occ.c
+++ b/hw/occ.c
@@ -612,13 +612,15 @@  static bool add_cpu_pstate_properties(int *pstate_nom)
 	nr_pstates = labs(pmax - pmin) + 1;
 	prlog(PR_DEBUG, "OCC: Version %x Min %d Nom %d Max %d Nr States %d\n",
 	      occ_data->version, pmin, pnom, pmax, nr_pstates);
-	if (nr_pstates <= 1 || nr_pstates > 128) {
+	if ((occ_data->version == 0x90 && (nr_pstates <= 1)) ||
+	    (occ_data->version <= 0x02 &&
+	     (nr_pstates <= 1 || nr_pstates > 128))) {
 		/**
 		 * @fwts-label OCCInvalidPStateRange
 		 * @fwts-advice The number of pstates is outside the valid
-		 * range (currently <=1 or > 128), so OPAL has not added
-		 * pstates to the device tree. This means that OCC (On Chip
-		 * Controller) will be non-functional. This means
+		 * range (currently <=1 or > 128 on p8, >255 on P9), so OPAL
+		 * has not added pstates to the device tree. This means that
+		 * OCC (On Chip Controller) will be non-functional. This means
 		 * that CPU idle states and CPU frequency scaling
 		 * will not be functional.
 		 */