diff mbox

cpuidle/powernv: Fix snooze timeout

Message ID 1466624203-1847-1-git-send-email-shreyas@linux.vnet.ibm.com (mailing list archive)
State Rejected
Headers show

Commit Message

Shreyas B. Prabhu June 22, 2016, 7:36 p.m. UTC
Snooze is a poll idle state in powernv and pseries platforms. Snooze
has a timeout so that if a cpu stays in snooze for more than target
residency of the next available idle state, then it would exit thereby
giving chance to the cpuidle governor to re-evaluate and
promote the cpu to a deeper idle state. Therefore whenever snooze exits
due to this timeout, its last_residency will be target_residency of next
deeper state.

commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
changed the math around last_residency calculation. Specifically, while
converting last_residency value from nanoseconds to microseconds it does
right shift by 10. Due to this, in snooze timeout exit scenarios
last_residency calculated is roughly 2.3% less than target_residency of
next available state. This pattern is picked up get_typical_interval()
in the menu governor and therefore expected_interval in menu_select() is
frequently less than the target_residency of any state but snooze.

Due to this we are entering snooze at a higher rate, thereby affecting
the single thread performance.
Since the math around last_residency is not meant to be precise, fix this
issue setting snooze timeout to 105% of target_residency of next
available idle state.

This also adds comment around why snooze timeout is necessary.

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
 drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
 2 files changed, 27 insertions(+)

Comments

Rafael J. Wysocki June 22, 2016, 10:49 p.m. UTC | #1
On Wed, Jun 22, 2016 at 9:36 PM, Shreyas B. Prabhu
<shreyas@linux.vnet.ibm.com> wrote:
> Snooze is a poll idle state in powernv and pseries platforms. Snooze
> has a timeout so that if a cpu stays in snooze for more than target
> residency of the next available idle state, then it would exit thereby
> giving chance to the cpuidle governor to re-evaluate and
> promote the cpu to a deeper idle state. Therefore whenever snooze exits
> due to this timeout, its last_residency will be target_residency of next
> deeper state.
>
> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
> changed the math around last_residency calculation. Specifically, while
> converting last_residency value from nanoseconds to microseconds it does
> right shift by 10. Due to this, in snooze timeout exit scenarios
> last_residency calculated is roughly 2.3% less than target_residency of
> next available state. This pattern is picked up get_typical_interval()
> in the menu governor and therefore expected_interval in menu_select() is
> frequently less than the target_residency of any state but snooze.
>
> Due to this we are entering snooze at a higher rate, thereby affecting
> the single thread performance.
> Since the math around last_residency is not meant to be precise, fix this
> issue setting snooze timeout to 105% of target_residency of next
> available idle state.
>
> This also adds comment around why snooze timeout is necessary.

Daniel, any comments?

> Reported-by: Anton Blanchard <anton@samba.org>
> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
>  drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
>  2 files changed, 27 insertions(+)
>
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index e12dc30..5835491 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>                 cpuidle_state_table = powernv_states;
>                 /* Device tree can indicate more idle states */
>                 max_idle_state = powernv_add_idle_states();
> +
> +               /*
> +                * Staying in snooze for a long period can degrade the
> +                * perfomance of the sibling cpus. Set timeout for snooze such
> +                * that if the cpu stays in snooze longer than target residency
> +                * of the next available idle state then exit from snooze. This
> +                * gives a chance to the cpuidle governor to re-evaluate and
> +                * promote it to deeper idle states.
> +                */
>                 if (max_idle_state > 1) {
>                         snooze_timeout_en = true;
>                         snooze_timeout = powernv_states[1].target_residency *
>                                          tb_ticks_per_usec;
> +                       /*
> +                        * Give a 5% margin since target residency related math
> +                        * is not precise in cpuidle core.
> +                        */
> +                       snooze_timeout += snooze_timeout / 20;
>                 }
>         } else
>                 return -ENODEV;
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index 07135e0..22de841 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -250,10 +250,23 @@ static int pseries_idle_probe(void)
>         } else
>                 return -ENODEV;
>
> +       /*
> +        * Staying in snooze for a long period can degrade the
> +        * perfomance of the sibling cpus. Set timeout for snooze such
> +        * that if the cpu stays in snooze longer than target residency
> +        * of the next available idle state then exit from snooze. This
> +        * gives a chance to the cpuidle governor to re-evaluate and
> +        * promote it to deeper idle states.
> +        */
>         if (max_idle_state > 1) {
>                 snooze_timeout_en = true;
>                 snooze_timeout = cpuidle_state_table[1].target_residency *
>                                  tb_ticks_per_usec;
> +               /*
> +                * Give a 5% margin since target residency related math
> +                * is not precise in cpuidle core.
> +                */
> +               snooze_timeout += snooze_timeout / 20;
>         }
>         return 0;
>  }
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Balbir Singh June 22, 2016, 11:48 p.m. UTC | #2
On 23/06/16 05:36, Shreyas B. Prabhu wrote:
> Snooze is a poll idle state in powernv and pseries platforms. Snooze
> has a timeout so that if a cpu stays in snooze for more than target
> residency of the next available idle state, then it would exit thereby
> giving chance to the cpuidle governor to re-evaluate and
> promote the cpu to a deeper idle state. Therefore whenever snooze exits
> due to this timeout, its last_residency will be target_residency of next
> deeper state.
> 
> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
> changed the math around last_residency calculation. Specifically, while
> converting last_residency value from nanoseconds to microseconds it does
> right shift by 10. Due to this, in snooze timeout exit scenarios
> last_residency calculated is roughly 2.3% less than target_residency of
> next available state. This pattern is picked up get_typical_interval()
> in the menu governor and therefore expected_interval in menu_select() is
> frequently less than the target_residency of any state but snooze.
> 
> Due to this we are entering snooze at a higher rate, thereby affecting
> the single thread performance.
> Since the math around last_residency is not meant to be precise, fix this
> issue setting snooze timeout to 105% of target_residency of next
> available idle state.
> 
> This also adds comment around why snooze timeout is necessary.
> 
> Reported-by: Anton Blanchard <anton@samba.org>
> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
>  drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index e12dc30..5835491 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>  		cpuidle_state_table = powernv_states;
>  		/* Device tree can indicate more idle states */
>  		max_idle_state = powernv_add_idle_states();
> +
> +		/*
> +		 * Staying in snooze for a long period can degrade the
> +		 * perfomance of the sibling cpus. Set timeout for snooze such
> +		 * that if the cpu stays in snooze longer than target residency
> +		 * of the next available idle state then exit from snooze. This
> +		 * gives a chance to the cpuidle governor to re-evaluate and
> +		 * promote it to deeper idle states.
> +		 */
>  		if (max_idle_state > 1) {
>  			snooze_timeout_en = true;
>  			snooze_timeout = powernv_states[1].target_residency *
>  					 tb_ticks_per_usec;
> +			/*
> +			 * Give a 5% margin since target residency related math
> +			 * is not precise in cpuidle core.
> +			 */

Is this due to the microsecond conversion mentioned above? It would be nice to
have it in the comment. Does

(powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve
your rounding issues, assuming the issue is really rounding or maybe it is due
to the shift by 10, could you please elaborate on what related math is not
precise? That would explain to me why I missed understanding your changes.

> +			snooze_timeout += snooze_timeout / 20;

For now 5% is sufficient, but do you want to check to assert to check if

snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency?

>  		}
>   	} else
>   		return -ENODEV;
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index 07135e0..22de841 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -250,10 +250,23 @@ static int pseries_idle_probe(void)
>  	} else
>  		return -ENODEV;
>  
> +	/*
> +	 * Staying in snooze for a long period can degrade the
> +	 * perfomance of the sibling cpus. Set timeout for snooze such
> +	 * that if the cpu stays in snooze longer than target residency
> +	 * of the next available idle state then exit from snooze. This
> +	 * gives a chance to the cpuidle governor to re-evaluate and
> +	 * promote it to deeper idle states.
> +	 */
>  	if (max_idle_state > 1) {
>  		snooze_timeout_en = true;
>  		snooze_timeout = cpuidle_state_table[1].target_residency *
>  				 tb_ticks_per_usec;
> +		/*
> +		 * Give a 5% margin since target residency related math
> +		 * is not precise in cpuidle core.
> +		 */
> +		snooze_timeout += snooze_timeout / 20;
>  	}
>  	return 0;
>  }
>
Shreyas B. Prabhu June 23, 2016, 4:58 a.m. UTC | #3
On 06/23/2016 05:18 AM, Balbir Singh wrote:
> 
> 
> On 23/06/16 05:36, Shreyas B. Prabhu wrote:
>> Snooze is a poll idle state in powernv and pseries platforms. Snooze
>> has a timeout so that if a cpu stays in snooze for more than target
>> residency of the next available idle state, then it would exit thereby
>> giving chance to the cpuidle governor to re-evaluate and
>> promote the cpu to a deeper idle state. Therefore whenever snooze exits
>> due to this timeout, its last_residency will be target_residency of next
>> deeper state.
>>
>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
>> changed the math around last_residency calculation. Specifically, while
>> converting last_residency value from nanoseconds to microseconds it does
>> right shift by 10. Due to this, in snooze timeout exit scenarios
>> last_residency calculated is roughly 2.3% less than target_residency of
>> next available state. This pattern is picked up get_typical_interval()
>> in the menu governor and therefore expected_interval in menu_select() is
>> frequently less than the target_residency of any state but snooze.
>>
>> Due to this we are entering snooze at a higher rate, thereby affecting
>> the single thread performance.
>> Since the math around last_residency is not meant to be precise, fix this
>> issue setting snooze timeout to 105% of target_residency of next
>> available idle state.
>>
>> This also adds comment around why snooze timeout is necessary.
>>
>> Reported-by: Anton Blanchard <anton@samba.org>
>> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
>> ---
>>  drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
>>  drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
>>  2 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>> index e12dc30..5835491 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>>  		cpuidle_state_table = powernv_states;
>>  		/* Device tree can indicate more idle states */
>>  		max_idle_state = powernv_add_idle_states();
>> +
>> +		/*
>> +		 * Staying in snooze for a long period can degrade the
>> +		 * perfomance of the sibling cpus. Set timeout for snooze such
>> +		 * that if the cpu stays in snooze longer than target residency
>> +		 * of the next available idle state then exit from snooze. This
>> +		 * gives a chance to the cpuidle governor to re-evaluate and
>> +		 * promote it to deeper idle states.
>> +		 */
>>  		if (max_idle_state > 1) {
>>  			snooze_timeout_en = true;
>>  			snooze_timeout = powernv_states[1].target_residency *
>>  					 tb_ticks_per_usec;
>> +			/*
>> +			 * Give a 5% margin since target residency related math
>> +			 * is not precise in cpuidle core.
>> +			 */
> 
> Is this due to the microsecond conversion mentioned above? It would be nice to
> have it in the comment. Does
> 
> (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve
> your rounding issues, assuming the issue is really rounding or maybe it is due
> to the shift by 10, could you please elaborate on what related math is not
> precise? That would explain to me why I missed understanding your changes.
> 
>> +			snooze_timeout += snooze_timeout / 20;
> 
> For now 5% is sufficient, but do you want to check to assert to check if
> 
> snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency?
> 

This is not a rounding issue. As I mentioned in the commit message, this
is because of the last_residency calculation in cpuidle.c.
To elaborate, last residency calculation is done in the following way
after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with
local_clock()") -

cpuidle_enter_state()
{
	[...]
	time_start = local_clock();
	[enter idle state]
	time_end = local_clock();
	/*
         * local_clock() returns the time in nanosecond, let's shift
         * by 10 (divide by 1024) to have microsecond based time.
         */
        diff = (time_end - time_start) >> 10;
	[...]
	dev->last_residency = (int) diff;
}

Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%

In snooze timeout exit scenarios because of this, last_residency
calculated is 2.3% less than target_residency of next available state.
This affects get_typical_interval() in the menu governor and therefore
expected_interval in menu_select() is frequently less than the
target_residency of any state but snooze.


I'll expand the comments as you suggested.

Thanks,
Shreyas
Balbir Singh June 23, 2016, 9:28 a.m. UTC | #4
On 23/06/16 14:58, Shreyas B Prabhu wrote:
> 
> 
> On 06/23/2016 05:18 AM, Balbir Singh wrote:
>>
>>
>> On 23/06/16 05:36, Shreyas B. Prabhu wrote:
>>> Snooze is a poll idle state in powernv and pseries platforms. Snooze
>>> has a timeout so that if a cpu stays in snooze for more than target
>>> residency of the next available idle state, then it would exit thereby
>>> giving chance to the cpuidle governor to re-evaluate and
>>> promote the cpu to a deeper idle state. Therefore whenever snooze exits
>>> due to this timeout, its last_residency will be target_residency of next
>>> deeper state.
>>>
>>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
>>> changed the math around last_residency calculation. Specifically, while
>>> converting last_residency value from nanoseconds to microseconds it does
>>> right shift by 10. Due to this, in snooze timeout exit scenarios
>>> last_residency calculated is roughly 2.3% less than target_residency of
>>> next available state. This pattern is picked up get_typical_interval()
>>> in the menu governor and therefore expected_interval in menu_select() is
>>> frequently less than the target_residency of any state but snooze.
>>>
>>> Due to this we are entering snooze at a higher rate, thereby affecting
>>> the single thread performance.
>>> Since the math around last_residency is not meant to be precise, fix this
>>> issue setting snooze timeout to 105% of target_residency of next
>>> available idle state.
>>>
>>> This also adds comment around why snooze timeout is necessary.
>>>
>>> Reported-by: Anton Blanchard <anton@samba.org>
>>> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
>>> ---
>>>  drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
>>>  drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
>>>  2 files changed, 27 insertions(+)
>>>
>>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>>> index e12dc30..5835491 100644
>>> --- a/drivers/cpuidle/cpuidle-powernv.c
>>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>>> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>>>  		cpuidle_state_table = powernv_states;
>>>  		/* Device tree can indicate more idle states */
>>>  		max_idle_state = powernv_add_idle_states();
>>> +
>>> +		/*
>>> +		 * Staying in snooze for a long period can degrade the
>>> +		 * perfomance of the sibling cpus. Set timeout for snooze such
>>> +		 * that if the cpu stays in snooze longer than target residency
>>> +		 * of the next available idle state then exit from snooze. This
>>> +		 * gives a chance to the cpuidle governor to re-evaluate and
>>> +		 * promote it to deeper idle states.
>>> +		 */
>>>  		if (max_idle_state > 1) {
>>>  			snooze_timeout_en = true;
>>>  			snooze_timeout = powernv_states[1].target_residency *
>>>  					 tb_ticks_per_usec;
>>> +			/*
>>> +			 * Give a 5% margin since target residency related math
>>> +			 * is not precise in cpuidle core.
>>> +			 */
>>
>> Is this due to the microsecond conversion mentioned above? It would be nice to
>> have it in the comment. Does
>>
>> (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve
>> your rounding issues, assuming the issue is really rounding or maybe it is due
>> to the shift by 10, could you please elaborate on what related math is not
>> precise? That would explain to me why I missed understanding your changes.
>>
>>> +			snooze_timeout += snooze_timeout / 20;
>>
>> For now 5% is sufficient, but do you want to check to assert to check if
>>
>> snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency?
>>
> 
> This is not a rounding issue. As I mentioned in the commit message, this
> is because of the last_residency calculation in cpuidle.c.
> To elaborate, last residency calculation is done in the following way
> after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with
> local_clock()") -
> 
> cpuidle_enter_state()
> {
> 	[...]
> 	time_start = local_clock();
> 	[enter idle state]
> 	time_end = local_clock();
> 	/*
>          * local_clock() returns the time in nanosecond, let's shift
>          * by 10 (divide by 1024) to have microsecond based time.
>          */
>         diff = (time_end - time_start) >> 10;
> 	[...]
> 	dev->last_residency = (int) diff;
> }
> 
> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%


This is still a rounding error but at a different site. I see we saved
a division by doing a >> 10, but we added it right back by doing a /20
later in the platform code. Shouldn't the rounding affect other
platforms as well? Can't we fix it in cpuidle_enter_state(). Division
by 1000 can be optimized if required (but rather not add that complexity).
Thanks for patiently explaining this

Balbir
Shreyas B. Prabhu June 23, 2016, 9:41 a.m. UTC | #5
On 06/23/2016 02:58 PM, Balbir Singh wrote:
> 
> 
> On 23/06/16 14:58, Shreyas B Prabhu wrote:
>>
>>
>> On 06/23/2016 05:18 AM, Balbir Singh wrote:
>>>
>>>
>>> On 23/06/16 05:36, Shreyas B. Prabhu wrote:
>>>> Snooze is a poll idle state in powernv and pseries platforms. Snooze
>>>> has a timeout so that if a cpu stays in snooze for more than target
>>>> residency of the next available idle state, then it would exit thereby
>>>> giving chance to the cpuidle governor to re-evaluate and
>>>> promote the cpu to a deeper idle state. Therefore whenever snooze exits
>>>> due to this timeout, its last_residency will be target_residency of next
>>>> deeper state.
>>>>
>>>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
>>>> changed the math around last_residency calculation. Specifically, while
>>>> converting last_residency value from nanoseconds to microseconds it does
>>>> right shift by 10. Due to this, in snooze timeout exit scenarios
>>>> last_residency calculated is roughly 2.3% less than target_residency of
>>>> next available state. This pattern is picked up get_typical_interval()
>>>> in the menu governor and therefore expected_interval in menu_select() is
>>>> frequently less than the target_residency of any state but snooze.
>>>>
>>>> Due to this we are entering snooze at a higher rate, thereby affecting
>>>> the single thread performance.
>>>> Since the math around last_residency is not meant to be precise, fix this
>>>> issue setting snooze timeout to 105% of target_residency of next
>>>> available idle state.
>>>>
>>>> This also adds comment around why snooze timeout is necessary.
>>>>
>>>> Reported-by: Anton Blanchard <anton@samba.org>
>>>> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
>>>> ---
>>>>  drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
>>>>  drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
>>>>  2 files changed, 27 insertions(+)
>>>>
>>>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>>>> index e12dc30..5835491 100644
>>>> --- a/drivers/cpuidle/cpuidle-powernv.c
>>>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>>>> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>>>>  		cpuidle_state_table = powernv_states;
>>>>  		/* Device tree can indicate more idle states */
>>>>  		max_idle_state = powernv_add_idle_states();
>>>> +
>>>> +		/*
>>>> +		 * Staying in snooze for a long period can degrade the
>>>> +		 * perfomance of the sibling cpus. Set timeout for snooze such
>>>> +		 * that if the cpu stays in snooze longer than target residency
>>>> +		 * of the next available idle state then exit from snooze. This
>>>> +		 * gives a chance to the cpuidle governor to re-evaluate and
>>>> +		 * promote it to deeper idle states.
>>>> +		 */
>>>>  		if (max_idle_state > 1) {
>>>>  			snooze_timeout_en = true;
>>>>  			snooze_timeout = powernv_states[1].target_residency *
>>>>  					 tb_ticks_per_usec;
>>>> +			/*
>>>> +			 * Give a 5% margin since target residency related math
>>>> +			 * is not precise in cpuidle core.
>>>> +			 */
>>>
>>> Is this due to the microsecond conversion mentioned above? It would be nice to
>>> have it in the comment. Does
>>>
>>> (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve
>>> your rounding issues, assuming the issue is really rounding or maybe it is due
>>> to the shift by 10, could you please elaborate on what related math is not
>>> precise? That would explain to me why I missed understanding your changes.
>>>
>>>> +			snooze_timeout += snooze_timeout / 20;
>>>
>>> For now 5% is sufficient, but do you want to check to assert to check if
>>>
>>> snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency?
>>>
>>
>> This is not a rounding issue. As I mentioned in the commit message, this
>> is because of the last_residency calculation in cpuidle.c.
>> To elaborate, last residency calculation is done in the following way
>> after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with
>> local_clock()") -
>>
>> cpuidle_enter_state()
>> {
>> 	[...]
>> 	time_start = local_clock();
>> 	[enter idle state]
>> 	time_end = local_clock();
>> 	/*
>>          * local_clock() returns the time in nanosecond, let's shift
>>          * by 10 (divide by 1024) to have microsecond based time.
>>          */
>>         diff = (time_end - time_start) >> 10;
>> 	[...]
>> 	dev->last_residency = (int) diff;
>> }
>>
>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%
> 
> 
> This is still a rounding error but at a different site. I see we saved
> a division by doing a >> 10, but we added it right back by doing a /20
> later in the platform code. 

While a >> 10 is done at every idle exit, div by 20 is done once during
boot, so this doesn't negate the previous optimization.

> Shouldn't the rounding affect other
> platforms as well? Can't we fix it in cpuidle_enter_state(). 

This does affect all platforms, but I'm guessing no other place relied
on the precision of last_residency calculations.
Daniel can perhaps comment on this.


> Division
> by 1000 can be optimized if required (but rather not add that complexity).
> Thanks for patiently explaining this
> 
> Balbir
>
Balbir Singh June 23, 2016, 9:55 a.m. UTC | #6
>> This is still a rounding error but at a different site. I see we saved
>> a division by doing a >> 10, but we added it right back by doing a /20
>> later in the platform code.
>
> While a >> 10 is done at every idle exit, div by 20 is done once during
> boot, so this doesn't negate the previous optimization.
>

Yes, fair point

>> Shouldn't the rounding affect other
>> platforms as well? Can't we fix it in cpuidle_enter_state().
>
> This does affect all platforms, but I'm guessing no other place relied
> on the precision of last_residency calculations.
> Daniel can perhaps comment on this.
>
Daniel Lezcano June 23, 2016, 10:01 a.m. UTC | #7
On 06/23/2016 11:28 AM, Balbir Singh wrote:

[ ... ]

>> cpuidle_enter_state()
>> {
>> 	[...]
>> 	time_start = local_clock();
>> 	[enter idle state]
>> 	time_end = local_clock();
>> 	/*
>>           * local_clock() returns the time in nanosecond, let's shift
>>           * by 10 (divide by 1024) to have microsecond based time.
>>           */
>>          diff = (time_end - time_start) >> 10;
>> 	[...]
>> 	dev->last_residency = (int) diff;
>> }
>>
>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%

I am surprised the last_residency is 2.3% exactly less. The difference 
between >>10 and /1000 is 2.34%.

What is the next target residency value ?

Does it solve the issue if you replace >>10 by /1000 ?
Shreyas B. Prabhu June 23, 2016, 1:35 p.m. UTC | #8
On 06/23/2016 03:31 PM, Daniel Lezcano wrote:
> On 06/23/2016 11:28 AM, Balbir Singh wrote:
> 
> [ ... ]
> 
>>> cpuidle_enter_state()
>>> {
>>>     [...]
>>>     time_start = local_clock();
>>>     [enter idle state]
>>>     time_end = local_clock();
>>>     /*
>>>           * local_clock() returns the time in nanosecond, let's shift
>>>           * by 10 (divide by 1024) to have microsecond based time.
>>>           */
>>>          diff = (time_end - time_start) >> 10;
>>>     [...]
>>>     dev->last_residency = (int) diff;
>>> }
>>>
>>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%
> 
> I am surprised the last_residency is 2.3% exactly less. The difference
> between >>10 and /1000 is 2.34%.
> 
> What is the next target residency value ?
> 
Target residency of the next idle state is 100 microseconds.
When snooze times out after 100 microseconds, last_residency value
calculated is typically 97 or 98 microseconds.

> Does it solve the issue if you replace >>10 by /1000 ?
> 

Yes it does.

--Shreyas
Daniel Lezcano June 23, 2016, 2:36 p.m. UTC | #9
On 06/23/2016 03:35 PM, Shreyas B Prabhu wrote:
>
>
> On 06/23/2016 03:31 PM, Daniel Lezcano wrote:
>> On 06/23/2016 11:28 AM, Balbir Singh wrote:
>>
>> [ ... ]
>>
>>>> cpuidle_enter_state()
>>>> {
>>>>      [...]
>>>>      time_start = local_clock();
>>>>      [enter idle state]
>>>>      time_end = local_clock();
>>>>      /*
>>>>            * local_clock() returns the time in nanosecond, let's shift
>>>>            * by 10 (divide by 1024) to have microsecond based time.
>>>>            */
>>>>           diff = (time_end - time_start) >> 10;
>>>>      [...]
>>>>      dev->last_residency = (int) diff;
>>>> }
>>>>
>>>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%
>>
>> I am surprised the last_residency is 2.3% exactly less. The difference
>> between >>10 and /1000 is 2.34%.
>>
>> What is the next target residency value ?
>>
> Target residency of the next idle state is 100 microseconds.
> When snooze times out after 100 microseconds, last_residency value
> calculated is typically 97 or 98 microseconds.

I see, the snooze exit is very fast.

>> Does it solve the issue if you replace >>10 by /1000 ?
>>
>
> Yes it does.

Ok. IMO, it would be cleaner to fix this in the core code.

   -- Daniel
diff mbox

Patch

diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index e12dc30..5835491 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -268,10 +268,24 @@  static int powernv_idle_probe(void)
 		cpuidle_state_table = powernv_states;
 		/* Device tree can indicate more idle states */
 		max_idle_state = powernv_add_idle_states();
+
+		/*
+		 * Staying in snooze for a long period can degrade the
+		 * perfomance of the sibling cpus. Set timeout for snooze such
+		 * that if the cpu stays in snooze longer than target residency
+		 * of the next available idle state then exit from snooze. This
+		 * gives a chance to the cpuidle governor to re-evaluate and
+		 * promote it to deeper idle states.
+		 */
 		if (max_idle_state > 1) {
 			snooze_timeout_en = true;
 			snooze_timeout = powernv_states[1].target_residency *
 					 tb_ticks_per_usec;
+			/*
+			 * Give a 5% margin since target residency related math
+			 * is not precise in cpuidle core.
+			 */
+			snooze_timeout += snooze_timeout / 20;
 		}
  	} else
  		return -ENODEV;
diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index 07135e0..22de841 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -250,10 +250,23 @@  static int pseries_idle_probe(void)
 	} else
 		return -ENODEV;
 
+	/*
+	 * Staying in snooze for a long period can degrade the
+	 * perfomance of the sibling cpus. Set timeout for snooze such
+	 * that if the cpu stays in snooze longer than target residency
+	 * of the next available idle state then exit from snooze. This
+	 * gives a chance to the cpuidle governor to re-evaluate and
+	 * promote it to deeper idle states.
+	 */
 	if (max_idle_state > 1) {
 		snooze_timeout_en = true;
 		snooze_timeout = cpuidle_state_table[1].target_residency *
 				 tb_ticks_per_usec;
+		/*
+		 * Give a 5% margin since target residency related math
+		 * is not precise in cpuidle core.
+		 */
+		snooze_timeout += snooze_timeout / 20;
 	}
 	return 0;
 }