[v2] cpuidle: Fix last_residency division

Message ID	1466756638-2362-1-git-send-email-shreyas@linux.vnet.ibm.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> Gateway: Authorized Use Only! Violators will be prosecuted for <linuxppc-dev@lists.ozlabs.org> from <shreyas@linux.vnet.ibm.com>; Fri, 24 Jun 2016 02:24:09 -0600 Gateway: Authorized Use Only! Violators will be prosecuted; Fri, 24 Jun 2016 02:24:05 -0600 From: "Shreyas B. Prabhu" <shreyas@linux.vnet.ibm.com> To: rjw@rjwysocki.net Subject: [PATCH v2] cpuidle: Fix last_residency division Date: Fri, 24 Jun 2016 03:23:58 -0500 Message-Id: <1466756638-2362-1-git-send-email-shreyas@linux.vnet.ibm.com> Precedence: list Cc: linux-pm@vger.kernel.org, daniel.lezcano@linaro.org, anton@samba.org, "Shreyas B. Prabhu" <shreyas@linux.vnet.ibm.com>, linuxppc-dev@lists.ozlabs.org MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

Shreyas B. Prabhu June 24, 2016, 8:23 a.m. UTC

Snooze is a poll idle state in powernv and pseries platforms. Snooze
has a timeout so that if a cpu stays in snooze for more than target
residency of the next available idle state, then it would exit thereby
giving chance to the cpuidle governor to re-evaluate and
promote the cpu to a deeper idle state. Therefore whenever snooze exits
due to this timeout, its last_residency will be target_residency of next
deeper state.

commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
changed the math around last_residency calculation. Specifically, while
converting last_residency value from nanoseconds to microseconds it does
right shift by 10. Due to this, in snooze timeout exit scenarios
last_residency calculated is roughly 2.3% less than target_residency of
next available state. This pattern is picked up get_typical_interval()
in the menu governor and therefore expected_interval in menu_select() is
frequently less than the target_residency of any state but snooze.

Due to this we are entering snooze at a higher rate, thereby affecting
the single thread performance.

Fix this by replacing right shift by 10 with /1000 while calculating
last_residency.

Reported-by: Anton Blanchard <anton@samba.org>
Bisected-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
---
Changes in v2

David Laight June 24, 2016, 9 a.m. UTC | #1

From: Shreyas B. Prabhu

> Sent: 24 June 2016 09:24

>

> Snooze is a poll idle state in powernv and pseries platforms. Snooze

> has a timeout so that if a cpu stays in snooze for more than target

> residency of the next available idle state, then it would exit thereby

> giving chance to the cpuidle governor to re-evaluate and

> promote the cpu to a deeper idle state. Therefore whenever snooze exits

> due to this timeout, its last_residency will be target_residency of next

> deeper state.

> 

> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")

> changed the math around last_residency calculation. Specifically, while

> converting last_residency value from nanoseconds to microseconds it does

> right shift by 10. Due to this, in snooze timeout exit scenarios

> last_residency calculated is roughly 2.3% less than target_residency of

> next available state. This pattern is picked up get_typical_interval()

> in the menu governor and therefore expected_interval in menu_select() is

> frequently less than the target_residency of any state but snooze.

> 

> Due to this we are entering snooze at a higher rate, thereby affecting

> the single thread performance.

> 

> Fix this by replacing right shift by 10 with /1000 while calculating

> last_residency.

> 

> Reported-by: Anton Blanchard <anton@samba.org>

> Bisected-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>

> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>

> ---

> Changes in v2

> =============

>  - Fixing it in the cpuidle core code instead of driver code.

> 

>  drivers/cpuidle/cpuidle.c | 6 +++---

>  1 file changed, 3 insertions(+), 3 deletions(-)

> 

> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c

> index a4d0059..30d67a8 100644

> --- a/drivers/cpuidle/cpuidle.c

> +++ b/drivers/cpuidle/cpuidle.c

> @@ -218,10 +218,10 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,

>  		local_irq_enable();

> 

>  	/*

> -	 * local_clock() returns the time in nanosecond, let's shift

> -	 * by 10 (divide by 1024) to have microsecond based time.

> +	 * local_clock() returns the time in nanosecond, let's

> +	 * divide by 1000 to have microsecond based time.

>  	 */

> -	diff = (time_end - time_start) >> 10;

> +	diff = (time_end - time_start) / 1000;

>  	if (diff > INT_MAX)

>  		diff = INT_MAX;


The intent of the >> 10 was probably to avoid an expensive 64bit divide.
So maybe something like:
	diff = time_end - time_start;
	if (diff >= INT_MAX/2)
		diff_32 = INT_MAX/2/1000;
	else
		diff_32 = diff;
		diff_32 += diff_32 >> 6;
		diff_32 >>= 10;
	}

Adding an extra 1/32 makes the division by be something slightly below 1000.

	David

kernel test robot June 24, 2016, 9:30 a.m. UTC | #2

Hi,

[auto build test ERROR on pm/linux-next]
[also build test ERROR on v4.7-rc4 next-20160624]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Shreyas-B-Prabhu/cpuidle-Fix-last_residency-division/20160624-162633
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
config: i386-defconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `cpuidle_enter_state':
>> (.text+0x31e0d2): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Daniel Lezcano June 24, 2016, 10:05 a.m. UTC | #3

On 06/24/2016 11:00 AM, David Laight wrote:
> From: Shreyas B. Prabhu
>> Sent: 24 June 2016 09:24
>>
>> Snooze is a poll idle state in powernv and pseries platforms. Snooze
>> has a timeout so that if a cpu stays in snooze for more than target
>> residency of the next available idle state, then it would exit thereby
>> giving chance to the cpuidle governor to re-evaluate and
>> promote the cpu to a deeper idle state. Therefore whenever snooze exits
>> due to this timeout, its last_residency will be target_residency of next
>> deeper state.
>>
>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
>> changed the math around last_residency calculation. Specifically, while
>> converting last_residency value from nanoseconds to microseconds it does
>> right shift by 10. Due to this, in snooze timeout exit scenarios
>> last_residency calculated is roughly 2.3% less than target_residency of
>> next available state. This pattern is picked up get_typical_interval()
>> in the menu governor and therefore expected_interval in menu_select() is
>> frequently less than the target_residency of any state but snooze.
>>
>> Due to this we are entering snooze at a higher rate, thereby affecting
>> the single thread performance.
>>
>> Fix this by replacing right shift by 10 with /1000 while calculating
>> last_residency.
>>
>> Reported-by: Anton Blanchard <anton@samba.org>
>> Bisected-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
>> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
>> ---
>> Changes in v2
>> =============
>>   - Fixing it in the cpuidle core code instead of driver code.
>>
>>   drivers/cpuidle/cpuidle.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index a4d0059..30d67a8 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -218,10 +218,10 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>>   		local_irq_enable();
>>
>>   	/*
>> -	 * local_clock() returns the time in nanosecond, let's shift
>> -	 * by 10 (divide by 1024) to have microsecond based time.
>> +	 * local_clock() returns the time in nanosecond, let's
>> +	 * divide by 1000 to have microsecond based time.
>>   	 */
>> -	diff = (time_end - time_start) >> 10;
>> +	diff = (time_end - time_start) / 1000;

do_div ?

>>   	if (diff > INT_MAX)
>>   		diff = INT_MAX;
>
> The intent of the >> 10 was probably to avoid an expensive 64bit divide.
> So maybe something like:
> 	diff = time_end - time_start;
> 	if (diff >= INT_MAX/2)
> 		diff_32 = INT_MAX/2/1000;
> 	else
> 		diff_32 = diff;
> 		diff_32 += diff_32 >> 6;
> 		diff_32 >>= 10;
> 	}
>
> Adding an extra 1/32 makes the division by be something slightly below 1000.

Arnd Bergmann June 24, 2016, 10:11 a.m. UTC | #4

On Friday, June 24, 2016 9:00:48 AM CEST David Laight wrote:
> The intent of the >> 10 was probably to avoid an expensive 64bit divide.
> So maybe something like:
>         diff = time_end - time_start;
>         if (diff >= INT_MAX/2)
>                 diff_32 = INT_MAX/2/1000;
>         else
>                 diff_32 = diff;
>                 diff_32 += diff_32 >> 6;
>                 diff_32 >>= 10;
>         }
> 
> Adding an extra 1/32 makes the division by be something slightly below 1000.

Why not change the definition of the time to nanoseconds and update
the users accordingly?

I see that cpuidle_enter_state() writes to the last_residency value,
and it is read in three places: ladder_select_state(), menu_update()
and show_state_time() (for sysfs).

If those functions are called less often than cpuidle_enter_state(),
we could just move the division there. Since the divisor is constant,
do_div() can convert it into a multiply and shift, or we could use
your the code you suggest above, or use a 32-bit division most of
the time:

	if (diff <= UINT_MAX)
		diff_32 = (u32)diff / NSECS_PER_USEC;
	else
		diff_32 = div_u64(diff, NSECS_PER_USEC;

which gcc itself will turn into a multiplication or series of
shifts on CPUs on which that is faster.

	Arnd

kernel test robot June 24, 2016, 10:27 a.m. UTC | #5

Hi,

[auto build test ERROR on pm/linux-next]
[also build test ERROR on v4.7-rc4 next-20160624]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Shreyas-B-Prabhu/cpuidle-Fix-last_residency-division/20160624-162633
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
config: sh-sh7785lcr_32bit_defconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 5.3.1-8) 5.3.1 20160205
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sh 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `cpuidle_enter_state':
>> drivers/cpuidle/cpuidle.c:242: undefined reference to `__udivdi3'

vim +242 drivers/cpuidle/cpuidle.c

56cfbf74a Colin Cross    2012-05-07  236  		dev->states_usage[entered_state].usage++;
56cfbf74a Colin Cross    2012-05-07  237  	} else {
56cfbf74a Colin Cross    2012-05-07  238  		dev->last_residency = 0;
56cfbf74a Colin Cross    2012-05-07  239  	}
56cfbf74a Colin Cross    2012-05-07  240  
56cfbf74a Colin Cross    2012-05-07  241  	return entered_state;
56cfbf74a Colin Cross    2012-05-07 @242  }
56cfbf74a Colin Cross    2012-05-07  243  
56cfbf74a Colin Cross    2012-05-07  244  /**
907e30f1b Daniel Lezcano 2014-03-03  245   * cpuidle_select - ask the cpuidle framework to choose an idle state

:::::: The code at line 242 was first introduced by commit
:::::: 56cfbf74a17c40f3a741398103c9f5d5a6806715 cpuidle: refactor out cpuidle_enter_state

:::::: TO: Colin Cross <ccross@android.com>
:::::: CC: Len Brown <len.brown@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

kernel test robot June 24, 2016, 12:10 p.m. UTC | #6

Hi,

[auto build test ERROR on pm/linux-next]
[also build test ERROR on v4.7-rc4 next-20160623]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Shreyas-B-Prabhu/cpuidle-Fix-last_residency-division/20160624-162633
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
config: arm-multi_v5_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 5.3.1-8) 5.3.1 20160205
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `cpuidle_enter_state':
>> drivers/cpuidle/cpuidle.c:224: undefined reference to `__aeabi_uldivmod'

vim +224 drivers/cpuidle/cpuidle.c

   218			local_irq_enable();
   219	
   220		/*
   221		 * local_clock() returns the time in nanosecond, let's
   222		 * divide by 1000 to have microsecond based time.
   223		 */
 > 224		diff = (time_end - time_start) / 1000;
   225		if (diff > INT_MAX)
   226			diff = INT_MAX;
   227	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Shreyas B. Prabhu June 24, 2016, 4:01 p.m. UTC | #7

On 06/24/2016 03:41 PM, Arnd Bergmann wrote:
> On Friday, June 24, 2016 9:00:48 AM CEST David Laight wrote:
>> The intent of the >> 10 was probably to avoid an expensive 64bit divide.
>> So maybe something like:
>>         diff = time_end - time_start;
>>         if (diff >= INT_MAX/2)
>>                 diff_32 = INT_MAX/2/1000;
>>         else
>>                 diff_32 = diff;
>>                 diff_32 += diff_32 >> 6;
>>                 diff_32 >>= 10;
>>         }
>>
>> Adding an extra 1/32 makes the division by be something slightly below 1000.
> 
> Why not change the definition of the time to nanoseconds and update
> the users accordingly?
> 
> I see that cpuidle_enter_state() writes to the last_residency value,
> and it is read in three places: ladder_select_state(), menu_update()
> and show_state_time() (for sysfs).
> 

Updating show_state_time() to convert ns to us gets bit messy since
show_state_##_name which is used for disable, usage and time simply
returns state_usage->_name. And state_usage->time updation happens in
cpuidle_enter_state() itself.

> If those functions are called less often than cpuidle_enter_state(),
> we could just move the division there. Since the divisor is constant,
> do_div() can convert it into a multiply and shift, or we could use
> your the code you suggest above, or use a 32-bit division most of
> the time:
> 
> 	if (diff <= UINT_MAX)
> 		diff_32 = (u32)diff / NSECS_PER_USEC;
> 	else
> 		diff_32 = div_u64(diff, NSECS_PER_USEC;
> 
> which gcc itself will turn into a multiplication or series of
> shifts on CPUs on which that is faster.
>
I'm not sure which division method of the three suggested here to use.
Does anyone have a strong preference?

--Shreyas

Arnd Bergmann June 24, 2016, 7:43 p.m. UTC | #8

On Friday, June 24, 2016 9:31:35 PM CEST Shreyas B Prabhu wrote:
> > If those functions are called less often than cpuidle_enter_state(),
> > we could just move the division there. Since the divisor is constant,
> > do_div() can convert it into a multiply and shift, or we could use
> > your the code you suggest above, or use a 32-bit division most of
> > the time:
> > 
> >       if (diff <= UINT_MAX)
> >               diff_32 = (u32)diff / NSECS_PER_USEC;
> >       else
> >               diff_32 = div_u64(diff, NSECS_PER_USEC;
> > 
> > which gcc itself will turn into a multiplication or series of
> > shifts on CPUs on which that is faster.
> >
> I'm not sure which division method of the three suggested here to use.
> Does anyone have a strong preference?
> 

It depends on how accurate we want it and how long we expect
the times to be. The optimization for the 4.2 second cutoff
for doing a 32-bit division only makes sense if the majority
of the sleep times are below that.

	Arnd

David Laight June 27, 2016, 8:59 a.m. UTC | #9

From: Arnd Bergmann
> Sent: 24 June 2016 20:43
> On Friday, June 24, 2016 9:31:35 PM CEST Shreyas B Prabhu wrote:
> > > If those functions are called less often than cpuidle_enter_state(),
> > > we could just move the division there. Since the divisor is constant,
> > > do_div() can convert it into a multiply and shift, or we could use
> > > your the code you suggest above, or use a 32-bit division most of
> > > the time:
> > >
> > >       if (diff <= UINT_MAX)
> > >               diff_32 = (u32)diff / NSECS_PER_USEC;
> > >       else
> > >               diff_32 = div_u64(diff, NSECS_PER_USEC;
> > >
> > > which gcc itself will turn into a multiplication or series of
> > > shifts on CPUs on which that is faster.
> > >
> > I'm not sure which division method of the three suggested here to use.
> > Does anyone have a strong preference?
> >
> 
> It depends on how accurate we want it and how long we expect
> the times to be. The optimization for the 4.2 second cutoff
> for doing a 32-bit division only makes sense if the majority
> of the sleep times are below that.

It also depends if the code actually cares about the length of 'long' sleeps.
I'd guess that for cpu idle 4.2 seconds is 'a long time', so the div_u64()
result could be treated as 4.2 seconds without causing grief.

Actually the cost of a 64bit divide after a 4 second sleep will be noise.
OTOH a 64bit divide after a sleep that lasted a few ns will be significant.

	David

Shreyas B. Prabhu June 29, 2016, 7 a.m. UTC | #10

On 06/27/2016 02:29 PM, David Laight wrote:
> From: Arnd Bergmann
>> Sent: 24 June 2016 20:43
>> On Friday, June 24, 2016 9:31:35 PM CEST Shreyas B Prabhu wrote:
>>>> If those functions are called less often than cpuidle_enter_state(),
>>>> we could just move the division there. Since the divisor is constant,
>>>> do_div() can convert it into a multiply and shift, or we could use
>>>> your the code you suggest above, or use a 32-bit division most of
>>>> the time:
>>>>
>>>>       if (diff <= UINT_MAX)
>>>>               diff_32 = (u32)diff / NSECS_PER_USEC;
>>>>       else
>>>>               diff_32 = div_u64(diff, NSECS_PER_USEC;
>>>>
>>>> which gcc itself will turn into a multiplication or series of
>>>> shifts on CPUs on which that is faster.
>>>>
>>> I'm not sure which division method of the three suggested here to use.
>>> Does anyone have a strong preference?
>>>
>>
>> It depends on how accurate we want it and how long we expect
>> the times to be. The optimization for the 4.2 second cutoff
>> for doing a 32-bit division only makes sense if the majority
>> of the sleep times are below that.
> 
> It also depends if the code actually cares about the length of 'long' sleeps.
> I'd guess that for cpu idle 4.2 seconds is 'a long time', so the div_u64()
> result could be treated as 4.2 seconds without causing grief.
> 
> Actually the cost of a 64bit divide after a 4 second sleep will be noise.
> OTOH a 64bit divide after a sleep that lasted a few ns will be significant.
> 
Agreed. I'll use the code you suggested, with a small change-
Using diff_32 += diff_32 >> 5 instead of diff_32 += diff_32 >> 6
since I want to err on the side of last_residency being more than actual.

And for long sleep cases, I'll use div_u64().

Thanks,
Shreyas

[v2] cpuidle: Fix last_residency division

Commit Message

Comments

Patch