diff mbox

[2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

Message ID 1460484386-28389-3-git-send-email-akshay.adiga@linux.vnet.ibm.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Akshay Adiga April 12, 2016, 6:06 p.m. UTC
This patch brings down global pstate at a slower rate than the local
pstate. As the frequency transition latency from pmin to pmax is
observed to be in few millisecond granurality. It takes a performance
penalty during sudden frequency rampup. Hence by holding global pstates
higher than local pstate makes the subsequent rampups faster.

A global per policy structure is maintained to keep track of the global
and local pstate changes. The global pstate is brought down using a
parabolic equation. The ramp down time to pmin is set to 6 seconds. To
make sure that the global pstates are dropped at regular interval , a
timer is queued for every 2 seconds, which eventually brings the pstate
down to local pstate.

Iozone results show fairly consistent performance boost.
YCSB on redis shows improved Max latencies in most cases.

Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb with
different record sizes . The following table shows IOoperations/sec with and
without patch.

Iozone Results ( in op/sec) ( mean over 3 iterations )
------------------------------------
file size-	   		with     	without
recordsize-IOtype      		patch    	patch 		 % change
----------------------------------------------------------------------
200704-1-SeqWrite		1616532 	1615425 	0.06
200704-1-Rewrite		2423195  	2303130 	5.21
200704-2-SeqWrite		1628577 	1602620 	1.61
200704-2-Rewrite		2428264  	2312154 	5.02
200704-4-SeqWrite		1617605 	1617182 	0.02
200704-4-Rewrite		2430524  	2351238 	3.37
200704-8-SeqWrite		1629478 	1600436 	1.81
200704-8-Rewrite		2415308  	2298136 	5.09
200704-16-SeqWrite		1619632 	1618250 	0.08
200704-16-Rewrite		2396650 	2352591 	1.87
200704-32-SeqWrite		1632544 	1598083 	2.15
200704-32-Rewrite		2425119 	2329743 	4.09
200704-64-SeqWrite		1617812 	1617235 	0.03
200704-64-Rewrite		2402021 	2321080 	3.48
200704-128-SeqWrite		1631998 	1600256 	1.98
200704-128-Rewrite		2422389		2304954 	5.09
200704-256 SeqWrite		1617065		1616962 	0.00
200704-256-Rewrite		2432539 	2301980 	5.67
200704-512-SeqWrite		1632599 	1598656 	2.12
200704-512-Rewrite		2429270		2323676 	4.54
200704-1024-SeqWrite		1618758		1616156 	0.16
200704-1024-Rewrite		2431631		2315889 	4.99
401408-1-SeqWrite		1631479  	1608132 	1.45
401408-1-Rewrite		2501550  	2459409 	1.71
401408-2-SeqWrite		1617095 	1626069 	-0.55
401408-2-Rewrite		2507557  	2443621 	2.61
401408-4-SeqWrite		1629601 	1611869		1.10
401408-4-Rewrite		2505909  	2462098 	1.77
401408-8-SeqWrite  		1617110 	1626968 	-0.60
401408-8-Rewrite		2512244  	2456827 	2.25
401408-16-SeqWrite		1632609 	1609603 	1.42
401408-16-Rewrite		2500792 	2451405 	2.01
401408-32-SeqWrite		1619294 	1628167 	-0.54
401408-32-Rewrite		2510115		2451292 	2.39
401408-64-SeqWrite		1632709		1603746 	1.80
401408-64-Rewrite		2506692 	2433186 	3.02
401408-128-SeqWrite		1619284		1627461 	-0.50
401408-128-Rewrite		2518698 	2453361 	2.66
401408-256-SeqWrite		1634022 	1610681 	1.44
401408-256-Rewrite		2509987 	2446328 	2.60
401408-512-SeqWrite 		1617524 	1628016 	-0.64
401408-512-Rewrite		2504409 	2442899 	2.51
401408-1024-SeqWrite		1629812 	1611566 	1.13
401408-1024-Rewrite 		2507620		 2442968 	2.64

Tested with YCSB workloada over redis for 1 million records and 1 million
operation. Each test was carried out with target operations per second and
persistence disabled. 

Max-latency (in us)( mean over 5 iterations )
-----------------------------------------------------------
op/s	Operation	with patch	without patch	%change
------------------------------------------------------------
15000	Read		61480.6		50261.4		22.32
15000	cleanup		215.2		293.6		-26.70
15000	update		25666.2		25163.8		2.00

25000	Read		32626.2		89525.4		-63.56
25000	cleanup		292.2		263.0		11.10
25000	update		32293.4		90255.0		-64.22

35000	Read		34783.0		33119.0		5.02
35000	cleanup		321.2		395.8		-18.8
35000	update		36047.0		38747.8		-6.97

40000	Read		38562.2		42357.4		-8.96
40000	cleanup		371.8		384.6		-3.33
40000	update		27861.4		41547.8		-32.94

45000	Read		42271.0		88120.6		-52.03
45000	cleanup		263.6		383.0		-31.17
45000	update		29755.8		81359.0		-63.43

(test without target op/s)
47659	Read		83061.4		136440.6	-39.12
47659	cleanup		195.8		193.8		1.03
47659	update		73429.4		124971.8	-41.24

Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpufreq/powernv-cpufreq.c | 239 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 237 insertions(+), 2 deletions(-)

Comments

Viresh Kumar April 13, 2016, 5:03 a.m. UTC | #1
Comments mostly on the coding standards which you have *not* followed.

Also, please run checkpatch --strict next time you send patches
upstream.

On 12-04-16, 23:36, Akshay Adiga wrote:
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> +#define MAX_RAMP_DOWN_TIME				5120
> +#define ramp_down_percent(time)		((time * time)>>18)

You need spaces around >>

> +
> +/*Interval after which the timer is queued to bring down global pstate*/

Bad comment style, should be /* ... */

> +#define GPSTATE_TIMER_INTERVAL				2000
> +/*
> + * global_pstate_info :

This is bad as well.. See how doc style comments are written at
separate places.

> + *	per policy data structure to maintain history of global pstates
> + *
> + * @highest_lpstate : the local pstate from which we are ramping down

No space required before :

> + * @elapsed_time : time in ms spent in ramping down from highest_lpstate
> + * @last_sampled_time : time from boot in ms when global pstates were last set
> + * @last_lpstate , last_gpstate : last set values for local and global pstates
> + * @timer : is used for ramping down if cpu goes idle for a long time with
> + *	global pstate held high
> + * @gpstate_lock : a spinlock to maintain synchronization between routines
> + *	called by the timer handler and governer's target_index calls
> + */
> +struct global_pstate_info {
> +	int highest_lpstate;
> +	unsigned int elapsed_time;
> +	unsigned int last_sampled_time;
> +	int last_lpstate;
> +	int last_gpstate;
> +	spinlock_t gpstate_lock;
> +	struct timer_list timer;
> +};
> +
> +/*
> + * While resetting we don't want "timer" fields to be set to zero as we
> + * may lose track of timer and will not be able to cleanly remove it
> + */
> +#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
> +					sizeof(struct global_pstate_info)-\
> +					sizeof(struct timer_list)-\
> +					sizeof(spinlock_t))

That's super *ugly*. Why don't you create a simple routine which will
set the 5 integer variables to 0 in a straight forward way ?

> @@ -348,14 +395,17 @@ static void set_pstate(void *freq_data)
>  	unsigned long val;
>  	unsigned long pstate_ul =
>  		((struct powernv_smp_call_data *) freq_data)->pstate_id;
> +	unsigned long gpstate_ul =
> +		((struct powernv_smp_call_data *) freq_data)->gpstate_id;

Remove these unnecessary casts and do:

struct powernv_smp_call_data *freq_data = data; //Name func arg as data

And then use freq_data->*.

>  
>  	val = get_pmspr(SPRN_PMCR);
>  	val = val & 0x0000FFFFFFFFFFFFULL;
>  
>  	pstate_ul = pstate_ul & 0xFF;
> +	gpstate_ul = gpstate_ul & 0xFF;
>  
>  	/* Set both global(bits 56..63) and local(bits 48..55) PStates */
> -	val = val | (pstate_ul << 56) | (pstate_ul << 48);
> +	val = val | (gpstate_ul << 56) | (pstate_ul << 48);

>  /*

You need /** here. See comments everywhere please, they are mostly
done in a wrong way.

> + * calcuate_global_pstate:
> + *
> + * @elapsed_time : elapsed time in milliseconds
> + * @local_pstate : new local pstate
> + * @highest_lpstate : pstate from which its ramping down
> + *
> + * Finds the appropriate global pstate based on the pstate from which its
> + * ramping down and the time elapsed in ramping down. It follows a quadratic
> + * equation which ensures that it reaches ramping down to pmin in 5sec.
> + */
> +inline int calculate_global_pstate(unsigned int elapsed_time,
> +		int highest_lpstate, int local_pstate)

Not aligned properly, checkpatch --strict will tell you what to do.

> +{
> +	int pstate_diff;
> +
> +	/*
> +	 * Using ramp_down_percent we get the percentage of rampdown
> +	 * that we are expecting to be dropping. Difference between
> +	 * highest_lpstate and powernv_pstate_info.min will give a absolute
> +	 * number of how many pstates we will drop eventually by the end of
> +	 * 5 seconds, then just scale it get the number pstates to be dropped.
> +	 */
> +	pstate_diff =  ((int)ramp_down_percent(elapsed_time) *
> +			(highest_lpstate - powernv_pstate_info.min))/100;
> +
> +	/* Ensure that global pstate is >= to local pstate */
> +	if (highest_lpstate - pstate_diff < local_pstate)
> +		return local_pstate;
> +	else
> +		return (highest_lpstate - pstate_diff);

Unnecessary ().

> +}
> +
> +inline int queue_gpstate_timer(struct global_pstate_info *gpstates)

Looks like many of the function you wrote now should be marked
'static' as well.

> +{
> +	unsigned int timer_interval;
> +
> +	/* Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But

Bad style again:

/*
 * ...
 * ...
 */

> +	 * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time.
> +	 * Set timer such that it fires exactly at MAX_RAMP_DOWN_TIME
> +	 * seconds of ramp down time.
> +	 */
> +	if ((gpstates->elapsed_time + GPSTATE_TIMER_INTERVAL)
> +							 > MAX_RAMP_DOWN_TIME)

Align '>' right below the second (

> +		timer_interval = MAX_RAMP_DOWN_TIME - gpstates->elapsed_time;
> +	else
> +		timer_interval = GPSTATE_TIMER_INTERVAL;
> +
> +	return  mod_timer_pinned(&(gpstates->timer), jiffies +

() not required.

> +				msecs_to_jiffies(timer_interval));
> +}

blank line required.

> +/*
> + * gpstate_timer_handler
> + *
> + * @data: pointer to cpufreq_policy on which timer was queued
> + *
> + * This handler brings down the global pstate closer to the local pstate
> + * according quadratic equation. Queues a new timer if it is still not equal
> + * to local pstate
> + */
> +void gpstate_timer_handler(unsigned long data)
> +{
> +	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;

no need to cast.

> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +							policy->driver_data;

same here.

> +	unsigned int time_diff = jiffies_to_msecs(jiffies)
> +					- gpstates->last_sampled_time;
> +	struct powernv_smp_call_data freq_data;
> +	int ret;
> +
> +	ret = spin_trylock(&gpstates->gpstate_lock);

no need of 'ret' for just this, simply do: if (!spin_trylock())...

> +	if (!ret)
> +		return;
> +
> +	gpstates->last_sampled_time += time_diff;
> +	gpstates->elapsed_time += time_diff;
> +	freq_data.pstate_id = gpstates->last_lpstate;
> +	if ((gpstates->last_gpstate == freq_data.pstate_id) ||
> +			(gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +		reset_gpstates(policy);
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +	} else {
> +		freq_data.gpstate_id = calculate_global_pstate(

You can't break a line after ( of a function call :)

Let it go beyond 80 columns if it has to.

> +			gpstates->elapsed_time, gpstates->highest_lpstate,
> +			freq_data.pstate_id);
> +	}
> +
> +	/* If local pstate is equal to global pstate, rampdown is over

Bad style again.

> +	 * So timer is not required to be queued.
> +	 */
> +	if (freq_data.gpstate_id != freq_data.pstate_id)
> +		ret = queue_gpstate_timer(gpstates);

ret not used.

> +
> +	gpstates->last_gpstate = freq_data.gpstate_id;
> +	gpstates->last_lpstate = freq_data.pstate_id;
> +
> +	/* Timer may get migrated to a different cpu on cpu hot unplug */
> +	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
> +	spin_unlock(&gpstates->gpstate_lock);
> +}
> +
> +

Remove extra blank line.

> +/*
>   * powernv_cpufreq_target_index: Sets the frequency corresponding to
>   * the cpufreq table entry indexed by new_index on the cpus in the
>   * mask policy->cpus
> @@ -432,23 +585,88 @@ next:
>  static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
>  					unsigned int new_index)
>  {
> +	int ret;
>  	struct powernv_smp_call_data freq_data;
> -
> +	unsigned int cur_msec;
> +	unsigned long flags;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +						policy->driver_data;
>  	if (unlikely(rebooting) && new_index != get_nominal_index())
>  		return 0;
>  
>  	if (!throttled)
>  		powernv_cpufreq_throttle_check(NULL);
>  
> +	cur_msec = jiffies_to_msecs(get_jiffies_64());
> +
> +	/*spinlock taken*/

Drop these useless comments, we know what you are doing.

> +	spin_lock_irqsave(&gpstates->gpstate_lock, flags);
>  	freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>  
> +	/*First time call */
> +	if (!gpstates->last_sampled_time) {
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +		goto gpstates_done;
> +	}
> +
> +	/*Ramp down*/

Rather explain what you are doing and how you are doing it here.

> +	if (gpstates->last_gpstate > freq_data.pstate_id) {
> +		gpstates->elapsed_time += cur_msec -
> +					gpstates->last_sampled_time;
> +		/* If its has been ramping down for more than 5seconds
> +		 * we should be reseting all global pstate related data.
> +		 * Set it equal to local pstate to start fresh.
> +		 */
> +		if (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME) {
> +			freq_data.gpstate_id = freq_data.pstate_id;
> +			reset_gpstates(policy);
> +			gpstates->highest_lpstate = freq_data.pstate_id;
> +			freq_data.gpstate_id = freq_data.pstate_id;
> +		} else {
> +		/* elaspsed_time is less than 5 seconds, continue to rampdown*/
> +			freq_data.gpstate_id = calculate_global_pstate(
> +			gpstates->elapsed_time,
> +			gpstates->highest_lpstate, freq_data.pstate_id);
> +
> +		}
> +

remove blank line.

> +	} else {
> +		/*Ramp up*/
> +		reset_gpstates(policy);
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +	}
> +
> +	/* If local pstate is equal to global pstate, rampdown is over
> +	 * So timer is not required to be queued.
> +	 */
> +	if (freq_data.gpstate_id != freq_data.pstate_id)
> +		ret = queue_gpstate_timer(gpstates);

add a blank line here

> +gpstates_done:
> +	gpstates->last_sampled_time = cur_msec;
> +	gpstates->last_gpstate = freq_data.gpstate_id;
> +	gpstates->last_lpstate = freq_data.pstate_id;
> +
>  	/*
>  	 * Use smp_call_function to send IPI and execute the
>  	 * mtspr on target CPU.  We could do that without IPI
>  	 * if current CPU is within policy->cpus (core)
>  	 */
>  	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
> +	spin_unlock_irqrestore(&gpstates->gpstate_lock, flags);
> +	return 0;
> +}
>  
> +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)

Add this after the init() routine.

> +{
> +	int base;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)

Unnecessary cast.

> +						policy->driver_data;
> +	base = cpu_first_thread_sibling(policy->cpu);
> +	del_timer_sync(&gpstates->timer);
> +	kfree(policy->driver_data);
> +	pr_info("freed driver_data cpu %d\n", base);

may be a blank line here.

>  	return 0;
>  }
>  
> @@ -456,6 +674,7 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  {
>  	int base, i;
>  	struct kernfs_node *kn;
> +	struct global_pstate_info *gpstates;
>  
>  	base = cpu_first_thread_sibling(policy->cpu);
>  
> @@ -475,6 +694,21 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  	} else {
>  		kernfs_put(kn);
>  	}

blank line here.

> +	gpstates =  kzalloc(sizeof(struct global_pstate_info), GFP_KERNEL);

sizeof(*gpstates).

> +	if (!gpstates) {
> +		pr_err("Could not allocate global_pstate_info\n");

print not required

> +		return -ENOMEM;
> +	}

blank line here

> +	policy->driver_data = gpstates;
> +
> +	/* initialize timer */
> +	init_timer_deferrable(&gpstates->timer);
> +	gpstates->timer.data = (unsigned long) policy;
> +	gpstates->timer.function = gpstate_timer_handler;
> +	gpstates->timer.expires = jiffies +
> +				msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
> +
> +	pr_info("Added global_pstate_info & timer for %d cpu\n", base);
>  	return cpufreq_table_validate_and_show(policy, powernv_freqs);

Who will free gpstates if this fails ?

>  }
>  
> @@ -612,6 +846,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>  	.name		= "powernv-cpufreq",
>  	.flags		= CPUFREQ_CONST_LOOPS,
>  	.init		= powernv_cpufreq_cpu_init,
> +	.exit		= powernv_cpufreq_cpu_exit,
>  	.verify		= cpufreq_generic_frequency_table_verify,
>  	.target_index	= powernv_cpufreq_target_index,
>  	.get		= powernv_cpufreq_get,
> -- 
> 2.5.5
Akshay Adiga April 13, 2016, 5:57 p.m. UTC | #2
Hi Viresh ,

Thanks for reviewing in detail.
I will correct all comments related to coding standards in my next patch.

On 04/13/2016 10:33 AM, Viresh Kumar wrote:

> Comments mostly on the coding standards which you have *not* followed.
>
> Also, please run checkpatch --strict next time you send patches
> upstream.

Thanks for pointing out the --strict option, was not aware of that. I will
run checkpatch --strict on the next versions.

> On 12-04-16, 23:36, Akshay Adiga wrote:
> +
> +/*
> + * While resetting we don't want "timer" fields to be set to zero as we
> + * may lose track of timer and will not be able to cleanly remove it
> + */
> +#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
> +					sizeof(struct global_pstate_info)-\
> +					sizeof(struct timer_list)-\
> +					sizeof(spinlock_t))
> That's super *ugly*. Why don't you create a simple routine which will
> set the 5 integer variables to 0 in a straight forward way ?

Yeh, will create a routine.

>> @@ -348,14 +395,17 @@ static void set_pstate(void *freq_data)
>>   	unsigned long val;
>>   	unsigned long pstate_ul =
>>   		((struct powernv_smp_call_data *) freq_data)->pstate_id;
>> +	unsigned long gpstate_ul =
>> +		((struct powernv_smp_call_data *) freq_data)->gpstate_id;
> Remove these unnecessary casts and do:
>
> struct powernv_smp_call_data *freq_data = data; //Name func arg as data
>
> And then use freq_data->*.

Ok. Will do that.

>> +/*
>> + * gpstate_timer_handler
>> + *
>> + * @data: pointer to cpufreq_policy on which timer was queued
>> + *
>> + * This handler brings down the global pstate closer to the local pstate
>> + * according quadratic equation. Queues a new timer if it is still not equal
>> + * to local pstate
>> + */
>> +void gpstate_timer_handler(unsigned long data)
>> +{
>> +	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;
> no need to cast.

May be i need a cast here,  because data is unsigned long ( unlike other places where its void *).
On building without cast, it throws me a warning.

>> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
>> +	struct powernv_smp_call_data freq_data;
>> +	int ret;
>> +
>> +	ret = spin_trylock(&gpstates->gpstate_lock);
> no need of 'ret' for just this, simply do: if (!spin_trylock())...

Sure will do that.

> a
>> +	if (!ret)
>> +		return;
>> +
>> +	gpstates->last_sampled_time += time_diff;
>> +	gpstates->elapsed_time += time_diff;
>> +	freq_data.pstate_id = gpstates->last_lpstate;
>> +	if ((gpstates->last_gpstate == freq_data.pstate_id) ||
>> +			(gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
>> +		freq_data.gpstate_id = freq_data.pstate_id;
>> +		reset_gpstates(policy);
>> +		gpstates->highest_lpstate = freq_data.pstate_id;
>> +	} else {
>> +		freq_data.gpstate_id = calculate_global_pstate(
> You can't break a line after ( of a function call :)
>
> Let it go beyond 80 columns if it has to.

May be i will try to get it inside 80 columns with a temporary variable instead of
freq_data.gpstate_id.

>> +			gpstates->elapsed_time, gpstates->highest_lpstate,
>> +			freq_data.pstate_id);
>> +	}
>> +
>> +	/* If local pstate is equal to global pstate, rampdown is over
> Bad style again.
>
>> +	 * So timer is not required to be queued.
>> +	 */
>> +	if (freq_data.gpstate_id != freq_data.pstate_id)
>> +		ret = queue_gpstate_timer(gpstates);
> ret not used.

Should i make it void instead of returning int?, as i cannot do much even if it fails, except for notifying.

>> +gpstates_done:
>> +	gpstates->last_sampled_time = cur_msec;
>> +	gpstates->last_gpstate = freq_data.gpstate_id;
>> +	gpstates->last_lpstate = freq_data.pstate_id;
>> +
>>   	/*
>>   	 * Use smp_call_function to send IPI and execute the
>>   	 * mtspr on target CPU.  We could do that without IPI
>>   	 * if current CPU is within policy->cpus (core)
>>   	 */
>>   	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
>> +	spin_unlock_irqrestore(&gpstates->gpstate_lock, flags);
>> +	return 0;
>> +}
>>   
>> +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> Add this after the init() routine.

Ok will do it.

>> +	policy->driver_data = gpstates;
>> +
>> +	/* initialize timer */
>> +	init_timer_deferrable(&gpstates->timer);
>> +	gpstates->timer.data = (unsigned long) policy;
>> +	gpstates->timer.function = gpstate_timer_handler;
>> +	gpstates->timer.expires = jiffies +
>> +				msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
>> +
>> +	pr_info("Added global_pstate_info & timer for %d cpu\n", base);
>>   	return cpufreq_table_validate_and_show(policy, powernv_freqs);
> Who will free gpstates if this fails ?

Thanks for pointing out. Will fix in v2.

Regards
Akshay Adiga
Viresh Kumar April 14, 2016, 2:19 a.m. UTC | #3
On 13-04-16, 23:27, Akshay Adiga wrote:
> On 04/13/2016 10:33 AM, Viresh Kumar wrote:
> >>+void gpstate_timer_handler(unsigned long data)
> >>+{
> >>+	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;
> >no need to cast.
> 
> May be i need a cast here,  because data is unsigned long ( unlike other places where its void *).
> On building without cast, it throws me a warning.

My bad, yeah :(

> >>+	if (freq_data.gpstate_id != freq_data.pstate_id)
> >>+		ret = queue_gpstate_timer(gpstates);
> >ret not used.
> 
> Should i make it void instead of returning int?, as i cannot do much even if it fails, except for notifying.

Sure.
Balbir Singh April 14, 2016, 5:40 a.m. UTC | #4
On 13/04/16 04:06, Akshay Adiga wrote:
> This patch brings down global pstate at a slower rate than the local
> pstate. As the frequency transition latency from pmin to pmax is
> observed to be in few millisecond granurality. It takes a performance
> penalty during sudden frequency rampup. Hence by holding global pstates
> higher than local pstate makes the subsequent rampups faster.

What domains does local and global refer to?

> 
> A global per policy structure is maintained to keep track of the global
> and local pstate changes. The global pstate is brought down using a
> parabolic equation. The ramp down time to pmin is set to 6 seconds. To
> make sure that the global pstates are dropped at regular interval , a
> timer is queued for every 2 seconds, which eventually brings the pstate
> down to local pstate.
> 
> Iozone results show fairly consistent performance boost.
> YCSB on redis shows improved Max latencies in most cases.
> 
> Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb with
> different record sizes . The following table shows IOoperations/sec with and
> without patch.
> 
> Iozone Results ( in op/sec) ( mean over 3 iterations )
> ------------------------------------
> file size-	   		with     	without
> recordsize-IOtype      		patch    	patch 		 % change
> ----------------------------------------------------------------------
> 200704-1-SeqWrite		1616532 	1615425 	0.06
> 200704-1-Rewrite		2423195  	2303130 	5.21
> 200704-2-SeqWrite		1628577 	1602620 	1.61
> 200704-2-Rewrite		2428264  	2312154 	5.02
> 200704-4-SeqWrite		1617605 	1617182 	0.02
> 200704-4-Rewrite		2430524  	2351238 	3.37
> 200704-8-SeqWrite		1629478 	1600436 	1.81
> 200704-8-Rewrite		2415308  	2298136 	5.09
> 200704-16-SeqWrite		1619632 	1618250 	0.08
> 200704-16-Rewrite		2396650 	2352591 	1.87
> 200704-32-SeqWrite		1632544 	1598083 	2.15
> 200704-32-Rewrite		2425119 	2329743 	4.09
> 200704-64-SeqWrite		1617812 	1617235 	0.03
> 200704-64-Rewrite		2402021 	2321080 	3.48
> 200704-128-SeqWrite		1631998 	1600256 	1.98
> 200704-128-Rewrite		2422389		2304954 	5.09
> 200704-256 SeqWrite		1617065		1616962 	0.00
> 200704-256-Rewrite		2432539 	2301980 	5.67
> 200704-512-SeqWrite		1632599 	1598656 	2.12
> 200704-512-Rewrite		2429270		2323676 	4.54
> 200704-1024-SeqWrite		1618758		1616156 	0.16
> 200704-1024-Rewrite		2431631		2315889 	4.99
> 401408-1-SeqWrite		1631479  	1608132 	1.45
> 401408-1-Rewrite		2501550  	2459409 	1.71
> 401408-2-SeqWrite		1617095 	1626069 	-0.55
> 401408-2-Rewrite		2507557  	2443621 	2.61
> 401408-4-SeqWrite		1629601 	1611869		1.10
> 401408-4-Rewrite		2505909  	2462098 	1.77
> 401408-8-SeqWrite  		1617110 	1626968 	-0.60
> 401408-8-Rewrite		2512244  	2456827 	2.25
> 401408-16-SeqWrite		1632609 	1609603 	1.42
> 401408-16-Rewrite		2500792 	2451405 	2.01
> 401408-32-SeqWrite		1619294 	1628167 	-0.54
> 401408-32-Rewrite		2510115		2451292 	2.39
> 401408-64-SeqWrite		1632709		1603746 	1.80
> 401408-64-Rewrite		2506692 	2433186 	3.02
> 401408-128-SeqWrite		1619284		1627461 	-0.50
> 401408-128-Rewrite		2518698 	2453361 	2.66
> 401408-256-SeqWrite		1634022 	1610681 	1.44
> 401408-256-Rewrite		2509987 	2446328 	2.60
> 401408-512-SeqWrite 		1617524 	1628016 	-0.64
> 401408-512-Rewrite		2504409 	2442899 	2.51
> 401408-1024-SeqWrite		1629812 	1611566 	1.13
> 401408-1024-Rewrite 		2507620		 2442968 	2.64
> 
> Tested with YCSB workloada over redis for 1 million records and 1 million
> operation. Each test was carried out with target operations per second and
> persistence disabled. 
> 
> Max-latency (in us)( mean over 5 iterations )
> -----------------------------------------------------------
> op/s	Operation	with patch	without patch	%change
> ------------------------------------------------------------
> 15000	Read		61480.6		50261.4		22.32
> 15000	cleanup		215.2		293.6		-26.70
> 15000	update		25666.2		25163.8		2.00
> 
> 25000	Read		32626.2		89525.4		-63.56
> 25000	cleanup		292.2		263.0		11.10
> 25000	update		32293.4		90255.0		-64.22
> 
> 35000	Read		34783.0		33119.0		5.02
> 35000	cleanup		321.2		395.8		-18.8
> 35000	update		36047.0		38747.8		-6.97
> 
> 40000	Read		38562.2		42357.4		-8.96
> 40000	cleanup		371.8		384.6		-3.33
> 40000	update		27861.4		41547.8		-32.94
> 
> 45000	Read		42271.0		88120.6		-52.03
> 45000	cleanup		263.6		383.0		-31.17
> 45000	update		29755.8		81359.0		-63.43
> 
> (test without target op/s)
> 47659	Read		83061.4		136440.6	-39.12
> 47659	cleanup		195.8		193.8		1.03
> 47659	update		73429.4		124971.8	-41.24
> 
> Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 239 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 237 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index e2e2219..288fa10 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -36,12 +36,58 @@
>  #include <asm/reg.h>
>  #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
>  #include <asm/opal.h>
> +#include <linux/timer.h>
>  
>  #define POWERNV_MAX_PSTATES	256
>  #define PMSR_PSAFE_ENABLE	(1UL << 30)
>  #define PMSR_SPR_EM_DISABLE	(1UL << 31)
>  #define PMSR_MAX(x)		((x >> 32) & 0xFF)
>  
> +/*
> + * Quadratic equation which gives the percentage rampdown for time elapsed in
> + * milliseconds. time can be between 0 and MAX_RAMP_DOWN_TIME ( milliseconds )
> + * This equation approximates to y = -4e-6 x^2

Thanks for documenting this, but I think it will also be good to explain why we 
use y = -4 e-6*x^2 as opposed to any other magic numbers.

> + *
> + * At 0 seconds x=0000 ramp_down_percent=0
> + * At MAX_RAMP_DOWN_TIME x=5120 ramp_down_percent=100
> + */
> +#define MAX_RAMP_DOWN_TIME				5120
> +#define ramp_down_percent(time)		((time * time)>>18)
> +
> +/*Interval after which the timer is queued to bring down global pstate*/
> +#define GPSTATE_TIMER_INTERVAL				2000
> +/*
> + * global_pstate_info :
> + *	per policy data structure to maintain history of global pstates
> + *
> + * @highest_lpstate : the local pstate from which we are ramping down
> + * @elapsed_time : time in ms spent in ramping down from highest_lpstate
> + * @last_sampled_time : time from boot in ms when global pstates were last set
> + * @last_lpstate , last_gpstate : last set values for local and global pstates
> + * @timer : is used for ramping down if cpu goes idle for a long time with
> + *	global pstate held high
> + * @gpstate_lock : a spinlock to maintain synchronization between routines
> + *	called by the timer handler and governer's target_index calls
> + */
> +struct global_pstate_info {
> +	int highest_lpstate;
> +	unsigned int elapsed_time;
> +	unsigned int last_sampled_time;
> +	int last_lpstate;
> +	int last_gpstate;
> +	spinlock_t gpstate_lock;
> +	struct timer_list timer;
> +};
> +
> +/*
> + * While resetting we don't want "timer" fields to be set to zero as we
> + * may lose track of timer and will not be able to cleanly remove it
> + */
> +#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
> +					sizeof(struct global_pstate_info)-\
> +					sizeof(struct timer_list)-\
> +					sizeof(spinlock_t))
> +
>  static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>  static bool rebooting, throttled, occ_reset;
>  
> @@ -285,6 +331,7 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val)
>  struct powernv_smp_call_data {
>  	unsigned int freq;
>  	int pstate_id;
> +	int gpstate_id;
>  };
>  
>  /*
> @@ -348,14 +395,17 @@ static void set_pstate(void *freq_data)
>  	unsigned long val;
>  	unsigned long pstate_ul =
>  		((struct powernv_smp_call_data *) freq_data)->pstate_id;
> +	unsigned long gpstate_ul =
> +		((struct powernv_smp_call_data *) freq_data)->gpstate_id;
>  
>  	val = get_pmspr(SPRN_PMCR);
>  	val = val & 0x0000FFFFFFFFFFFFULL;
>  
>  	pstate_ul = pstate_ul & 0xFF;
> +	gpstate_ul = gpstate_ul & 0xFF;
>  
>  	/* Set both global(bits 56..63) and local(bits 48..55) PStates */
> -	val = val | (pstate_ul << 56) | (pstate_ul << 48);
> +	val = val | (gpstate_ul << 56) | (pstate_ul << 48);
>  
>  	pr_debug("Setting cpu %d pmcr to %016lX\n",
>  			raw_smp_processor_id(), val);
> @@ -425,6 +475,109 @@ next:
>  }
>  
>  /*
> + * calcuate_global_pstate:
> + *
> + * @elapsed_time : elapsed time in milliseconds
> + * @local_pstate : new local pstate
> + * @highest_lpstate : pstate from which its ramping down
> + *
> + * Finds the appropriate global pstate based on the pstate from which its
> + * ramping down and the time elapsed in ramping down. It follows a quadratic
> + * equation which ensures that it reaches ramping down to pmin in 5sec.
> + */
> +inline int calculate_global_pstate(unsigned int elapsed_time,
> +		int highest_lpstate, int local_pstate)
> +{
> +	int pstate_diff;
> +
> +	/*
> +	 * Using ramp_down_percent we get the percentage of rampdown
> +	 * that we are expecting to be dropping. Difference between
> +	 * highest_lpstate and powernv_pstate_info.min will give a absolute
> +	 * number of how many pstates we will drop eventually by the end of
> +	 * 5 seconds, then just scale it get the number pstates to be dropped.
> +	 */
> +	pstate_diff =  ((int)ramp_down_percent(elapsed_time) *
> +			(highest_lpstate - powernv_pstate_info.min))/100;
> +
> +	/* Ensure that global pstate is >= to local pstate */
> +	if (highest_lpstate - pstate_diff < local_pstate)
> +		return local_pstate;
> +	else
> +		return (highest_lpstate - pstate_diff);
> +}
> +
> +inline int queue_gpstate_timer(struct global_pstate_info *gpstates)
> +{
> +	unsigned int timer_interval;
> +
> +	/* Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But
> +	 * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time.
> +	 * Set timer such that it fires exactly at MAX_RAMP_DOWN_TIME
> +	 * seconds of ramp down time.
> +	 */
> +	if ((gpstates->elapsed_time + GPSTATE_TIMER_INTERVAL)
> +							 > MAX_RAMP_DOWN_TIME)
> +		timer_interval = MAX_RAMP_DOWN_TIME - gpstates->elapsed_time;
> +	else
> +		timer_interval = GPSTATE_TIMER_INTERVAL;
> +
> +	return  mod_timer_pinned(&(gpstates->timer), jiffies +
> +				msecs_to_jiffies(timer_interval));
> +}
> +/*
> + * gpstate_timer_handler
> + *
> + * @data: pointer to cpufreq_policy on which timer was queued
> + *
> + * This handler brings down the global pstate closer to the local pstate
> + * according quadratic equation. Queues a new timer if it is still not equal
> + * to local pstate
> + */
> +void gpstate_timer_handler(unsigned long data)
> +{
> +	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +							policy->driver_data;
> +	unsigned int time_diff = jiffies_to_msecs(jiffies)
> +					- gpstates->last_sampled_time;
> +	struct powernv_smp_call_data freq_data;
> +	int ret;
> +
> +	ret = spin_trylock(&gpstates->gpstate_lock);
> +	if (!ret)
> +		return;
> +
> +	gpstates->last_sampled_time += time_diff;
> +	gpstates->elapsed_time += time_diff;
> +	freq_data.pstate_id = gpstates->last_lpstate;
> +	if ((gpstates->last_gpstate == freq_data.pstate_id) ||
> +			(gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +		reset_gpstates(policy);
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +	} else {
> +		freq_data.gpstate_id = calculate_global_pstate(
> +			gpstates->elapsed_time, gpstates->highest_lpstate,
> +			freq_data.pstate_id);
> +	}
> +
> +	/* If local pstate is equal to global pstate, rampdown is over
> +	 * So timer is not required to be queued.
> +	 */
> +	if (freq_data.gpstate_id != freq_data.pstate_id)
> +		ret = queue_gpstate_timer(gpstates);
> +
> +	gpstates->last_gpstate = freq_data.gpstate_id;
> +	gpstates->last_lpstate = freq_data.pstate_id;
> +
> +	/* Timer may get migrated to a different cpu on cpu hot unplug */
> +	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
> +	spin_unlock(&gpstates->gpstate_lock);
> +}
> +
> +
> +/*
>   * powernv_cpufreq_target_index: Sets the frequency corresponding to
>   * the cpufreq table entry indexed by new_index on the cpus in the
>   * mask policy->cpus
> @@ -432,23 +585,88 @@ next:
>  static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
>  					unsigned int new_index)
>  {
> +	int ret;
>  	struct powernv_smp_call_data freq_data;
> -
> +	unsigned int cur_msec;
> +	unsigned long flags;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +						policy->driver_data;
>  	if (unlikely(rebooting) && new_index != get_nominal_index())
>  		return 0;
>  
>  	if (!throttled)
>  		powernv_cpufreq_throttle_check(NULL);
>  
> +	cur_msec = jiffies_to_msecs(get_jiffies_64());
> +
> +	/*spinlock taken*/
> +	spin_lock_irqsave(&gpstates->gpstate_lock, flags);
>  	freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>  
> +	/*First time call */
> +	if (!gpstates->last_sampled_time) {
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +		goto gpstates_done;
> +	}
> +
> +	/*Ramp down*/
> +	if (gpstates->last_gpstate > freq_data.pstate_id) {
> +		gpstates->elapsed_time += cur_msec -
> +					gpstates->last_sampled_time;
> +		/* If its has been ramping down for more than 5seconds
> +		 * we should be reseting all global pstate related data.
> +		 * Set it equal to local pstate to start fresh.
> +		 */
> +		if (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME) {
> +			freq_data.gpstate_id = freq_data.pstate_id;
> +			reset_gpstates(policy);
> +			gpstates->highest_lpstate = freq_data.pstate_id;
> +			freq_data.gpstate_id = freq_data.pstate_id;
> +		} else {
> +		/* elaspsed_time is less than 5 seconds, continue to rampdown*/
> +			freq_data.gpstate_id = calculate_global_pstate(
> +			gpstates->elapsed_time,
> +			gpstates->highest_lpstate, freq_data.pstate_id);
> +
> +		}
> +
> +	} else {
> +		/*Ramp up*/
> +		reset_gpstates(policy);
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +	}
> +
> +	/* If local pstate is equal to global pstate, rampdown is over
> +	 * So timer is not required to be queued.
> +	 */
> +	if (freq_data.gpstate_id != freq_data.pstate_id)
> +		ret = queue_gpstate_timer(gpstates);
> +gpstates_done:
> +	gpstates->last_sampled_time = cur_msec;
> +	gpstates->last_gpstate = freq_data.gpstate_id;
> +	gpstates->last_lpstate = freq_data.pstate_id;
> +
>  	/*
>  	 * Use smp_call_function to send IPI and execute the
>  	 * mtspr on target CPU.  We could do that without IPI
>  	 * if current CPU is within policy->cpus (core)
>  	 */
>  	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
> +	spin_unlock_irqrestore(&gpstates->gpstate_lock, flags);
> +	return 0;
> +}
>  
> +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> +	int base;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +						policy->driver_data;
> +	base = cpu_first_thread_sibling(policy->cpu);
> +	del_timer_sync(&gpstates->timer);
> +	kfree(policy->driver_data);
> +	pr_info("freed driver_data cpu %d\n", base);
>  	return 0;
>  }
>  
> @@ -456,6 +674,7 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  {
>  	int base, i;
>  	struct kernfs_node *kn;
> +	struct global_pstate_info *gpstates;
>  
>  	base = cpu_first_thread_sibling(policy->cpu);
>  
> @@ -475,6 +694,21 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  	} else {
>  		kernfs_put(kn);
>  	}
> +	gpstates =  kzalloc(sizeof(struct global_pstate_info), GFP_KERNEL);
> +	if (!gpstates) {
> +		pr_err("Could not allocate global_pstate_info\n");
> +		return -ENOMEM;
> +	}
> +	policy->driver_data = gpstates;
> +
> +	/* initialize timer */
> +	init_timer_deferrable(&gpstates->timer);
> +	gpstates->timer.data = (unsigned long) policy;
> +	gpstates->timer.function = gpstate_timer_handler;
> +	gpstates->timer.expires = jiffies +
> +				msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
> +
> +	pr_info("Added global_pstate_info & timer for %d cpu\n", base);
>  	return cpufreq_table_validate_and_show(policy, powernv_freqs);
>  }
>  
> @@ -612,6 +846,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>  	.name		= "powernv-cpufreq",
>  	.flags		= CPUFREQ_CONST_LOOPS,
>  	.init		= powernv_cpufreq_cpu_init,
> +	.exit		= powernv_cpufreq_cpu_exit,
>  	.verify		= cpufreq_generic_frequency_table_verify,
>  	.target_index	= powernv_cpufreq_target_index,
>  	.get		= powernv_cpufreq_get,
> 


Balbir Singh
Akshay Adiga April 14, 2016, 3:54 p.m. UTC | #5
Hi Balbir,

On 04/14/2016 11:10 AM, Balbir Singh wrote:

> On 13/04/16 04:06, Akshay Adiga wrote:
>> This patch brings down global pstate at a slower rate than the local
>> pstate. As the frequency transition latency from pmin to pmax is
>> observed to be in few millisecond granurality. It takes a performance
>> penalty during sudden frequency rampup. Hence by holding global pstates
>> higher than local pstate makes the subsequent rampups faster.
> What domains does local and global refer to?

The *global pstate* is a Chip-level entity as such, so the global entity
(Voltage)  is managed across several cores. But with a DCM (Dual Chip Modules),
its more of a socket-level entity and hence Voltage is managed across both chips.

The *local pstate* is a Core-level entity, so the local entity (frequency) is
managed across threads.

I will include this in the commit message. Thanks.

>
>> +/*
>> + * Quadratic equation which gives the percentage rampdown for time elapsed in
>> + * milliseconds. time can be between 0 and MAX_RAMP_DOWN_TIME ( milliseconds )
>> + * This equation approximates to y = -4e-6 x^2
> Thanks for documenting this, but I think it will also be good to explain why we
> use y = -4 e-6*x^2 as opposed to any other magic numbers.

Well it empirically worked out best this way.

On an idle system we want the Global Pstate to ramp-down from max value to min over
a span of ~5 secs. Also we want initially ramp-down slowly and ramp-down rapidly later
on, hence the equation.

I will try to make this in the comment more informative.

Regards

Akshay Adiga
diff mbox

Patch

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index e2e2219..288fa10 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -36,12 +36,58 @@ 
 #include <asm/reg.h>
 #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
 #include <asm/opal.h>
+#include <linux/timer.h>
 
 #define POWERNV_MAX_PSTATES	256
 #define PMSR_PSAFE_ENABLE	(1UL << 30)
 #define PMSR_SPR_EM_DISABLE	(1UL << 31)
 #define PMSR_MAX(x)		((x >> 32) & 0xFF)
 
+/*
+ * Quadratic equation which gives the percentage rampdown for time elapsed in
+ * milliseconds. time can be between 0 and MAX_RAMP_DOWN_TIME ( milliseconds )
+ * This equation approximates to y = -4e-6 x^2
+ *
+ * At 0 seconds x=0000 ramp_down_percent=0
+ * At MAX_RAMP_DOWN_TIME x=5120 ramp_down_percent=100
+ */
+#define MAX_RAMP_DOWN_TIME				5120
+#define ramp_down_percent(time)		((time * time)>>18)
+
+/*Interval after which the timer is queued to bring down global pstate*/
+#define GPSTATE_TIMER_INTERVAL				2000
+/*
+ * global_pstate_info :
+ *	per policy data structure to maintain history of global pstates
+ *
+ * @highest_lpstate : the local pstate from which we are ramping down
+ * @elapsed_time : time in ms spent in ramping down from highest_lpstate
+ * @last_sampled_time : time from boot in ms when global pstates were last set
+ * @last_lpstate , last_gpstate : last set values for local and global pstates
+ * @timer : is used for ramping down if cpu goes idle for a long time with
+ *	global pstate held high
+ * @gpstate_lock : a spinlock to maintain synchronization between routines
+ *	called by the timer handler and governer's target_index calls
+ */
+struct global_pstate_info {
+	int highest_lpstate;
+	unsigned int elapsed_time;
+	unsigned int last_sampled_time;
+	int last_lpstate;
+	int last_gpstate;
+	spinlock_t gpstate_lock;
+	struct timer_list timer;
+};
+
+/*
+ * While resetting we don't want "timer" fields to be set to zero as we
+ * may lose track of timer and will not be able to cleanly remove it
+ */
+#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
+					sizeof(struct global_pstate_info)-\
+					sizeof(struct timer_list)-\
+					sizeof(spinlock_t))
+
 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
 static bool rebooting, throttled, occ_reset;
 
@@ -285,6 +331,7 @@  static inline void set_pmspr(unsigned long sprn, unsigned long val)
 struct powernv_smp_call_data {
 	unsigned int freq;
 	int pstate_id;
+	int gpstate_id;
 };
 
 /*
@@ -348,14 +395,17 @@  static void set_pstate(void *freq_data)
 	unsigned long val;
 	unsigned long pstate_ul =
 		((struct powernv_smp_call_data *) freq_data)->pstate_id;
+	unsigned long gpstate_ul =
+		((struct powernv_smp_call_data *) freq_data)->gpstate_id;
 
 	val = get_pmspr(SPRN_PMCR);
 	val = val & 0x0000FFFFFFFFFFFFULL;
 
 	pstate_ul = pstate_ul & 0xFF;
+	gpstate_ul = gpstate_ul & 0xFF;
 
 	/* Set both global(bits 56..63) and local(bits 48..55) PStates */
-	val = val | (pstate_ul << 56) | (pstate_ul << 48);
+	val = val | (gpstate_ul << 56) | (pstate_ul << 48);
 
 	pr_debug("Setting cpu %d pmcr to %016lX\n",
 			raw_smp_processor_id(), val);
@@ -425,6 +475,109 @@  next:
 }
 
 /*
+ * calcuate_global_pstate:
+ *
+ * @elapsed_time : elapsed time in milliseconds
+ * @local_pstate : new local pstate
+ * @highest_lpstate : pstate from which its ramping down
+ *
+ * Finds the appropriate global pstate based on the pstate from which its
+ * ramping down and the time elapsed in ramping down. It follows a quadratic
+ * equation which ensures that it reaches ramping down to pmin in 5sec.
+ */
+inline int calculate_global_pstate(unsigned int elapsed_time,
+		int highest_lpstate, int local_pstate)
+{
+	int pstate_diff;
+
+	/*
+	 * Using ramp_down_percent we get the percentage of rampdown
+	 * that we are expecting to be dropping. Difference between
+	 * highest_lpstate and powernv_pstate_info.min will give a absolute
+	 * number of how many pstates we will drop eventually by the end of
+	 * 5 seconds, then just scale it get the number pstates to be dropped.
+	 */
+	pstate_diff =  ((int)ramp_down_percent(elapsed_time) *
+			(highest_lpstate - powernv_pstate_info.min))/100;
+
+	/* Ensure that global pstate is >= to local pstate */
+	if (highest_lpstate - pstate_diff < local_pstate)
+		return local_pstate;
+	else
+		return (highest_lpstate - pstate_diff);
+}
+
+inline int queue_gpstate_timer(struct global_pstate_info *gpstates)
+{
+	unsigned int timer_interval;
+
+	/* Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But
+	 * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time.
+	 * Set timer such that it fires exactly at MAX_RAMP_DOWN_TIME
+	 * seconds of ramp down time.
+	 */
+	if ((gpstates->elapsed_time + GPSTATE_TIMER_INTERVAL)
+							 > MAX_RAMP_DOWN_TIME)
+		timer_interval = MAX_RAMP_DOWN_TIME - gpstates->elapsed_time;
+	else
+		timer_interval = GPSTATE_TIMER_INTERVAL;
+
+	return  mod_timer_pinned(&(gpstates->timer), jiffies +
+				msecs_to_jiffies(timer_interval));
+}
+/*
+ * gpstate_timer_handler
+ *
+ * @data: pointer to cpufreq_policy on which timer was queued
+ *
+ * This handler brings down the global pstate closer to the local pstate
+ * according quadratic equation. Queues a new timer if it is still not equal
+ * to local pstate
+ */
+void gpstate_timer_handler(unsigned long data)
+{
+	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;
+	struct global_pstate_info *gpstates = (struct global_pstate_info *)
+							policy->driver_data;
+	unsigned int time_diff = jiffies_to_msecs(jiffies)
+					- gpstates->last_sampled_time;
+	struct powernv_smp_call_data freq_data;
+	int ret;
+
+	ret = spin_trylock(&gpstates->gpstate_lock);
+	if (!ret)
+		return;
+
+	gpstates->last_sampled_time += time_diff;
+	gpstates->elapsed_time += time_diff;
+	freq_data.pstate_id = gpstates->last_lpstate;
+	if ((gpstates->last_gpstate == freq_data.pstate_id) ||
+			(gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
+		freq_data.gpstate_id = freq_data.pstate_id;
+		reset_gpstates(policy);
+		gpstates->highest_lpstate = freq_data.pstate_id;
+	} else {
+		freq_data.gpstate_id = calculate_global_pstate(
+			gpstates->elapsed_time, gpstates->highest_lpstate,
+			freq_data.pstate_id);
+	}
+
+	/* If local pstate is equal to global pstate, rampdown is over
+	 * So timer is not required to be queued.
+	 */
+	if (freq_data.gpstate_id != freq_data.pstate_id)
+		ret = queue_gpstate_timer(gpstates);
+
+	gpstates->last_gpstate = freq_data.gpstate_id;
+	gpstates->last_lpstate = freq_data.pstate_id;
+
+	/* Timer may get migrated to a different cpu on cpu hot unplug */
+	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
+	spin_unlock(&gpstates->gpstate_lock);
+}
+
+
+/*
  * powernv_cpufreq_target_index: Sets the frequency corresponding to
  * the cpufreq table entry indexed by new_index on the cpus in the
  * mask policy->cpus
@@ -432,23 +585,88 @@  next:
 static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
 					unsigned int new_index)
 {
+	int ret;
 	struct powernv_smp_call_data freq_data;
-
+	unsigned int cur_msec;
+	unsigned long flags;
+	struct global_pstate_info *gpstates = (struct global_pstate_info *)
+						policy->driver_data;
 	if (unlikely(rebooting) && new_index != get_nominal_index())
 		return 0;
 
 	if (!throttled)
 		powernv_cpufreq_throttle_check(NULL);
 
+	cur_msec = jiffies_to_msecs(get_jiffies_64());
+
+	/*spinlock taken*/
+	spin_lock_irqsave(&gpstates->gpstate_lock, flags);
 	freq_data.pstate_id = powernv_freqs[new_index].driver_data;
 
+	/*First time call */
+	if (!gpstates->last_sampled_time) {
+		freq_data.gpstate_id = freq_data.pstate_id;
+		gpstates->highest_lpstate = freq_data.pstate_id;
+		goto gpstates_done;
+	}
+
+	/*Ramp down*/
+	if (gpstates->last_gpstate > freq_data.pstate_id) {
+		gpstates->elapsed_time += cur_msec -
+					gpstates->last_sampled_time;
+		/* If its has been ramping down for more than 5seconds
+		 * we should be reseting all global pstate related data.
+		 * Set it equal to local pstate to start fresh.
+		 */
+		if (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME) {
+			freq_data.gpstate_id = freq_data.pstate_id;
+			reset_gpstates(policy);
+			gpstates->highest_lpstate = freq_data.pstate_id;
+			freq_data.gpstate_id = freq_data.pstate_id;
+		} else {
+		/* elaspsed_time is less than 5 seconds, continue to rampdown*/
+			freq_data.gpstate_id = calculate_global_pstate(
+			gpstates->elapsed_time,
+			gpstates->highest_lpstate, freq_data.pstate_id);
+
+		}
+
+	} else {
+		/*Ramp up*/
+		reset_gpstates(policy);
+		gpstates->highest_lpstate = freq_data.pstate_id;
+		freq_data.gpstate_id = freq_data.pstate_id;
+	}
+
+	/* If local pstate is equal to global pstate, rampdown is over
+	 * So timer is not required to be queued.
+	 */
+	if (freq_data.gpstate_id != freq_data.pstate_id)
+		ret = queue_gpstate_timer(gpstates);
+gpstates_done:
+	gpstates->last_sampled_time = cur_msec;
+	gpstates->last_gpstate = freq_data.gpstate_id;
+	gpstates->last_lpstate = freq_data.pstate_id;
+
 	/*
 	 * Use smp_call_function to send IPI and execute the
 	 * mtspr on target CPU.  We could do that without IPI
 	 * if current CPU is within policy->cpus (core)
 	 */
 	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
+	spin_unlock_irqrestore(&gpstates->gpstate_lock, flags);
+	return 0;
+}
 
+static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+	int base;
+	struct global_pstate_info *gpstates = (struct global_pstate_info *)
+						policy->driver_data;
+	base = cpu_first_thread_sibling(policy->cpu);
+	del_timer_sync(&gpstates->timer);
+	kfree(policy->driver_data);
+	pr_info("freed driver_data cpu %d\n", base);
 	return 0;
 }
 
@@ -456,6 +674,7 @@  static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
 {
 	int base, i;
 	struct kernfs_node *kn;
+	struct global_pstate_info *gpstates;
 
 	base = cpu_first_thread_sibling(policy->cpu);
 
@@ -475,6 +694,21 @@  static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
 	} else {
 		kernfs_put(kn);
 	}
+	gpstates =  kzalloc(sizeof(struct global_pstate_info), GFP_KERNEL);
+	if (!gpstates) {
+		pr_err("Could not allocate global_pstate_info\n");
+		return -ENOMEM;
+	}
+	policy->driver_data = gpstates;
+
+	/* initialize timer */
+	init_timer_deferrable(&gpstates->timer);
+	gpstates->timer.data = (unsigned long) policy;
+	gpstates->timer.function = gpstate_timer_handler;
+	gpstates->timer.expires = jiffies +
+				msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
+
+	pr_info("Added global_pstate_info & timer for %d cpu\n", base);
 	return cpufreq_table_validate_and_show(policy, powernv_freqs);
 }
 
@@ -612,6 +846,7 @@  static struct cpufreq_driver powernv_cpufreq_driver = {
 	.name		= "powernv-cpufreq",
 	.flags		= CPUFREQ_CONST_LOOPS,
 	.init		= powernv_cpufreq_cpu_init,
+	.exit		= powernv_cpufreq_cpu_exit,
 	.verify		= cpufreq_generic_frequency_table_verify,
 	.target_index	= powernv_cpufreq_target_index,
 	.get		= powernv_cpufreq_get,