diff mbox series

[TEGRA194_CPUFREQ,2/3] cpufreq: Add Tegra194 cpufreq driver

Message ID 1575394348-17649-2-git-send-email-sumitg@nvidia.com
State Changes Requested
Headers show
Series [TEGRA194_CPUFREQ,1/3] firmware: tegra: adding function to get BPMP data | expand

Commit Message

Sumit Gupta Dec. 3, 2019, 5:32 p.m. UTC
Add support for CPU frequency scaling on Tegra194. The frequency
of each core can be adjusted by writing a clock divisor value to
an MSR on the core. The range of valid divisors is queried from
the BPMP.

Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
---
 drivers/cpufreq/Kconfig.arm        |   6 +
 drivers/cpufreq/Makefile           |   1 +
 drivers/cpufreq/tegra194-cpufreq.c | 423 +++++++++++++++++++++++++++++++++++++
 3 files changed, 430 insertions(+)
 create mode 100644 drivers/cpufreq/tegra194-cpufreq.c

Comments

Viresh Kumar Dec. 4, 2019, 5:40 a.m. UTC | #1
Hi Sumit,

On 03-12-19, 23:02, Sumit Gupta wrote:
> Add support for CPU frequency scaling on Tegra194. The frequency
> of each core can be adjusted by writing a clock divisor value to
> an MSR on the core. The range of valid divisors is queried from
> the BPMP.
> 
> Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
> Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
> ---
>  drivers/cpufreq/Kconfig.arm        |   6 +
>  drivers/cpufreq/Makefile           |   1 +
>  drivers/cpufreq/tegra194-cpufreq.c | 423 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 430 insertions(+)
>  create mode 100644 drivers/cpufreq/tegra194-cpufreq.c

Overall these are the things that you are doing here in the driver:

- open coded clk_{get|set}_rate(), why can't you implement a clock
  driver for the CPU and use the clk framework? You may not need the
  (hacky) work-queue usage then probably.

- populating cpufreq table, you can probably add OPPs instead using
  the same mechanism

- And then you can reuse the cpufreq-dt driver for your platform as
  well, as is the case for few other tegra platforms.
Sumit Gupta Dec. 4, 2019, 10:55 a.m. UTC | #2
Hi Viresh,

On 04/12/19 11:10 AM, Viresh Kumar wrote:
> Hi Sumit,
>
> On 03-12-19, 23:02, Sumit Gupta wrote:
>> Add support for CPU frequency scaling on Tegra194. The frequency
>> of each core can be adjusted by writing a clock divisor value to
>> an MSR on the core. The range of valid divisors is queried from
>> the BPMP.
>>
>> Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
>> Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
>> ---
>>   drivers/cpufreq/Kconfig.arm        |   6 +
>>   drivers/cpufreq/Makefile           |   1 +
>>   drivers/cpufreq/tegra194-cpufreq.c | 423 +++++++++++++++++++++++++++++++++++++
>>   3 files changed, 430 insertions(+)
>>   create mode 100644 drivers/cpufreq/tegra194-cpufreq.c
> Overall these are the things that you are doing here in the driver:
>
> - open coded clk_{get|set}_rate(), why can't you implement a clock
>    driver for the CPU and use the clk framework? You may not need the
>    (hacky) work-queue usage then probably.

In T194, CCPLEX doesn't have access to set clocks and the

clk_{get|set}_rate() functions set clocks by hook to BPMP R5.

CPU freq can be directly set by CCPLEX using MSR(NVFREQ_REQ_EL1).

As DVFS run's on BPMP, another MSR (NVFREQ_FEEDBACK_EL1) is

used to read the counters and calculate "actual" cpu freq at CCPLEX.

So, "cpuinfo_cur_freq" node gives the actual cpu frequency and not

given by node "scaling_cur_freq".

>
> - populating cpufreq table, you can probably add OPPs instead using
>    the same mechanism

We are reading available frequencies from BPMP to populate

cpufreq table and not using static opp table.

>
> - And then you can reuse the cpufreq-dt driver for your platform as
>    well, as is the case for few other tegra platforms.
>
Viresh Kumar Dec. 4, 2019, 11:27 a.m. UTC | #3
On 04-12-19, 16:25, sumitg wrote:
> In T194, CCPLEX doesn't have access to set clocks and the
> 
> clk_{get|set}_rate() functions set clocks by hook to BPMP R5.
> 
> CPU freq can be directly set by CCPLEX using MSR(NVFREQ_REQ_EL1).
> 
> As DVFS run's on BPMP, another MSR (NVFREQ_FEEDBACK_EL1) is
> 
> used to read the counters and calculate "actual" cpu freq at CCPLEX.
> 
> So, "cpuinfo_cur_freq" node gives the actual cpu frequency and not
> 
> given by node "scaling_cur_freq".

Right, but why can't this be hidden in the CPU's clk driver instead,
so cpufreq driver can just do clk_get_rate() and clk_set_rate() ?

> > 
> > - populating cpufreq table, you can probably add OPPs instead using
> >    the same mechanism
> 
> We are reading available frequencies from BPMP to populate
> 
> cpufreq table and not using static opp table.

Right and lot of other platforms read it from firmware (I believe BBMP
is a firmware here), and create OPPs at runtime. Look at this for
example:

drivers/cpufreq/qcom-cpufreq-hw.c

and search for dev_pm_opp_add().
Dmitry Osipenko Dec. 4, 2019, 1:57 p.m. UTC | #4
04.12.2019 14:27, Viresh Kumar пишет:
> On 04-12-19, 16:25, sumitg wrote:
>> In T194, CCPLEX doesn't have access to set clocks and the
>>
>> clk_{get|set}_rate() functions set clocks by hook to BPMP R5.
>>
>> CPU freq can be directly set by CCPLEX using MSR(NVFREQ_REQ_EL1).
>>
>> As DVFS run's on BPMP, another MSR (NVFREQ_FEEDBACK_EL1) is
>>
>> used to read the counters and calculate "actual" cpu freq at CCPLEX.
>>
>> So, "cpuinfo_cur_freq" node gives the actual cpu frequency and not
>>
>> given by node "scaling_cur_freq".
> 
> Right, but why can't this be hidden in the CPU's clk driver instead,
> so cpufreq driver can just do clk_get_rate() and clk_set_rate() ?

What about to make use of dev_pm_opp_register_set_opp_helper()?


>>> - populating cpufreq table, you can probably add OPPs instead using
>>>    the same mechanism
>>
>> We are reading available frequencies from BPMP to populate
>>
>> cpufreq table and not using static opp table.
> 
> Right and lot of other platforms read it from firmware (I believe BBMP
> is a firmware here), and create OPPs at runtime. Look at this for
> example:
> 
> drivers/cpufreq/qcom-cpufreq-hw.c
> 
> and search for dev_pm_opp_add().
>
Dmitry Osipenko Dec. 4, 2019, 1:59 p.m. UTC | #5
03.12.2019 20:32, Sumit Gupta пишет:
> Add support for CPU frequency scaling on Tegra194. The frequency
> of each core can be adjusted by writing a clock divisor value to
> an MSR on the core. The range of valid divisors is queried from
> the BPMP.
> 
> Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
> Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
> ---
>  drivers/cpufreq/Kconfig.arm        |   6 +
>  drivers/cpufreq/Makefile           |   1 +
>  drivers/cpufreq/tegra194-cpufreq.c | 423 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 430 insertions(+)
>  create mode 100644 drivers/cpufreq/tegra194-cpufreq.c
> 
> diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
> index a905796..4bcd47c 100644
> --- a/drivers/cpufreq/Kconfig.arm
> +++ b/drivers/cpufreq/Kconfig.arm
> @@ -320,6 +320,12 @@ config ARM_TEGRA186_CPUFREQ
>  	help
>  	  This adds the CPUFreq driver support for Tegra186 SOCs.
>  
> +config ARM_TEGRA194_CPUFREQ
> +	tristate "Tegra194 CPUFreq support"
> +	depends on ARCH_TEGRA && TEGRA_BPMP
> +	help
> +	  This adds CPU frequency driver support for Tegra194 SOCs.
> +
>  config ARM_TI_CPUFREQ
>  	bool "Texas Instruments CPUFreq support"
>  	depends on ARCH_OMAP2PLUS
> diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
> index 9a9f5cc..433d492 100644
> --- a/drivers/cpufreq/Makefile
> +++ b/drivers/cpufreq/Makefile
> @@ -85,6 +85,7 @@ obj-$(CONFIG_ARM_TANGO_CPUFREQ)		+= tango-cpufreq.o
>  obj-$(CONFIG_ARM_TEGRA20_CPUFREQ)	+= tegra20-cpufreq.o
>  obj-$(CONFIG_ARM_TEGRA124_CPUFREQ)	+= tegra124-cpufreq.o
>  obj-$(CONFIG_ARM_TEGRA186_CPUFREQ)	+= tegra186-cpufreq.o
> +obj-$(CONFIG_ARM_TEGRA194_CPUFREQ)	+= tegra194-cpufreq.o
>  obj-$(CONFIG_ARM_TI_CPUFREQ)		+= ti-cpufreq.o
>  obj-$(CONFIG_ARM_VEXPRESS_SPC_CPUFREQ)	+= vexpress-spc-cpufreq.o
>  
> diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
> new file mode 100644
> index 0000000..9df12f4
> --- /dev/null
> +++ b/drivers/cpufreq/tegra194-cpufreq.c
> @@ -0,0 +1,423 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved
> + */
> +
> +#include <linux/cpu.h>
> +#include <linux/cpufreq.h>
> +#include <linux/delay.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/of_platform.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +
> +#include <asm/smp_plat.h>
> +
> +#include <soc/tegra/bpmp.h>
> +#include <soc/tegra/bpmp-abi.h>
> +
> +#define KHZ                     1000
> +#define REF_CLK_MHZ             408 /* 408 MHz */
> +#define US_DELAY                2000
> +#define US_DELAY_MIN            2
> +#define CPUFREQ_TBL_STEP_HZ     (50 * KHZ * KHZ)
> +#define MAX_CNT                 ~0U
> +
> +/* cpufreq transisition latency */
> +#define TEGRA_CPUFREQ_TRANSITION_LATENCY (300 * 1000) /* unit in nanoseconds */
> +
> +enum cluster {
> +	CLUSTER0,
> +	CLUSTER1,
> +	CLUSTER2,
> +	CLUSTER3,
> +	MAX_CLUSTERS,
> +};
> +
> +struct tegra194_cpufreq_data {
> +	void __iomem *regs;
> +	size_t num_clusters;
> +	struct cpufreq_frequency_table **tables;
> +};
> +
> +static DEFINE_MUTEX(cpufreq_lock);
> +
> +struct tegra_cpu_ctr {
> +	u32 cpu;
> +	u32 delay;
> +	u32 coreclk_cnt, last_coreclk_cnt;
> +	u32 refclk_cnt, last_refclk_cnt;
> +};
> +
> +static struct workqueue_struct *read_counters_wq;
> +struct read_counters_work {
> +	struct work_struct work;
> +	struct tegra_cpu_ctr c;
> +};
> +
> +static enum cluster get_cpu_cluster(u8 cpu)
> +{
> +	return MPIDR_AFFINITY_LEVEL(cpu_logical_map(cpu), 1);
> +}
> +
> +/*
> + * Read per-core Read-only system register NVFREQ_FEEDBACK_EL1.
> + * The register provides frequency feedback information to
> + * determine the average actual frequency a core has run at over
> + * a period of time.
> + *	[31:0] PLLP counter: Counts at fixed frequency (408 MHz)
> + *	[63:32] Core clock counter: counts on every core clock cycle
> + *			where the core is architecturally clocking
> + */
> +static u64 read_freq_feedback(void)
> +{
> +	u64 val = 0;
> +
> +	asm volatile("mrs %0, s3_0_c15_c0_5" : "=r" (val) : );
> +
> +	return val;
> +}
> +
> +u16 map_freq_to_ndiv(struct mrq_cpu_ndiv_limits_response *nltbl, u32 freq)
> +{
> +	return DIV_ROUND_UP(freq * nltbl->pdiv * nltbl->mdiv,
> +			    nltbl->ref_clk_hz / KHZ);
> +}

This function isn't used anywhere.
Viresh Kumar Dec. 5, 2019, 2:51 a.m. UTC | #6
On 04-12-19, 16:57, Dmitry Osipenko wrote:
> 04.12.2019 14:27, Viresh Kumar пишет:
> > On 04-12-19, 16:25, sumitg wrote:
> >> In T194, CCPLEX doesn't have access to set clocks and the
> >>
> >> clk_{get|set}_rate() functions set clocks by hook to BPMP R5.
> >>
> >> CPU freq can be directly set by CCPLEX using MSR(NVFREQ_REQ_EL1).
> >>
> >> As DVFS run's on BPMP, another MSR (NVFREQ_FEEDBACK_EL1) is
> >>
> >> used to read the counters and calculate "actual" cpu freq at CCPLEX.
> >>
> >> So, "cpuinfo_cur_freq" node gives the actual cpu frequency and not
> >>
> >> given by node "scaling_cur_freq".
> > 
> > Right, but why can't this be hidden in the CPU's clk driver instead,
> > so cpufreq driver can just do clk_get_rate() and clk_set_rate() ?
> 
> What about to make use of dev_pm_opp_register_set_opp_helper()?

It has a different purpose where we have to play with different
regulators. And that won't help with clk_get_rate() anyway.
Dmitry Osipenko Dec. 5, 2019, 12:55 p.m. UTC | #7
05.12.2019 05:51, Viresh Kumar пишет:
> On 04-12-19, 16:57, Dmitry Osipenko wrote:
>> 04.12.2019 14:27, Viresh Kumar пишет:
>>> On 04-12-19, 16:25, sumitg wrote:
>>>> In T194, CCPLEX doesn't have access to set clocks and the
>>>>
>>>> clk_{get|set}_rate() functions set clocks by hook to BPMP R5.
>>>>
>>>> CPU freq can be directly set by CCPLEX using MSR(NVFREQ_REQ_EL1).
>>>>
>>>> As DVFS run's on BPMP, another MSR (NVFREQ_FEEDBACK_EL1) is
>>>>
>>>> used to read the counters and calculate "actual" cpu freq at CCPLEX.
>>>>
>>>> So, "cpuinfo_cur_freq" node gives the actual cpu frequency and not
>>>>
>>>> given by node "scaling_cur_freq".
>>>
>>> Right, but why can't this be hidden in the CPU's clk driver instead,
>>> so cpufreq driver can just do clk_get_rate() and clk_set_rate() ?
>>
>> What about to make use of dev_pm_opp_register_set_opp_helper()?
> 
> It has a different purpose where we have to play with different
> regulators. And that won't help with clk_get_rate() anyway.

Indeed that won't help with the clk.
Dmitry Osipenko Dec. 5, 2019, 2:15 p.m. UTC | #8
03.12.2019 20:32, Sumit Gupta пишет:
> Add support for CPU frequency scaling on Tegra194. The frequency
> of each core can be adjusted by writing a clock divisor value to
> an MSR on the core. The range of valid divisors is queried from
> the BPMP.
> 
> Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
> Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
> ---
>  drivers/cpufreq/Kconfig.arm        |   6 +
>  drivers/cpufreq/Makefile           |   1 +
>  drivers/cpufreq/tegra194-cpufreq.c | 423 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 430 insertions(+)
>  create mode 100644 drivers/cpufreq/tegra194-cpufreq.c
> 
> diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
> index a905796..4bcd47c 100644
> --- a/drivers/cpufreq/Kconfig.arm
> +++ b/drivers/cpufreq/Kconfig.arm
> @@ -320,6 +320,12 @@ config ARM_TEGRA186_CPUFREQ
>  	help
>  	  This adds the CPUFreq driver support for Tegra186 SOCs.
>  
> +config ARM_TEGRA194_CPUFREQ
> +	tristate "Tegra194 CPUFreq support"
> +	depends on ARCH_TEGRA && TEGRA_BPMP
> +	help
> +	  This adds CPU frequency driver support for Tegra194 SOCs.
> +
>  config ARM_TI_CPUFREQ
>  	bool "Texas Instruments CPUFreq support"
>  	depends on ARCH_OMAP2PLUS
> diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
> index 9a9f5cc..433d492 100644
> --- a/drivers/cpufreq/Makefile
> +++ b/drivers/cpufreq/Makefile
> @@ -85,6 +85,7 @@ obj-$(CONFIG_ARM_TANGO_CPUFREQ)		+= tango-cpufreq.o
>  obj-$(CONFIG_ARM_TEGRA20_CPUFREQ)	+= tegra20-cpufreq.o
>  obj-$(CONFIG_ARM_TEGRA124_CPUFREQ)	+= tegra124-cpufreq.o
>  obj-$(CONFIG_ARM_TEGRA186_CPUFREQ)	+= tegra186-cpufreq.o
> +obj-$(CONFIG_ARM_TEGRA194_CPUFREQ)	+= tegra194-cpufreq.o
>  obj-$(CONFIG_ARM_TI_CPUFREQ)		+= ti-cpufreq.o
>  obj-$(CONFIG_ARM_VEXPRESS_SPC_CPUFREQ)	+= vexpress-spc-cpufreq.o
>  
> diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
> new file mode 100644
> index 0000000..9df12f4
> --- /dev/null
> +++ b/drivers/cpufreq/tegra194-cpufreq.c
> @@ -0,0 +1,423 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved
> + */
> +
> +#include <linux/cpu.h>
> +#include <linux/cpufreq.h>
> +#include <linux/delay.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/of_platform.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +
> +#include <asm/smp_plat.h>
> +
> +#include <soc/tegra/bpmp.h>
> +#include <soc/tegra/bpmp-abi.h>
> +
> +#define KHZ                     1000
> +#define REF_CLK_MHZ             408 /* 408 MHz */
> +#define US_DELAY                2000
> +#define US_DELAY_MIN            2
> +#define CPUFREQ_TBL_STEP_HZ     (50 * KHZ * KHZ)
> +#define MAX_CNT                 ~0U
> +
> +/* cpufreq transisition latency */
> +#define TEGRA_CPUFREQ_TRANSITION_LATENCY (300 * 1000) /* unit in nanoseconds */
> +
> +enum cluster {
> +	CLUSTER0,
> +	CLUSTER1,
> +	CLUSTER2,
> +	CLUSTER3,
> +	MAX_CLUSTERS,
> +};
> +
> +struct tegra194_cpufreq_data {
> +	void __iomem *regs;
> +	size_t num_clusters;
> +	struct cpufreq_frequency_table **tables;
> +};
> +
> +static DEFINE_MUTEX(cpufreq_lock);
> +
> +struct tegra_cpu_ctr {
> +	u32 cpu;
> +	u32 delay;
> +	u32 coreclk_cnt, last_coreclk_cnt;
> +	u32 refclk_cnt, last_refclk_cnt;
> +};
> +
> +static struct workqueue_struct *read_counters_wq;
> +struct read_counters_work {
> +	struct work_struct work;
> +	struct tegra_cpu_ctr c;
> +};
> +
> +static enum cluster get_cpu_cluster(u8 cpu)
> +{
> +	return MPIDR_AFFINITY_LEVEL(cpu_logical_map(cpu), 1);
> +}
> +
> +/*
> + * Read per-core Read-only system register NVFREQ_FEEDBACK_EL1.
> + * The register provides frequency feedback information to
> + * determine the average actual frequency a core has run at over
> + * a period of time.
> + *	[31:0] PLLP counter: Counts at fixed frequency (408 MHz)
> + *	[63:32] Core clock counter: counts on every core clock cycle
> + *			where the core is architecturally clocking
> + */
> +static u64 read_freq_feedback(void)
> +{
> +	u64 val = 0;
> +
> +	asm volatile("mrs %0, s3_0_c15_c0_5" : "=r" (val) : );
> +
> +	return val;
> +}
> +
> +u16 map_freq_to_ndiv(struct mrq_cpu_ndiv_limits_response *nltbl, u32 freq)
> +{
> +	return DIV_ROUND_UP(freq * nltbl->pdiv * nltbl->mdiv,
> +			    nltbl->ref_clk_hz / KHZ);
> +}
> +
> +static inline u32 map_ndiv_to_freq(struct mrq_cpu_ndiv_limits_response
> +				   *nltbl, u16 ndiv)
> +{
> +	return nltbl->ref_clk_hz / KHZ * ndiv / (nltbl->pdiv * nltbl->mdiv);
> +}
> +
> +static void tegra_read_counters(struct work_struct *work)
> +{
> +	struct read_counters_work *read_counters_work;
> +	struct tegra_cpu_ctr *c;
> +	u64 val;
> +
> +	/*
> +	 * ref_clk_counter(32 bit counter) runs on constant clk,
> +	 * pll_p(408MHz).
> +	 * It will take = 2 ^ 32 / 408 MHz to overflow ref clk counter
> +	 *              = 10526880 usec = 10.527 sec to overflow
> +	 *
> +	 * Like wise core_clk_counter(32 bit counter) runs on core clock.
> +	 * It's synchronized to crab_clk (cpu_crab_clk) which runs at
> +	 * freq of cluster. Assuming max cluster clock ~2000MHz,
> +	 * It will take = 2 ^ 32 / 2000 MHz to overflow core clk counter
> +	 *              = ~2.147 sec to overflow
> +	 */
> +	read_counters_work = container_of(work, struct read_counters_work,
> +					  work);
> +	c = &read_counters_work->c;
> +
> +	val = read_freq_feedback();
> +	c->last_refclk_cnt = lower_32_bits(val);
> +	c->last_coreclk_cnt = upper_32_bits(val);
> +	udelay(c->delay);
> +	val = read_freq_feedback();
> +	c->refclk_cnt = lower_32_bits(val);
> +	c->coreclk_cnt = upper_32_bits(val);
> +}
> +
> +/*
> + * Return instantaneous cpu speed
> + * Instantaneous freq is calculated as -
> + * -Takes sample on every query of getting the freq.
> + *	- Read core and ref clock counters;
> + *	- Delay for X us
> + *	- Read above cycle counters again
> + *	- Calculates freq by subtracting current and previous counters
> + *	  divided by the delay time or eqv. of ref_clk_counter in delta time
> + *	- Return Kcycles/second, freq in KHz
> + *
> + *	delta time period = x sec
> + *			  = delta ref_clk_counter / (408 * 10^6) sec
> + *	freq in Hz = cycles/sec
> + *		   = (delta cycles / x sec
> + *		   = (delta cycles * 408 * 10^6) / delta ref_clk_counter
> + *	in KHz	   = (delta cycles * 408 * 10^3) / delta ref_clk_counter
> + *
> + * @cpu - logical cpu whose freq to be updated
> + * Returns freq in KHz on success, 0 if cpu is offline
> + */
> +static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
> +{
> +	struct read_counters_work read_counters_work;
> +	struct tegra_cpu_ctr c;
> +	u32 delta_refcnt;
> +	u32 delta_ccnt;
> +	u32 rate_mhz;
> +
> +	read_counters_work.c.cpu = cpu;
> +	read_counters_work.c.delay = delay;
> +	INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);
> +	queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
> +	flush_work(&read_counters_work.work);

What will happen if CPU is offline?
Sumit Gupta March 25, 2020, 11:59 p.m. UTC | #9
Hi Viresh,

Sorry for the late reply.

On 04/12/19 4:57 PM, Viresh Kumar wrote:
> On 04-12-19, 16:25, sumitg wrote:
>> In T194, CCPLEX doesn't have access to set clocks and the
>>
>> clk_{get|set}_rate() functions set clocks by hook to BPMP R5.
>>
>> CPU freq can be directly set by CCPLEX using MSR(NVFREQ_REQ_EL1).
>>
>> As DVFS run's on BPMP, another MSR (NVFREQ_FEEDBACK_EL1) is
>>
>> used to read the counters and calculate "actual" cpu freq at CCPLEX.
>>
>> So, "cpuinfo_cur_freq" node gives the actual cpu frequency and not
>>
>> given by node "scaling_cur_freq".
> Right, but why can't this be hidden in the CPU's clk driver instead,
> so cpufreq driver can just do clk_get_rate() and clk_set_rate() ?
>
>>> - populating cpufreq table, you can probably add OPPs instead using
>>>     the same mechanism
>> We are reading available frequencies from BPMP to populate
>>
>> cpufreq table and not using static opp table.
> Right and lot of other platforms read it from firmware (I believe BBMP
> is a firmware here), and create OPPs at runtime. Look at this for
> example:
>
> drivers/cpufreq/qcom-cpufreq-hw.c
>
> and search for dev_pm_opp_add().
>
- I think we don't need separate CPU clock driver & to reuse

   cpufreq-dt driver as we will still have to replicate same logic

   from cpufreq driver to that dummy clock driver for calculating

   actual cpufreq from MSR value. So, it won't add much value.

     - "qcom-cpufreq-hw.c" is using clk_get_rate() during init, but

       the frequency ops "get/target_index" write to register directly

       and not using clk api's. Also, the clock driver from gcc-msm*.c

       seem to handle all clocks in CCPLEX.

       Tegra SOC's which didn't have BPMP had the clock handling

       done by CCPLEX. They were using clk_{get|set}_rate() api's

       as you mentioned. But in Tegra194, all clock handling is done

       within BPMP R5 core except CPU clock(which is through MSR).

- Adding OPP's with dev_pm_opp_add() is also not required as:

    1) We don't have any consumer like energy model or EAS in

        Tegra194 which it seems was valid with "qcom-cpufreq-hw.c".

        So, i think it won't be useful for T194.

    2) Also, there is no way to map ndiv to voltage in kernel. Kernel

        driver passes ndiv value to BPMP(R5) which converts to vhint.

Please share your inputs.
Viresh Kumar March 26, 2020, 11:50 a.m. UTC | #10
On 03-12-19, 23:02, Sumit Gupta wrote:
> diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
> new file mode 100644
> index 0000000..9df12f4
> --- /dev/null
> +++ b/drivers/cpufreq/tegra194-cpufreq.c
> @@ -0,0 +1,423 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved
> + */
> +
> +#include <linux/cpu.h>
> +#include <linux/cpufreq.h>
> +#include <linux/delay.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/of_platform.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +
> +#include <asm/smp_plat.h>
> +
> +#include <soc/tegra/bpmp.h>
> +#include <soc/tegra/bpmp-abi.h>
> +
> +#define KHZ                     1000
> +#define REF_CLK_MHZ             408 /* 408 MHz */
> +#define US_DELAY                2000
> +#define US_DELAY_MIN            2
> +#define CPUFREQ_TBL_STEP_HZ     (50 * KHZ * KHZ)
> +#define MAX_CNT                 ~0U
> +
> +/* cpufreq transisition latency */
> +#define TEGRA_CPUFREQ_TRANSITION_LATENCY (300 * 1000) /* unit in nanoseconds */
> +
> +enum cluster {
> +	CLUSTER0,
> +	CLUSTER1,
> +	CLUSTER2,
> +	CLUSTER3,

All these have same CPUs ? Or big little kind of stuff ? How come they
have different frequency tables ?

> +	MAX_CLUSTERS,
> +};
> +
> +struct tegra194_cpufreq_data {
> +	void __iomem *regs;
> +	size_t num_clusters;
> +	struct cpufreq_frequency_table **tables;
> +};
> +
> +static DEFINE_MUTEX(cpufreq_lock);
> +
> +struct tegra_cpu_ctr {
> +	u32 cpu;
> +	u32 delay;
> +	u32 coreclk_cnt, last_coreclk_cnt;
> +	u32 refclk_cnt, last_refclk_cnt;
> +};
> +
> +static struct workqueue_struct *read_counters_wq;
> +struct read_counters_work {
> +	struct work_struct work;
> +	struct tegra_cpu_ctr c;
> +};
> +
> +static enum cluster get_cpu_cluster(u8 cpu)
> +{
> +	return MPIDR_AFFINITY_LEVEL(cpu_logical_map(cpu), 1);
> +}
> +
> +/*
> + * Read per-core Read-only system register NVFREQ_FEEDBACK_EL1.
> + * The register provides frequency feedback information to
> + * determine the average actual frequency a core has run at over
> + * a period of time.
> + *	[31:0] PLLP counter: Counts at fixed frequency (408 MHz)
> + *	[63:32] Core clock counter: counts on every core clock cycle
> + *			where the core is architecturally clocking
> + */
> +static u64 read_freq_feedback(void)
> +{
> +	u64 val = 0;
> +
> +	asm volatile("mrs %0, s3_0_c15_c0_5" : "=r" (val) : );
> +
> +	return val;
> +}
> +
> +u16 map_freq_to_ndiv(struct mrq_cpu_ndiv_limits_response *nltbl, u32 freq)
> +{
> +	return DIV_ROUND_UP(freq * nltbl->pdiv * nltbl->mdiv,
> +			    nltbl->ref_clk_hz / KHZ);
> +}
> +
> +static inline u32 map_ndiv_to_freq(struct mrq_cpu_ndiv_limits_response
> +				   *nltbl, u16 ndiv)
> +{
> +	return nltbl->ref_clk_hz / KHZ * ndiv / (nltbl->pdiv * nltbl->mdiv);
> +}
> +
> +static void tegra_read_counters(struct work_struct *work)
> +{
> +	struct read_counters_work *read_counters_work;
> +	struct tegra_cpu_ctr *c;
> +	u64 val;
> +
> +	/*
> +	 * ref_clk_counter(32 bit counter) runs on constant clk,
> +	 * pll_p(408MHz).
> +	 * It will take = 2 ^ 32 / 408 MHz to overflow ref clk counter
> +	 *              = 10526880 usec = 10.527 sec to overflow
> +	 *
> +	 * Like wise core_clk_counter(32 bit counter) runs on core clock.
> +	 * It's synchronized to crab_clk (cpu_crab_clk) which runs at
> +	 * freq of cluster. Assuming max cluster clock ~2000MHz,
> +	 * It will take = 2 ^ 32 / 2000 MHz to overflow core clk counter
> +	 *              = ~2.147 sec to overflow
> +	 */
> +	read_counters_work = container_of(work, struct read_counters_work,
> +					  work);
> +	c = &read_counters_work->c;
> +
> +	val = read_freq_feedback();
> +	c->last_refclk_cnt = lower_32_bits(val);
> +	c->last_coreclk_cnt = upper_32_bits(val);
> +	udelay(c->delay);
> +	val = read_freq_feedback();
> +	c->refclk_cnt = lower_32_bits(val);
> +	c->coreclk_cnt = upper_32_bits(val);
> +}
> +
> +/*
> + * Return instantaneous cpu speed
> + * Instantaneous freq is calculated as -
> + * -Takes sample on every query of getting the freq.
> + *	- Read core and ref clock counters;
> + *	- Delay for X us
> + *	- Read above cycle counters again
> + *	- Calculates freq by subtracting current and previous counters
> + *	  divided by the delay time or eqv. of ref_clk_counter in delta time
> + *	- Return Kcycles/second, freq in KHz
> + *
> + *	delta time period = x sec
> + *			  = delta ref_clk_counter / (408 * 10^6) sec
> + *	freq in Hz = cycles/sec
> + *		   = (delta cycles / x sec
> + *		   = (delta cycles * 408 * 10^6) / delta ref_clk_counter
> + *	in KHz	   = (delta cycles * 408 * 10^3) / delta ref_clk_counter
> + *
> + * @cpu - logical cpu whose freq to be updated
> + * Returns freq in KHz on success, 0 if cpu is offline
> + */
> +static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
> +{
> +	struct read_counters_work read_counters_work;
> +	struct tegra_cpu_ctr c;
> +	u32 delta_refcnt;
> +	u32 delta_ccnt;
> +	u32 rate_mhz;
> +
> +	read_counters_work.c.cpu = cpu;
> +	read_counters_work.c.delay = delay;
> +	INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);
> +	queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
> +	flush_work(&read_counters_work.work);

Why can't this be done in current context ?

> +	c = read_counters_work.c;
> +
> +	if (c.coreclk_cnt < c.last_coreclk_cnt)
> +		delta_ccnt = c.coreclk_cnt + (MAX_CNT - c.last_coreclk_cnt);
> +	else
> +		delta_ccnt = c.coreclk_cnt - c.last_coreclk_cnt;
> +	if (!delta_ccnt)
> +		return 0;
> +
> +	/* ref clock is 32 bits */
> +	if (c.refclk_cnt < c.last_refclk_cnt)
> +		delta_refcnt = c.refclk_cnt + (MAX_CNT - c.last_refclk_cnt);
> +	else
> +		delta_refcnt = c.refclk_cnt - c.last_refclk_cnt;
> +	if (!delta_refcnt) {
> +		pr_debug("cpufreq: %d is idle, delta_refcnt: 0\n", cpu);
> +		return 0;
> +	}
> +	rate_mhz = ((unsigned long)(delta_ccnt * REF_CLK_MHZ)) / delta_refcnt;
> +
> +	return (rate_mhz * KHZ); /* in KHz */
> +}
> +
> +static unsigned int tegra194_get_speed(u32 cpu)
> +{
> +	return tegra194_get_speed_common(cpu, US_DELAY);
> +}
> +
> +static unsigned int tegra194_fast_get_speed(u32 cpu)
> +{
> +	return tegra194_get_speed_common(cpu, US_DELAY_MIN);

Why is this required specially here ? Why can't you work with normal
delay ?

> +}
> +
> +static int tegra194_cpufreq_init(struct cpufreq_policy *policy)
> +{
> +	struct tegra194_cpufreq_data *data = cpufreq_get_driver_data();
> +	int cluster = get_cpu_cluster(policy->cpu);
> +
> +	if (cluster >= data->num_clusters)
> +		return -EINVAL;
> +
> +	policy->cur = tegra194_fast_get_speed(policy->cpu); /* boot freq */
> +
> +	/* set same policy for all cpus */
> +	cpumask_copy(policy->cpus, cpu_possible_mask);

You are copying cpu_possible_mask mask here, and so this routine will
get called only once.

I still don't understand the logic behind clusters and frequency
tables.

> +
> +	policy->freq_table = data->tables[cluster];
> +	policy->cpuinfo.transition_latency = TEGRA_CPUFREQ_TRANSITION_LATENCY;
> +
> +	return 0;
> +}
> +
> +static void set_cpu_ndiv(void *data)
> +{
> +	struct cpufreq_frequency_table *tbl = data;
> +	u64 ndiv_val = (u64)tbl->driver_data;
> +
> +	asm volatile("msr s3_0_c15_c0_4, %0" : : "r" (ndiv_val));
> +}
> +
> +static int tegra194_cpufreq_set_target(struct cpufreq_policy *policy,
> +				       unsigned int index)
> +{
> +	struct cpufreq_frequency_table *tbl = policy->freq_table + index;
> +	static struct cpufreq_freqs freqs;
> +
> +	mutex_lock(&cpufreq_lock);

No need of lock here.

> +	freqs.old = policy->cur;
> +	freqs.new = tbl->frequency;
> +
> +	cpufreq_freq_transition_begin(policy, &freqs);
> +	on_each_cpu_mask(policy->cpus, set_cpu_ndiv, tbl, true);

When CPUs share clock line, why is this required for every CPU ?

> +	cpufreq_freq_transition_end(policy, &freqs, 0);
> +
> +	mutex_unlock(&cpufreq_lock);
> +
> +	return 0;
> +}
> +
> +static struct cpufreq_driver tegra194_cpufreq_driver = {
> +	.name = "tegra194",
> +	.flags = CPUFREQ_STICKY | CPUFREQ_CONST_LOOPS |
> +		CPUFREQ_NEED_INITIAL_FREQ_CHECK | CPUFREQ_ASYNC_NOTIFICATION,

Why Async here ? I am really confused if I am not able to understand
the driver or you :)

> +	.verify = cpufreq_generic_frequency_table_verify,
> +	.target_index = tegra194_cpufreq_set_target,
> +	.get = tegra194_get_speed,
> +	.init = tegra194_cpufreq_init,
> +	.attr = cpufreq_generic_attr,
> +};
> +
> +static void tegra194_cpufreq_free_resources(void)
> +{
> +	flush_workqueue(read_counters_wq);
> +	destroy_workqueue(read_counters_wq);
> +}
> +
> +static struct cpufreq_frequency_table *init_freq_table

Don't break line here, rather break after above *.

> +		(struct platform_device *pdev, struct tegra_bpmp *bpmp,
> +		 unsigned int cluster_id)
> +{
> +	struct cpufreq_frequency_table *opp_table;

Please name it freq_table :)

> +	struct mrq_cpu_ndiv_limits_response resp;
> +	unsigned int num_freqs, ndiv, delta_ndiv;
> +	struct mrq_cpu_ndiv_limits_request req;
> +	struct tegra_bpmp_message msg;
> +	u16 freq_table_step_size;
> +	int err, index;
> +
> +	memset(&req, 0, sizeof(req));
> +	req.cluster_id = cluster_id;
> +
> +	memset(&msg, 0, sizeof(msg));
> +	msg.mrq = MRQ_CPU_NDIV_LIMITS;
> +	msg.tx.data = &req;
> +	msg.tx.size = sizeof(req);
> +	msg.rx.data = &resp;
> +	msg.rx.size = sizeof(resp);
> +
> +	err = tegra_bpmp_transfer(bpmp, &msg);
> +	if (err)
> +		return ERR_PTR(err);
> +
> +	/*
> +	 * Make sure frequency table step is a multiple of mdiv to match
> +	 * vhint table granularity.
> +	 */
> +	freq_table_step_size = resp.mdiv *
> +			DIV_ROUND_UP(CPUFREQ_TBL_STEP_HZ, resp.ref_clk_hz);
> +
> +	dev_dbg(&pdev->dev, "cluster %d: frequency table step size: %d\n",
> +		cluster_id, freq_table_step_size);
> +
> +	delta_ndiv = resp.ndiv_max - resp.ndiv_min;
> +
> +	if (unlikely(delta_ndiv == 0))
> +		num_freqs = 1;
> +	else
> +		/* We store both ndiv_min and ndiv_max hence the +1 */
> +		num_freqs = delta_ndiv / freq_table_step_size + 1;
> +
> +	num_freqs += (delta_ndiv % freq_table_step_size) ? 1 : 0;
> +
> +	opp_table = devm_kcalloc(&pdev->dev, num_freqs + 1, sizeof(*opp_table),
> +				 GFP_KERNEL);
> +	if (!opp_table)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (index = 0, ndiv = resp.ndiv_min;
> +			ndiv < resp.ndiv_max;
> +			index++, ndiv += freq_table_step_size) {
> +		opp_table[index].driver_data = ndiv;
> +		opp_table[index].frequency = map_ndiv_to_freq(&resp, ndiv);
> +	}
> +
> +	opp_table[index].driver_data = resp.ndiv_max;
> +	opp_table[index++].frequency = map_ndiv_to_freq(&resp, resp.ndiv_max);
> +	opp_table[index].frequency = CPUFREQ_TABLE_END;
> +
> +	return opp_table;
> +}
> +
> +static int tegra194_cpufreq_probe(struct platform_device *pdev)
> +{
> +	struct tegra194_cpufreq_data *data;
> +	struct tegra_bpmp *bpmp;
> +	int err, i;
> +
> +	data = devm_kzalloc(&pdev->dev, sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	data->num_clusters = MAX_CLUSTERS;
> +	data->tables = devm_kcalloc(&pdev->dev, data->num_clusters,
> +				    sizeof(*data->tables), GFP_KERNEL);
> +	if (!data->tables)
> +		return -ENOMEM;
> +
> +	platform_set_drvdata(pdev, data);
> +
> +	read_counters_wq = alloc_workqueue("read_counters_wq", __WQ_LEGACY, 1);
> +	if (!read_counters_wq) {
> +		dev_err(&pdev->dev, "fail to create_workqueue\n");
> +		return -EINVAL;
> +	}
> +
> +	bpmp = of_tegra_bpmp_get();
> +	if (IS_ERR(bpmp)) {
> +		err = PTR_ERR(bpmp);
> +		goto err_free_res;
> +	}
> +
> +	for (i = 0; i < data->num_clusters; i++) {
> +		data->tables[i] = init_freq_table(pdev, bpmp, i);
> +		if (IS_ERR(data->tables[i])) {
> +			err = PTR_ERR(data->tables[i]);
> +			goto put_bpmp;
> +		}
> +	}
> +
> +	tegra_bpmp_put(bpmp);
> +
> +	tegra194_cpufreq_driver.driver_data = data;
> +
> +	err = cpufreq_register_driver(&tegra194_cpufreq_driver);
> +	if (err)
> +		goto err_free_res;
> +
> +	return err;
> +
> +put_bpmp:
> +	tegra_bpmp_put(bpmp);
> +err_free_res:
> +	tegra194_cpufreq_free_resources();
> +	return err;
> +}
> +
> +static int tegra194_cpufreq_remove(struct platform_device *pdev)
> +{
> +	cpufreq_unregister_driver(&tegra194_cpufreq_driver);
> +	tegra194_cpufreq_free_resources();
> +
> +	return 0;
> +}
> +
> +static struct platform_driver tegra194_cpufreq_platform_driver = {
> +	.driver = {
> +		.name = "tegra194-cpufreq",
> +	},
> +	.probe = tegra194_cpufreq_probe,
> +	.remove = tegra194_cpufreq_remove,
> +};
> +
> +static int __init tegra_cpufreq_init(void)

I seem to be forgetting this, but should we use __init with modules or
not ?
Sumit Gupta April 4, 2020, 6:38 p.m. UTC | #11
On 26/03/20 5:20 PM, Viresh Kumar wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 03-12-19, 23:02, Sumit Gupta wrote:
>> diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
>> new file mode 100644
>> index 0000000..9df12f4
>> --- /dev/null
>> +++ b/drivers/cpufreq/tegra194-cpufreq.c
>> @@ -0,0 +1,423 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved
>> + */
>> +
>> +#include <linux/cpu.h>
>> +#include <linux/cpufreq.h>
>> +#include <linux/delay.h>
>> +#include <linux/dma-mapping.h>
>> +#include <linux/module.h>
>> +#include <linux/of.h>
>> +#include <linux/of_platform.h>
>> +#include <linux/platform_device.h>
>> +#include <linux/slab.h>
>> +
>> +#include <asm/smp_plat.h>
>> +
>> +#include <soc/tegra/bpmp.h>
>> +#include <soc/tegra/bpmp-abi.h>
>> +
>> +#define KHZ                     1000
>> +#define REF_CLK_MHZ             408 /* 408 MHz */
>> +#define US_DELAY                2000
>> +#define US_DELAY_MIN            2
>> +#define CPUFREQ_TBL_STEP_HZ     (50 * KHZ * KHZ)
>> +#define MAX_CNT                 ~0U
>> +
>> +/* cpufreq transisition latency */
>> +#define TEGRA_CPUFREQ_TRANSITION_LATENCY (300 * 1000) /* unit in nanoseconds */
>> +
>> +enum cluster {
>> +     CLUSTER0,
>> +     CLUSTER1,
>> +     CLUSTER2,
>> +     CLUSTER3,
> 
> All these have same CPUs ? Or big little kind of stuff ? How come they
> have different frequency tables ?
> 
T194 SOC has homogeneous architecture where each cluster has two 
symmetric Carmel cores and and not big little. LUT's are per cluster and 
all LUT's have same values currently. Future SOC's may have different 
LUT values per cluster.

>> +     MAX_CLUSTERS,
>> +};
>> +
>> +struct tegra194_cpufreq_data {
>> +     void __iomem *regs;
>> +     size_t num_clusters;
>> +     struct cpufreq_frequency_table **tables;
>> +};
>> +
>> +static DEFINE_MUTEX(cpufreq_lock);
>> +
>> +struct tegra_cpu_ctr {
>> +     u32 cpu;
>> +     u32 delay;
>> +     u32 coreclk_cnt, last_coreclk_cnt;
>> +     u32 refclk_cnt, last_refclk_cnt;
>> +};
>> +
>> +static struct workqueue_struct *read_counters_wq;
>> +struct read_counters_work {
>> +     struct work_struct work;
>> +     struct tegra_cpu_ctr c;
>> +};
>> +
>> +static enum cluster get_cpu_cluster(u8 cpu)
>> +{
>> +     return MPIDR_AFFINITY_LEVEL(cpu_logical_map(cpu), 1);
>> +}
>> +
>> +/*
>> + * Read per-core Read-only system register NVFREQ_FEEDBACK_EL1.
>> + * The register provides frequency feedback information to
>> + * determine the average actual frequency a core has run at over
>> + * a period of time.
>> + *   [31:0] PLLP counter: Counts at fixed frequency (408 MHz)
>> + *   [63:32] Core clock counter: counts on every core clock cycle
>> + *                   where the core is architecturally clocking
>> + */
>> +static u64 read_freq_feedback(void)
>> +{
>> +     u64 val = 0;
>> +
>> +     asm volatile("mrs %0, s3_0_c15_c0_5" : "=r" (val) : );
>> +
>> +     return val;
>> +}
>> +
>> +u16 map_freq_to_ndiv(struct mrq_cpu_ndiv_limits_response *nltbl, u32 freq)
>> +{
>> +     return DIV_ROUND_UP(freq * nltbl->pdiv * nltbl->mdiv,
>> +                         nltbl->ref_clk_hz / KHZ);
>> +}
>> +
>> +static inline u32 map_ndiv_to_freq(struct mrq_cpu_ndiv_limits_response
>> +                                *nltbl, u16 ndiv)
>> +{
>> +     return nltbl->ref_clk_hz / KHZ * ndiv / (nltbl->pdiv * nltbl->mdiv);
>> +}
>> +
>> +static void tegra_read_counters(struct work_struct *work)
>> +{
>> +     struct read_counters_work *read_counters_work;
>> +     struct tegra_cpu_ctr *c;
>> +     u64 val;
>> +
>> +     /*
>> +      * ref_clk_counter(32 bit counter) runs on constant clk,
>> +      * pll_p(408MHz).
>> +      * It will take = 2 ^ 32 / 408 MHz to overflow ref clk counter
>> +      *              = 10526880 usec = 10.527 sec to overflow
>> +      *
>> +      * Like wise core_clk_counter(32 bit counter) runs on core clock.
>> +      * It's synchronized to crab_clk (cpu_crab_clk) which runs at
>> +      * freq of cluster. Assuming max cluster clock ~2000MHz,
>> +      * It will take = 2 ^ 32 / 2000 MHz to overflow core clk counter
>> +      *              = ~2.147 sec to overflow
>> +      */
>> +     read_counters_work = container_of(work, struct read_counters_work,
>> +                                       work);
>> +     c = &read_counters_work->c;
>> +
>> +     val = read_freq_feedback();
>> +     c->last_refclk_cnt = lower_32_bits(val);
>> +     c->last_coreclk_cnt = upper_32_bits(val);
>> +     udelay(c->delay);
>> +     val = read_freq_feedback();
>> +     c->refclk_cnt = lower_32_bits(val);
>> +     c->coreclk_cnt = upper_32_bits(val);
>> +}
>> +
>> +/*
>> + * Return instantaneous cpu speed
>> + * Instantaneous freq is calculated as -
>> + * -Takes sample on every query of getting the freq.
>> + *   - Read core and ref clock counters;
>> + *   - Delay for X us
>> + *   - Read above cycle counters again
>> + *   - Calculates freq by subtracting current and previous counters
>> + *     divided by the delay time or eqv. of ref_clk_counter in delta time
>> + *   - Return Kcycles/second, freq in KHz
>> + *
>> + *   delta time period = x sec
>> + *                     = delta ref_clk_counter / (408 * 10^6) sec
>> + *   freq in Hz = cycles/sec
>> + *              = (delta cycles / x sec
>> + *              = (delta cycles * 408 * 10^6) / delta ref_clk_counter
>> + *   in KHz     = (delta cycles * 408 * 10^3) / delta ref_clk_counter
>> + *
>> + * @cpu - logical cpu whose freq to be updated
>> + * Returns freq in KHz on success, 0 if cpu is offline
>> + */
>> +static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
>> +{
>> +     struct read_counters_work read_counters_work;
>> +     struct tegra_cpu_ctr c;
>> +     u32 delta_refcnt;
>> +     u32 delta_ccnt;
>> +     u32 rate_mhz;
>> +
>> +     read_counters_work.c.cpu = cpu;
>> +     read_counters_work.c.delay = delay;
>> +     INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);
>> +     queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
>> +     flush_work(&read_counters_work.work);
> 
> Why can't this be done in current context ?
> 
We used work queue instead of smp_call_function_single() to have long delay.

>> +     c = read_counters_work.c;
>> +
>> +     if (c.coreclk_cnt < c.last_coreclk_cnt)
>> +             delta_ccnt = c.coreclk_cnt + (MAX_CNT - c.last_coreclk_cnt);
>> +     else
>> +             delta_ccnt = c.coreclk_cnt - c.last_coreclk_cnt;
>> +     if (!delta_ccnt)
>> +             return 0;
>> +
>> +     /* ref clock is 32 bits */
>> +     if (c.refclk_cnt < c.last_refclk_cnt)
>> +             delta_refcnt = c.refclk_cnt + (MAX_CNT - c.last_refclk_cnt);
>> +     else
>> +             delta_refcnt = c.refclk_cnt - c.last_refclk_cnt;
>> +     if (!delta_refcnt) {
>> +             pr_debug("cpufreq: %d is idle, delta_refcnt: 0\n", cpu);
>> +             return 0;
>> +     }
>> +     rate_mhz = ((unsigned long)(delta_ccnt * REF_CLK_MHZ)) / delta_refcnt;
>> +
>> +     return (rate_mhz * KHZ); /* in KHz */
>> +}
>> +
>> +static unsigned int tegra194_get_speed(u32 cpu)
>> +{
>> +     return tegra194_get_speed_common(cpu, US_DELAY);
>> +}
>> +
>> +static unsigned int tegra194_fast_get_speed(u32 cpu)
>> +{
>> +     return tegra194_get_speed_common(cpu, US_DELAY_MIN);
> 
> Why is this required specially here ? Why can't you work with normal
> delay ?
> 
Less delay value used during init to reduce cpu boot time.

>> +}
>> +
>> +static int tegra194_cpufreq_init(struct cpufreq_policy *policy)
>> +{
>> +     struct tegra194_cpufreq_data *data = cpufreq_get_driver_data();
>> +     int cluster = get_cpu_cluster(policy->cpu);
>> +
>> +     if (cluster >= data->num_clusters)
>> +             return -EINVAL;
>> +
>> +     policy->cur = tegra194_fast_get_speed(policy->cpu); /* boot freq */
>> +
>> +     /* set same policy for all cpus */
>> +     cpumask_copy(policy->cpus, cpu_possible_mask);
> 
> You are copying cpu_possible_mask mask here, and so this routine will
> get called only once.
> 
> I still don't understand the logic behind clusters and frequency
> tables.
> 
Currently, we use same policy for all CPU's to maximize throughput. Will 
add separate patch later to set policy as per cluster. But we are not 
using that in T194 due to perf reasons.

>> +
>> +     policy->freq_table = data->tables[cluster];
>> +     policy->cpuinfo.transition_latency = TEGRA_CPUFREQ_TRANSITION_LATENCY;
>> +
>> +     return 0;
>> +}
>> +
>> +static void set_cpu_ndiv(void *data)
>> +{
>> +     struct cpufreq_frequency_table *tbl = data;
>> +     u64 ndiv_val = (u64)tbl->driver_data;
>> +
>> +     asm volatile("msr s3_0_c15_c0_4, %0" : : "r" (ndiv_val));
>> +}
>> +
>> +static int tegra194_cpufreq_set_target(struct cpufreq_policy *policy,
>> +                                    unsigned int index)
>> +{
>> +     struct cpufreq_frequency_table *tbl = policy->freq_table + index;
>> +     static struct cpufreq_freqs freqs;
>> +
>> +     mutex_lock(&cpufreq_lock);
> 
> No need of lock here.
>
Done

>> +     freqs.old = policy->cur;
>> +     freqs.new = tbl->frequency;
>> +
>> +     cpufreq_freq_transition_begin(policy, &freqs);
>> +     on_each_cpu_mask(policy->cpus, set_cpu_ndiv, tbl, true);
> 
> When CPUs share clock line, why is this required for every CPU ?
>Per core NVFREQ_REQ system register is written to make frequency 
requests for the core. Cluster h/w then forwards the max(core0, core1) 
request to cluster NAFLL.

>> +     cpufreq_freq_transition_end(policy, &freqs, 0);
>> +
>> +     mutex_unlock(&cpufreq_lock);
>> +
>> +     return 0;
>> +}
>> +
>> +static struct cpufreq_driver tegra194_cpufreq_driver = {
>> +     .name = "tegra194",
>> +     .flags = CPUFREQ_STICKY | CPUFREQ_CONST_LOOPS |
>> +             CPUFREQ_NEED_INITIAL_FREQ_CHECK | CPUFREQ_ASYNC_NOTIFICATION,
> 
> Why Async here ? I am really confused if I am not able to understand
> the driver or you :)
> 
Have removed the flag as it was not required. Thank you for pointing.

>> +     .verify = cpufreq_generic_frequency_table_verify,
>> +     .target_index = tegra194_cpufreq_set_target,
>> +     .get = tegra194_get_speed,
>> +     .init = tegra194_cpufreq_init,
>> +     .attr = cpufreq_generic_attr,
>> +};
>> +
>> +static void tegra194_cpufreq_free_resources(void)
>> +{
>> +     flush_workqueue(read_counters_wq);
>> +     destroy_workqueue(read_counters_wq);
>> +}
>> +
>> +static struct cpufreq_frequency_table *init_freq_table
> 
> Don't break line here, rather break after above *.
> 
Done

>> +             (struct platform_device *pdev, struct tegra_bpmp *bpmp,
>> +              unsigned int cluster_id)
>> +{
>> +     struct cpufreq_frequency_table *opp_table;
> 
> Please name it freq_table :)
> 
Done

>> +     struct mrq_cpu_ndiv_limits_response resp;
>> +     unsigned int num_freqs, ndiv, delta_ndiv;
>> +     struct mrq_cpu_ndiv_limits_request req;
>> +     struct tegra_bpmp_message msg;
>> +     u16 freq_table_step_size;
>> +     int err, index;
>> +
>> +     memset(&req, 0, sizeof(req));
>> +     req.cluster_id = cluster_id;
>> +
>> +     memset(&msg, 0, sizeof(msg));
>> +     msg.mrq = MRQ_CPU_NDIV_LIMITS;
>> +     msg.tx.data = &req;
>> +     msg.tx.size = sizeof(req);
>> +     msg.rx.data = &resp;
>> +     msg.rx.size = sizeof(resp);
>> +
>> +     err = tegra_bpmp_transfer(bpmp, &msg);
>> +     if (err)
>> +             return ERR_PTR(err);
>> +
>> +     /*
>> +      * Make sure frequency table step is a multiple of mdiv to match
>> +      * vhint table granularity.
>> +      */
>> +     freq_table_step_size = resp.mdiv *
>> +                     DIV_ROUND_UP(CPUFREQ_TBL_STEP_HZ, resp.ref_clk_hz);
>> +
>> +     dev_dbg(&pdev->dev, "cluster %d: frequency table step size: %d\n",
>> +             cluster_id, freq_table_step_size);
>> +
>> +     delta_ndiv = resp.ndiv_max - resp.ndiv_min;
>> +
>> +     if (unlikely(delta_ndiv == 0))
>> +             num_freqs = 1;
>> +     else
>> +             /* We store both ndiv_min and ndiv_max hence the +1 */
>> +             num_freqs = delta_ndiv / freq_table_step_size + 1;
>> +
>> +     num_freqs += (delta_ndiv % freq_table_step_size) ? 1 : 0;
>> +
>> +     opp_table = devm_kcalloc(&pdev->dev, num_freqs + 1, sizeof(*opp_table),
>> +                              GFP_KERNEL);
>> +     if (!opp_table)
>> +             return ERR_PTR(-ENOMEM);
>> +
>> +     for (index = 0, ndiv = resp.ndiv_min;
>> +                     ndiv < resp.ndiv_max;
>> +                     index++, ndiv += freq_table_step_size) {
>> +             opp_table[index].driver_data = ndiv;
>> +             opp_table[index].frequency = map_ndiv_to_freq(&resp, ndiv);
>> +     }
>> +
>> +     opp_table[index].driver_data = resp.ndiv_max;
>> +     opp_table[index++].frequency = map_ndiv_to_freq(&resp, resp.ndiv_max);
>> +     opp_table[index].frequency = CPUFREQ_TABLE_END;
>> +
>> +     return opp_table;
>> +}
>> +
>> +static int tegra194_cpufreq_probe(struct platform_device *pdev)
>> +{
>> +     struct tegra194_cpufreq_data *data;
>> +     struct tegra_bpmp *bpmp;
>> +     int err, i;
>> +
>> +     data = devm_kzalloc(&pdev->dev, sizeof(*data), GFP_KERNEL);
>> +     if (!data)
>> +             return -ENOMEM;
>> +
>> +     data->num_clusters = MAX_CLUSTERS;
>> +     data->tables = devm_kcalloc(&pdev->dev, data->num_clusters,
>> +                                 sizeof(*data->tables), GFP_KERNEL);
>> +     if (!data->tables)
>> +             return -ENOMEM;
>> +
>> +     platform_set_drvdata(pdev, data);
>> +
>> +     read_counters_wq = alloc_workqueue("read_counters_wq", __WQ_LEGACY, 1);
>> +     if (!read_counters_wq) {
>> +             dev_err(&pdev->dev, "fail to create_workqueue\n");
>> +             return -EINVAL;
>> +     }
>> +
>> +     bpmp = of_tegra_bpmp_get();
>> +     if (IS_ERR(bpmp)) {
>> +             err = PTR_ERR(bpmp);
>> +             goto err_free_res;
>> +     }
>> +
>> +     for (i = 0; i < data->num_clusters; i++) {
>> +             data->tables[i] = init_freq_table(pdev, bpmp, i);
>> +             if (IS_ERR(data->tables[i])) {
>> +                     err = PTR_ERR(data->tables[i]);
>> +                     goto put_bpmp;
>> +             }
>> +     }
>> +
>> +     tegra_bpmp_put(bpmp);
>> +
>> +     tegra194_cpufreq_driver.driver_data = data;
>> +
>> +     err = cpufreq_register_driver(&tegra194_cpufreq_driver);
>> +     if (err)
>> +             goto err_free_res;
>> +
>> +     return err;
>> +
>> +put_bpmp:
>> +     tegra_bpmp_put(bpmp);
>> +err_free_res:
>> +     tegra194_cpufreq_free_resources();
>> +     return err;
>> +}
>> +
>> +static int tegra194_cpufreq_remove(struct platform_device *pdev)
>> +{
>> +     cpufreq_unregister_driver(&tegra194_cpufreq_driver);
>> +     tegra194_cpufreq_free_resources();
>> +
>> +     return 0;
>> +}
>> +
>> +static struct platform_driver tegra194_cpufreq_platform_driver = {
>> +     .driver = {
>> +             .name = "tegra194-cpufreq",
>> +     },
>> +     .probe = tegra194_cpufreq_probe,
>> +     .remove = tegra194_cpufreq_remove,
>> +};
>> +
>> +static int __init tegra_cpufreq_init(void)
> 
> I seem to be forgetting this, but should we use __init with modules or
> not ?
> 
Yes

Thankyou for the review. I will send v2 with suggested changes.

> --
> viresh
>
Viresh Kumar April 6, 2020, 2:55 a.m. UTC | #12
On 05-04-20, 00:08, sumitg wrote:
> 
> 
> On 26/03/20 5:20 PM, Viresh Kumar wrote:
> > On 03-12-19, 23:02, Sumit Gupta wrote:
> > > diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
> > > +enum cluster {
> > > +     CLUSTER0,
> > > +     CLUSTER1,
> > > +     CLUSTER2,
> > > +     CLUSTER3,
> > 
> > All these have same CPUs ? Or big little kind of stuff ? How come they
> > have different frequency tables ?
> > 
> T194 SOC has homogeneous architecture where each cluster has two symmetric
> Carmel cores and and not big little. LUT's are per cluster and all LUT's
> have same values currently. Future SOC's may have different LUT values per
> cluster.

LUT ?

> > > +     MAX_CLUSTERS,
> > > +};

> > > +static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
> > > +{
> > > +     struct read_counters_work read_counters_work;
> > > +     struct tegra_cpu_ctr c;
> > > +     u32 delta_refcnt;
> > > +     u32 delta_ccnt;
> > > +     u32 rate_mhz;
> > > +
> > > +     read_counters_work.c.cpu = cpu;
> > > +     read_counters_work.c.delay = delay;
> > > +     INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);
> > > +     queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
> > > +     flush_work(&read_counters_work.work);
> > 
> > Why can't this be done in current context ?
> > 
> We used work queue instead of smp_call_function_single() to have long delay.

Please explain completely, you have raised more questions than you
answered :)

Why do you want to have long delays ?

> > > +static int tegra194_cpufreq_init(struct cpufreq_policy *policy)
> > > +{
> > > +     struct tegra194_cpufreq_data *data = cpufreq_get_driver_data();
> > > +     int cluster = get_cpu_cluster(policy->cpu);
> > > +
> > > +     if (cluster >= data->num_clusters)
> > > +             return -EINVAL;
> > > +
> > > +     policy->cur = tegra194_fast_get_speed(policy->cpu); /* boot freq */
> > > +
> > > +     /* set same policy for all cpus */
> > > +     cpumask_copy(policy->cpus, cpu_possible_mask);
> > 
> > You are copying cpu_possible_mask mask here, and so this routine will
> > get called only once.
> > 
> > I still don't understand the logic behind clusters and frequency
> > tables.
> > 
> Currently, we use same policy for all CPU's to maximize throughput. Will add
> separate patch later to set policy as per cluster. But we are not using that
> in T194 due to perf reasons.

You can't misrepresent the hardware this way, sorry.

Again few questions, I understand that you have multiple clusters
here:

- All clusters will always have the frequency table ?
- All clusters are capable of having a different frequency at any
  point of time ? Or they will always run at same freq ?

> > > +     freqs.old = policy->cur;
> > > +     freqs.new = tbl->frequency;
> > > +
> > > +     cpufreq_freq_transition_begin(policy, &freqs);
> > > +     on_each_cpu_mask(policy->cpus, set_cpu_ndiv, tbl, true);
> > 
> > When CPUs share clock line, why is this required for every CPU ?
> > Per core NVFREQ_REQ system register is written to make frequency
> requests for the core. Cluster h/w then forwards the max(core0, core1)
> request to cluster NAFLL.

You mean that each cluster can set frequency independently ?

If all the clusters can run at independent frequencies but you still
want them to run at same frequency, then that can be done with a
single set of governor tunables, which is the default anyway. So this
should just work for you.

I will not be reviewing the v2 version you sent for now as that is
most likely wrong anyway.
Sumit Gupta April 7, 2020, 6:18 p.m. UTC | #13
On 06/04/20 8:25 AM, Viresh Kumar wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 05-04-20, 00:08, sumitg wrote:
>>
>>
>> On 26/03/20 5:20 PM, Viresh Kumar wrote:
>>> On 03-12-19, 23:02, Sumit Gupta wrote:
>>>> diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
>>>> +enum cluster {
>>>> +     CLUSTER0,
>>>> +     CLUSTER1,
>>>> +     CLUSTER2,
>>>> +     CLUSTER3,
>>>
>>> All these have same CPUs ? Or big little kind of stuff ? How come they
>>> have different frequency tables ?
>>>
>> T194 SOC has homogeneous architecture where each cluster has two symmetric
>> Carmel cores and and not big little. LUT's are per cluster and all LUT's
>> have same values currently. Future SOC's may have different LUT values per
>> cluster.
> 
> LUT ?
>
LUT is "Lookup Table" which provides "hardware frequency request" as a 
function of voltage. For each frequency, the table can have multiple 
voltages based on temperature. The table is maintained in "BPMP R5".

>>>> +     MAX_CLUSTERS,
>>>> +};
> 
>>>> +static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
>>>> +{
>>>> +     struct read_counters_work read_counters_work;
>>>> +     struct tegra_cpu_ctr c;
>>>> +     u32 delta_refcnt;
>>>> +     u32 delta_ccnt;
>>>> +     u32 rate_mhz;
>>>> +
>>>> +     read_counters_work.c.cpu = cpu;
>>>> +     read_counters_work.c.delay = delay;
>>>> +     INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);
>>>> +     queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
>>>> +     flush_work(&read_counters_work.work);
>>>
>>> Why can't this be done in current context ?
>>>
>> We used work queue instead of smp_call_function_single() to have long delay.
> 
> Please explain completely, you have raised more questions than you
> answered :)
> 
> Why do you want to have long delays ?
> 
Long delay value is used to have the observation window long enough for 
correctly reconstructing the CPU frequency considering noise.
In next patch version, changed delay value to 500us which in our tests 
is considered reliable.

>>>> +static int tegra194_cpufreq_init(struct cpufreq_policy *policy)
>>>> +{
>>>> +     struct tegra194_cpufreq_data *data = cpufreq_get_driver_data();
>>>> +     int cluster = get_cpu_cluster(policy->cpu);
>>>> +
>>>> +     if (cluster >= data->num_clusters)
>>>> +             return -EINVAL;
>>>> +
>>>> +     policy->cur = tegra194_fast_get_speed(policy->cpu); /* boot freq */
>>>> +
>>>> +     /* set same policy for all cpus */
>>>> +     cpumask_copy(policy->cpus, cpu_possible_mask);
>>>
>>> You are copying cpu_possible_mask mask here, and so this routine will
>>> get called only once.
>>>
>>> I still don't understand the logic behind clusters and frequency
>>> tables.
>>>
>> Currently, we use same policy for all CPU's to maximize throughput. Will add
>> separate patch later to set policy as per cluster. But we are not using that
>> in T194 due to perf reasons.
> 
> You can't misrepresent the hardware this way, sorry.
> 
ok. Updated the policy to be per cluster now.

> Again few questions, I understand that you have multiple clusters
> here:
> 
> - All clusters will always have the frequency table ?
Yes, frequency table is per cluster.

> - All clusters are capable of having a different frequency at any
>    point of time ? Or they will always run at same freq ?
>
Yes, all clusters are capable to run at different frequencies.

>>>> +     freqs.old = policy->cur;
>>>> +     freqs.new = tbl->frequency;
>>>> +
>>>> +     cpufreq_freq_transition_begin(policy, &freqs);
>>>> +     on_each_cpu_mask(policy->cpus, set_cpu_ndiv, tbl, true);
>>>
>>> When CPUs share clock line, why is this required for every CPU ?
>>> Per core NVFREQ_REQ system register is written to make frequency
>> requests for the core. Cluster h/w then forwards the max(core0, core1)
>> request to cluster NAFLL.
> 
> You mean that each cluster can set frequency independently ?
> 
Yes.

> If all the clusters can run at independent frequencies but you still
> want them to run at same frequency, then that can be done with a
> single set of governor tunables, which is the default anyway. So this
> should just work for you.
> 
> I will not be reviewing the v2 version you sent for now as that is
> most likely wrong anyway.
> 
> --
> viresh
>
Viresh Kumar April 8, 2020, 5:53 a.m. UTC | #14
On 07-04-20, 23:48, sumitg wrote:
> On 06/04/20 8:25 AM, Viresh Kumar wrote:
> > On 05-04-20, 00:08, sumitg wrote:
> > > On 26/03/20 5:20 PM, Viresh Kumar wrote:
> > > > On 03-12-19, 23:02, Sumit Gupta wrote:
> > > > > diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
> > > > > +static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
> > > > > +{
> > > > > +     struct read_counters_work read_counters_work;
> > > > > +     struct tegra_cpu_ctr c;
> > > > > +     u32 delta_refcnt;
> > > > > +     u32 delta_ccnt;
> > > > > +     u32 rate_mhz;
> > > > > +
> > > > > +     read_counters_work.c.cpu = cpu;
> > > > > +     read_counters_work.c.delay = delay;
> > > > > +     INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);
> > > > > +     queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
> > > > > +     flush_work(&read_counters_work.work);
> > > > 
> > > > Why can't this be done in current context ?
> > > > 
> > > We used work queue instead of smp_call_function_single() to have long delay.
> > 
> > Please explain completely, you have raised more questions than you
> > answered :)
> > 
> > Why do you want to have long delays ?
> > 
> Long delay value is used to have the observation window long enough for
> correctly reconstructing the CPU frequency considering noise.
> In next patch version, changed delay value to 500us which in our tests is
> considered reliable.

I understand that you need to put a udelay() while reading the freq from
hardware, that is fine, but why do you need a workqueue for that? Why can't you
just read the values directly from the same context ?
Sumit Gupta April 8, 2020, 11:24 a.m. UTC | #15
On 08/04/20 11:23 AM, Viresh Kumar wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 07-04-20, 23:48, sumitg wrote:
>> On 06/04/20 8:25 AM, Viresh Kumar wrote:
>>> On 05-04-20, 00:08, sumitg wrote:
>>>> On 26/03/20 5:20 PM, Viresh Kumar wrote:
>>>>> On 03-12-19, 23:02, Sumit Gupta wrote:
>>>>>> diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
>>>>>> +static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
>>>>>> +{
>>>>>> +     struct read_counters_work read_counters_work;
>>>>>> +     struct tegra_cpu_ctr c;
>>>>>> +     u32 delta_refcnt;
>>>>>> +     u32 delta_ccnt;
>>>>>> +     u32 rate_mhz;
>>>>>> +
>>>>>> +     read_counters_work.c.cpu = cpu;
>>>>>> +     read_counters_work.c.delay = delay;
>>>>>> +     INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);
>>>>>> +     queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
>>>>>> +     flush_work(&read_counters_work.work);
>>>>>
>>>>> Why can't this be done in current context ?
>>>>>
>>>> We used work queue instead of smp_call_function_single() to have long delay.
>>>
>>> Please explain completely, you have raised more questions than you
>>> answered :)
>>>
>>> Why do you want to have long delays ?
>>>
>> Long delay value is used to have the observation window long enough for
>> correctly reconstructing the CPU frequency considering noise.
>> In next patch version, changed delay value to 500us which in our tests is
>> considered reliable.
> 
> I understand that you need to put a udelay() while reading the freq from
> hardware, that is fine, but why do you need a workqueue for that? Why can't you
> just read the values directly from the same context ?
> 
The register to read frequency is per core and not accessible to other 
cores. So, we have to execute the function remotely as the target core 
to read frequency might be different from current.
The functions for that are smp_call_function_single or queue_work_on.
We used queue_work_on() to avoid long delay inside ipi interrupt context 
with interrupts disabled.

> --
> viresh
>
Viresh Kumar April 9, 2020, 7:44 a.m. UTC | #16
On 08-04-20, 16:54, sumitg wrote:
> 
> 
> On 08/04/20 11:23 AM, Viresh Kumar wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On 07-04-20, 23:48, sumitg wrote:
> > > On 06/04/20 8:25 AM, Viresh Kumar wrote:
> > > > On 05-04-20, 00:08, sumitg wrote:
> > > > > On 26/03/20 5:20 PM, Viresh Kumar wrote:
> > > > > > On 03-12-19, 23:02, Sumit Gupta wrote:
> > > > > > > diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
> > > > > > > +static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
> > > > > > > +{
> > > > > > > +     struct read_counters_work read_counters_work;
> > > > > > > +     struct tegra_cpu_ctr c;
> > > > > > > +     u32 delta_refcnt;
> > > > > > > +     u32 delta_ccnt;
> > > > > > > +     u32 rate_mhz;
> > > > > > > +
> > > > > > > +     read_counters_work.c.cpu = cpu;
> > > > > > > +     read_counters_work.c.delay = delay;
> > > > > > > +     INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);

Initialize the work only once from init routine.

> > > > > > > +     queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
> > > > > > > +     flush_work(&read_counters_work.work);
> > > > > > 
> > > > > > Why can't this be done in current context ?
> > > > > > 
> > > > > We used work queue instead of smp_call_function_single() to have long delay.
> > > > 
> > > > Please explain completely, you have raised more questions than you
> > > > answered :)
> > > > 
> > > > Why do you want to have long delays ?
> > > > 
> > > Long delay value is used to have the observation window long enough for
> > > correctly reconstructing the CPU frequency considering noise.
> > > In next patch version, changed delay value to 500us which in our tests is
> > > considered reliable.
> > 
> > I understand that you need to put a udelay() while reading the freq from
> > hardware, that is fine, but why do you need a workqueue for that? Why can't you
> > just read the values directly from the same context ?
> > 
> The register to read frequency is per core and not accessible to other
> cores. So, we have to execute the function remotely as the target core to
> read frequency might be different from current.
> The functions for that are smp_call_function_single or queue_work_on.
> We used queue_work_on() to avoid long delay inside ipi interrupt context
> with interrupts disabled.

Okay, I understand this now, finally :)

But if the interrupts are disabled during some call, won't workqueues face the
same problem ?
Sumit Gupta April 9, 2020, 11:21 a.m. UTC | #17
On 09/04/20 1:14 PM, Viresh Kumar wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 08-04-20, 16:54, sumitg wrote:
>>
>>
>> On 08/04/20 11:23 AM, Viresh Kumar wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 07-04-20, 23:48, sumitg wrote:
>>>> On 06/04/20 8:25 AM, Viresh Kumar wrote:
>>>>> On 05-04-20, 00:08, sumitg wrote:
>>>>>> On 26/03/20 5:20 PM, Viresh Kumar wrote:
>>>>>>> On 03-12-19, 23:02, Sumit Gupta wrote:
>>>>>>>> diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
>>>>>>>> +static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
>>>>>>>> +{
>>>>>>>> +     struct read_counters_work read_counters_work;
>>>>>>>> +     struct tegra_cpu_ctr c;
>>>>>>>> +     u32 delta_refcnt;
>>>>>>>> +     u32 delta_ccnt;
>>>>>>>> +     u32 rate_mhz;
>>>>>>>> +
>>>>>>>> +     read_counters_work.c.cpu = cpu;
>>>>>>>> +     read_counters_work.c.delay = delay;
>>>>>>>> +     INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);
> 
> Initialize the work only once from init routine.
> 
We are using "read_counters_work" as local variable. So every invocation 
the function will have its own copy of counters for corresponding cpu. 
That's why are doing INIT_WORK_ONSTACK here.

>>>>>>>> +     queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
>>>>>>>> +     flush_work(&read_counters_work.work);
>>>>>>>
>>>>>>> Why can't this be done in current context ?
>>>>>>>
>>>>>> We used work queue instead of smp_call_function_single() to have long delay.
>>>>>
>>>>> Please explain completely, you have raised more questions than you
>>>>> answered :)
>>>>>
>>>>> Why do you want to have long delays ?
>>>>>
>>>> Long delay value is used to have the observation window long enough for
>>>> correctly reconstructing the CPU frequency considering noise.
>>>> In next patch version, changed delay value to 500us which in our tests is
>>>> considered reliable.
>>>
>>> I understand that you need to put a udelay() while reading the freq from
>>> hardware, that is fine, but why do you need a workqueue for that? Why can't you
>>> just read the values directly from the same context ?
>>>
>> The register to read frequency is per core and not accessible to other
>> cores. So, we have to execute the function remotely as the target core to
>> read frequency might be different from current.
>> The functions for that are smp_call_function_single or queue_work_on.
>> We used queue_work_on() to avoid long delay inside ipi interrupt context
>> with interrupts disabled.
> 
> Okay, I understand this now, finally :)
> 
> But if the interrupts are disabled during some call, won't workqueues face the
> same problem ?
> 
Yes, we are trying to minimize the case.

> --
> viresh
>
Viresh Kumar April 13, 2020, 6:21 a.m. UTC | #18
On 09-04-20, 16:51, Sumit Gupta wrote:
> We are using "read_counters_work" as local variable. So every invocation the
> function will have its own copy of counters for corresponding cpu. That's
> why are doing INIT_WORK_ONSTACK here.

Why? To support parallel calls to reading the freq ?

> > > > > > > > > +     queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
> > > > > > > > > +     flush_work(&read_counters_work.work);
> > > > > > > > 
> > > > > > > > Why can't this be done in current context ?
> > > > > > > > 
> > > > > > > We used work queue instead of smp_call_function_single() to have long delay.
> > > > > > 
> > > > > > Please explain completely, you have raised more questions than you
> > > > > > answered :)
> > > > > > 
> > > > > > Why do you want to have long delays ?
> > > > > > 
> > > > > Long delay value is used to have the observation window long enough for
> > > > > correctly reconstructing the CPU frequency considering noise.
> > > > > In next patch version, changed delay value to 500us which in our tests is
> > > > > considered reliable.
> > > > 
> > > > I understand that you need to put a udelay() while reading the freq from
> > > > hardware, that is fine, but why do you need a workqueue for that? Why can't you
> > > > just read the values directly from the same context ?
> > > > 
> > > The register to read frequency is per core and not accessible to other
> > > cores. So, we have to execute the function remotely as the target core to
> > > read frequency might be different from current.
> > > The functions for that are smp_call_function_single or queue_work_on.
> > > We used queue_work_on() to avoid long delay inside ipi interrupt context
> > > with interrupts disabled.
> > 
> > Okay, I understand this now, finally :)
> > 
> > But if the interrupts are disabled during some call, won't workqueues face the
> > same problem ?
> > 
> Yes, we are trying to minimize the case.

But how do you know workqueues will perform better than
smp_call_function_single() ? Just asking for clarity on this as normally
everyone tries to do smp_call_function_single().
Sumit Gupta April 13, 2020, 12:20 p.m. UTC | #19
On 13/04/20 11:51 AM, Viresh Kumar wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 09-04-20, 16:51, Sumit Gupta wrote:
>> We are using "read_counters_work" as local variable. So every invocation the
>> function will have its own copy of counters for corresponding cpu. That's
>> why are doing INIT_WORK_ONSTACK here.
> 
> Why? To support parallel calls to reading the freq ?
> 
Yes.

>>>>>>>>>> +     queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
>>>>>>>>>> +     flush_work(&read_counters_work.work);
>>>>>>>>>
>>>>>>>>> Why can't this be done in current context ?
>>>>>>>>>
>>>>>>>> We used work queue instead of smp_call_function_single() to have long delay.
>>>>>>>
>>>>>>> Please explain completely, you have raised more questions than you
>>>>>>> answered :)
>>>>>>>
>>>>>>> Why do you want to have long delays ?
>>>>>>>
>>>>>> Long delay value is used to have the observation window long enough for
>>>>>> correctly reconstructing the CPU frequency considering noise.
>>>>>> In next patch version, changed delay value to 500us which in our tests is
>>>>>> considered reliable.
>>>>>
>>>>> I understand that you need to put a udelay() while reading the freq from
>>>>> hardware, that is fine, but why do you need a workqueue for that? Why can't you
>>>>> just read the values directly from the same context ?
>>>>>
>>>> The register to read frequency is per core and not accessible to other
>>>> cores. So, we have to execute the function remotely as the target core to
>>>> read frequency might be different from current.
>>>> The functions for that are smp_call_function_single or queue_work_on.
>>>> We used queue_work_on() to avoid long delay inside ipi interrupt context
>>>> with interrupts disabled.
>>>
>>> Okay, I understand this now, finally :)
>>>
>>> But if the interrupts are disabled during some call, won't workqueues face the
>>> same problem ?
>>>
>> Yes, we are trying to minimize the case.
> 
> But how do you know workqueues will perform better than
> smp_call_function_single() ? Just asking for clarity on this as normally
> everyone tries to do smp_call_function_single().
> 
This was done considering long delay value as explained previously.
Do you think that smp_call_function_single() would be better than work 
queue here?

> --
> viresh
>
Viresh Kumar April 14, 2020, 5:45 a.m. UTC | #20
On 13-04-20, 17:50, Sumit Gupta wrote:
> This was done considering long delay value as explained previously.
> Do you think that smp_call_function_single() would be better than work queue
> here?

Don't work with assumptions, you should test both and see which one
works better. Workqueue should never be faster than
smp_call_function_single() with my understanding.
Sumit Gupta April 15, 2020, 11:25 a.m. UTC | #21
On 14/04/20 11:15 AM, Viresh Kumar wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 13-04-20, 17:50, Sumit Gupta wrote:
>> This was done considering long delay value as explained previously.
>> Do you think that smp_call_function_single() would be better than work queue
>> here?
> 
> Don't work with assumptions, you should test both and see which one
> works better. Workqueue should never be faster than
> smp_call_function_single() with my understanding.
Checked the time taken and its almost same in both cases.
Earlier we used smp_call_function_single(), but delay time period was 
small in that SOC. In T194, the time period was more. So, this is an 
optimization done because using work queue has advantage as interrupts 
will not be disabled for that period.
If you think work queue is not required, then can remove it. The 
functionality works fine in both cases.

> 
> --
> viresh
>
Viresh Kumar April 16, 2020, 3:37 a.m. UTC | #22
On 15-04-20, 16:55, Sumit Gupta wrote:
> 
> 
> On 14/04/20 11:15 AM, Viresh Kumar wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On 13-04-20, 17:50, Sumit Gupta wrote:
> > > This was done considering long delay value as explained previously.
> > > Do you think that smp_call_function_single() would be better than work queue
> > > here?
> > 
> > Don't work with assumptions, you should test both and see which one
> > works better. Workqueue should never be faster than
> > smp_call_function_single() with my understanding.
> Checked the time taken and its almost same in both cases.
> Earlier we used smp_call_function_single(), but delay time period was small
> in that SOC. In T194, the time period was more. So, this is an optimization
> done because using work queue has advantage as interrupts will not be
> disabled for that period.

Hmm, okay, keep the workqueue and mention the required details in a
comment for everyone to understand why the implementation is done that
way.
Sumit Gupta April 16, 2020, 7:06 a.m. UTC | #23
On 16/04/20 9:07 AM, Viresh Kumar wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 15-04-20, 16:55, Sumit Gupta wrote:
>>
>>
>> On 14/04/20 11:15 AM, Viresh Kumar wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 13-04-20, 17:50, Sumit Gupta wrote:
>>>> This was done considering long delay value as explained previously.
>>>> Do you think that smp_call_function_single() would be better than work queue
>>>> here?
>>>
>>> Don't work with assumptions, you should test both and see which one
>>> works better. Workqueue should never be faster than
>>> smp_call_function_single() with my understanding.
>> Checked the time taken and its almost same in both cases.
>> Earlier we used smp_call_function_single(), but delay time period was small
>> in that SOC. In T194, the time period was more. So, this is an optimization
>> done because using work queue has advantage as interrupts will not be
>> disabled for that period.
> 
> Hmm, okay, keep the workqueue and mention the required details in a
> comment for everyone to understand why the implementation is done that
> way.
> 

sure, thank you!

> --
> viresh
>
diff mbox series

Patch

diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
index a905796..4bcd47c 100644
--- a/drivers/cpufreq/Kconfig.arm
+++ b/drivers/cpufreq/Kconfig.arm
@@ -320,6 +320,12 @@  config ARM_TEGRA186_CPUFREQ
 	help
 	  This adds the CPUFreq driver support for Tegra186 SOCs.
 
+config ARM_TEGRA194_CPUFREQ
+	tristate "Tegra194 CPUFreq support"
+	depends on ARCH_TEGRA && TEGRA_BPMP
+	help
+	  This adds CPU frequency driver support for Tegra194 SOCs.
+
 config ARM_TI_CPUFREQ
 	bool "Texas Instruments CPUFreq support"
 	depends on ARCH_OMAP2PLUS
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 9a9f5cc..433d492 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -85,6 +85,7 @@  obj-$(CONFIG_ARM_TANGO_CPUFREQ)		+= tango-cpufreq.o
 obj-$(CONFIG_ARM_TEGRA20_CPUFREQ)	+= tegra20-cpufreq.o
 obj-$(CONFIG_ARM_TEGRA124_CPUFREQ)	+= tegra124-cpufreq.o
 obj-$(CONFIG_ARM_TEGRA186_CPUFREQ)	+= tegra186-cpufreq.o
+obj-$(CONFIG_ARM_TEGRA194_CPUFREQ)	+= tegra194-cpufreq.o
 obj-$(CONFIG_ARM_TI_CPUFREQ)		+= ti-cpufreq.o
 obj-$(CONFIG_ARM_VEXPRESS_SPC_CPUFREQ)	+= vexpress-spc-cpufreq.o
 
diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
new file mode 100644
index 0000000..9df12f4
--- /dev/null
+++ b/drivers/cpufreq/tegra194-cpufreq.c
@@ -0,0 +1,423 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved
+ */
+
+#include <linux/cpu.h>
+#include <linux/cpufreq.h>
+#include <linux/delay.h>
+#include <linux/dma-mapping.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+
+#include <asm/smp_plat.h>
+
+#include <soc/tegra/bpmp.h>
+#include <soc/tegra/bpmp-abi.h>
+
+#define KHZ                     1000
+#define REF_CLK_MHZ             408 /* 408 MHz */
+#define US_DELAY                2000
+#define US_DELAY_MIN            2
+#define CPUFREQ_TBL_STEP_HZ     (50 * KHZ * KHZ)
+#define MAX_CNT                 ~0U
+
+/* cpufreq transisition latency */
+#define TEGRA_CPUFREQ_TRANSITION_LATENCY (300 * 1000) /* unit in nanoseconds */
+
+enum cluster {
+	CLUSTER0,
+	CLUSTER1,
+	CLUSTER2,
+	CLUSTER3,
+	MAX_CLUSTERS,
+};
+
+struct tegra194_cpufreq_data {
+	void __iomem *regs;
+	size_t num_clusters;
+	struct cpufreq_frequency_table **tables;
+};
+
+static DEFINE_MUTEX(cpufreq_lock);
+
+struct tegra_cpu_ctr {
+	u32 cpu;
+	u32 delay;
+	u32 coreclk_cnt, last_coreclk_cnt;
+	u32 refclk_cnt, last_refclk_cnt;
+};
+
+static struct workqueue_struct *read_counters_wq;
+struct read_counters_work {
+	struct work_struct work;
+	struct tegra_cpu_ctr c;
+};
+
+static enum cluster get_cpu_cluster(u8 cpu)
+{
+	return MPIDR_AFFINITY_LEVEL(cpu_logical_map(cpu), 1);
+}
+
+/*
+ * Read per-core Read-only system register NVFREQ_FEEDBACK_EL1.
+ * The register provides frequency feedback information to
+ * determine the average actual frequency a core has run at over
+ * a period of time.
+ *	[31:0] PLLP counter: Counts at fixed frequency (408 MHz)
+ *	[63:32] Core clock counter: counts on every core clock cycle
+ *			where the core is architecturally clocking
+ */
+static u64 read_freq_feedback(void)
+{
+	u64 val = 0;
+
+	asm volatile("mrs %0, s3_0_c15_c0_5" : "=r" (val) : );
+
+	return val;
+}
+
+u16 map_freq_to_ndiv(struct mrq_cpu_ndiv_limits_response *nltbl, u32 freq)
+{
+	return DIV_ROUND_UP(freq * nltbl->pdiv * nltbl->mdiv,
+			    nltbl->ref_clk_hz / KHZ);
+}
+
+static inline u32 map_ndiv_to_freq(struct mrq_cpu_ndiv_limits_response
+				   *nltbl, u16 ndiv)
+{
+	return nltbl->ref_clk_hz / KHZ * ndiv / (nltbl->pdiv * nltbl->mdiv);
+}
+
+static void tegra_read_counters(struct work_struct *work)
+{
+	struct read_counters_work *read_counters_work;
+	struct tegra_cpu_ctr *c;
+	u64 val;
+
+	/*
+	 * ref_clk_counter(32 bit counter) runs on constant clk,
+	 * pll_p(408MHz).
+	 * It will take = 2 ^ 32 / 408 MHz to overflow ref clk counter
+	 *              = 10526880 usec = 10.527 sec to overflow
+	 *
+	 * Like wise core_clk_counter(32 bit counter) runs on core clock.
+	 * It's synchronized to crab_clk (cpu_crab_clk) which runs at
+	 * freq of cluster. Assuming max cluster clock ~2000MHz,
+	 * It will take = 2 ^ 32 / 2000 MHz to overflow core clk counter
+	 *              = ~2.147 sec to overflow
+	 */
+	read_counters_work = container_of(work, struct read_counters_work,
+					  work);
+	c = &read_counters_work->c;
+
+	val = read_freq_feedback();
+	c->last_refclk_cnt = lower_32_bits(val);
+	c->last_coreclk_cnt = upper_32_bits(val);
+	udelay(c->delay);
+	val = read_freq_feedback();
+	c->refclk_cnt = lower_32_bits(val);
+	c->coreclk_cnt = upper_32_bits(val);
+}
+
+/*
+ * Return instantaneous cpu speed
+ * Instantaneous freq is calculated as -
+ * -Takes sample on every query of getting the freq.
+ *	- Read core and ref clock counters;
+ *	- Delay for X us
+ *	- Read above cycle counters again
+ *	- Calculates freq by subtracting current and previous counters
+ *	  divided by the delay time or eqv. of ref_clk_counter in delta time
+ *	- Return Kcycles/second, freq in KHz
+ *
+ *	delta time period = x sec
+ *			  = delta ref_clk_counter / (408 * 10^6) sec
+ *	freq in Hz = cycles/sec
+ *		   = (delta cycles / x sec
+ *		   = (delta cycles * 408 * 10^6) / delta ref_clk_counter
+ *	in KHz	   = (delta cycles * 408 * 10^3) / delta ref_clk_counter
+ *
+ * @cpu - logical cpu whose freq to be updated
+ * Returns freq in KHz on success, 0 if cpu is offline
+ */
+static unsigned int tegra194_get_speed_common(u32 cpu, u32 delay)
+{
+	struct read_counters_work read_counters_work;
+	struct tegra_cpu_ctr c;
+	u32 delta_refcnt;
+	u32 delta_ccnt;
+	u32 rate_mhz;
+
+	read_counters_work.c.cpu = cpu;
+	read_counters_work.c.delay = delay;
+	INIT_WORK_ONSTACK(&read_counters_work.work, tegra_read_counters);
+	queue_work_on(cpu, read_counters_wq, &read_counters_work.work);
+	flush_work(&read_counters_work.work);
+	c = read_counters_work.c;
+
+	if (c.coreclk_cnt < c.last_coreclk_cnt)
+		delta_ccnt = c.coreclk_cnt + (MAX_CNT - c.last_coreclk_cnt);
+	else
+		delta_ccnt = c.coreclk_cnt - c.last_coreclk_cnt;
+	if (!delta_ccnt)
+		return 0;
+
+	/* ref clock is 32 bits */
+	if (c.refclk_cnt < c.last_refclk_cnt)
+		delta_refcnt = c.refclk_cnt + (MAX_CNT - c.last_refclk_cnt);
+	else
+		delta_refcnt = c.refclk_cnt - c.last_refclk_cnt;
+	if (!delta_refcnt) {
+		pr_debug("cpufreq: %d is idle, delta_refcnt: 0\n", cpu);
+		return 0;
+	}
+	rate_mhz = ((unsigned long)(delta_ccnt * REF_CLK_MHZ)) / delta_refcnt;
+
+	return (rate_mhz * KHZ); /* in KHz */
+}
+
+static unsigned int tegra194_get_speed(u32 cpu)
+{
+	return tegra194_get_speed_common(cpu, US_DELAY);
+}
+
+static unsigned int tegra194_fast_get_speed(u32 cpu)
+{
+	return tegra194_get_speed_common(cpu, US_DELAY_MIN);
+}
+
+static int tegra194_cpufreq_init(struct cpufreq_policy *policy)
+{
+	struct tegra194_cpufreq_data *data = cpufreq_get_driver_data();
+	int cluster = get_cpu_cluster(policy->cpu);
+
+	if (cluster >= data->num_clusters)
+		return -EINVAL;
+
+	policy->cur = tegra194_fast_get_speed(policy->cpu); /* boot freq */
+
+	/* set same policy for all cpus */
+	cpumask_copy(policy->cpus, cpu_possible_mask);
+
+	policy->freq_table = data->tables[cluster];
+	policy->cpuinfo.transition_latency = TEGRA_CPUFREQ_TRANSITION_LATENCY;
+
+	return 0;
+}
+
+static void set_cpu_ndiv(void *data)
+{
+	struct cpufreq_frequency_table *tbl = data;
+	u64 ndiv_val = (u64)tbl->driver_data;
+
+	asm volatile("msr s3_0_c15_c0_4, %0" : : "r" (ndiv_val));
+}
+
+static int tegra194_cpufreq_set_target(struct cpufreq_policy *policy,
+				       unsigned int index)
+{
+	struct cpufreq_frequency_table *tbl = policy->freq_table + index;
+	static struct cpufreq_freqs freqs;
+
+	mutex_lock(&cpufreq_lock);
+	freqs.old = policy->cur;
+	freqs.new = tbl->frequency;
+
+	cpufreq_freq_transition_begin(policy, &freqs);
+	on_each_cpu_mask(policy->cpus, set_cpu_ndiv, tbl, true);
+	cpufreq_freq_transition_end(policy, &freqs, 0);
+
+	mutex_unlock(&cpufreq_lock);
+
+	return 0;
+}
+
+static struct cpufreq_driver tegra194_cpufreq_driver = {
+	.name = "tegra194",
+	.flags = CPUFREQ_STICKY | CPUFREQ_CONST_LOOPS |
+		CPUFREQ_NEED_INITIAL_FREQ_CHECK | CPUFREQ_ASYNC_NOTIFICATION,
+	.verify = cpufreq_generic_frequency_table_verify,
+	.target_index = tegra194_cpufreq_set_target,
+	.get = tegra194_get_speed,
+	.init = tegra194_cpufreq_init,
+	.attr = cpufreq_generic_attr,
+};
+
+static void tegra194_cpufreq_free_resources(void)
+{
+	flush_workqueue(read_counters_wq);
+	destroy_workqueue(read_counters_wq);
+}
+
+static struct cpufreq_frequency_table *init_freq_table
+		(struct platform_device *pdev, struct tegra_bpmp *bpmp,
+		 unsigned int cluster_id)
+{
+	struct cpufreq_frequency_table *opp_table;
+	struct mrq_cpu_ndiv_limits_response resp;
+	unsigned int num_freqs, ndiv, delta_ndiv;
+	struct mrq_cpu_ndiv_limits_request req;
+	struct tegra_bpmp_message msg;
+	u16 freq_table_step_size;
+	int err, index;
+
+	memset(&req, 0, sizeof(req));
+	req.cluster_id = cluster_id;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.mrq = MRQ_CPU_NDIV_LIMITS;
+	msg.tx.data = &req;
+	msg.tx.size = sizeof(req);
+	msg.rx.data = &resp;
+	msg.rx.size = sizeof(resp);
+
+	err = tegra_bpmp_transfer(bpmp, &msg);
+	if (err)
+		return ERR_PTR(err);
+
+	/*
+	 * Make sure frequency table step is a multiple of mdiv to match
+	 * vhint table granularity.
+	 */
+	freq_table_step_size = resp.mdiv *
+			DIV_ROUND_UP(CPUFREQ_TBL_STEP_HZ, resp.ref_clk_hz);
+
+	dev_dbg(&pdev->dev, "cluster %d: frequency table step size: %d\n",
+		cluster_id, freq_table_step_size);
+
+	delta_ndiv = resp.ndiv_max - resp.ndiv_min;
+
+	if (unlikely(delta_ndiv == 0))
+		num_freqs = 1;
+	else
+		/* We store both ndiv_min and ndiv_max hence the +1 */
+		num_freqs = delta_ndiv / freq_table_step_size + 1;
+
+	num_freqs += (delta_ndiv % freq_table_step_size) ? 1 : 0;
+
+	opp_table = devm_kcalloc(&pdev->dev, num_freqs + 1, sizeof(*opp_table),
+				 GFP_KERNEL);
+	if (!opp_table)
+		return ERR_PTR(-ENOMEM);
+
+	for (index = 0, ndiv = resp.ndiv_min;
+			ndiv < resp.ndiv_max;
+			index++, ndiv += freq_table_step_size) {
+		opp_table[index].driver_data = ndiv;
+		opp_table[index].frequency = map_ndiv_to_freq(&resp, ndiv);
+	}
+
+	opp_table[index].driver_data = resp.ndiv_max;
+	opp_table[index++].frequency = map_ndiv_to_freq(&resp, resp.ndiv_max);
+	opp_table[index].frequency = CPUFREQ_TABLE_END;
+
+	return opp_table;
+}
+
+static int tegra194_cpufreq_probe(struct platform_device *pdev)
+{
+	struct tegra194_cpufreq_data *data;
+	struct tegra_bpmp *bpmp;
+	int err, i;
+
+	data = devm_kzalloc(&pdev->dev, sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->num_clusters = MAX_CLUSTERS;
+	data->tables = devm_kcalloc(&pdev->dev, data->num_clusters,
+				    sizeof(*data->tables), GFP_KERNEL);
+	if (!data->tables)
+		return -ENOMEM;
+
+	platform_set_drvdata(pdev, data);
+
+	read_counters_wq = alloc_workqueue("read_counters_wq", __WQ_LEGACY, 1);
+	if (!read_counters_wq) {
+		dev_err(&pdev->dev, "fail to create_workqueue\n");
+		return -EINVAL;
+	}
+
+	bpmp = of_tegra_bpmp_get();
+	if (IS_ERR(bpmp)) {
+		err = PTR_ERR(bpmp);
+		goto err_free_res;
+	}
+
+	for (i = 0; i < data->num_clusters; i++) {
+		data->tables[i] = init_freq_table(pdev, bpmp, i);
+		if (IS_ERR(data->tables[i])) {
+			err = PTR_ERR(data->tables[i]);
+			goto put_bpmp;
+		}
+	}
+
+	tegra_bpmp_put(bpmp);
+
+	tegra194_cpufreq_driver.driver_data = data;
+
+	err = cpufreq_register_driver(&tegra194_cpufreq_driver);
+	if (err)
+		goto err_free_res;
+
+	return err;
+
+put_bpmp:
+	tegra_bpmp_put(bpmp);
+err_free_res:
+	tegra194_cpufreq_free_resources();
+	return err;
+}
+
+static int tegra194_cpufreq_remove(struct platform_device *pdev)
+{
+	cpufreq_unregister_driver(&tegra194_cpufreq_driver);
+	tegra194_cpufreq_free_resources();
+
+	return 0;
+}
+
+static struct platform_driver tegra194_cpufreq_platform_driver = {
+	.driver = {
+		.name = "tegra194-cpufreq",
+	},
+	.probe = tegra194_cpufreq_probe,
+	.remove = tegra194_cpufreq_remove,
+};
+
+static int __init tegra_cpufreq_init(void)
+{
+	struct platform_device *pdev;
+	int ret;
+
+	if (!of_machine_is_compatible("nvidia,tegra194"))
+		return -ENODEV;
+
+	ret = platform_driver_register(&tegra194_cpufreq_platform_driver);
+	if (ret)
+		return ret;
+
+	pdev = platform_device_register_simple("tegra194-cpufreq", -1, NULL, 0);
+	if (IS_ERR(pdev)) {
+		platform_driver_unregister(&tegra194_cpufreq_platform_driver);
+		return PTR_ERR(pdev);
+	}
+
+	return 0;
+}
+module_init(tegra_cpufreq_init);
+
+static void __exit tegra_cpufreq_exit(void)
+{
+	platform_driver_unregister(&tegra194_cpufreq_platform_driver);
+}
+module_exit(tegra_cpufreq_exit);
+
+MODULE_AUTHOR("Mikko Perttunen <mperttunen@nvidia.com>");
+MODULE_AUTHOR("Sumit Gupta <sumitg@nvidia.com>");
+MODULE_DESCRIPTION("NVIDIA Tegra194 cpufreq driver");
+MODULE_LICENSE("GPL v2");