diff mbox series

[v1,03/10] ARM: tegra: acer-a500: Bump thermal trips by 10C

Message ID 20210510202600.12156-4-digetx@gmail.com
State Accepted
Headers show
Series NVIDIA Tegra ARM32 device-tree improvements for 5.14 | expand

Commit Message

Dmitry Osipenko May 10, 2021, 8:25 p.m. UTC
It's possible to hit the temperature of the thermal zone in a very warm
environment under a constant load, like watching a video using software
decoding. It's even easier to hit the limit with a slightly overclocked
CPU. Bump the temperature limit by 10C in order to improve user
experience. Acer A500 has a large board and 10" display panel which are
used for the heat dissipation, the SoC is placed far away from battery,
hence we can safely bump the temperature limit.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 arch/arm/boot/dts/tegra20-acer-a500-picasso.dts | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Michał Mirosław May 14, 2021, 9:16 p.m. UTC | #1
On Mon, May 10, 2021 at 11:25:53PM +0300, Dmitry Osipenko wrote:
> It's possible to hit the temperature of the thermal zone in a very warm
> environment under a constant load, like watching a video using software
> decoding. It's even easier to hit the limit with a slightly overclocked
> CPU. Bump the temperature limit by 10C in order to improve user
> experience. Acer A500 has a large board and 10" display panel which are
> used for the heat dissipation, the SoC is placed far away from battery,
> hence we can safely bump the temperature limit.

60^C looks like a touch-safety limit (to avoid burns for users). Did you
verify the touchable parts' temperature somehow after the change?

Best Regards
Michał Mirosław
Dmitry Osipenko May 14, 2021, 10:17 p.m. UTC | #2
15.05.2021 00:16, Michał Mirosław пишет:
> On Mon, May 10, 2021 at 11:25:53PM +0300, Dmitry Osipenko wrote:
>> It's possible to hit the temperature of the thermal zone in a very warm
>> environment under a constant load, like watching a video using software
>> decoding. It's even easier to hit the limit with a slightly overclocked
>> CPU. Bump the temperature limit by 10C in order to improve user
>> experience. Acer A500 has a large board and 10" display panel which are
>> used for the heat dissipation, the SoC is placed far away from battery,
>> hence we can safely bump the temperature limit.
> 
> 60^C looks like a touch-safety limit (to avoid burns for users). Did you
> verify the touchable parts' temperature somehow after the change?

The SoC is placed under a can. Both front and back of device are large
metal planes which dissipate heat efficiently. I don't recall A500
getting hot ever and I'm holding it in hands every day. From a user
perspective it may feel like a part of device getting slightly warm in a
worst case.
Daniel Lezcano June 11, 2021, 9:52 a.m. UTC | #3
On 14/05/2021 23:16, Michał Mirosław wrote:
> On Mon, May 10, 2021 at 11:25:53PM +0300, Dmitry Osipenko wrote:
>> It's possible to hit the temperature of the thermal zone in a very warm
>> environment under a constant load, like watching a video using software
>> decoding. It's even easier to hit the limit with a slightly overclocked
>> CPU. Bump the temperature limit by 10C in order to improve user
>> experience. Acer A500 has a large board and 10" display panel which are
>> used for the heat dissipation, the SoC is placed far away from battery,
>> hence we can safely bump the temperature limit.
> 
> 60^C looks like a touch-safety limit (to avoid burns for users). Did you
> verify the touchable parts' temperature somehow after the change?

The skin temperature and the CPU/GPU etc ... temperatures are different
things.

For the embedded system there is the dissipation system and a
temperature sensor on it which is the skin temp. This temperature is the
result of the heat of all the thermal zones on the board and must be
below 45°C. The temperature slowly changes.

On the CPU, the temperature changes can be very fast and you have to
take care of keeping it below the max temperature specified in the TRM
by using different techniques (freq changes, idle injection, ...) but
the temperature can be 75°C, 85°C or whatever the manual says.

50°C and 60°C are low temperature for a CPU and that will inevitably
impact the performances, so setting the temperature close the max
temperature is what will allow max performances.

What matters is the skin temperature.

The skin temperature must be monitored by other techniques, eg. using
the TDP of the system and throttle the different devices to keep them in
this power budget. That is the role of an thermal daemon.
Dmitry Osipenko June 12, 2021, 10:40 a.m. UTC | #4
11.06.2021 12:52, Daniel Lezcano пишет:
> On 14/05/2021 23:16, Michał Mirosław wrote:
>> On Mon, May 10, 2021 at 11:25:53PM +0300, Dmitry Osipenko wrote:
>>> It's possible to hit the temperature of the thermal zone in a very warm
>>> environment under a constant load, like watching a video using software
>>> decoding. It's even easier to hit the limit with a slightly overclocked
>>> CPU. Bump the temperature limit by 10C in order to improve user
>>> experience. Acer A500 has a large board and 10" display panel which are
>>> used for the heat dissipation, the SoC is placed far away from battery,
>>> hence we can safely bump the temperature limit.
>>
>> 60^C looks like a touch-safety limit (to avoid burns for users). Did you
>> verify the touchable parts' temperature somehow after the change?
> 
> The skin temperature and the CPU/GPU etc ... temperatures are different
> things.
> 
> For the embedded system there is the dissipation system and a
> temperature sensor on it which is the skin temp. This temperature is the
> result of the heat of all the thermal zones on the board and must be
> below 45°C. The temperature slowly changes.
> 
> On the CPU, the temperature changes can be very fast and you have to
> take care of keeping it below the max temperature specified in the TRM
> by using different techniques (freq changes, idle injection, ...) but
> the temperature can be 75°C, 85°C or whatever the manual says.
> 
> 50°C and 60°C are low temperature for a CPU and that will inevitably
> impact the performances, so setting the temperature close the max
> temperature is what will allow max performances.
> 
> What matters is the skin temperature.
> 
> The skin temperature must be monitored by other techniques, eg. using
> the TDP of the system and throttle the different devices to keep them in
> this power budget. That is the role of an thermal daemon.

Thank you for the clarification. Indeed, I wasn't sure how to make use
of the skin temperature properly.

The skin temperature varies a lot depending on the thermal capabilities
of a particular device. It's about 15C below CPU core at a full load on
A500, while it's 2C below CPU core on Nexus 7. But this is expected
since Nexus 7 can't dissipate heat efficiently.

I will revisit the DT thermal zones again for the next kernel release.
Daniel Lezcano June 12, 2021, 2:24 p.m. UTC | #5
On 12/06/2021 12:40, Dmitry Osipenko wrote:
> 11.06.2021 12:52, Daniel Lezcano пишет:
>> On 14/05/2021 23:16, Michał Mirosław wrote:
>>> On Mon, May 10, 2021 at 11:25:53PM +0300, Dmitry Osipenko wrote:
>>>> It's possible to hit the temperature of the thermal zone in a very warm
>>>> environment under a constant load, like watching a video using software
>>>> decoding. It's even easier to hit the limit with a slightly overclocked
>>>> CPU. Bump the temperature limit by 10C in order to improve user
>>>> experience. Acer A500 has a large board and 10" display panel which are
>>>> used for the heat dissipation, the SoC is placed far away from battery,
>>>> hence we can safely bump the temperature limit.
>>>
>>> 60^C looks like a touch-safety limit (to avoid burns for users). Did you
>>> verify the touchable parts' temperature somehow after the change?
>>
>> The skin temperature and the CPU/GPU etc ... temperatures are different
>> things.
>>
>> For the embedded system there is the dissipation system and a
>> temperature sensor on it which is the skin temp. This temperature is the
>> result of the heat of all the thermal zones on the board and must be
>> below 45°C. The temperature slowly changes.
>>
>> On the CPU, the temperature changes can be very fast and you have to
>> take care of keeping it below the max temperature specified in the TRM
>> by using different techniques (freq changes, idle injection, ...) but
>> the temperature can be 75°C, 85°C or whatever the manual says.
>>
>> 50°C and 60°C are low temperature for a CPU and that will inevitably
>> impact the performances, so setting the temperature close the max
>> temperature is what will allow max performances.
>>
>> What matters is the skin temperature.
>>
>> The skin temperature must be monitored by other techniques, eg. using
>> the TDP of the system and throttle the different devices to keep them in
>> this power budget. That is the role of an thermal daemon.
> 
> Thank you for the clarification. Indeed, I wasn't sure how to make use
> of the skin temperature properly.
> 
> The skin temperature varies a lot depending on the thermal capabilities
> of a particular device. It's about 15C below CPU core at a full load on
> A500, while it's 2C below CPU core on Nexus 7. But this is expected
> since Nexus 7 can't dissipate heat efficiently.
Yeah, but it can not be directly related to the CPU because if the GPU
is intensively used and the battery is charging at the same time, the
skin temp will increase anyway.

You should set the trip points close to the functioning boundary
temperature given in the hardware specification whatever the resulting
heating effect is on the device.

The thermal zone is there to protect the silicon and the system from a
wild reboot.

If the Nexus 7 is too hot after the changes, then you may act on the
sources of the heat. For instance, set the the highest OPP to turbo or
remove it, or, if there is one, change the thermal daemon to reduce the
overall power consumption.

In case you are interested in: https://lwn.net/Articles/839318/

Hope that helps

  -- Daniel
Dmitry Osipenko June 13, 2021, 12:25 a.m. UTC | #6
12.06.2021 17:24, Daniel Lezcano пишет:
> On 12/06/2021 12:40, Dmitry Osipenko wrote:
>> 11.06.2021 12:52, Daniel Lezcano пишет:
>>> On 14/05/2021 23:16, Michał Mirosław wrote:
>>>> On Mon, May 10, 2021 at 11:25:53PM +0300, Dmitry Osipenko wrote:
>>>>> It's possible to hit the temperature of the thermal zone in a very warm
>>>>> environment under a constant load, like watching a video using software
>>>>> decoding. It's even easier to hit the limit with a slightly overclocked
>>>>> CPU. Bump the temperature limit by 10C in order to improve user
>>>>> experience. Acer A500 has a large board and 10" display panel which are
>>>>> used for the heat dissipation, the SoC is placed far away from battery,
>>>>> hence we can safely bump the temperature limit.
>>>>
>>>> 60^C looks like a touch-safety limit (to avoid burns for users). Did you
>>>> verify the touchable parts' temperature somehow after the change?
>>>
>>> The skin temperature and the CPU/GPU etc ... temperatures are different
>>> things.
>>>
>>> For the embedded system there is the dissipation system and a
>>> temperature sensor on it which is the skin temp. This temperature is the
>>> result of the heat of all the thermal zones on the board and must be
>>> below 45°C. The temperature slowly changes.
>>>
>>> On the CPU, the temperature changes can be very fast and you have to
>>> take care of keeping it below the max temperature specified in the TRM
>>> by using different techniques (freq changes, idle injection, ...) but
>>> the temperature can be 75°C, 85°C or whatever the manual says.
>>>
>>> 50°C and 60°C are low temperature for a CPU and that will inevitably
>>> impact the performances, so setting the temperature close the max
>>> temperature is what will allow max performances.
>>>
>>> What matters is the skin temperature.
>>>
>>> The skin temperature must be monitored by other techniques, eg. using
>>> the TDP of the system and throttle the different devices to keep them in
>>> this power budget. That is the role of an thermal daemon.
>>
>> Thank you for the clarification. Indeed, I wasn't sure how to make use
>> of the skin temperature properly.
>>
>> The skin temperature varies a lot depending on the thermal capabilities
>> of a particular device. It's about 15C below CPU core at a full load on
>> A500, while it's 2C below CPU core on Nexus 7. But this is expected
>> since Nexus 7 can't dissipate heat efficiently.
> Yeah, but it can not be directly related to the CPU because if the GPU
> is intensively used and the battery is charging at the same time, the
> skin temp will increase anyway.

Sure, we just added the memory devfreq throttling as a cooling device to
Nexus 7 and Ouya DTs in addition to the CPU throttling.

The GPU and other h/w units are on the pending list. For the starter we
need to add GENPD and runtime PM support to all drivers, which solves
the overheating problem of idling systems. We have Tegra30 Ouya game
console that is getting hot during idle without the runtime PM support.
Afterwards we can add the devfreq support to improve the active cooling.
I'm already working on it.

> You should set the trip points close to the functioning boundary
> temperature given in the hardware specification whatever the resulting
> heating effect is on the device.
> 
> The thermal zone is there to protect the silicon and the system from a
> wild reboot.
> 
> If the Nexus 7 is too hot after the changes, then you may act on the
> sources of the heat. For instance, set the the highest OPP to turbo or
> remove it, or, if there is one, change the thermal daemon to reduce the
> overall power consumption.
> In case you are interested in: https://lwn.net/Articles/839318/

The DTPM is a very interesting approach. For now Tegra still misses some
basics in mainline kernel which have a higher priority, so I think it
should be good enough to perform the in-kernel thermal management for
the starter. We may consider a more complex solutions later on if will
be necessary.

What I'm currently thinking to do is:

1. Set up the trips of SoC/CPU core thermal zones in accordance to the
silicon limits.

2. Set up the skin trips in accordance to the device limits.

The breached skin trips will cause a mild throttling, while the SoC/CPU
trips will be allowed to cause the severe throttling. Does this sound
good to you?

> Hope that helps

Helps a lot, thank you very much.
Daniel Lezcano June 13, 2021, 6:19 p.m. UTC | #7
On 13/06/2021 02:25, Dmitry Osipenko wrote:

[ ... ]

>> You should set the trip points close to the functioning boundary
>> temperature given in the hardware specification whatever the resulting
>> heating effect is on the device.
>>
>> The thermal zone is there to protect the silicon and the system from a
>> wild reboot.
>>
>> If the Nexus 7 is too hot after the changes, then you may act on the
>> sources of the heat. For instance, set the the highest OPP to turbo or
>> remove it, or, if there is one, change the thermal daemon to reduce the
>> overall power consumption.
>> In case you are interested in: https://lwn.net/Articles/839318/
> 
> The DTPM is a very interesting approach. For now Tegra still misses some
> basics in mainline kernel which have a higher priority, so I think it
> should be good enough to perform the in-kernel thermal management for
> the starter. We may consider a more complex solutions later on if will
> be necessary.
> 
> What I'm currently thinking to do is:
> 
> 1. Set up the trips of SoC/CPU core thermal zones in accordance to the
> silicon limits.
> 
> 2. Set up the skin trips in accordance to the device limits.
> 
> The breached skin trips will cause a mild throttling, while the SoC/CPU
> trips will be allowed to cause the severe throttling. Does this sound
> good to you?

The skin temperature must be managed from userspace. The kernel is
unable to do a smart thermal management given different thermal zones
but if the goal is to go forward and prevent the tablet to be hot
temporarily until the other hardware support is there, I think it is
acceptable.
Dmitry Osipenko June 15, 2021, 12:53 p.m. UTC | #8
13.06.2021 21:19, Daniel Lezcano пишет:
> On 13/06/2021 02:25, Dmitry Osipenko wrote:
> 
> [ ... ]
> 
>>> You should set the trip points close to the functioning boundary
>>> temperature given in the hardware specification whatever the resulting
>>> heating effect is on the device.
>>>
>>> The thermal zone is there to protect the silicon and the system from a
>>> wild reboot.
>>>
>>> If the Nexus 7 is too hot after the changes, then you may act on the
>>> sources of the heat. For instance, set the the highest OPP to turbo or
>>> remove it, or, if there is one, change the thermal daemon to reduce the
>>> overall power consumption.
>>> In case you are interested in: https://lwn.net/Articles/839318/
>>
>> The DTPM is a very interesting approach. For now Tegra still misses some
>> basics in mainline kernel which have a higher priority, so I think it
>> should be good enough to perform the in-kernel thermal management for
>> the starter. We may consider a more complex solutions later on if will
>> be necessary.
>>
>> What I'm currently thinking to do is:
>>
>> 1. Set up the trips of SoC/CPU core thermal zones in accordance to the
>> silicon limits.
>>
>> 2. Set up the skin trips in accordance to the device limits.
>>
>> The breached skin trips will cause a mild throttling, while the SoC/CPU
>> trips will be allowed to cause the severe throttling. Does this sound
>> good to you?
> 
> The skin temperature must be managed from userspace. The kernel is
> unable to do a smart thermal management given different thermal zones
> but if the goal is to go forward and prevent the tablet to be hot
> temporarily until the other hardware support is there, I think it is
> acceptable.

The current goal is to get maximum from what we already have, thank you.
Daniel Lezcano June 15, 2021, 1:05 p.m. UTC | #9
On 15/06/2021 14:53, Dmitry Osipenko wrote:
> 13.06.2021 21:19, Daniel Lezcano пишет:
>> On 13/06/2021 02:25, Dmitry Osipenko wrote:
>>
>> [ ... ]
>>
>>>> You should set the trip points close to the functioning boundary
>>>> temperature given in the hardware specification whatever the resulting
>>>> heating effect is on the device.
>>>>
>>>> The thermal zone is there to protect the silicon and the system from a
>>>> wild reboot.
>>>>
>>>> If the Nexus 7 is too hot after the changes, then you may act on the
>>>> sources of the heat. For instance, set the the highest OPP to turbo or
>>>> remove it, or, if there is one, change the thermal daemon to reduce the
>>>> overall power consumption.
>>>> In case you are interested in: https://lwn.net/Articles/839318/
>>>
>>> The DTPM is a very interesting approach. For now Tegra still misses some
>>> basics in mainline kernel which have a higher priority, so I think it
>>> should be good enough to perform the in-kernel thermal management for
>>> the starter. We may consider a more complex solutions later on if will
>>> be necessary.
>>>
>>> What I'm currently thinking to do is:
>>>
>>> 1. Set up the trips of SoC/CPU core thermal zones in accordance to the
>>> silicon limits.
>>>
>>> 2. Set up the skin trips in accordance to the device limits.
>>>
>>> The breached skin trips will cause a mild throttling, while the SoC/CPU
>>> trips will be allowed to cause the severe throttling. Does this sound
>>> good to you?
>>
>> The skin temperature must be managed from userspace. The kernel is
>> unable to do a smart thermal management given different thermal zones
>> but if the goal is to go forward and prevent the tablet to be hot
>> temporarily until the other hardware support is there, I think it is
>> acceptable.
> 
> The current goal is to get maximum from what we already have, thank you.

maximum of performance or maximum of mitigation ?
Dmitry Osipenko June 15, 2021, 1:26 p.m. UTC | #10
15.06.2021 16:05, Daniel Lezcano пишет:
> On 15/06/2021 14:53, Dmitry Osipenko wrote:
>> 13.06.2021 21:19, Daniel Lezcano пишет:
>>> On 13/06/2021 02:25, Dmitry Osipenko wrote:
>>>
>>> [ ... ]
>>>
>>>>> You should set the trip points close to the functioning boundary
>>>>> temperature given in the hardware specification whatever the resulting
>>>>> heating effect is on the device.
>>>>>
>>>>> The thermal zone is there to protect the silicon and the system from a
>>>>> wild reboot.
>>>>>
>>>>> If the Nexus 7 is too hot after the changes, then you may act on the
>>>>> sources of the heat. For instance, set the the highest OPP to turbo or
>>>>> remove it, or, if there is one, change the thermal daemon to reduce the
>>>>> overall power consumption.
>>>>> In case you are interested in: https://lwn.net/Articles/839318/
>>>>
>>>> The DTPM is a very interesting approach. For now Tegra still misses some
>>>> basics in mainline kernel which have a higher priority, so I think it
>>>> should be good enough to perform the in-kernel thermal management for
>>>> the starter. We may consider a more complex solutions later on if will
>>>> be necessary.
>>>>
>>>> What I'm currently thinking to do is:
>>>>
>>>> 1. Set up the trips of SoC/CPU core thermal zones in accordance to the
>>>> silicon limits.
>>>>
>>>> 2. Set up the skin trips in accordance to the device limits.
>>>>
>>>> The breached skin trips will cause a mild throttling, while the SoC/CPU
>>>> trips will be allowed to cause the severe throttling. Does this sound
>>>> good to you?
>>>
>>> The skin temperature must be managed from userspace. The kernel is
>>> unable to do a smart thermal management given different thermal zones
>>> but if the goal is to go forward and prevent the tablet to be hot
>>> temporarily until the other hardware support is there, I think it is
>>> acceptable.
>>
>> The current goal is to get maximum from what we already have, thank you.
> 
> maximum of performance or maximum of mitigation ?

The best balance of both. Maximum performance + no risk of damaging
hardware + pleasant body temperature from a user perspective.
diff mbox series

Patch

diff --git a/arch/arm/boot/dts/tegra20-acer-a500-picasso.dts b/arch/arm/boot/dts/tegra20-acer-a500-picasso.dts
index eff9bfb2d442..15b7965599ee 100644
--- a/arch/arm/boot/dts/tegra20-acer-a500-picasso.dts
+++ b/arch/arm/boot/dts/tegra20-acer-a500-picasso.dts
@@ -1059,15 +1059,15 @@  cpu-thermal {
 
 			trips {
 				trip0: cpu-alert0 {
-					/* start throttling at 50C */
-					temperature = <50000>;
+					/* start throttling at 60C */
+					temperature = <60000>;
 					hysteresis = <200>;
 					type = "passive";
 				};
 
 				trip1: cpu-crit {
-					/* shut down at 60C */
-					temperature = <60000>;
+					/* shut down at 70C */
+					temperature = <70000>;
 					hysteresis = <2000>;
 					type = "critical";
 				};