diff mbox

[RFC,2/8] Documentation: arm: define DT cpu capacity bindings

Message ID 1448288921-30307-3-git-send-email-juri.lelli@arm.com
State Changes Requested, archived
Headers show

Commit Message

Juri Lelli Nov. 23, 2015, 2:28 p.m. UTC
ARM systems may be configured to have cpus with different power/performance
characteristics within the same chip. In this case, additional information
has to be made available to the kernel (the scheduler in particular) for it
to be aware of such differences and take decisions accordingly.

Therefore, this patch aims at standardizing cpu capacities device tree
bindings for ARM platforms. Bindings define cpu capacity parameter, to
allow operating systems to retrieve such information from the device tree
and initialize related kernel structures, paving the way for common code in
the kernel to deal with heterogeneity.

Cc: Rob Herring <robh+dt@kernel.org>
Cc: Pawel Moll <pawel.moll@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ian Campbell <ijc+devicetree@hellion.org.uk>
Cc: Kumar Gala <galak@codeaurora.org>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: Paul Walmsley <paul@pwsan.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: devicetree@vger.kernel.org
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 .../devicetree/bindings/arm/cpu-capacity.txt       | 227 +++++++++++++++++++++
 Documentation/devicetree/bindings/arm/cpus.txt     |  17 ++
 2 files changed, 244 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt

Comments

Rob Herring Nov. 24, 2015, 2:06 a.m. UTC | #1
On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote:
> ARM systems may be configured to have cpus with different power/performance
> characteristics within the same chip. In this case, additional information
> has to be made available to the kernel (the scheduler in particular) for it
> to be aware of such differences and take decisions accordingly.
> 
> Therefore, this patch aims at standardizing cpu capacities device tree
> bindings for ARM platforms. Bindings define cpu capacity parameter, to
> allow operating systems to retrieve such information from the device tree
> and initialize related kernel structures, paving the way for common code in
> the kernel to deal with heterogeneity.
> 
> Cc: Rob Herring <robh+dt@kernel.org>
> Cc: Pawel Moll <pawel.moll@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Ian Campbell <ijc+devicetree@hellion.org.uk>
> Cc: Kumar Gala <galak@codeaurora.org>
> Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
> Cc: Olof Johansson <olof@lixom.net>
> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
> Cc: Paul Walmsley <paul@pwsan.com>
> Cc: Linus Walleij <linus.walleij@linaro.org>
> Cc: Chen-Yu Tsai <wens@csie.org>
> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
> Cc: devicetree@vger.kernel.org
> Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> ---
>  .../devicetree/bindings/arm/cpu-capacity.txt       | 227 +++++++++++++++++++++
>  Documentation/devicetree/bindings/arm/cpus.txt     |  17 ++
>  2 files changed, 244 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt
> 
> diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
> new file mode 100644
> index 0000000..2a00af0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
> @@ -0,0 +1,227 @@
> +==========================================
> +ARM CPUs capacity bindings
> +==========================================
> +
> +==========================================
> +1 - Introduction
> +==========================================
> +
> +ARM systems may be configured to have cpus with different power/performance
> +characteristics within the same chip. In this case, additional information
> +has to be made available to the kernel (the scheduler in particular) for
> +it to be aware of such differences and take decisions accordingly.
> +
> +==========================================
> +2 - CPU capacity definition
> +==========================================
> +
> +CPU capacity is a number that provides the scheduler information about CPUs
> +heterogeneity. Such heterogeneity can come from micro-architectural differences
> +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
> +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
> +context is about differing performance characteristics; this binding tries to
> +capture a first-order approximation of the relative performance of CPUs.
> +
> +One simple way to estimate CPU capacities is to iteratively run a well-known
> +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
> +maximum frequency and then normalize values w.r.t.  the best performing CPU.
> +One can also do a statistically significant study of a wide collection of
> +benchmarks, but pros of such an approach are not really evident at the time of
> +writing.
> +
> +==========================================
> +3 - capacity-scale
> +==========================================
> +
> +CPUs capacities are defined with respect to capacity-scale property in the cpus
> +node [1]. The property is optional; if not defined a 1024 capacity-scale is
> +assumed. This property defines both the highest CPU capacity present in the
> +system and granularity of CPU capacity values.

I don't really see the point of this vs. having an absolute scale.

> +
> +==========================================
> +4 - capacity
> +==========================================
> +
> +capacity is an optional cpu node [1] property: u32 value representing CPU
> +capacity, relative to capacity-scale. It is required and enforced that capacity
> +<= capacity-scale.

I think you need something absolute and probably per MHz (like 
dynamic-power-coefficient property). Perhaps the IPC (instructions per 
clock) value?

In other words, I want to see these numbers have a defined method 
of determining them and don't want to see random values from every 
vendor. ARM, Ltd. says core X has a value of Y would be good enough for 
me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a 
value of 1024 is not what I want to see. Of course things like cache 
sizes can vary the performance, but is a baseline value good enough? 

However, no vendor will want to publish their values if these are 
absolute values relative to other vendors.

If you expect these to need frequent tuning, then don't put them in DT.

Rob

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Nov. 24, 2015, 10:54 a.m. UTC | #2
Hi,

On 23/11/15 20:06, Rob Herring wrote:
> On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote:
> > ARM systems may be configured to have cpus with different power/performance
> > characteristics within the same chip. In this case, additional information
> > has to be made available to the kernel (the scheduler in particular) for it
> > to be aware of such differences and take decisions accordingly.
> > 
> > Therefore, this patch aims at standardizing cpu capacities device tree
> > bindings for ARM platforms. Bindings define cpu capacity parameter, to
> > allow operating systems to retrieve such information from the device tree
> > and initialize related kernel structures, paving the way for common code in
> > the kernel to deal with heterogeneity.
> > 
> > Cc: Rob Herring <robh+dt@kernel.org>
> > Cc: Pawel Moll <pawel.moll@arm.com>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Ian Campbell <ijc+devicetree@hellion.org.uk>
> > Cc: Kumar Gala <galak@codeaurora.org>
> > Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
> > Cc: Olof Johansson <olof@lixom.net>
> > Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
> > Cc: Paul Walmsley <paul@pwsan.com>
> > Cc: Linus Walleij <linus.walleij@linaro.org>
> > Cc: Chen-Yu Tsai <wens@csie.org>
> > Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
> > Cc: devicetree@vger.kernel.org
> > Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> > ---
> >  .../devicetree/bindings/arm/cpu-capacity.txt       | 227 +++++++++++++++++++++
> >  Documentation/devicetree/bindings/arm/cpus.txt     |  17 ++
> >  2 files changed, 244 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
> > new file mode 100644
> > index 0000000..2a00af0
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
> > @@ -0,0 +1,227 @@
> > +==========================================
> > +ARM CPUs capacity bindings
> > +==========================================
> > +
> > +==========================================
> > +1 - Introduction
> > +==========================================
> > +
> > +ARM systems may be configured to have cpus with different power/performance
> > +characteristics within the same chip. In this case, additional information
> > +has to be made available to the kernel (the scheduler in particular) for
> > +it to be aware of such differences and take decisions accordingly.
> > +
> > +==========================================
> > +2 - CPU capacity definition
> > +==========================================
> > +
> > +CPU capacity is a number that provides the scheduler information about CPUs
> > +heterogeneity. Such heterogeneity can come from micro-architectural differences
> > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
> > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
> > +context is about differing performance characteristics; this binding tries to
> > +capture a first-order approximation of the relative performance of CPUs.
> > +
> > +One simple way to estimate CPU capacities is to iteratively run a well-known
> > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
> > +maximum frequency and then normalize values w.r.t.  the best performing CPU.
> > +One can also do a statistically significant study of a wide collection of
> > +benchmarks, but pros of such an approach are not really evident at the time of
> > +writing.
> > +
> > +==========================================
> > +3 - capacity-scale
> > +==========================================
> > +
> > +CPUs capacities are defined with respect to capacity-scale property in the cpus
> > +node [1]. The property is optional; if not defined a 1024 capacity-scale is
> > +assumed. This property defines both the highest CPU capacity present in the
> > +system and granularity of CPU capacity values.
> 
> I don't really see the point of this vs. having an absolute scale.
> 

IMHO, we need this for several reasons, one being to address one of your
concerns below: vendors are free to choose their scale without being
forced to publish absolute data. Another reason is that it might make
life easier in certain cases; for example, someone could implement a
system with a few clusters of, say, A57s, but some run at half the clock
of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
this case I think it is just easier to define capacity-scale as 1200 and
capacities as 1200 and 600. Last reason that I can think of right now is
that we don't probably want to bound ourself to some particular range
from the beginning, as that range might be enough now, but it could
change in the future (as in, right now [1-1024] looks fine for
scheduling purposes, but that might change).

> > +
> > +==========================================
> > +4 - capacity
> > +==========================================
> > +
> > +capacity is an optional cpu node [1] property: u32 value representing CPU
> > +capacity, relative to capacity-scale. It is required and enforced that capacity
> > +<= capacity-scale.
> 
> I think you need something absolute and probably per MHz (like 
> dynamic-power-coefficient property). Perhaps the IPC (instructions per 
> clock) value?
> 
> In other words, I want to see these numbers have a defined method 
> of determining them and don't want to see random values from every 
> vendor. ARM, Ltd. says core X has a value of Y would be good enough for 
> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a 
> value of 1024 is not what I want to see. Of course things like cache 
> sizes can vary the performance, but is a baseline value good enough? 
> 

A standard reference baseline is what we advocate with this set, but
making this baseline work for every vendor's implementation is hardly
achievable, IMHO. I don't think we can come up with any number that
applies to each and every implementation; you can have different
revisions of the same core and vendors might make implementation choices
that end up with different peak performance. 

> However, no vendor will want to publish their values if these are 
> absolute values relative to other vendors.
> 

Right. That is why I think we need to abstract numbers, as we do with
capacity-scale.

> If you expect these to need frequent tuning, then don't put them in DT.
> 

I expect that it is possible to come up with a sensible baseline number
for a specific platform implementation, so there is value in
standardizing how we specify this value and how it is then consumed.
Finer grained tuning might then happen both offline (with changes to the
mainline DT) and online (using the sysfs interface), but that should
only apply to a narrow set of use cases.

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot Nov. 30, 2015, 9:59 a.m. UTC | #3
Hi Juri,

On 24 November 2015 at 11:54, Juri Lelli <juri.lelli@arm.com> wrote:
> Hi,
>
> On 23/11/15 20:06, Rob Herring wrote:
>> On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote:
>> > ARM systems may be configured to have cpus with different power/performance
>> > characteristics within the same chip. In this case, additional information
>> > has to be made available to the kernel (the scheduler in particular) for it
>> > to be aware of such differences and take decisions accordingly.
>> >

[snip]

>> > +==========================================
>> > +2 - CPU capacity definition
>> > +==========================================
>> > +
>> > +CPU capacity is a number that provides the scheduler information about CPUs
>> > +heterogeneity. Such heterogeneity can come from micro-architectural differences
>> > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
>> > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
>> > +context is about differing performance characteristics; this binding tries to
>> > +capture a first-order approximation of the relative performance of CPUs.
>> > +
>> > +One simple way to estimate CPU capacities is to iteratively run a well-known
>> > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
>> > +maximum frequency and then normalize values w.r.t.  the best performing CPU.
>> > +One can also do a statistically significant study of a wide collection of
>> > +benchmarks, but pros of such an approach are not really evident at the time of
>> > +writing.
>> > +
>> > +==========================================
>> > +3 - capacity-scale
>> > +==========================================
>> > +
>> > +CPUs capacities are defined with respect to capacity-scale property in the cpus
>> > +node [1]. The property is optional; if not defined a 1024 capacity-scale is
>> > +assumed. This property defines both the highest CPU capacity present in the
>> > +system and granularity of CPU capacity values.
>>
>> I don't really see the point of this vs. having an absolute scale.
>>
>
> IMHO, we need this for several reasons, one being to address one of your
> concerns below: vendors are free to choose their scale without being
> forced to publish absolute data. Another reason is that it might make
> life easier in certain cases; for example, someone could implement a
> system with a few clusters of, say, A57s, but some run at half the clock
> of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
> this case I think it is just easier to define capacity-scale as 1200 and
> capacities as 1200 and 600. Last reason that I can think of right now is
> that we don't probably want to bound ourself to some particular range
> from the beginning, as that range might be enough now, but it could
> change in the future (as in, right now [1-1024] looks fine for
> scheduling purposes, but that might change).

Like Rob, i don't really see the benefit of this optional
capacity-scale property. Parsing the capacity of all cpu nodes should
give you a range as well.
IMHO, this property looks like an optimization of the code that will
parse the dt more than a HW description

>
>> > +
>> > +==========================================
>> > +4 - capacity
>> > +==========================================
>> > +
>> > +capacity is an optional cpu node [1] property: u32 value representing CPU
>> > +capacity, relative to capacity-scale. It is required and enforced that capacity
>> > +<= capacity-scale.
>>
>> I think you need something absolute and probably per MHz (like
>> dynamic-power-coefficient property). Perhaps the IPC (instructions per
>> clock) value?
>>
>> In other words, I want to see these numbers have a defined method
>> of determining them and don't want to see random values from every
>> vendor. ARM, Ltd. says core X has a value of Y would be good enough for
>> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
>> value of 1024 is not what I want to see. Of course things like cache
>> sizes can vary the performance, but is a baseline value good enough?
>>
>
> A standard reference baseline is what we advocate with this set, but
> making this baseline work for every vendor's implementation is hardly
> achievable, IMHO. I don't think we can come up with any number that
> applies to each and every implementation; you can have different
> revisions of the same core and vendors might make implementation choices
> that end up with different peak performance.
>
>> However, no vendor will want to publish their values if these are
>> absolute values relative to other vendors.
>>
>
> Right. That is why I think we need to abstract numbers, as we do with
> capacity-scale.
>
>> If you expect these to need frequent tuning, then don't put them in DT.
>>
>
> I expect that it is possible to come up with a sensible baseline number
> for a specific platform implementation, so there is value in
> standardizing how we specify this value and how it is then consumed.
> Finer grained tuning might then happen both offline (with changes to the
> mainline DT) and online (using the sysfs interface), but that should
> only apply to a narrow set of use cases.
>
> Thanks,
>
> - Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Dec. 1, 2015, 11:20 a.m. UTC | #4
Hi Vincent,

On 30/11/15 10:59, Vincent Guittot wrote:
> Hi Juri,
> 
> On 24 November 2015 at 11:54, Juri Lelli <juri.lelli@arm.com> wrote:
> > Hi,
> >
> > On 23/11/15 20:06, Rob Herring wrote:
> >> On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote:
> >> > ARM systems may be configured to have cpus with different power/performance
> >> > characteristics within the same chip. In this case, additional information
> >> > has to be made available to the kernel (the scheduler in particular) for it
> >> > to be aware of such differences and take decisions accordingly.
> >> >
> 
> [snip]
> 
> >> > +==========================================
> >> > +2 - CPU capacity definition
> >> > +==========================================
> >> > +
> >> > +CPU capacity is a number that provides the scheduler information about CPUs
> >> > +heterogeneity. Such heterogeneity can come from micro-architectural differences
> >> > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
> >> > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
> >> > +context is about differing performance characteristics; this binding tries to
> >> > +capture a first-order approximation of the relative performance of CPUs.
> >> > +
> >> > +One simple way to estimate CPU capacities is to iteratively run a well-known
> >> > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
> >> > +maximum frequency and then normalize values w.r.t.  the best performing CPU.
> >> > +One can also do a statistically significant study of a wide collection of
> >> > +benchmarks, but pros of such an approach are not really evident at the time of
> >> > +writing.
> >> > +
> >> > +==========================================
> >> > +3 - capacity-scale
> >> > +==========================================
> >> > +
> >> > +CPUs capacities are defined with respect to capacity-scale property in the cpus
> >> > +node [1]. The property is optional; if not defined a 1024 capacity-scale is
> >> > +assumed. This property defines both the highest CPU capacity present in the
> >> > +system and granularity of CPU capacity values.
> >>
> >> I don't really see the point of this vs. having an absolute scale.
> >>
> >
> > IMHO, we need this for several reasons, one being to address one of your
> > concerns below: vendors are free to choose their scale without being
> > forced to publish absolute data. Another reason is that it might make
> > life easier in certain cases; for example, someone could implement a
> > system with a few clusters of, say, A57s, but some run at half the clock
> > of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
> > this case I think it is just easier to define capacity-scale as 1200 and
> > capacities as 1200 and 600. Last reason that I can think of right now is
> > that we don't probably want to bound ourself to some particular range
> > from the beginning, as that range might be enough now, but it could
> > change in the future (as in, right now [1-1024] looks fine for
> > scheduling purposes, but that might change).
> 
> Like Rob, i don't really see the benefit of this optional
> capacity-scale property. Parsing the capacity of all cpu nodes should
> give you a range as well.
> IMHO, this property looks like an optimization of the code that will
> parse the dt more than a HW description
> 

I agree that we can come up with the same information just looking at
the biggest capacity value of all CPUs and treat that value as
capacity-scale. I just thought that having that explicit made things
clearer, as it could be not easy to immediately see from a DT with many
CPUs which is the biggest capacity value. But, yes, we could remove that
anyway.

Thanks,

- Juri

> >
> >> > +
> >> > +==========================================
> >> > +4 - capacity
> >> > +==========================================
> >> > +
> >> > +capacity is an optional cpu node [1] property: u32 value representing CPU
> >> > +capacity, relative to capacity-scale. It is required and enforced that capacity
> >> > +<= capacity-scale.
> >>
> >> I think you need something absolute and probably per MHz (like
> >> dynamic-power-coefficient property). Perhaps the IPC (instructions per
> >> clock) value?
> >>
> >> In other words, I want to see these numbers have a defined method
> >> of determining them and don't want to see random values from every
> >> vendor. ARM, Ltd. says core X has a value of Y would be good enough for
> >> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a
> >> value of 1024 is not what I want to see. Of course things like cache
> >> sizes can vary the performance, but is a baseline value good enough?
> >>
> >
> > A standard reference baseline is what we advocate with this set, but
> > making this baseline work for every vendor's implementation is hardly
> > achievable, IMHO. I don't think we can come up with any number that
> > applies to each and every implementation; you can have different
> > revisions of the same core and vendors might make implementation choices
> > that end up with different peak performance.
> >
> >> However, no vendor will want to publish their values if these are
> >> absolute values relative to other vendors.
> >>
> >
> > Right. That is why I think we need to abstract numbers, as we do with
> > capacity-scale.
> >
> >> If you expect these to need frequent tuning, then don't put them in DT.
> >>
> >
> > I expect that it is possible to come up with a sensible baseline number
> > for a specific platform implementation, so there is value in
> > standardizing how we specify this value and how it is then consumed.
> > Finer grained tuning might then happen both offline (with changes to the
> > mainline DT) and online (using the sysfs interface), but that should
> > only apply to a narrow set of use cases.
> >
> > Thanks,
> >
> > - Juri
> 
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dietmar Eggemann Dec. 10, 2015, 2:14 p.m. UTC | #5
On 01/12/15 11:20, Juri Lelli wrote:
> Hi Vincent,
> 
> On 30/11/15 10:59, Vincent Guittot wrote:
>> Hi Juri,
>>
>> On 24 November 2015 at 11:54, Juri Lelli <juri.lelli@arm.com> wrote:

[...]

>>>>> +==========================================
>>>>> +3 - capacity-scale
>>>>> +==========================================
>>>>> +
>>>>> +CPUs capacities are defined with respect to capacity-scale property in the cpus
>>>>> +node [1]. The property is optional; if not defined a 1024 capacity-scale is
>>>>> +assumed. This property defines both the highest CPU capacity present in the
>>>>> +system and granularity of CPU capacity values.
>>>>
>>>> I don't really see the point of this vs. having an absolute scale.
>>>>
>>>
>>> IMHO, we need this for several reasons, one being to address one of your
>>> concerns below: vendors are free to choose their scale without being
>>> forced to publish absolute data. Another reason is that it might make
>>> life easier in certain cases; for example, someone could implement a
>>> system with a few clusters of, say, A57s, but some run at half the clock
>>> of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
>>> this case I think it is just easier to define capacity-scale as 1200 and
>>> capacities as 1200 and 600. Last reason that I can think of right now is
>>> that we don't probably want to bound ourself to some particular range
>>> from the beginning, as that range might be enough now, but it could
>>> change in the future (as in, right now [1-1024] looks fine for
>>> scheduling purposes, but that might change).
>>
>> Like Rob, i don't really see the benefit of this optional
>> capacity-scale property. Parsing the capacity of all cpu nodes should
>> give you a range as well.
>> IMHO, this property looks like an optimization of the code that will
>> parse the dt more than a HW description
>>
> 
> I agree that we can come up with the same information just looking at
> the biggest capacity value of all CPUs and treat that value as
> capacity-scale. I just thought that having that explicit made things
> clearer, as it could be not easy to immediately see from a DT with many
> CPUs which is the biggest capacity value. But, yes, we could remove that
> anyway.

+1! This capacity-scale complicates things unnecessarily. It was hard
for me to understand the meaning of it. Your 2. example sets
'capacity-scale = <2>' but also 'capacity = <2>' for cpu[01] and
'capacity = <1>' for cpu[23]. This can be easily replaced by 'capacity =
<1024>' for cpu[01] and 'capacity = <512>' for cpu[23]. Much more
readable, as it was mentioned already in this thread.

I understand that we don't want to limit the range of capacity values in
the dt file to [1..1024] nor enforce that the cpu w/ the highest
capacity has to have the value of 1024 in the dt file so the scheduler
has to scale accordingly if we want to limit capacity to its supported
capacity range (like with EAS [1..1024]).

[...]

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown Dec. 10, 2015, 3:30 p.m. UTC | #6
On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote:

> I think you need something absolute and probably per MHz (like 
> dynamic-power-coefficient property). Perhaps the IPC (instructions per 
> clock) value?

> In other words, I want to see these numbers have a defined method 
> of determining them and don't want to see random values from every 
> vendor. ARM, Ltd. says core X has a value of Y would be good enough for 
> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a 
> value of 1024 is not what I want to see. Of course things like cache 
> sizes can vary the performance, but is a baseline value good enough? 

> However, no vendor will want to publish their values if these are 
> absolute values relative to other vendors.

> If you expect these to need frequent tuning, then don't put them in DT.

I agree strongly.  Putting what are essentially tuning numbers for the
system into the ABI is going to lead us into a mess long term since if
we change anything related to the performance of the system the numbers
may become invalid and we've no real way of recovering sensible
information.

There is of course also the issue where people are getting the numbers
from in the first place - were the numbers picked for some particular
use case or to optimise some particular benchmark, what other conditions
existed at the time (cpufreq setup for example), what tuning goals did
the people picking the numbers have and do any of those things
correspond to what a given user wants?  If detailed tuning the numbers
for specific systems matters much will we get competing users patching
the in kernel DTs over and over, and what do we do about ACPI systems?
Having an absolute definition doesn't really help with this since the
concrete effect DT authors see is that these are tuning numbers.

It would be better to have the DT describe concrete physical properties
of the system which we can then map onto numbers we like, that way if we
get better information in future or just decide that completely
different metrics are appropriate for tuning we can just do that without
having to worry about translating the old metrics into new ones.  We can
then expose the tuning knobs to userspace for override if that's needed.
If doing system specific tuning on vertically integrated systems really
is terribly important it's not going to matter too much where the tuning
is but we also have to consider more general purpose systems.

We're not going to get out of having to pick numbers at some point,
pushing them into DT doesn't get us out of that but it does make the
situation harder to manage long term and makes the performance for the
general user less relaible.  It's also just more work all round,
everyone doing the DT for a SoC is going to have to do some combination
of cargo culting or repeating the callibration.

I remember Peter remarking at one of the LPC discussions of this idea
that there had been some bad experiences with getting numbers from 
firmware on other systems.
Juri Lelli Dec. 10, 2015, 5:58 p.m. UTC | #7
Hi Mark,

I certainly understand your (and Rob's) concerns, but let me try anyway
to argument a bit more around this approach :-).

On 10/12/15 15:30, Mark Brown wrote:
> On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote:
> 
> > I think you need something absolute and probably per MHz (like 
> > dynamic-power-coefficient property). Perhaps the IPC (instructions per 
> > clock) value?
> 
> > In other words, I want to see these numbers have a defined method 
> > of determining them and don't want to see random values from every 
> > vendor. ARM, Ltd. says core X has a value of Y would be good enough for 
> > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a 
> > value of 1024 is not what I want to see. Of course things like cache 
> > sizes can vary the performance, but is a baseline value good enough? 
> 
> > However, no vendor will want to publish their values if these are 
> > absolute values relative to other vendors.
> 
> > If you expect these to need frequent tuning, then don't put them in DT.
> 
> I agree strongly.  Putting what are essentially tuning numbers for the
> system into the ABI is going to lead us into a mess long term since if
> we change anything related to the performance of the system the numbers
> may become invalid and we've no real way of recovering sensible
> information.
> 
> There is of course also the issue where people are getting the numbers
> from in the first place - were the numbers picked for some particular
> use case or to optimise some particular benchmark, what other conditions
> existed at the time (cpufreq setup for example), what tuning goals did
> the people picking the numbers have and do any of those things
> correspond to what a given user wants?  If detailed tuning the numbers
> for specific systems matters much will we get competing users patching
> the in kernel DTs over and over, and what do we do about ACPI systems?
> Having an absolute definition doesn't really help with this since the
> concrete effect DT authors see is that these are tuning numbers.
> 

I'm not entirely getting here why you consider capacity values to be
tunables. As part of the EAS effort, we are proposing ways in which users
should be able to fine tune their system as needed, when required
(don't know if you had a chance to have a look at the SchedTune posting
back in August for example [1]). This patch tries to only standardize
where do we get default values from and how we specify them. I'm not
seeing them changing much after an initial benchmarking phase has been
done. Tuning should happen using different methods, not by changing
these values, IMHO.

> It would be better to have the DT describe concrete physical properties
> of the system which we can then map onto numbers we like, that way if we
> get better information in future or just decide that completely
> different metrics are appropriate for tuning we can just do that without
> having to worry about translating the old metrics into new ones.  We can
> then expose the tuning knobs to userspace for override if that's needed.
> If doing system specific tuning on vertically integrated systems really
> is terribly important it's not going to matter too much where the tuning
> is but we also have to consider more general purpose systems.
> 

As replied to Rob, I'm not sure it is so easy to find any physical
property that expresses what we essentially need (without maybe relying
on a complex mix of hardware details and a model to extract numbers from
them). Instead, we propose to have reasonable, per SoC, default numbers;
and then let users fine tune their platform afterwards, without changing
those default values.

> We're not going to get out of having to pick numbers at some point,
> pushing them into DT doesn't get us out of that but it does make the
> situation harder to manage long term and makes the performance for the
> general user less relaible.  It's also just more work all round,
> everyone doing the DT for a SoC is going to have to do some combination
> of cargo culting or repeating the callibration.
> 

I'm most probably a bit naive here, but I see the calibration phase
happening only once, after the platform is stable. You get default
capacity values by running a pretty simple benchmark on a fixed
configuration; and you put them somewhere (DTs still seem to be a
sensible place to me). Then you'll be able to suit tuning needs using
different interfaces.

ABI changes have to be carefully considered, I know. But still, we need
to agree on a way to provide these default capacity values someway. So,
thanks for helping us carry on this discussion.

Best,

- Juri

> I remember Peter remarking at one of the LPC discussions of this idea
> that there had been some bad experiences with getting numbers from 
> firmware on other systems.

[1] https://lkml.org/lkml/2015/8/19/419
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Dec. 11, 2015, 10:09 a.m. UTC | #8
Hi,

On 10/12/15 14:14, Dietmar Eggemann wrote:
> On 01/12/15 11:20, Juri Lelli wrote:
> > Hi Vincent,
> > 
> > On 30/11/15 10:59, Vincent Guittot wrote:
> >> Hi Juri,
> >>
> >> On 24 November 2015 at 11:54, Juri Lelli <juri.lelli@arm.com> wrote:
> 
> [...]
> 
> >>>>> +==========================================
> >>>>> +3 - capacity-scale
> >>>>> +==========================================
> >>>>> +
> >>>>> +CPUs capacities are defined with respect to capacity-scale property in the cpus
> >>>>> +node [1]. The property is optional; if not defined a 1024 capacity-scale is
> >>>>> +assumed. This property defines both the highest CPU capacity present in the
> >>>>> +system and granularity of CPU capacity values.
> >>>>
> >>>> I don't really see the point of this vs. having an absolute scale.
> >>>>
> >>>
> >>> IMHO, we need this for several reasons, one being to address one of your
> >>> concerns below: vendors are free to choose their scale without being
> >>> forced to publish absolute data. Another reason is that it might make
> >>> life easier in certain cases; for example, someone could implement a
> >>> system with a few clusters of, say, A57s, but some run at half the clock
> >>> of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in
> >>> this case I think it is just easier to define capacity-scale as 1200 and
> >>> capacities as 1200 and 600. Last reason that I can think of right now is
> >>> that we don't probably want to bound ourself to some particular range
> >>> from the beginning, as that range might be enough now, but it could
> >>> change in the future (as in, right now [1-1024] looks fine for
> >>> scheduling purposes, but that might change).
> >>
> >> Like Rob, i don't really see the benefit of this optional
> >> capacity-scale property. Parsing the capacity of all cpu nodes should
> >> give you a range as well.
> >> IMHO, this property looks like an optimization of the code that will
> >> parse the dt more than a HW description
> >>
> > 
> > I agree that we can come up with the same information just looking at
> > the biggest capacity value of all CPUs and treat that value as
> > capacity-scale. I just thought that having that explicit made things
> > clearer, as it could be not easy to immediately see from a DT with many
> > CPUs which is the biggest capacity value. But, yes, we could remove that
> > anyway.
> 
> +1! This capacity-scale complicates things unnecessarily. It was hard
> for me to understand the meaning of it. Your 2. example sets
> 'capacity-scale = <2>' but also 'capacity = <2>' for cpu[01] and
> 'capacity = <1>' for cpu[23]. This can be easily replaced by 'capacity =
> <1024>' for cpu[01] and 'capacity = <512>' for cpu[23]. Much more
> readable, as it was mentioned already in this thread.
> 
> I understand that we don't want to limit the range of capacity values in
> the dt file to [1..1024] nor enforce that the cpu w/ the highest
> capacity has to have the value of 1024 in the dt file so the scheduler
> has to scale accordingly if we want to limit capacity to its supported
> capacity range (like with EAS [1..1024]).
> 

OK, I guess I can easily remove capacity-value and simply normalize CPU
capacities w.r.t. the highest capacity in the DT.

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown Dec. 11, 2015, 5:49 p.m. UTC | #9
On Thu, Dec 10, 2015 at 05:58:20PM +0000, Juri Lelli wrote:
> On 10/12/15 15:30, Mark Brown wrote:
> > On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote:

> > > In other words, I want to see these numbers have a defined method 
> > > of determining them and don't want to see random values from every 
> > > vendor. ARM, Ltd. says core X has a value of Y would be good enough for 
> > > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a 
> > > value of 1024 is not what I want to see. Of course things like cache 
> > > sizes can vary the performance, but is a baseline value good enough? 

> > > However, no vendor will want to publish their values if these are 
> > > absolute values relative to other vendors.

> > > If you expect these to need frequent tuning, then don't put them in DT.

> > I agree strongly.  Putting what are essentially tuning numbers for the
> > system into the ABI is going to lead us into a mess long term since if
> > we change anything related to the performance of the system the numbers
> > may become invalid and we've no real way of recovering sensible
> > information.

> I'm not entirely getting here why you consider capacity values to be
> tunables. As part of the EAS effort, we are proposing ways in which users

The purpose of the capacity values is to influence the scheduler
behaviour and hence performance.  Without a concrete definition they're
just magic numbers which have meaining only in terms of their effect on
the performance of the system.  That is a sufficiently complex outcome
to ensure that there will be an element of taste in what the desired
outcomes are.  Sounds like tuneables to me.

> should be able to fine tune their system as needed, when required
> (don't know if you had a chance to have a look at the SchedTune posting
> back in August for example [1]). This patch tries to only standardize
> where do we get default values from and how we specify them. I'm not
> seeing them changing much after an initial benchmarking phase has been
> done. Tuning should happen using different methods, not by changing
> these values, IMHO.

If you are saying people should use other, more sensible, ways of
specifying the final values that actually get used in production then
why take the defaults from direct numbers DT in the first place?  If you
are saying that people should tune and then put the values in here then
that's problematic for the reasons I outlined.

> > It would be better to have the DT describe concrete physical properties
> > of the system which we can then map onto numbers we like, that way if we
> > get better information in future or just decide that completely
> > different metrics are appropriate for tuning we can just do that without
> > having to worry about translating the old metrics into new ones.  We can
> > then expose the tuning knobs to userspace for override if that's needed.
> > If doing system specific tuning on vertically integrated systems really
> > is terribly important it's not going to matter too much where the tuning
> > is but we also have to consider more general purpose systems.

> As replied to Rob, I'm not sure it is so easy to find any physical
> property that expresses what we essentially need (without maybe relying
> on a complex mix of hardware details and a model to extract numbers from
> them). Instead, we propose to have reasonable, per SoC, default numbers;
> and then let users fine tune their platform afterwards, without changing
> those default values.

If users are supposed to do fine tuning elsewhere after the fact why
bother with this initial callibration?  Something that's ballpark good
enough like just knowing the core used and perhaps some important
options on it should give an adequate starting point and not have the
issues with having the tuning numbers present as magic numbers.  Perhaps
we might also feed cache information in at some point.  If in future
we're able to improve those default numbers (or just adapt at runtime)
then even better.

It also seems a bit strange to expect people to do some tuning in one
place initially and then additional tuning somewhere else later, from
a user point of view I'd expect to always do my tuning in the same
place.

> > We're not going to get out of having to pick numbers at some point,
> > pushing them into DT doesn't get us out of that but it does make the
> > situation harder to manage long term and makes the performance for the
> > general user less relaible.  It's also just more work all round,
> > everyone doing the DT for a SoC is going to have to do some combination
> > of cargo culting or repeating the callibration.

> I'm most probably a bit naive here, but I see the calibration phase
> happening only once, after the platform is stable. You get default
> capacity values by running a pretty simple benchmark on a fixed
> configuration; and you put them somewhere (DTs still seem to be a
> sensible place to me). Then you'll be able to suit tuning needs using
> different interfaces.

My point is that everyone making any kind of SoC with asymmetries is
expected to go and do some kind of callibration based on some unclear
criteria, if these are just ballpark accurate starting points that seems
like wasted effort - the kernel should be making a reasonable effort to
do something sensible without this information which is going to be less
effort all round.  It doesn't need to wait for real silicon (this seems
like the sort of core bit of DT which will be being written pre-tapeout)
and doesn't have marketing implications.

Doing that and then switching to some other interface for real tuning
seems especially odd and I'm not sure that's something that users are
going to expect or understand.
Juri Lelli Dec. 14, 2015, 12:36 p.m. UTC | #10
Hi Mark,

On 11/12/15 17:49, Mark Brown wrote:
> On Thu, Dec 10, 2015 at 05:58:20PM +0000, Juri Lelli wrote:
> > On 10/12/15 15:30, Mark Brown wrote:
> > > On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote:
> 
> > > > In other words, I want to see these numbers have a defined method 
> > > > of determining them and don't want to see random values from every 
> > > > vendor. ARM, Ltd. says core X has a value of Y would be good enough for 
> > > > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a 
> > > > value of 1024 is not what I want to see. Of course things like cache 
> > > > sizes can vary the performance, but is a baseline value good enough? 
> 
> > > > However, no vendor will want to publish their values if these are 
> > > > absolute values relative to other vendors.
> 
> > > > If you expect these to need frequent tuning, then don't put them in DT.
> 
> > > I agree strongly.  Putting what are essentially tuning numbers for the
> > > system into the ABI is going to lead us into a mess long term since if
> > > we change anything related to the performance of the system the numbers
> > > may become invalid and we've no real way of recovering sensible
> > > information.
> 
> > I'm not entirely getting here why you consider capacity values to be
> > tunables. As part of the EAS effort, we are proposing ways in which users
> 
> The purpose of the capacity values is to influence the scheduler
> behaviour and hence performance.  Without a concrete definition they're
> just magic numbers which have meaining only in terms of their effect on
> the performance of the system.  That is a sufficiently complex outcome
> to ensure that there will be an element of taste in what the desired
> outcomes are.  Sounds like tuneables to me.
> 

Capacity values are meant to describe asymmetry (if any) of the system
CPUs to the scheduler. The scheduler can then use this additional bit of
information to try to do better scheduling decisions. Yes, having these
values available will end up giving you better performance, but I guess
this apply to any information we provide to the kernel (and scheduler);
the less dumb a subsystem is, the better we can make it work.

> > should be able to fine tune their system as needed, when required
> > (don't know if you had a chance to have a look at the SchedTune posting
> > back in August for example [1]). This patch tries to only standardize
> > where do we get default values from and how we specify them. I'm not
> > seeing them changing much after an initial benchmarking phase has been
> > done. Tuning should happen using different methods, not by changing
> > these values, IMHO.
> 
> If you are saying people should use other, more sensible, ways of
> specifying the final values that actually get used in production then
> why take the defaults from direct numbers DT in the first place?  If you
> are saying that people should tune and then put the values in here then
> that's problematic for the reasons I outlined.
> 

IMHO, people should come up with default values that describe
heterogeneity in their system. Then use other ways to tune the system at
run time (depending on the workload maybe).

As said, I understand your concerns; but, what I don't still get is
where CPU capacity values are so different from, say, idle states
min-residency-us. AFAIK there is a per-SoC benchmarking phase required
to come up with that values as well; you have to pick some benchmark
that stresses worst case entry/exit while measuring energy, then make
calculations that tells you when it is wise to enter a particular idle
state. Ideally we should derive min residency from specs, but I'm not
sure is how it works in practice.

> > > It would be better to have the DT describe concrete physical properties
> > > of the system which we can then map onto numbers we like, that way if we
> > > get better information in future or just decide that completely
> > > different metrics are appropriate for tuning we can just do that without
> > > having to worry about translating the old metrics into new ones.  We can
> > > then expose the tuning knobs to userspace for override if that's needed.
> > > If doing system specific tuning on vertically integrated systems really
> > > is terribly important it's not going to matter too much where the tuning
> > > is but we also have to consider more general purpose systems.
> 
> > As replied to Rob, I'm not sure it is so easy to find any physical
> > property that expresses what we essentially need (without maybe relying
> > on a complex mix of hardware details and a model to extract numbers from
> > them). Instead, we propose to have reasonable, per SoC, default numbers;
> > and then let users fine tune their platform afterwards, without changing
> > those default values.
> 
> If users are supposed to do fine tuning elsewhere after the fact why
> bother with this initial callibration?  Something that's ballpark good
> enough like just knowing the core used and perhaps some important
> options on it should give an adequate starting point and not have the
> issues with having the tuning numbers present as magic numbers.  Perhaps
> we might also feed cache information in at some point.  If in future
> we're able to improve those default numbers (or just adapt at runtime)
> then even better.
> 
> It also seems a bit strange to expect people to do some tuning in one
> place initially and then additional tuning somewhere else later, from
> a user point of view I'd expect to always do my tuning in the same
> place.
> 

I think that runtime tuning needs are much more complex and have finer
grained needs than what you can achieve by playing with CPU capacities.
And I agree with you, users should only play with these other methods
I'm referring to; they should not mess around with platform description
bits. They should provide information about runtime needs, then the
scheduler (in this case) will do its best to give them acceptable
performance using improved knowledge about the platform.

> > > We're not going to get out of having to pick numbers at some point,
> > > pushing them into DT doesn't get us out of that but it does make the
> > > situation harder to manage long term and makes the performance for the
> > > general user less relaible.  It's also just more work all round,
> > > everyone doing the DT for a SoC is going to have to do some combination
> > > of cargo culting or repeating the callibration.
> 
> > I'm most probably a bit naive here, but I see the calibration phase
> > happening only once, after the platform is stable. You get default
> > capacity values by running a pretty simple benchmark on a fixed
> > configuration; and you put them somewhere (DTs still seem to be a
> > sensible place to me). Then you'll be able to suit tuning needs using
> > different interfaces.
> 
> My point is that everyone making any kind of SoC with asymmetries is
> expected to go and do some kind of callibration based on some unclear
> criteria, if these are just ballpark accurate starting points that seems
> like wasted effort - the kernel should be making a reasonable effort to
> do something sensible without this information which is going to be less
> effort all round.  It doesn't need to wait for real silicon (this seems
> like the sort of core bit of DT which will be being written pre-tapeout)
> and doesn't have marketing implications.
> 
> Doing that and then switching to some other interface for real tuning
> seems especially odd and I'm not sure that's something that users are
> going to expect or understand.

As I'm saying above, users should not care about this first step of
platform description; not more than how much they care about other bits
in DTs that describe their platform.

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown Dec. 14, 2015, 4:59 p.m. UTC | #11
On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote:
> On 11/12/15 17:49, Mark Brown wrote:

> > The purpose of the capacity values is to influence the scheduler
> > behaviour and hence performance.  Without a concrete definition they're
> > just magic numbers which have meaining only in terms of their effect on
> > the performance of the system.  That is a sufficiently complex outcome
> > to ensure that there will be an element of taste in what the desired
> > outcomes are.  Sounds like tuneables to me.

> Capacity values are meant to describe asymmetry (if any) of the system
> CPUs to the scheduler. The scheduler can then use this additional bit of
> information to try to do better scheduling decisions. Yes, having these
> values available will end up giving you better performance, but I guess
> this apply to any information we provide to the kernel (and scheduler);
> the less dumb a subsystem is, the better we can make it work.

This information is a magic number, there's never going to be a right
answer.  If it needs changing it's not like the kernel is modeling a
concrete thing like the relative performance of the A53 and A57 poorly
or whatever, it's just that the relative values of number A and number B
are not what the system integrator desires.

> > If you are saying people should use other, more sensible, ways of
> > specifying the final values that actually get used in production then
> > why take the defaults from direct numbers DT in the first place?  If you
> > are saying that people should tune and then put the values in here then
> > that's problematic for the reasons I outlined.

> IMHO, people should come up with default values that describe
> heterogeneity in their system. Then use other ways to tune the system at
> run time (depending on the workload maybe).

My argument is that they should be describing the hetrogeneity of their
system by describing concrete properties of their system rather than by
providing magic numbers.

> As said, I understand your concerns; but, what I don't still get is
> where CPU capacity values are so different from, say, idle states
> min-residency-us. AFAIK there is a per-SoC benchmarking phase required
> to come up with that values as well; you have to pick some benchmark
> that stresses worst case entry/exit while measuring energy, then make
> calculations that tells you when it is wise to enter a particular idle
> state. Ideally we should derive min residency from specs, but I'm not
> sure is how it works in practice.

Those at least have a concrete physical value that it is possible to
measure in a describable way that is unlikely to change based on the
internals of the kernel.  It would be kind of nice to have the broken
down numbers for entry time, exit time and power burn in suspend but
it's not clear it's worth the bother.  It's also one of these things
where we don't have any real proxies that get us anywhere in the
ballpark of where we want to be.

> > It also seems a bit strange to expect people to do some tuning in one
> > place initially and then additional tuning somewhere else later, from
> > a user point of view I'd expect to always do my tuning in the same
> > place.

> I think that runtime tuning needs are much more complex and have finer
> grained needs than what you can achieve by playing with CPU capacities.
> And I agree with you, users should only play with these other methods
> I'm referring to; they should not mess around with platform description
> bits. They should provide information about runtime needs, then the
> scheduler (in this case) will do its best to give them acceptable
> performance using improved knowledge about the platform.

So then why isn't it adequate to just have things like the core types in
there and work from there?  Are we really expecting the tuning to be so
much better than it's possible to come up with something that's so much
better on the scale that we're expecting this to be accurate that it's
worth just jumping straight to magic numbers?

> > Doing that and then switching to some other interface for real tuning
> > seems especially odd and I'm not sure that's something that users are
> > going to expect or understand.

> As I'm saying above, users should not care about this first step of
> platform description; not more than how much they care about other bits
> in DTs that describe their platform.

That may be your intention but I don't see how it is realistic to expect
that this is what people will actually understand.  It's a number, it
has an effect and it's hard to see that people won't tune it, it's not
like people don't have to edit DTs during system integration.  People
won't reliably read documentation or look in mailing list threads and
other that that it has all the properties of a tuning interface.

There's a tension here between what you're saying about people not being
supposed to care much about the numbers for tuning and the very fact
that there's a need for the DT to carry explicit numbers.
Juri Lelli Dec. 15, 2015, 12:22 p.m. UTC | #12
On 14/12/15 16:59, Mark Brown wrote:
> On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote:
> > On 11/12/15 17:49, Mark Brown wrote:
> 
> > > The purpose of the capacity values is to influence the scheduler
> > > behaviour and hence performance.  Without a concrete definition they're
> > > just magic numbers which have meaining only in terms of their effect on
> > > the performance of the system.  That is a sufficiently complex outcome
> > > to ensure that there will be an element of taste in what the desired
> > > outcomes are.  Sounds like tuneables to me.
> 
> > Capacity values are meant to describe asymmetry (if any) of the system
> > CPUs to the scheduler. The scheduler can then use this additional bit of
> > information to try to do better scheduling decisions. Yes, having these
> > values available will end up giving you better performance, but I guess
> > this apply to any information we provide to the kernel (and scheduler);
> > the less dumb a subsystem is, the better we can make it work.
> 
> This information is a magic number, there's never going to be a right
> answer.  If it needs changing it's not like the kernel is modeling a
> concrete thing like the relative performance of the A53 and A57 poorly
> or whatever, it's just that the relative values of number A and number B
> are not what the system integrator desires.
> 
> > > If you are saying people should use other, more sensible, ways of
> > > specifying the final values that actually get used in production then
> > > why take the defaults from direct numbers DT in the first place?  If you
> > > are saying that people should tune and then put the values in here then
> > > that's problematic for the reasons I outlined.
> 
> > IMHO, people should come up with default values that describe
> > heterogeneity in their system. Then use other ways to tune the system at
> > run time (depending on the workload maybe).
> 
> My argument is that they should be describing the hetrogeneity of their
> system by describing concrete properties of their system rather than by
> providing magic numbers.
> 
> > As said, I understand your concerns; but, what I don't still get is
> > where CPU capacity values are so different from, say, idle states
> > min-residency-us. AFAIK there is a per-SoC benchmarking phase required
> > to come up with that values as well; you have to pick some benchmark
> > that stresses worst case entry/exit while measuring energy, then make
> > calculations that tells you when it is wise to enter a particular idle
> > state. Ideally we should derive min residency from specs, but I'm not
> > sure is how it works in practice.
> 
> Those at least have a concrete physical value that it is possible to
> measure in a describable way that is unlikely to change based on the
> internals of the kernel.  It would be kind of nice to have the broken
> down numbers for entry time, exit time and power burn in suspend but
> it's not clear it's worth the bother.  It's also one of these things
> where we don't have any real proxies that get us anywhere in the
> ballpark of where we want to be.
> 

I'm proposing to add a new value because I couldn't find any proxies in
the current bindings that bring us any close to what we need. If I
failed in looking for them, and they actually exists, I'll personally be
more then happy to just rely on them instead of adding more stuff :-).

Interestingly, to me it sounds like we could actually use your first
paragraph above almost as it is to describe how to come up with capacity
values. In the documentation I put the following:

"One simple way to estimate CPU capacities is to iteratively run a
well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on
each CPU at maximum frequency and then normalize values w.r.t.  the best
performing CPU."

I don't see why this should change if we decide that the scheduler has
to change in the future.

Also, looking again at section 2 of idle-states bindings docs, we have a
nice and accurate description of what min-residency is, but not much
info about how we can actually measure that. Maybe, expanding the docs
section regarding CPU capacity could help?

> > > It also seems a bit strange to expect people to do some tuning in one
> > > place initially and then additional tuning somewhere else later, from
> > > a user point of view I'd expect to always do my tuning in the same
> > > place.
> 
> > I think that runtime tuning needs are much more complex and have finer
> > grained needs than what you can achieve by playing with CPU capacities.
> > And I agree with you, users should only play with these other methods
> > I'm referring to; they should not mess around with platform description
> > bits. They should provide information about runtime needs, then the
> > scheduler (in this case) will do its best to give them acceptable
> > performance using improved knowledge about the platform.
> 
> So then why isn't it adequate to just have things like the core types in
> there and work from there?  Are we really expecting the tuning to be so
> much better than it's possible to come up with something that's so much
> better on the scale that we're expecting this to be accurate that it's
> worth just jumping straight to magic numbers?
> 

I take your point here that having fine grained values might not really
give us appreciable differences (that is also why I proposed the
capacity-scale in the first instance), but I'm not sure I'm getting what
you are proposing here.

Today, and for arm only, we have a static table representing CPUs
"efficiency":

 /*
  * Table of relative efficiency of each processors
  * The efficiency value must fit in 20bit and the final
  * cpu_scale value must be in the range
  *   0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
  * in order to return at most 1 when DIV_ROUND_CLOSEST
  * is used to compute the capacity of a CPU.
  * Processors that are not defined in the table,
  * use the default SCHED_CAPACITY_SCALE value for cpu_scale.
  */
 static const struct cpu_efficiency table_efficiency[] = {
 	{"arm,cortex-a15", 3891},
 	{"arm,cortex-a7",  2048},
 	{NULL, },
 };

When clock-frequency property is defined in DT, we try to find a match
for the compatibility string in the table above and then use the
associate number to compute the capacity. Are you proposing to have
something like this for arm64 as well?

BTW, the only info I could find about those numbers is from this thread

 http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/104072.html

Vincent, do we have more precise information about these numbers
somewhere else?

If I understand how that table was created, how do we think we will
extend it in the future to allow newer core types (say we replicate this
solution for arm64)?  It seems that we have to change it, rescaling
values, each time we have a new core on the market. How can we come up
with relative numbers, in the future, comparing newer cores to old ones
(that might be already out of the market by that time)?

> > > Doing that and then switching to some other interface for real tuning
> > > seems especially odd and I'm not sure that's something that users are
> > > going to expect or understand.
> 
> > As I'm saying above, users should not care about this first step of
> > platform description; not more than how much they care about other bits
> > in DTs that describe their platform.
> 
> That may be your intention but I don't see how it is realistic to expect
> that this is what people will actually understand.  It's a number, it
> has an effect and it's hard to see that people won't tune it, it's not
> like people don't have to edit DTs during system integration.  People
> won't reliably read documentation or look in mailing list threads and
> other that that it has all the properties of a tuning interface.
> 

Eh, sad but true. I guess we can, as we usually do, put more effort in
documenting how things are supposed to be used. Then, if people think
that they can make their system perform better without looking at
documentation or asking around, I'm not sure there is much we could do
to prevent them to do things wrong. There are already lot of things
people shouldn't touch if they don't know what they are doing. :-/

> There's a tension here between what you're saying about people not being
> supposed to care much about the numbers for tuning and the very fact
> that there's a need for the DT to carry explicit numbers.

My point is that people with tuning needs shouldn't even look at DTs,
but put all their efforts in describing (using appropriate APIs) their
needs and how they apply to the workload they care about. Our job is to
put together information coming from users and knowledge of system
configuration to provide people the desired outcomes.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown Dec. 15, 2015, 1:39 p.m. UTC | #13
On Tue, Dec 15, 2015 at 12:22:38PM +0000, Juri Lelli wrote:

> I'm proposing to add a new value because I couldn't find any proxies in
> the current bindings that bring us any close to what we need. If I
> failed in looking for them, and they actually exists, I'll personally be
> more then happy to just rely on them instead of adding more stuff :-).

Well, the first pass is going to be the core types (possibly with some
other properties if there's interesting parameters people can tweak in
integration).

> Interestingly, to me it sounds like we could actually use your first
> paragraph above almost as it is to describe how to come up with capacity
> values. In the documentation I put the following:

> "One simple way to estimate CPU capacities is to iteratively run a
> well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on
> each CPU at maximum frequency and then normalize values w.r.t.  the best
> performing CPU."

> I don't see why this should change if we decide that the scheduler has
> to change in the future.

You'd at least need to pick a particular benchmark there...

> Also, looking again at section 2 of idle-states bindings docs, we have a
> nice and accurate description of what min-residency is, but not much
> info about how we can actually measure that. Maybe, expanding the docs
> section regarding CPU capacity could help?

I'm dubious about this to be honest, if only on the basis of people
reading the docs in the first place.  It also seems that if we're really
just talking about some CPU microbenchmark then forcing every
implementor to do the benchmark and scaling is at best going to burn
people's time and be error prone since it doesn't seem so likely to have
dramatic system integration variation.

> > So then why isn't it adequate to just have things like the core types in
> > there and work from there?  Are we really expecting the tuning to be so
> > much better than it's possible to come up with something that's so much
> > better on the scale that we're expecting this to be accurate that it's
> > worth just jumping straight to magic numbers?

> I take your point here that having fine grained values might not really
> give us appreciable differences (that is also why I proposed the
> capacity-scale in the first instance), but I'm not sure I'm getting what
> you are proposing here.

Something like the existing solution for arm32.

>  static const struct cpu_efficiency table_efficiency[] = {
>  	{"arm,cortex-a15", 3891},
>  	{"arm,cortex-a7",  2048},
>  	{NULL, },
>  };

> When clock-frequency property is defined in DT, we try to find a match
> for the compatibility string in the table above and then use the
> associate number to compute the capacity. Are you proposing to have
> something like this for arm64 as well?

> BTW, the only info I could find about those numbers is from this thread

It was discussed in some other thread when I was sending the equivalent
stuff for arm64 (I never got round to finishing it off due to issues
with Catalin and Will being concerned about the specific numbers).
Vincent confirmed that the numbers came from the (IIRC) DMIPS/MHz
numbers that ARM publish for the cores.  I'd independently done the same
thing for arm64.  It would probably help to put comments in there with
the base numbers before scaling, or just redo the table in terms of the
raw numbers.

This is, of course, an example of my concerns about magic number
configuration.

> If I understand how that table was created, how do we think we will
> extend it in the future to allow newer core types (say we replicate this
> solution for arm64)?  It seems that we have to change it, rescaling
> values, each time we have a new core on the market. How can we come up
> with relative numbers, in the future, comparing newer cores to old ones
> (that might be already out of the market by that time)?

It doesn't seem particularly challenging to add new numbers to the table
(and add additional properties to select on) TBH.  We can either rescale
by hand in the table when adding entries, script it as part of the
kernel build or do it at runtime (as the arm32 code already does to an
extent based on the particular set of cores we find).  What difficulties
do you see with this?

This is something that seems like an advantage to me - we can just
replace everything at any point, we're not tied to trusting the golden
benchmark someone did (or tweaked) if we come up with a better
methodology later on.

> Eh, sad but true. I guess we can, as we usually do, put more effort in
> documenting how things are supposed to be used. Then, if people think
> that they can make their system perform better without looking at
> documentation or asking around, I'm not sure there is much we could do
> to prevent them to do things wrong. There are already lot of things
> people shouldn't touch if they don't know what they are doing. :-/

The trouble with making people specify this in DT is that it becomes a
parameter that someone *has* to tweak at some point.

> > There's a tension here between what you're saying about people not being
> > supposed to care much about the numbers for tuning and the very fact
> > that there's a need for the DT to carry explicit numbers.

> My point is that people with tuning needs shouldn't even look at DTs,
> but put all their efforts in describing (using appropriate APIs) their
> needs and how they apply to the workload they care about. Our job is to
> put together information coming from users and knowledge of system
> configuration to provide people the desired outcomes.

You can't do a system integration for a smartphone or embedded board
without configuring things in DT, people integrating for those systems
are already looking at DT and they are the main current targets for
heterogeneous systems I'm aware of.  The people who don't need to look
at DT are mainly the enterprise type people but a lot of them won't be
able to use this as-is anyway since they'll be using ACPI.
Vincent Guittot Dec. 15, 2015, 1:55 p.m. UTC | #14
On 15 December 2015 at 13:22, Juri Lelli <juri.lelli@arm.com> wrote:
> On 14/12/15 16:59, Mark Brown wrote:
>> On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote:
>> > On 11/12/15 17:49, Mark Brown wrote:
>>
>> > > The purpose of the capacity values is to influence the scheduler
>> > > behaviour and hence performance.  Without a concrete definition they're
>> > > just magic numbers which have meaining only in terms of their effect on
>> > > the performance of the system.  That is a sufficiently complex outcome
>> > > to ensure that there will be an element of taste in what the desired
>> > > outcomes are.  Sounds like tuneables to me.
>>
>> > Capacity values are meant to describe asymmetry (if any) of the system
>> > CPUs to the scheduler. The scheduler can then use this additional bit of
>> > information to try to do better scheduling decisions. Yes, having these
>> > values available will end up giving you better performance, but I guess
>> > this apply to any information we provide to the kernel (and scheduler);
>> > the less dumb a subsystem is, the better we can make it work.
>>
>> This information is a magic number, there's never going to be a right
>> answer.  If it needs changing it's not like the kernel is modeling a
>> concrete thing like the relative performance of the A53 and A57 poorly
>> or whatever, it's just that the relative values of number A and number B
>> are not what the system integrator desires.
>>
>> > > If you are saying people should use other, more sensible, ways of
>> > > specifying the final values that actually get used in production then
>> > > why take the defaults from direct numbers DT in the first place?  If you
>> > > are saying that people should tune and then put the values in here then
>> > > that's problematic for the reasons I outlined.
>>
>> > IMHO, people should come up with default values that describe
>> > heterogeneity in their system. Then use other ways to tune the system at
>> > run time (depending on the workload maybe).
>>
>> My argument is that they should be describing the hetrogeneity of their
>> system by describing concrete properties of their system rather than by
>> providing magic numbers.
>>
>> > As said, I understand your concerns; but, what I don't still get is
>> > where CPU capacity values are so different from, say, idle states
>> > min-residency-us. AFAIK there is a per-SoC benchmarking phase required
>> > to come up with that values as well; you have to pick some benchmark
>> > that stresses worst case entry/exit while measuring energy, then make
>> > calculations that tells you when it is wise to enter a particular idle
>> > state. Ideally we should derive min residency from specs, but I'm not
>> > sure is how it works in practice.
>>
>> Those at least have a concrete physical value that it is possible to
>> measure in a describable way that is unlikely to change based on the
>> internals of the kernel.  It would be kind of nice to have the broken
>> down numbers for entry time, exit time and power burn in suspend but
>> it's not clear it's worth the bother.  It's also one of these things
>> where we don't have any real proxies that get us anywhere in the
>> ballpark of where we want to be.
>>
>
> I'm proposing to add a new value because I couldn't find any proxies in
> the current bindings that bring us any close to what we need. If I
> failed in looking for them, and they actually exists, I'll personally be
> more then happy to just rely on them instead of adding more stuff :-).
>
> Interestingly, to me it sounds like we could actually use your first
> paragraph above almost as it is to describe how to come up with capacity
> values. In the documentation I put the following:
>
> "One simple way to estimate CPU capacities is to iteratively run a
> well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on
> each CPU at maximum frequency and then normalize values w.r.t.  the best
> performing CPU."
>
> I don't see why this should change if we decide that the scheduler has
> to change in the future.
>
> Also, looking again at section 2 of idle-states bindings docs, we have a
> nice and accurate description of what min-residency is, but not much
> info about how we can actually measure that. Maybe, expanding the docs
> section regarding CPU capacity could help?
>
>> > > It also seems a bit strange to expect people to do some tuning in one
>> > > place initially and then additional tuning somewhere else later, from
>> > > a user point of view I'd expect to always do my tuning in the same
>> > > place.
>>
>> > I think that runtime tuning needs are much more complex and have finer
>> > grained needs than what you can achieve by playing with CPU capacities.
>> > And I agree with you, users should only play with these other methods
>> > I'm referring to; they should not mess around with platform description
>> > bits. They should provide information about runtime needs, then the
>> > scheduler (in this case) will do its best to give them acceptable
>> > performance using improved knowledge about the platform.
>>
>> So then why isn't it adequate to just have things like the core types in
>> there and work from there?  Are we really expecting the tuning to be so
>> much better than it's possible to come up with something that's so much
>> better on the scale that we're expecting this to be accurate that it's
>> worth just jumping straight to magic numbers?
>>
>
> I take your point here that having fine grained values might not really
> give us appreciable differences (that is also why I proposed the
> capacity-scale in the first instance), but I'm not sure I'm getting what
> you are proposing here.
>
> Today, and for arm only, we have a static table representing CPUs
> "efficiency":
>
>  /*
>   * Table of relative efficiency of each processors
>   * The efficiency value must fit in 20bit and the final
>   * cpu_scale value must be in the range
>   *   0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
>   * in order to return at most 1 when DIV_ROUND_CLOSEST
>   * is used to compute the capacity of a CPU.
>   * Processors that are not defined in the table,
>   * use the default SCHED_CAPACITY_SCALE value for cpu_scale.
>   */
>  static const struct cpu_efficiency table_efficiency[] = {
>         {"arm,cortex-a15", 3891},
>         {"arm,cortex-a7",  2048},
>         {NULL, },
>  };
>
> When clock-frequency property is defined in DT, we try to find a match
> for the compatibility string in the table above and then use the
> associate number to compute the capacity. Are you proposing to have
> something like this for arm64 as well?
>
> BTW, the only info I could find about those numbers is from this thread
>
>  http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/104072.html
>
> Vincent, do we have more precise information about these numbers
> somewhere else?

These numbers come from a document from ARM in which they compared A15
and A7 . I just used the number provided by this white paper and scale
it in a more appropriate range than DMIPS/Mhz

>
> If I understand how that table was created, how do we think we will
> extend it in the future to allow newer core types (say we replicate this
> solution for arm64)?  It seems that we have to change it, rescaling
> values, each time we have a new core on the market. How can we come up
> with relative numbers, in the future, comparing newer cores to old ones
> (that might be already out of the market by that time)?
>
>> > > Doing that and then switching to some other interface for real tuning
>> > > seems especially odd and I'm not sure that's something that users are
>> > > going to expect or understand.
>>
>> > As I'm saying above, users should not care about this first step of
>> > platform description; not more than how much they care about other bits
>> > in DTs that describe their platform.
>>
>> That may be your intention but I don't see how it is realistic to expect
>> that this is what people will actually understand.  It's a number, it
>> has an effect and it's hard to see that people won't tune it, it's not
>> like people don't have to edit DTs during system integration.  People
>> won't reliably read documentation or look in mailing list threads and
>> other that that it has all the properties of a tuning interface.
>>
>
> Eh, sad but true. I guess we can, as we usually do, put more effort in
> documenting how things are supposed to be used. Then, if people think
> that they can make their system perform better without looking at
> documentation or asking around, I'm not sure there is much we could do
> to prevent them to do things wrong. There are already lot of things
> people shouldn't touch if they don't know what they are doing. :-/
>
>> There's a tension here between what you're saying about people not being
>> supposed to care much about the numbers for tuning and the very fact
>> that there's a need for the DT to carry explicit numbers.
>
> My point is that people with tuning needs shouldn't even look at DTs,
> but put all their efforts in describing (using appropriate APIs) their
> needs and how they apply to the workload they care about. Our job is to
> put together information coming from users and knowledge of system
> configuration to provide people the desired outcomes.
>
> Best,
>
> - Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Rutland Dec. 15, 2015, 2:01 p.m. UTC | #15
On Tue, Dec 15, 2015 at 01:39:51PM +0000, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 12:22:38PM +0000, Juri Lelli wrote:
> > > So then why isn't it adequate to just have things like the core types in
> > > there and work from there?  Are we really expecting the tuning to be so
> > > much better than it's possible to come up with something that's so much
> > > better on the scale that we're expecting this to be accurate that it's
> > > worth just jumping straight to magic numbers?
> 
> > I take your point here that having fine grained values might not really
> > give us appreciable differences (that is also why I proposed the
> > capacity-scale in the first instance), but I'm not sure I'm getting what
> > you are proposing here.
> 
> Something like the existing solution for arm32.
> 
> >  static const struct cpu_efficiency table_efficiency[] = {
> >  	{"arm,cortex-a15", 3891},
> >  	{"arm,cortex-a7",  2048},
> >  	{NULL, },
> >  };
> 
> > When clock-frequency property is defined in DT, we try to find a match
> > for the compatibility string in the table above and then use the
> > associate number to compute the capacity. Are you proposing to have
> > something like this for arm64 as well?
> 
> > BTW, the only info I could find about those numbers is from this thread
> 
> It was discussed in some other thread when I was sending the equivalent
> stuff for arm64 (I never got round to finishing it off due to issues
> with Catalin and Will being concerned about the specific numbers).
> Vincent confirmed that the numbers came from the (IIRC) DMIPS/MHz
> numbers that ARM publish for the cores.  I'd independently done the same
> thing for arm64.  It would probably help to put comments in there with
> the base numbers before scaling, or just redo the table in terms of the
> raw numbers.
> 
> This is, of course, an example of my concerns about magic number
> configuration.
> 
> > If I understand how that table was created, how do we think we will
> > extend it in the future to allow newer core types (say we replicate this
> > solution for arm64)?  It seems that we have to change it, rescaling
> > values, each time we have a new core on the market. How can we come up
> > with relative numbers, in the future, comparing newer cores to old ones
> > (that might be already out of the market by that time)?
> 
> It doesn't seem particularly challenging to add new numbers to the table
> (and add additional properties to select on) TBH.  We can either rescale
> by hand in the table when adding entries, script it as part of the
> kernel build or do it at runtime (as the arm32 code already does to an
> extent based on the particular set of cores we find).  What difficulties
> do you see with this?
> 
> This is something that seems like an advantage to me - we can just
> replace everything at any point, we're not tied to trusting the golden
> benchmark someone did (or tweaked) if we come up with a better
> methodology later on.

I really don't want to see a table of magic numbers in the kernel.

The relative performance and efficiency of cores will vary depending on
uArch-specific configuration (e.g. sizing of L1/L2 caches) in addition
to general uArch differences, and integration too (e.g. if the memory
system gives priority to one cluster over another for whatever reason).
I've heard of pseudo-heterogeneous platforms with different
configuration of the same uArch across clusters.

We also don't necessarily have the CPU clock frequencies, or the ability
to scale them. Maybe we simply give up in that case, though.

If we cannot rely on external information, and want this information to
be derived by the kernel, then we need to perform some dynamic
benchmark. That would work for future CPUs the kernel knows nothing
about yet, and would cater for the pseudo-heterogeneous cases too.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Dec. 15, 2015, 2:24 p.m. UTC | #16
Hi Mark,

On 15/12/15 14:01, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 01:39:51PM +0000, Mark Brown wrote:
> > On Tue, Dec 15, 2015 at 12:22:38PM +0000, Juri Lelli wrote:
> > > > So then why isn't it adequate to just have things like the core types in
> > > > there and work from there?  Are we really expecting the tuning to be so
> > > > much better than it's possible to come up with something that's so much
> > > > better on the scale that we're expecting this to be accurate that it's
> > > > worth just jumping straight to magic numbers?
> > 
> > > I take your point here that having fine grained values might not really
> > > give us appreciable differences (that is also why I proposed the
> > > capacity-scale in the first instance), but I'm not sure I'm getting what
> > > you are proposing here.
> > 
> > Something like the existing solution for arm32.
> > 
> > >  static const struct cpu_efficiency table_efficiency[] = {
> > >  	{"arm,cortex-a15", 3891},
> > >  	{"arm,cortex-a7",  2048},
> > >  	{NULL, },
> > >  };
> > 
> > > When clock-frequency property is defined in DT, we try to find a match
> > > for the compatibility string in the table above and then use the
> > > associate number to compute the capacity. Are you proposing to have
> > > something like this for arm64 as well?
> > 
> > > BTW, the only info I could find about those numbers is from this thread
> > 
> > It was discussed in some other thread when I was sending the equivalent
> > stuff for arm64 (I never got round to finishing it off due to issues
> > with Catalin and Will being concerned about the specific numbers).
> > Vincent confirmed that the numbers came from the (IIRC) DMIPS/MHz
> > numbers that ARM publish for the cores.  I'd independently done the same
> > thing for arm64.  It would probably help to put comments in there with
> > the base numbers before scaling, or just redo the table in terms of the
> > raw numbers.
> > 
> > This is, of course, an example of my concerns about magic number
> > configuration.
> > 
> > > If I understand how that table was created, how do we think we will
> > > extend it in the future to allow newer core types (say we replicate this
> > > solution for arm64)?  It seems that we have to change it, rescaling
> > > values, each time we have a new core on the market. How can we come up
> > > with relative numbers, in the future, comparing newer cores to old ones
> > > (that might be already out of the market by that time)?
> > 
> > It doesn't seem particularly challenging to add new numbers to the table
> > (and add additional properties to select on) TBH.  We can either rescale
> > by hand in the table when adding entries, script it as part of the
> > kernel build or do it at runtime (as the arm32 code already does to an
> > extent based on the particular set of cores we find).  What difficulties
> > do you see with this?
> > 
> > This is something that seems like an advantage to me - we can just
> > replace everything at any point, we're not tied to trusting the golden
> > benchmark someone did (or tweaked) if we come up with a better
> > methodology later on.
> 
> I really don't want to see a table of magic numbers in the kernel.
> 

Doesn't seem to be a clean and scalable solution to me either. It is not
easy to reconfigure when new core types come around, as I don't think
relative data is always present or easy to derive, and it exposes some
sort of centralized global information where everyone is compared
against everyone. Where the DT solution is inherently per platform: no
need to expose absolute values and no problems with knowing data
regarding old core types.

> The relative performance and efficiency of cores will vary depending on
> uArch-specific configuration (e.g. sizing of L1/L2 caches) in addition
> to general uArch differences, and integration too (e.g. if the memory
> system gives priority to one cluster over another for whatever reason).
> I've heard of pseudo-heterogeneous platforms with different
> configuration of the same uArch across clusters.
> 
> We also don't necessarily have the CPU clock frequencies, or the ability
> to scale them. Maybe we simply give up in that case, though.
> 
> If we cannot rely on external information, and want this information to
> be derived by the kernel, then we need to perform some dynamic
> benchmark. That would work for future CPUs the kernel knows nothing
> about yet, and would cater for the pseudo-heterogeneous cases too.
> 

I've actually experimented a bit with this approch already, but I wasn't
convinced of its viability. It is true that we remove the burden of
coming up with default values from user/integrator, but I'm pretty sure
we will end up discussing endlessly about which particular benchmark to
pick and the fact that it impacts on boot time and such.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Rutland Dec. 15, 2015, 2:50 p.m. UTC | #17
On Tue, Dec 15, 2015 at 02:24:58PM +0000, Juri Lelli wrote:
> Hi Mark,

Hi Juri,

> On 15/12/15 14:01, Mark Rutland wrote:
> > I really don't want to see a table of magic numbers in the kernel.
> 
> Doesn't seem to be a clean and scalable solution to me either. It is not
> easy to reconfigure when new core types come around, as I don't think
> relative data is always present or easy to derive, and it exposes some
> sort of centralized global information where everyone is compared
> against everyone.

I'm also concerned that it will be difficult to curate this to avoid
deceptive marketing numbers. These may not reflect reality.

> Where the DT solution is inherently per platform: no need to expose
> absolute values and no problems with knowing data regarding old core
> types.

The DT approach certainly avoids tying the kernel to a given idea of
particular microarchitectures.

> > If we cannot rely on external information, and want this information to
> > be derived by the kernel, then we need to perform some dynamic
> > benchmark. That would work for future CPUs the kernel knows nothing
> > about yet, and would cater for the pseudo-heterogeneous cases too.
> 
> I've actually experimented a bit with this approch already, but I wasn't
> convinced of its viability. It is true that we remove the burden of
> coming up with default values from user/integrator, but I'm pretty sure
> we will end up discussing endlessly about which particular benchmark to
> pick 

Regardless of which direction we go there will be endless discussion as
to the benchmark. As Mark pointed out, that happened in the case of the
table, and it's happening now for the DT approach.

I think we agree that if this is something we can change later (i.e. we
don't rely on an external oracle like DT) the particular benchmark
matters less, as that can be changed given evidence of superiority.

> and the fact that it impacts on boot time and such.

I was under the impression that the kernel already did RAID algorithm
benchmarking as part of the boot process. Maybe we can find a set of
similarly brief benchmarks.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown Dec. 15, 2015, 3:08 p.m. UTC | #18
On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:

> I really don't want to see a table of magic numbers in the kernel.

Right, there's pitfalls there too although not being part of an ABI
does make them more manageable.  

One thing it's probably helpful to establish here is how much the
specific numbers are going to matter in the grand scheme of things.  If
the specific numbers *are* super important then nobody is going to want
to touch them as they'll be prone to getting tweaked.  If instead the
numbers just need to be ballpark accurate so the scheduler starts off in
roughly the right place and the specific numbers don't matter it's a lot
easier and having a table in the kernel until we think of something
better (if that ever happens) gets a lot easier.

My expectation is that we just need good enough, not perfect, and that
seems to match what Juri is saying about the expectation that most of
the fine tuning is done via other knobs.

> The relative performance and efficiency of cores will vary depending on
> uArch-specific configuration (e.g. sizing of L1/L2 caches) in addition
> to general uArch differences, and integration too (e.g. if the memory
> system gives priority to one cluster over another for whatever reason).
> I've heard of pseudo-heterogeneous platforms with different
> configuration of the same uArch across clusters.

> We also don't necessarily have the CPU clock frequencies, or the ability
> to scale them. Maybe we simply give up in that case, though.

These variables all sound like the sort of thing we can get people to
put in the DT where they matter, and the more we talk about multiple
subtle variables feeding into a single number the more this starts to
sound like tuning (with all the problems that brings, especially when
you're into things like the behaviour with the clusters competing for
resources).  Who knows, at some point the scheduler may even want to
directly act on some of these parameters?

> If we cannot rely on external information, and want this information to
> be derived by the kernel, then we need to perform some dynamic
> benchmark. That would work for future CPUs the kernel knows nothing
> about yet, and would cater for the pseudo-heterogeneous cases too.

Runtime benchmarking (or gradual tuning based on observed performance of
the scheduler) would work as well of course.  I'd expect something like
that to be part of any end point we get to here, the more the system is
able to self tune the better.
Mark Rutland Dec. 15, 2015, 3:32 p.m. UTC | #19
On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:
> 
> > I really don't want to see a table of magic numbers in the kernel.
> 
> Right, there's pitfalls there too although not being part of an ABI
> does make them more manageable.  

I think that people are very likely to treat them exactly like an ABI,
w.r.t. any regressions in performance that result from their addition,
modification, or removal. That becomes really horrible when new CPUs
appear.

> One thing it's probably helpful to establish here is how much the
> specific numbers are going to matter in the grand scheme of things.  If
> the specific numbers *are* super important then nobody is going to want
> to touch them as they'll be prone to getting tweaked.  If instead the
> numbers just need to be ballpark accurate so the scheduler starts off in
> roughly the right place and the specific numbers don't matter it's a lot
> easier and having a table in the kernel until we think of something
> better (if that ever happens) gets a lot easier.

I agree that we first need to figure out the importance of these
numbers. I disagree that our first step should be to add a table.

> My expectation is that we just need good enough, not perfect, and that
> seems to match what Juri is saying about the expectation that most of
> the fine tuning is done via other knobs.

My expectation is that if a ballpark figure is good enough, it should be
possible to implement something trivial like bogomips / loop_per_jiffy
calculation.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Dec. 15, 2015, 3:36 p.m. UTC | #20
On 15/12/15 14:50, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 02:24:58PM +0000, Juri Lelli wrote:
> > Hi Mark,
> 
> Hi Juri,
> 
> > On 15/12/15 14:01, Mark Rutland wrote:
> > > I really don't want to see a table of magic numbers in the kernel.
> > 
> > Doesn't seem to be a clean and scalable solution to me either. It is not
> > easy to reconfigure when new core types come around, as I don't think
> > relative data is always present or easy to derive, and it exposes some
> > sort of centralized global information where everyone is compared
> > against everyone.
> 
> I'm also concerned that it will be difficult to curate this to avoid
> deceptive marketing numbers. These may not reflect reality.
> 

Right.

> > Where the DT solution is inherently per platform: no need to expose
> > absolute values and no problems with knowing data regarding old core
> > types.
> 
> The DT approach certainly avoids tying the kernel to a given idea of
> particular microarchitectures.
> 
> > > If we cannot rely on external information, and want this information to
> > > be derived by the kernel, then we need to perform some dynamic
> > > benchmark. That would work for future CPUs the kernel knows nothing
> > > about yet, and would cater for the pseudo-heterogeneous cases too.
> > 
> > I've actually experimented a bit with this approch already, but I wasn't
> > convinced of its viability. It is true that we remove the burden of
> > coming up with default values from user/integrator, but I'm pretty sure
> > we will end up discussing endlessly about which particular benchmark to
> > pick 
> 
> Regardless of which direction we go there will be endless discussion as
> to the benchmark. As Mark pointed out, that happened in the case of the
> table, and it's happening now for the DT approach.
> 
> I think we agree that if this is something we can change later (i.e. we
> don't rely on an external oracle like DT) the particular benchmark
> matters less, as that can be changed given evidence of superiority.
> 

True, and in fact we already only point out suggestions for sensitive
benchmarks. I don't think we want to tie ourselves to any particular
one. And I think not having a particular benchmark implementation as
part of the solution will help changing our minds afterwards.

> > and the fact that it impacts on boot time and such.
> 
> I was under the impression that the kernel already did RAID algorithm
> benchmarking as part of the boot process. Maybe we can find a set of
> similarly brief benchmarks.
> 

Yeah, it's certainly doable, but I'm not really yet seeing what we gain
with this additional complexity. Also, AFAIK boot time is one important
metric for the mobile market (while it is less crucial for systems with
RAID configurations?) and we are going to add overhead there. Stability
of default values is probably another factor here, where you could be
rebooting your phone in many different conditions from time to time.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Dec. 15, 2015, 3:46 p.m. UTC | #21
On 15/12/15 15:32, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:
> > 
> > > I really don't want to see a table of magic numbers in the kernel.
> > 
> > Right, there's pitfalls there too although not being part of an ABI
> > does make them more manageable.  
> 
> I think that people are very likely to treat them exactly like an ABI,
> w.r.t. any regressions in performance that result from their addition,
> modification, or removal. That becomes really horrible when new CPUs
> appear.
> 

Yeah, and I guess the path towards out of three patches changing this
values for a specifica platform (without exposing the same changes
upstream) is not too far away.

> > One thing it's probably helpful to establish here is how much the
> > specific numbers are going to matter in the grand scheme of things.  If
> > the specific numbers *are* super important then nobody is going to want
> > to touch them as they'll be prone to getting tweaked.  If instead the
> > numbers just need to be ballpark accurate so the scheduler starts off in
> > roughly the right place and the specific numbers don't matter it's a lot
> > easier and having a table in the kernel until we think of something
> > better (if that ever happens) gets a lot easier.
> 
> I agree that we first need to figure out the importance of these
> numbers. I disagree that our first step should be to add a table.
> 

My take is that ballpark is fine, but it's a per platform/configuration
ballpark that we need. Not a per core-type one.

> > My expectation is that we just need good enough, not perfect, and that
> > seems to match what Juri is saying about the expectation that most of
> > the fine tuning is done via other knobs.
> 
> My expectation is that if a ballpark figure is good enough, it should be
> possible to implement something trivial like bogomips / loop_per_jiffy
> calculation.
> 

I didn't really followed that, so I might be wrong here, but isn't
already happened a discussion about how we want/like to stop exposing
bogomips info or rely on it for anything but in kernel delay loops?

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Rutland Dec. 15, 2015, 3:57 p.m. UTC | #22
On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> On 15/12/15 15:32, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > > My expectation is that we just need good enough, not perfect, and that
> > > seems to match what Juri is saying about the expectation that most of
> > > the fine tuning is done via other knobs.
> > 
> > My expectation is that if a ballpark figure is good enough, it should be
> > possible to implement something trivial like bogomips / loop_per_jiffy
> > calculation.
> > 
> 
> I didn't really followed that, so I might be wrong here, but isn't
> already happened a discussion about how we want/like to stop exposing
> bogomips info or rely on it for anything but in kernel delay loops?

I meant that we could have a benchmark of that level of complexity,
rather than those specific values.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Catalin Marinas Dec. 15, 2015, 4:23 p.m. UTC | #23
On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> > On 15/12/15 15:32, Mark Rutland wrote:
> > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > > > My expectation is that we just need good enough, not perfect, and that
> > > > seems to match what Juri is saying about the expectation that most of
> > > > the fine tuning is done via other knobs.
> > > 
> > > My expectation is that if a ballpark figure is good enough, it should be
> > > possible to implement something trivial like bogomips / loop_per_jiffy
> > > calculation.
> > 
> > I didn't really followed that, so I might be wrong here, but isn't
> > already happened a discussion about how we want/like to stop exposing
> > bogomips info or rely on it for anything but in kernel delay loops?
> 
> I meant that we could have a benchmark of that level of complexity,
> rather than those specific values.

Or we could simply let user space use whatever benchmarks or hard-coded
values it wants and set the capacity via sysfs (during boot). By
default, the kernel would assume all CPUs equal.
Mark Rutland Dec. 15, 2015, 4:41 p.m. UTC | #24
On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> > > On 15/12/15 15:32, Mark Rutland wrote:
> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > > > > My expectation is that we just need good enough, not perfect, and that
> > > > > seems to match what Juri is saying about the expectation that most of
> > > > > the fine tuning is done via other knobs.
> > > > 
> > > > My expectation is that if a ballpark figure is good enough, it should be
> > > > possible to implement something trivial like bogomips / loop_per_jiffy
> > > > calculation.
> > > 
> > > I didn't really followed that, so I might be wrong here, but isn't
> > > already happened a discussion about how we want/like to stop exposing
> > > bogomips info or rely on it for anything but in kernel delay loops?
> > 
> > I meant that we could have a benchmark of that level of complexity,
> > rather than those specific values.
> 
> Or we could simply let user space use whatever benchmarks or hard-coded
> values it wants and set the capacity via sysfs (during boot). By
> default, the kernel would assume all CPUs equal.

I assume that a userspace override would be available regardless of
whatever mechanism the kernel uses to determine relative
performance/effinciency.

I am not opposed to that mechanism being "assume equal".

Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot Dec. 15, 2015, 4:59 p.m. UTC | #25
On 15 December 2015 at 17:41, Mark Rutland <mark.rutland@arm.com> wrote:
> On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
>> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
>> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
>> > > On 15/12/15 15:32, Mark Rutland wrote:
>> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
>> > > > > My expectation is that we just need good enough, not perfect, and that
>> > > > > seems to match what Juri is saying about the expectation that most of
>> > > > > the fine tuning is done via other knobs.
>> > > >
>> > > > My expectation is that if a ballpark figure is good enough, it should be
>> > > > possible to implement something trivial like bogomips / loop_per_jiffy
>> > > > calculation.
>> > >
>> > > I didn't really followed that, so I might be wrong here, but isn't
>> > > already happened a discussion about how we want/like to stop exposing
>> > > bogomips info or rely on it for anything but in kernel delay loops?
>> >
>> > I meant that we could have a benchmark of that level of complexity,
>> > rather than those specific values.
>>
>> Or we could simply let user space use whatever benchmarks or hard-coded
>> values it wants and set the capacity via sysfs (during boot). By
>> default, the kernel would assume all CPUs equal.
>
> I assume that a userspace override would be available regardless of
> whatever mechanism the kernel uses to determine relative
> performance/effinciency.

Don't you think that if we let a complete latitude to the userspace
to set whatever it wants, it will be used to abuse the kernel (and the
scheduler in particular ) and that this will finish in a real mess to
understand what is wrong when a task is not placed where it should be.
We can probably provide a debug mode to help soc manufacturer to
define their capacity value but IMHO we should not let complete
latitude in normal operation

In normal operation we need to give some methods to tweak the value to
reflect a memory bounded or integer calculation work  or other kind of
work that currently runs on the cpu but not more

Vincent
>
> I am not opposed to that mechanism being "assume equal".
>
> Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Rutland Dec. 15, 2015, 5:15 p.m. UTC | #26
On Tue, Dec 15, 2015 at 05:59:34PM +0100, Vincent Guittot wrote:
> On 15 December 2015 at 17:41, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
> >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
> >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> >> > > On 15/12/15 15:32, Mark Rutland wrote:
> >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> >> > > > > My expectation is that we just need good enough, not perfect, and that
> >> > > > > seems to match what Juri is saying about the expectation that most of
> >> > > > > the fine tuning is done via other knobs.
> >> > > >
> >> > > > My expectation is that if a ballpark figure is good enough, it should be
> >> > > > possible to implement something trivial like bogomips / loop_per_jiffy
> >> > > > calculation.
> >> > >
> >> > > I didn't really followed that, so I might be wrong here, but isn't
> >> > > already happened a discussion about how we want/like to stop exposing
> >> > > bogomips info or rely on it for anything but in kernel delay loops?
> >> >
> >> > I meant that we could have a benchmark of that level of complexity,
> >> > rather than those specific values.
> >>
> >> Or we could simply let user space use whatever benchmarks or hard-coded
> >> values it wants and set the capacity via sysfs (during boot). By
> >> default, the kernel would assume all CPUs equal.
> >
> > I assume that a userspace override would be available regardless of
> > whatever mechanism the kernel uses to determine relative
> > performance/effinciency.
> 
> Don't you think that if we let a complete latitude to the userspace
> to set whatever it wants, it will be used to abuse the kernel (and the
> scheduler in particular ) and that this will finish in a real mess to
> understand what is wrong when a task is not placed where it should be.

I'm not sure I follow what you mean by "abuse" here. Userspace currently
can force the scheduler to make sub-optimal decisions in a number of
ways, e.g.

* Hot-unplugging the preferred CPUs
* Changing a task's affinity mask
* Setting the nice value of a task
* Using rlimits and/or cgroups
* Using a cpufreq governor
* Fork-bombing

Practically all of these are prvileged operations. I would envisage the
userspace interface for "capacity" management to be similar.

> We can probably provide a debug mode to help soc manufacturer to
> define their capacity value but IMHO we should not let complete
> latitude in normal operation

In normal operation userspace wouldn't mess with this, as with most of
the cases above. Userspace can already shooti tself in the foot.

> In normal operation we need to give some methods to tweak the value to
> reflect a memory bounded or integer calculation work  or other kind of
> work that currently runs on the cpu but not more

You can already do that with the mechanisms above, to some extent. I'm
not sure I follow.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown Dec. 15, 2015, 5:17 p.m. UTC | #27
On Tue, Dec 15, 2015 at 03:32:19PM +0000, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:

> > > I really don't want to see a table of magic numbers in the kernel.

> > Right, there's pitfalls there too although not being part of an ABI
> > does make them more manageable.  

> I think that people are very likely to treat them exactly like an ABI,
> w.r.t. any regressions in performance that result from their addition,
> modification, or removal. That becomes really horrible when new CPUs
> appear.

Obviously people are going to get upset if we introduce performance
regressions - but that's true always, we can also introduce problems
with numbers people have put in DT.  It seems like it'd be harder to
manage regressions due to externally provided magic numbers since
there's inherently less information there.

> > One thing it's probably helpful to establish here is how much the
> > specific numbers are going to matter in the grand scheme of things.  If
> > the specific numbers *are* super important then nobody is going to want
> > to touch them as they'll be prone to getting tweaked.  If instead the
> > numbers just need to be ballpark accurate so the scheduler starts off in
> > roughly the right place and the specific numbers don't matter it's a lot
> > easier and having a table in the kernel until we think of something
> > better (if that ever happens) gets a lot easier.

> I agree that we first need to figure out the importance of these
> numbers. I disagree that our first step should be to add a table.

My point there is that if we're not that concerned about the specific
number something in kernel is safer.

> > My expectation is that we just need good enough, not perfect, and that
> > seems to match what Juri is saying about the expectation that most of
> > the fine tuning is done via other knobs.

> My expectation is that if a ballpark figure is good enough, it should be
> possible to implement something trivial like bogomips / loop_per_jiffy
> calculation.

That does have the issue that we need to scale with regard to the
frequency the benchmark gets run at.  That's not an insurmountable
obstacle but it's not completely trivial either.
Mark Rutland Dec. 15, 2015, 5:28 p.m. UTC | #28
On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 03:32:19PM +0000, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> > > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote:
> 
> > > > I really don't want to see a table of magic numbers in the kernel.
> 
> > > Right, there's pitfalls there too although not being part of an ABI
> > > does make them more manageable.  
> 
> > I think that people are very likely to treat them exactly like an ABI,
> > w.r.t. any regressions in performance that result from their addition,
> > modification, or removal. That becomes really horrible when new CPUs
> > appear.
> 
> Obviously people are going to get upset if we introduce performance
> regressions - but that's true always, we can also introduce problems
> with numbers people have put in DT.  It seems like it'd be harder to
> manage regressions due to externally provided magic numbers since
> there's inherently less information there.

It's certainly still possible to have regressions in that case. Those
regressions would be due to code changes in the kernel, given the DT
didn't change.

I'm not sure I follow w.r.t. "inherently less information", unless you
mean trying to debug without access to that DTB?

> > > One thing it's probably helpful to establish here is how much the
> > > specific numbers are going to matter in the grand scheme of things.  If
> > > the specific numbers *are* super important then nobody is going to want
> > > to touch them as they'll be prone to getting tweaked.  If instead the
> > > numbers just need to be ballpark accurate so the scheduler starts off in
> > > roughly the right place and the specific numbers don't matter it's a lot
> > > easier and having a table in the kernel until we think of something
> > > better (if that ever happens) gets a lot easier.
> 
> > I agree that we first need to figure out the importance of these
> > numbers. I disagree that our first step should be to add a table.
> 
> My point there is that if we're not that concerned about the specific
> number something in kernel is safer.

I don't entirely disagree there. I think an in-kernel benchmark is
likely safer.

> > > My expectation is that we just need good enough, not perfect, and that
> > > seems to match what Juri is saying about the expectation that most of
> > > the fine tuning is done via other knobs.
> 
> > My expectation is that if a ballpark figure is good enough, it should be
> > possible to implement something trivial like bogomips / loop_per_jiffy
> > calculation.
> 
> That does have the issue that we need to scale with regard to the
> frequency the benchmark gets run at.  That's not an insurmountable
> obstacle but it's not completely trivial either.

If we change clock frequency, then regardless of where the information
comes from we need to perform scaling, no?

One nice thing about doing a benchmark to derive the numbers is that
when the kernel is that when the frequency is fixed but the kernel
cannot query it, the numbers will be representative.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown Dec. 15, 2015, 5:45 p.m. UTC | #29
On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote:

> > Obviously people are going to get upset if we introduce performance
> > regressions - but that's true always, we can also introduce problems
> > with numbers people have put in DT.  It seems like it'd be harder to
> > manage regressions due to externally provided magic numbers since
> > there's inherently less information there.

> It's certainly still possible to have regressions in that case. Those
> regressions would be due to code changes in the kernel, given the DT
> didn't change.

> I'm not sure I follow w.r.t. "inherently less information", unless you
> mean trying to debug without access to that DTB?

If what the kernel knows about the system is that it's got a bunch of
cores with numbers assigned to them then all it's really got is those
numbers.  If something changes that causes problems for some systems
(eg, because the numbers have been picked poorly but in a way that
happened to work well with the old code) that's not a lot to go on, the
more we know about the system the more likely it is that we'll be able
to adjust the assumptions in whatever new thing we do that causes
problems for any particular systems where we run into trouble.

> > My point there is that if we're not that concerned about the specific
> > number something in kernel is safer.

> I don't entirely disagree there. I think an in-kernel benchmark is
> likely safer.

Yes, I think that something where we just observe the system performance
at runtime is likely one of the best solutions if we can get something
that gives reasonable results.

> > That does have the issue that we need to scale with regard to the
> > frequency the benchmark gets run at.  That's not an insurmountable
> > obstacle but it's not completely trivial either.

> If we change clock frequency, then regardless of where the information
> comes from we need to perform scaling, no?

Yes, it's just a question of making the benchmarking bit talk to the
scaling bit so we know where we're at when we do the benchmark.  Like I
say it should be doable.

> One nice thing about doing a benchmark to derive the numbers is that
> when the kernel is that when the frequency is fixed but the kernel
> cannot query it, the numbers will be representative.

Definitely.
Vincent Guittot Dec. 15, 2015, 5:47 p.m. UTC | #30
On 15 December 2015 at 18:15, Mark Rutland <mark.rutland@arm.com> wrote:
> On Tue, Dec 15, 2015 at 05:59:34PM +0100, Vincent Guittot wrote:
>> On 15 December 2015 at 17:41, Mark Rutland <mark.rutland@arm.com> wrote:
>> > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
>> >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
>> >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
>> >> > > On 15/12/15 15:32, Mark Rutland wrote:
>> >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
>> >> > > > > My expectation is that we just need good enough, not perfect, and that
>> >> > > > > seems to match what Juri is saying about the expectation that most of
>> >> > > > > the fine tuning is done via other knobs.
>> >> > > >
>> >> > > > My expectation is that if a ballpark figure is good enough, it should be
>> >> > > > possible to implement something trivial like bogomips / loop_per_jiffy
>> >> > > > calculation.
>> >> > >
>> >> > > I didn't really followed that, so I might be wrong here, but isn't
>> >> > > already happened a discussion about how we want/like to stop exposing
>> >> > > bogomips info or rely on it for anything but in kernel delay loops?
>> >> >
>> >> > I meant that we could have a benchmark of that level of complexity,
>> >> > rather than those specific values.
>> >>
>> >> Or we could simply let user space use whatever benchmarks or hard-coded
>> >> values it wants and set the capacity via sysfs (during boot). By
>> >> default, the kernel would assume all CPUs equal.
>> >
>> > I assume that a userspace override would be available regardless of
>> > whatever mechanism the kernel uses to determine relative
>> > performance/effinciency.
>>
>> Don't you think that if we let a complete latitude to the userspace
>> to set whatever it wants, it will be used to abuse the kernel (and the
>> scheduler in particular ) and that this will finish in a real mess to
>> understand what is wrong when a task is not placed where it should be.
>
> I'm not sure I follow what you mean by "abuse" here. Userspace currently
> can force the scheduler to make sub-optimal decisions in a number of
> ways, e.g.
>
> * Hot-unplugging the preferred CPUs
> * Changing a task's affinity mask
> * Setting the nice value of a task
> * Using rlimits and/or cgroups
> * Using a cpufreq governor
> * Fork-bombing

All these are parameters have a meaning (except the last one). By
abusing i mean setting the capacity of the most powerful cpu to 1 for
no good reason except to abuse the scheduler so the latter will not
put that much tasks on it just  because the current running use case
is more efficient if the big core is not used.


>
> Practically all of these are prvileged operations. I would envisage the
> userspace interface for "capacity" management to be similar.
>
>> We can probably provide a debug mode to help soc manufacturer to
>> define their capacity value but IMHO we should not let complete
>> latitude in normal operation
>
> In normal operation userspace wouldn't mess with this, as with most of
> the cases above. Userspace can already shooti tself in the foot.
>
>> In normal operation we need to give some methods to tweak the value to
>> reflect a memory bounded or integer calculation work  or other kind of
>> work that currently runs on the cpu but not more
>
> You can already do that with the mechanisms above, to some extent. I'm
> not sure I follow.
>
> Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Rutland Dec. 15, 2015, 6:10 p.m. UTC | #31
On Tue, Dec 15, 2015 at 05:45:16PM +0000, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote:
> 
> > > Obviously people are going to get upset if we introduce performance
> > > regressions - but that's true always, we can also introduce problems
> > > with numbers people have put in DT.  It seems like it'd be harder to
> > > manage regressions due to externally provided magic numbers since
> > > there's inherently less information there.
> 
> > It's certainly still possible to have regressions in that case. Those
> > regressions would be due to code changes in the kernel, given the DT
> > didn't change.
> 
> > I'm not sure I follow w.r.t. "inherently less information", unless you
> > mean trying to debug without access to that DTB?
> 
> If what the kernel knows about the system is that it's got a bunch of
> cores with numbers assigned to them then all it's really got is those
> numbers.  If something changes that causes problems for some systems
> (eg, because the numbers have been picked poorly but in a way that
> happened to work well with the old code) that's not a lot to go on, the
> more we know about the system the more likely it is that we'll be able
> to adjust the assumptions in whatever new thing we do that causes
> problems for any particular systems where we run into trouble.

Regardless of where the numbers live (DT or kernel), all we have are
numbers. I can see that changing the in-kernel numbers would be possible
when modifyign the DT is not, but I don't see how that gives you more
information.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Rutland Dec. 15, 2015, 6:39 p.m. UTC | #32
On Tue, Dec 15, 2015 at 06:47:20PM +0100, Vincent Guittot wrote:
> On 15 December 2015 at 18:15, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Tue, Dec 15, 2015 at 05:59:34PM +0100, Vincent Guittot wrote:
> >> On 15 December 2015 at 17:41, Mark Rutland <mark.rutland@arm.com> wrote:
> >> > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote:
> >> >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote:
> >> >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote:
> >> >> > > On 15/12/15 15:32, Mark Rutland wrote:
> >> >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote:
> >> >> > > > > My expectation is that we just need good enough, not perfect, and that
> >> >> > > > > seems to match what Juri is saying about the expectation that most of
> >> >> > > > > the fine tuning is done via other knobs.
> >> >> > > >
> >> >> > > > My expectation is that if a ballpark figure is good enough, it should be
> >> >> > > > possible to implement something trivial like bogomips / loop_per_jiffy
> >> >> > > > calculation.
> >> >> > >
> >> >> > > I didn't really followed that, so I might be wrong here, but isn't
> >> >> > > already happened a discussion about how we want/like to stop exposing
> >> >> > > bogomips info or rely on it for anything but in kernel delay loops?
> >> >> >
> >> >> > I meant that we could have a benchmark of that level of complexity,
> >> >> > rather than those specific values.
> >> >>
> >> >> Or we could simply let user space use whatever benchmarks or hard-coded
> >> >> values it wants and set the capacity via sysfs (during boot). By
> >> >> default, the kernel would assume all CPUs equal.
> >> >
> >> > I assume that a userspace override would be available regardless of
> >> > whatever mechanism the kernel uses to determine relative
> >> > performance/effinciency.
> >>
> >> Don't you think that if we let a complete latitude to the userspace
> >> to set whatever it wants, it will be used to abuse the kernel (and the
> >> scheduler in particular ) and that this will finish in a real mess to
> >> understand what is wrong when a task is not placed where it should be.
> >
> > I'm not sure I follow what you mean by "abuse" here. Userspace currently
> > can force the scheduler to make sub-optimal decisions in a number of
> > ways, e.g.
> >
> > * Hot-unplugging the preferred CPUs
> > * Changing a task's affinity mask
> > * Setting the nice value of a task
> > * Using rlimits and/or cgroups
> > * Using a cpufreq governor
> > * Fork-bombing
> 
> All these are parameters have a meaning (except the last one). By
> abusing i mean setting the capacity of the most powerful cpu to 1 for
> no good reason except to abuse the scheduler so the latter will not
> put that much tasks on it just  because the current running use case
> is more efficient if the big core is not used.

Surely it's better to allow them to "abuse" the kernel in that manner
than to place otherwise insane values into a DT? Especially if they can
later change to a sane value?

For that particular case it's easy to hotplug out the big core, or to
set the affinity of tasks to avoid it.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Brown Dec. 15, 2015, 6:45 p.m. UTC | #33
On Tue, Dec 15, 2015 at 06:10:03PM +0000, Mark Rutland wrote:
> On Tue, Dec 15, 2015 at 05:45:16PM +0000, Mark Brown wrote:

> > > I'm not sure I follow w.r.t. "inherently less information", unless you
> > > mean trying to debug without access to that DTB?

> > If what the kernel knows about the system is that it's got a bunch of
> > cores with numbers assigned to them then all it's really got is those
> > numbers.  If something changes that causes problems for some systems
> > (eg, because the numbers have been picked poorly but in a way that
> > happened to work well with the old code) that's not a lot to go on, the
> > more we know about the system the more likely it is that we'll be able
> > to adjust the assumptions in whatever new thing we do that causes
> > problems for any particular systems where we run into trouble.

> Regardless of where the numbers live (DT or kernel), all we have are
> numbers. I can see that changing the in-kernel numbers would be possible
> when modifyign the DT is not, but I don't see how that gives you more
> information.

It's mainly the modifying the DT case - you're not dealing with some
external misguided number selection method you'd never thought of and
you're not forcing some third party to redo benchmarks or adjust DTs
they may not want to adjust.  You're also able to readjust the numbers
based on feedback if you need to rather than having to adapt algorithms
to handle particular number selections, the algorithm and number
selection are done together rather than separately.
Juri Lelli Dec. 17, 2015, 9:07 a.m. UTC | #34
Hi,

On 15/12/15 17:45, Mark Brown wrote:
> On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote:
> > On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote:
> 
> > > Obviously people are going to get upset if we introduce performance
> > > regressions - but that's true always, we can also introduce problems
> > > with numbers people have put in DT.  It seems like it'd be harder to
> > > manage regressions due to externally provided magic numbers since
> > > there's inherently less information there.
> 
> > It's certainly still possible to have regressions in that case. Those
> > regressions would be due to code changes in the kernel, given the DT
> > didn't change.
> 
> > I'm not sure I follow w.r.t. "inherently less information", unless you
> > mean trying to debug without access to that DTB?
> 
> If what the kernel knows about the system is that it's got a bunch of
> cores with numbers assigned to them then all it's really got is those
> numbers.  If something changes that causes problems for some systems
> (eg, because the numbers have been picked poorly but in a way that
> happened to work well with the old code) that's not a lot to go on, the
> more we know about the system the more likely it is that we'll be able
> to adjust the assumptions in whatever new thing we do that causes
> problems for any particular systems where we run into trouble.
> 
> > > My point there is that if we're not that concerned about the specific
> > > number something in kernel is safer.
> 
> > I don't entirely disagree there. I think an in-kernel benchmark is
> > likely safer.
> 
> Yes, I think that something where we just observe the system performance
> at runtime is likely one of the best solutions if we can get something
> that gives reasonable results.
> 
> > > That does have the issue that we need to scale with regard to the
> > > frequency the benchmark gets run at.  That's not an insurmountable
> > > obstacle but it's not completely trivial either.
> 
> > If we change clock frequency, then regardless of where the information
> > comes from we need to perform scaling, no?
> 
> Yes, it's just a question of making the benchmarking bit talk to the
> scaling bit so we know where we're at when we do the benchmark.  Like I
> say it should be doable.
> 
> > One nice thing about doing a benchmark to derive the numbers is that
> > when the kernel is that when the frequency is fixed but the kernel
> > cannot query it, the numbers will be representative.
> 
> Definitely.

OK, let's see how a dynamic approach could look like. As said, since it
was actually our first thought too, I already have a possible
implementation of such a thing. I'll be OOO until early Jan, but I'll
try to rebase what I have and post it here as soon as I'm back; and then
we see which solution looks better.

Thanks a lot for the feedback!

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
new file mode 100644
index 0000000..2a00af0
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt
@@ -0,0 +1,227 @@ 
+==========================================
+ARM CPUs capacity bindings
+==========================================
+
+==========================================
+1 - Introduction
+==========================================
+
+ARM systems may be configured to have cpus with different power/performance
+characteristics within the same chip. In this case, additional information
+has to be made available to the kernel (the scheduler in particular) for
+it to be aware of such differences and take decisions accordingly.
+
+==========================================
+2 - CPU capacity definition
+==========================================
+
+CPU capacity is a number that provides the scheduler information about CPUs
+heterogeneity. Such heterogeneity can come from micro-architectural differences
+(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run
+(e.g., SMP systems with multiple frequency domains). Heterogeneity in this
+context is about differing performance characteristics; this binding tries to
+capture a first-order approximation of the relative performance of CPUs.
+
+One simple way to estimate CPU capacities is to iteratively run a well-known
+CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at
+maximum frequency and then normalize values w.r.t.  the best performing CPU.
+One can also do a statistically significant study of a wide collection of
+benchmarks, but pros of such an approach are not really evident at the time of
+writing.
+
+==========================================
+3 - capacity-scale
+==========================================
+
+CPUs capacities are defined with respect to capacity-scale property in the cpus
+node [1]. The property is optional; if not defined a 1024 capacity-scale is
+assumed. This property defines both the highest CPU capacity present in the
+system and granularity of CPU capacity values.
+
+==========================================
+4 - capacity
+==========================================
+
+capacity is an optional cpu node [1] property: u32 value representing CPU
+capacity, relative to capacity-scale. It is required and enforced that capacity
+<= capacity-scale.
+
+===========================================
+5 - Examples
+===========================================
+
+Example 1 (ARM 64-bit, 6-cpu system, two clusters):
+capacity-scale is not defined, so it is assumed to be 1024
+
+cpus {
+	#address-cells = <2>;
+	#size-cells = <0>;
+
+	cpu-map {
+		cluster0 {
+			core0 {
+				cpu = <&A57_0>;
+			};
+			core1 {
+				cpu = <&A57_1>;
+			};
+		};
+
+		cluster1 {
+			core0 {
+				cpu = <&A53_0>;
+			};
+			core1 {
+				cpu = <&A53_1>;
+			};
+			core2 {
+				cpu = <&A53_2>;
+			};
+			core3 {
+				cpu = <&A53_3>;
+			};
+		};
+	};
+
+	idle-states {
+		entry-method = "arm,psci";
+
+		CPU_SLEEP_0: cpu-sleep-0 {
+			compatible = "arm,idle-state";
+			arm,psci-suspend-param = <0x0010000>;
+			local-timer-stop;
+			entry-latency-us = <100>;
+			exit-latency-us = <250>;
+			min-residency-us = <150>;
+		};
+
+		CLUSTER_SLEEP_0: cluster-sleep-0 {
+			compatible = "arm,idle-state";
+			arm,psci-suspend-param = <0x1010000>;
+			local-timer-stop;
+			entry-latency-us = <800>;
+			exit-latency-us = <700>;
+			min-residency-us = <2500>;
+		};
+	};
+
+	A57_0: cpu@0 {
+		compatible = "arm,cortex-a57","arm,armv8";
+		reg = <0x0 0x0>;
+		device_type = "cpu";
+		enable-method = "psci";
+		next-level-cache = <&A57_L2>;
+		clocks = <&scpi_dvfs 0>;
+		cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+		capacity = <1024>;
+	};
+
+	A57_1: cpu@1 {
+		compatible = "arm,cortex-a57","arm,armv8";
+		reg = <0x0 0x1>;
+		device_type = "cpu";
+		enable-method = "psci";
+		next-level-cache = <&A57_L2>;
+		clocks = <&scpi_dvfs 0>;
+		cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+		capacity = <1024>;
+	};
+
+	A53_0: cpu@100 {
+		compatible = "arm,cortex-a53","arm,armv8";
+		reg = <0x0 0x100>;
+		device_type = "cpu";
+		enable-method = "psci";
+		next-level-cache = <&A53_L2>;
+		clocks = <&scpi_dvfs 1>;
+		cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+		capacity = <447>;
+	};
+
+	A53_1: cpu@101 {
+		compatible = "arm,cortex-a53","arm,armv8";
+		reg = <0x0 0x101>;
+		device_type = "cpu";
+		enable-method = "psci";
+		next-level-cache = <&A53_L2>;
+		clocks = <&scpi_dvfs 1>;
+		cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+		capacity = <447>;
+	};
+
+	A53_2: cpu@102 {
+		compatible = "arm,cortex-a53","arm,armv8";
+		reg = <0x0 0x102>;
+		device_type = "cpu";
+		enable-method = "psci";
+		next-level-cache = <&A53_L2>;
+		clocks = <&scpi_dvfs 1>;
+		cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+		capacity = <447>;
+	};
+
+	A53_3: cpu@103 {
+		compatible = "arm,cortex-a53","arm,armv8";
+		reg = <0x0 0x103>;
+		device_type = "cpu";
+		enable-method = "psci";
+		next-level-cache = <&A53_L2>;
+		clocks = <&scpi_dvfs 1>;
+		cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
+		capacity = <447>;
+	};
+
+	A57_L2: l2-cache0 {
+		compatible = "cache";
+	};
+
+	A53_L2: l2-cache1 {
+		compatible = "cache";
+	};
+};
+
+Example 2 (ARM 32-bit, 4-cpu system, two clusters,
+	   cpus 0,1@1GHz, cpus 2,3@500MHz):
+capacity-scale is equal to 2, so first cluster is twice faster than second
+cluster (which matches with clock frequencies)
+
+cpus {
+	#address-cells = <1>;
+	#size-cells = <0>;
+	capacity-scale = <2>;
+
+	cpu0: cpu@0 {
+		device_type = "cpu";
+		compatible = "arm,cortex-a15";
+		reg = <0>;
+		capacity = <2>;
+	};
+
+	cpu1: cpu@1 {
+		device_type = "cpu";
+		compatible = "arm,cortex-a15";
+		reg = <1>;
+		capacity = <2>;
+	};
+
+	cpu2: cpu@2 {
+		device_type = "cpu";
+		compatible = "arm,cortex-a15";
+		reg = <0x100>;
+		capacity = <1>;
+	};
+
+	cpu3: cpu@3 {
+		device_type = "cpu";
+		compatible = "arm,cortex-a15";
+		reg = <0x101>;
+		capacity = <1>;
+	};
+};
+
+===========================================
+6 - References
+===========================================
+
+[1] ARM Linux Kernel documentation - CPUs bindings
+    Documentation/devicetree/bindings/arm/cpus.txt
diff --git a/Documentation/devicetree/bindings/arm/cpus.txt b/Documentation/devicetree/bindings/arm/cpus.txt
index 91e6e5c..7593584 100644
--- a/Documentation/devicetree/bindings/arm/cpus.txt
+++ b/Documentation/devicetree/bindings/arm/cpus.txt
@@ -62,6 +62,14 @@  nodes to be present and contain the properties described below.
 		Value type: <u32>
 		Definition: must be set to 0
 
+	A cpus node may also define the following optional property:
+
+	- capacity-scale
+		Usage: optional
+		Value type: <u32>
+		Definition: value used as a reference for CPU capacity [3]
+			    (see below).
+
 - cpu node
 
 	Description: Describes a CPU in an ARM based system
@@ -231,6 +239,13 @@  nodes to be present and contain the properties described below.
 			# List of phandles to idle state nodes supported
 			  by this cpu [3].
 
+	- capacity
+		Usage: Optional
+		Value type: <u32>
+		Definition:
+			# u32 value representing CPU capacity [3], relative to
+			  capacity-scale (see above).
+
 	- rockchip,pmu
 		Usage: optional for systems that have an "enable-method"
 		       property value of "rockchip,rk3066-smp"
@@ -437,3 +452,5 @@  cpus {
 [2] arm/msm/qcom,kpss-acc.txt
 [3] ARM Linux kernel documentation - idle states bindings
     Documentation/devicetree/bindings/arm/idle-states.txt
+[3] ARM Linux kernel documentation - cpu capacity bindings
+    Documentation/devicetree/bindings/arm/cpu-capacity.txt