Message ID | 1448288921-30307-3-git-send-email-juri.lelli@arm.com |
---|---|
State | Changes Requested, archived |
Headers | show |
On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote: > ARM systems may be configured to have cpus with different power/performance > characteristics within the same chip. In this case, additional information > has to be made available to the kernel (the scheduler in particular) for it > to be aware of such differences and take decisions accordingly. > > Therefore, this patch aims at standardizing cpu capacities device tree > bindings for ARM platforms. Bindings define cpu capacity parameter, to > allow operating systems to retrieve such information from the device tree > and initialize related kernel structures, paving the way for common code in > the kernel to deal with heterogeneity. > > Cc: Rob Herring <robh+dt@kernel.org> > Cc: Pawel Moll <pawel.moll@arm.com> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Ian Campbell <ijc+devicetree@hellion.org.uk> > Cc: Kumar Gala <galak@codeaurora.org> > Cc: Maxime Ripard <maxime.ripard@free-electrons.com> > Cc: Olof Johansson <olof@lixom.net> > Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> > Cc: Paul Walmsley <paul@pwsan.com> > Cc: Linus Walleij <linus.walleij@linaro.org> > Cc: Chen-Yu Tsai <wens@csie.org> > Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> > Cc: devicetree@vger.kernel.org > Signed-off-by: Juri Lelli <juri.lelli@arm.com> > --- > .../devicetree/bindings/arm/cpu-capacity.txt | 227 +++++++++++++++++++++ > Documentation/devicetree/bindings/arm/cpus.txt | 17 ++ > 2 files changed, 244 insertions(+) > create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt > > diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt > new file mode 100644 > index 0000000..2a00af0 > --- /dev/null > +++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt > @@ -0,0 +1,227 @@ > +========================================== > +ARM CPUs capacity bindings > +========================================== > + > +========================================== > +1 - Introduction > +========================================== > + > +ARM systems may be configured to have cpus with different power/performance > +characteristics within the same chip. In this case, additional information > +has to be made available to the kernel (the scheduler in particular) for > +it to be aware of such differences and take decisions accordingly. > + > +========================================== > +2 - CPU capacity definition > +========================================== > + > +CPU capacity is a number that provides the scheduler information about CPUs > +heterogeneity. Such heterogeneity can come from micro-architectural differences > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this > +context is about differing performance characteristics; this binding tries to > +capture a first-order approximation of the relative performance of CPUs. > + > +One simple way to estimate CPU capacities is to iteratively run a well-known > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at > +maximum frequency and then normalize values w.r.t. the best performing CPU. > +One can also do a statistically significant study of a wide collection of > +benchmarks, but pros of such an approach are not really evident at the time of > +writing. > + > +========================================== > +3 - capacity-scale > +========================================== > + > +CPUs capacities are defined with respect to capacity-scale property in the cpus > +node [1]. The property is optional; if not defined a 1024 capacity-scale is > +assumed. This property defines both the highest CPU capacity present in the > +system and granularity of CPU capacity values. I don't really see the point of this vs. having an absolute scale. > + > +========================================== > +4 - capacity > +========================================== > + > +capacity is an optional cpu node [1] property: u32 value representing CPU > +capacity, relative to capacity-scale. It is required and enforced that capacity > +<= capacity-scale. I think you need something absolute and probably per MHz (like dynamic-power-coefficient property). Perhaps the IPC (instructions per clock) value? In other words, I want to see these numbers have a defined method of determining them and don't want to see random values from every vendor. ARM, Ltd. says core X has a value of Y would be good enough for me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a value of 1024 is not what I want to see. Of course things like cache sizes can vary the performance, but is a baseline value good enough? However, no vendor will want to publish their values if these are absolute values relative to other vendors. If you expect these to need frequent tuning, then don't put them in DT. Rob -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On 23/11/15 20:06, Rob Herring wrote: > On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote: > > ARM systems may be configured to have cpus with different power/performance > > characteristics within the same chip. In this case, additional information > > has to be made available to the kernel (the scheduler in particular) for it > > to be aware of such differences and take decisions accordingly. > > > > Therefore, this patch aims at standardizing cpu capacities device tree > > bindings for ARM platforms. Bindings define cpu capacity parameter, to > > allow operating systems to retrieve such information from the device tree > > and initialize related kernel structures, paving the way for common code in > > the kernel to deal with heterogeneity. > > > > Cc: Rob Herring <robh+dt@kernel.org> > > Cc: Pawel Moll <pawel.moll@arm.com> > > Cc: Mark Rutland <mark.rutland@arm.com> > > Cc: Ian Campbell <ijc+devicetree@hellion.org.uk> > > Cc: Kumar Gala <galak@codeaurora.org> > > Cc: Maxime Ripard <maxime.ripard@free-electrons.com> > > Cc: Olof Johansson <olof@lixom.net> > > Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> > > Cc: Paul Walmsley <paul@pwsan.com> > > Cc: Linus Walleij <linus.walleij@linaro.org> > > Cc: Chen-Yu Tsai <wens@csie.org> > > Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> > > Cc: devicetree@vger.kernel.org > > Signed-off-by: Juri Lelli <juri.lelli@arm.com> > > --- > > .../devicetree/bindings/arm/cpu-capacity.txt | 227 +++++++++++++++++++++ > > Documentation/devicetree/bindings/arm/cpus.txt | 17 ++ > > 2 files changed, 244 insertions(+) > > create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt > > > > diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt > > new file mode 100644 > > index 0000000..2a00af0 > > --- /dev/null > > +++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt > > @@ -0,0 +1,227 @@ > > +========================================== > > +ARM CPUs capacity bindings > > +========================================== > > + > > +========================================== > > +1 - Introduction > > +========================================== > > + > > +ARM systems may be configured to have cpus with different power/performance > > +characteristics within the same chip. In this case, additional information > > +has to be made available to the kernel (the scheduler in particular) for > > +it to be aware of such differences and take decisions accordingly. > > + > > +========================================== > > +2 - CPU capacity definition > > +========================================== > > + > > +CPU capacity is a number that provides the scheduler information about CPUs > > +heterogeneity. Such heterogeneity can come from micro-architectural differences > > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run > > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this > > +context is about differing performance characteristics; this binding tries to > > +capture a first-order approximation of the relative performance of CPUs. > > + > > +One simple way to estimate CPU capacities is to iteratively run a well-known > > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at > > +maximum frequency and then normalize values w.r.t. the best performing CPU. > > +One can also do a statistically significant study of a wide collection of > > +benchmarks, but pros of such an approach are not really evident at the time of > > +writing. > > + > > +========================================== > > +3 - capacity-scale > > +========================================== > > + > > +CPUs capacities are defined with respect to capacity-scale property in the cpus > > +node [1]. The property is optional; if not defined a 1024 capacity-scale is > > +assumed. This property defines both the highest CPU capacity present in the > > +system and granularity of CPU capacity values. > > I don't really see the point of this vs. having an absolute scale. > IMHO, we need this for several reasons, one being to address one of your concerns below: vendors are free to choose their scale without being forced to publish absolute data. Another reason is that it might make life easier in certain cases; for example, someone could implement a system with a few clusters of, say, A57s, but some run at half the clock of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in this case I think it is just easier to define capacity-scale as 1200 and capacities as 1200 and 600. Last reason that I can think of right now is that we don't probably want to bound ourself to some particular range from the beginning, as that range might be enough now, but it could change in the future (as in, right now [1-1024] looks fine for scheduling purposes, but that might change). > > + > > +========================================== > > +4 - capacity > > +========================================== > > + > > +capacity is an optional cpu node [1] property: u32 value representing CPU > > +capacity, relative to capacity-scale. It is required and enforced that capacity > > +<= capacity-scale. > > I think you need something absolute and probably per MHz (like > dynamic-power-coefficient property). Perhaps the IPC (instructions per > clock) value? > > In other words, I want to see these numbers have a defined method > of determining them and don't want to see random values from every > vendor. ARM, Ltd. says core X has a value of Y would be good enough for > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a > value of 1024 is not what I want to see. Of course things like cache > sizes can vary the performance, but is a baseline value good enough? > A standard reference baseline is what we advocate with this set, but making this baseline work for every vendor's implementation is hardly achievable, IMHO. I don't think we can come up with any number that applies to each and every implementation; you can have different revisions of the same core and vendors might make implementation choices that end up with different peak performance. > However, no vendor will want to publish their values if these are > absolute values relative to other vendors. > Right. That is why I think we need to abstract numbers, as we do with capacity-scale. > If you expect these to need frequent tuning, then don't put them in DT. > I expect that it is possible to come up with a sensible baseline number for a specific platform implementation, so there is value in standardizing how we specify this value and how it is then consumed. Finer grained tuning might then happen both offline (with changes to the mainline DT) and online (using the sysfs interface), but that should only apply to a narrow set of use cases. Thanks, - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Juri, On 24 November 2015 at 11:54, Juri Lelli <juri.lelli@arm.com> wrote: > Hi, > > On 23/11/15 20:06, Rob Herring wrote: >> On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote: >> > ARM systems may be configured to have cpus with different power/performance >> > characteristics within the same chip. In this case, additional information >> > has to be made available to the kernel (the scheduler in particular) for it >> > to be aware of such differences and take decisions accordingly. >> > [snip] >> > +========================================== >> > +2 - CPU capacity definition >> > +========================================== >> > + >> > +CPU capacity is a number that provides the scheduler information about CPUs >> > +heterogeneity. Such heterogeneity can come from micro-architectural differences >> > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run >> > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this >> > +context is about differing performance characteristics; this binding tries to >> > +capture a first-order approximation of the relative performance of CPUs. >> > + >> > +One simple way to estimate CPU capacities is to iteratively run a well-known >> > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at >> > +maximum frequency and then normalize values w.r.t. the best performing CPU. >> > +One can also do a statistically significant study of a wide collection of >> > +benchmarks, but pros of such an approach are not really evident at the time of >> > +writing. >> > + >> > +========================================== >> > +3 - capacity-scale >> > +========================================== >> > + >> > +CPUs capacities are defined with respect to capacity-scale property in the cpus >> > +node [1]. The property is optional; if not defined a 1024 capacity-scale is >> > +assumed. This property defines both the highest CPU capacity present in the >> > +system and granularity of CPU capacity values. >> >> I don't really see the point of this vs. having an absolute scale. >> > > IMHO, we need this for several reasons, one being to address one of your > concerns below: vendors are free to choose their scale without being > forced to publish absolute data. Another reason is that it might make > life easier in certain cases; for example, someone could implement a > system with a few clusters of, say, A57s, but some run at half the clock > of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in > this case I think it is just easier to define capacity-scale as 1200 and > capacities as 1200 and 600. Last reason that I can think of right now is > that we don't probably want to bound ourself to some particular range > from the beginning, as that range might be enough now, but it could > change in the future (as in, right now [1-1024] looks fine for > scheduling purposes, but that might change). Like Rob, i don't really see the benefit of this optional capacity-scale property. Parsing the capacity of all cpu nodes should give you a range as well. IMHO, this property looks like an optimization of the code that will parse the dt more than a HW description > >> > + >> > +========================================== >> > +4 - capacity >> > +========================================== >> > + >> > +capacity is an optional cpu node [1] property: u32 value representing CPU >> > +capacity, relative to capacity-scale. It is required and enforced that capacity >> > +<= capacity-scale. >> >> I think you need something absolute and probably per MHz (like >> dynamic-power-coefficient property). Perhaps the IPC (instructions per >> clock) value? >> >> In other words, I want to see these numbers have a defined method >> of determining them and don't want to see random values from every >> vendor. ARM, Ltd. says core X has a value of Y would be good enough for >> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a >> value of 1024 is not what I want to see. Of course things like cache >> sizes can vary the performance, but is a baseline value good enough? >> > > A standard reference baseline is what we advocate with this set, but > making this baseline work for every vendor's implementation is hardly > achievable, IMHO. I don't think we can come up with any number that > applies to each and every implementation; you can have different > revisions of the same core and vendors might make implementation choices > that end up with different peak performance. > >> However, no vendor will want to publish their values if these are >> absolute values relative to other vendors. >> > > Right. That is why I think we need to abstract numbers, as we do with > capacity-scale. > >> If you expect these to need frequent tuning, then don't put them in DT. >> > > I expect that it is possible to come up with a sensible baseline number > for a specific platform implementation, so there is value in > standardizing how we specify this value and how it is then consumed. > Finer grained tuning might then happen both offline (with changes to the > mainline DT) and online (using the sysfs interface), but that should > only apply to a narrow set of use cases. > > Thanks, > > - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Vincent, On 30/11/15 10:59, Vincent Guittot wrote: > Hi Juri, > > On 24 November 2015 at 11:54, Juri Lelli <juri.lelli@arm.com> wrote: > > Hi, > > > > On 23/11/15 20:06, Rob Herring wrote: > >> On Mon, Nov 23, 2015 at 02:28:35PM +0000, Juri Lelli wrote: > >> > ARM systems may be configured to have cpus with different power/performance > >> > characteristics within the same chip. In this case, additional information > >> > has to be made available to the kernel (the scheduler in particular) for it > >> > to be aware of such differences and take decisions accordingly. > >> > > > [snip] > > >> > +========================================== > >> > +2 - CPU capacity definition > >> > +========================================== > >> > + > >> > +CPU capacity is a number that provides the scheduler information about CPUs > >> > +heterogeneity. Such heterogeneity can come from micro-architectural differences > >> > +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run > >> > +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this > >> > +context is about differing performance characteristics; this binding tries to > >> > +capture a first-order approximation of the relative performance of CPUs. > >> > + > >> > +One simple way to estimate CPU capacities is to iteratively run a well-known > >> > +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at > >> > +maximum frequency and then normalize values w.r.t. the best performing CPU. > >> > +One can also do a statistically significant study of a wide collection of > >> > +benchmarks, but pros of such an approach are not really evident at the time of > >> > +writing. > >> > + > >> > +========================================== > >> > +3 - capacity-scale > >> > +========================================== > >> > + > >> > +CPUs capacities are defined with respect to capacity-scale property in the cpus > >> > +node [1]. The property is optional; if not defined a 1024 capacity-scale is > >> > +assumed. This property defines both the highest CPU capacity present in the > >> > +system and granularity of CPU capacity values. > >> > >> I don't really see the point of this vs. having an absolute scale. > >> > > > > IMHO, we need this for several reasons, one being to address one of your > > concerns below: vendors are free to choose their scale without being > > forced to publish absolute data. Another reason is that it might make > > life easier in certain cases; for example, someone could implement a > > system with a few clusters of, say, A57s, but some run at half the clock > > of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in > > this case I think it is just easier to define capacity-scale as 1200 and > > capacities as 1200 and 600. Last reason that I can think of right now is > > that we don't probably want to bound ourself to some particular range > > from the beginning, as that range might be enough now, but it could > > change in the future (as in, right now [1-1024] looks fine for > > scheduling purposes, but that might change). > > Like Rob, i don't really see the benefit of this optional > capacity-scale property. Parsing the capacity of all cpu nodes should > give you a range as well. > IMHO, this property looks like an optimization of the code that will > parse the dt more than a HW description > I agree that we can come up with the same information just looking at the biggest capacity value of all CPUs and treat that value as capacity-scale. I just thought that having that explicit made things clearer, as it could be not easy to immediately see from a DT with many CPUs which is the biggest capacity value. But, yes, we could remove that anyway. Thanks, - Juri > > > >> > + > >> > +========================================== > >> > +4 - capacity > >> > +========================================== > >> > + > >> > +capacity is an optional cpu node [1] property: u32 value representing CPU > >> > +capacity, relative to capacity-scale. It is required and enforced that capacity > >> > +<= capacity-scale. > >> > >> I think you need something absolute and probably per MHz (like > >> dynamic-power-coefficient property). Perhaps the IPC (instructions per > >> clock) value? > >> > >> In other words, I want to see these numbers have a defined method > >> of determining them and don't want to see random values from every > >> vendor. ARM, Ltd. says core X has a value of Y would be good enough for > >> me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a > >> value of 1024 is not what I want to see. Of course things like cache > >> sizes can vary the performance, but is a baseline value good enough? > >> > > > > A standard reference baseline is what we advocate with this set, but > > making this baseline work for every vendor's implementation is hardly > > achievable, IMHO. I don't think we can come up with any number that > > applies to each and every implementation; you can have different > > revisions of the same core and vendors might make implementation choices > > that end up with different peak performance. > > > >> However, no vendor will want to publish their values if these are > >> absolute values relative to other vendors. > >> > > > > Right. That is why I think we need to abstract numbers, as we do with > > capacity-scale. > > > >> If you expect these to need frequent tuning, then don't put them in DT. > >> > > > > I expect that it is possible to come up with a sensible baseline number > > for a specific platform implementation, so there is value in > > standardizing how we specify this value and how it is then consumed. > > Finer grained tuning might then happen both offline (with changes to the > > mainline DT) and online (using the sysfs interface), but that should > > only apply to a narrow set of use cases. > > > > Thanks, > > > > - Juri > -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/12/15 11:20, Juri Lelli wrote: > Hi Vincent, > > On 30/11/15 10:59, Vincent Guittot wrote: >> Hi Juri, >> >> On 24 November 2015 at 11:54, Juri Lelli <juri.lelli@arm.com> wrote: [...] >>>>> +========================================== >>>>> +3 - capacity-scale >>>>> +========================================== >>>>> + >>>>> +CPUs capacities are defined with respect to capacity-scale property in the cpus >>>>> +node [1]. The property is optional; if not defined a 1024 capacity-scale is >>>>> +assumed. This property defines both the highest CPU capacity present in the >>>>> +system and granularity of CPU capacity values. >>>> >>>> I don't really see the point of this vs. having an absolute scale. >>>> >>> >>> IMHO, we need this for several reasons, one being to address one of your >>> concerns below: vendors are free to choose their scale without being >>> forced to publish absolute data. Another reason is that it might make >>> life easier in certain cases; for example, someone could implement a >>> system with a few clusters of, say, A57s, but some run at half the clock >>> of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in >>> this case I think it is just easier to define capacity-scale as 1200 and >>> capacities as 1200 and 600. Last reason that I can think of right now is >>> that we don't probably want to bound ourself to some particular range >>> from the beginning, as that range might be enough now, but it could >>> change in the future (as in, right now [1-1024] looks fine for >>> scheduling purposes, but that might change). >> >> Like Rob, i don't really see the benefit of this optional >> capacity-scale property. Parsing the capacity of all cpu nodes should >> give you a range as well. >> IMHO, this property looks like an optimization of the code that will >> parse the dt more than a HW description >> > > I agree that we can come up with the same information just looking at > the biggest capacity value of all CPUs and treat that value as > capacity-scale. I just thought that having that explicit made things > clearer, as it could be not easy to immediately see from a DT with many > CPUs which is the biggest capacity value. But, yes, we could remove that > anyway. +1! This capacity-scale complicates things unnecessarily. It was hard for me to understand the meaning of it. Your 2. example sets 'capacity-scale = <2>' but also 'capacity = <2>' for cpu[01] and 'capacity = <1>' for cpu[23]. This can be easily replaced by 'capacity = <1024>' for cpu[01] and 'capacity = <512>' for cpu[23]. Much more readable, as it was mentioned already in this thread. I understand that we don't want to limit the range of capacity values in the dt file to [1..1024] nor enforce that the cpu w/ the highest capacity has to have the value of 1024 in the dt file so the scheduler has to scale accordingly if we want to limit capacity to its supported capacity range (like with EAS [1..1024]). [...] -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote: > I think you need something absolute and probably per MHz (like > dynamic-power-coefficient property). Perhaps the IPC (instructions per > clock) value? > In other words, I want to see these numbers have a defined method > of determining them and don't want to see random values from every > vendor. ARM, Ltd. says core X has a value of Y would be good enough for > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a > value of 1024 is not what I want to see. Of course things like cache > sizes can vary the performance, but is a baseline value good enough? > However, no vendor will want to publish their values if these are > absolute values relative to other vendors. > If you expect these to need frequent tuning, then don't put them in DT. I agree strongly. Putting what are essentially tuning numbers for the system into the ABI is going to lead us into a mess long term since if we change anything related to the performance of the system the numbers may become invalid and we've no real way of recovering sensible information. There is of course also the issue where people are getting the numbers from in the first place - were the numbers picked for some particular use case or to optimise some particular benchmark, what other conditions existed at the time (cpufreq setup for example), what tuning goals did the people picking the numbers have and do any of those things correspond to what a given user wants? If detailed tuning the numbers for specific systems matters much will we get competing users patching the in kernel DTs over and over, and what do we do about ACPI systems? Having an absolute definition doesn't really help with this since the concrete effect DT authors see is that these are tuning numbers. It would be better to have the DT describe concrete physical properties of the system which we can then map onto numbers we like, that way if we get better information in future or just decide that completely different metrics are appropriate for tuning we can just do that without having to worry about translating the old metrics into new ones. We can then expose the tuning knobs to userspace for override if that's needed. If doing system specific tuning on vertically integrated systems really is terribly important it's not going to matter too much where the tuning is but we also have to consider more general purpose systems. We're not going to get out of having to pick numbers at some point, pushing them into DT doesn't get us out of that but it does make the situation harder to manage long term and makes the performance for the general user less relaible. It's also just more work all round, everyone doing the DT for a SoC is going to have to do some combination of cargo culting or repeating the callibration. I remember Peter remarking at one of the LPC discussions of this idea that there had been some bad experiences with getting numbers from firmware on other systems.
Hi Mark, I certainly understand your (and Rob's) concerns, but let me try anyway to argument a bit more around this approach :-). On 10/12/15 15:30, Mark Brown wrote: > On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote: > > > I think you need something absolute and probably per MHz (like > > dynamic-power-coefficient property). Perhaps the IPC (instructions per > > clock) value? > > > In other words, I want to see these numbers have a defined method > > of determining them and don't want to see random values from every > > vendor. ARM, Ltd. says core X has a value of Y would be good enough for > > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a > > value of 1024 is not what I want to see. Of course things like cache > > sizes can vary the performance, but is a baseline value good enough? > > > However, no vendor will want to publish their values if these are > > absolute values relative to other vendors. > > > If you expect these to need frequent tuning, then don't put them in DT. > > I agree strongly. Putting what are essentially tuning numbers for the > system into the ABI is going to lead us into a mess long term since if > we change anything related to the performance of the system the numbers > may become invalid and we've no real way of recovering sensible > information. > > There is of course also the issue where people are getting the numbers > from in the first place - were the numbers picked for some particular > use case or to optimise some particular benchmark, what other conditions > existed at the time (cpufreq setup for example), what tuning goals did > the people picking the numbers have and do any of those things > correspond to what a given user wants? If detailed tuning the numbers > for specific systems matters much will we get competing users patching > the in kernel DTs over and over, and what do we do about ACPI systems? > Having an absolute definition doesn't really help with this since the > concrete effect DT authors see is that these are tuning numbers. > I'm not entirely getting here why you consider capacity values to be tunables. As part of the EAS effort, we are proposing ways in which users should be able to fine tune their system as needed, when required (don't know if you had a chance to have a look at the SchedTune posting back in August for example [1]). This patch tries to only standardize where do we get default values from and how we specify them. I'm not seeing them changing much after an initial benchmarking phase has been done. Tuning should happen using different methods, not by changing these values, IMHO. > It would be better to have the DT describe concrete physical properties > of the system which we can then map onto numbers we like, that way if we > get better information in future or just decide that completely > different metrics are appropriate for tuning we can just do that without > having to worry about translating the old metrics into new ones. We can > then expose the tuning knobs to userspace for override if that's needed. > If doing system specific tuning on vertically integrated systems really > is terribly important it's not going to matter too much where the tuning > is but we also have to consider more general purpose systems. > As replied to Rob, I'm not sure it is so easy to find any physical property that expresses what we essentially need (without maybe relying on a complex mix of hardware details and a model to extract numbers from them). Instead, we propose to have reasonable, per SoC, default numbers; and then let users fine tune their platform afterwards, without changing those default values. > We're not going to get out of having to pick numbers at some point, > pushing them into DT doesn't get us out of that but it does make the > situation harder to manage long term and makes the performance for the > general user less relaible. It's also just more work all round, > everyone doing the DT for a SoC is going to have to do some combination > of cargo culting or repeating the callibration. > I'm most probably a bit naive here, but I see the calibration phase happening only once, after the platform is stable. You get default capacity values by running a pretty simple benchmark on a fixed configuration; and you put them somewhere (DTs still seem to be a sensible place to me). Then you'll be able to suit tuning needs using different interfaces. ABI changes have to be carefully considered, I know. But still, we need to agree on a way to provide these default capacity values someway. So, thanks for helping us carry on this discussion. Best, - Juri > I remember Peter remarking at one of the LPC discussions of this idea > that there had been some bad experiences with getting numbers from > firmware on other systems. [1] https://lkml.org/lkml/2015/8/19/419 -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On 10/12/15 14:14, Dietmar Eggemann wrote: > On 01/12/15 11:20, Juri Lelli wrote: > > Hi Vincent, > > > > On 30/11/15 10:59, Vincent Guittot wrote: > >> Hi Juri, > >> > >> On 24 November 2015 at 11:54, Juri Lelli <juri.lelli@arm.com> wrote: > > [...] > > >>>>> +========================================== > >>>>> +3 - capacity-scale > >>>>> +========================================== > >>>>> + > >>>>> +CPUs capacities are defined with respect to capacity-scale property in the cpus > >>>>> +node [1]. The property is optional; if not defined a 1024 capacity-scale is > >>>>> +assumed. This property defines both the highest CPU capacity present in the > >>>>> +system and granularity of CPU capacity values. > >>>> > >>>> I don't really see the point of this vs. having an absolute scale. > >>>> > >>> > >>> IMHO, we need this for several reasons, one being to address one of your > >>> concerns below: vendors are free to choose their scale without being > >>> forced to publish absolute data. Another reason is that it might make > >>> life easier in certain cases; for example, someone could implement a > >>> system with a few clusters of, say, A57s, but some run at half the clock > >>> of the others (e.g., you have a 1.2GHz cluster and a 600MHz cluster); in > >>> this case I think it is just easier to define capacity-scale as 1200 and > >>> capacities as 1200 and 600. Last reason that I can think of right now is > >>> that we don't probably want to bound ourself to some particular range > >>> from the beginning, as that range might be enough now, but it could > >>> change in the future (as in, right now [1-1024] looks fine for > >>> scheduling purposes, but that might change). > >> > >> Like Rob, i don't really see the benefit of this optional > >> capacity-scale property. Parsing the capacity of all cpu nodes should > >> give you a range as well. > >> IMHO, this property looks like an optimization of the code that will > >> parse the dt more than a HW description > >> > > > > I agree that we can come up with the same information just looking at > > the biggest capacity value of all CPUs and treat that value as > > capacity-scale. I just thought that having that explicit made things > > clearer, as it could be not easy to immediately see from a DT with many > > CPUs which is the biggest capacity value. But, yes, we could remove that > > anyway. > > +1! This capacity-scale complicates things unnecessarily. It was hard > for me to understand the meaning of it. Your 2. example sets > 'capacity-scale = <2>' but also 'capacity = <2>' for cpu[01] and > 'capacity = <1>' for cpu[23]. This can be easily replaced by 'capacity = > <1024>' for cpu[01] and 'capacity = <512>' for cpu[23]. Much more > readable, as it was mentioned already in this thread. > > I understand that we don't want to limit the range of capacity values in > the dt file to [1..1024] nor enforce that the cpu w/ the highest > capacity has to have the value of 1024 in the dt file so the scheduler > has to scale accordingly if we want to limit capacity to its supported > capacity range (like with EAS [1..1024]). > OK, I guess I can easily remove capacity-value and simply normalize CPU capacities w.r.t. the highest capacity in the DT. Thanks, - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 10, 2015 at 05:58:20PM +0000, Juri Lelli wrote: > On 10/12/15 15:30, Mark Brown wrote: > > On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote: > > > In other words, I want to see these numbers have a defined method > > > of determining them and don't want to see random values from every > > > vendor. ARM, Ltd. says core X has a value of Y would be good enough for > > > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a > > > value of 1024 is not what I want to see. Of course things like cache > > > sizes can vary the performance, but is a baseline value good enough? > > > However, no vendor will want to publish their values if these are > > > absolute values relative to other vendors. > > > If you expect these to need frequent tuning, then don't put them in DT. > > I agree strongly. Putting what are essentially tuning numbers for the > > system into the ABI is going to lead us into a mess long term since if > > we change anything related to the performance of the system the numbers > > may become invalid and we've no real way of recovering sensible > > information. > I'm not entirely getting here why you consider capacity values to be > tunables. As part of the EAS effort, we are proposing ways in which users The purpose of the capacity values is to influence the scheduler behaviour and hence performance. Without a concrete definition they're just magic numbers which have meaining only in terms of their effect on the performance of the system. That is a sufficiently complex outcome to ensure that there will be an element of taste in what the desired outcomes are. Sounds like tuneables to me. > should be able to fine tune their system as needed, when required > (don't know if you had a chance to have a look at the SchedTune posting > back in August for example [1]). This patch tries to only standardize > where do we get default values from and how we specify them. I'm not > seeing them changing much after an initial benchmarking phase has been > done. Tuning should happen using different methods, not by changing > these values, IMHO. If you are saying people should use other, more sensible, ways of specifying the final values that actually get used in production then why take the defaults from direct numbers DT in the first place? If you are saying that people should tune and then put the values in here then that's problematic for the reasons I outlined. > > It would be better to have the DT describe concrete physical properties > > of the system which we can then map onto numbers we like, that way if we > > get better information in future or just decide that completely > > different metrics are appropriate for tuning we can just do that without > > having to worry about translating the old metrics into new ones. We can > > then expose the tuning knobs to userspace for override if that's needed. > > If doing system specific tuning on vertically integrated systems really > > is terribly important it's not going to matter too much where the tuning > > is but we also have to consider more general purpose systems. > As replied to Rob, I'm not sure it is so easy to find any physical > property that expresses what we essentially need (without maybe relying > on a complex mix of hardware details and a model to extract numbers from > them). Instead, we propose to have reasonable, per SoC, default numbers; > and then let users fine tune their platform afterwards, without changing > those default values. If users are supposed to do fine tuning elsewhere after the fact why bother with this initial callibration? Something that's ballpark good enough like just knowing the core used and perhaps some important options on it should give an adequate starting point and not have the issues with having the tuning numbers present as magic numbers. Perhaps we might also feed cache information in at some point. If in future we're able to improve those default numbers (or just adapt at runtime) then even better. It also seems a bit strange to expect people to do some tuning in one place initially and then additional tuning somewhere else later, from a user point of view I'd expect to always do my tuning in the same place. > > We're not going to get out of having to pick numbers at some point, > > pushing them into DT doesn't get us out of that but it does make the > > situation harder to manage long term and makes the performance for the > > general user less relaible. It's also just more work all round, > > everyone doing the DT for a SoC is going to have to do some combination > > of cargo culting or repeating the callibration. > I'm most probably a bit naive here, but I see the calibration phase > happening only once, after the platform is stable. You get default > capacity values by running a pretty simple benchmark on a fixed > configuration; and you put them somewhere (DTs still seem to be a > sensible place to me). Then you'll be able to suit tuning needs using > different interfaces. My point is that everyone making any kind of SoC with asymmetries is expected to go and do some kind of callibration based on some unclear criteria, if these are just ballpark accurate starting points that seems like wasted effort - the kernel should be making a reasonable effort to do something sensible without this information which is going to be less effort all round. It doesn't need to wait for real silicon (this seems like the sort of core bit of DT which will be being written pre-tapeout) and doesn't have marketing implications. Doing that and then switching to some other interface for real tuning seems especially odd and I'm not sure that's something that users are going to expect or understand.
Hi Mark, On 11/12/15 17:49, Mark Brown wrote: > On Thu, Dec 10, 2015 at 05:58:20PM +0000, Juri Lelli wrote: > > On 10/12/15 15:30, Mark Brown wrote: > > > On Mon, Nov 23, 2015 at 08:06:31PM -0600, Rob Herring wrote: > > > > > In other words, I want to see these numbers have a defined method > > > > of determining them and don't want to see random values from every > > > > vendor. ARM, Ltd. says core X has a value of Y would be good enough for > > > > me. Vendor X's A57 having a value of 2 and Vendor Y's A57 having a > > > > value of 1024 is not what I want to see. Of course things like cache > > > > sizes can vary the performance, but is a baseline value good enough? > > > > > However, no vendor will want to publish their values if these are > > > > absolute values relative to other vendors. > > > > > If you expect these to need frequent tuning, then don't put them in DT. > > > > I agree strongly. Putting what are essentially tuning numbers for the > > > system into the ABI is going to lead us into a mess long term since if > > > we change anything related to the performance of the system the numbers > > > may become invalid and we've no real way of recovering sensible > > > information. > > > I'm not entirely getting here why you consider capacity values to be > > tunables. As part of the EAS effort, we are proposing ways in which users > > The purpose of the capacity values is to influence the scheduler > behaviour and hence performance. Without a concrete definition they're > just magic numbers which have meaining only in terms of their effect on > the performance of the system. That is a sufficiently complex outcome > to ensure that there will be an element of taste in what the desired > outcomes are. Sounds like tuneables to me. > Capacity values are meant to describe asymmetry (if any) of the system CPUs to the scheduler. The scheduler can then use this additional bit of information to try to do better scheduling decisions. Yes, having these values available will end up giving you better performance, but I guess this apply to any information we provide to the kernel (and scheduler); the less dumb a subsystem is, the better we can make it work. > > should be able to fine tune their system as needed, when required > > (don't know if you had a chance to have a look at the SchedTune posting > > back in August for example [1]). This patch tries to only standardize > > where do we get default values from and how we specify them. I'm not > > seeing them changing much after an initial benchmarking phase has been > > done. Tuning should happen using different methods, not by changing > > these values, IMHO. > > If you are saying people should use other, more sensible, ways of > specifying the final values that actually get used in production then > why take the defaults from direct numbers DT in the first place? If you > are saying that people should tune and then put the values in here then > that's problematic for the reasons I outlined. > IMHO, people should come up with default values that describe heterogeneity in their system. Then use other ways to tune the system at run time (depending on the workload maybe). As said, I understand your concerns; but, what I don't still get is where CPU capacity values are so different from, say, idle states min-residency-us. AFAIK there is a per-SoC benchmarking phase required to come up with that values as well; you have to pick some benchmark that stresses worst case entry/exit while measuring energy, then make calculations that tells you when it is wise to enter a particular idle state. Ideally we should derive min residency from specs, but I'm not sure is how it works in practice. > > > It would be better to have the DT describe concrete physical properties > > > of the system which we can then map onto numbers we like, that way if we > > > get better information in future or just decide that completely > > > different metrics are appropriate for tuning we can just do that without > > > having to worry about translating the old metrics into new ones. We can > > > then expose the tuning knobs to userspace for override if that's needed. > > > If doing system specific tuning on vertically integrated systems really > > > is terribly important it's not going to matter too much where the tuning > > > is but we also have to consider more general purpose systems. > > > As replied to Rob, I'm not sure it is so easy to find any physical > > property that expresses what we essentially need (without maybe relying > > on a complex mix of hardware details and a model to extract numbers from > > them). Instead, we propose to have reasonable, per SoC, default numbers; > > and then let users fine tune their platform afterwards, without changing > > those default values. > > If users are supposed to do fine tuning elsewhere after the fact why > bother with this initial callibration? Something that's ballpark good > enough like just knowing the core used and perhaps some important > options on it should give an adequate starting point and not have the > issues with having the tuning numbers present as magic numbers. Perhaps > we might also feed cache information in at some point. If in future > we're able to improve those default numbers (or just adapt at runtime) > then even better. > > It also seems a bit strange to expect people to do some tuning in one > place initially and then additional tuning somewhere else later, from > a user point of view I'd expect to always do my tuning in the same > place. > I think that runtime tuning needs are much more complex and have finer grained needs than what you can achieve by playing with CPU capacities. And I agree with you, users should only play with these other methods I'm referring to; they should not mess around with platform description bits. They should provide information about runtime needs, then the scheduler (in this case) will do its best to give them acceptable performance using improved knowledge about the platform. > > > We're not going to get out of having to pick numbers at some point, > > > pushing them into DT doesn't get us out of that but it does make the > > > situation harder to manage long term and makes the performance for the > > > general user less relaible. It's also just more work all round, > > > everyone doing the DT for a SoC is going to have to do some combination > > > of cargo culting or repeating the callibration. > > > I'm most probably a bit naive here, but I see the calibration phase > > happening only once, after the platform is stable. You get default > > capacity values by running a pretty simple benchmark on a fixed > > configuration; and you put them somewhere (DTs still seem to be a > > sensible place to me). Then you'll be able to suit tuning needs using > > different interfaces. > > My point is that everyone making any kind of SoC with asymmetries is > expected to go and do some kind of callibration based on some unclear > criteria, if these are just ballpark accurate starting points that seems > like wasted effort - the kernel should be making a reasonable effort to > do something sensible without this information which is going to be less > effort all round. It doesn't need to wait for real silicon (this seems > like the sort of core bit of DT which will be being written pre-tapeout) > and doesn't have marketing implications. > > Doing that and then switching to some other interface for real tuning > seems especially odd and I'm not sure that's something that users are > going to expect or understand. As I'm saying above, users should not care about this first step of platform description; not more than how much they care about other bits in DTs that describe their platform. Thanks, - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote: > On 11/12/15 17:49, Mark Brown wrote: > > The purpose of the capacity values is to influence the scheduler > > behaviour and hence performance. Without a concrete definition they're > > just magic numbers which have meaining only in terms of their effect on > > the performance of the system. That is a sufficiently complex outcome > > to ensure that there will be an element of taste in what the desired > > outcomes are. Sounds like tuneables to me. > Capacity values are meant to describe asymmetry (if any) of the system > CPUs to the scheduler. The scheduler can then use this additional bit of > information to try to do better scheduling decisions. Yes, having these > values available will end up giving you better performance, but I guess > this apply to any information we provide to the kernel (and scheduler); > the less dumb a subsystem is, the better we can make it work. This information is a magic number, there's never going to be a right answer. If it needs changing it's not like the kernel is modeling a concrete thing like the relative performance of the A53 and A57 poorly or whatever, it's just that the relative values of number A and number B are not what the system integrator desires. > > If you are saying people should use other, more sensible, ways of > > specifying the final values that actually get used in production then > > why take the defaults from direct numbers DT in the first place? If you > > are saying that people should tune and then put the values in here then > > that's problematic for the reasons I outlined. > IMHO, people should come up with default values that describe > heterogeneity in their system. Then use other ways to tune the system at > run time (depending on the workload maybe). My argument is that they should be describing the hetrogeneity of their system by describing concrete properties of their system rather than by providing magic numbers. > As said, I understand your concerns; but, what I don't still get is > where CPU capacity values are so different from, say, idle states > min-residency-us. AFAIK there is a per-SoC benchmarking phase required > to come up with that values as well; you have to pick some benchmark > that stresses worst case entry/exit while measuring energy, then make > calculations that tells you when it is wise to enter a particular idle > state. Ideally we should derive min residency from specs, but I'm not > sure is how it works in practice. Those at least have a concrete physical value that it is possible to measure in a describable way that is unlikely to change based on the internals of the kernel. It would be kind of nice to have the broken down numbers for entry time, exit time and power burn in suspend but it's not clear it's worth the bother. It's also one of these things where we don't have any real proxies that get us anywhere in the ballpark of where we want to be. > > It also seems a bit strange to expect people to do some tuning in one > > place initially and then additional tuning somewhere else later, from > > a user point of view I'd expect to always do my tuning in the same > > place. > I think that runtime tuning needs are much more complex and have finer > grained needs than what you can achieve by playing with CPU capacities. > And I agree with you, users should only play with these other methods > I'm referring to; they should not mess around with platform description > bits. They should provide information about runtime needs, then the > scheduler (in this case) will do its best to give them acceptable > performance using improved knowledge about the platform. So then why isn't it adequate to just have things like the core types in there and work from there? Are we really expecting the tuning to be so much better than it's possible to come up with something that's so much better on the scale that we're expecting this to be accurate that it's worth just jumping straight to magic numbers? > > Doing that and then switching to some other interface for real tuning > > seems especially odd and I'm not sure that's something that users are > > going to expect or understand. > As I'm saying above, users should not care about this first step of > platform description; not more than how much they care about other bits > in DTs that describe their platform. That may be your intention but I don't see how it is realistic to expect that this is what people will actually understand. It's a number, it has an effect and it's hard to see that people won't tune it, it's not like people don't have to edit DTs during system integration. People won't reliably read documentation or look in mailing list threads and other that that it has all the properties of a tuning interface. There's a tension here between what you're saying about people not being supposed to care much about the numbers for tuning and the very fact that there's a need for the DT to carry explicit numbers.
On 14/12/15 16:59, Mark Brown wrote: > On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote: > > On 11/12/15 17:49, Mark Brown wrote: > > > > The purpose of the capacity values is to influence the scheduler > > > behaviour and hence performance. Without a concrete definition they're > > > just magic numbers which have meaining only in terms of their effect on > > > the performance of the system. That is a sufficiently complex outcome > > > to ensure that there will be an element of taste in what the desired > > > outcomes are. Sounds like tuneables to me. > > > Capacity values are meant to describe asymmetry (if any) of the system > > CPUs to the scheduler. The scheduler can then use this additional bit of > > information to try to do better scheduling decisions. Yes, having these > > values available will end up giving you better performance, but I guess > > this apply to any information we provide to the kernel (and scheduler); > > the less dumb a subsystem is, the better we can make it work. > > This information is a magic number, there's never going to be a right > answer. If it needs changing it's not like the kernel is modeling a > concrete thing like the relative performance of the A53 and A57 poorly > or whatever, it's just that the relative values of number A and number B > are not what the system integrator desires. > > > > If you are saying people should use other, more sensible, ways of > > > specifying the final values that actually get used in production then > > > why take the defaults from direct numbers DT in the first place? If you > > > are saying that people should tune and then put the values in here then > > > that's problematic for the reasons I outlined. > > > IMHO, people should come up with default values that describe > > heterogeneity in their system. Then use other ways to tune the system at > > run time (depending on the workload maybe). > > My argument is that they should be describing the hetrogeneity of their > system by describing concrete properties of their system rather than by > providing magic numbers. > > > As said, I understand your concerns; but, what I don't still get is > > where CPU capacity values are so different from, say, idle states > > min-residency-us. AFAIK there is a per-SoC benchmarking phase required > > to come up with that values as well; you have to pick some benchmark > > that stresses worst case entry/exit while measuring energy, then make > > calculations that tells you when it is wise to enter a particular idle > > state. Ideally we should derive min residency from specs, but I'm not > > sure is how it works in practice. > > Those at least have a concrete physical value that it is possible to > measure in a describable way that is unlikely to change based on the > internals of the kernel. It would be kind of nice to have the broken > down numbers for entry time, exit time and power burn in suspend but > it's not clear it's worth the bother. It's also one of these things > where we don't have any real proxies that get us anywhere in the > ballpark of where we want to be. > I'm proposing to add a new value because I couldn't find any proxies in the current bindings that bring us any close to what we need. If I failed in looking for them, and they actually exists, I'll personally be more then happy to just rely on them instead of adding more stuff :-). Interestingly, to me it sounds like we could actually use your first paragraph above almost as it is to describe how to come up with capacity values. In the documentation I put the following: "One simple way to estimate CPU capacities is to iteratively run a well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at maximum frequency and then normalize values w.r.t. the best performing CPU." I don't see why this should change if we decide that the scheduler has to change in the future. Also, looking again at section 2 of idle-states bindings docs, we have a nice and accurate description of what min-residency is, but not much info about how we can actually measure that. Maybe, expanding the docs section regarding CPU capacity could help? > > > It also seems a bit strange to expect people to do some tuning in one > > > place initially and then additional tuning somewhere else later, from > > > a user point of view I'd expect to always do my tuning in the same > > > place. > > > I think that runtime tuning needs are much more complex and have finer > > grained needs than what you can achieve by playing with CPU capacities. > > And I agree with you, users should only play with these other methods > > I'm referring to; they should not mess around with platform description > > bits. They should provide information about runtime needs, then the > > scheduler (in this case) will do its best to give them acceptable > > performance using improved knowledge about the platform. > > So then why isn't it adequate to just have things like the core types in > there and work from there? Are we really expecting the tuning to be so > much better than it's possible to come up with something that's so much > better on the scale that we're expecting this to be accurate that it's > worth just jumping straight to magic numbers? > I take your point here that having fine grained values might not really give us appreciable differences (that is also why I proposed the capacity-scale in the first instance), but I'm not sure I'm getting what you are proposing here. Today, and for arm only, we have a static table representing CPUs "efficiency": /* * Table of relative efficiency of each processors * The efficiency value must fit in 20bit and the final * cpu_scale value must be in the range * 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2 * in order to return at most 1 when DIV_ROUND_CLOSEST * is used to compute the capacity of a CPU. * Processors that are not defined in the table, * use the default SCHED_CAPACITY_SCALE value for cpu_scale. */ static const struct cpu_efficiency table_efficiency[] = { {"arm,cortex-a15", 3891}, {"arm,cortex-a7", 2048}, {NULL, }, }; When clock-frequency property is defined in DT, we try to find a match for the compatibility string in the table above and then use the associate number to compute the capacity. Are you proposing to have something like this for arm64 as well? BTW, the only info I could find about those numbers is from this thread http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/104072.html Vincent, do we have more precise information about these numbers somewhere else? If I understand how that table was created, how do we think we will extend it in the future to allow newer core types (say we replicate this solution for arm64)? It seems that we have to change it, rescaling values, each time we have a new core on the market. How can we come up with relative numbers, in the future, comparing newer cores to old ones (that might be already out of the market by that time)? > > > Doing that and then switching to some other interface for real tuning > > > seems especially odd and I'm not sure that's something that users are > > > going to expect or understand. > > > As I'm saying above, users should not care about this first step of > > platform description; not more than how much they care about other bits > > in DTs that describe their platform. > > That may be your intention but I don't see how it is realistic to expect > that this is what people will actually understand. It's a number, it > has an effect and it's hard to see that people won't tune it, it's not > like people don't have to edit DTs during system integration. People > won't reliably read documentation or look in mailing list threads and > other that that it has all the properties of a tuning interface. > Eh, sad but true. I guess we can, as we usually do, put more effort in documenting how things are supposed to be used. Then, if people think that they can make their system perform better without looking at documentation or asking around, I'm not sure there is much we could do to prevent them to do things wrong. There are already lot of things people shouldn't touch if they don't know what they are doing. :-/ > There's a tension here between what you're saying about people not being > supposed to care much about the numbers for tuning and the very fact > that there's a need for the DT to carry explicit numbers. My point is that people with tuning needs shouldn't even look at DTs, but put all their efforts in describing (using appropriate APIs) their needs and how they apply to the workload they care about. Our job is to put together information coming from users and knowledge of system configuration to provide people the desired outcomes. Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 12:22:38PM +0000, Juri Lelli wrote: > I'm proposing to add a new value because I couldn't find any proxies in > the current bindings that bring us any close to what we need. If I > failed in looking for them, and they actually exists, I'll personally be > more then happy to just rely on them instead of adding more stuff :-). Well, the first pass is going to be the core types (possibly with some other properties if there's interesting parameters people can tweak in integration). > Interestingly, to me it sounds like we could actually use your first > paragraph above almost as it is to describe how to come up with capacity > values. In the documentation I put the following: > "One simple way to estimate CPU capacities is to iteratively run a > well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on > each CPU at maximum frequency and then normalize values w.r.t. the best > performing CPU." > I don't see why this should change if we decide that the scheduler has > to change in the future. You'd at least need to pick a particular benchmark there... > Also, looking again at section 2 of idle-states bindings docs, we have a > nice and accurate description of what min-residency is, but not much > info about how we can actually measure that. Maybe, expanding the docs > section regarding CPU capacity could help? I'm dubious about this to be honest, if only on the basis of people reading the docs in the first place. It also seems that if we're really just talking about some CPU microbenchmark then forcing every implementor to do the benchmark and scaling is at best going to burn people's time and be error prone since it doesn't seem so likely to have dramatic system integration variation. > > So then why isn't it adequate to just have things like the core types in > > there and work from there? Are we really expecting the tuning to be so > > much better than it's possible to come up with something that's so much > > better on the scale that we're expecting this to be accurate that it's > > worth just jumping straight to magic numbers? > I take your point here that having fine grained values might not really > give us appreciable differences (that is also why I proposed the > capacity-scale in the first instance), but I'm not sure I'm getting what > you are proposing here. Something like the existing solution for arm32. > static const struct cpu_efficiency table_efficiency[] = { > {"arm,cortex-a15", 3891}, > {"arm,cortex-a7", 2048}, > {NULL, }, > }; > When clock-frequency property is defined in DT, we try to find a match > for the compatibility string in the table above and then use the > associate number to compute the capacity. Are you proposing to have > something like this for arm64 as well? > BTW, the only info I could find about those numbers is from this thread It was discussed in some other thread when I was sending the equivalent stuff for arm64 (I never got round to finishing it off due to issues with Catalin and Will being concerned about the specific numbers). Vincent confirmed that the numbers came from the (IIRC) DMIPS/MHz numbers that ARM publish for the cores. I'd independently done the same thing for arm64. It would probably help to put comments in there with the base numbers before scaling, or just redo the table in terms of the raw numbers. This is, of course, an example of my concerns about magic number configuration. > If I understand how that table was created, how do we think we will > extend it in the future to allow newer core types (say we replicate this > solution for arm64)? It seems that we have to change it, rescaling > values, each time we have a new core on the market. How can we come up > with relative numbers, in the future, comparing newer cores to old ones > (that might be already out of the market by that time)? It doesn't seem particularly challenging to add new numbers to the table (and add additional properties to select on) TBH. We can either rescale by hand in the table when adding entries, script it as part of the kernel build or do it at runtime (as the arm32 code already does to an extent based on the particular set of cores we find). What difficulties do you see with this? This is something that seems like an advantage to me - we can just replace everything at any point, we're not tied to trusting the golden benchmark someone did (or tweaked) if we come up with a better methodology later on. > Eh, sad but true. I guess we can, as we usually do, put more effort in > documenting how things are supposed to be used. Then, if people think > that they can make their system perform better without looking at > documentation or asking around, I'm not sure there is much we could do > to prevent them to do things wrong. There are already lot of things > people shouldn't touch if they don't know what they are doing. :-/ The trouble with making people specify this in DT is that it becomes a parameter that someone *has* to tweak at some point. > > There's a tension here between what you're saying about people not being > > supposed to care much about the numbers for tuning and the very fact > > that there's a need for the DT to carry explicit numbers. > My point is that people with tuning needs shouldn't even look at DTs, > but put all their efforts in describing (using appropriate APIs) their > needs and how they apply to the workload they care about. Our job is to > put together information coming from users and knowledge of system > configuration to provide people the desired outcomes. You can't do a system integration for a smartphone or embedded board without configuring things in DT, people integrating for those systems are already looking at DT and they are the main current targets for heterogeneous systems I'm aware of. The people who don't need to look at DT are mainly the enterprise type people but a lot of them won't be able to use this as-is anyway since they'll be using ACPI.
On 15 December 2015 at 13:22, Juri Lelli <juri.lelli@arm.com> wrote: > On 14/12/15 16:59, Mark Brown wrote: >> On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote: >> > On 11/12/15 17:49, Mark Brown wrote: >> >> > > The purpose of the capacity values is to influence the scheduler >> > > behaviour and hence performance. Without a concrete definition they're >> > > just magic numbers which have meaining only in terms of their effect on >> > > the performance of the system. That is a sufficiently complex outcome >> > > to ensure that there will be an element of taste in what the desired >> > > outcomes are. Sounds like tuneables to me. >> >> > Capacity values are meant to describe asymmetry (if any) of the system >> > CPUs to the scheduler. The scheduler can then use this additional bit of >> > information to try to do better scheduling decisions. Yes, having these >> > values available will end up giving you better performance, but I guess >> > this apply to any information we provide to the kernel (and scheduler); >> > the less dumb a subsystem is, the better we can make it work. >> >> This information is a magic number, there's never going to be a right >> answer. If it needs changing it's not like the kernel is modeling a >> concrete thing like the relative performance of the A53 and A57 poorly >> or whatever, it's just that the relative values of number A and number B >> are not what the system integrator desires. >> >> > > If you are saying people should use other, more sensible, ways of >> > > specifying the final values that actually get used in production then >> > > why take the defaults from direct numbers DT in the first place? If you >> > > are saying that people should tune and then put the values in here then >> > > that's problematic for the reasons I outlined. >> >> > IMHO, people should come up with default values that describe >> > heterogeneity in their system. Then use other ways to tune the system at >> > run time (depending on the workload maybe). >> >> My argument is that they should be describing the hetrogeneity of their >> system by describing concrete properties of their system rather than by >> providing magic numbers. >> >> > As said, I understand your concerns; but, what I don't still get is >> > where CPU capacity values are so different from, say, idle states >> > min-residency-us. AFAIK there is a per-SoC benchmarking phase required >> > to come up with that values as well; you have to pick some benchmark >> > that stresses worst case entry/exit while measuring energy, then make >> > calculations that tells you when it is wise to enter a particular idle >> > state. Ideally we should derive min residency from specs, but I'm not >> > sure is how it works in practice. >> >> Those at least have a concrete physical value that it is possible to >> measure in a describable way that is unlikely to change based on the >> internals of the kernel. It would be kind of nice to have the broken >> down numbers for entry time, exit time and power burn in suspend but >> it's not clear it's worth the bother. It's also one of these things >> where we don't have any real proxies that get us anywhere in the >> ballpark of where we want to be. >> > > I'm proposing to add a new value because I couldn't find any proxies in > the current bindings that bring us any close to what we need. If I > failed in looking for them, and they actually exists, I'll personally be > more then happy to just rely on them instead of adding more stuff :-). > > Interestingly, to me it sounds like we could actually use your first > paragraph above almost as it is to describe how to come up with capacity > values. In the documentation I put the following: > > "One simple way to estimate CPU capacities is to iteratively run a > well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on > each CPU at maximum frequency and then normalize values w.r.t. the best > performing CPU." > > I don't see why this should change if we decide that the scheduler has > to change in the future. > > Also, looking again at section 2 of idle-states bindings docs, we have a > nice and accurate description of what min-residency is, but not much > info about how we can actually measure that. Maybe, expanding the docs > section regarding CPU capacity could help? > >> > > It also seems a bit strange to expect people to do some tuning in one >> > > place initially and then additional tuning somewhere else later, from >> > > a user point of view I'd expect to always do my tuning in the same >> > > place. >> >> > I think that runtime tuning needs are much more complex and have finer >> > grained needs than what you can achieve by playing with CPU capacities. >> > And I agree with you, users should only play with these other methods >> > I'm referring to; they should not mess around with platform description >> > bits. They should provide information about runtime needs, then the >> > scheduler (in this case) will do its best to give them acceptable >> > performance using improved knowledge about the platform. >> >> So then why isn't it adequate to just have things like the core types in >> there and work from there? Are we really expecting the tuning to be so >> much better than it's possible to come up with something that's so much >> better on the scale that we're expecting this to be accurate that it's >> worth just jumping straight to magic numbers? >> > > I take your point here that having fine grained values might not really > give us appreciable differences (that is also why I proposed the > capacity-scale in the first instance), but I'm not sure I'm getting what > you are proposing here. > > Today, and for arm only, we have a static table representing CPUs > "efficiency": > > /* > * Table of relative efficiency of each processors > * The efficiency value must fit in 20bit and the final > * cpu_scale value must be in the range > * 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2 > * in order to return at most 1 when DIV_ROUND_CLOSEST > * is used to compute the capacity of a CPU. > * Processors that are not defined in the table, > * use the default SCHED_CAPACITY_SCALE value for cpu_scale. > */ > static const struct cpu_efficiency table_efficiency[] = { > {"arm,cortex-a15", 3891}, > {"arm,cortex-a7", 2048}, > {NULL, }, > }; > > When clock-frequency property is defined in DT, we try to find a match > for the compatibility string in the table above and then use the > associate number to compute the capacity. Are you proposing to have > something like this for arm64 as well? > > BTW, the only info I could find about those numbers is from this thread > > http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/104072.html > > Vincent, do we have more precise information about these numbers > somewhere else? These numbers come from a document from ARM in which they compared A15 and A7 . I just used the number provided by this white paper and scale it in a more appropriate range than DMIPS/Mhz > > If I understand how that table was created, how do we think we will > extend it in the future to allow newer core types (say we replicate this > solution for arm64)? It seems that we have to change it, rescaling > values, each time we have a new core on the market. How can we come up > with relative numbers, in the future, comparing newer cores to old ones > (that might be already out of the market by that time)? > >> > > Doing that and then switching to some other interface for real tuning >> > > seems especially odd and I'm not sure that's something that users are >> > > going to expect or understand. >> >> > As I'm saying above, users should not care about this first step of >> > platform description; not more than how much they care about other bits >> > in DTs that describe their platform. >> >> That may be your intention but I don't see how it is realistic to expect >> that this is what people will actually understand. It's a number, it >> has an effect and it's hard to see that people won't tune it, it's not >> like people don't have to edit DTs during system integration. People >> won't reliably read documentation or look in mailing list threads and >> other that that it has all the properties of a tuning interface. >> > > Eh, sad but true. I guess we can, as we usually do, put more effort in > documenting how things are supposed to be used. Then, if people think > that they can make their system perform better without looking at > documentation or asking around, I'm not sure there is much we could do > to prevent them to do things wrong. There are already lot of things > people shouldn't touch if they don't know what they are doing. :-/ > >> There's a tension here between what you're saying about people not being >> supposed to care much about the numbers for tuning and the very fact >> that there's a need for the DT to carry explicit numbers. > > My point is that people with tuning needs shouldn't even look at DTs, > but put all their efforts in describing (using appropriate APIs) their > needs and how they apply to the workload they care about. Our job is to > put together information coming from users and knowledge of system > configuration to provide people the desired outcomes. > > Best, > > - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 01:39:51PM +0000, Mark Brown wrote: > On Tue, Dec 15, 2015 at 12:22:38PM +0000, Juri Lelli wrote: > > > So then why isn't it adequate to just have things like the core types in > > > there and work from there? Are we really expecting the tuning to be so > > > much better than it's possible to come up with something that's so much > > > better on the scale that we're expecting this to be accurate that it's > > > worth just jumping straight to magic numbers? > > > I take your point here that having fine grained values might not really > > give us appreciable differences (that is also why I proposed the > > capacity-scale in the first instance), but I'm not sure I'm getting what > > you are proposing here. > > Something like the existing solution for arm32. > > > static const struct cpu_efficiency table_efficiency[] = { > > {"arm,cortex-a15", 3891}, > > {"arm,cortex-a7", 2048}, > > {NULL, }, > > }; > > > When clock-frequency property is defined in DT, we try to find a match > > for the compatibility string in the table above and then use the > > associate number to compute the capacity. Are you proposing to have > > something like this for arm64 as well? > > > BTW, the only info I could find about those numbers is from this thread > > It was discussed in some other thread when I was sending the equivalent > stuff for arm64 (I never got round to finishing it off due to issues > with Catalin and Will being concerned about the specific numbers). > Vincent confirmed that the numbers came from the (IIRC) DMIPS/MHz > numbers that ARM publish for the cores. I'd independently done the same > thing for arm64. It would probably help to put comments in there with > the base numbers before scaling, or just redo the table in terms of the > raw numbers. > > This is, of course, an example of my concerns about magic number > configuration. > > > If I understand how that table was created, how do we think we will > > extend it in the future to allow newer core types (say we replicate this > > solution for arm64)? It seems that we have to change it, rescaling > > values, each time we have a new core on the market. How can we come up > > with relative numbers, in the future, comparing newer cores to old ones > > (that might be already out of the market by that time)? > > It doesn't seem particularly challenging to add new numbers to the table > (and add additional properties to select on) TBH. We can either rescale > by hand in the table when adding entries, script it as part of the > kernel build or do it at runtime (as the arm32 code already does to an > extent based on the particular set of cores we find). What difficulties > do you see with this? > > This is something that seems like an advantage to me - we can just > replace everything at any point, we're not tied to trusting the golden > benchmark someone did (or tweaked) if we come up with a better > methodology later on. I really don't want to see a table of magic numbers in the kernel. The relative performance and efficiency of cores will vary depending on uArch-specific configuration (e.g. sizing of L1/L2 caches) in addition to general uArch differences, and integration too (e.g. if the memory system gives priority to one cluster over another for whatever reason). I've heard of pseudo-heterogeneous platforms with different configuration of the same uArch across clusters. We also don't necessarily have the CPU clock frequencies, or the ability to scale them. Maybe we simply give up in that case, though. If we cannot rely on external information, and want this information to be derived by the kernel, then we need to perform some dynamic benchmark. That would work for future CPUs the kernel knows nothing about yet, and would cater for the pseudo-heterogeneous cases too. Thanks, Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Mark, On 15/12/15 14:01, Mark Rutland wrote: > On Tue, Dec 15, 2015 at 01:39:51PM +0000, Mark Brown wrote: > > On Tue, Dec 15, 2015 at 12:22:38PM +0000, Juri Lelli wrote: > > > > So then why isn't it adequate to just have things like the core types in > > > > there and work from there? Are we really expecting the tuning to be so > > > > much better than it's possible to come up with something that's so much > > > > better on the scale that we're expecting this to be accurate that it's > > > > worth just jumping straight to magic numbers? > > > > > I take your point here that having fine grained values might not really > > > give us appreciable differences (that is also why I proposed the > > > capacity-scale in the first instance), but I'm not sure I'm getting what > > > you are proposing here. > > > > Something like the existing solution for arm32. > > > > > static const struct cpu_efficiency table_efficiency[] = { > > > {"arm,cortex-a15", 3891}, > > > {"arm,cortex-a7", 2048}, > > > {NULL, }, > > > }; > > > > > When clock-frequency property is defined in DT, we try to find a match > > > for the compatibility string in the table above and then use the > > > associate number to compute the capacity. Are you proposing to have > > > something like this for arm64 as well? > > > > > BTW, the only info I could find about those numbers is from this thread > > > > It was discussed in some other thread when I was sending the equivalent > > stuff for arm64 (I never got round to finishing it off due to issues > > with Catalin and Will being concerned about the specific numbers). > > Vincent confirmed that the numbers came from the (IIRC) DMIPS/MHz > > numbers that ARM publish for the cores. I'd independently done the same > > thing for arm64. It would probably help to put comments in there with > > the base numbers before scaling, or just redo the table in terms of the > > raw numbers. > > > > This is, of course, an example of my concerns about magic number > > configuration. > > > > > If I understand how that table was created, how do we think we will > > > extend it in the future to allow newer core types (say we replicate this > > > solution for arm64)? It seems that we have to change it, rescaling > > > values, each time we have a new core on the market. How can we come up > > > with relative numbers, in the future, comparing newer cores to old ones > > > (that might be already out of the market by that time)? > > > > It doesn't seem particularly challenging to add new numbers to the table > > (and add additional properties to select on) TBH. We can either rescale > > by hand in the table when adding entries, script it as part of the > > kernel build or do it at runtime (as the arm32 code already does to an > > extent based on the particular set of cores we find). What difficulties > > do you see with this? > > > > This is something that seems like an advantage to me - we can just > > replace everything at any point, we're not tied to trusting the golden > > benchmark someone did (or tweaked) if we come up with a better > > methodology later on. > > I really don't want to see a table of magic numbers in the kernel. > Doesn't seem to be a clean and scalable solution to me either. It is not easy to reconfigure when new core types come around, as I don't think relative data is always present or easy to derive, and it exposes some sort of centralized global information where everyone is compared against everyone. Where the DT solution is inherently per platform: no need to expose absolute values and no problems with knowing data regarding old core types. > The relative performance and efficiency of cores will vary depending on > uArch-specific configuration (e.g. sizing of L1/L2 caches) in addition > to general uArch differences, and integration too (e.g. if the memory > system gives priority to one cluster over another for whatever reason). > I've heard of pseudo-heterogeneous platforms with different > configuration of the same uArch across clusters. > > We also don't necessarily have the CPU clock frequencies, or the ability > to scale them. Maybe we simply give up in that case, though. > > If we cannot rely on external information, and want this information to > be derived by the kernel, then we need to perform some dynamic > benchmark. That would work for future CPUs the kernel knows nothing > about yet, and would cater for the pseudo-heterogeneous cases too. > I've actually experimented a bit with this approch already, but I wasn't convinced of its viability. It is true that we remove the burden of coming up with default values from user/integrator, but I'm pretty sure we will end up discussing endlessly about which particular benchmark to pick and the fact that it impacts on boot time and such. Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 02:24:58PM +0000, Juri Lelli wrote: > Hi Mark, Hi Juri, > On 15/12/15 14:01, Mark Rutland wrote: > > I really don't want to see a table of magic numbers in the kernel. > > Doesn't seem to be a clean and scalable solution to me either. It is not > easy to reconfigure when new core types come around, as I don't think > relative data is always present or easy to derive, and it exposes some > sort of centralized global information where everyone is compared > against everyone. I'm also concerned that it will be difficult to curate this to avoid deceptive marketing numbers. These may not reflect reality. > Where the DT solution is inherently per platform: no need to expose > absolute values and no problems with knowing data regarding old core > types. The DT approach certainly avoids tying the kernel to a given idea of particular microarchitectures. > > If we cannot rely on external information, and want this information to > > be derived by the kernel, then we need to perform some dynamic > > benchmark. That would work for future CPUs the kernel knows nothing > > about yet, and would cater for the pseudo-heterogeneous cases too. > > I've actually experimented a bit with this approch already, but I wasn't > convinced of its viability. It is true that we remove the burden of > coming up with default values from user/integrator, but I'm pretty sure > we will end up discussing endlessly about which particular benchmark to > pick Regardless of which direction we go there will be endless discussion as to the benchmark. As Mark pointed out, that happened in the case of the table, and it's happening now for the DT approach. I think we agree that if this is something we can change later (i.e. we don't rely on an external oracle like DT) the particular benchmark matters less, as that can be changed given evidence of superiority. > and the fact that it impacts on boot time and such. I was under the impression that the kernel already did RAID algorithm benchmarking as part of the boot process. Maybe we can find a set of similarly brief benchmarks. Thanks, Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote: > I really don't want to see a table of magic numbers in the kernel. Right, there's pitfalls there too although not being part of an ABI does make them more manageable. One thing it's probably helpful to establish here is how much the specific numbers are going to matter in the grand scheme of things. If the specific numbers *are* super important then nobody is going to want to touch them as they'll be prone to getting tweaked. If instead the numbers just need to be ballpark accurate so the scheduler starts off in roughly the right place and the specific numbers don't matter it's a lot easier and having a table in the kernel until we think of something better (if that ever happens) gets a lot easier. My expectation is that we just need good enough, not perfect, and that seems to match what Juri is saying about the expectation that most of the fine tuning is done via other knobs. > The relative performance and efficiency of cores will vary depending on > uArch-specific configuration (e.g. sizing of L1/L2 caches) in addition > to general uArch differences, and integration too (e.g. if the memory > system gives priority to one cluster over another for whatever reason). > I've heard of pseudo-heterogeneous platforms with different > configuration of the same uArch across clusters. > We also don't necessarily have the CPU clock frequencies, or the ability > to scale them. Maybe we simply give up in that case, though. These variables all sound like the sort of thing we can get people to put in the DT where they matter, and the more we talk about multiple subtle variables feeding into a single number the more this starts to sound like tuning (with all the problems that brings, especially when you're into things like the behaviour with the clusters competing for resources). Who knows, at some point the scheduler may even want to directly act on some of these parameters? > If we cannot rely on external information, and want this information to > be derived by the kernel, then we need to perform some dynamic > benchmark. That would work for future CPUs the kernel knows nothing > about yet, and would cater for the pseudo-heterogeneous cases too. Runtime benchmarking (or gradual tuning based on observed performance of the scheduler) would work as well of course. I'd expect something like that to be part of any end point we get to here, the more the system is able to self tune the better.
On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote: > > > I really don't want to see a table of magic numbers in the kernel. > > Right, there's pitfalls there too although not being part of an ABI > does make them more manageable. I think that people are very likely to treat them exactly like an ABI, w.r.t. any regressions in performance that result from their addition, modification, or removal. That becomes really horrible when new CPUs appear. > One thing it's probably helpful to establish here is how much the > specific numbers are going to matter in the grand scheme of things. If > the specific numbers *are* super important then nobody is going to want > to touch them as they'll be prone to getting tweaked. If instead the > numbers just need to be ballpark accurate so the scheduler starts off in > roughly the right place and the specific numbers don't matter it's a lot > easier and having a table in the kernel until we think of something > better (if that ever happens) gets a lot easier. I agree that we first need to figure out the importance of these numbers. I disagree that our first step should be to add a table. > My expectation is that we just need good enough, not perfect, and that > seems to match what Juri is saying about the expectation that most of > the fine tuning is done via other knobs. My expectation is that if a ballpark figure is good enough, it should be possible to implement something trivial like bogomips / loop_per_jiffy calculation. Thanks, Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15/12/15 14:50, Mark Rutland wrote: > On Tue, Dec 15, 2015 at 02:24:58PM +0000, Juri Lelli wrote: > > Hi Mark, > > Hi Juri, > > > On 15/12/15 14:01, Mark Rutland wrote: > > > I really don't want to see a table of magic numbers in the kernel. > > > > Doesn't seem to be a clean and scalable solution to me either. It is not > > easy to reconfigure when new core types come around, as I don't think > > relative data is always present or easy to derive, and it exposes some > > sort of centralized global information where everyone is compared > > against everyone. > > I'm also concerned that it will be difficult to curate this to avoid > deceptive marketing numbers. These may not reflect reality. > Right. > > Where the DT solution is inherently per platform: no need to expose > > absolute values and no problems with knowing data regarding old core > > types. > > The DT approach certainly avoids tying the kernel to a given idea of > particular microarchitectures. > > > > If we cannot rely on external information, and want this information to > > > be derived by the kernel, then we need to perform some dynamic > > > benchmark. That would work for future CPUs the kernel knows nothing > > > about yet, and would cater for the pseudo-heterogeneous cases too. > > > > I've actually experimented a bit with this approch already, but I wasn't > > convinced of its viability. It is true that we remove the burden of > > coming up with default values from user/integrator, but I'm pretty sure > > we will end up discussing endlessly about which particular benchmark to > > pick > > Regardless of which direction we go there will be endless discussion as > to the benchmark. As Mark pointed out, that happened in the case of the > table, and it's happening now for the DT approach. > > I think we agree that if this is something we can change later (i.e. we > don't rely on an external oracle like DT) the particular benchmark > matters less, as that can be changed given evidence of superiority. > True, and in fact we already only point out suggestions for sensitive benchmarks. I don't think we want to tie ourselves to any particular one. And I think not having a particular benchmark implementation as part of the solution will help changing our minds afterwards. > > and the fact that it impacts on boot time and such. > > I was under the impression that the kernel already did RAID algorithm > benchmarking as part of the boot process. Maybe we can find a set of > similarly brief benchmarks. > Yeah, it's certainly doable, but I'm not really yet seeing what we gain with this additional complexity. Also, AFAIK boot time is one important metric for the mobile market (while it is less crucial for systems with RAID configurations?) and we are going to add overhead there. Stability of default values is probably another factor here, where you could be rebooting your phone in many different conditions from time to time. Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15/12/15 15:32, Mark Rutland wrote: > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: > > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote: > > > > > I really don't want to see a table of magic numbers in the kernel. > > > > Right, there's pitfalls there too although not being part of an ABI > > does make them more manageable. > > I think that people are very likely to treat them exactly like an ABI, > w.r.t. any regressions in performance that result from their addition, > modification, or removal. That becomes really horrible when new CPUs > appear. > Yeah, and I guess the path towards out of three patches changing this values for a specifica platform (without exposing the same changes upstream) is not too far away. > > One thing it's probably helpful to establish here is how much the > > specific numbers are going to matter in the grand scheme of things. If > > the specific numbers *are* super important then nobody is going to want > > to touch them as they'll be prone to getting tweaked. If instead the > > numbers just need to be ballpark accurate so the scheduler starts off in > > roughly the right place and the specific numbers don't matter it's a lot > > easier and having a table in the kernel until we think of something > > better (if that ever happens) gets a lot easier. > > I agree that we first need to figure out the importance of these > numbers. I disagree that our first step should be to add a table. > My take is that ballpark is fine, but it's a per platform/configuration ballpark that we need. Not a per core-type one. > > My expectation is that we just need good enough, not perfect, and that > > seems to match what Juri is saying about the expectation that most of > > the fine tuning is done via other knobs. > > My expectation is that if a ballpark figure is good enough, it should be > possible to implement something trivial like bogomips / loop_per_jiffy > calculation. > I didn't really followed that, so I might be wrong here, but isn't already happened a discussion about how we want/like to stop exposing bogomips info or rely on it for anything but in kernel delay loops? Thanks, - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote: > On 15/12/15 15:32, Mark Rutland wrote: > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: > > > My expectation is that we just need good enough, not perfect, and that > > > seems to match what Juri is saying about the expectation that most of > > > the fine tuning is done via other knobs. > > > > My expectation is that if a ballpark figure is good enough, it should be > > possible to implement something trivial like bogomips / loop_per_jiffy > > calculation. > > > > I didn't really followed that, so I might be wrong here, but isn't > already happened a discussion about how we want/like to stop exposing > bogomips info or rely on it for anything but in kernel delay loops? I meant that we could have a benchmark of that level of complexity, rather than those specific values. Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote: > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote: > > On 15/12/15 15:32, Mark Rutland wrote: > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: > > > > My expectation is that we just need good enough, not perfect, and that > > > > seems to match what Juri is saying about the expectation that most of > > > > the fine tuning is done via other knobs. > > > > > > My expectation is that if a ballpark figure is good enough, it should be > > > possible to implement something trivial like bogomips / loop_per_jiffy > > > calculation. > > > > I didn't really followed that, so I might be wrong here, but isn't > > already happened a discussion about how we want/like to stop exposing > > bogomips info or rely on it for anything but in kernel delay loops? > > I meant that we could have a benchmark of that level of complexity, > rather than those specific values. Or we could simply let user space use whatever benchmarks or hard-coded values it wants and set the capacity via sysfs (during boot). By default, the kernel would assume all CPUs equal.
On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote: > On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote: > > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote: > > > On 15/12/15 15:32, Mark Rutland wrote: > > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: > > > > > My expectation is that we just need good enough, not perfect, and that > > > > > seems to match what Juri is saying about the expectation that most of > > > > > the fine tuning is done via other knobs. > > > > > > > > My expectation is that if a ballpark figure is good enough, it should be > > > > possible to implement something trivial like bogomips / loop_per_jiffy > > > > calculation. > > > > > > I didn't really followed that, so I might be wrong here, but isn't > > > already happened a discussion about how we want/like to stop exposing > > > bogomips info or rely on it for anything but in kernel delay loops? > > > > I meant that we could have a benchmark of that level of complexity, > > rather than those specific values. > > Or we could simply let user space use whatever benchmarks or hard-coded > values it wants and set the capacity via sysfs (during boot). By > default, the kernel would assume all CPUs equal. I assume that a userspace override would be available regardless of whatever mechanism the kernel uses to determine relative performance/effinciency. I am not opposed to that mechanism being "assume equal". Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15 December 2015 at 17:41, Mark Rutland <mark.rutland@arm.com> wrote: > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote: >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote: >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote: >> > > On 15/12/15 15:32, Mark Rutland wrote: >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: >> > > > > My expectation is that we just need good enough, not perfect, and that >> > > > > seems to match what Juri is saying about the expectation that most of >> > > > > the fine tuning is done via other knobs. >> > > > >> > > > My expectation is that if a ballpark figure is good enough, it should be >> > > > possible to implement something trivial like bogomips / loop_per_jiffy >> > > > calculation. >> > > >> > > I didn't really followed that, so I might be wrong here, but isn't >> > > already happened a discussion about how we want/like to stop exposing >> > > bogomips info or rely on it for anything but in kernel delay loops? >> > >> > I meant that we could have a benchmark of that level of complexity, >> > rather than those specific values. >> >> Or we could simply let user space use whatever benchmarks or hard-coded >> values it wants and set the capacity via sysfs (during boot). By >> default, the kernel would assume all CPUs equal. > > I assume that a userspace override would be available regardless of > whatever mechanism the kernel uses to determine relative > performance/effinciency. Don't you think that if we let a complete latitude to the userspace to set whatever it wants, it will be used to abuse the kernel (and the scheduler in particular ) and that this will finish in a real mess to understand what is wrong when a task is not placed where it should be. We can probably provide a debug mode to help soc manufacturer to define their capacity value but IMHO we should not let complete latitude in normal operation In normal operation we need to give some methods to tweak the value to reflect a memory bounded or integer calculation work or other kind of work that currently runs on the cpu but not more Vincent > > I am not opposed to that mechanism being "assume equal". > > Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 05:59:34PM +0100, Vincent Guittot wrote: > On 15 December 2015 at 17:41, Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote: > >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote: > >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote: > >> > > On 15/12/15 15:32, Mark Rutland wrote: > >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: > >> > > > > My expectation is that we just need good enough, not perfect, and that > >> > > > > seems to match what Juri is saying about the expectation that most of > >> > > > > the fine tuning is done via other knobs. > >> > > > > >> > > > My expectation is that if a ballpark figure is good enough, it should be > >> > > > possible to implement something trivial like bogomips / loop_per_jiffy > >> > > > calculation. > >> > > > >> > > I didn't really followed that, so I might be wrong here, but isn't > >> > > already happened a discussion about how we want/like to stop exposing > >> > > bogomips info or rely on it for anything but in kernel delay loops? > >> > > >> > I meant that we could have a benchmark of that level of complexity, > >> > rather than those specific values. > >> > >> Or we could simply let user space use whatever benchmarks or hard-coded > >> values it wants and set the capacity via sysfs (during boot). By > >> default, the kernel would assume all CPUs equal. > > > > I assume that a userspace override would be available regardless of > > whatever mechanism the kernel uses to determine relative > > performance/effinciency. > > Don't you think that if we let a complete latitude to the userspace > to set whatever it wants, it will be used to abuse the kernel (and the > scheduler in particular ) and that this will finish in a real mess to > understand what is wrong when a task is not placed where it should be. I'm not sure I follow what you mean by "abuse" here. Userspace currently can force the scheduler to make sub-optimal decisions in a number of ways, e.g. * Hot-unplugging the preferred CPUs * Changing a task's affinity mask * Setting the nice value of a task * Using rlimits and/or cgroups * Using a cpufreq governor * Fork-bombing Practically all of these are prvileged operations. I would envisage the userspace interface for "capacity" management to be similar. > We can probably provide a debug mode to help soc manufacturer to > define their capacity value but IMHO we should not let complete > latitude in normal operation In normal operation userspace wouldn't mess with this, as with most of the cases above. Userspace can already shooti tself in the foot. > In normal operation we need to give some methods to tweak the value to > reflect a memory bounded or integer calculation work or other kind of > work that currently runs on the cpu but not more You can already do that with the mechanisms above, to some extent. I'm not sure I follow. Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 03:32:19PM +0000, Mark Rutland wrote: > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: > > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote: > > > I really don't want to see a table of magic numbers in the kernel. > > Right, there's pitfalls there too although not being part of an ABI > > does make them more manageable. > I think that people are very likely to treat them exactly like an ABI, > w.r.t. any regressions in performance that result from their addition, > modification, or removal. That becomes really horrible when new CPUs > appear. Obviously people are going to get upset if we introduce performance regressions - but that's true always, we can also introduce problems with numbers people have put in DT. It seems like it'd be harder to manage regressions due to externally provided magic numbers since there's inherently less information there. > > One thing it's probably helpful to establish here is how much the > > specific numbers are going to matter in the grand scheme of things. If > > the specific numbers *are* super important then nobody is going to want > > to touch them as they'll be prone to getting tweaked. If instead the > > numbers just need to be ballpark accurate so the scheduler starts off in > > roughly the right place and the specific numbers don't matter it's a lot > > easier and having a table in the kernel until we think of something > > better (if that ever happens) gets a lot easier. > I agree that we first need to figure out the importance of these > numbers. I disagree that our first step should be to add a table. My point there is that if we're not that concerned about the specific number something in kernel is safer. > > My expectation is that we just need good enough, not perfect, and that > > seems to match what Juri is saying about the expectation that most of > > the fine tuning is done via other knobs. > My expectation is that if a ballpark figure is good enough, it should be > possible to implement something trivial like bogomips / loop_per_jiffy > calculation. That does have the issue that we need to scale with regard to the frequency the benchmark gets run at. That's not an insurmountable obstacle but it's not completely trivial either.
On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote: > On Tue, Dec 15, 2015 at 03:32:19PM +0000, Mark Rutland wrote: > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: > > > On Tue, Dec 15, 2015 at 02:01:36PM +0000, Mark Rutland wrote: > > > > > I really don't want to see a table of magic numbers in the kernel. > > > > Right, there's pitfalls there too although not being part of an ABI > > > does make them more manageable. > > > I think that people are very likely to treat them exactly like an ABI, > > w.r.t. any regressions in performance that result from their addition, > > modification, or removal. That becomes really horrible when new CPUs > > appear. > > Obviously people are going to get upset if we introduce performance > regressions - but that's true always, we can also introduce problems > with numbers people have put in DT. It seems like it'd be harder to > manage regressions due to externally provided magic numbers since > there's inherently less information there. It's certainly still possible to have regressions in that case. Those regressions would be due to code changes in the kernel, given the DT didn't change. I'm not sure I follow w.r.t. "inherently less information", unless you mean trying to debug without access to that DTB? > > > One thing it's probably helpful to establish here is how much the > > > specific numbers are going to matter in the grand scheme of things. If > > > the specific numbers *are* super important then nobody is going to want > > > to touch them as they'll be prone to getting tweaked. If instead the > > > numbers just need to be ballpark accurate so the scheduler starts off in > > > roughly the right place and the specific numbers don't matter it's a lot > > > easier and having a table in the kernel until we think of something > > > better (if that ever happens) gets a lot easier. > > > I agree that we first need to figure out the importance of these > > numbers. I disagree that our first step should be to add a table. > > My point there is that if we're not that concerned about the specific > number something in kernel is safer. I don't entirely disagree there. I think an in-kernel benchmark is likely safer. > > > My expectation is that we just need good enough, not perfect, and that > > > seems to match what Juri is saying about the expectation that most of > > > the fine tuning is done via other knobs. > > > My expectation is that if a ballpark figure is good enough, it should be > > possible to implement something trivial like bogomips / loop_per_jiffy > > calculation. > > That does have the issue that we need to scale with regard to the > frequency the benchmark gets run at. That's not an insurmountable > obstacle but it's not completely trivial either. If we change clock frequency, then regardless of where the information comes from we need to perform scaling, no? One nice thing about doing a benchmark to derive the numbers is that when the kernel is that when the frequency is fixed but the kernel cannot query it, the numbers will be representative. Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote: > On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote: > > Obviously people are going to get upset if we introduce performance > > regressions - but that's true always, we can also introduce problems > > with numbers people have put in DT. It seems like it'd be harder to > > manage regressions due to externally provided magic numbers since > > there's inherently less information there. > It's certainly still possible to have regressions in that case. Those > regressions would be due to code changes in the kernel, given the DT > didn't change. > I'm not sure I follow w.r.t. "inherently less information", unless you > mean trying to debug without access to that DTB? If what the kernel knows about the system is that it's got a bunch of cores with numbers assigned to them then all it's really got is those numbers. If something changes that causes problems for some systems (eg, because the numbers have been picked poorly but in a way that happened to work well with the old code) that's not a lot to go on, the more we know about the system the more likely it is that we'll be able to adjust the assumptions in whatever new thing we do that causes problems for any particular systems where we run into trouble. > > My point there is that if we're not that concerned about the specific > > number something in kernel is safer. > I don't entirely disagree there. I think an in-kernel benchmark is > likely safer. Yes, I think that something where we just observe the system performance at runtime is likely one of the best solutions if we can get something that gives reasonable results. > > That does have the issue that we need to scale with regard to the > > frequency the benchmark gets run at. That's not an insurmountable > > obstacle but it's not completely trivial either. > If we change clock frequency, then regardless of where the information > comes from we need to perform scaling, no? Yes, it's just a question of making the benchmarking bit talk to the scaling bit so we know where we're at when we do the benchmark. Like I say it should be doable. > One nice thing about doing a benchmark to derive the numbers is that > when the kernel is that when the frequency is fixed but the kernel > cannot query it, the numbers will be representative. Definitely.
On 15 December 2015 at 18:15, Mark Rutland <mark.rutland@arm.com> wrote: > On Tue, Dec 15, 2015 at 05:59:34PM +0100, Vincent Guittot wrote: >> On 15 December 2015 at 17:41, Mark Rutland <mark.rutland@arm.com> wrote: >> > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote: >> >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote: >> >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote: >> >> > > On 15/12/15 15:32, Mark Rutland wrote: >> >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: >> >> > > > > My expectation is that we just need good enough, not perfect, and that >> >> > > > > seems to match what Juri is saying about the expectation that most of >> >> > > > > the fine tuning is done via other knobs. >> >> > > > >> >> > > > My expectation is that if a ballpark figure is good enough, it should be >> >> > > > possible to implement something trivial like bogomips / loop_per_jiffy >> >> > > > calculation. >> >> > > >> >> > > I didn't really followed that, so I might be wrong here, but isn't >> >> > > already happened a discussion about how we want/like to stop exposing >> >> > > bogomips info or rely on it for anything but in kernel delay loops? >> >> > >> >> > I meant that we could have a benchmark of that level of complexity, >> >> > rather than those specific values. >> >> >> >> Or we could simply let user space use whatever benchmarks or hard-coded >> >> values it wants and set the capacity via sysfs (during boot). By >> >> default, the kernel would assume all CPUs equal. >> > >> > I assume that a userspace override would be available regardless of >> > whatever mechanism the kernel uses to determine relative >> > performance/effinciency. >> >> Don't you think that if we let a complete latitude to the userspace >> to set whatever it wants, it will be used to abuse the kernel (and the >> scheduler in particular ) and that this will finish in a real mess to >> understand what is wrong when a task is not placed where it should be. > > I'm not sure I follow what you mean by "abuse" here. Userspace currently > can force the scheduler to make sub-optimal decisions in a number of > ways, e.g. > > * Hot-unplugging the preferred CPUs > * Changing a task's affinity mask > * Setting the nice value of a task > * Using rlimits and/or cgroups > * Using a cpufreq governor > * Fork-bombing All these are parameters have a meaning (except the last one). By abusing i mean setting the capacity of the most powerful cpu to 1 for no good reason except to abuse the scheduler so the latter will not put that much tasks on it just because the current running use case is more efficient if the big core is not used. > > Practically all of these are prvileged operations. I would envisage the > userspace interface for "capacity" management to be similar. > >> We can probably provide a debug mode to help soc manufacturer to >> define their capacity value but IMHO we should not let complete >> latitude in normal operation > > In normal operation userspace wouldn't mess with this, as with most of > the cases above. Userspace can already shooti tself in the foot. > >> In normal operation we need to give some methods to tweak the value to >> reflect a memory bounded or integer calculation work or other kind of >> work that currently runs on the cpu but not more > > You can already do that with the mechanisms above, to some extent. I'm > not sure I follow. > > Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 05:45:16PM +0000, Mark Brown wrote: > On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote: > > On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote: > > > > Obviously people are going to get upset if we introduce performance > > > regressions - but that's true always, we can also introduce problems > > > with numbers people have put in DT. It seems like it'd be harder to > > > manage regressions due to externally provided magic numbers since > > > there's inherently less information there. > > > It's certainly still possible to have regressions in that case. Those > > regressions would be due to code changes in the kernel, given the DT > > didn't change. > > > I'm not sure I follow w.r.t. "inherently less information", unless you > > mean trying to debug without access to that DTB? > > If what the kernel knows about the system is that it's got a bunch of > cores with numbers assigned to them then all it's really got is those > numbers. If something changes that causes problems for some systems > (eg, because the numbers have been picked poorly but in a way that > happened to work well with the old code) that's not a lot to go on, the > more we know about the system the more likely it is that we'll be able > to adjust the assumptions in whatever new thing we do that causes > problems for any particular systems where we run into trouble. Regardless of where the numbers live (DT or kernel), all we have are numbers. I can see that changing the in-kernel numbers would be possible when modifyign the DT is not, but I don't see how that gives you more information. Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 06:47:20PM +0100, Vincent Guittot wrote: > On 15 December 2015 at 18:15, Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Dec 15, 2015 at 05:59:34PM +0100, Vincent Guittot wrote: > >> On 15 December 2015 at 17:41, Mark Rutland <mark.rutland@arm.com> wrote: > >> > On Tue, Dec 15, 2015 at 04:23:18PM +0000, Catalin Marinas wrote: > >> >> On Tue, Dec 15, 2015 at 03:57:37PM +0000, Mark Rutland wrote: > >> >> > On Tue, Dec 15, 2015 at 03:46:51PM +0000, Juri Lelli wrote: > >> >> > > On 15/12/15 15:32, Mark Rutland wrote: > >> >> > > > On Tue, Dec 15, 2015 at 03:08:13PM +0000, Mark Brown wrote: > >> >> > > > > My expectation is that we just need good enough, not perfect, and that > >> >> > > > > seems to match what Juri is saying about the expectation that most of > >> >> > > > > the fine tuning is done via other knobs. > >> >> > > > > >> >> > > > My expectation is that if a ballpark figure is good enough, it should be > >> >> > > > possible to implement something trivial like bogomips / loop_per_jiffy > >> >> > > > calculation. > >> >> > > > >> >> > > I didn't really followed that, so I might be wrong here, but isn't > >> >> > > already happened a discussion about how we want/like to stop exposing > >> >> > > bogomips info or rely on it for anything but in kernel delay loops? > >> >> > > >> >> > I meant that we could have a benchmark of that level of complexity, > >> >> > rather than those specific values. > >> >> > >> >> Or we could simply let user space use whatever benchmarks or hard-coded > >> >> values it wants and set the capacity via sysfs (during boot). By > >> >> default, the kernel would assume all CPUs equal. > >> > > >> > I assume that a userspace override would be available regardless of > >> > whatever mechanism the kernel uses to determine relative > >> > performance/effinciency. > >> > >> Don't you think that if we let a complete latitude to the userspace > >> to set whatever it wants, it will be used to abuse the kernel (and the > >> scheduler in particular ) and that this will finish in a real mess to > >> understand what is wrong when a task is not placed where it should be. > > > > I'm not sure I follow what you mean by "abuse" here. Userspace currently > > can force the scheduler to make sub-optimal decisions in a number of > > ways, e.g. > > > > * Hot-unplugging the preferred CPUs > > * Changing a task's affinity mask > > * Setting the nice value of a task > > * Using rlimits and/or cgroups > > * Using a cpufreq governor > > * Fork-bombing > > All these are parameters have a meaning (except the last one). By > abusing i mean setting the capacity of the most powerful cpu to 1 for > no good reason except to abuse the scheduler so the latter will not > put that much tasks on it just because the current running use case > is more efficient if the big core is not used. Surely it's better to allow them to "abuse" the kernel in that manner than to place otherwise insane values into a DT? Especially if they can later change to a sane value? For that particular case it's easy to hotplug out the big core, or to set the affinity of tasks to avoid it. Mark. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 15, 2015 at 06:10:03PM +0000, Mark Rutland wrote: > On Tue, Dec 15, 2015 at 05:45:16PM +0000, Mark Brown wrote: > > > I'm not sure I follow w.r.t. "inherently less information", unless you > > > mean trying to debug without access to that DTB? > > If what the kernel knows about the system is that it's got a bunch of > > cores with numbers assigned to them then all it's really got is those > > numbers. If something changes that causes problems for some systems > > (eg, because the numbers have been picked poorly but in a way that > > happened to work well with the old code) that's not a lot to go on, the > > more we know about the system the more likely it is that we'll be able > > to adjust the assumptions in whatever new thing we do that causes > > problems for any particular systems where we run into trouble. > Regardless of where the numbers live (DT or kernel), all we have are > numbers. I can see that changing the in-kernel numbers would be possible > when modifyign the DT is not, but I don't see how that gives you more > information. It's mainly the modifying the DT case - you're not dealing with some external misguided number selection method you'd never thought of and you're not forcing some third party to redo benchmarks or adjust DTs they may not want to adjust. You're also able to readjust the numbers based on feedback if you need to rather than having to adapt algorithms to handle particular number selections, the algorithm and number selection are done together rather than separately.
Hi, On 15/12/15 17:45, Mark Brown wrote: > On Tue, Dec 15, 2015 at 05:28:37PM +0000, Mark Rutland wrote: > > On Tue, Dec 15, 2015 at 05:17:13PM +0000, Mark Brown wrote: > > > > Obviously people are going to get upset if we introduce performance > > > regressions - but that's true always, we can also introduce problems > > > with numbers people have put in DT. It seems like it'd be harder to > > > manage regressions due to externally provided magic numbers since > > > there's inherently less information there. > > > It's certainly still possible to have regressions in that case. Those > > regressions would be due to code changes in the kernel, given the DT > > didn't change. > > > I'm not sure I follow w.r.t. "inherently less information", unless you > > mean trying to debug without access to that DTB? > > If what the kernel knows about the system is that it's got a bunch of > cores with numbers assigned to them then all it's really got is those > numbers. If something changes that causes problems for some systems > (eg, because the numbers have been picked poorly but in a way that > happened to work well with the old code) that's not a lot to go on, the > more we know about the system the more likely it is that we'll be able > to adjust the assumptions in whatever new thing we do that causes > problems for any particular systems where we run into trouble. > > > > My point there is that if we're not that concerned about the specific > > > number something in kernel is safer. > > > I don't entirely disagree there. I think an in-kernel benchmark is > > likely safer. > > Yes, I think that something where we just observe the system performance > at runtime is likely one of the best solutions if we can get something > that gives reasonable results. > > > > That does have the issue that we need to scale with regard to the > > > frequency the benchmark gets run at. That's not an insurmountable > > > obstacle but it's not completely trivial either. > > > If we change clock frequency, then regardless of where the information > > comes from we need to perform scaling, no? > > Yes, it's just a question of making the benchmarking bit talk to the > scaling bit so we know where we're at when we do the benchmark. Like I > say it should be doable. > > > One nice thing about doing a benchmark to derive the numbers is that > > when the kernel is that when the frequency is fixed but the kernel > > cannot query it, the numbers will be representative. > > Definitely. OK, let's see how a dynamic approach could look like. As said, since it was actually our first thought too, I already have a possible implementation of such a thing. I'll be OOO until early Jan, but I'll try to rebase what I have and post it here as soon as I'm back; and then we see which solution looks better. Thanks a lot for the feedback! Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt new file mode 100644 index 0000000..2a00af0 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt @@ -0,0 +1,227 @@ +========================================== +ARM CPUs capacity bindings +========================================== + +========================================== +1 - Introduction +========================================== + +ARM systems may be configured to have cpus with different power/performance +characteristics within the same chip. In this case, additional information +has to be made available to the kernel (the scheduler in particular) for +it to be aware of such differences and take decisions accordingly. + +========================================== +2 - CPU capacity definition +========================================== + +CPU capacity is a number that provides the scheduler information about CPUs +heterogeneity. Such heterogeneity can come from micro-architectural differences +(e.g., ARM big.LITTLE systems) or maximum frequency at which CPUs can run +(e.g., SMP systems with multiple frequency domains). Heterogeneity in this +context is about differing performance characteristics; this binding tries to +capture a first-order approximation of the relative performance of CPUs. + +One simple way to estimate CPU capacities is to iteratively run a well-known +CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on each CPU at +maximum frequency and then normalize values w.r.t. the best performing CPU. +One can also do a statistically significant study of a wide collection of +benchmarks, but pros of such an approach are not really evident at the time of +writing. + +========================================== +3 - capacity-scale +========================================== + +CPUs capacities are defined with respect to capacity-scale property in the cpus +node [1]. The property is optional; if not defined a 1024 capacity-scale is +assumed. This property defines both the highest CPU capacity present in the +system and granularity of CPU capacity values. + +========================================== +4 - capacity +========================================== + +capacity is an optional cpu node [1] property: u32 value representing CPU +capacity, relative to capacity-scale. It is required and enforced that capacity +<= capacity-scale. + +=========================================== +5 - Examples +=========================================== + +Example 1 (ARM 64-bit, 6-cpu system, two clusters): +capacity-scale is not defined, so it is assumed to be 1024 + +cpus { + #address-cells = <2>; + #size-cells = <0>; + + cpu-map { + cluster0 { + core0 { + cpu = <&A57_0>; + }; + core1 { + cpu = <&A57_1>; + }; + }; + + cluster1 { + core0 { + cpu = <&A53_0>; + }; + core1 { + cpu = <&A53_1>; + }; + core2 { + cpu = <&A53_2>; + }; + core3 { + cpu = <&A53_3>; + }; + }; + }; + + idle-states { + entry-method = "arm,psci"; + + CPU_SLEEP_0: cpu-sleep-0 { + compatible = "arm,idle-state"; + arm,psci-suspend-param = <0x0010000>; + local-timer-stop; + entry-latency-us = <100>; + exit-latency-us = <250>; + min-residency-us = <150>; + }; + + CLUSTER_SLEEP_0: cluster-sleep-0 { + compatible = "arm,idle-state"; + arm,psci-suspend-param = <0x1010000>; + local-timer-stop; + entry-latency-us = <800>; + exit-latency-us = <700>; + min-residency-us = <2500>; + }; + }; + + A57_0: cpu@0 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x0>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + clocks = <&scpi_dvfs 0>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <1024>; + }; + + A57_1: cpu@1 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x1>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + clocks = <&scpi_dvfs 0>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <1024>; + }; + + A53_0: cpu@100 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x100>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <447>; + }; + + A53_1: cpu@101 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x101>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <447>; + }; + + A53_2: cpu@102 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x102>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <447>; + }; + + A53_3: cpu@103 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x103>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + capacity = <447>; + }; + + A57_L2: l2-cache0 { + compatible = "cache"; + }; + + A53_L2: l2-cache1 { + compatible = "cache"; + }; +}; + +Example 2 (ARM 32-bit, 4-cpu system, two clusters, + cpus 0,1@1GHz, cpus 2,3@500MHz): +capacity-scale is equal to 2, so first cluster is twice faster than second +cluster (which matches with clock frequencies) + +cpus { + #address-cells = <1>; + #size-cells = <0>; + capacity-scale = <2>; + + cpu0: cpu@0 { + device_type = "cpu"; + compatible = "arm,cortex-a15"; + reg = <0>; + capacity = <2>; + }; + + cpu1: cpu@1 { + device_type = "cpu"; + compatible = "arm,cortex-a15"; + reg = <1>; + capacity = <2>; + }; + + cpu2: cpu@2 { + device_type = "cpu"; + compatible = "arm,cortex-a15"; + reg = <0x100>; + capacity = <1>; + }; + + cpu3: cpu@3 { + device_type = "cpu"; + compatible = "arm,cortex-a15"; + reg = <0x101>; + capacity = <1>; + }; +}; + +=========================================== +6 - References +=========================================== + +[1] ARM Linux Kernel documentation - CPUs bindings + Documentation/devicetree/bindings/arm/cpus.txt diff --git a/Documentation/devicetree/bindings/arm/cpus.txt b/Documentation/devicetree/bindings/arm/cpus.txt index 91e6e5c..7593584 100644 --- a/Documentation/devicetree/bindings/arm/cpus.txt +++ b/Documentation/devicetree/bindings/arm/cpus.txt @@ -62,6 +62,14 @@ nodes to be present and contain the properties described below. Value type: <u32> Definition: must be set to 0 + A cpus node may also define the following optional property: + + - capacity-scale + Usage: optional + Value type: <u32> + Definition: value used as a reference for CPU capacity [3] + (see below). + - cpu node Description: Describes a CPU in an ARM based system @@ -231,6 +239,13 @@ nodes to be present and contain the properties described below. # List of phandles to idle state nodes supported by this cpu [3]. + - capacity + Usage: Optional + Value type: <u32> + Definition: + # u32 value representing CPU capacity [3], relative to + capacity-scale (see above). + - rockchip,pmu Usage: optional for systems that have an "enable-method" property value of "rockchip,rk3066-smp" @@ -437,3 +452,5 @@ cpus { [2] arm/msm/qcom,kpss-acc.txt [3] ARM Linux kernel documentation - idle states bindings Documentation/devicetree/bindings/arm/idle-states.txt +[3] ARM Linux kernel documentation - cpu capacity bindings + Documentation/devicetree/bindings/arm/cpu-capacity.txt
ARM systems may be configured to have cpus with different power/performance characteristics within the same chip. In this case, additional information has to be made available to the kernel (the scheduler in particular) for it to be aware of such differences and take decisions accordingly. Therefore, this patch aims at standardizing cpu capacities device tree bindings for ARM platforms. Bindings define cpu capacity parameter, to allow operating systems to retrieve such information from the device tree and initialize related kernel structures, paving the way for common code in the kernel to deal with heterogeneity. Cc: Rob Herring <robh+dt@kernel.org> Cc: Pawel Moll <pawel.moll@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ian Campbell <ijc+devicetree@hellion.org.uk> Cc: Kumar Gala <galak@codeaurora.org> Cc: Maxime Ripard <maxime.ripard@free-electrons.com> Cc: Olof Johansson <olof@lixom.net> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Cc: Paul Walmsley <paul@pwsan.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Chen-Yu Tsai <wens@csie.org> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: devicetree@vger.kernel.org Signed-off-by: Juri Lelli <juri.lelli@arm.com> --- .../devicetree/bindings/arm/cpu-capacity.txt | 227 +++++++++++++++++++++ Documentation/devicetree/bindings/arm/cpus.txt | 17 ++ 2 files changed, 244 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/cpu-capacity.txt