Patchwork [1/5] sched: fix capacity calculations for SMT4

login
register
mail settings
Submitter Michael Neuling
Date April 9, 2010, 6:21 a.m.
Message ID <20100409062118.D4096CBB6C@localhost.localdomain>
Download mbox | patch
Permalink /patch/49770/
State Not Applicable
Headers show

Comments

Michael Neuling - April 9, 2010, 6:21 a.m.
When calculating capacity we use the following calculation:

       capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);

In SMT2, power will be 1178/2 (provided we are not scaling power with
freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
being 1.

With SMT4 however, power will be 1178/4, hence capacity will end up as
0.

Fix this by ensuring capacity is always at least 1 after this
calculation.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
I'm not sure this is the correct fix but this works for me.  
Original post here: 
  http://lkml.org/lkml/2010/3/30/884

---

 kernel/sched_fair.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)
Peter Zijlstra - April 13, 2010, 12:29 p.m.
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> When calculating capacity we use the following calculation:
> 
>        capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> 
> In SMT2, power will be 1178/2 (provided we are not scaling power with
> freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
> being 1.
> 
> With SMT4 however, power will be 1178/4, hence capacity will end up as
> 0.
> 
> Fix this by ensuring capacity is always at least 1 after this
> calculation.
> 
> Signed-off-by: Michael Neuling <mikey@neuling.org>
> ---
> I'm not sure this is the correct fix but this works for me.  

Right, so I suspect this will indeed break some things.

We initially allowed 0 capacity for when a cpu is consumed by an RT task
and there simply isn't much capacity left, in that case you really want
to try and move load to your sibling cpus if possible.

However you're right that this goes awry in your case.

One thing to look at is if that 15% increase is indeed representative
for the power7 cpu, it having 4 SMT threads seems to suggest there was
significant gains, otherwise they'd not have wasted the silicon.

(The broken x86 code was meant to actually compute the SMT gain, so that
we'd not have to guess the 15%)

Now, increasing this will only marginally fix the issue, since if you
end up with 512 per thread it only takes a very tiny amount of RT
workload to drop below and end up at 0 again.

One thing we could look at is using the cpu base power to compute
capacity from. We'd have to add another field to sched_group and store
power before we do the scale_rt_power() stuff.

Thoughts?


>  kernel/sched_fair.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6-ozlabs/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> +++ linux-2.6-ozlabs/kernel/sched_fair.c
> @@ -1482,6 +1482,7 @@ static int select_task_rq_fair(struct ta
>  			}
>  
>  			capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> +			capacity = max(capacity, 1UL);
>  
>  			if (tmp->flags & SD_POWERSAVINGS_BALANCE)
>  				nr_running /= 2;
> @@ -2488,6 +2489,7 @@ static inline void update_sg_lb_stats(st
>  
>  	sgs->group_capacity =
>  		DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
> +	sgs->group_capacity = max(sgs->group_capacity, 1UL);
>  }
>  
>  /**
> @@ -2795,9 +2797,11 @@ find_busiest_queue(struct sched_group *g
>  
>  	for_each_cpu(i, sched_group_cpus(group)) {
>  		unsigned long power = power_of(i);
> -		unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> +		unsigned long capacity;
>  		unsigned long wl;
>  
> +		capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL);
> +
>  		if (!cpumask_test_cpu(i, cpus))
>  			continue;
>
Michael Neuling - April 14, 2010, 4:28 a.m.
In message <1271161766.4807.1280.camel@twins> you wrote:
> On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> > When calculating capacity we use the following calculation:
> >=20
> >        capacity =3D DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> >=20
> > In SMT2, power will be 1178/2 (provided we are not scaling power with
> > freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
> > being 1.
> >=20
> > With SMT4 however, power will be 1178/4, hence capacity will end up as
> > 0.
> >=20
> > Fix this by ensuring capacity is always at least 1 after this
> > calculation.
> >=20
> > Signed-off-by: Michael Neuling <mikey@neuling.org>
> > ---
> > I'm not sure this is the correct fix but this works for me. =20
> 
> Right, so I suspect this will indeed break some things.
> 
> We initially allowed 0 capacity for when a cpu is consumed by an RT task
> and there simply isn't much capacity left, in that case you really want
> to try and move load to your sibling cpus if possible.

Changing the CPU power based on what tasks are running on them seems a
bit wrong to me.  Shouldn't we keep those concepts separate?

> However you're right that this goes awry in your case.
> 
> One thing to look at is if that 15% increase is indeed representative
> for the power7 cpu, it having 4 SMT threads seems to suggest there was
> significant gains, otherwise they'd not have wasted the silicon.

There are certainly, for most workloads, per core gains for SMT4 over
SMT2 on P7.  My kernels certainly compile faster and that's the only
workload anyone who matters cares about.... ;-)

> (The broken x86 code was meant to actually compute the SMT gain, so that
> we'd not have to guess the 15%)
> 
> Now, increasing this will only marginally fix the issue, since if you
> end up with 512 per thread it only takes a very tiny amount of RT
> workload to drop below and end up at 0 again.

I tried initialled to make smt_gain programmable and at 2048 per core
(512 per thread), the packing became unstable, as you intimated.

> One thing we could look at is using the cpu base power to compute
> capacity from. We'd have to add another field to sched_group and store
> power before we do the scale_rt_power() stuff.

Separating capacity from what RT tasks are running seems like a good
idea to me.

This would fix the RT issue, but it's not clear to me how you are
suggesting fixing the rounding down to 0 SMT4 issue.  Are you suggesting
we bump smt_gain to say 2048 + 15%?  Or are you suggesting we separate
the RT tasks out from capacity and keep the max(1, capacity) that I've
added?  Or something else?

Would another possibility be changing capacity a scaled value (like
cpu_power is now) rather than a small integer as it is now.  For
example, a scaled capacity of 1024 would be equivalent to a capacity of
1 now.  This might enable us to handle partial capacities better?  We'd
probably have to scale a bunch of nr_running too.  

Mikey
Peter Zijlstra - April 16, 2010, 1:58 p.m.
On Wed, 2010-04-14 at 14:28 +1000, Michael Neuling wrote:

> > Right, so I suspect this will indeed break some things.
> > 
> > We initially allowed 0 capacity for when a cpu is consumed by an RT task
> > and there simply isn't much capacity left, in that case you really want
> > to try and move load to your sibling cpus if possible.
> 
> Changing the CPU power based on what tasks are running on them seems a
> bit wrong to me.  Shouldn't we keep those concepts separate?

Well the thing cpu_power represents is a ratio of compute capacity
available to this cpu as compared to other cpus. By normalizing the
runqueue weights with this we end up with a fair balance.

The thing to realize here is that this is solely about SCHED_NORMAL
tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about
fairness and available compute capacity.

So if we were to ignore RT tasks, you'd end up with a situation where,
assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the
load-balancer would give each cpu 2 NORMAL tasks, but the tasks that
would end up on the cpu the RT tasks would be running on would not run
as fast -- is that fair?

Since RT tasks do not have a weight (FIFO/RR have no limit at all,
DEADLINE would have something equivalent to a max weight), it is
impossible to account them in the normal weight sense.

Therefore the current model takes them into account by lowering the
compute capacity according to their (avg) cpu usage. So if the RT task
would consume 66% cputime, we'd end up with a situation where the cpu
running the RT task would get 1 NORMAL task, and other cpu would have
the remaining 3, that way they'd all get 33% cpu.

> > However you're right that this goes awry in your case.
> > 
> > One thing to look at is if that 15% increase is indeed representative
> > for the power7 cpu, it having 4 SMT threads seems to suggest there was
> > significant gains, otherwise they'd not have wasted the silicon.
> 
> There are certainly, for most workloads, per core gains for SMT4 over
> SMT2 on P7.  My kernels certainly compile faster and that's the only
> workload anyone who matters cares about.... ;-)

For sure ;-)

Are there any numbers available on how much they gain? It might be worth
to stick in real numbers instead of this alleged 15%.

> > One thing we could look at is using the cpu base power to compute
> > capacity from. We'd have to add another field to sched_group and store
> > power before we do the scale_rt_power() stuff.
> 
> Separating capacity from what RT tasks are running seems like a good
> idea to me.

Well, per the above we cannot fully separate them.

> This would fix the RT issue, but it's not clear to me how you are
> suggesting fixing the rounding down to 0 SMT4 issue.  Are you suggesting
> we bump smt_gain to say 2048 + 15%?  Or are you suggesting we separate
> the RT tasks out from capacity and keep the max(1, capacity) that I've
> added?  Or something else?

I would think that 4 SMT threads are still slower than two full cores,
right? So cpu_power=2048 would not be appropriate.

> Would another possibility be changing capacity a scaled value (like
> cpu_power is now) rather than a small integer as it is now.  For
> example, a scaled capacity of 1024 would be equivalent to a capacity of
> 1 now.  This might enable us to handle partial capacities better?  We'd
> probably have to scale a bunch of nr_running too.  

Right, so my proposal was to scale down the capacity divider (currently
1024) to whatever would be the base capacity for that cpu. Trouble seems
to be that that makes group capacity a lot more complex, as you would
end up needing to average all the cpu's their base capacity.


Hrmm, my brain seems muddled but I might have another solution, let me
ponder this for a bit..
Michael Neuling - April 18, 2010, 9:34 p.m.
In message <1271426308.1674.429.camel@laptop> you wrote:
> On Wed, 2010-04-14 at 14:28 +1000, Michael Neuling wrote:
> 
> > > Right, so I suspect this will indeed break some things.
> > > 
> > > We initially allowed 0 capacity for when a cpu is consumed by an RT task
> > > and there simply isn't much capacity left, in that case you really want
> > > to try and move load to your sibling cpus if possible.
> > 
> > Changing the CPU power based on what tasks are running on them seems a
> > bit wrong to me.  Shouldn't we keep those concepts separate?
> 
> Well the thing cpu_power represents is a ratio of compute capacity
> available to this cpu as compared to other cpus. By normalizing the
> runqueue weights with this we end up with a fair balance.
> 
> The thing to realize here is that this is solely about SCHED_NORMAL
> tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about
> fairness and available compute capacity.
> 
> So if we were to ignore RT tasks, you'd end up with a situation where,
> assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the
> load-balancer would give each cpu 2 NORMAL tasks, but the tasks that
> would end up on the cpu the RT tasks would be running on would not run
> as fast -- is that fair?
> 
> Since RT tasks do not have a weight (FIFO/RR have no limit at all,
> DEADLINE would have something equivalent to a max weight), it is
> impossible to account them in the normal weight sense.
> 
> Therefore the current model takes them into account by lowering the
> compute capacity according to their (avg) cpu usage. So if the RT task
> would consume 66% cputime, we'd end up with a situation where the cpu
> running the RT task would get 1 NORMAL task, and other cpu would have
> the remaining 3, that way they'd all get 33% cpu.

Thanks for the explanation.  Your last example makes perfect sense for
me.  

> 
> > > However you're right that this goes awry in your case.
> > > 
> > > One thing to look at is if that 15% increase is indeed representative
> > > for the power7 cpu, it having 4 SMT threads seems to suggest there was
> > > significant gains, otherwise they'd not have wasted the silicon.
> > 
> > There are certainly, for most workloads, per core gains for SMT4 over
> > SMT2 on P7.  My kernels certainly compile faster and that's the only
> > workload anyone who matters cares about.... ;-)
> 
> For sure ;-)
> 
> Are there any numbers available on how much they gain? It might be worth
> to stick in real numbers instead of this alleged 15%.

I get some gain numbers but obviously the workloads makes a huge
difference.  From a scheduler perspective, I assume an
average/representative gain is best rather than an optimistic or
pessimistic one?

We'll have different gains for SMT2 and SMT4, so we could change the
gain dynamically based on which SMT mode we are in.  Does that seem like
something we should add as an arch hook?  

> > > One thing we could look at is using the cpu base power to compute
> > > capacity from. We'd have to add another field to sched_group and store
> > > power before we do the scale_rt_power() stuff.
> > 
> > Separating capacity from what RT tasks are running seems like a good
> > idea to me.
> 
> Well, per the above we cannot fully separate them.

Yep.

> 
> > This would fix the RT issue, but it's not clear to me how you are
> > suggesting fixing the rounding down to 0 SMT4 issue.  Are you suggesting
> > we bump smt_gain to say 2048 + 15%?  Or are you suggesting we separate
> > the RT tasks out from capacity and keep the max(1, capacity) that I've
> > added?  Or something else?
> 
> I would think that 4 SMT threads are still slower than two full cores,
> right? So cpu_power=2048 would not be appropriate.

Yes for an average workload, 4 SMT threads are slower than 2 full
cores.  

> 
> > Would another possibility be changing capacity a scaled value (like
> > cpu_power is now) rather than a small integer as it is now.  For
> > example, a scaled capacity of 1024 would be equivalent to a capacity of
> > 1 now.  This might enable us to handle partial capacities better?  We'd
> > probably have to scale a bunch of nr_running too.  
> 
> Right, so my proposal was to scale down the capacity divider (currently
> 1024) to whatever would be the base capacity for that cpu. Trouble seems
> to be that that makes group capacity a lot more complex, as you would
> end up needing to average all the cpu's their base capacity.
> 
> Hrmm, my brain seems muddled but I might have another solution, let me
> ponder this for a bit..

Great!

Mikey
Peter Zijlstra - April 19, 2010, 2:49 p.m.
On Mon, 2010-04-19 at 07:34 +1000, Michael Neuling wrote:
> > Are there any numbers available on how much they gain? It might be worth
> > to stick in real numbers instead of this alleged 15%.
> 
> I get some gain numbers but obviously the workloads makes a huge
> difference.  From a scheduler perspective, I assume an
> average/representative gain is best rather than an optimistic or
> pessimistic one?

Yeah, average would be best.

> We'll have different gains for SMT2 and SMT4, so we could change the
> gain dynamically based on which SMT mode we are in.  Does that seem like
> something we should add as an arch hook?  

That's the sort of thing you can use arch_scale_smt_power() for. But be
weary to not fall into the same trap I did with x86, where I confused
actual gain with capacity (When idle the actual gain is 0, but the
capacity is not).
Michael Neuling - April 19, 2010, 8:45 p.m.
In message <1271688543.1488.253.camel@laptop> you wrote:
> On Mon, 2010-04-19 at 07:34 +1000, Michael Neuling wrote:
> > > Are there any numbers available on how much they gain? It might be worth
> > > to stick in real numbers instead of this alleged 15%.
> > 
> > I get some gain numbers but obviously the workloads makes a huge
> > difference.  From a scheduler perspective, I assume an
> > average/representative gain is best rather than an optimistic or
> > pessimistic one?
> 
> Yeah, average would be best.

Ok.

> > We'll have different gains for SMT2 and SMT4, so we could change the
> > gain dynamically based on which SMT mode we are in.  Does that seem like
> > something we should add as an arch hook?  
> 
> That's the sort of thing you can use arch_scale_smt_power() for. But be
> weary to not fall into the same trap I did with x86, where I confused
> actual gain with capacity (When idle the actual gain is 0, but the
> capacity is not).

Oops, yes of course :-)

<from before>
> Hrmm, my brain seems muddled but I might have another solution, let me
> ponder this for a bit..

Let me know if/when you come up this solution or if I can help.  

Mikey
Michael Neuling - April 29, 2010, 6:55 a.m.
In message <1271426308.1674.429.camel@laptop> you wrote:
> On Wed, 2010-04-14 at 14:28 +1000, Michael Neuling wrote:
> 
> > > Right, so I suspect this will indeed break some things.
> > > 
> > > We initially allowed 0 capacity for when a cpu is consumed by an RT task
> > > and there simply isn't much capacity left, in that case you really want
> > > to try and move load to your sibling cpus if possible.
> > 
> > Changing the CPU power based on what tasks are running on them seems a
> > bit wrong to me.  Shouldn't we keep those concepts separate?
> 
> Well the thing cpu_power represents is a ratio of compute capacity
> available to this cpu as compared to other cpus. By normalizing the
> runqueue weights with this we end up with a fair balance.
> 
> The thing to realize here is that this is solely about SCHED_NORMAL
> tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about
> fairness and available compute capacity.
> 
> So if we were to ignore RT tasks, you'd end up with a situation where,
> assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the
> load-balancer would give each cpu 2 NORMAL tasks, but the tasks that
> would end up on the cpu the RT tasks would be running on would not run
> as fast -- is that fair?
> 
> Since RT tasks do not have a weight (FIFO/RR have no limit at all,
> DEADLINE would have something equivalent to a max weight), it is
> impossible to account them in the normal weight sense.
> 
> Therefore the current model takes them into account by lowering the
> compute capacity according to their (avg) cpu usage. So if the RT task
> would consume 66% cputime, we'd end up with a situation where the cpu
> running the RT task would get 1 NORMAL task, and other cpu would have
> the remaining 3, that way they'd all get 33% cpu.
> 
> > > However you're right that this goes awry in your case.
> > > 
> > > One thing to look at is if that 15% increase is indeed representative
> > > for the power7 cpu, it having 4 SMT threads seems to suggest there was
> > > significant gains, otherwise they'd not have wasted the silicon.
> > 
> > There are certainly, for most workloads, per core gains for SMT4 over
> > SMT2 on P7.  My kernels certainly compile faster and that's the only
> > workload anyone who matters cares about.... ;-)
> 
> For sure ;-)
> 
> Are there any numbers available on how much they gain? It might be worth
> to stick in real numbers instead of this alleged 15%.
> 
> > > One thing we could look at is using the cpu base power to compute
> > > capacity from. We'd have to add another field to sched_group and store
> > > power before we do the scale_rt_power() stuff.
> > 
> > Separating capacity from what RT tasks are running seems like a good
> > idea to me.
> 
> Well, per the above we cannot fully separate them.
> 
> > This would fix the RT issue, but it's not clear to me how you are
> > suggesting fixing the rounding down to 0 SMT4 issue.  Are you suggesting
> > we bump smt_gain to say 2048 + 15%?  Or are you suggesting we separate
> > the RT tasks out from capacity and keep the max(1, capacity) that I've
> > added?  Or something else?
> 
> I would think that 4 SMT threads are still slower than two full cores,
> right? So cpu_power=2048 would not be appropriate.
> 
> > Would another possibility be changing capacity a scaled value (like
> > cpu_power is now) rather than a small integer as it is now.  For
> > example, a scaled capacity of 1024 would be equivalent to a capacity of
> > 1 now.  This might enable us to handle partial capacities better?  We'd
> > probably have to scale a bunch of nr_running too.  
> 
> Right, so my proposal was to scale down the capacity divider (currently
> 1024) to whatever would be the base capacity for that cpu. Trouble seems
> to be that that makes group capacity a lot more complex, as you would
> end up needing to average all the cpu's their base capacity.
> 
> 
> Hrmm, my brain seems muddled but I might have another solution, let me
> ponder this for a bit..
> 

Peter,

Did you manage to get anywhere on this capacity issue?

Mikey
Peter Zijlstra - May 31, 2010, 8:33 a.m.
On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote:
> 
> 
> Hrmm, my brain seems muddled but I might have another solution, let me
> ponder this for a bit..
> 

Right, so the thing I was thinking about is taking the group capacity
into account when determining the capacity for a single cpu.

Say the group contains all the SMT siblings, then use the group capacity
(usually larger than 1024) and then distribute the capacity over the
group members, preferring CPUs with higher individual cpu_power over
those with less.

So suppose you've got 4 siblings with cpu_power=294 each, then we assign
capacity 1 to the first member, and the remaining 153 is insufficient,
and thus we stop and the rest lives with 0 capacity.

Now take the example that the first sibling would be running a heavy RT
load, and its cpu_power would be reduced to say, 50, then we still got
nearly 933 left over the others, which is still sufficient for one
capacity, but because the first sibling is low, we'll assign it 0 and
instead assign 1 to the second, again, leaving the third and fourth 0.

If the group were a core group, the total would be much higher and we'd
likely end up assigning 1 to each before we'd run out of capacity.


For power savings, we can lower the threshold and maybe use the maximal
individual cpu_power in the group to base 1 capacity from.

So, suppose the second example, where sibling0 has 50 and the others
have 294, you'd end up with a capacity distribution of: {0,1,1,1}.
Vaidyanathan Srinivasan - June 1, 2010, 10:52 p.m.
* Peter Zijlstra <peterz@infradead.org> [2010-05-31 10:33:16]:

> On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote:
> > 
> > 
> > Hrmm, my brain seems muddled but I might have another solution, let me
> > ponder this for a bit..
> > 
> 
> Right, so the thing I was thinking about is taking the group capacity
> into account when determining the capacity for a single cpu.
> 
> Say the group contains all the SMT siblings, then use the group capacity
> (usually larger than 1024) and then distribute the capacity over the
> group members, preferring CPUs with higher individual cpu_power over
> those with less.
> 
> So suppose you've got 4 siblings with cpu_power=294 each, then we assign
> capacity 1 to the first member, and the remaining 153 is insufficient,
> and thus we stop and the rest lives with 0 capacity.
> 
> Now take the example that the first sibling would be running a heavy RT
> load, and its cpu_power would be reduced to say, 50, then we still got
> nearly 933 left over the others, which is still sufficient for one
> capacity, but because the first sibling is low, we'll assign it 0 and
> instead assign 1 to the second, again, leaving the third and fourth 0.

Hi Peter,

Thanks for the suggestion.

> If the group were a core group, the total would be much higher and we'd
> likely end up assigning 1 to each before we'd run out of capacity.

This is a tricky case because we are depending upon the
DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1.  We
will not have any task movement until capacity is depleted to quite
low value due to RT task.  Having a threshold to flag 0/1 instead of
DIV_ROUND_CLOSEST just like you have suggested in the power savings
case may help here as well to move tasks to other idle cores.

> For power savings, we can lower the threshold and maybe use the maximal
> individual cpu_power in the group to base 1 capacity from.
> 
> So, suppose the second example, where sibling0 has 50 and the others
> have 294, you'd end up with a capacity distribution of: {0,1,1,1}.

One challenge here is that if RT tasks run on more that one thread in
this group, we will have slightly different cpu powers.  Arranging
them from max to min and having a cutoff threshold should work.

Should we keep the RT scaling as a separate entity along with
cpu_power to simplify these thresholds.  Whenever we need to scale
group load with cpu power can take the product of cpu_power and
scale_rt_power but in these cases where we compute capacity, we can
mark a 0 or 1 just based on whether scale_rt_power was less than
SCHED_LOAD_SCALE or not.  Alternatively we can keep cpu_power as
a product of all scaling factors as it is today but save the component
scale factors also like scale_rt_power() and arch_scale_freq_power()
so that it can be used in load balance decisions.

Basically in power save balance we would give all threads a capacity
'1' unless the cpu_power was reduced due to RT task.  Similarly in
the non-power save case, we can have flag 1,0,0,0 unless first thread
had a RT scaling during the last interval.

I am suggesting to distinguish the reduction is cpu_power due to
architectural (hardware DVFS) reasons from RT tasks so that it is easy
to decide if moving tasks to sibling thread or core can help or not.

--Vaidy
Peter Zijlstra - June 3, 2010, 8:56 a.m.
On Wed, 2010-06-02 at 04:22 +0530, Vaidyanathan Srinivasan wrote:

> > If the group were a core group, the total would be much higher and we'd
> > likely end up assigning 1 to each before we'd run out of capacity.
> 
> This is a tricky case because we are depending upon the
> DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1.  We
> will not have any task movement until capacity is depleted to quite
> low value due to RT task.  Having a threshold to flag 0/1 instead of
> DIV_ROUND_CLOSEST just like you have suggested in the power savings
> case may help here as well to move tasks to other idle cores.

Right, well we could put the threshold higher than the 50%, say 90% or
so.

> > For power savings, we can lower the threshold and maybe use the maximal
> > individual cpu_power in the group to base 1 capacity from.
> > 
> > So, suppose the second example, where sibling0 has 50 and the others
> > have 294, you'd end up with a capacity distribution of: {0,1,1,1}.
> 
> One challenge here is that if RT tasks run on more that one thread in
> this group, we will have slightly different cpu powers.  Arranging
> them from max to min and having a cutoff threshold should work.

Right, like the 90% above.

> Should we keep the RT scaling as a separate entity along with
> cpu_power to simplify these thresholds.  Whenever we need to scale
> group load with cpu power can take the product of cpu_power and
> scale_rt_power but in these cases where we compute capacity, we can
> mark a 0 or 1 just based on whether scale_rt_power was less than
> SCHED_LOAD_SCALE or not.  Alternatively we can keep cpu_power as
> a product of all scaling factors as it is today but save the component
> scale factors also like scale_rt_power() and arch_scale_freq_power()
> so that it can be used in load balance decisions.

Right, so the question is, do we only care about RT or should capacity
reflect the full asymmetric MP case.

I don't quite see why RT is special from any of the other scale factors,
if someone pegged one core at half the frequency of the others you'd
still want it to get 0 capacity so that we only try to populate it on
overload.

> Basically in power save balance we would give all threads a capacity
> '1' unless the cpu_power was reduced due to RT task.  Similarly in
> the non-power save case, we can have flag 1,0,0,0 unless first thread
> had a RT scaling during the last interval.
> 
> I am suggesting to distinguish the reduction is cpu_power due to
> architectural (hardware DVFS) reasons from RT tasks so that it is easy
> to decide if moving tasks to sibling thread or core can help or not.

For power savings such a special heuristic _might_ make sense.

Patch

Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -1482,6 +1482,7 @@  static int select_task_rq_fair(struct ta
 			}
 
 			capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+			capacity = max(capacity, 1UL);
 
 			if (tmp->flags & SD_POWERSAVINGS_BALANCE)
 				nr_running /= 2;
@@ -2488,6 +2489,7 @@  static inline void update_sg_lb_stats(st
 
 	sgs->group_capacity =
 		DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+	sgs->group_capacity = max(sgs->group_capacity, 1UL);
 }
 
 /**
@@ -2795,9 +2797,11 @@  find_busiest_queue(struct sched_group *g
 
 	for_each_cpu(i, sched_group_cpus(group)) {
 		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+		unsigned long capacity;
 		unsigned long wl;
 
+		capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL);
+
 		if (!cpumask_test_cpu(i, cpus))
 			continue;