diff mbox

RFC: An alternative -fsched-pressure implementation

Message ID g4k45nh5nk.fsf@richards-thinkpad.stglab.manchester.uk.ibm.com
State New
Headers show

Commit Message

Richard Sandiford Dec. 23, 2011, 11:46 a.m. UTC
So it looks like two pieces of work related to scheduling and register
pressure are being posted close together.  This one is an RFC for a less
aggressive form of -fsched-pressure.  I think it should complement
rather than compete with Bernd's IRA patch.  It seems like a good idea
to take register pressure into account during the first scheduling pass,
where we can still easily look at things like instruction latencies
and pipeline utilisation.  Better rematerialisation in the register
allocator would obviously be a good thing too though.

This patch started when we (Linaro) saw a big drop in performance
from vectorising an RGB to YIQ filter on ARM.  The first scheduling
pass was overly aggressive in creating a "wide" schedule, and caused
the newly-vectorised loop to contain lots of spills.  The loop grew so
big that it even had a constant pool in the middle of it.

-fsched-pressure did a very good job on this loop, creating far fewer
spills and consequently avoiding the constant pool.  However, it seemed
to make several other cases significantly worse.  The idea was therefore
to try to come up with a form of -fsched-pressure that could be turned
on for ARM by default.

Current -fsched-pressure works by assigning an "excess (pressure) cost change"
to each instruction; here I'll write that as ECC(X).  -fsched-pressure also
changes the way that the main list scheduler handles stalls due to data
dependencies.  If an instruction would stall for N cycles, the scheduler
would normally add it to the "now+N" queue, then add it to the ready queue
after N cycles.  With -fsched-pressure, it instead adds the instruction
to the ready queue immediately, while still recording that the instruction
would require N stalls.  I'll write the number of stalls on X as delay(X).

This arrangement allows the scheduler to choose between increasing register
pressure and introducing a deliberate stall.  Instructions are ranked by:

  (a) lowest ECC(X) + delay(X)
  (b) lowest delay(X)
  (c) normal list-scheduler ranking (based on things like INSN_PRIORITY)

Note that since delay(X) is measured in cycles, ECC(X) is effectively
measured in cycles too.

Several things seemed to be causing the degradations we were seeing
with -fsched-pressure:

  (1) The -fsched-pressure schedule is based purely on instruction latencies
      and issue rate; it doesn't take the DFA into account.  This means that
      we attempt to "dual issue" things like vector operations, loads and
      stores on Cortex A8 and A9.  In the examples I looked at, these sorts
      of inaccuracy seemed to accumulate, so that the value of delay(X)
      became based on slightly unrealistic cycle times.

      Note that this also affects code that doesn't have any pressure
      problems; it isn't limited to code that does.

      This may simply be historical.  It became much easier to use the
      DFA here after Bernd's introduction of prune_ready_list, but the
      original -fsched-pressure predates that.

  (2) We calculate ECC(X) by walking the unscheduled part of the block
      in its original order, then recording the pressure at each instruction.
      This seemed to make ECC(X) quite sensitive to that original order.
      I saw blocks that started out naturally "narrow" (not much ILP,
      e.g. from unrolled loops) and others that started naturally "wide"
      (a lot of ILP, such as in the libav h264 code), and walking the
      block in order meant that the two styles would be handled differently.

  (3) When calculating the pressure of the original block (as described
      in (2)), we ignore the deaths of registers that are used by more
      than one unscheduled instruction.  This tended to hurt long(ish)
      loops in things like filters, where the same value is often used
      as an input to two calculations.  The effect was that instructions
      towards the end of the block would appear to have very high pressure.
      This in turn made the algorithm very conservative; it wouldn't
      promote instructions from later in the block because those
      instructions seemed to have a prohibitively large cost.

      I asked Vlad about this, and he confirmed that it was a deliberate
      decision.  He'd tried honouring REG_DEAD notes instead, but it
      produced worse results on x86.  I'll return to this at the end.

  (4) ECC(X) is based on the pressure over and above ira_available_class_regs
      (the number of allocatable registers in a given class).  ARM has 14
      allocatable GENERAL_REGS: 16 minus the stack pointer and program
      counter.  So if 14 integer variables are live across a loop but
      not referenced within it, we end up scheduling that loop in a context
      of permanent pressure.  Pressure becomes the overriding concern,
      and we don't get much ILP.

      I suppose there are at least two ways of viewing this:

      (4a) We're giving an equal weight to:

      	   (Ra) registers that are live across a loop but not used within it
	        (and which should therefore require no spill code in the
		loop itself)

	   (Rb) registers that hold loop invariants (and so should only
	   	require loads in the loop itself)

	   (Rc) registers that are used and set within the loop (and so would
	        require both loads and stores in the loop itself)

	   We effectively treat everything as (Rc).

      (4b) We always work to ira_available_class_regs, rather than to
      	   an estimate of what maximum pressure is realistically achievable.
	   If a block has something like:

	       A: R2 = *R1
	          ...
	       B: R3 = *(R1 + 4)
	          ...
	       C: R1 = R1 + 8
	          ...
	       D: R4 = R2 + R3
	          ...

	   then it can't help but have three registers live at once.

The first thing I tried was to change just (4a) in isolation.
Taking (Ra) out of the pressure estimation produced some benefits,
but not enough.  I then tried a "staging" of costs, so that:

  (Ra) had the cost of a load and store in the outer loop (if any)
  (Rb) had the cost of a load in the inner loop (or the cost of (Ra) if larger)
  (Rc) had the cost of a load and store in the inner loop ( " " )

These costs were still based on MEMORY_MOVE_COST, just like the current
ones are.  However, MEMORY_MOVE_COST is defined to be relative to
REGISTER_MOVE_COST, and REGISTER_MOVE_COST is defined to be 2 for a
"simple" move.  Since ECC(X) is effectively a cycle count, it seemed
better to divide the cost by 2.

This again produced some benefit, but not enough.  I think part of the
problem is that ARM's MEMORY_MOVE_COST is very high: it says that both
loads and stores cost the equivalent of 5 register moves, making the cost
of a full spill equivalent to 10 register moves.  ARM also makes these
costs independent of the mode, so that a single SImode load and store
has the same cost as a 4-quad-vector load and store.  This might be
something that's worth looking at, especially since the same costs
influence spilling decisions in the register allocator.

However, even reducing the cost of a load to 4 moves and a store to 2
moves didn't bring enough benefit (although it was better than 5 and 5).
The division of costs above is also a little artificial.  Because we don't
have an integrated register allocator and scheduler, there's not really
any telling how many times the block will need to load or store a value.
An invariant might be used many times, and in such a way that several
loads are needed, while another variable might be involved in a single
read-modify-write sequence.  The true cost also depends on how easily
the spill instructions fit into the surrounding code.

On a different tack, I tried to tackle (1) by taking the DFA into account.
If delay(X) is zero, but X cannot issue due to resource hazards, I set
delay(X) to 1.  Again this wasn't enough in itself, although it does
still form an important part of the "final" proposed version.

I tried the algorithms from a few papers, but they too tended to be
overly conservative or to suffer from degenerate behaviour in filters.

In the end I tried an ad-hoc approach in an attempt to do something
about (2), (3) and (4b).  The idea was to construct a preliminary
"model" schedule in which the only objective is to keep register
pressure to a minimum.  This schedule ignores pipeline characteristics,
latencies, and the number of available registers.  The maximum pressure
seen in this initial model schedule (MP) is then the benchmark for ECC(X).

During the main scheduling, an instruction X that occurs at or before
the next point of maximum pressure in the model schedule is measured
based on the current register pressure.  If X doesn't increase the
current pressure beyond the current maximum, its ECC(X) is zero,
otherwise ECC(X) is the cost of going from MP to the new maximum.
The idea is that the final net pressure of scheduling a given set of
instructions is going to be the same regardless of the order; we simply
want to keep the intermediate pressure under control.  An ECC(X) of zero
usually[*] means that scheduling X next won't send the rest of the
sequence beyond the current maximum pressure.

  [*] but not always.  There's more about this in the patch comments.

If an instruction X occurs _after_ the next point of maximum pressure,
X is measured based on that maximum pressure.  If the current maximum
pressure is MP', and X increases pressure by dP, ECC(X) is the cost of
going from MP to MP' + dP.

Of course, this all depends on how good a value MP is, and therefore
on how good the model schedule is.  I tried a few variations before
settling on the one in the patch (which I hope makes conceptual sense).

I initially stayed with the idea above about assigning different costs to
(Ra), (Rb) and (Rc).  This produces some good results, but was still a
little too conservative in general, in that other tests were still worse
with -fsched-pressure than without.  I described some of the problems
with these costs above.  Another is that if an instruction X has a spill
cost of 6, say, then:

  ECC(X) + delay(X)

will only allow X to be scheduled if the next instruction without
a spill cost has a delay of 6 cycles or more.  This is overly harsh,
especially seeing as few ARM instructions have such a high latency.
The benefit of spilling is often to avoid a series of short
(e.g. single-cycle) stalls, rather than to avoid a single long one.

I then adjusted positive ECC(X) values based on the priority of X
relative to the highest-priority zero-cost instruction.  This was
better, but a DES filter in particular still suffered from the
"lots of short stalls" problem.

Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether
and simply treating each extra spill as having a cost of 1.  Somewhat
to my surprise, there was only one test in which some improvement was
lost; that test was now only slightly better than without -fsched-pressure.
But the fixed cost of 1 per spill achieved the goal of being as good as
or better than -fno-sched-pressure in almost all cases, while still being
significantly better in some of the cases we care about.

Assigning a cost of just 1 to each excess spill is clearly going to
interfere with the normal list scheduler less often than assigning each
spill a higher cost.  Given that this appraoch also gives the most important
benefits, it felt like a good default.

All in all, the patch is a complicated way of being quite timid,
but I hope it could also be used as a basis for more aggressive
versions in future.  Things I haven't looked at include whether
it would make sense to disable the priority-based adjustment of
ECC for -Os.  (That's a question of whether this patch can improve
size still further over -fno-sched-pressure.)

Being timid ought to make it fit quite well with Bernd's patch,
although I haven't had chance to try the two together.

I talked with Vlad about this a couple of months ago (thanks again for
the help, Vlad).  He explained that some of the things I was seeing --
especially (3) -- were deliberate decisions that had improved the
results for x86.  And I think it's probably true that the patch below
is only going to work well on fairly regular RISCy machines.

For that reason, the patch adds an -fsched-pressure-algorithm=
option to choose between the two versions.  For historical reasons
I called the current algorithm "weighted" and the proposed one "model",
but I'm sure there are better names.  I've kept the option undocumented
because it's really an internal thing.  "weighted" remains the default.

Of course, adding an alternative like this would only be acceptable if
-fsched-pressure and -fsched-pressure-algorithm=model became the default
for ARM (or another target).  That's a decision for the ARM maintainers;
I've sent Ramana the results I got, which unfortunately I can't share
more widely.

If anyone does have time to test this on their favourite port,
I'd really appreciate it.  I already know that it doesn't perform
well on SPECFP for rs6000 because of the choice of pressure classes;
I'll send a separate note about that so that it doesn't get lost in
this long essay.

Also, print_pseudo_costs segfaults when -fsched-pressure and dumping
are enabled.  I'm planning to send a patch for that at some point;
it's a regression from 4.6, so seems like stage 3 material.
In the meantime, a simple workaround is to stick "return;"
at the beginning of the function.

Richard

Comments

Richard Biener Dec. 23, 2011, 12:12 p.m. UTC | #1
On Fri, Dec 23, 2011 at 12:46 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> So it looks like two pieces of work related to scheduling and register
> pressure are being posted close together.  This one is an RFC for a less
> aggressive form of -fsched-pressure.  I think it should complement
> rather than compete with Bernd's IRA patch.  It seems like a good idea
> to take register pressure into account during the first scheduling pass,
> where we can still easily look at things like instruction latencies
> and pipeline utilisation.  Better rematerialisation in the register
> allocator would obviously be a good thing too though.
>
> This patch started when we (Linaro) saw a big drop in performance
> from vectorising an RGB to YIQ filter on ARM.  The first scheduling
> pass was overly aggressive in creating a "wide" schedule, and caused
> the newly-vectorised loop to contain lots of spills.  The loop grew so
> big that it even had a constant pool in the middle of it.
>
> -fsched-pressure did a very good job on this loop, creating far fewer
> spills and consequently avoiding the constant pool.  However, it seemed
> to make several other cases significantly worse.  The idea was therefore
> to try to come up with a form of -fsched-pressure that could be turned
> on for ARM by default.
>
> Current -fsched-pressure works by assigning an "excess (pressure) cost change"
> to each instruction; here I'll write that as ECC(X).  -fsched-pressure also
> changes the way that the main list scheduler handles stalls due to data
> dependencies.  If an instruction would stall for N cycles, the scheduler
> would normally add it to the "now+N" queue, then add it to the ready queue
> after N cycles.  With -fsched-pressure, it instead adds the instruction
> to the ready queue immediately, while still recording that the instruction
> would require N stalls.  I'll write the number of stalls on X as delay(X).
>
> This arrangement allows the scheduler to choose between increasing register
> pressure and introducing a deliberate stall.  Instructions are ranked by:
>
>  (a) lowest ECC(X) + delay(X)
>  (b) lowest delay(X)
>  (c) normal list-scheduler ranking (based on things like INSN_PRIORITY)
>
> Note that since delay(X) is measured in cycles, ECC(X) is effectively
> measured in cycles too.
>
> Several things seemed to be causing the degradations we were seeing
> with -fsched-pressure:
>
>  (1) The -fsched-pressure schedule is based purely on instruction latencies
>      and issue rate; it doesn't take the DFA into account.  This means that
>      we attempt to "dual issue" things like vector operations, loads and
>      stores on Cortex A8 and A9.  In the examples I looked at, these sorts
>      of inaccuracy seemed to accumulate, so that the value of delay(X)
>      became based on slightly unrealistic cycle times.
>
>      Note that this also affects code that doesn't have any pressure
>      problems; it isn't limited to code that does.
>
>      This may simply be historical.  It became much easier to use the
>      DFA here after Bernd's introduction of prune_ready_list, but the
>      original -fsched-pressure predates that.
>
>  (2) We calculate ECC(X) by walking the unscheduled part of the block
>      in its original order, then recording the pressure at each instruction.
>      This seemed to make ECC(X) quite sensitive to that original order.
>      I saw blocks that started out naturally "narrow" (not much ILP,
>      e.g. from unrolled loops) and others that started naturally "wide"
>      (a lot of ILP, such as in the libav h264 code), and walking the
>      block in order meant that the two styles would be handled differently.
>
>  (3) When calculating the pressure of the original block (as described
>      in (2)), we ignore the deaths of registers that are used by more
>      than one unscheduled instruction.  This tended to hurt long(ish)
>      loops in things like filters, where the same value is often used
>      as an input to two calculations.  The effect was that instructions
>      towards the end of the block would appear to have very high pressure.
>      This in turn made the algorithm very conservative; it wouldn't
>      promote instructions from later in the block because those
>      instructions seemed to have a prohibitively large cost.
>
>      I asked Vlad about this, and he confirmed that it was a deliberate
>      decision.  He'd tried honouring REG_DEAD notes instead, but it
>      produced worse results on x86.  I'll return to this at the end.
>
>  (4) ECC(X) is based on the pressure over and above ira_available_class_regs
>      (the number of allocatable registers in a given class).  ARM has 14
>      allocatable GENERAL_REGS: 16 minus the stack pointer and program
>      counter.  So if 14 integer variables are live across a loop but
>      not referenced within it, we end up scheduling that loop in a context
>      of permanent pressure.  Pressure becomes the overriding concern,
>      and we don't get much ILP.
>
>      I suppose there are at least two ways of viewing this:
>
>      (4a) We're giving an equal weight to:
>
>           (Ra) registers that are live across a loop but not used within it
>                (and which should therefore require no spill code in the
>                loop itself)
>
>           (Rb) registers that hold loop invariants (and so should only
>                require loads in the loop itself)
>
>           (Rc) registers that are used and set within the loop (and so would
>                require both loads and stores in the loop itself)
>
>           We effectively treat everything as (Rc).
>
>      (4b) We always work to ira_available_class_regs, rather than to
>           an estimate of what maximum pressure is realistically achievable.
>           If a block has something like:
>
>               A: R2 = *R1
>                  ...
>               B: R3 = *(R1 + 4)
>                  ...
>               C: R1 = R1 + 8
>                  ...
>               D: R4 = R2 + R3
>                  ...
>
>           then it can't help but have three registers live at once.
>
> The first thing I tried was to change just (4a) in isolation.
> Taking (Ra) out of the pressure estimation produced some benefits,
> but not enough.  I then tried a "staging" of costs, so that:
>
>  (Ra) had the cost of a load and store in the outer loop (if any)
>  (Rb) had the cost of a load in the inner loop (or the cost of (Ra) if larger)
>  (Rc) had the cost of a load and store in the inner loop ( " " )
>
> These costs were still based on MEMORY_MOVE_COST, just like the current
> ones are.  However, MEMORY_MOVE_COST is defined to be relative to
> REGISTER_MOVE_COST, and REGISTER_MOVE_COST is defined to be 2 for a
> "simple" move.  Since ECC(X) is effectively a cycle count, it seemed
> better to divide the cost by 2.
>
> This again produced some benefit, but not enough.  I think part of the
> problem is that ARM's MEMORY_MOVE_COST is very high: it says that both
> loads and stores cost the equivalent of 5 register moves, making the cost
> of a full spill equivalent to 10 register moves.  ARM also makes these
> costs independent of the mode, so that a single SImode load and store
> has the same cost as a 4-quad-vector load and store.  This might be
> something that's worth looking at, especially since the same costs
> influence spilling decisions in the register allocator.
>
> However, even reducing the cost of a load to 4 moves and a store to 2
> moves didn't bring enough benefit (although it was better than 5 and 5).
> The division of costs above is also a little artificial.  Because we don't
> have an integrated register allocator and scheduler, there's not really
> any telling how many times the block will need to load or store a value.
> An invariant might be used many times, and in such a way that several
> loads are needed, while another variable might be involved in a single
> read-modify-write sequence.  The true cost also depends on how easily
> the spill instructions fit into the surrounding code.
>
> On a different tack, I tried to tackle (1) by taking the DFA into account.
> If delay(X) is zero, but X cannot issue due to resource hazards, I set
> delay(X) to 1.  Again this wasn't enough in itself, although it does
> still form an important part of the "final" proposed version.
>
> I tried the algorithms from a few papers, but they too tended to be
> overly conservative or to suffer from degenerate behaviour in filters.
>
> In the end I tried an ad-hoc approach in an attempt to do something
> about (2), (3) and (4b).  The idea was to construct a preliminary
> "model" schedule in which the only objective is to keep register
> pressure to a minimum.  This schedule ignores pipeline characteristics,
> latencies, and the number of available registers.  The maximum pressure
> seen in this initial model schedule (MP) is then the benchmark for ECC(X).
>
> During the main scheduling, an instruction X that occurs at or before
> the next point of maximum pressure in the model schedule is measured
> based on the current register pressure.  If X doesn't increase the
> current pressure beyond the current maximum, its ECC(X) is zero,
> otherwise ECC(X) is the cost of going from MP to the new maximum.
> The idea is that the final net pressure of scheduling a given set of
> instructions is going to be the same regardless of the order; we simply
> want to keep the intermediate pressure under control.  An ECC(X) of zero
> usually[*] means that scheduling X next won't send the rest of the
> sequence beyond the current maximum pressure.
>
>  [*] but not always.  There's more about this in the patch comments.
>
> If an instruction X occurs _after_ the next point of maximum pressure,
> X is measured based on that maximum pressure.  If the current maximum
> pressure is MP', and X increases pressure by dP, ECC(X) is the cost of
> going from MP to MP' + dP.
>
> Of course, this all depends on how good a value MP is, and therefore
> on how good the model schedule is.  I tried a few variations before
> settling on the one in the patch (which I hope makes conceptual sense).
>
> I initially stayed with the idea above about assigning different costs to
> (Ra), (Rb) and (Rc).  This produces some good results, but was still a
> little too conservative in general, in that other tests were still worse
> with -fsched-pressure than without.  I described some of the problems
> with these costs above.  Another is that if an instruction X has a spill
> cost of 6, say, then:
>
>  ECC(X) + delay(X)
>
> will only allow X to be scheduled if the next instruction without
> a spill cost has a delay of 6 cycles or more.  This is overly harsh,
> especially seeing as few ARM instructions have such a high latency.
> The benefit of spilling is often to avoid a series of short
> (e.g. single-cycle) stalls, rather than to avoid a single long one.
>
> I then adjusted positive ECC(X) values based on the priority of X
> relative to the highest-priority zero-cost instruction.  This was
> better, but a DES filter in particular still suffered from the
> "lots of short stalls" problem.
>
> Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether
> and simply treating each extra spill as having a cost of 1.  Somewhat
> to my surprise, there was only one test in which some improvement was
> lost; that test was now only slightly better than without -fsched-pressure.
> But the fixed cost of 1 per spill achieved the goal of being as good as
> or better than -fno-sched-pressure in almost all cases, while still being
> significantly better in some of the cases we care about.
>
> Assigning a cost of just 1 to each excess spill is clearly going to
> interfere with the normal list scheduler less often than assigning each
> spill a higher cost.  Given that this appraoch also gives the most important
> benefits, it felt like a good default.
>
> All in all, the patch is a complicated way of being quite timid,
> but I hope it could also be used as a basis for more aggressive
> versions in future.  Things I haven't looked at include whether
> it would make sense to disable the priority-based adjustment of
> ECC for -Os.  (That's a question of whether this patch can improve
> size still further over -fno-sched-pressure.)
>
> Being timid ought to make it fit quite well with Bernd's patch,
> although I haven't had chance to try the two together.
>
> I talked with Vlad about this a couple of months ago (thanks again for
> the help, Vlad).  He explained that some of the things I was seeing --
> especially (3) -- were deliberate decisions that had improved the
> results for x86.  And I think it's probably true that the patch below
> is only going to work well on fairly regular RISCy machines.
>
> For that reason, the patch adds an -fsched-pressure-algorithm=
> option to choose between the two versions.  For historical reasons
> I called the current algorithm "weighted" and the proposed one "model",
> but I'm sure there are better names.  I've kept the option undocumented
> because it's really an internal thing.  "weighted" remains the default.
>
> Of course, adding an alternative like this would only be acceptable if
> -fsched-pressure and -fsched-pressure-algorithm=model became the default
> for ARM (or another target).  That's a decision for the ARM maintainers;
> I've sent Ramana the results I got, which unfortunately I can't share
> more widely.
>
> If anyone does have time to test this on their favourite port,
> I'd really appreciate it.  I already know that it doesn't perform
> well on SPECFP for rs6000 because of the choice of pressure classes;
> I'll send a separate note about that so that it doesn't get lost in
> this long essay.
>
> Also, print_pseudo_costs segfaults when -fsched-pressure and dumping
> are enabled.  I'm planning to send a patch for that at some point;
> it's a regression from 4.6, so seems like stage 3 material.
> In the meantime, a simple workaround is to stick "return;"
> at the beginning of the function.

Without looking at the patch itself I think that addressing (2) independent
on the sched-pressure-algorithm would be a good idea.  That said,
even on x86_64 -fsched-pressure isn't an overall win - is it enabled by
default on any architecture?  I too think a magic constant of '1' for each
spill is good (though maybe we'd want to have a target hook that
specifies spill cost [for a specific mode] in terms of cycles).

Richard.

> Richard
>
>
Vladimir Makarov Dec. 23, 2011, 4:31 p.m. UTC | #2
On 12/23/2011 06:46 AM, Richard Sandiford wrote:
> So it looks like two pieces of work related to scheduling and register
> pressure are being posted close together.  This one is an RFC for a less
> aggressive form of -fsched-pressure.  I think it should complement
> rather than compete with Bernd's IRA patch.  It seems like a good idea
> to take register pressure into account during the first scheduling pass,
> where we can still easily look at things like instruction latencies
> and pipeline utilisation.  Better rematerialisation in the register
> allocator would obviously be a good thing too though.
>
> This patch started when we (Linaro) saw a big drop in performance
> from vectorising an RGB to YIQ filter on ARM.  The first scheduling
> pass was overly aggressive in creating a "wide" schedule, and caused
> the newly-vectorised loop to contain lots of spills.  The loop grew so
> big that it even had a constant pool in the middle of it.
>
> -fsched-pressure did a very good job on this loop, creating far fewer
> spills and consequently avoiding the constant pool.  However, it seemed
> to make several other cases significantly worse.  The idea was therefore
> to try to come up with a form of -fsched-pressure that could be turned
> on for ARM by default.
>
> Current -fsched-pressure works by assigning an "excess (pressure) cost change"
> to each instruction; here I'll write that as ECC(X).  -fsched-pressure also
> changes the way that the main list scheduler handles stalls due to data
> dependencies.  If an instruction would stall for N cycles, the scheduler
> would normally add it to the "now+N" queue, then add it to the ready queue
> after N cycles.  With -fsched-pressure, it instead adds the instruction
> to the ready queue immediately, while still recording that the instruction
> would require N stalls.  I'll write the number of stalls on X as delay(X).
>
> This arrangement allows the scheduler to choose between increasing register
> pressure and introducing a deliberate stall.  Instructions are ranked by:
>
>    (a) lowest ECC(X) + delay(X)
>    (b) lowest delay(X)
>    (c) normal list-scheduler ranking (based on things like INSN_PRIORITY)
>
> Note that since delay(X) is measured in cycles, ECC(X) is effectively
> measured in cycles too.
>
> Several things seemed to be causing the degradations we were seeing
> with -fsched-pressure:
>
>    (1) The -fsched-pressure schedule is based purely on instruction latencies
>        and issue rate; it doesn't take the DFA into account.  This means that
>        we attempt to "dual issue" things like vector operations, loads and
>        stores on Cortex A8 and A9.  In the examples I looked at, these sorts
>        of inaccuracy seemed to accumulate, so that the value of delay(X)
>        became based on slightly unrealistic cycle times.
>
>        Note that this also affects code that doesn't have any pressure
>        problems; it isn't limited to code that does.
>
>        This may simply be historical.  It became much easier to use the
>        DFA here after Bernd's introduction of prune_ready_list, but the
>        original -fsched-pressure predates that.
>
>    (2) We calculate ECC(X) by walking the unscheduled part of the block
>        in its original order, then recording the pressure at each instruction.
>        This seemed to make ECC(X) quite sensitive to that original order.
>        I saw blocks that started out naturally "narrow" (not much ILP,
>        e.g. from unrolled loops) and others that started naturally "wide"
>        (a lot of ILP, such as in the libav h264 code), and walking the
>        block in order meant that the two styles would be handled differently.
>
>    (3) When calculating the pressure of the original block (as described
>        in (2)), we ignore the deaths of registers that are used by more
>        than one unscheduled instruction.  This tended to hurt long(ish)
>        loops in things like filters, where the same value is often used
>        as an input to two calculations.  The effect was that instructions
>        towards the end of the block would appear to have very high pressure.
>        This in turn made the algorithm very conservative; it wouldn't
>        promote instructions from later in the block because those
>        instructions seemed to have a prohibitively large cost.
>
>        I asked Vlad about this, and he confirmed that it was a deliberate
>        decision.  He'd tried honouring REG_DEAD notes instead, but it
>        produced worse results on x86.  I'll return to this at the end.
>
>    (4) ECC(X) is based on the pressure over and above ira_available_class_regs
>        (the number of allocatable registers in a given class).  ARM has 14
>        allocatable GENERAL_REGS: 16 minus the stack pointer and program
>        counter.  So if 14 integer variables are live across a loop but
>        not referenced within it, we end up scheduling that loop in a context
>        of permanent pressure.  Pressure becomes the overriding concern,
>        and we don't get much ILP.
>
>        I suppose there are at least two ways of viewing this:
>
>        (4a) We're giving an equal weight to:
>
>        	   (Ra) registers that are live across a loop but not used within it
> 	        (and which should therefore require no spill code in the
> 		loop itself)
>
> 	   (Rb) registers that hold loop invariants (and so should only
> 	   	require loads in the loop itself)
>
> 	   (Rc) registers that are used and set within the loop (and so would
> 	        require both loads and stores in the loop itself)
>
> 	   We effectively treat everything as (Rc).
>
>        (4b) We always work to ira_available_class_regs, rather than to
>        	   an estimate of what maximum pressure is realistically achievable.
> 	   If a block has something like:
>
> 	       A: R2 = *R1
> 	          ...
> 	       B: R3 = *(R1 + 4)
> 	          ...
> 	       C: R1 = R1 + 8
> 	          ...
> 	       D: R4 = R2 + R3
> 	          ...
>
> 	   then it can't help but have three registers live at once.
>
> The first thing I tried was to change just (4a) in isolation.
> Taking (Ra) out of the pressure estimation produced some benefits,
> but not enough.  I then tried a "staging" of costs, so that:
>
>    (Ra) had the cost of a load and store in the outer loop (if any)
>    (Rb) had the cost of a load in the inner loop (or the cost of (Ra) if larger)
>    (Rc) had the cost of a load and store in the inner loop ( " " )
>
> These costs were still based on MEMORY_MOVE_COST, just like the current
> ones are.  However, MEMORY_MOVE_COST is defined to be relative to
> REGISTER_MOVE_COST, and REGISTER_MOVE_COST is defined to be 2 for a
> "simple" move.  Since ECC(X) is effectively a cycle count, it seemed
> better to divide the cost by 2.
>
> This again produced some benefit, but not enough.  I think part of the
> problem is that ARM's MEMORY_MOVE_COST is very high: it says that both
> loads and stores cost the equivalent of 5 register moves, making the cost
> of a full spill equivalent to 10 register moves.  ARM also makes these
> costs independent of the mode, so that a single SImode load and store
> has the same cost as a 4-quad-vector load and store.  This might be
> something that's worth looking at, especially since the same costs
> influence spilling decisions in the register allocator.
>
Yes, the accuracy of the costs was always a problem.  I saw it for some 
targets.
> However, even reducing the cost of a load to 4 moves and a store to 2
> moves didn't bring enough benefit (although it was better than 5 and 5).
> The division of costs above is also a little artificial.  Because we don't
> have an integrated register allocator and scheduler, there's not really
> any telling how many times the block will need to load or store a value.
> An invariant might be used many times, and in such a way that several
> loads are needed, while another variable might be involved in a single
> read-modify-write sequence.  The true cost also depends on how easily
> the spill instructions fit into the surrounding code.
>
Yes, there is also inheritance in reload.  It is hard to say will reload 
to reuse already loaded value.
> On a different tack, I tried to tackle (1) by taking the DFA into account.
> If delay(X) is zero, but X cannot issue due to resource hazards, I set
> delay(X) to 1.  Again this wasn't enough in itself, although it does
> still form an important part of the "final" proposed version.
>
> I tried the algorithms from a few papers, but they too tended to be
> overly conservative or to suffer from degenerate behaviour in filters.
>
This area had no real serious attention of the researches, although it 
is quite important for some targets.  I got this impression long ago.  
Some algorithms (e.g. one mentioned in Muchnik's book) would degrade the 
code instead, I believe.
> In the end I tried an ad-hoc approach in an attempt to do something
> about (2), (3) and (4b).  The idea was to construct a preliminary
> "model" schedule in which the only objective is to keep register
> pressure to a minimum.  This schedule ignores pipeline characteristics,
> latencies, and the number of available registers.  The maximum pressure
> seen in this initial model schedule (MP) is then the benchmark for ECC(X).
>
I always had an impression that the code before scheduler is close to 
minimal register pressure because of specific expression generation.  
May be I was wrong and some optimizations (global ones like pre) changes 
this a lot.
> During the main scheduling, an instruction X that occurs at or before
> the next point of maximum pressure in the model schedule is measured
> based on the current register pressure.  If X doesn't increase the
> current pressure beyond the current maximum, its ECC(X) is zero,
> otherwise ECC(X) is the cost of going from MP to the new maximum.
> The idea is that the final net pressure of scheduling a given set of
> instructions is going to be the same regardless of the order; we simply
> want to keep the intermediate pressure under control.  An ECC(X) of zero
> usually[*] means that scheduling X next won't send the rest of the
> sequence beyond the current maximum pressure.
>
>    [*] but not always.  There's more about this in the patch comments.
>
> If an instruction X occurs _after_ the next point of maximum pressure,
> X is measured based on that maximum pressure.  If the current maximum
> pressure is MP', and X increases pressure by dP, ECC(X) is the cost of
> going from MP to MP' + dP.
>
> Of course, this all depends on how good a value MP is, and therefore
> on how good the model schedule is.  I tried a few variations before
> settling on the one in the patch (which I hope makes conceptual sense).
>
> I initially stayed with the idea above about assigning different costs to
> (Ra), (Rb) and (Rc).  This produces some good results, but was still a
> little too conservative in general, in that other tests were still worse
> with -fsched-pressure than without.  I described some of the problems
> with these costs above.  Another is that if an instruction X has a spill
> cost of 6, say, then:
>
>    ECC(X) + delay(X)
>
> will only allow X to be scheduled if the next instruction without
> a spill cost has a delay of 6 cycles or more.  This is overly harsh,
> especially seeing as few ARM instructions have such a high latency.
> The benefit of spilling is often to avoid a series of short
> (e.g. single-cycle) stalls, rather than to avoid a single long one.
>
> I then adjusted positive ECC(X) values based on the priority of X
> relative to the highest-priority zero-cost instruction.  This was
> better, but a DES filter in particular still suffered from the
> "lots of short stalls" problem.
>
> Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether
> and simply treating each extra spill as having a cost of 1.  Somewhat
> to my surprise, there was only one test in which some improvement was
> lost; that test was now only slightly better than without -fsched-pressure.
> But the fixed cost of 1 per spill achieved the goal of being as good as
> or better than -fno-sched-pressure in almost all cases, while still being
> significantly better in some of the cases we care about.
>
> Assigning a cost of just 1 to each excess spill is clearly going to
> interfere with the normal list scheduler less often than assigning each
> spill a higher cost.  Given that this appraoch also gives the most important
> benefits, it felt like a good default.
>
We should use whatever works better but I always try to find an 
explanation and reason for some unexpected results.  If it is OOO 
processor I have no problem with understanding this.
> All in all, the patch is a complicated way of being quite timid,
Yes, it is a bit scaring me that pace of changes in scheduler would make 
it complicated as reload.  On the other hand, the pass is one of the 
most target dependent and solves NP-problem.  I think it is hard to 
generate a good code without some complications, except of course an 
optimal solution approach but it hardly sometime in future will be 
acceptable with compilation time point of view.
> but I hope it could also be used as a basis for more aggressive
> versions in future.  Things I haven't looked at include whether
> it would make sense to disable the priority-based adjustment of
> ECC for -Os.  (That's a question of whether this patch can improve
> size still further over -fno-sched-pressure.)
>
> Being timid ought to make it fit quite well with Bernd's patch,
> although I haven't had chance to try the two together.
>
> I talked with Vlad about this a couple of months ago (thanks again for
> the help, Vlad).  He explained that some of the things I was seeing --
> especially (3) -- were deliberate decisions that had improved the
> results for x86.  And I think it's probably true that the patch below
> is only going to work well on fairly regular RISCy machines.
>
> For that reason, the patch adds an -fsched-pressure-algorithm=
> option to choose between the two versions.  For historical reasons
> I called the current algorithm "weighted" and the proposed one "model",
> but I'm sure there are better names.  I've kept the option undocumented
> because it's really an internal thing.  "weighted" remains the default.
>
> Of course, adding an alternative like this would only be acceptable if
> -fsched-pressure and -fsched-pressure-algorithm=model became the default
> for ARM (or another target).  That's a decision for the ARM maintainers;
> I've sent Ramana the results I got, which unfortunately I can't share
> more widely.
>
I guess the result is positive, otherwise you would not submit the 
patch.  Although I place the performance improvement always first, it 
would be interesting to know how the compilation time changes.
> If anyone does have time to test this on their favourite port,
> I'd really appreciate it.  I already know that it doesn't perform
> well on SPECFP for rs6000 because of the choice of pressure classes;
> I'll send a separate note about that so that it doesn't get lost in
> this long essay.
>
Yes, before I submitted coloring without cover classes, we had 1-1 
relation of cover classes and register pressure classes.  Cover-classes 
were an approximation of classes should be used for pseudos.  Coloring 
without cover classes makes more accurate coloring but there is no 1-1 
relation with pressure classes anymore.  It is not strange to me that 
biggest performance improvement I saw on SPECFP2000 on ppc when I 
submitted the patch.
> Also, print_pseudo_costs segfaults when -fsched-pressure and dumping
> are enabled.  I'm planning to send a patch for that at some point;
> it's a regression from 4.6, so seems like stage 3 material.
> In the meantime, a simple workaround is to stick "return;"
> at the beginning of the function.
>
>
Richard, thanks for working on these problems.  I thought about some of 
them but I never had time to address them.  In overall, I have a very 
positive opinion about your systematic approach.

As for the patch itself, I look at this with more attention at the 
beginning of next year.  As I understand there is no rush with that 
because we are still not at the stage 1.

Once again, thanks for your work.
Richard Sandiford Dec. 28, 2011, 1:51 p.m. UTC | #3
Vladimir Makarov <vmakarov@redhat.com> writes:
>> In the end I tried an ad-hoc approach in an attempt to do something
>> about (2), (3) and (4b).  The idea was to construct a preliminary
>> "model" schedule in which the only objective is to keep register
>> pressure to a minimum.  This schedule ignores pipeline characteristics,
>> latencies, and the number of available registers.  The maximum pressure
>> seen in this initial model schedule (MP) is then the benchmark for ECC(X).
>>
> I always had an impression that the code before scheduler is close to 
> minimal register pressure because of specific expression generation.  
> May be I was wrong and some optimizations (global ones like pre) changes 
> this a lot.

One of the examples I was looking at was:

-----------------------------------------------------------------------------
#include <stdint.h>

#define COUNT 8

void
loop (uint8_t *__restrict dst, uint8_t *__restrict src, uint8_t *__restrict ff_cropTbl, int dstStride, int srcStride)
{
  const int w = COUNT;
  uint8_t *cm = ff_cropTbl + 1024;
  for(int i=0; i<w; i++)
    {
      const int srcB = src[-2*srcStride];
      const int srcA = src[-1*srcStride];
      const int src0 = src[0 *srcStride];
      const int src1 = src[1 *srcStride];
      const int src2 = src[2 *srcStride];
      const int src3 = src[3 *srcStride];
      const int src4 = src[4 *srcStride];
      const int src5 = src[5 *srcStride];
      const int src6 = src[6 *srcStride];
      const int src7 = src[7 *srcStride];
      const int src8 = src[8 *srcStride];
      const int src9 = src[9 *srcStride];
      const int src10 = src[10*srcStride];

      dst[0*dstStride] = cm[(((src0+src1)*20 - (srcA+src2)*5 + (srcB+src3)) + 16)>>5];
      dst[1*dstStride] = cm[(((src1+src2)*20 - (src0+src3)*5 + (srcA+src4)) + 16)>>5];
      dst[2*dstStride] = cm[(((src2+src3)*20 - (src1+src4)*5 + (src0+src5)) + 16)>>5];
      dst[3*dstStride] = cm[(((src3+src4)*20 - (src2+src5)*5 + (src1+src6)) + 16)>>5];
      dst[4*dstStride] = cm[(((src4+src5)*20 - (src3+src6)*5 + (src2+src7)) + 16)>>5];
      dst[5*dstStride] = cm[(((src5+src6)*20 - (src4+src7)*5 + (src3+src8)) + 16)>>5];
      dst[6*dstStride] = cm[(((src6+src7)*20 - (src5+src8)*5 + (src4+src9)) + 16)>>5];
      dst[7*dstStride] = cm[(((src7+src8)*20 - (src6+src9)*5 + (src5+src10)) + 16)>>5];
      dst++;
      src++;
    }
}
-----------------------------------------------------------------------------

(based on the libav h264 code).  In this example the loads from src and
stores to dst are still in their original order by the time we reach sched1,
so src, dst, srcA, srcB, and src0..10 are all live at once.  There's no
aliasing reason why they can't be reordered, and we do that during
scheduling.

>> During the main scheduling, an instruction X that occurs at or before
>> the next point of maximum pressure in the model schedule is measured
>> based on the current register pressure.  If X doesn't increase the
>> current pressure beyond the current maximum, its ECC(X) is zero,
>> otherwise ECC(X) is the cost of going from MP to the new maximum.
>> The idea is that the final net pressure of scheduling a given set of
>> instructions is going to be the same regardless of the order; we simply
>> want to keep the intermediate pressure under control.  An ECC(X) of zero
>> usually[*] means that scheduling X next won't send the rest of the
>> sequence beyond the current maximum pressure.
>>
>>    [*] but not always.  There's more about this in the patch comments.
>>
>> If an instruction X occurs _after_ the next point of maximum pressure,
>> X is measured based on that maximum pressure.  If the current maximum
>> pressure is MP', and X increases pressure by dP, ECC(X) is the cost of
>> going from MP to MP' + dP.
>>
>> Of course, this all depends on how good a value MP is, and therefore
>> on how good the model schedule is.  I tried a few variations before
>> settling on the one in the patch (which I hope makes conceptual sense).
>>
>> I initially stayed with the idea above about assigning different costs to
>> (Ra), (Rb) and (Rc).  This produces some good results, but was still a
>> little too conservative in general, in that other tests were still worse
>> with -fsched-pressure than without.  I described some of the problems
>> with these costs above.  Another is that if an instruction X has a spill
>> cost of 6, say, then:
>>
>>    ECC(X) + delay(X)
>>
>> will only allow X to be scheduled if the next instruction without
>> a spill cost has a delay of 6 cycles or more.  This is overly harsh,
>> especially seeing as few ARM instructions have such a high latency.
>> The benefit of spilling is often to avoid a series of short
>> (e.g. single-cycle) stalls, rather than to avoid a single long one.
>>
>> I then adjusted positive ECC(X) values based on the priority of X
>> relative to the highest-priority zero-cost instruction.  This was
>> better, but a DES filter in particular still suffered from the
>> "lots of short stalls" problem.
>>
>> Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether
>> and simply treating each extra spill as having a cost of 1.  Somewhat
>> to my surprise, there was only one test in which some improvement was
>> lost; that test was now only slightly better than without -fsched-pressure.
>> But the fixed cost of 1 per spill achieved the goal of being as good as
>> or better than -fno-sched-pressure in almost all cases, while still being
>> significantly better in some of the cases we care about.
>>
>> Assigning a cost of just 1 to each excess spill is clearly going to
>> interfere with the normal list scheduler less often than assigning each
>> spill a higher cost.  Given that this appraoch also gives the most important
>> benefits, it felt like a good default.
>>
> We should use whatever works better but I always try to find an 
> explanation and reason for some unexpected results.  If it is OOO 
> processor I have no problem with understanding this.

Yeah, it was an OO processor, but the behaviour appeared to be
independent of that.  The differences in performance were often
also reflected in the execution times that GCC itself predicted
during sched2.

The reasons appeared to be the ones I mentioned above:

 (1) Using a memory-based spill cost is arbitrary because we don't
     know how many or what type of spill instructions will be needed,
     and where they will be placed in relation to the other instructions.

     E.g. I saw a few cases where keeping the number of live GPRs under
     14 + X, but closer to 14 + X more of the time, produced more spills
     and worse sched2 code than allowing the number of live GPRs to peak
     at a higher value and go down sooner.  And that's no surprise, of
     course: this is very much a heuristic.

 (2) If the load+store cost is 6 (which seems a reasonable value for
     ARM GPRs) then with the current costs, we will only schedule an
     instruction that increases pressure by 1 if all instructions
     that don't increase pressure have a latency of 6 cycles or more.

     I tried to get around this by adjusting the costs based on priorities.
     However, I still saw situations where we had several instructions
     of a similar priority that all increased pressure.  Once we scheduled
     one of those -- X say -- we wouldn't schedule another (Y) until X's
     dependents themselves increased pressure or until those dependents
     reached a low enough priority to cancel out the load+store cost
     in ECC(Y).  If we schedule Y in that situation, we end up with the
     same pressure as if we'd scheduled Y straight after X, making those
     stalls unnecessary.

Maybe it was just the choice of examples I looked at in detail,
but (2) seemed like the bigger factor.

>> All in all, the patch is a complicated way of being quite timid,
> Yes, it is a bit scaring me that pace of changes in scheduler would make 
> it complicated as reload.  On the other hand, the pass is one of the 
> most target dependent and solves NP-problem.  I think it is hard to 
> generate a good code without some complications, except of course an 
> optimal solution approach but it hardly sometime in future will be 
> acceptable with compilation time point of view.

Yeah, complication is definitely a problem here.  We did try running
benchmarks with the first scheduling pass disabled, but unfortunately,
the pass is just too important to consider dropping.

>> but I hope it could also be used as a basis for more aggressive
>> versions in future.  Things I haven't looked at include whether
>> it would make sense to disable the priority-based adjustment of
>> ECC for -Os.  (That's a question of whether this patch can improve
>> size still further over -fno-sched-pressure.)
>>
>> Being timid ought to make it fit quite well with Bernd's patch,
>> although I haven't had chance to try the two together.
>>
>> I talked with Vlad about this a couple of months ago (thanks again for
>> the help, Vlad).  He explained that some of the things I was seeing --
>> especially (3) -- were deliberate decisions that had improved the
>> results for x86.  And I think it's probably true that the patch below
>> is only going to work well on fairly regular RISCy machines.
>>
>> For that reason, the patch adds an -fsched-pressure-algorithm=
>> option to choose between the two versions.  For historical reasons
>> I called the current algorithm "weighted" and the proposed one "model",
>> but I'm sure there are better names.  I've kept the option undocumented
>> because it's really an internal thing.  "weighted" remains the default.
>>
>> Of course, adding an alternative like this would only be acceptable if
>> -fsched-pressure and -fsched-pressure-algorithm=model became the default
>> for ARM (or another target).  That's a decision for the ARM maintainers;
>> I've sent Ramana the results I got, which unfortunately I can't share
>> more widely.
>>
> I guess the result is positive, otherwise you would not submit the 
> patch.  Although I place the performance improvement always first, it 
> would be interesting to know how the compilation time changes.

I tried compiling SPEC2000 (CPU + FP) for arm-linux-gnueabi on
x86_64-linux-gnu without -fsched-pressure and then with both
-fsched-pressure and -fsched-pressure-algorithm=model.
I couldn't detect a difference outside the noise.

> As for the patch itself, I look at this with more attention at the 
> beginning of next year.  As I understand there is no rush with that 
> because we are still not at the stage 1.

Thanks, appreciate it.  And yeah, there's definitely no rush: there's
no way this can go in 4.7.

Richard
Bernd Schmidt Jan. 9, 2012, 12:45 p.m. UTC | #4
On 12/23/2011 12:46 PM, Richard Sandiford wrote:
> In the end I tried an ad-hoc approach in an attempt to do something
> about (2), (3) and (4b).  The idea was to construct a preliminary
> "model" schedule in which the only objective is to keep register
> pressure to a minimum.  This schedule ignores pipeline characteristics,
> latencies, and the number of available registers.  The maximum pressure
> seen in this initial model schedule (MP) is then the benchmark for ECC(X).

Interesting. I had also thought about trying to address the problem in
the scheduler, so I'll just share some of my thoughts (not intended as a
negative comment on your approach).

Essentially the problem I usually see is that the wider your machine
gets, the happier sched1 will be to fill unused slots with instructions
that could just as well be scheduled 200 cycles later. That's really an
artifact of forward scheduling, so why not schedule both forwards and
then backwards (or the other way round)? Produce an initial schedule,
then perform a fixup phase: start at the other end and look for
instructions that can be scheduled in a wide range of cycles, and move
them if doing so can be shown to reduce register pressure, and we retain
a correct schedule. This can be done without trying to have a global
pressure estimate which has the problems you noted.

Doing this would require constructing a backwards DFA, and I gave up on
this for the moment after a day or so spent with genautomata, but it
should be possible.


Bernd
Vladimir Makarov Jan. 10, 2012, 4:38 p.m. UTC | #5
On 12/28/2011 08:51 AM, Richard Sandiford wrote:
> Vladimir Makarov<vmakarov@redhat.com>  writes:
>>> In the end I tried an ad-hoc approach in an attempt to do something
>>> about (2), (3) and (4b).  The idea was to construct a preliminary
>>> "model" schedule in which the only objective is to keep register
>>> pressure to a minimum.  This schedule ignores pipeline characteristics,
>>> latencies, and the number of available registers.  The maximum pressure
>>> seen in this initial model schedule (MP) is then the benchmark for ECC(X).
>>>
>> I always had an impression that the code before scheduler is close to
>> minimal register pressure because of specific expression generation.
>> May be I was wrong and some optimizations (global ones like pre) changes
>> this a lot.
> One of the examples I was looking at was:
>
> -----------------------------------------------------------------------------
> #include<stdint.h>
>
> #define COUNT 8
>
> void
> loop (uint8_t *__restrict dst, uint8_t *__restrict src, uint8_t *__restrict ff_cropTbl, int dstStride, int srcStride)
> {
>    const int w = COUNT;
>    uint8_t *cm = ff_cropTbl + 1024;
>    for(int i=0; i<w; i++)
>      {
>        const int srcB = src[-2*srcStride];
>        const int srcA = src[-1*srcStride];
>        const int src0 = src[0 *srcStride];
>        const int src1 = src[1 *srcStride];
>        const int src2 = src[2 *srcStride];
>        const int src3 = src[3 *srcStride];
>        const int src4 = src[4 *srcStride];
>        const int src5 = src[5 *srcStride];
>        const int src6 = src[6 *srcStride];
>        const int src7 = src[7 *srcStride];
>        const int src8 = src[8 *srcStride];
>        const int src9 = src[9 *srcStride];
>        const int src10 = src[10*srcStride];
>
>        dst[0*dstStride] = cm[(((src0+src1)*20 - (srcA+src2)*5 + (srcB+src3)) + 16)>>5];
>        dst[1*dstStride] = cm[(((src1+src2)*20 - (src0+src3)*5 + (srcA+src4)) + 16)>>5];
>        dst[2*dstStride] = cm[(((src2+src3)*20 - (src1+src4)*5 + (src0+src5)) + 16)>>5];
>        dst[3*dstStride] = cm[(((src3+src4)*20 - (src2+src5)*5 + (src1+src6)) + 16)>>5];
>        dst[4*dstStride] = cm[(((src4+src5)*20 - (src3+src6)*5 + (src2+src7)) + 16)>>5];
>        dst[5*dstStride] = cm[(((src5+src6)*20 - (src4+src7)*5 + (src3+src8)) + 16)>>5];
>        dst[6*dstStride] = cm[(((src6+src7)*20 - (src5+src8)*5 + (src4+src9)) + 16)>>5];
>        dst[7*dstStride] = cm[(((src7+src8)*20 - (src6+src9)*5 + (src5+src10)) + 16)>>5];
>        dst++;
>        src++;
>      }
> }
> -----------------------------------------------------------------------------
>
> (based on the libav h264 code).  In this example the loads from src and
> stores to dst are still in their original order by the time we reach sched1,
> so src, dst, srcA, srcB, and src0..10 are all live at once.  There's no
> aliasing reason why they can't be reordered, and we do that during
> scheduling.
>
Thanks, for the example.

>
>> As for the patch itself, I look at this with more attention at the
>> beginning of next year.  As I understand there is no rush with that
>> because we are still not at the stage 1.
> Thanks, appreciate it.  And yeah, there's definitely no rush: there's
> no way this can go in 4.7.
>
Ok.  Thanks, Richard.
Vladimir Makarov Jan. 10, 2012, 4:41 p.m. UTC | #6
On 01/09/2012 07:45 AM, Bernd Schmidt wrote:
> On 12/23/2011 12:46 PM, Richard Sandiford wrote:
>> In the end I tried an ad-hoc approach in an attempt to do something
>> about (2), (3) and (4b).  The idea was to construct a preliminary
>> "model" schedule in which the only objective is to keep register
>> pressure to a minimum.  This schedule ignores pipeline characteristics,
>> latencies, and the number of available registers.  The maximum pressure
>> seen in this initial model schedule (MP) is then the benchmark for ECC(X).
> Interesting. I had also thought about trying to address the problem in
> the scheduler, so I'll just share some of my thoughts (not intended as a
> negative comment on your approach).
>
> Essentially the problem I usually see is that the wider your machine
> gets, the happier sched1 will be to fill unused slots with instructions
> that could just as well be scheduled 200 cycles later. That's really an
> artifact of forward scheduling, so why not schedule both forwards and
> then backwards (or the other way round)? Produce an initial schedule,
> then perform a fixup phase: start at the other end and look for
> instructions that can be scheduled in a wide range of cycles, and move
> them if doing so can be shown to reduce register pressure, and we retain
> a correct schedule. This can be done without trying to have a global
> pressure estimate which has the problems you noted.
>
I saw such approaches in the literature.  It would be interesting to see 
how it works in GCC.

Although, I believe that register pressure minimization pass before RA 
and selective insn scheduler used only after RA could solve all this 
problems.  But it needs a lot of investigation to confirm this.  It is 
also hard for me to say what approach would require more efforts to 
implement.
> Doing this would require constructing a backwards DFA, and I gave up on
> this for the moment after a day or so spent with genautomata, but it
> should be possible.
>
Long ago (10 years ago) I considered to do this exactly for removing the 
2nd separate insn scheduler by inserting spill code in already existing 
insn schedule.  But I decided not to do this.

Also I considered to generate DFAs with repeated insn issues for better 
serving modulo scheduler.

There are a lot of directions of developing automatically generated 
pipeline hazard recognizers going beyond (N)DFAs to simulate hidden 
register renaming, different ooo processors look ahead buffers/queues.  
All of these directions could be interesting research projects.  But 
even in the current state, imho GCC (N)DFA pipeline hazard recognizer 
after its 10 years of existence is the most powerful one used in 
industrial compilers.
diff mbox

Patch

gcc/
	* sched-deps.c (fixup_sched_groups): Rename to...
	(chain_to_prev_insn): ...this.
	(chain_to_prev_insn_p): New function.
	(deps_analyze_insn): Use it instead of SCHED_GROUP_P.

Index: gcc/sched-deps.c
===================================================================
--- gcc/sched-deps.c	2011-12-22 14:15:38.000000000 +0000
+++ gcc/sched-deps.c	2011-12-22 14:22:40.749212237 +0000
@@ -477,7 +477,7 @@  static void add_dependence_list (rtx, rt
 static void add_dependence_list_and_free (struct deps_desc *, rtx,
 					  rtx *, int, enum reg_note);
 static void delete_all_dependences (rtx);
-static void fixup_sched_groups (rtx);
+static void chain_to_prev_insn (rtx);
 
 static void flush_pending_lists (struct deps_desc *, rtx, int, int);
 static void sched_analyze_1 (struct deps_desc *, rtx, rtx);
@@ -1655,7 +1655,7 @@  delete_all_dependences (rtx insn)
    the previous nonnote insn.  */
 
 static void
-fixup_sched_groups (rtx insn)
+chain_to_prev_insn (rtx insn)
 {
   sd_iterator_def sd_it;
   dep_t dep;
@@ -3294,7 +3294,7 @@  sched_analyze_insn (struct deps_desc *de
 	       instructions that follow seem like they should be part
 	       of the call group.
 
-	       Also, if we did, fixup_sched_groups() would move the
+	       Also, if we did, chain_to_prev_insn would move the
 	       deps of the debug insn to the call insn, modifying
 	       non-debug post-dependency counts of the debug insn
 	       dependencies and otherwise messing with the scheduling
@@ -3440,6 +3440,37 @@  call_may_noreturn_p (rtx insn)
   return true;
 }
 
+/* Return true if INSN should be made dependent on the previous instruction
+   group, and if all INSN's dependencies should be moved to the first
+   instruction of that group.  */
+
+static bool
+chain_to_prev_insn_p (rtx insn)
+{
+  rtx prev, x;
+
+  /* INSN forms a group with the previous instruction.  */
+  if (SCHED_GROUP_P (insn))
+    return true;
+
+  /* If the previous instruction clobbers a register R and this one sets
+     part of R, the clobber was added specifically to help us track the
+     liveness of R.  There's no point scheduling the clobber and leaving
+     INSN behind, especially if we move the clobber to another block.  */
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (prev
+      && INSN_P (prev)
+      && BLOCK_FOR_INSN (prev) == BLOCK_FOR_INSN (insn)
+      && GET_CODE (PATTERN (prev)) == CLOBBER)
+    {
+      x = XEXP (PATTERN (prev), 0);
+      if (set_of (x, insn))
+	return true;
+    }
+
+  return false;
+}
+
 /* Analyze INSN with DEPS as a context.  */
 void
 deps_analyze_insn (struct deps_desc *deps, rtx insn)
@@ -3607,8 +3638,9 @@  deps_analyze_insn (struct deps_desc *dep
 
   /* Fixup the dependencies in the sched group.  */
   if ((NONJUMP_INSN_P (insn) || JUMP_P (insn))
-      && SCHED_GROUP_P (insn) && !sel_sched_p ())
-    fixup_sched_groups (insn);
+      && chain_to_prev_insn_p (insn)
+      && !sel_sched_p ())
+    chain_to_prev_insn (insn);
 }
 
 /* Initialize DEPS for the new block beginning with HEAD.  */