RFC: An alternative -fsched-pressure implementation

So it looks like two pieces of work related to scheduling and register
pressure are being posted close together.  This one is an RFC for a less
aggressive form of -fsched-pressure.  I think it should complement
rather than compete with Bernd's IRA patch.  It seems like a good idea
to take register pressure into account during the first scheduling pass,
where we can still easily look at things like instruction latencies
and pipeline utilisation.  Better rematerialisation in the register
allocator would obviously be a good thing too though.

This patch started when we (Linaro) saw a big drop in performance
from vectorising an RGB to YIQ filter on ARM.  The first scheduling
pass was overly aggressive in creating a "wide" schedule, and caused
the newly-vectorised loop to contain lots of spills.  The loop grew so
big that it even had a constant pool in the middle of it.

-fsched-pressure did a very good job on this loop, creating far fewer
spills and consequently avoiding the constant pool.  However, it seemed
to make several other cases significantly worse.  The idea was therefore
to try to come up with a form of -fsched-pressure that could be turned
on for ARM by default.

Current -fsched-pressure works by assigning an "excess (pressure) cost change"
to each instruction; here I'll write that as ECC(X).  -fsched-pressure also
changes the way that the main list scheduler handles stalls due to data
dependencies.  If an instruction would stall for N cycles, the scheduler
would normally add it to the "now+N" queue, then add it to the ready queue
after N cycles.  With -fsched-pressure, it instead adds the instruction
to the ready queue immediately, while still recording that the instruction
would require N stalls.  I'll write the number of stalls on X as delay(X).

This arrangement allows the scheduler to choose between increasing register
pressure and introducing a deliberate stall.  Instructions are ranked by:

  (a) lowest ECC(X) + delay(X)
  (b) lowest delay(X)
  (c) normal list-scheduler ranking (based on things like INSN_PRIORITY)

Note that since delay(X) is measured in cycles, ECC(X) is effectively
measured in cycles too.

Several things seemed to be causing the degradations we were seeing
with -fsched-pressure:

  (1) The -fsched-pressure schedule is based purely on instruction latencies
      and issue rate; it doesn't take the DFA into account.  This means that
      we attempt to "dual issue" things like vector operations, loads and
      stores on Cortex A8 and A9.  In the examples I looked at, these sorts
      of inaccuracy seemed to accumulate, so that the value of delay(X)
      became based on slightly unrealistic cycle times.

      Note that this also affects code that doesn't have any pressure
      problems; it isn't limited to code that does.

      This may simply be historical.  It became much easier to use the
      DFA here after Bernd's introduction of prune_ready_list, but the
      original -fsched-pressure predates that.

  (2) We calculate ECC(X) by walking the unscheduled part of the block
      in its original order, then recording the pressure at each instruction.
      This seemed to make ECC(X) quite sensitive to that original order.
      I saw blocks that started out naturally "narrow" (not much ILP,
      e.g. from unrolled loops) and others that started naturally "wide"
      (a lot of ILP, such as in the libav h264 code), and walking the
      block in order meant that the two styles would be handled differently.

  (3) When calculating the pressure of the original block (as described
      in (2)), we ignore the deaths of registers that are used by more
      than one unscheduled instruction.  This tended to hurt long(ish)
      loops in things like filters, where the same value is often used
      as an input to two calculations.  The effect was that instructions
      towards the end of the block would appear to have very high pressure.
      This in turn made the algorithm very conservative; it wouldn't
      promote instructions from later in the block because those
      instructions seemed to have a prohibitively large cost.

      I asked Vlad about this, and he confirmed that it was a deliberate
      decision.  He'd tried honouring REG_DEAD notes instead, but it
      produced worse results on x86.  I'll return to this at the end.

  (4) ECC(X) is based on the pressure over and above ira_available_class_regs
      (the number of allocatable registers in a given class).  ARM has 14
      allocatable GENERAL_REGS: 16 minus the stack pointer and program
      counter.  So if 14 integer variables are live across a loop but
      not referenced within it, we end up scheduling that loop in a context
      of permanent pressure.  Pressure becomes the overriding concern,
      and we don't get much ILP.

      I suppose there are at least two ways of viewing this:

      (4a) We're giving an equal weight to:

      	   (Ra) registers that are live across a loop but not used within it
	        (and which should therefore require no spill code in the
		loop itself)

	   (Rb) registers that hold loop invariants (and so should only
	   	require loads in the loop itself)

	   (Rc) registers that are used and set within the loop (and so would
	        require both loads and stores in the loop itself)

	   We effectively treat everything as (Rc).

      (4b) We always work to ira_available_class_regs, rather than to
      	   an estimate of what maximum pressure is realistically achievable.
	   If a block has something like:

	       A: R2 = *R1
	          ...
	       B: R3 = *(R1 + 4)
	          ...
	       C: R1 = R1 + 8
	          ...
	       D: R4 = R2 + R3
	          ...

	   then it can't help but have three registers live at once.

The first thing I tried was to change just (4a) in isolation.
Taking (Ra) out of the pressure estimation produced some benefits,
but not enough.  I then tried a "staging" of costs, so that:

  (Ra) had the cost of a load and store in the outer loop (if any)
  (Rb) had the cost of a load in the inner loop (or the cost of (Ra) if larger)
  (Rc) had the cost of a load and store in the inner loop ( " " )

These costs were still based on MEMORY_MOVE_COST, just like the current
ones are.  However, MEMORY_MOVE_COST is defined to be relative to
REGISTER_MOVE_COST, and REGISTER_MOVE_COST is defined to be 2 for a
"simple" move.  Since ECC(X) is effectively a cycle count, it seemed
better to divide the cost by 2.

This again produced some benefit, but not enough.  I think part of the
problem is that ARM's MEMORY_MOVE_COST is very high: it says that both
loads and stores cost the equivalent of 5 register moves, making the cost
of a full spill equivalent to 10 register moves.  ARM also makes these
costs independent of the mode, so that a single SImode load and store
has the same cost as a 4-quad-vector load and store.  This might be
something that's worth looking at, especially since the same costs
influence spilling decisions in the register allocator.

However, even reducing the cost of a load to 4 moves and a store to 2
moves didn't bring enough benefit (although it was better than 5 and 5).
The division of costs above is also a little artificial.  Because we don't
have an integrated register allocator and scheduler, there's not really
any telling how many times the block will need to load or store a value.
An invariant might be used many times, and in such a way that several
loads are needed, while another variable might be involved in a single
read-modify-write sequence.  The true cost also depends on how easily
the spill instructions fit into the surrounding code.

On a different tack, I tried to tackle (1) by taking the DFA into account.
If delay(X) is zero, but X cannot issue due to resource hazards, I set
delay(X) to 1.  Again this wasn't enough in itself, although it does
still form an important part of the "final" proposed version.

I tried the algorithms from a few papers, but they too tended to be
overly conservative or to suffer from degenerate behaviour in filters.

In the end I tried an ad-hoc approach in an attempt to do something
about (2), (3) and (4b).  The idea was to construct a preliminary
"model" schedule in which the only objective is to keep register
pressure to a minimum.  This schedule ignores pipeline characteristics,
latencies, and the number of available registers.  The maximum pressure
seen in this initial model schedule (MP) is then the benchmark for ECC(X).

During the main scheduling, an instruction X that occurs at or before
the next point of maximum pressure in the model schedule is measured
based on the current register pressure.  If X doesn't increase the
current pressure beyond the current maximum, its ECC(X) is zero,
otherwise ECC(X) is the cost of going from MP to the new maximum.
The idea is that the final net pressure of scheduling a given set of
instructions is going to be the same regardless of the order; we simply
want to keep the intermediate pressure under control.  An ECC(X) of zero
usually[*] means that scheduling X next won't send the rest of the
sequence beyond the current maximum pressure.

  [*] but not always.  There's more about this in the patch comments.

If an instruction X occurs _after_ the next point of maximum pressure,
X is measured based on that maximum pressure.  If the current maximum
pressure is MP', and X increases pressure by dP, ECC(X) is the cost of
going from MP to MP' + dP.

Of course, this all depends on how good a value MP is, and therefore
on how good the model schedule is.  I tried a few variations before
settling on the one in the patch (which I hope makes conceptual sense).

I initially stayed with the idea above about assigning different costs to
(Ra), (Rb) and (Rc).  This produces some good results, but was still a
little too conservative in general, in that other tests were still worse
with -fsched-pressure than without.  I described some of the problems
with these costs above.  Another is that if an instruction X has a spill
cost of 6, say, then:

  ECC(X) + delay(X)

will only allow X to be scheduled if the next instruction without
a spill cost has a delay of 6 cycles or more.  This is overly harsh,
especially seeing as few ARM instructions have such a high latency.
The benefit of spilling is often to avoid a series of short
(e.g. single-cycle) stalls, rather than to avoid a single long one.

I then adjusted positive ECC(X) values based on the priority of X
relative to the highest-priority zero-cost instruction.  This was
better, but a DES filter in particular still suffered from the
"lots of short stalls" problem.

Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether
and simply treating each extra spill as having a cost of 1.  Somewhat
to my surprise, there was only one test in which some improvement was
lost; that test was now only slightly better than without -fsched-pressure.
But the fixed cost of 1 per spill achieved the goal of being as good as
or better than -fno-sched-pressure in almost all cases, while still being
significantly better in some of the cases we care about.

Assigning a cost of just 1 to each excess spill is clearly going to
interfere with the normal list scheduler less often than assigning each
spill a higher cost.  Given that this appraoch also gives the most important
benefits, it felt like a good default.

All in all, the patch is a complicated way of being quite timid,
but I hope it could also be used as a basis for more aggressive
versions in future.  Things I haven't looked at include whether
it would make sense to disable the priority-based adjustment of
ECC for -Os.  (That's a question of whether this patch can improve
size still further over -fno-sched-pressure.)

Being timid ought to make it fit quite well with Bernd's patch,
although I haven't had chance to try the two together.

I talked with Vlad about this a couple of months ago (thanks again for
the help, Vlad).  He explained that some of the things I was seeing --
especially (3) -- were deliberate decisions that had improved the
results for x86.  And I think it's probably true that the patch below
is only going to work well on fairly regular RISCy machines.

For that reason, the patch adds an -fsched-pressure-algorithm=
option to choose between the two versions.  For historical reasons
I called the current algorithm "weighted" and the proposed one "model",
but I'm sure there are better names.  I've kept the option undocumented
because it's really an internal thing.  "weighted" remains the default.

Of course, adding an alternative like this would only be acceptable if
-fsched-pressure and -fsched-pressure-algorithm=model became the default
for ARM (or another target).  That's a decision for the ARM maintainers;
I've sent Ramana the results I got, which unfortunately I can't share
more widely.

If anyone does have time to test this on their favourite port,
I'd really appreciate it.  I already know that it doesn't perform
well on SPECFP for rs6000 because of the choice of pressure classes;
I'll send a separate note about that so that it doesn't get lost in
this long essay.

Also, print_pseudo_costs segfaults when -fsched-pressure and dumping
are enabled.  I'm planning to send a patch for that at some point;
it's a regression from 4.6, so seems like stage 3 material.
In the meantime, a simple workaround is to stick "return;"
at the beginning of the function.

Richard

RFC: An alternative -fsched-pressure implementation

Commit Message

Comments

Patch