Add a new combine pass
diff mbox series

Message ID mpteey6xeip.fsf@arm.com
State New
Headers show
Series
  • Add a new combine pass
Related show

Commit Message

Richard Sandiford Nov. 17, 2019, 11:35 p.m. UTC
(It's 23:35 local time, so it's still just about stage 1. :-))

While working on SVE, I've noticed several cases in which we fail
to combine instructions because the combined form would need to be
placed earlier in the instruction stream than the last of the
instructions being combined.  This includes one very important
case in the handling of the first fault register (FFR).

Combine currently requires the combined instruction to live at the same
location as i3.  I thought about trying to relax that restriction, but it
would be difficult to do with the current pass structure while keeping
everything linear-ish time.

So this patch instead goes for an option that has been talked about
several times over the years: writing a new combine pass that just
does instruction combination, and not all the other optimisations
that have been bolted onto combine over time.  E.g. it deliberately
doesn't do things like nonzero-bits tracking, since that really ought
to be a separate, more global, optimisation.

This is still far from being a realistic replacement for the even
the combine parts of the current combine pass.  E.g.:

- it only handles combinations that can be built up from individual
  two-instruction combinations.

- it doesn't allow new hard register clobbers to be added.

- it doesn't have the special treatment of CC operations.

- etc.

But we have to start somewhere.

On a more positive note, the pass handles things that the current
combine pass doesn't:

- the main motivating feature mentioned above: it works out where
  the combined instruction could validly live and moves it there
  if necessary.  If there are a range of valid places, it tries
  to pick the best one based on register pressure (although only
  with a simple heuristic for now).

- once it has combined two instructions, it can try combining the
  result with both later and earlier code, i.e. it can combine
  in both directions.

- it tries using REG_EQUAL notes for the final instruction.

- it can parallelise two independent instructions that both read from
  the same register or both read from memory.

This last feature is useful for generating more load-pair combinations
on AArch64.  In some cases it can also produce more store-pair combinations,
but only for consecutive stores.  However, since the pass currently does
this in a very greedy, peephole way, it only allows load/store-pair
combinations if the first memory access has a higher alignment than
the second, i.e. if we can be sure that the combined access is naturally
aligned.  This should help it to make better decisions than the post-RA
peephole pass in some cases while not being too aggressive.

The pass is supposed to be linear time without debug insns.
It only tries a constant number C of combinations per instruction
and its bookkeeping updates are constant-time.  Once it has combined two
instructions, it'll try up to C combinations on the result, but this can
be counted against the instruction that was deleted by the combination
and so effectively just doubles the constant.  (Note that C depends
on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.)

Unfortunately, debug updates via propagate_for_debug are more expensive.
This could probably be fixed if the pass did more to track debug insns
itself, but using propagate_for_debug matches combine's behaviour.

The patch adds two instances of the new pass: one before combine and
one after it.  By default both are disabled, but this can be changed
using the new 3-bit run-combine param, where:

- bit 0 selects the new pre-combine pass
- bit 1 selects the main combine pass
- bit 2 selects the new post-combine pass

The idea is that run-combine=3 can be used to see which combinations
are missed by the new pass, while run-combine=6 (which I hope to be
the production setting for AArch64 at -O2+) just uses the new pass
to mop up cases that normal combine misses.  Maybe in some distant
future, the pass will be good enough for run-combine=[14] to be a
realistic option.

I ended up having to add yet another validate_simplify_* routine,
this time to do the equivalent of:

   newx = simplify_replace_rtx (*loc, old_rtx, new_rtx);
   validate_change (insn, loc, newx, 1);

but in a more memory-efficient way.  validate_replace_rtx isn't suitable
because it deliberately only tries simplifications in limited cases:

  /* Do changes needed to keep rtx consistent.  Don't do any other
     simplifications, as it is not our job.  */

And validate_simplify_insn isn't useful for this case because it works
on patterns that have already had changes made to them and expects
those patterns to be valid rtxes.  simplify-replace operations instead
need to simplify as they go, when the original modes are still to hand.

As far as compile-time goes, I tried compiling optabs.ii at -O2
with an --enable-checking=release compiler:

run-combine=2 (normal combine):  100.0% (baseline)
run-combine=4 (new pass only)     98.0%
run-combine=6 (both passes)      100.3%

where the results are easily outside the noise.  So the pass on
its own is quicker than combine, but that's not a fair comparison
when it doesn't do everything combine does.  Running both passes
only has a slight overhead.

To get a feel for the effect on multiple targets, I did my usual
bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
and g++.dg, this time comparing run-combine=2 and run-combine=6
using -O2 -ftree-vectorize:

Target                 Tests   Delta    Best   Worst  Median
======                 =====   =====    ====   =====  ======
aarch64-linux-gnu       3974  -39393   -2275      90      -2
aarch64_be-linux-gnu    3389  -36683   -2275     165      -2
alpha-linux-gnu         4154  -62860   -2132     335      -2
amdgcn-amdhsa           4818    9079   -7987   51850      -2
arc-elf                 2868  -63710  -18998     286      -1
arm-linux-gnueabi       4053  -80404  -10019     605      -2
arm-linux-gnueabihf     4053  -80404  -10019     605      -2
avr-elf                 3620   38513   -2386   23364       2
bfin-elf                2691  -32973   -1483    1127      -2
bpf-elf                 5581  -78105  -11064     113      -3
c6x-elf                 3915  -31710   -2441    1560      -2
cr16-elf                6030  192102   -1757   60009      12
cris-elf                2217  -30794   -1716     294      -2
csky-elf                2003  -24989   -9999    1468      -2
epiphany-elf            3345  -19416   -1803    4594      -2
fr30-elf                3562  -15077   -1921    2334      -1
frv-linux-gnu           2423  -16589   -1736     999      -1
ft32-elf                2246  -46337  -15988     433      -2
h8300-elf               2581  -33553   -1403     168      -2
hppa64-hp-hpux11.23     3926 -120876  -50134    1056      -2
i686-apple-darwin       3562  -46851   -1764     310      -2
i686-pc-linux-gnu       2902   -3639   -4809    6848      -2
ia64-linux-gnu          2900 -158870  -14006     428      -7
iq2000-elf              2929  -54690   -2904    2576      -3
lm32-elf                5265  162519   -1918    8004       5
m32r-elf                1861  -25296   -2713    1004      -2
m68k-linux-gnu          2520 -241573  -21879     200      -3
mcore-elf               2378  -28532   -1810    1635      -2
microblaze-elf          2782 -137363   -9516    1986      -2
mipsel-linux-gnu        2443  -38422   -8331     458      -1
mipsisa64-linux-gnu     2287  -60294  -12214     432      -2
mmix                    4910 -136549  -13616     599      -2
mn10300-elf             2944  -29151   -2488     132      -1
moxie-rtems             1935  -12364   -1002     125      -1
msp430-elf              2379  -37007   -2163     176      -2
nds32le-elf             2356  -27551   -2126     163      -1
nios2-linux-gnu         1572  -44828  -23613      92      -2
nvptx-none              1014  -17337   -1590      16      -3
or1k-elf                2724  -92816  -14144      56      -3
pdp11                   1897  -27296   -1370     534      -2
powerpc-ibm-aix7.0      2909  -58829  -10026    2001      -2
powerpc64-linux-gnu     3685  -60551  -12158    2001      -1
powerpc64le-linux-gnu   3501  -61846  -10024     765      -2
pru-elf                 1574  -29734  -19998    1718      -1
riscv32-elf             2357  -22506  -10002   10175      -1
riscv64-elf             3320  -56777  -10002     226      -2
rl78-elf                2113 -232328  -18607    4065      -3
rx-elf                  2800  -38515    -896     491      -2
s390-linux-gnu          3582  -75626  -12098    3999      -2
s390x-linux-gnu         3761  -73473  -13748    3999      -2
sh-linux-gnu            2350  -26401   -1003     522      -2
sparc-linux-gnu         3279  -49518   -2175    2223      -2
sparc64-linux-gnu       3849 -123084  -30200    2141      -2
tilepro-linux-gnu       2737  -35562   -3458    2848      -2
v850-elf                9002 -169126  -49996      76      -4
vax-netbsdelf           3325  -57734  -10000    1989      -2
visium-elf              1860  -17006   -1006    1066      -2
x86_64-darwin           3278  -48933   -9999    1408      -2
x86_64-linux-gnu        3008  -43887   -9999    3248      -2
xstormy16-elf           2497  -26569   -2051      89      -2
xtensa-elf              2161  -31231   -6910     138      -2

So running both passes does seem to have a significant benefit
on most targets, but there are some nasty-looking outliers.
The usual caveat applies: number of lines is a very poor measurement,
it's just to get a feel.

Bootstrapped & regression-tested on aarch64-linux-gnu and
x86_64-linux-gnu with both run-combine=3 as the default (so that the new
pass runs first) and with run-combine=6 as the default (so that the new
pass runs second).  There were no new execution failures.  A couple of
guality.exp tests that already failed for most options started failing
for a couple more.  Enabling the pass fixes the XFAILs in:

gcc.target/aarch64/sve/acle/general/ptrue_pat_[234].c

Inevitably there was some scan-assembler fallout for other tests.
E.g. in gcc.target/aarch64/vmov_n_1.c:

#define INHIB_OPTIMIZATION asm volatile ("" : : : "memory")
  ...
  INHIB_OPTIMIZATION;							\
  (a) = TEST (test, data_len);						\
  INHIB_OPTIMIZATION;							\
  (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a));	\

is no longer effective for preventing move (a) from being merged
into (b), because the pass can merge at the point of (a).  I think
this is a valid thing to do -- the asm semantics are still satisfied,
and asm volatile ("" : : : "memory") never acted as a register barrier.
But perhaps we should deal with this as a special case?

Richard


2019-11-17  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* Makefile.in (OBJS): Add combine2.o
	* params.opt (--param=run-combine): New option.
	* doc/invoke.texi: Document it.
	* tree-pass.h (make_pass_combine2_before): Declare.
	(make_pass_combine2_after): Likewise.
	* passes.def: Add them.
	* timevar.def (TV_COMBINE2): New timevar.
	* cfgrtl.h (update_cfg_for_uncondjump): Declare.
	* combine.c (update_cfg_for_uncondjump): Move to...
	* cfgrtl.c (update_cfg_for_uncondjump): ...here.
	* simplify-rtx.c (simplify_truncation): Handle comparisons.
	* recog.h (validate_simplify_replace_rtx): Declare.
	* recog.c (validate_simplify_replace_rtx_1): New function.
	(validate_simplify_replace_rtx_uses): Likewise.
	(validate_simplify_replace_rtx): Likewise.
	* combine2.c: New file.

Comments

Andrew Pinski Nov. 18, 2019, 12:45 a.m. UTC | #1
On Sun, Nov 17, 2019 at 3:35 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> (It's 23:35 local time, so it's still just about stage 1. :-))
>
> While working on SVE, I've noticed several cases in which we fail
> to combine instructions because the combined form would need to be
> placed earlier in the instruction stream than the last of the
> instructions being combined.  This includes one very important
> case in the handling of the first fault register (FFR).
>
> Combine currently requires the combined instruction to live at the same
> location as i3.  I thought about trying to relax that restriction, but it
> would be difficult to do with the current pass structure while keeping
> everything linear-ish time.
>
> So this patch instead goes for an option that has been talked about
> several times over the years: writing a new combine pass that just
> does instruction combination, and not all the other optimisations
> that have been bolted onto combine over time.  E.g. it deliberately
> doesn't do things like nonzero-bits tracking, since that really ought
> to be a separate, more global, optimisation.
>
> This is still far from being a realistic replacement for the even
> the combine parts of the current combine pass.  E.g.:
>
> - it only handles combinations that can be built up from individual
>   two-instruction combinations.
>
> - it doesn't allow new hard register clobbers to be added.
>
> - it doesn't have the special treatment of CC operations.
>
> - etc.
>
> But we have to start somewhere.
>
> On a more positive note, the pass handles things that the current
> combine pass doesn't:
>
> - the main motivating feature mentioned above: it works out where
>   the combined instruction could validly live and moves it there
>   if necessary.  If there are a range of valid places, it tries
>   to pick the best one based on register pressure (although only
>   with a simple heuristic for now).
>
> - once it has combined two instructions, it can try combining the
>   result with both later and earlier code, i.e. it can combine
>   in both directions.
>
> - it tries using REG_EQUAL notes for the final instruction.
>
> - it can parallelise two independent instructions that both read from
>   the same register or both read from memory.
>
> This last feature is useful for generating more load-pair combinations
> on AArch64.  In some cases it can also produce more store-pair combinations,
> but only for consecutive stores.  However, since the pass currently does
> this in a very greedy, peephole way, it only allows load/store-pair
> combinations if the first memory access has a higher alignment than
> the second, i.e. if we can be sure that the combined access is naturally
> aligned.  This should help it to make better decisions than the post-RA
> peephole pass in some cases while not being too aggressive.
>
> The pass is supposed to be linear time without debug insns.
> It only tries a constant number C of combinations per instruction
> and its bookkeeping updates are constant-time.  Once it has combined two
> instructions, it'll try up to C combinations on the result, but this can
> be counted against the instruction that was deleted by the combination
> and so effectively just doubles the constant.  (Note that C depends
> on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.)
>
> Unfortunately, debug updates via propagate_for_debug are more expensive.
> This could probably be fixed if the pass did more to track debug insns
> itself, but using propagate_for_debug matches combine's behaviour.
>
> The patch adds two instances of the new pass: one before combine and
> one after it.  By default both are disabled, but this can be changed
> using the new 3-bit run-combine param, where:
>
> - bit 0 selects the new pre-combine pass
> - bit 1 selects the main combine pass
> - bit 2 selects the new post-combine pass
>
> The idea is that run-combine=3 can be used to see which combinations
> are missed by the new pass, while run-combine=6 (which I hope to be
> the production setting for AArch64 at -O2+) just uses the new pass
> to mop up cases that normal combine misses.  Maybe in some distant
> future, the pass will be good enough for run-combine=[14] to be a
> realistic option.
>
> I ended up having to add yet another validate_simplify_* routine,
> this time to do the equivalent of:
>
>    newx = simplify_replace_rtx (*loc, old_rtx, new_rtx);
>    validate_change (insn, loc, newx, 1);
>
> but in a more memory-efficient way.  validate_replace_rtx isn't suitable
> because it deliberately only tries simplifications in limited cases:
>
>   /* Do changes needed to keep rtx consistent.  Don't do any other
>      simplifications, as it is not our job.  */
>
> And validate_simplify_insn isn't useful for this case because it works
> on patterns that have already had changes made to them and expects
> those patterns to be valid rtxes.  simplify-replace operations instead
> need to simplify as they go, when the original modes are still to hand.
>
> As far as compile-time goes, I tried compiling optabs.ii at -O2
> with an --enable-checking=release compiler:
>
> run-combine=2 (normal combine):  100.0% (baseline)
> run-combine=4 (new pass only)     98.0%
> run-combine=6 (both passes)      100.3%
>
> where the results are easily outside the noise.  So the pass on
> its own is quicker than combine, but that's not a fair comparison
> when it doesn't do everything combine does.  Running both passes
> only has a slight overhead.
>
> To get a feel for the effect on multiple targets, I did my usual
> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
> and g++.dg, this time comparing run-combine=2 and run-combine=6
> using -O2 -ftree-vectorize:
>
> Target                 Tests   Delta    Best   Worst  Median
> ======                 =====   =====    ====   =====  ======
> aarch64-linux-gnu       3974  -39393   -2275      90      -2
> aarch64_be-linux-gnu    3389  -36683   -2275     165      -2
> alpha-linux-gnu         4154  -62860   -2132     335      -2
> amdgcn-amdhsa           4818    9079   -7987   51850      -2
> arc-elf                 2868  -63710  -18998     286      -1
> arm-linux-gnueabi       4053  -80404  -10019     605      -2
> arm-linux-gnueabihf     4053  -80404  -10019     605      -2
> avr-elf                 3620   38513   -2386   23364       2
> bfin-elf                2691  -32973   -1483    1127      -2
> bpf-elf                 5581  -78105  -11064     113      -3
> c6x-elf                 3915  -31710   -2441    1560      -2
> cr16-elf                6030  192102   -1757   60009      12
> cris-elf                2217  -30794   -1716     294      -2
> csky-elf                2003  -24989   -9999    1468      -2
> epiphany-elf            3345  -19416   -1803    4594      -2
> fr30-elf                3562  -15077   -1921    2334      -1
> frv-linux-gnu           2423  -16589   -1736     999      -1
> ft32-elf                2246  -46337  -15988     433      -2
> h8300-elf               2581  -33553   -1403     168      -2
> hppa64-hp-hpux11.23     3926 -120876  -50134    1056      -2
> i686-apple-darwin       3562  -46851   -1764     310      -2
> i686-pc-linux-gnu       2902   -3639   -4809    6848      -2
> ia64-linux-gnu          2900 -158870  -14006     428      -7
> iq2000-elf              2929  -54690   -2904    2576      -3
> lm32-elf                5265  162519   -1918    8004       5
> m32r-elf                1861  -25296   -2713    1004      -2
> m68k-linux-gnu          2520 -241573  -21879     200      -3
> mcore-elf               2378  -28532   -1810    1635      -2
> microblaze-elf          2782 -137363   -9516    1986      -2
> mipsel-linux-gnu        2443  -38422   -8331     458      -1
> mipsisa64-linux-gnu     2287  -60294  -12214     432      -2
> mmix                    4910 -136549  -13616     599      -2
> mn10300-elf             2944  -29151   -2488     132      -1
> moxie-rtems             1935  -12364   -1002     125      -1
> msp430-elf              2379  -37007   -2163     176      -2
> nds32le-elf             2356  -27551   -2126     163      -1
> nios2-linux-gnu         1572  -44828  -23613      92      -2
> nvptx-none              1014  -17337   -1590      16      -3
> or1k-elf                2724  -92816  -14144      56      -3
> pdp11                   1897  -27296   -1370     534      -2
> powerpc-ibm-aix7.0      2909  -58829  -10026    2001      -2
> powerpc64-linux-gnu     3685  -60551  -12158    2001      -1
> powerpc64le-linux-gnu   3501  -61846  -10024     765      -2
> pru-elf                 1574  -29734  -19998    1718      -1
> riscv32-elf             2357  -22506  -10002   10175      -1
> riscv64-elf             3320  -56777  -10002     226      -2
> rl78-elf                2113 -232328  -18607    4065      -3
> rx-elf                  2800  -38515    -896     491      -2
> s390-linux-gnu          3582  -75626  -12098    3999      -2
> s390x-linux-gnu         3761  -73473  -13748    3999      -2
> sh-linux-gnu            2350  -26401   -1003     522      -2
> sparc-linux-gnu         3279  -49518   -2175    2223      -2
> sparc64-linux-gnu       3849 -123084  -30200    2141      -2
> tilepro-linux-gnu       2737  -35562   -3458    2848      -2
> v850-elf                9002 -169126  -49996      76      -4
> vax-netbsdelf           3325  -57734  -10000    1989      -2
> visium-elf              1860  -17006   -1006    1066      -2
> x86_64-darwin           3278  -48933   -9999    1408      -2
> x86_64-linux-gnu        3008  -43887   -9999    3248      -2
> xstormy16-elf           2497  -26569   -2051      89      -2
> xtensa-elf              2161  -31231   -6910     138      -2
>
> So running both passes does seem to have a significant benefit
> on most targets, but there are some nasty-looking outliers.
> The usual caveat applies: number of lines is a very poor measurement,
> it's just to get a feel.
>
> Bootstrapped & regression-tested on aarch64-linux-gnu and
> x86_64-linux-gnu with both run-combine=3 as the default (so that the new
> pass runs first) and with run-combine=6 as the default (so that the new
> pass runs second).  There were no new execution failures.  A couple of
> guality.exp tests that already failed for most options started failing
> for a couple more.  Enabling the pass fixes the XFAILs in:
>
> gcc.target/aarch64/sve/acle/general/ptrue_pat_[234].c
>
> Inevitably there was some scan-assembler fallout for other tests.
> E.g. in gcc.target/aarch64/vmov_n_1.c:
>
> #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory")
>   ...
>   INHIB_OPTIMIZATION;                                                   \
>   (a) = TEST (test, data_len);                                          \
>   INHIB_OPTIMIZATION;                                                   \
>   (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a));        \
>
> is no longer effective for preventing move (a) from being merged
> into (b), because the pass can merge at the point of (a).  I think
> this is a valid thing to do -- the asm semantics are still satisfied,
> and asm volatile ("" : : : "memory") never acted as a register barrier.
> But perhaps we should deal with this as a special case?

Not really.  I think the testcase should be changed to:
INHIB_OPT_VAR(a)

instead.

Where INHIB_OPT_VAR should be:
#define INHIB_OPT_VAR(a) asm("":"+X"(a));

Since it is obviously not doing the correct testing in the first
place.  Even then, this testcase is huge and really should be broken
up into different testcases.


Thanks,
Andrew

>
> Richard
>
>
> 2019-11-17  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * Makefile.in (OBJS): Add combine2.o
>         * params.opt (--param=run-combine): New option.
>         * doc/invoke.texi: Document it.
>         * tree-pass.h (make_pass_combine2_before): Declare.
>         (make_pass_combine2_after): Likewise.
>         * passes.def: Add them.
>         * timevar.def (TV_COMBINE2): New timevar.
>         * cfgrtl.h (update_cfg_for_uncondjump): Declare.
>         * combine.c (update_cfg_for_uncondjump): Move to...
>         * cfgrtl.c (update_cfg_for_uncondjump): ...here.
>         * simplify-rtx.c (simplify_truncation): Handle comparisons.
>         * recog.h (validate_simplify_replace_rtx): Declare.
>         * recog.c (validate_simplify_replace_rtx_1): New function.
>         (validate_simplify_replace_rtx_uses): Likewise.
>         (validate_simplify_replace_rtx): Likewise.
>         * combine2.c: New file.
>
> Index: gcc/Makefile.in
> ===================================================================
> --- gcc/Makefile.in     2019-11-14 14:34:27.599783740 +0000
> +++ gcc/Makefile.in     2019-11-17 23:15:31.188500613 +0000
> @@ -1261,6 +1261,7 @@ OBJS = \
>         cgraphunit.o \
>         cgraphclones.o \
>         combine.o \
> +       combine2.o \
>         combine-stack-adj.o \
>         compare-elim.o \
>         context.o \
> Index: gcc/params.opt
> ===================================================================
> --- gcc/params.opt      2019-11-14 14:34:26.339792215 +0000
> +++ gcc/params.opt      2019-11-17 23:15:31.200500531 +0000
> @@ -768,6 +768,10 @@ Use internal function id in profile look
>  Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param
>  Maximum depth of a loop nest to fully value-number optimistically.
>
> +-param=run-combine=
> +Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param
> +Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass.
> +
>  -param=sccvn-max-alias-queries-per-access=
>  Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param
>  Maximum number of disambiguations to perform per memory access.
> Index: gcc/doc/invoke.texi
> ===================================================================
> --- gcc/doc/invoke.texi 2019-11-16 10:43:45.597105823 +0000
> +++ gcc/doc/invoke.texi 2019-11-17 23:15:31.200500531 +0000
> @@ -11807,6 +11807,11 @@ in combiner for a pseudo register as las
>  @item max-combine-insns
>  The maximum number of instructions the RTL combiner tries to combine.
>
> +@item run-combine
> +Choose which of the 3 available combine passes to run: bit 1 for the main
> +combine pass, bit 0 for an earlier variant of the combine pass, and bit 2
> +for a later variant of the combine pass.
> +
>  @item integer-share-limit
>  Small integer constants can use a shared data structure, reducing the
>  compiler's memory usage and increasing its speed.  This sets the maximum
> Index: gcc/tree-pass.h
> ===================================================================
> --- gcc/tree-pass.h     2019-10-29 08:29:03.096444049 +0000
> +++ gcc/tree-pass.h     2019-11-17 23:15:31.204500501 +0000
> @@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i
>  extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt);
> Index: gcc/passes.def
> ===================================================================
> --- gcc/passes.def      2019-10-29 08:29:03.224443133 +0000
> +++ gcc/passes.def      2019-11-17 23:15:31.200500531 +0000
> @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_inc_dec);
>        NEXT_PASS (pass_initialize_regs);
>        NEXT_PASS (pass_ud_rtl_dce);
> +      NEXT_PASS (pass_combine2_before);
>        NEXT_PASS (pass_combine);
> +      NEXT_PASS (pass_combine2_after);
>        NEXT_PASS (pass_if_after_combine);
>        NEXT_PASS (pass_jump_after_combine);
>        NEXT_PASS (pass_partition_blocks);
> Index: gcc/timevar.def
> ===================================================================
> --- gcc/timevar.def     2019-10-11 15:43:53.403498517 +0100
> +++ gcc/timevar.def     2019-11-17 23:15:31.204500501 +0000
> @@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC          , "
>  DEFTIMEVAR (TV_CSE2                  , "CSE 2")
>  DEFTIMEVAR (TV_BRANCH_PROB           , "branch prediction")
>  DEFTIMEVAR (TV_COMBINE               , "combiner")
> +DEFTIMEVAR (TV_COMBINE2              , "second combiner")
>  DEFTIMEVAR (TV_IFCVT                , "if-conversion")
>  DEFTIMEVAR (TV_MODE_SWITCH           , "mode switching")
>  DEFTIMEVAR (TV_SMS                  , "sms modulo scheduling")
> Index: gcc/cfgrtl.h
> ===================================================================
> --- gcc/cfgrtl.h        2019-03-08 18:15:39.320730391 +0000
> +++ gcc/cfgrtl.h        2019-11-17 23:15:31.192500584 +0000
> @@ -47,6 +47,7 @@ extern void fixup_partitions (void);
>  extern bool purge_dead_edges (basic_block);
>  extern bool purge_all_dead_edges (void);
>  extern bool fixup_abnormal_edges (void);
> +extern void update_cfg_for_uncondjump (rtx_insn *);
>  extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *);
>  extern void relink_block_chain (bool);
>  extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *);
> Index: gcc/combine.c
> ===================================================================
> --- gcc/combine.c       2019-11-13 08:42:45.537368745 +0000
> +++ gcc/combine.c       2019-11-17 23:15:31.192500584 +0000
> @@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg)
>          && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT;
>  }
>
> -/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
> -   Note that the INSN should be deleted *after* removing dead edges, so
> -   that the kept edge is the fallthrough edge for a (set (pc) (pc))
> -   but not for a (set (pc) (label_ref FOO)).  */
> -
> -static void
> -update_cfg_for_uncondjump (rtx_insn *insn)
> -{
> -  basic_block bb = BLOCK_FOR_INSN (insn);
> -  gcc_assert (BB_END (bb) == insn);
> -
> -  purge_dead_edges (bb);
> -
> -  delete_insn (insn);
> -  if (EDGE_COUNT (bb->succs) == 1)
> -    {
> -      rtx_insn *insn;
> -
> -      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
> -
> -      /* Remove barriers from the footer if there are any.  */
> -      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
> -       if (BARRIER_P (insn))
> -         {
> -           if (PREV_INSN (insn))
> -             SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
> -           else
> -             BB_FOOTER (bb) = NEXT_INSN (insn);
> -           if (NEXT_INSN (insn))
> -             SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
> -         }
> -       else if (LABEL_P (insn))
> -         break;
> -    }
> -}
> -
>  /* Return whether PAT is a PARALLEL of exactly N register SETs followed
>     by an arbitrary number of CLOBBERs.  */
>  static bool
> @@ -15096,7 +15060,10 @@ const pass_data pass_data_combine =
>    {}
>
>    /* opt_pass methods: */
> -  virtual bool gate (function *) { return (optimize > 0); }
> +  virtual bool gate (function *)
> +    {
> +      return optimize > 0 && (param_run_combine & 2) != 0;
> +    }
>    virtual unsigned int execute (function *)
>      {
>        return rest_of_handle_combine ();
> Index: gcc/cfgrtl.c
> ===================================================================
> --- gcc/cfgrtl.c        2019-10-17 14:22:55.523309009 +0100
> +++ gcc/cfgrtl.c        2019-11-17 23:15:31.188500613 +0000
> @@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void)
>    return inserted;
>  }
>
> +/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
> +   Note that the INSN should be deleted *after* removing dead edges, so
> +   that the kept edge is the fallthrough edge for a (set (pc) (pc))
> +   but not for a (set (pc) (label_ref FOO)).  */
> +
> +void
> +update_cfg_for_uncondjump (rtx_insn *insn)
> +{
> +  basic_block bb = BLOCK_FOR_INSN (insn);
> +  gcc_assert (BB_END (bb) == insn);
> +
> +  purge_dead_edges (bb);
> +
> +  delete_insn (insn);
> +  if (EDGE_COUNT (bb->succs) == 1)
> +    {
> +      rtx_insn *insn;
> +
> +      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
> +
> +      /* Remove barriers from the footer if there are any.  */
> +      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
> +       if (BARRIER_P (insn))
> +         {
> +           if (PREV_INSN (insn))
> +             SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
> +           else
> +             BB_FOOTER (bb) = NEXT_INSN (insn);
> +           if (NEXT_INSN (insn))
> +             SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
> +         }
> +       else if (LABEL_P (insn))
> +         break;
> +    }
> +}
> +
>  /* Cut the insns from FIRST to LAST out of the insns stream.  */
>
>  rtx_insn *
> Index: gcc/simplify-rtx.c
> ===================================================================
> --- gcc/simplify-rtx.c  2019-11-16 15:33:36.642840131 +0000
> +++ gcc/simplify-rtx.c  2019-11-17 23:15:31.204500501 +0000
> @@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode,
>        && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1)
>      return constm1_rtx;
>
> +  /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result
> +     in a narrower mode if useful.  */
> +  if (COMPARISON_P (op))
> +    return simplify_gen_relational (GET_CODE (op), mode, VOIDmode,
> +                                   XEXP (op, 0), XEXP (op, 1));
> +
>    return NULL_RTX;
>  }
>
> Index: gcc/recog.h
> ===================================================================
> --- gcc/recog.h 2019-09-09 18:58:28.860430363 +0100
> +++ gcc/recog.h 2019-11-17 23:15:31.204500501 +0000
> @@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos
>  extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *);
>  extern void validate_replace_src_group (rtx, rtx, rtx_insn *);
>  extern bool validate_simplify_insn (rtx_insn *insn);
> +extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx);
>  extern int num_changes_pending (void);
>  extern int next_insn_tests_no_inequality (rtx_insn *);
>  extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode);
> Index: gcc/recog.c
> ===================================================================
> --- gcc/recog.c 2019-10-01 09:55:35.150088599 +0100
> +++ gcc/recog.c 2019-11-17 23:15:31.204500501 +0000
> @@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn)
>        }
>    return ((num_changes_pending () > 0) && (apply_change_group () > 0));
>  }
> +
> +/* A subroutine of validate_simplify_replace_rtx.  Apply the replacement
> +   described by R to LOC.  Return true on success; leave the caller
> +   to clean up on failure.  */
> +
> +static bool
> +validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc)
> +{
> +  rtx x = *loc;
> +  enum rtx_code code = GET_CODE (x);
> +  machine_mode mode = GET_MODE (x);
> +
> +  if (rtx_equal_p (x, r.from))
> +    {
> +      validate_unshare_change (r.insn, loc, r.to, 1);
> +      return true;
> +    }
> +
> +  /* Recursively apply the substitution and see if we can simplify
> +     the result.  This specifically shouldn't use simplify_gen_*,
> +     since we want to avoid generating new expressions where possible.  */
> +  int old_num_changes = num_validated_changes ();
> +  rtx newx = NULL_RTX;
> +  bool recurse_p = false;
> +  switch (GET_RTX_CLASS (code))
> +    {
> +    case RTX_UNARY:
> +      {
> +       machine_mode op0_mode = GET_MODE (XEXP (x, 0));
> +       if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)))
> +         return false;
> +
> +       newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode);
> +       break;
> +      }
> +
> +    case RTX_BIN_ARITH:
> +    case RTX_COMM_ARITH:
> +      {
> +       if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
> +           || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
> +         return false;
> +
> +       newx = simplify_binary_operation (code, mode,
> +                                         XEXP (x, 0), XEXP (x, 1));
> +       break;
> +      }
> +
> +    case RTX_COMPARE:
> +    case RTX_COMM_COMPARE:
> +      {
> +       machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode
> +                               ? GET_MODE (XEXP (x, 0))
> +                               : GET_MODE (XEXP (x, 1)));
> +       if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
> +           || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
> +         return false;
> +
> +       newx = simplify_relational_operation (code, mode, op_mode,
> +                                             XEXP (x, 0), XEXP (x, 1));
> +       break;
> +      }
> +
> +    case RTX_TERNARY:
> +    case RTX_BITFIELD_OPS:
> +      {
> +       machine_mode op0_mode = GET_MODE (XEXP (x, 0));
> +       if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
> +           || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))
> +           || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2)))
> +         return false;
> +
> +       newx = simplify_ternary_operation (code, mode, op0_mode,
> +                                          XEXP (x, 0), XEXP (x, 1),
> +                                          XEXP (x, 2));
> +       break;
> +      }
> +
> +    case RTX_EXTRA:
> +      if (code == SUBREG)
> +       {
> +         machine_mode inner_mode = GET_MODE (SUBREG_REG (x));
> +         if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x)))
> +           return false;
> +
> +         rtx inner = SUBREG_REG (x);
> +         newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x));
> +         /* Reject the same cases that simplify_gen_subreg would.  */
> +         if (!newx
> +             && (GET_CODE (inner) == SUBREG
> +                 || GET_CODE (inner) == CONCAT
> +                 || GET_MODE (inner) == VOIDmode
> +                 || !validate_subreg (mode, inner_mode,
> +                                      inner, SUBREG_BYTE (x))))
> +           return false;
> +         break;
> +       }
> +      else
> +       recurse_p = true;
> +      break;
> +
> +    case RTX_OBJ:
> +      if (code == LO_SUM)
> +       {
> +         if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
> +             || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
> +           return false;
> +
> +         /* (lo_sum (high x) y) -> y where x and y have the same base.  */
> +         rtx op0 = XEXP (x, 0);
> +         rtx op1 = XEXP (x, 1);
> +         if (GET_CODE (op0) == HIGH)
> +           {
> +             rtx base0, base1, offset0, offset1;
> +             split_const (XEXP (op0, 0), &base0, &offset0);
> +             split_const (op1, &base1, &offset1);
> +             if (rtx_equal_p (base0, base1))
> +               newx = op1;
> +           }
> +       }
> +      else if (code == REG)
> +       {
> +         if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from))
> +           return false;
> +       }
> +      else
> +       recurse_p = true;
> +      break;
> +
> +    case RTX_CONST_OBJ:
> +      break;
> +
> +    case RTX_AUTOINC:
> +      if (reg_overlap_mentioned_p (XEXP (x, 0), r.from))
> +       return false;
> +      recurse_p = true;
> +      break;
> +
> +    case RTX_MATCH:
> +    case RTX_INSN:
> +      gcc_unreachable ();
> +    }
> +
> +  if (recurse_p)
> +    {
> +      const char *fmt = GET_RTX_FORMAT (code);
> +      for (int i = 0; fmt[i]; i++)
> +       switch (fmt[i])
> +         {
> +         case 'E':
> +           for (int j = 0; j < XVECLEN (x, i); j++)
> +             if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j)))
> +               return false;
> +           break;
> +
> +         case 'e':
> +           if (XEXP (x, i)
> +               && !validate_simplify_replace_rtx_1 (r, &XEXP (x, i)))
> +             return false;
> +           break;
> +         }
> +    }
> +
> +  if (newx && !rtx_equal_p (x, newx))
> +    {
> +      /* There's no longer any point unsharing the substitutions made
> +        for subexpressions, since we'll just copy this one instead.  */
> +      for (int i = old_num_changes; i < num_changes; ++i)
> +       changes[i].unshare = false;
> +      validate_unshare_change (r.insn, loc, newx, 1);
> +    }
> +
> +  return true;
> +}
> +
> +/* A note_uses callback for validate_simplify_replace_rtx.
> +   DATA points to a validate_replace_src_data object.  */
> +
> +static void
> +validate_simplify_replace_rtx_uses (rtx *loc, void *data)
> +{
> +  validate_replace_src_data &r = *(validate_replace_src_data *) data;
> +  if (r.insn && !validate_simplify_replace_rtx_1 (r, loc))
> +    r.insn = NULL;
> +}
> +
> +/* Try to perform the equivalent of:
> +
> +      newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX);
> +      validate_change (INSN, LOC, newx, 1);
> +
> +   but without generating as much garbage rtl when the resulting
> +   pattern doesn't match.
> +
> +   Return true if we were able to replace all uses of OLD_RTX in *LOC
> +   and if the result conforms to general rtx rules (e.g. for whether
> +   subregs are meaningful).
> +
> +   When returning true, add all replacements to the current validation group,
> +   leaving the caller to test it in the normal way.  Leave both *LOC and the
> +   validation group unchanged on failure.  */
> +
> +bool
> +validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc,
> +                              rtx old_rtx, rtx new_rtx)
> +{
> +  validate_replace_src_data r;
> +  r.from = old_rtx;
> +  r.to = new_rtx;
> +  r.insn = insn;
> +
> +  unsigned int num_changes = num_validated_changes ();
> +  note_uses (loc, validate_simplify_replace_rtx_uses, &r);
> +  if (!r.insn)
> +    {
> +      cancel_changes (num_changes);
> +      return false;
> +    }
> +  return true;
> +}
>
>  /* Return 1 if the insn using CC0 set by INSN does not contain
>     any ordered tests applied to the condition codes.
> Index: gcc/combine2.c
> ===================================================================
> --- /dev/null   2019-09-17 11:41:18.176664108 +0100
> +++ gcc/combine2.c      2019-11-17 23:15:31.196500559 +0000
> @@ -0,0 +1,1576 @@
> +/* Combine instructions
> +   Copyright (C) 2019 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "tree-pass.h"
> +#include "memmodel.h"
> +#include "emit-rtl.h"
> +#include "insn-config.h"
> +#include "recog.h"
> +#include "print-rtl.h"
> +#include "rtl-iter.h"
> +#include "predict.h"
> +#include "cfgcleanup.h"
> +#include "cfghooks.h"
> +#include "cfgrtl.h"
> +#include "alias.h"
> +#include "valtrack.h"
> +
> +/* This pass tries to combine instructions in the following ways:
> +
> +   (1) If we have two dependent instructions:
> +
> +        I1: (set DEST1 SRC1)
> +        I2: (...DEST1...)
> +
> +       and I2 is the only user of DEST1, the pass tries to combine them into:
> +
> +        I2: (...SRC1...)
> +
> +   (2) If we have two dependent instructions:
> +
> +        I1: (set DEST1 SRC1)
> +        I2: (...DEST1...)
> +
> +       the pass tries to combine them into:
> +
> +        I2: (parallel [(set DEST1 SRC1) (...SRC1...)])
> +
> +       or:
> +
> +        I2: (parallel [(...SRC1...) (set DEST1 SRC1)])
> +
> +   (3) If we have two independent instructions:
> +
> +        I1: (set DEST1 SRC1)
> +        I2: (set DEST2 SRC2)
> +
> +       that read from memory or from the same register, the pass tries to
> +       combine them into:
> +
> +        I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)])
> +
> +       or:
> +
> +        I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)])
> +
> +   If the combined form is a valid instruction, the pass tries to find a
> +   place between I1 and I2 inclusive for the new instruction.  If there
> +   are multiple valid locations, it tries to pick the best one by taking
> +   the effect on register pressure into account.
> +
> +   If a combination succeeds and produces a single set, the pass tries to
> +   combine the new form with earlier or later instructions.
> +
> +   The pass currently optimizes each basic block separately.  It walks
> +   the instructions in reverse order, building up live ranges for registers
> +   and memory.  It then uses these live ranges to look for possible
> +   combination opportunities and to decide where the combined instructions
> +   could be placed.
> +
> +   The pass represents positions in the block using point numbers,
> +   with higher numbers indicating earlier instructions.  The numbering
> +   scheme is that:
> +
> +   - the end of the current instruction sequence has an even base point B.
> +
> +   - instructions initially have odd-numbered points B + 1, B + 3, etc.
> +     with B + 1 being the final instruction in the sequence.
> +
> +   - even points after B represent gaps between instructions where combined
> +     instructions could be placed.
> +
> +   Thus even points initially represent no instructions and odd points
> +   initially represent single instructions.  However, when picking a
> +   place for a combined instruction, the pass may choose somewhere
> +   inbetween the original two instructions, so that over time a point
> +   may come to represent several instructions.  When this happens,
> +   the pass maintains the invariant that all instructions with the same
> +   point number are independent of each other and thus can be treated as
> +   acting in parallel (or as acting in any arbitrary sequence).
> +
> +   TODOs:
> +
> +   - Handle 3-instruction combinations, and possibly more.
> +
> +   - Handle existing clobbers more efficiently.  At the moment we can't
> +     move an instruction that clobbers R across another instruction that
> +     clobbers R.
> +
> +   - Allow hard register clobbers to be added, like combine does.
> +
> +   - Perhaps work on EBBs, or SESE regions.  */
> +
> +namespace {
> +
> +/* The number of explicit uses to record in a live range.  */
> +const unsigned int NUM_RANGE_USERS = 4;
> +
> +/* The maximum number of instructions that we can combine at once.  */
> +const unsigned int MAX_COMBINE_INSNS = 2;
> +
> +/* A fake cost for instructions that we haven't costed yet.  */
> +const unsigned int UNKNOWN_COST = ~0U;
> +
> +class combine2
> +{
> +public:
> +  combine2 (function *);
> +  ~combine2 ();
> +
> +  void execute ();
> +
> +private:
> +  struct insn_info_rec;
> +
> +  /* Describes the live range of a register or of memory.  For simplicity,
> +     we treat memory as a single entity.
> +
> +     If we had a fully-accurate live range, updating it to account for a
> +     moved instruction would be a linear-time operation.  Doing this for
> +     each combination would then make the pass quadratic.  We therefore
> +     just maintain a list of NUM_RANGE_USERS use insns and use simple,
> +     conservatively-correct behavior for the rest.  */
> +  struct live_range_rec
> +  {
> +    /* Which instruction provides the dominating definition, or null if
> +       we don't know yet.  */
> +    insn_info_rec *producer;
> +
> +    /* A selection of instructions that use the resource, in program order.  */
> +    insn_info_rec *users[NUM_RANGE_USERS];
> +
> +    /* An inclusive range of points that covers instructions not mentioned
> +       in USERS.  Both values are zero if there are no such instructions.
> +
> +       Once we've included a use U at point P in this range, we continue
> +       to assume that some kind of use exists at P whatever happens to U
> +       afterwards.  */
> +    unsigned int first_extra_use;
> +    unsigned int last_extra_use;
> +
> +    /* The register number this range describes, or INVALID_REGNUM
> +       for memory.  */
> +    unsigned int regno;
> +
> +    /* Forms a linked list of ranges for the same resource, in program
> +       order.  */
> +    live_range_rec *prev_range;
> +    live_range_rec *next_range;
> +  };
> +
> +  /* Pass-specific information about an instruction.  */
> +  struct insn_info_rec
> +  {
> +    /* The instruction itself.  */
> +    rtx_insn *insn;
> +
> +    /* A null-terminated list of live ranges for the things that this
> +       instruction defines.  */
> +    live_range_rec **defs;
> +
> +    /* A null-terminated list of live ranges for the things that this
> +       instruction uses.  */
> +    live_range_rec **uses;
> +
> +    /* The point at which the instruction appears.  */
> +    unsigned int point;
> +
> +    /* The cost of the instruction, or UNKNOWN_COST if we haven't
> +       measured it yet.  */
> +    unsigned int cost;
> +  };
> +
> +  /* Describes one attempt to combine instructions.  */
> +  struct combination_attempt_rec
> +  {
> +    /* The instruction that we're currently trying to optimize.
> +       If the combination succeeds, we'll use this insn_info_rec
> +       to describe the new instruction.  */
> +    insn_info_rec *new_home;
> +
> +    /* The instructions we're combining, in program order.  */
> +    insn_info_rec *sequence[MAX_COMBINE_INSNS];
> +
> +    /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the
> +       live range that describes the substituted register.  */
> +    live_range_rec *def_use_range;
> +
> +    /* The earliest and latest points at which we could insert the
> +       combined instruction.  */
> +    unsigned int earliest_point;
> +    unsigned int latest_point;
> +
> +    /* The cost of the new instruction, once we have a successful match.  */
> +    unsigned int new_cost;
> +  };
> +
> +  /* Pass-specific information about a register.  */
> +  struct reg_info_rec
> +  {
> +    /* The live range associated with the last reference to the register.  */
> +    live_range_rec *range;
> +
> +    /* The point at which the last reference occurred.  */
> +    unsigned int next_ref;
> +
> +    /* True if the register is currently live.  We record this here rather
> +       than in a separate bitmap because (a) there's a natural hole for
> +       it on LP64 hosts and (b) we only refer to it when updating the
> +       other fields, and so recording it here should give better locality.  */
> +    unsigned int live_p : 1;
> +  };
> +
> +  live_range_rec *new_live_range (unsigned int, live_range_rec *);
> +  live_range_rec *reg_live_range (unsigned int);
> +  live_range_rec *mem_live_range ();
> +  bool add_range_use (live_range_rec *, insn_info_rec *);
> +  void remove_range_use (live_range_rec *, insn_info_rec *);
> +  bool has_single_use_p (live_range_rec *);
> +  bool known_last_use_p (live_range_rec *, insn_info_rec *);
> +  unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *);
> +  unsigned int find_latest_point (insn_info_rec *, insn_info_rec *);
> +  bool start_combination (combination_attempt_rec &, insn_info_rec *,
> +                         insn_info_rec *, live_range_rec * = NULL);
> +  bool verify_combination (combination_attempt_rec &);
> +  int estimate_reg_pressure_delta (insn_info_rec *);
> +  void commit_combination (combination_attempt_rec &, bool);
> +  bool try_parallel_sets (combination_attempt_rec &, rtx, rtx);
> +  bool try_parallelize_insns (combination_attempt_rec &);
> +  bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool);
> +  bool try_combine_def_use (combination_attempt_rec &, rtx, rtx);
> +  bool try_combine_two_uses (combination_attempt_rec &);
> +  bool try_combine (insn_info_rec *, rtx, unsigned int);
> +  bool optimize_insn (insn_info_rec *);
> +  void record_defs (insn_info_rec *);
> +  void record_reg_use (insn_info_rec *, df_ref);
> +  void record_uses (insn_info_rec *);
> +  void process_insn (insn_info_rec *);
> +  void start_sequence ();
> +
> +  /* The function we're optimizing.  */
> +  function *m_fn;
> +
> +  /* The highest pseudo register number plus one.  */
> +  unsigned int m_num_regs;
> +
> +  /* The current basic block.  */
> +  basic_block m_bb;
> +
> +  /* True if we should optimize the current basic block for speed.  */
> +  bool m_optimize_for_speed_p;
> +
> +  /* The point number to allocate to the next instruction we visit
> +     in the backward traversal.  */
> +  unsigned int m_point;
> +
> +  /* The point number corresponding to the end of the current
> +     instruction sequence, i.e. the lowest point number about which
> +     we still have valid information.  */
> +  unsigned int m_end_of_sequence;
> +
> +  /* The point number corresponding to the end of the current basic block.
> +     This is the same as M_END_OF_SEQUENCE when processing the last
> +     instruction sequence in a basic block.  */
> +  unsigned int m_end_of_bb;
> +
> +  /* The memory live range, or null if we haven't yet found a memory
> +     reference in the current instruction sequence.  */
> +  live_range_rec *m_mem_range;
> +
> +  /* Gives information about each register.  We track both hard and
> +     pseudo registers.  */
> +  auto_vec<reg_info_rec> m_reg_info;
> +
> +  /* A bitmap of registers whose entry in m_reg_info is valid.  */
> +  auto_sbitmap m_valid_regs;
> +
> +  /* If nonnuull, an unused 2-element PARALLEL that we can use to test
> +     instruction combinations.  */
> +  rtx m_spare_parallel;
> +
> +  /* A bitmap of instructions that we've already tried to combine with.  */
> +  auto_bitmap m_tried_insns;
> +
> +  /* A temporary bitmap used to hold register numbers.  */
> +  auto_bitmap m_true_deps;
> +
> +  /* An obstack used for allocating insn_info_recs and for building
> +     up their lists of definitions and uses.  */
> +  obstack m_insn_obstack;
> +
> +  /* An obstack used for allocating live_range_recs.  */
> +  obstack m_range_obstack;
> +
> +  /* Start-of-object pointers for the two obstacks.  */
> +  char *m_insn_obstack_start;
> +  char *m_range_obstack_start;
> +
> +  /* A list of instructions that we've optimized and whose new forms
> +     change the cfg.  */
> +  auto_vec<rtx_insn *> m_cfg_altering_insns;
> +
> +  /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS.  */
> +  auto_bitmap m_cfg_altering_insn_ids;
> +
> +  /* We can insert new instructions at point P * 2 by inserting them
> +     after M_POINTS[P - M_END_OF_SEQUENCE / 2].  We can insert new
> +     instructions at point P * 2 + 1 by inserting them before
> +     M_POINTS[P - M_END_OF_SEQUENCE / 2].  */
> +  auto_vec<rtx_insn *, 256> m_points;
> +};
> +
> +combine2::combine2 (function *fn)
> +  : m_fn (fn),
> +    m_num_regs (max_reg_num ()),
> +    m_bb (NULL),
> +    m_optimize_for_speed_p (false),
> +    m_point (2),
> +    m_end_of_sequence (m_point),
> +    m_end_of_bb (m_point),
> +    m_mem_range (NULL),
> +    m_reg_info (m_num_regs),
> +    m_valid_regs (m_num_regs),
> +    m_spare_parallel (NULL_RTX)
> +{
> +  gcc_obstack_init (&m_insn_obstack);
> +  gcc_obstack_init (&m_range_obstack);
> +  m_reg_info.quick_grow (m_num_regs);
> +  bitmap_clear (m_valid_regs);
> +  m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0);
> +  m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0);
> +}
> +
> +combine2::~combine2 ()
> +{
> +  obstack_free (&m_insn_obstack, NULL);
> +  obstack_free (&m_range_obstack, NULL);
> +}
> +
> +/* Return true if it's possible in principle to combine INSN with
> +   other instructions.  ALLOW_ASMS_P is true if the caller can cope
> +   with asm statements.  */
> +
> +static bool
> +combinable_insn_p (rtx_insn *insn, bool allow_asms_p)
> +{
> +  rtx pattern = PATTERN (insn);
> +
> +  if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER)
> +    return false;
> +
> +  if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX))
> +    return false;
> +
> +  if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0)
> +    return false;
> +
> +  return true;
> +}
> +
> +/* Return true if it's possible in principle to move INSN somewhere else,
> +   as long as all dependencies are satisfied.  */
> +
> +static bool
> +movable_insn_p (rtx_insn *insn)
> +{
> +  if (JUMP_P (insn))
> +    return false;
> +
> +  if (volatile_refs_p (PATTERN (insn)))
> +    return false;
> +
> +  return true;
> +}
> +
> +/* Create and return a new live range for REGNO.  NEXT is the next range
> +   in program order, or null if this is the first live range in the
> +   sequence.  */
> +
> +combine2::live_range_rec *
> +combine2::new_live_range (unsigned int regno, live_range_rec *next)
> +{
> +  live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec);
> +  memset (range, 0, sizeof (*range));
> +
> +  range->regno = regno;
> +  range->next_range = next;
> +  if (next)
> +    next->prev_range = range;
> +  return range;
> +}
> +
> +/* Return the current live range for register REGNO, creating a new
> +   one if necessary.  */
> +
> +combine2::live_range_rec *
> +combine2::reg_live_range (unsigned int regno)
> +{
> +  /* Initialize the liveness flag, if it isn't already valid for this BB.  */
> +  bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno);
> +  if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb)
> +    m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno);
> +
> +  /* See if we already have a live range associated with the current
> +     instruction sequence.  */
> +  live_range_rec *range = NULL;
> +  if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence)
> +    range = m_reg_info[regno].range;
> +
> +  /* Create a new range if this is the first reference to REGNO in the
> +     current instruction sequence or if the current range has been closed
> +     off by a definition.  */
> +  if (!range || range->producer)
> +    {
> +      range = new_live_range (regno, range);
> +
> +      /* If the register is live after the current sequence, treat that
> +        as a fake use at the end of the sequence.  */
> +      if (!range->next_range && m_reg_info[regno].live_p)
> +       range->first_extra_use = range->last_extra_use = m_end_of_sequence;
> +
> +      /* Record that this is now the current range for REGNO.  */
> +      if (first_ref_p)
> +       bitmap_set_bit (m_valid_regs, regno);
> +      m_reg_info[regno].range = range;
> +      m_reg_info[regno].next_ref = m_point;
> +    }
> +  return range;
> +}
> +
> +/* Return the current live range for memory, treating memory as a single
> +   entity.  Create a new live range if necessary.  */
> +
> +combine2::live_range_rec *
> +combine2::mem_live_range ()
> +{
> +  if (!m_mem_range || m_mem_range->producer)
> +    m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range);
> +  return m_mem_range;
> +}
> +
> +/* Record that instruction USER uses the resource described by RANGE.
> +   Return true if this is new information.  */
> +
> +bool
> +combine2::add_range_use (live_range_rec *range, insn_info_rec *user)
> +{
> +  /* See if we've already recorded the instruction, or if there's a
> +     spare use slot we can use.  */
> +  unsigned int i = 0;
> +  for (; i < NUM_RANGE_USERS && range->users[i]; ++i)
> +    if (range->users[i] == user)
> +      return false;
> +
> +  if (i == NUM_RANGE_USERS)
> +    {
> +      /* Since we've processed USER recently, assume that it's more
> +        interesting to record explicitly than the last user in the
> +        current list.  Evict that last user and describe it in the
> +        overflow "extra use" range instead.  */
> +      insn_info_rec *ousted_user = range->users[--i];
> +      if (range->first_extra_use < ousted_user->point)
> +       range->first_extra_use = ousted_user->point;
> +      if (range->last_extra_use > ousted_user->point)
> +       range->last_extra_use = ousted_user->point;
> +    }
> +
> +  /* Insert USER while keeping the list sorted.  */
> +  for (; i > 0 && range->users[i - 1]->point < user->point; --i)
> +    range->users[i] = range->users[i - 1];
> +  range->users[i] = user;
> +  return true;
> +}
> +
> +/* Remove USER from the uses recorded for RANGE, if we can.
> +   There's nothing we can do if USER was described in the
> +   overflow "extra use" range.  */
> +
> +void
> +combine2::remove_range_use (live_range_rec *range, insn_info_rec *user)
> +{
> +  for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
> +    if (range->users[i] == user)
> +      {
> +       for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j)
> +         range->users[j] = range->users[j + 1];
> +       range->users[NUM_RANGE_USERS - 1] = NULL;
> +       break;
> +      }
> +}
> +
> +/* Return true if RANGE has a single known user.  */
> +
> +bool
> +combine2::has_single_use_p (live_range_rec *range)
> +{
> +  return range->users[0] && !range->users[1] && !range->first_extra_use;
> +}
> +
> +/* Return true if we know that USER is the last user of RANGE.  */
> +
> +bool
> +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user)
> +{
> +  if (range->last_extra_use <= user->point)
> +    return false;
> +
> +  for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i)
> +    if (range->users[i] == user)
> +      return i == NUM_RANGE_USERS - 1 || !range->users[i + 1];
> +    else if (range->users[i]->point == user->point)
> +      return false;
> +
> +  gcc_unreachable ();
> +}
> +
> +/* Find the earliest point that we could move I2 up in order to combine
> +   it with I1.  Ignore any dependencies between I1 and I2; leave the
> +   caller to deal with those instead.  */
> +
> +unsigned int
> +combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1)
> +{
> +  if (!movable_insn_p (i2->insn))
> +    return i2->point;
> +
> +  /* Start by optimistically assuming that we can move the instruction
> +     all the way up to I1.  */
> +  unsigned int point = i1->point;
> +
> +  /* Make sure that the new position preserves all necessary true dependencies
> +     on earlier instructions.  */
> +  for (live_range_rec **use = i2->uses; *use; ++use)
> +    {
> +      live_range_rec *range = *use;
> +      if (range->producer
> +         && range->producer != i1
> +         && point >= range->producer->point)
> +       point = range->producer->point - 1;
> +    }
> +
> +  /* Make sure that the new position preserves all necessary output and
> +     anti dependencies on earlier instructions.  */
> +  for (live_range_rec **def = i2->defs; *def; ++def)
> +    if (live_range_rec *range = (*def)->prev_range)
> +      {
> +       if (range->producer
> +           && range->producer != i1
> +           && point >= range->producer->point)
> +         point = range->producer->point - 1;
> +
> +       for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;)
> +         if (range->users[i] && range->users[i] != i1)
> +           {
> +             if (point >= range->users[i]->point)
> +               point = range->users[i]->point - 1;
> +             break;
> +           }
> +
> +       if (range->last_extra_use && point >= range->last_extra_use)
> +         point = range->last_extra_use - 1;
> +      }
> +
> +  return point;
> +}
> +
> +/* Find the latest point that we could move I1 down in order to combine
> +   it with I2.  Ignore any dependencies between I1 and I2; leave the
> +   caller to deal with those instead.  */
> +
> +unsigned int
> +combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2)
> +{
> +  if (!movable_insn_p (i1->insn))
> +    return i1->point;
> +
> +  /* Start by optimistically assuming that we can move the instruction
> +     all the way down to I2.  */
> +  unsigned int point = i2->point;
> +
> +  /* Make sure that the new position preserves all necessary anti dependencies
> +     on later instructions.  */
> +  for (live_range_rec **use = i1->uses; *use; ++use)
> +    if (live_range_rec *range = (*use)->next_range)
> +      if (range->producer != i2 && point <= range->producer->point)
> +       point = range->producer->point + 1;
> +
> +  /* Make sure that the new position preserves all necessary output and
> +     true dependencies on later instructions.  */
> +  for (live_range_rec **def = i1->defs; *def; ++def)
> +    {
> +      live_range_rec *range = *def;
> +
> +      for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
> +       if (range->users[i] != i2)
> +         {
> +           if (range->users[i] && point <= range->users[i]->point)
> +             point = range->users[i]->point + 1;
> +           break;
> +         }
> +
> +      if (range->first_extra_use && point <= range->first_extra_use)
> +       point = range->first_extra_use + 1;
> +
> +      live_range_rec *next_range = range->next_range;
> +      if (next_range
> +         && next_range->producer != i2
> +         && point <= next_range->producer->point)
> +       point = next_range->producer->point + 1;
> +    }
> +
> +  return point;
> +}
> +
> +/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2,
> +   where I1 is the instruction that we're currently trying to optimize.
> +   If DEF_USE_RANGE is nonnull, I1 defines the value described by
> +   DEF_USE_RANGE and I2 uses it.  */
> +
> +bool
> +combine2::start_combination (combination_attempt_rec &attempt,
> +                            insn_info_rec *i1, insn_info_rec *i2,
> +                            live_range_rec *def_use_range)
> +{
> +  attempt.new_home = i1;
> +  attempt.sequence[0] = i1;
> +  attempt.sequence[1] = i2;
> +  if (attempt.sequence[0]->point < attempt.sequence[1]->point)
> +    std::swap (attempt.sequence[0], attempt.sequence[1]);
> +  attempt.def_use_range = def_use_range;
> +
> +  /* Check that the instructions have no true dependencies other than
> +     DEF_USE_RANGE.  */
> +  bitmap_clear (m_true_deps);
> +  for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def)
> +    if (*def != def_use_range)
> +      bitmap_set_bit (m_true_deps, (*def)->regno);
> +  for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use)
> +    if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno))
> +      return false;
> +
> +  /* Calculate the range of points at which the combined instruction
> +     could live.  */
> +  attempt.earliest_point = find_earliest_point (attempt.sequence[1],
> +                                               attempt.sequence[0]);
> +  attempt.latest_point = find_latest_point (attempt.sequence[0],
> +                                           attempt.sequence[1]);
> +  if (attempt.earliest_point < attempt.latest_point)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "cannot combine %d and %d: no suitable"
> +                " location for combined insn\n",
> +                INSN_UID (attempt.sequence[0]->insn),
> +                INSN_UID (attempt.sequence[1]->insn));
> +      return false;
> +    }
> +
> +  /* Make sure we have valid costs for the original instructions before
> +     we start changing their patterns.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    if (attempt.sequence[i]->cost == UNKNOWN_COST)
> +      attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn,
> +                                            m_optimize_for_speed_p);
> +  return true;
> +}
> +
> +/* Check whether the combination attempt described by ATTEMPT matches
> +   an .md instruction (or matches its constraints, in the case of an
> +   asm statement).  If so, calculate the cost of the new instruction
> +   and check whether it's cheap enough.  */
> +
> +bool
> +combine2::verify_combination (combination_attempt_rec &attempt)
> +{
> +  rtx_insn *insn = attempt.sequence[1]->insn;
> +
> +  bool ok_p = verify_changes (0);
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      if (!ok_p)
> +       fprintf (dump_file, "failed to match this instruction:\n");
> +      else if (const char *name = get_insn_name (INSN_CODE (insn)))
> +       fprintf (dump_file, "successfully matched this instruction to %s:\n",
> +                name);
> +      else
> +       fprintf (dump_file, "successfully matched this instruction:\n");
> +      print_rtl_single (dump_file, PATTERN (insn));
> +    }
> +  if (!ok_p)
> +    return false;
> +
> +  unsigned int cost1 = attempt.sequence[0]->cost;
> +  unsigned int cost2 = attempt.sequence[1]->cost;
> +  attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p);
> +  ok_p = (attempt.new_cost <= cost1 + cost2);
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n",
> +            cost1, cost2, attempt.new_cost,
> +            ok_p ? "keeping replacement" : "rejecting replacement");
> +  if (!ok_p)
> +    return false;
> +
> +  confirm_change_group ();
> +  return true;
> +}
> +
> +/* Return true if we should consider register REGNO when calculating
> +   register pressure estimates.  */
> +
> +static bool
> +count_reg_pressure_p (unsigned int regno)
> +{
> +  if (regno == INVALID_REGNUM)
> +    return false;
> +
> +  /* Unallocatable registers aren't interesting.  */
> +  if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno])
> +    return false;
> +
> +  return true;
> +}
> +
> +/* Try to estimate the effect that the original form of INSN_INFO
> +   had on register pressure, in the form "born - dying".  */
> +
> +int
> +combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info)
> +{
> +  int delta = 0;
> +
> +  for (live_range_rec **def = insn_info->defs; *def; ++def)
> +    if (count_reg_pressure_p ((*def)->regno))
> +      delta += 1;
> +
> +  for (live_range_rec **use = insn_info->uses; *use; ++use)
> +    if (count_reg_pressure_p ((*use)->regno)
> +       && known_last_use_p (*use, insn_info))
> +      delta -= 1;
> +
> +  return delta;
> +}
> +
> +/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete
> +   FROM_INSN.  Copy any useful information to TO_INSN before doing that.  */
> +
> +static void
> +transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn)
> +{
> +  INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn);
> +  INSN_CODE (to_insn) = INSN_CODE (from_insn);
> +  REG_NOTES (to_insn) = REG_NOTES (from_insn);
> +}
> +
> +/* The combination attempt in ATTEMPT has succeeded and is currently
> +   part of an open validate_change group.  Commit to making the change
> +   and decide where the new instruction should go.
> +
> +   KEPT_DEF_P is true if the new instruction continues to perform
> +   the definition described by ATTEMPT.def_use_range.  */
> +
> +void
> +combine2::commit_combination (combination_attempt_rec &attempt,
> +                             bool kept_def_p)
> +{
> +  insn_info_rec *new_home = attempt.new_home;
> +  rtx_insn *old_insn = attempt.sequence[0]->insn;
> +  rtx_insn *new_insn = attempt.sequence[1]->insn;
> +
> +  /* Remove any notes that are no longer relevant.  */
> +  bool single_set_p = single_set (new_insn);
> +  for (rtx *note_ptr = &REG_NOTES (new_insn); *note_ptr; )
> +    {
> +      rtx note = *note_ptr;
> +      bool keep_p = true;
> +      switch (REG_NOTE_KIND (note))
> +       {
> +       case REG_EQUAL:
> +       case REG_EQUIV:
> +       case REG_NOALIAS:
> +         keep_p = single_set_p;
> +         break;
> +
> +       case REG_UNUSED:
> +         keep_p = false;
> +         break;
> +
> +       default:
> +         break;
> +       }
> +      if (keep_p)
> +       note_ptr = &XEXP (*note_ptr, 1);
> +      else
> +       {
> +         *note_ptr = XEXP (*note_ptr, 1);
> +         free_EXPR_LIST_node (note);
> +       }
> +    }
> +
> +  /* Complete the open validate_change group.  */
> +  confirm_change_group ();
> +
> +  /* Decide where the new instruction should go.  */
> +  unsigned int new_point = attempt.latest_point;
> +  if (new_point != attempt.earliest_point
> +      && prev_real_insn (new_insn) != old_insn)
> +    {
> +      /* Prefer the earlier point if the combined instruction reduces
> +        register pressure and the latest point if it increases register
> +        pressure.
> +
> +        The choice isn't obvious in the event of a tie, but picking
> +        the earliest point should reduce the number of times that
> +        we need to invalidate debug insns.  */
> +      int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]);
> +      int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]);
> +      bool move_up_p = (delta1 + delta2 <= 0);
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "register pressure delta = %d + %d; using %s position\n",
> +                delta1, delta2, move_up_p ? "earliest" : "latest");
> +      if (move_up_p)
> +       new_point = attempt.earliest_point;
> +    }
> +
> +  /* Translate inserting at NEW_POINT into inserting before or after
> +     a particular insn.  */
> +  rtx_insn *anchor = NULL;
> +  bool before_p = (new_point & 1);
> +  if (new_point != attempt.sequence[1]->point
> +      && new_point != attempt.sequence[0]->point)
> +    {
> +      anchor = m_points[(new_point - m_end_of_sequence) / 2];
> +      rtx_insn *other_side = (before_p
> +                             ? prev_real_insn (anchor)
> +                             : next_real_insn (anchor));
> +      /* Inserting next to an insn X and then deleting X is just a
> +        roundabout way of using X as the insertion point.  */
> +      if (anchor == new_insn || other_side == new_insn)
> +       new_point = attempt.sequence[1]->point;
> +      else if (anchor == old_insn || other_side == old_insn)
> +       new_point = attempt.sequence[0]->point;
> +    }
> +
> +  /* Actually perform the move.  */
> +  if (new_point == attempt.sequence[1]->point)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "using insn %d to hold the combined pattern\n",
> +                INSN_UID (new_insn));
> +      set_insn_deleted (old_insn);
> +    }
> +  else if (new_point == attempt.sequence[0]->point)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "using insn %d to hold the combined pattern\n",
> +                INSN_UID (old_insn));
> +      PATTERN (old_insn) = PATTERN (new_insn);
> +      transfer_insn (old_insn, new_insn);
> +      std::swap (old_insn, new_insn);
> +      set_insn_deleted (old_insn);
> +    }
> +  else
> +    {
> +      /* We need to insert a new instruction.  We can't simply move
> +        NEW_INSN because it acts as an insertion anchor in m_points.  */
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "inserting combined insn %s insn %d\n",
> +                before_p ? "before" : "after", INSN_UID (anchor));
> +
> +      rtx_insn *added_insn = (before_p
> +                             ? emit_insn_before (PATTERN (new_insn), anchor)
> +                             : emit_insn_after (PATTERN (new_insn), anchor));
> +      transfer_insn (added_insn, new_insn);
> +      set_insn_deleted (old_insn);
> +      set_insn_deleted (new_insn);
> +      new_insn = added_insn;
> +    }
> +  df_insn_rescan (new_insn);
> +
> +  /* Unlink the old uses.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
> +      remove_range_use (*use, attempt.sequence[i]);
> +
> +  /* Work out which registers the new pattern uses.  */
> +  bitmap_clear (m_true_deps);
> +  df_ref use;
> +  FOR_EACH_INSN_USE (use, new_insn)
> +    {
> +      rtx reg = DF_REF_REAL_REG (use);
> +      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
> +    }
> +  FOR_EACH_INSN_EQ_USE (use, new_insn)
> +    {
> +      rtx reg = DF_REF_REAL_REG (use);
> +      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
> +    }
> +
> +  /* Describe the combined instruction in NEW_HOME.  */
> +  new_home->insn = new_insn;
> +  new_home->point = new_point;
> +  new_home->cost = attempt.new_cost;
> +
> +  /* Build up a list of definitions for the combined instructions
> +     and update all the ranges accordingly.  It shouldn't matter
> +     which order we do this in.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def)
> +      if (kept_def_p || *def != attempt.def_use_range)
> +       {
> +         obstack_ptr_grow (&m_insn_obstack, *def);
> +         (*def)->producer = new_home;
> +       }
> +  obstack_ptr_grow (&m_insn_obstack, NULL);
> +  new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
> +
> +  /* Build up a list of uses for the combined instructions and update
> +     all the ranges accordingly.  Again, it shouldn't matter which
> +     order we do this in.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
> +      if (*use != attempt.def_use_range
> +         && add_range_use (*use, new_home))
> +       obstack_ptr_grow (&m_insn_obstack, *use);
> +  obstack_ptr_grow (&m_insn_obstack, NULL);
> +  new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
> +
> +  /* There shouldn't be any remaining references to other instructions
> +     in the combination.  Invalidate their contents to make lingering
> +     references a noisy failure.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    if (attempt.sequence[i] != new_home)
> +      {
> +       attempt.sequence[i]->insn = NULL;
> +       attempt.sequence[i]->point = ~0U;
> +      }
> +
> +  /* Unlink the def-use range.  */
> +  if (!kept_def_p && attempt.def_use_range)
> +    {
> +      live_range_rec *range = attempt.def_use_range;
> +      if (range->prev_range)
> +       range->prev_range->next_range = range->next_range;
> +      else
> +       m_reg_info[range->regno].range = range->next_range;
> +      if (range->next_range)
> +       range->next_range->prev_range = range->prev_range;
> +    }
> +
> +  /* Record instructions whose new form alters the cfg.  */
> +  rtx pattern = PATTERN (new_insn);
> +  if ((returnjump_p (new_insn)
> +       || any_uncondjump_p (new_insn)
> +       || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx))
> +      && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn)))
> +    m_cfg_altering_insns.safe_push (new_insn);
> +}
> +
> +/* Return true if X1 and X2 are memories and if X1 does not have
> +   a higher alignment than X2.  */
> +
> +static bool
> +dubious_mem_pair_p (rtx x1, rtx x2)
> +{
> +  return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2);
> +}
> +
> +/* Try implement ATTEMPT using (parallel [SET1 SET2]).  */
> +
> +bool
> +combine2::try_parallel_sets (combination_attempt_rec &attempt,
> +                            rtx set1, rtx set2)
> +{
> +  rtx_insn *insn = attempt.sequence[1]->insn;
> +
> +  /* Combining two loads or two stores can be useful on targets that
> +     allow them to be treated as a single access.  However, we use a
> +     very peephole approach to picking the pairs, so we need to be
> +     relatively confident that we're making a good choice.
> +
> +     For now just aim for cases in which the memory references are
> +     consecutive and the first reference has a higher alignment.
> +     We can leave the target to test the consecutive part; whatever test
> +     we added here might be different from the target's, and in any case
> +     it's fine if the target accepts other well-aligned cases too.  */
> +  if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2))
> +      || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2)))
> +    return false;
> +
> +  /* Cache the PARALLEL rtx between attempts so that we don't generate
> +     too much garbage rtl.  */
> +  if (!m_spare_parallel)
> +    {
> +      rtvec vec = gen_rtvec (2, set1, set2);
> +      m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec);
> +    }
> +  else
> +    {
> +      XVECEXP (m_spare_parallel, 0, 0) = set1;
> +      XVECEXP (m_spare_parallel, 0, 1) = set2;
> +    }
> +
> +  unsigned int num_changes = num_validated_changes ();
> +  validate_change (insn, &PATTERN (insn), m_spare_parallel, true);
> +  if (verify_combination (attempt))
> +    {
> +      m_spare_parallel = NULL_RTX;
> +      return true;
> +    }
> +  cancel_changes (num_changes);
> +  return false;
> +}
> +
> +/* Try to parallelize the two instructions in ATTEMPT.  */
> +
> +bool
> +combine2::try_parallelize_insns (combination_attempt_rec &attempt)
> +{
> +  rtx_insn *i1_insn = attempt.sequence[0]->insn;
> +  rtx_insn *i2_insn = attempt.sequence[1]->insn;
> +
> +  /* Can't parallelize asm statements.  */
> +  if (asm_noperands (PATTERN (i1_insn)) >= 0
> +      || asm_noperands (PATTERN (i2_insn)) >= 0)
> +    return false;
> +
> +  /* For now, just handle the case in which both instructions are
> +     single sets.  We could handle more than 2 sets as well, but few
> +     targets support that anyway.  */
> +  rtx set1 = single_set (i1_insn);
> +  if (!set1)
> +    return false;
> +  rtx set2 = single_set (i2_insn);
> +  if (!set2)
> +    return false;
> +
> +  /* Make sure that we have structural proof that the destinations
> +     are independent.  Things like alias analysis rely on semantic
> +     information and assume no undefined behavior, which is rarely a
> +     good enough guarantee to allow a useful instruction combination.  */
> +  rtx dest1 = SET_DEST (set1);
> +  rtx dest2 = SET_DEST (set2);
> +  if (MEM_P (dest1)
> +      ? MEM_P (dest2) && nonoverlapping_memrefs_p (dest1, dest2, false)
> +      : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2))
> +    return false;
> +
> +  /* Try the sets in both orders.  */
> +  if (try_parallel_sets (attempt, set1, set2)
> +      || try_parallel_sets (attempt, set2, set1))
> +    {
> +      commit_combination (attempt, true);
> +      if (MAY_HAVE_DEBUG_BIND_INSNS
> +         && attempt.new_home->insn != i1_insn)
> +       propagate_for_debug (i1_insn, attempt.new_home->insn,
> +
Richard Sandiford Nov. 18, 2019, 5:55 p.m. UTC | #2
Richard Sandiford <richard.sandiford@arm.com> writes:
> (It's 23:35 local time, so it's still just about stage 1. :-))

Or actually, just under 1 day after end of stage 1.  Oops.
Could have sworn stage 1 ended on the 17th :-(  Only realised
I'd got it wrong when catching up on Saturday's email traffic.

And inevitably, I introduced a couple of stupid mistakes while
trying to clean the patch up for submission by that (non-)deadline.
Here's a version that fixes an inverted overlapping memref check
and that correctly prunes the use list for combined instructions.
(This last one is just a compile-time saving -- the old code was
correct, just suboptimal.)

And those comparisons that looked too good to be true were:
I'd bodged the choice of run-combine parameters when setting
up the tests.  All in all, not a great a day.

Here are the (much less impressive) real values:

Target                 Tests   Delta    Best   Worst  Median
======                 =====   =====    ====   =====  ======
aarch64-linux-gnu        412    -786    -270     520      -1
aarch64_be-linux-gnu     288   -3314    -270      33      -1
alpha-linux-gnu          399   -2721    -370      22      -2
amdgcn-amdhsa            201    1938    -484    1259      -1
arc-elf                  530   -5901   -1529     356      -1
arm-linux-gnueabi        193   -1167    -612     680      -1
arm-linux-gnueabihf      193   -1167    -612     680      -1
avr-elf                 1331 -111093  -13824     680      -9
bfin-elf                1347  -18928   -8461     465      -2
bpf-elf                   63    -475     -60       6      -2
c6x-elf                  183  -10508  -10084      41      -2
cr16-elf                1610  -51360  -10657      42     -13
cris-elf                 143   -1534    -702       4      -2
csky-elf                 136   -3371    -474       6      -2
epiphany-elf             178    -389    -149      84      -1
fr30-elf                 161   -1756    -756     289      -2
frv-linux-gnu            807  -13324   -2074      67      -1
ft32-elf                 282   -1666    -111       5      -2
h8300-elf                522  -11451   -1747      68      -3
hppa64-hp-hpux11.23      186    -848    -142      34      -1
i686-apple-darwin        344   -1298     -56      44      -1
i686-pc-linux-gnu        242   -1953    -556      33      -1
ia64-linux-gnu           150   -4834   -1134      40      -4
iq2000-elf               177   -1333     -61       3      -2
lm32-elf                 193   -1792    -316      47      -2
m32r-elf                  73    -595     -98      11      -2
m68k-linux-gnu           210   -2351    -332     148      -2
mcore-elf                133   -1213    -146       7      -1
microblaze-elf           445   -4493   -2094      32      -2
mipsel-linux-gnu         134   -2038    -222      60      -2
mmix                     108    -233     -26       4      -1
mn10300-elf              224   -1024    -234      80      -1
moxie-rtems              154    -743     -79       4      -2
msp430-elf               182    -586     -63      19      -1
nds32le-elf              267    -485     -37     136      -1
nios2-linux-gnu           83    -323     -66       5      -1
nvptx-none               568   -1124    -208      16       1
or1k-elf                  61    -281     -25       4      -1
pdp11                    248   -1292    -182      83      -1
powerpc-ibm-aix7.0      1288   -3031    -370    2046      -1
powerpc64-linux-gnu     1118     692    -274    2934      -2
powerpc64le-linux-gnu   1044   -4719    -688     156      -1
pru-elf                   48   -7014   -6921       6      -1
riscv32-elf               63   -1364    -139       7      -2
riscv64-elf               91   -1557    -264       7      -1
rl78-elf                 354  -16805   -1665      42      -6
rx-elf                    95    -186     -53       8      -1
s390-linux-gnu           184   -2282   -1485      63      -1
s390x-linux-gnu          257    -363    -159     522      -1
sh-linux-gnu             225    -405    -108      68      -1
sparc-linux-gnu          164    -859     -99      18      -1
sparc64-linux-gnu        169    -791    -102      15      -1
tilepro-linux-gnu       1037   -4896    -315     332      -2
v850-elf                  54    -408     -53       3      -2
vax-netbsdelf            251   -3315    -400       2      -2
visium-elf               101    -693    -138      16      -1
x86_64-darwin            350   -2145    -490      72      -1
x86_64-linux-gnu         311    -853    -288     210      -1
xstormy16-elf            219    -770    -156      59      -1
xtensa-elf               201   -1418    -322      36       1

Also, the number of LDPs on aarch64-linux-gnu went up from
3543 to 5235.  The number of STPs went up from 10494 to 12151.
All the new pairs should be aligned ones.

Retested on aarch64-linux-gnu and x86_64-linux-gnu.  It missed the
deadline, but I thought I'd post it anyway to put the record straight.

Thanks,
Richard


2019-11-18  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* Makefile.in (OBJS): Add combine2.o
	* params.opt (--param=run-combine): New option.
	* doc/invoke.texi: Document it.
	* tree-pass.h (make_pass_combine2_before): Declare.
	(make_pass_combine2_after): Likewise.
	* passes.def: Add them.
	* timevar.def (TV_COMBINE2): New timevar.
	* cfgrtl.h (update_cfg_for_uncondjump): Declare.
	* combine.c (update_cfg_for_uncondjump): Move to...
	* cfgrtl.c (update_cfg_for_uncondjump): ...here.
	* simplify-rtx.c (simplify_truncation): Handle comparisons.
	* recog.h (validate_simplify_replace_rtx): Declare.
	* recog.c (validate_simplify_replace_rtx_1): New function.
	(validate_simplify_replace_rtx_uses): Likewise.
	(validate_simplify_replace_rtx): Likewise.
	* combine2.c: New file.

Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	2019-11-18 15:12:34.000000000 +0000
+++ gcc/Makefile.in	2019-11-18 17:43:14.245303327 +0000
@@ -1261,6 +1261,7 @@ OBJS = \
 	cgraphunit.o \
 	cgraphclones.o \
 	combine.o \
+	combine2.o \
 	combine-stack-adj.o \
 	compare-elim.o \
 	context.o \
Index: gcc/params.opt
===================================================================
--- gcc/params.opt	2019-11-18 15:12:34.000000000 +0000
+++ gcc/params.opt	2019-11-18 17:43:14.257303244 +0000
@@ -768,6 +768,10 @@ Use internal function id in profile look
 Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param
 Maximum depth of a loop nest to fully value-number optimistically.
 
+-param=run-combine=
+Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param
+Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass.
+
 -param=sccvn-max-alias-queries-per-access=
 Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param
 Maximum number of disambiguations to perform per memory access.
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	2019-11-18 15:12:34.000000000 +0000
+++ gcc/doc/invoke.texi	2019-11-18 17:43:14.257303244 +0000
@@ -11807,6 +11807,11 @@ in combiner for a pseudo register as las
 @item max-combine-insns
 The maximum number of instructions the RTL combiner tries to combine.
 
+@item run-combine
+Choose which of the 3 available combine passes to run: bit 1 for the main
+combine pass, bit 0 for an earlier variant of the combine pass, and bit 2
+for a later variant of the combine pass.
+
 @item integer-share-limit
 Small integer constants can use a shared data structure, reducing the
 compiler's memory usage and increasing its speed.  This sets the maximum
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	2019-11-18 15:12:34.000000000 +0000
+++ gcc/tree-pass.h	2019-11-18 17:43:14.257303244 +0000
@@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i
 extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt);
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	2019-11-18 15:12:34.000000000 +0000
+++ gcc/passes.def	2019-11-18 17:43:14.257303244 +0000
@@ -437,7 +437,9 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_inc_dec);
       NEXT_PASS (pass_initialize_regs);
       NEXT_PASS (pass_ud_rtl_dce);
+      NEXT_PASS (pass_combine2_before);
       NEXT_PASS (pass_combine);
+      NEXT_PASS (pass_combine2_after);
       NEXT_PASS (pass_if_after_combine);
       NEXT_PASS (pass_jump_after_combine);
       NEXT_PASS (pass_partition_blocks);
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	2019-11-18 15:12:34.000000000 +0000
+++ gcc/timevar.def	2019-11-18 17:43:14.257303244 +0000
@@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC          , "
 DEFTIMEVAR (TV_CSE2                  , "CSE 2")
 DEFTIMEVAR (TV_BRANCH_PROB           , "branch prediction")
 DEFTIMEVAR (TV_COMBINE               , "combiner")
+DEFTIMEVAR (TV_COMBINE2              , "second combiner")
 DEFTIMEVAR (TV_IFCVT		     , "if-conversion")
 DEFTIMEVAR (TV_MODE_SWITCH           , "mode switching")
 DEFTIMEVAR (TV_SMS		     , "sms modulo scheduling")
Index: gcc/cfgrtl.h
===================================================================
--- gcc/cfgrtl.h	2019-11-18 15:12:34.000000000 +0000
+++ gcc/cfgrtl.h	2019-11-18 17:43:14.245303327 +0000
@@ -47,6 +47,7 @@ extern void fixup_partitions (void);
 extern bool purge_dead_edges (basic_block);
 extern bool purge_all_dead_edges (void);
 extern bool fixup_abnormal_edges (void);
+extern void update_cfg_for_uncondjump (rtx_insn *);
 extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *);
 extern void relink_block_chain (bool);
 extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *);
Index: gcc/combine.c
===================================================================
--- gcc/combine.c	2019-11-18 15:12:34.000000000 +0000
+++ gcc/combine.c	2019-11-18 17:43:14.249303299 +0000
@@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg)
 	 && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT;
 }
 
-/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
-   Note that the INSN should be deleted *after* removing dead edges, so
-   that the kept edge is the fallthrough edge for a (set (pc) (pc))
-   but not for a (set (pc) (label_ref FOO)).  */
-
-static void
-update_cfg_for_uncondjump (rtx_insn *insn)
-{
-  basic_block bb = BLOCK_FOR_INSN (insn);
-  gcc_assert (BB_END (bb) == insn);
-
-  purge_dead_edges (bb);
-
-  delete_insn (insn);
-  if (EDGE_COUNT (bb->succs) == 1)
-    {
-      rtx_insn *insn;
-
-      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
-
-      /* Remove barriers from the footer if there are any.  */
-      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
-	if (BARRIER_P (insn))
-	  {
-	    if (PREV_INSN (insn))
-	      SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
-	    else
-	      BB_FOOTER (bb) = NEXT_INSN (insn);
-	    if (NEXT_INSN (insn))
-	      SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
-	  }
-	else if (LABEL_P (insn))
-	  break;
-    }
-}
-
 /* Return whether PAT is a PARALLEL of exactly N register SETs followed
    by an arbitrary number of CLOBBERs.  */
 static bool
@@ -15096,7 +15060,10 @@ const pass_data pass_data_combine =
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *) { return (optimize > 0); }
+  virtual bool gate (function *)
+    {
+      return optimize > 0 && (param_run_combine & 2) != 0;
+    }
   virtual unsigned int execute (function *)
     {
       return rest_of_handle_combine ();
Index: gcc/cfgrtl.c
===================================================================
--- gcc/cfgrtl.c	2019-11-18 15:12:34.000000000 +0000
+++ gcc/cfgrtl.c	2019-11-18 17:43:14.245303327 +0000
@@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void)
   return inserted;
 }
 
+/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
+   Note that the INSN should be deleted *after* removing dead edges, so
+   that the kept edge is the fallthrough edge for a (set (pc) (pc))
+   but not for a (set (pc) (label_ref FOO)).  */
+
+void
+update_cfg_for_uncondjump (rtx_insn *insn)
+{
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  gcc_assert (BB_END (bb) == insn);
+
+  purge_dead_edges (bb);
+
+  delete_insn (insn);
+  if (EDGE_COUNT (bb->succs) == 1)
+    {
+      rtx_insn *insn;
+
+      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
+
+      /* Remove barriers from the footer if there are any.  */
+      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
+	if (BARRIER_P (insn))
+	  {
+	    if (PREV_INSN (insn))
+	      SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
+	    else
+	      BB_FOOTER (bb) = NEXT_INSN (insn);
+	    if (NEXT_INSN (insn))
+	      SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
+	  }
+	else if (LABEL_P (insn))
+	  break;
+    }
+}
+
 /* Cut the insns from FIRST to LAST out of the insns stream.  */
 
 rtx_insn *
Index: gcc/simplify-rtx.c
===================================================================
--- gcc/simplify-rtx.c	2019-11-18 15:28:59.916793401 +0000
+++ gcc/simplify-rtx.c	2019-11-18 17:43:14.257303244 +0000
@@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode,
       && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1)
     return constm1_rtx;
 
+  /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result
+     in a narrower mode if useful.  */
+  if (COMPARISON_P (op))
+    return simplify_gen_relational (GET_CODE (op), mode, VOIDmode,
+				    XEXP (op, 0), XEXP (op, 1));
+
   return NULL_RTX;
 }
 
Index: gcc/recog.h
===================================================================
--- gcc/recog.h	2019-11-18 15:12:34.000000000 +0000
+++ gcc/recog.h	2019-11-18 17:43:14.257303244 +0000
@@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos
 extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *);
 extern void validate_replace_src_group (rtx, rtx, rtx_insn *);
 extern bool validate_simplify_insn (rtx_insn *insn);
+extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx);
 extern int num_changes_pending (void);
 extern int next_insn_tests_no_inequality (rtx_insn *);
 extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode);
Index: gcc/recog.c
===================================================================
--- gcc/recog.c	2019-11-18 15:12:34.000000000 +0000
+++ gcc/recog.c	2019-11-18 17:43:14.257303244 +0000
@@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn)
       }
   return ((num_changes_pending () > 0) && (apply_change_group () > 0));
 }
+
+/* A subroutine of validate_simplify_replace_rtx.  Apply the replacement
+   described by R to LOC.  Return true on success; leave the caller
+   to clean up on failure.  */
+
+static bool
+validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc)
+{
+  rtx x = *loc;
+  enum rtx_code code = GET_CODE (x);
+  machine_mode mode = GET_MODE (x);
+
+  if (rtx_equal_p (x, r.from))
+    {
+      validate_unshare_change (r.insn, loc, r.to, 1);
+      return true;
+    }
+
+  /* Recursively apply the substitution and see if we can simplify
+     the result.  This specifically shouldn't use simplify_gen_*,
+     since we want to avoid generating new expressions where possible.  */
+  int old_num_changes = num_validated_changes ();
+  rtx newx = NULL_RTX;
+  bool recurse_p = false;
+  switch (GET_RTX_CLASS (code))
+    {
+    case RTX_UNARY:
+      {
+	machine_mode op0_mode = GET_MODE (XEXP (x, 0));
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)))
+	  return false;
+
+	newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode);
+	break;
+      }
+
+    case RTX_BIN_ARITH:
+    case RTX_COMM_ARITH:
+      {
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+	  return false;
+
+	newx = simplify_binary_operation (code, mode,
+					  XEXP (x, 0), XEXP (x, 1));
+	break;
+      }
+
+    case RTX_COMPARE:
+    case RTX_COMM_COMPARE:
+      {
+	machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode
+				? GET_MODE (XEXP (x, 0))
+				: GET_MODE (XEXP (x, 1)));
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+	  return false;
+
+	newx = simplify_relational_operation (code, mode, op_mode,
+					      XEXP (x, 0), XEXP (x, 1));
+	break;
+      }
+
+    case RTX_TERNARY:
+    case RTX_BITFIELD_OPS:
+      {
+	machine_mode op0_mode = GET_MODE (XEXP (x, 0));
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2)))
+	  return false;
+
+	newx = simplify_ternary_operation (code, mode, op0_mode,
+					   XEXP (x, 0), XEXP (x, 1),
+					   XEXP (x, 2));
+	break;
+      }
+
+    case RTX_EXTRA:
+      if (code == SUBREG)
+	{
+	  machine_mode inner_mode = GET_MODE (SUBREG_REG (x));
+	  if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x)))
+	    return false;
+
+	  rtx inner = SUBREG_REG (x);
+	  newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x));
+	  /* Reject the same cases that simplify_gen_subreg would.  */
+	  if (!newx
+	      && (GET_CODE (inner) == SUBREG
+		  || GET_CODE (inner) == CONCAT
+		  || GET_MODE (inner) == VOIDmode
+		  || !validate_subreg (mode, inner_mode,
+				       inner, SUBREG_BYTE (x))))
+	    return false;
+	  break;
+	}
+      else
+	recurse_p = true;
+      break;
+
+    case RTX_OBJ:
+      if (code == LO_SUM)
+	{
+	  if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	      || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+	    return false;
+
+	  /* (lo_sum (high x) y) -> y where x and y have the same base.  */
+	  rtx op0 = XEXP (x, 0);
+	  rtx op1 = XEXP (x, 1);
+	  if (GET_CODE (op0) == HIGH)
+	    {
+	      rtx base0, base1, offset0, offset1;
+	      split_const (XEXP (op0, 0), &base0, &offset0);
+	      split_const (op1, &base1, &offset1);
+	      if (rtx_equal_p (base0, base1))
+		newx = op1;
+	    }
+	}
+      else if (code == REG)
+	{
+	  if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from))
+	    return false;
+	}
+      else
+	recurse_p = true;
+      break;
+
+    case RTX_CONST_OBJ:
+      break;
+
+    case RTX_AUTOINC:
+      if (reg_overlap_mentioned_p (XEXP (x, 0), r.from))
+	return false;
+      recurse_p = true;
+      break;
+
+    case RTX_MATCH:
+    case RTX_INSN:
+      gcc_unreachable ();
+    }
+
+  if (recurse_p)
+    {
+      const char *fmt = GET_RTX_FORMAT (code);
+      for (int i = 0; fmt[i]; i++)
+	switch (fmt[i])
+	  {
+	  case 'E':
+	    for (int j = 0; j < XVECLEN (x, i); j++)
+	      if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j)))
+		return false;
+	    break;
+
+	  case 'e':
+	    if (XEXP (x, i)
+		&& !validate_simplify_replace_rtx_1 (r, &XEXP (x, i)))
+	      return false;
+	    break;
+	  }
+    }
+
+  if (newx && !rtx_equal_p (x, newx))
+    {
+      /* There's no longer any point unsharing the substitutions made
+	 for subexpressions, since we'll just copy this one instead.  */
+      for (int i = old_num_changes; i < num_changes; ++i)
+	changes[i].unshare = false;
+      validate_unshare_change (r.insn, loc, newx, 1);
+    }
+
+  return true;
+}
+
+/* A note_uses callback for validate_simplify_replace_rtx.
+   DATA points to a validate_replace_src_data object.  */
+
+static void
+validate_simplify_replace_rtx_uses (rtx *loc, void *data)
+{
+  validate_replace_src_data &r = *(validate_replace_src_data *) data;
+  if (r.insn && !validate_simplify_replace_rtx_1 (r, loc))
+    r.insn = NULL;
+}
+
+/* Try to perform the equivalent of:
+
+      newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX);
+      validate_change (INSN, LOC, newx, 1);
+
+   but without generating as much garbage rtl when the resulting
+   pattern doesn't match.
+
+   Return true if we were able to replace all uses of OLD_RTX in *LOC
+   and if the result conforms to general rtx rules (e.g. for whether
+   subregs are meaningful).
+
+   When returning true, add all replacements to the current validation group,
+   leaving the caller to test it in the normal way.  Leave both *LOC and the
+   validation group unchanged on failure.  */
+
+bool
+validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc,
+			       rtx old_rtx, rtx new_rtx)
+{
+  validate_replace_src_data r;
+  r.from = old_rtx;
+  r.to = new_rtx;
+  r.insn = insn;
+
+  unsigned int num_changes = num_validated_changes ();
+  note_uses (loc, validate_simplify_replace_rtx_uses, &r);
+  if (!r.insn)
+    {
+      cancel_changes (num_changes);
+      return false;
+    }
+  return true;
+}
 
 /* Return 1 if the insn using CC0 set by INSN does not contain
    any ordered tests applied to the condition codes.
Index: gcc/combine2.c
===================================================================
--- /dev/null	2019-09-17 11:41:18.176664108 +0100
+++ gcc/combine2.c	2019-11-18 17:43:14.249303299 +0000
@@ -0,0 +1,1598 @@
+/* Combine instructions
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "tree-pass.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "insn-config.h"
+#include "recog.h"
+#include "print-rtl.h"
+#include "rtl-iter.h"
+#include "predict.h"
+#include "cfgcleanup.h"
+#include "cfghooks.h"
+#include "cfgrtl.h"
+#include "alias.h"
+#include "valtrack.h"
+
+/* This pass tries to combine instructions in the following ways:
+
+   (1) If we have two dependent instructions:
+
+	 I1: (set DEST1 SRC1)
+	 I2: (...DEST1...)
+
+       and I2 is the only user of DEST1, the pass tries to combine them into:
+
+	 I2: (...SRC1...)
+
+   (2) If we have two dependent instructions:
+
+	 I1: (set DEST1 SRC1)
+	 I2: (...DEST1...)
+
+       the pass tries to combine them into:
+
+	 I2: (parallel [(set DEST1 SRC1) (...SRC1...)])
+
+       or:
+
+	 I2: (parallel [(...SRC1...) (set DEST1 SRC1)])
+
+   (3) If we have two independent instructions:
+
+	 I1: (set DEST1 SRC1)
+	 I2: (set DEST2 SRC2)
+
+       that read from memory or from the same register, the pass tries to
+       combine them into:
+
+	 I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)])
+
+       or:
+
+	 I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)])
+
+   If the combined form is a valid instruction, the pass tries to find a
+   place between I1 and I2 inclusive for the new instruction.  If there
+   are multiple valid locations, it tries to pick the best one by taking
+   the effect on register pressure into account.
+
+   If a combination succeeds and produces a single set, the pass tries to
+   combine the new form with earlier or later instructions.
+
+   The pass currently optimizes each basic block separately.  It walks
+   the instructions in reverse order, building up live ranges for registers
+   and memory.  It then uses these live ranges to look for possible
+   combination opportunities and to decide where the combined instructions
+   could be placed.
+
+   The pass represents positions in the block using point numbers,
+   with higher numbers indicating earlier instructions.  The numbering
+   scheme is that:
+
+   - the end of the current instruction sequence has an even base point B.
+
+   - instructions initially have odd-numbered points B + 1, B + 3, etc.
+     with B + 1 being the final instruction in the sequence.
+
+   - even points after B represent gaps between instructions where combined
+     instructions could be placed.
+
+   Thus even points initially represent no instructions and odd points
+   initially represent single instructions.  However, when picking a
+   place for a combined instruction, the pass may choose somewhere
+   inbetween the original two instructions, so that over time a point
+   may come to represent several instructions.  When this happens,
+   the pass maintains the invariant that all instructions with the same
+   point number are independent of each other and thus can be treated as
+   acting in parallel (or as acting in any arbitrary sequence).
+
+   TODOs:
+
+   - Handle 3-instruction combinations, and possibly more.
+
+   - Handle existing clobbers more efficiently.  At the moment we can't
+     move an instruction that clobbers R across another instruction that
+     clobbers R.
+
+   - Allow hard register clobbers to be added, like combine does.
+
+   - Perhaps work on EBBs, or SESE regions.  */
+
+namespace {
+
+/* The number of explicit uses to record in a live range.  */
+const unsigned int NUM_RANGE_USERS = 4;
+
+/* The maximum number of instructions that we can combine at once.  */
+const unsigned int MAX_COMBINE_INSNS = 2;
+
+/* A fake cost for instructions that we haven't costed yet.  */
+const unsigned int UNKNOWN_COST = ~0U;
+
+class combine2
+{
+public:
+  combine2 (function *);
+  ~combine2 ();
+
+  void execute ();
+
+private:
+  struct insn_info_rec;
+
+  /* Describes the live range of a register or of memory.  For simplicity,
+     we treat memory as a single entity.
+
+     If we had a fully-accurate live range, updating it to account for a
+     moved instruction would be a linear-time operation.  Doing this for
+     each combination would then make the pass quadratic.  We therefore
+     just maintain a list of NUM_RANGE_USERS use insns and use simple,
+     conservatively-correct behavior for the rest.  */
+  struct live_range_rec
+  {
+    /* Which instruction provides the dominating definition, or null if
+       we don't know yet.  */
+    insn_info_rec *producer;
+
+    /* A selection of instructions that use the resource, in program order.  */
+    insn_info_rec *users[NUM_RANGE_USERS];
+
+    /* An inclusive range of points that covers instructions not mentioned
+       in USERS.  Both values are zero if there are no such instructions.
+
+       Once we've included a use U at point P in this range, we continue
+       to assume that some kind of use exists at P whatever happens to U
+       afterwards.  */
+    unsigned int first_extra_use;
+    unsigned int last_extra_use;
+
+    /* The register number this range describes, or INVALID_REGNUM
+       for memory.  */
+    unsigned int regno;
+
+    /* Forms a linked list of ranges for the same resource, in program
+       order.  */
+    live_range_rec *prev_range;
+    live_range_rec *next_range;
+  };
+
+  /* Pass-specific information about an instruction.  */
+  struct insn_info_rec
+  {
+    /* The instruction itself.  */
+    rtx_insn *insn;
+
+    /* A null-terminated list of live ranges for the things that this
+       instruction defines.  */
+    live_range_rec **defs;
+
+    /* A null-terminated list of live ranges for the things that this
+       instruction uses.  */
+    live_range_rec **uses;
+
+    /* The point at which the instruction appears.  */
+    unsigned int point;
+
+    /* The cost of the instruction, or UNKNOWN_COST if we haven't
+       measured it yet.  */
+    unsigned int cost;
+  };
+
+  /* Describes one attempt to combine instructions.  */
+  struct combination_attempt_rec
+  {
+    /* The instruction that we're currently trying to optimize.
+       If the combination succeeds, we'll use this insn_info_rec
+       to describe the new instruction.  */
+    insn_info_rec *new_home;
+
+    /* The instructions we're combining, in program order.  */
+    insn_info_rec *sequence[MAX_COMBINE_INSNS];
+
+    /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the
+       live range that describes the substituted register.  */
+    live_range_rec *def_use_range;
+
+    /* The earliest and latest points at which we could insert the
+       combined instruction.  */
+    unsigned int earliest_point;
+    unsigned int latest_point;
+
+    /* The cost of the new instruction, once we have a successful match.  */
+    unsigned int new_cost;
+  };
+
+  /* Pass-specific information about a register.  */
+  struct reg_info_rec
+  {
+    /* The live range associated with the last reference to the register.  */
+    live_range_rec *range;
+
+    /* The point at which the last reference occurred.  */
+    unsigned int next_ref;
+
+    /* True if the register is currently live.  We record this here rather
+       than in a separate bitmap because (a) there's a natural hole for
+       it on LP64 hosts and (b) we only refer to it when updating the
+       other fields, and so recording it here should give better locality.  */
+    unsigned int live_p : 1;
+  };
+
+  live_range_rec *new_live_range (unsigned int, live_range_rec *);
+  live_range_rec *reg_live_range (unsigned int);
+  live_range_rec *mem_live_range ();
+  bool add_range_use (live_range_rec *, insn_info_rec *);
+  void remove_range_use (live_range_rec *, insn_info_rec *);
+  bool has_single_use_p (live_range_rec *);
+  bool known_last_use_p (live_range_rec *, insn_info_rec *);
+  unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *);
+  unsigned int find_latest_point (insn_info_rec *, insn_info_rec *);
+  bool start_combination (combination_attempt_rec &, insn_info_rec *,
+			  insn_info_rec *, live_range_rec * = NULL);
+  bool verify_combination (combination_attempt_rec &);
+  int estimate_reg_pressure_delta (insn_info_rec *);
+  void commit_combination (combination_attempt_rec &, bool);
+  bool try_parallel_sets (combination_attempt_rec &, rtx, rtx);
+  bool try_parallelize_insns (combination_attempt_rec &);
+  bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool);
+  bool try_combine_def_use (combination_attempt_rec &, rtx, rtx);
+  bool try_combine_two_uses (combination_attempt_rec &);
+  bool try_combine (insn_info_rec *, rtx, unsigned int);
+  bool optimize_insn (insn_info_rec *);
+  void record_defs (insn_info_rec *);
+  void record_reg_use (insn_info_rec *, df_ref);
+  void record_uses (insn_info_rec *);
+  void process_insn (insn_info_rec *);
+  void start_sequence ();
+
+  /* The function we're optimizing.  */
+  function *m_fn;
+
+  /* The highest pseudo register number plus one.  */
+  unsigned int m_num_regs;
+
+  /* The current basic block.  */
+  basic_block m_bb;
+
+  /* True if we should optimize the current basic block for speed.  */
+  bool m_optimize_for_speed_p;
+
+  /* The point number to allocate to the next instruction we visit
+     in the backward traversal.  */
+  unsigned int m_point;
+
+  /* The point number corresponding to the end of the current
+     instruction sequence, i.e. the lowest point number about which
+     we still have valid information.  */
+  unsigned int m_end_of_sequence;
+
+  /* The point number corresponding to the end of the current basic block.
+     This is the same as M_END_OF_SEQUENCE when processing the last
+     instruction sequence in a basic block.  */
+  unsigned int m_end_of_bb;
+
+  /* The memory live range, or null if we haven't yet found a memory
+     reference in the current instruction sequence.  */
+  live_range_rec *m_mem_range;
+
+  /* Gives information about each register.  We track both hard and
+     pseudo registers.  */
+  auto_vec<reg_info_rec> m_reg_info;
+
+  /* A bitmap of registers whose entry in m_reg_info is valid.  */
+  auto_sbitmap m_valid_regs;
+
+  /* If nonnuull, an unused 2-element PARALLEL that we can use to test
+     instruction combinations.  */
+  rtx m_spare_parallel;
+
+  /* A bitmap of instructions that we've already tried to combine with.  */
+  auto_bitmap m_tried_insns;
+
+  /* A temporary bitmap used to hold register numbers.  */
+  auto_bitmap m_true_deps;
+
+  /* An obstack used for allocating insn_info_recs and for building
+     up their lists of definitions and uses.  */
+  obstack m_insn_obstack;
+
+  /* An obstack used for allocating live_range_recs.  */
+  obstack m_range_obstack;
+
+  /* Start-of-object pointers for the two obstacks.  */
+  char *m_insn_obstack_start;
+  char *m_range_obstack_start;
+
+  /* A list of instructions that we've optimized and whose new forms
+     change the cfg.  */
+  auto_vec<rtx_insn *> m_cfg_altering_insns;
+
+  /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS.  */
+  auto_bitmap m_cfg_altering_insn_ids;
+
+  /* We can insert new instructions at point P * 2 by inserting them
+     after M_POINTS[P - M_END_OF_SEQUENCE / 2].  We can insert new
+     instructions at point P * 2 + 1 by inserting them before
+     M_POINTS[P - M_END_OF_SEQUENCE / 2].  */
+  auto_vec<rtx_insn *, 256> m_points;
+};
+
+combine2::combine2 (function *fn)
+  : m_fn (fn),
+    m_num_regs (max_reg_num ()),
+    m_bb (NULL),
+    m_optimize_for_speed_p (false),
+    m_point (2),
+    m_end_of_sequence (m_point),
+    m_end_of_bb (m_point),
+    m_mem_range (NULL),
+    m_reg_info (m_num_regs),
+    m_valid_regs (m_num_regs),
+    m_spare_parallel (NULL_RTX)
+{
+  gcc_obstack_init (&m_insn_obstack);
+  gcc_obstack_init (&m_range_obstack);
+  m_reg_info.quick_grow (m_num_regs);
+  bitmap_clear (m_valid_regs);
+  m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0);
+  m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0);
+}
+
+combine2::~combine2 ()
+{
+  obstack_free (&m_insn_obstack, NULL);
+  obstack_free (&m_range_obstack, NULL);
+}
+
+/* Return true if it's possible in principle to combine INSN with
+   other instructions.  ALLOW_ASMS_P is true if the caller can cope
+   with asm statements.  */
+
+static bool
+combinable_insn_p (rtx_insn *insn, bool allow_asms_p)
+{
+  rtx pattern = PATTERN (insn);
+
+  if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER)
+    return false;
+
+  if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX))
+    return false;
+
+  if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0)
+    return false;
+
+  return true;
+}
+
+/* Return true if it's possible in principle to move INSN somewhere else,
+   as long as all dependencies are satisfied.  */
+
+static bool
+movable_insn_p (rtx_insn *insn)
+{
+  if (JUMP_P (insn))
+    return false;
+
+  if (volatile_refs_p (PATTERN (insn)))
+    return false;
+
+  return true;
+}
+
+/* A note_stores callback.  Set the bool at *DATA to true if DEST is in
+   memory.  */
+
+static void
+find_mem_def (rtx dest, const_rtx, void *data)
+{
+  /* note_stores has stripped things like subregs and zero_extracts,
+     so we don't need to worry about them here.  */
+  if (MEM_P (dest))
+    *(bool *) data = true;
+}
+
+/* Return true if instruction INSN writes to memory.  */
+
+static bool
+insn_writes_mem_p (rtx_insn *insn)
+{
+  bool saw_mem_p = false;
+  note_stores (insn, find_mem_def, &saw_mem_p);
+  return saw_mem_p;
+}
+
+/* A note_uses callback.  Set the bool at DATA to true if *LOC reads
+   from variable memory.  */
+
+static void
+find_mem_use (rtx *loc, void *data)
+{
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, *loc, NONCONST)
+    if (MEM_P (*iter) && !MEM_READONLY_P (*iter))
+      {
+	*(bool *) data = true;
+	break;
+      }
+}
+
+/* Return true if instruction INSN reads memory, including via notes.  */
+
+static bool
+insn_reads_mem_p (rtx_insn *insn)
+{
+  bool saw_mem_p = false;
+  note_uses (&PATTERN (insn), find_mem_use, &saw_mem_p);
+  for (rtx note = REG_NOTES (insn); !saw_mem_p && note; note = XEXP (note, 1))
+    if (REG_NOTE_KIND (note) == REG_EQUAL
+	|| REG_NOTE_KIND (note) == REG_EQUIV)
+      note_uses (&XEXP (note, 0), find_mem_use, &saw_mem_p);
+  return saw_mem_p;
+}
+
+/* Create and return a new live range for REGNO.  NEXT is the next range
+   in program order, or null if this is the first live range in the
+   sequence.  */
+
+combine2::live_range_rec *
+combine2::new_live_range (unsigned int regno, live_range_rec *next)
+{
+  live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec);
+  memset (range, 0, sizeof (*range));
+
+  range->regno = regno;
+  range->next_range = next;
+  if (next)
+    next->prev_range = range;
+  return range;
+}
+
+/* Return the current live range for register REGNO, creating a new
+   one if necessary.  */
+
+combine2::live_range_rec *
+combine2::reg_live_range (unsigned int regno)
+{
+  /* Initialize the liveness flag, if it isn't already valid for this BB.  */
+  bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno);
+  if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb)
+    m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno);
+
+  /* See if we already have a live range associated with the current
+     instruction sequence.  */
+  live_range_rec *range = NULL;
+  if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence)
+    range = m_reg_info[regno].range;
+
+  /* Create a new range if this is the first reference to REGNO in the
+     current instruction sequence or if the current range has been closed
+     off by a definition.  */
+  if (!range || range->producer)
+    {
+      range = new_live_range (regno, range);
+
+      /* If the register is live after the current sequence, treat that
+	 as a fake use at the end of the sequence.  */
+      if (!range->next_range && m_reg_info[regno].live_p)
+	range->first_extra_use = range->last_extra_use = m_end_of_sequence;
+
+      /* Record that this is now the current range for REGNO.  */
+      if (first_ref_p)
+	bitmap_set_bit (m_valid_regs, regno);
+      m_reg_info[regno].range = range;
+      m_reg_info[regno].next_ref = m_point;
+    }
+  return range;
+}
+
+/* Return the current live range for memory, treating memory as a single
+   entity.  Create a new live range if necessary.  */
+
+combine2::live_range_rec *
+combine2::mem_live_range ()
+{
+  if (!m_mem_range || m_mem_range->producer)
+    m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range);
+  return m_mem_range;
+}
+
+/* Record that instruction USER uses the resource described by RANGE.
+   Return true if this is new information.  */
+
+bool
+combine2::add_range_use (live_range_rec *range, insn_info_rec *user)
+{
+  /* See if we've already recorded the instruction, or if there's a
+     spare use slot we can use.  */
+  unsigned int i = 0;
+  for (; i < NUM_RANGE_USERS && range->users[i]; ++i)
+    if (range->users[i] == user)
+      return false;
+
+  if (i == NUM_RANGE_USERS)
+    {
+      /* Since we've processed USER recently, assume that it's more
+	 interesting to record explicitly than the last user in the
+	 current list.  Evict that last user and describe it in the
+	 overflow "extra use" range instead.  */
+      insn_info_rec *ousted_user = range->users[--i];
+      if (range->first_extra_use < ousted_user->point)
+	range->first_extra_use = ousted_user->point;
+      if (range->last_extra_use > ousted_user->point)
+	range->last_extra_use = ousted_user->point;
+    }
+
+  /* Insert USER while keeping the list sorted.  */
+  for (; i > 0 && range->users[i - 1]->point < user->point; --i)
+    range->users[i] = range->users[i - 1];
+  range->users[i] = user;
+  return true;
+}
+
+/* Remove USER from the uses recorded for RANGE, if we can.
+   There's nothing we can do if USER was described in the
+   overflow "extra use" range.  */
+
+void
+combine2::remove_range_use (live_range_rec *range, insn_info_rec *user)
+{
+  for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+    if (range->users[i] == user)
+      {
+	for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j)
+	  range->users[j] = range->users[j + 1];
+	range->users[NUM_RANGE_USERS - 1] = NULL;
+	break;
+      }
+}
+
+/* Return true if RANGE has a single known user.  */
+
+bool
+combine2::has_single_use_p (live_range_rec *range)
+{
+  return range->users[0] && !range->users[1] && !range->first_extra_use;
+}
+
+/* Return true if we know that USER is the last user of RANGE.  */
+
+bool
+combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user)
+{
+  if (range->last_extra_use <= user->point)
+    return false;
+
+  for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i)
+    if (range->users[i] == user)
+      return i == NUM_RANGE_USERS - 1 || !range->users[i + 1];
+    else if (range->users[i]->point == user->point)
+      return false;
+
+  gcc_unreachable ();
+}
+
+/* Find the earliest point that we could move I2 up in order to combine
+   it with I1.  Ignore any dependencies between I1 and I2; leave the
+   caller to deal with those instead.  */
+
+unsigned int
+combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1)
+{
+  if (!movable_insn_p (i2->insn))
+    return i2->point;
+
+  /* Start by optimistically assuming that we can move the instruction
+     all the way up to I1.  */
+  unsigned int point = i1->point;
+
+  /* Make sure that the new position preserves all necessary true dependencies
+     on earlier instructions.  */
+  for (live_range_rec **use = i2->uses; *use; ++use)
+    {
+      live_range_rec *range = *use;
+      if (range->producer
+	  && range->producer != i1
+	  && point >= range->producer->point)
+	point = range->producer->point - 1;
+    }
+
+  /* Make sure that the new position preserves all necessary output and
+     anti dependencies on earlier instructions.  */
+  for (live_range_rec **def = i2->defs; *def; ++def)
+    if (live_range_rec *range = (*def)->prev_range)
+      {
+	if (range->producer
+	    && range->producer != i1
+	    && point >= range->producer->point)
+	  point = range->producer->point - 1;
+
+	for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;)
+	  if (range->users[i] && range->users[i] != i1)
+	    {
+	      if (point >= range->users[i]->point)
+		point = range->users[i]->point - 1;
+	      break;
+	    }
+
+	if (range->last_extra_use && point >= range->last_extra_use)
+	  point = range->last_extra_use - 1;
+      }
+
+  return point;
+}
+
+/* Find the latest point that we could move I1 down in order to combine
+   it with I2.  Ignore any dependencies between I1 and I2; leave the
+   caller to deal with those instead.  */
+
+unsigned int
+combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2)
+{
+  if (!movable_insn_p (i1->insn))
+    return i1->point;
+
+  /* Start by optimistically assuming that we can move the instruction
+     all the way down to I2.  */
+  unsigned int point = i2->point;
+
+  /* Make sure that the new position preserves all necessary anti dependencies
+     on later instructions.  */
+  for (live_range_rec **use = i1->uses; *use; ++use)
+    if (live_range_rec *range = (*use)->next_range)
+      if (range->producer != i2 && point <= range->producer->point)
+	point = range->producer->point + 1;
+
+  /* Make sure that the new position preserves all necessary output and
+     true dependencies on later instructions.  */
+  for (live_range_rec **def = i1->defs; *def; ++def)
+    {
+      live_range_rec *range = *def;
+
+      for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+	if (range->users[i] != i2)
+	  {
+	    if (range->users[i] && point <= range->users[i]->point)
+	      point = range->users[i]->point + 1;
+	    break;
+	  }
+
+      if (range->first_extra_use && point <= range->first_extra_use)
+	point = range->first_extra_use + 1;
+
+      live_range_rec *next_range = range->next_range;
+      if (next_range
+	  && next_range->producer != i2
+	  && point <= next_range->producer->point)
+	point = next_range->producer->point + 1;
+    }
+
+  return point;
+}
+
+/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2,
+   where I1 is the instruction that we're currently trying to optimize.
+   If DEF_USE_RANGE is nonnull, I1 defines the value described by
+   DEF_USE_RANGE and I2 uses it.  */
+
+bool
+combine2::start_combination (combination_attempt_rec &attempt,
+			     insn_info_rec *i1, insn_info_rec *i2,
+			     live_range_rec *def_use_range)
+{
+  attempt.new_home = i1;
+  attempt.sequence[0] = i1;
+  attempt.sequence[1] = i2;
+  if (attempt.sequence[0]->point < attempt.sequence[1]->point)
+    std::swap (attempt.sequence[0], attempt.sequence[1]);
+  attempt.def_use_range = def_use_range;
+
+  /* Check that the instructions have no true dependencies other than
+     DEF_USE_RANGE.  */
+  bitmap_clear (m_true_deps);
+  for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def)
+    if (*def != def_use_range)
+      bitmap_set_bit (m_true_deps, (*def)->regno);
+  for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use)
+    if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno))
+      return false;
+
+  /* Calculate the range of points at which the combined instruction
+     could live.  */
+  attempt.earliest_point = find_earliest_point (attempt.sequence[1],
+						attempt.sequence[0]);
+  attempt.latest_point = find_latest_point (attempt.sequence[0],
+					    attempt.sequence[1]);
+  if (attempt.earliest_point < attempt.latest_point)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "cannot combine %d and %d: no suitable"
+		 " location for combined insn\n",
+		 INSN_UID (attempt.sequence[0]->insn),
+		 INSN_UID (attempt.sequence[1]->insn));
+      return false;
+    }
+
+  /* Make sure we have valid costs for the original instructions before
+     we start changing their patterns.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    if (attempt.sequence[i]->cost == UNKNOWN_COST)
+      attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn,
+					     m_optimize_for_speed_p);
+  return true;
+}
+
+/* Check whether the combination attempt described by ATTEMPT matches
+   an .md instruction (or matches its constraints, in the case of an
+   asm statement).  If so, calculate the cost of the new instruction
+   and check whether it's cheap enough.  */
+
+bool
+combine2::verify_combination (combination_attempt_rec &attempt)
+{
+  rtx_insn *insn = attempt.sequence[1]->insn;
+
+  bool ok_p = verify_changes (0);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      if (!ok_p)
+	fprintf (dump_file, "failed to match this instruction:\n");
+      else if (const char *name = get_insn_name (INSN_CODE (insn)))
+	fprintf (dump_file, "successfully matched this instruction to %s:\n",
+		 name);
+      else
+	fprintf (dump_file, "successfully matched this instruction:\n");
+      print_rtl_single (dump_file, PATTERN (insn));
+    }
+  if (!ok_p)
+    return false;
+
+  unsigned int cost1 = attempt.sequence[0]->cost;
+  unsigned int cost2 = attempt.sequence[1]->cost;
+  attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p);
+  ok_p = (attempt.new_cost <= cost1 + cost2);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n",
+	     cost1, cost2, attempt.new_cost,
+	     ok_p ? "keeping replacement" : "rejecting replacement");
+  if (!ok_p)
+    return false;
+
+  confirm_change_group ();
+  return true;
+}
+
+/* Return true if we should consider register REGNO when calculating
+   register pressure estimates.  */
+
+static bool
+count_reg_pressure_p (unsigned int regno)
+{
+  if (regno == INVALID_REGNUM)
+    return false;
+
+  /* Unallocatable registers aren't interesting.  */
+  if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno])
+    return false;
+
+  return true;
+}
+
+/* Try to estimate the effect that the original form of INSN_INFO
+   had on register pressure, in the form "born - dying".  */
+
+int
+combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info)
+{
+  int delta = 0;
+
+  for (live_range_rec **def = insn_info->defs; *def; ++def)
+    if (count_reg_pressure_p ((*def)->regno))
+      delta += 1;
+
+  for (live_range_rec **use = insn_info->uses; *use; ++use)
+    if (count_reg_pressure_p ((*use)->regno)
+	&& known_last_use_p (*use, insn_info))
+      delta -= 1;
+
+  return delta;
+}
+
+/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete
+   FROM_INSN.  Copy any useful information to TO_INSN before doing that.  */
+
+static void
+transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn)
+{
+  INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn);
+  INSN_CODE (to_insn) = INSN_CODE (from_insn);
+  REG_NOTES (to_insn) = REG_NOTES (from_insn);
+}
+
+/* The combination attempt in ATTEMPT has succeeded and is currently
+   part of an open validate_change group.  Commit to making the change
+   and decide where the new instruction should go.
+
+   KEPT_DEF_P is true if the new instruction continues to perform
+   the definition described by ATTEMPT.def_use_range.  */
+
+void
+combine2::commit_combination (combination_attempt_rec &attempt,
+			      bool kept_def_p)
+{
+  insn_info_rec *new_home = attempt.new_home;
+  rtx_insn *old_insn = attempt.sequence[0]->insn;
+  rtx_insn *new_insn = attempt.sequence[1]->insn;
+
+  /* Remove any notes that are no longer relevant.  */
+  bool single_set_p = single_set (new_insn);
+  for (rtx *note_ptr = &REG_NOTES (new_insn); *note_ptr; )
+    {
+      rtx note = *note_ptr;
+      bool keep_p = true;
+      switch (REG_NOTE_KIND (note))
+	{
+	case REG_EQUAL:
+	case REG_EQUIV:
+	case REG_NOALIAS:
+	  keep_p = single_set_p;
+	  break;
+
+	case REG_UNUSED:
+	  keep_p = false;
+	  break;
+
+	default:
+	  break;
+	}
+      if (keep_p)
+	note_ptr = &XEXP (*note_ptr, 1);
+      else
+	{
+	  *note_ptr = XEXP (*note_ptr, 1);
+	  free_EXPR_LIST_node (note);
+	}
+    }
+
+  /* Complete the open validate_change group.  */
+  confirm_change_group ();
+
+  /* Decide where the new instruction should go.  */
+  unsigned int new_point = attempt.latest_point;
+  if (new_point != attempt.earliest_point
+      && prev_real_insn (new_insn) != old_insn)
+    {
+      /* Prefer the earlier point if the combined instruction reduces
+	 register pressure and the latest point if it increases register
+	 pressure.
+
+	 The choice isn't obvious in the event of a tie, but picking
+	 the earliest point should reduce the number of times that
+	 we need to invalidate debug insns.  */
+      int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]);
+      int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]);
+      bool move_up_p = (delta1 + delta2 <= 0);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "register pressure delta = %d + %d; using %s position\n",
+		 delta1, delta2, move_up_p ? "earliest" : "latest");
+      if (move_up_p)
+	new_point = attempt.earliest_point;
+    }
+
+  /* Translate inserting at NEW_POINT into inserting before or after
+     a particular insn.  */
+  rtx_insn *anchor = NULL;
+  bool before_p = (new_point & 1);
+  if (new_point != attempt.sequence[1]->point
+      && new_point != attempt.sequence[0]->point)
+    {
+      anchor = m_points[(new_point - m_end_of_sequence) / 2];
+      rtx_insn *other_side = (before_p
+			      ? prev_real_insn (anchor)
+			      : next_real_insn (anchor));
+      /* Inserting next to an insn X and then deleting X is just a
+	 roundabout way of using X as the insertion point.  */
+      if (anchor == new_insn || other_side == new_insn)
+	new_point = attempt.sequence[1]->point;
+      else if (anchor == old_insn || other_side == old_insn)
+	new_point = attempt.sequence[0]->point;
+    }
+
+  /* Actually perform the move.  */
+  if (new_point == attempt.sequence[1]->point)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "using insn %d to hold the combined pattern\n",
+		 INSN_UID (new_insn));
+      set_insn_deleted (old_insn);
+    }
+  else if (new_point == attempt.sequence[0]->point)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "using insn %d to hold the combined pattern\n",
+		 INSN_UID (old_insn));
+      PATTERN (old_insn) = PATTERN (new_insn);
+      transfer_insn (old_insn, new_insn);
+      std::swap (old_insn, new_insn);
+      set_insn_deleted (old_insn);
+    }
+  else
+    {
+      /* We need to insert a new instruction.  We can't simply move
+	 NEW_INSN because it acts as an insertion anchor in m_points.  */
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "inserting combined insn %s insn %d\n",
+		 before_p ? "before" : "after", INSN_UID (anchor));
+
+      rtx_insn *added_insn = (before_p
+			      ? emit_insn_before (PATTERN (new_insn), anchor)
+			      : emit_insn_after (PATTERN (new_insn), anchor));
+      transfer_insn (added_insn, new_insn);
+      set_insn_deleted (old_insn);
+      set_insn_deleted (new_insn);
+      new_insn = added_insn;
+    }
+  df_insn_rescan (new_insn);
+
+  /* Unlink the old uses.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
+      remove_range_use (*use, attempt.sequence[i]);
+
+  /* Work out which registers the new pattern uses.  */
+  bitmap_clear (m_true_deps);
+  df_ref use;
+  FOR_EACH_INSN_USE (use, new_insn)
+    {
+      rtx reg = DF_REF_REAL_REG (use);
+      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
+    }
+  FOR_EACH_INSN_EQ_USE (use, new_insn)
+    {
+      rtx reg = DF_REF_REAL_REG (use);
+      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
+    }
+
+  /* Describe the combined instruction in NEW_HOME.  */
+  new_home->insn = new_insn;
+  new_home->point = new_point;
+  new_home->cost = attempt.new_cost;
+
+  /* Build up a list of definitions for the combined instructions
+     and update all the ranges accordingly.  It shouldn't matter
+     which order we do this in.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def)
+      if (kept_def_p || *def != attempt.def_use_range)
+	{
+	  obstack_ptr_grow (&m_insn_obstack, *def);
+	  (*def)->producer = new_home;
+	}
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
+
+  /* Build up a list of uses for the combined instructions and update
+     all the ranges accordingly.  Again, it shouldn't matter which
+     order we do this in.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
+      {
+	live_range_rec *range = *use;
+	if (range != attempt.def_use_range
+	    && (range->regno == INVALID_REGNUM
+		? insn_reads_mem_p (new_insn)
+		: bitmap_bit_p (m_true_deps, range->regno))
+	    && add_range_use (range, new_home))
+	  obstack_ptr_grow (&m_insn_obstack, range);
+      }
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
+
+  /* There shouldn't be any remaining references to other instructions
+     in the combination.  Invalidate their contents to make lingering
+     references a noisy failure.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    if (attempt.sequence[i] != new_home)
+      {
+	attempt.sequence[i]->insn = NULL;
+	attempt.sequence[i]->point = ~0U;
+      }
+
+  /* Unlink the def-use range.  */
+  if (!kept_def_p && attempt.def_use_range)
+    {
+      live_range_rec *range = attempt.def_use_range;
+      if (range->prev_range)
+	range->prev_range->next_range = range->next_range;
+      else
+	m_reg_info[range->regno].range = range->next_range;
+      if (range->next_range)
+	range->next_range->prev_range = range->prev_range;
+    }
+
+  /* Record instructions whose new form alters the cfg.  */
+  rtx pattern = PATTERN (new_insn);
+  if ((returnjump_p (new_insn)
+       || any_uncondjump_p (new_insn)
+       || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx))
+      && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn)))
+    m_cfg_altering_insns.safe_push (new_insn);
+}
+
+/* Return true if X1 and X2 are memories and if X1 does not have
+   a higher alignment than X2.  */
+
+static bool
+dubious_mem_pair_p (rtx x1, rtx x2)
+{
+  return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2);
+}
+
+/* Try implement ATTEMPT using (parallel [SET1 SET2]).  */
+
+bool
+combine2::try_parallel_sets (combination_attempt_rec &attempt,
+			     rtx set1, rtx set2)
+{
+  rtx_insn *insn = attempt.sequence[1]->insn;
+
+  /* Combining two loads or two stores can be useful on targets that
+     allow them to be treated as a single access.  However, we use a
+     very peephole approach to picking the pairs, so we need to be
+     relatively confident that we're making a good choice.
+
+     For now just aim for cases in which the memory references are
+     consecutive and the first reference has a higher alignment.
+     We can leave the target to test the consecutive part; whatever test
+     we added here might be different from the target's, and in any case
+     it's fine if the target accepts other well-aligned cases too.  */
+  if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2))
+      || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2)))
+    return false;
+
+  /* Cache the PARALLEL rtx between attempts so that we don't generate
+     too much garbage rtl.  */
+  if (!m_spare_parallel)
+    {
+      rtvec vec = gen_rtvec (2, set1, set2);
+      m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec);
+    }
+  else
+    {
+      XVECEXP (m_spare_parallel, 0, 0) = set1;
+      XVECEXP (m_spare_parallel, 0, 1) = set2;
+    }
+
+  unsigned int num_changes = num_validated_changes ();
+  validate_change (insn, &PATTERN (insn), m_spare_parallel, true);
+  if (verify_combination (attempt))
+    {
+      m_spare_parallel = NULL_RTX;
+      return true;
+    }
+  cancel_changes (num_changes);
+  return false;
+}
+
+/* Try to parallelize the two instructions in ATTEMPT.  */
+
+bool
+combine2::try_parallelize_insns (combination_attempt_rec &attempt)
+{
+  rtx_insn *i1_insn = attempt.sequence[0]->insn;
+  rtx_insn *i2_insn = attempt.sequence[1]->insn;
+
+  /* Can't parallelize asm statements.  */
+  if (asm_noperands (PATTERN (i1_insn)) >= 0
+      || asm_noperands (PATTERN (i2_insn)) >= 0)
+    return false;
+
+  /* For now, just handle the case in which both instructions are
+     single sets.  We could handle more than 2 sets as well, but few
+     targets support that anyway.  */
+  rtx set1 = single_set (i1_insn);
+  if (!set1)
+    return false;
+  rtx set2 = single_set (i2_insn);
+  if (!set2)
+    return false;
+
+  /* Make sure that we have structural proof that the destinations
+     are independent.  Things like alias analysis rely on semantic
+     information and assume no undefined behavior, which is rarely a
+     good enough guarantee to allow a useful instruction combination.  */
+  rtx dest1 = SET_DEST (set1);
+  rtx dest2 = SET_DEST (set2);
+  if (MEM_P (dest1)
+      ? MEM_P (dest2) && !nonoverlapping_memrefs_p (dest1, dest2, false)
+      : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2))
+    return false;
+
+  /* Try the sets in both orders.  */
+  if (try_parallel_sets (attempt, set1, set2)
+      || try_parallel_sets (attempt, set2, set1))
+    {
+      commit_combination (attempt, true);
+      if (MAY_HAVE_DEBUG_BIND_INSNS
+	  && attempt.new_home->insn != i1_insn)
+	propagate_for_debug (i1_insn, attempt.new_home->insn,
+			     SET_DEST (set1), SET_SRC (set1), m_bb);
+      return true;
+    }
+  return false;
+}
+
+/* Replace DEST with SRC in the register notes for INSN.  */
+
+static void
+substitute_into_note (rtx_insn *insn, rtx dest, rtx src)
+{
+  for (rtx *note_ptr = &REG_NOTES (insn); *note_ptr; )
+    {
+      rtx note = *note_ptr;
+      bool keep_p = true;
+      switch (REG_NOTE_KIND (note))
+	{
+	case REG_EQUAL:
+	case REG_EQUIV:
+	  keep_p = validate_simplify_replace_rtx (insn, &XEXP (note, 0),
+						  dest, src);
+	  break;
+
+	default:
+	  break;
+	}
+      if (keep_p)
+	note_ptr = &XEXP (*note_ptr, 1);
+      else
+	{
+	  *note_ptr = XEXP (*note_ptr, 1);
+	  free_EXPR_LIST_node (note);
+	}
+    }
+}
+
+/* A subroutine of try_combine_def_use.  Try replacing DEST with SRC
+   in ATTEMPT.  SRC might be either the original SET_SRC passed to the
+   parent routine or a value pulled from a note; SRC_IS_NOTE_P is true
+   in the latter case.  */
+
+bool
+combine2::try_combine_def_use_1 (combination_attempt_rec &attempt,
+				 rtx dest, rtx src, bool src_is_note_p)
+{
+  rtx_insn *def_insn = attempt.sequence[0]->insn;
+  rtx_insn *use_insn = attempt.sequence[1]->insn;
+
+  /* Mimic combine's behavior by not combining moves from allocatable hard
+     registers (e.g. when copying parameters or function return values).  */
+  if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
+    return false;
+
+  /* Don't mess with volatile references.  For one thing, we don't yet
+     know how many copies of SRC we'll need.  */
+  if (volatile_refs_p (src))
+    return false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "trying to combine %d and %d%s:\n",
+	       INSN_UID (def_insn), INSN_UID (use_insn),
+	       src_is_note_p ? " using equal/equiv note" : "");
+      dump_insn_slim (dump_file, def_insn);
+      dump_insn_slim (dump_file, use_insn);
+    }
+
+  unsigned int num_changes = num_validated_changes ();
+  if (!validate_simplify_replace_rtx (use_insn, &PATTERN (use_insn),
+				      dest, src))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "combination failed -- unable to substitute"
+		 " all uses\n");
+      return false;
+    }
+
+  /* Try matching the instruction on its own if DEST isn't used elsewhere.  */
+  if (has_single_use_p (attempt.def_use_range)
+      && verify_combination (attempt))
+    {
+      live_range_rec *next_range = attempt.def_use_range->next_range;
+      substitute_into_note (use_insn, dest, src);
+      commit_combination (attempt, false);
+      if (MAY_HAVE_DEBUG_BIND_INSNS)
+	{
+	  rtx_insn *end_of_range = (next_range
+				    ? next_range->producer->insn
+				    : BB_END (m_bb));
+	  propagate_for_debug (def_insn, end_of_range, dest, src, m_bb);
+	}
+      return true;
+    }
+
+  /* Try doing the new USE_INSN pattern in parallel with the DEF_INSN
+     pattern.  */
+  if (try_parallelize_insns (attempt))
+    return true;
+
+  cancel_changes (num_changes);
+  return false;
+}
+
+/* ATTEMPT describes an attempt to substitute the result of the first
+   instruction into the second instruction.  Try to implement it,
+   given that the first instruction sets DEST to SRC.  */
+
+bool
+combine2::try_combine_def_use (combination_attempt_rec &attempt,
+			       rtx dest, rtx src)
+{
+  rtx_insn *def_insn = attempt.sequence[0]->insn;
+  rtx_insn *use_insn = attempt.sequence[1]->insn;
+  rtx def_note = find_reg_equal_equiv_note (def_insn);
+
+  /* First try combining the instructions in their original form.  */
+  if (try_combine_def_use_1 (attempt, dest, src, false))
+    return true;
+
+  /* Try to replace DEST with a REG_EQUAL/EQUIV value instead.  */
+  if (def_note
+      && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
+    return true;
+
+  /* If USE_INSN has a REG_EQUAL/EQUIV note that refers to DEST, try
+     using that instead of the main pattern.  */
+  for (rtx *link_ptr = &REG_NOTES (use_insn); *link_ptr;
+       link_ptr = &XEXP (*link_ptr, 1))
+    {
+      rtx use_note = *link_ptr;
+      if (REG_NOTE_KIND (use_note) != REG_EQUAL
+	  && REG_NOTE_KIND (use_note) != REG_EQUIV)
+	continue;
+
+      rtx use_set = single_set (use_insn);
+      if (!use_set)
+	break;
+
+      if (!reg_overlap_mentioned_p (dest, XEXP (use_note, 0)))
+	continue;
+
+      /* Try snipping out the note and putting it in the SET instead.  */
+      validate_change (use_insn, link_ptr, XEXP (use_note, 1), 1);
+      validate_change (use_insn, &SET_SRC (use_set), XEXP (use_note, 0), 1);
+
+      if (try_combine_def_use_1 (attempt, dest, src, false))
+	return true;
+
+      if (def_note
+	  && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
+	return true;
+
+      cancel_changes (0);
+    }
+
+  return false;
+}
+
+/* ATTEMPT describes an attempt to combine two instructions that use
+   the same resource.  Try to implement it, returning true on success.  */
+
+bool
+combine2::try_combine_two_uses (combination_attempt_rec &attempt)
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "trying to parallelize %d and %d:\n",
+	       INSN_UID (attempt.sequence[0]->insn),
+	       INSN_UID (attempt.sequence[1]->insn));
+      dump_insn_slim (dump_file, attempt.sequence[0]->insn);
+      dump_insn_slim (dump_file, attempt.sequence[1]->insn);
+    }
+
+  return try_parallelize_insns (attempt);
+}
+
+/* Try to optimize instruction INSN_INFO.  Return true on success.  */
+
+bool
+combine2::optimize_insn (insn_info_rec *i1)
+{
+  combination_attempt_rec attempt;
+
+  if (!combinable_insn_p (i1->insn, false))
+    return false;
+
+  rtx set = single_set (i1->insn);
+  if (!set)
+    return false;
+
+  /* First try combining INSN with a user of its result.  */
+  rtx dest = SET_DEST (set);
+  rtx src = SET_SRC (set);
+  if (REG_P (dest) && REG_NREGS (dest) == 1)
+    for (live_range_rec **def = i1->defs; *def; ++def)
+      if ((*def)->regno == REGNO (dest))
+	{
+	  for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+	    {
+	      insn_info_rec *use = (*def)->users[i];
+	      if (use
+		  && combinable_insn_p (use->insn, has_single_use_p (*def))
+		  && start_combination (attempt, i1, use, *def)
+		  && try_combine_def_use (attempt, dest, src))
+		return true;
+	    }
+	  break;
+	}
+
+  /* Try parallelizing INSN and another instruction that uses the same
+     resource.  */
+  bitmap_clear (m_tried_insns);
+  for (live_range_rec **use = i1->uses; *use; ++use)
+    for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+      {
+	insn_info_rec *i2 = (*use)->users[i];
+	if (i2
+	    && i2 != i1
+	    && combinable_insn_p (i2->insn, false)
+	    && bitmap_set_bit (m_tried_insns, INSN_UID (i2->insn))
+	    && start_combination (attempt, i1, i2)
+	    && try_combine_two_uses (attempt))
+	  return true;
+      }
+
+  return false;
+}
+
+/* Record all register and memory definitions in INSN_INFO and fill in its
+   "defs" list.  */
+
+void
+combine2::record_defs (insn_info_rec *insn_info)
+{
+  rtx_insn *insn = insn_info->insn;
+
+  /* Record register definitions.  */
+  df_ref def;
+  FOR_EACH_INSN_DEF (def, insn)
+    {
+      rtx reg = DF_REF_REAL_REG (def);
+      unsigned int end_regno = END_REGNO (reg);
+      for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
+	{
+	  live_range_rec *range = reg_live_range (regno);
+	  range->producer = insn_info;
+	  m_reg_info[regno].live_p = false;
+	  obstack_ptr_grow (&m_insn_obstack, range);
+	}
+    }
+
+  /* If the instruction writes to memory, record that too.  */
+  if (insn_writes_mem_p (insn))
+    {
+      live_range_rec *range = mem_live_range ();
+      range->producer = insn_info;
+      obstack_ptr_grow (&m_insn_obstack, range);
+    }
+
+  /* Complete the list of definitions.  */
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  insn_info->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
+}
+
+/* Record that INSN_INFO contains register use USE.  If this requires
+   new entries to be added to INSN_INFO->uses, add those entries to the
+   list we're building in m_insn_obstack.  */
+
+void
+combine2::record_reg_use (insn_info_rec *insn_info, df_ref use)
+{
+  rtx reg = DF_REF_REAL_REG (use);
+  unsigned int end_regno = END_REGNO (reg);
+  for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
+    {
+      live_range_rec *range = reg_live_range (regno);
+      if (add_range_use (range, insn_info))
+	obstack_ptr_grow (&m_insn_obstack, range);
+      m_reg_info[regno].live_p = true;
+    }
+}
+
+/* Record all register and memory uses in INSN_INFO and fill in its
+   "uses" list.  */
+
+void
+combine2::record_uses (insn_info_rec *insn_info)
+{
+  rtx_insn *insn = insn_info->insn;
+
+  /* Record register uses in the main pattern.  */
+  df_ref use;
+  FOR_EACH_INSN_USE (use, insn)
+    record_reg_use (insn_info, use);
+
+  /* Treat REG_EQUAL uses as first-class uses.  We don't lose much
+     by doing that, since it's rare for a REG_EQUAL note to mention
+     registers that the main pattern doesn't.  It also gives us the
+     maximum freedom to use REG_EQUAL notes in place of the main pattern.  */
+  FOR_EACH_INSN_EQ_USE (use, insn)
+    record_reg_use (insn_info, use);
+
+  /* Record a memory use if either the pattern or the notes read from
+     memory.  */
+  if (insn_reads_mem_p (insn))
+    {
+      live_range_rec *range = mem_live_range ();
+      if (add_range_use (range, insn_info))
+	obstack_ptr_grow (&m_insn_obstack, range);
+    }
+
+  /* Complete the list of uses.  */
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  insn_info->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
+}
+
+/* Start a new instruction sequence, discarding all information about
+   the previous one.  */
+
+void
+combine2::start_sequence (void)
+{
+  m_end_of_sequence = m_point;
+  m_mem_range = NULL;
+  m_points.truncate (0);
+  obstack_free (&m_insn_obstack, m_insn_obstack_start);
+  obstack_free (&m_range_obstack, m_range_obstack_start);
+}
+
+/* Run the pass on the current function.  */
+
+void
+combine2::execute (void)
+{
+  df_analyze ();
+  FOR_EACH_BB_FN (m_bb, cfun)
+    {
+      m_optimize_for_speed_p = optimize_bb_for_speed_p (m_bb);
+      m_end_of_bb = m_point;
+      start_sequence ();
+
+      rtx_insn *insn, *prev;
+      FOR_BB_INSNS_REVERSE_SAFE (m_bb, insn, prev)
+	{
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+
+	  /* The current m_point represents the end of the sequence if
+	     INSN is the last instruction in the sequence, otherwise it
+	     represents the gap between INSN and the next instruction.
+	     m_point + 1 represents INSN itself.
+
+	     Instructions can be added to m_point by inserting them
+	     after INSN.  They can be added to m_point + 1 by inserting
+	     them before INSN.  */
+	  m_points.safe_push (insn);
+	  m_point += 1;
+
+	  insn_info_rec *insn_info = XOBNEW (&m_insn_obstack, insn_info_rec);
+	  insn_info->insn = insn;
+	  insn_info->point = m_point;
+	  insn_info->cost = UNKNOWN_COST;
+
+	  record_defs (insn_info);
+	  record_uses (insn_info);
+
+	  /* Set up m_point for the next instruction.  */
+	  m_point += 1;
+
+	  if (CALL_P (insn))
+	    start_sequence ();
+	  else
+	    while (optimize_insn (insn_info))
+	      gcc_assert (insn_info->insn);
+	}
+    }
+
+  /* If an instruction changes the cfg, update the containing block
+     accordingly.  */
+  rtx_insn *insn;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (m_cfg_altering_insns, i, insn)
+    if (JUMP_P (insn))
+      {
+	mark_jump_label (PATTERN (insn), insn, 0);
+	update_cfg_for_uncondjump (insn);
+      }
+    else
+      {
+	remove_edge (split_block (BLOCK_FOR_INSN (insn), insn));
+	emit_barrier_after_bb (BLOCK_FOR_INSN (insn));
+      }
+
+  /* Propagate the above block-local cfg changes to the rest of the cfg.  */
+  if (!m_cfg_altering_insns.is_empty ())
+    {
+      if (dom_info_available_p (CDI_DOMINATORS))
+	free_dominance_info (CDI_DOMINATORS);
+      timevar_push (TV_JUMP);
+      rebuild_jump_labels (get_insns ());
+      cleanup_cfg (0);
+      timevar_pop (TV_JUMP);
+    }
+}
+
+const pass_data pass_data_combine2 =
+{
+  RTL_PASS, /* type */
+  "combine2", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_COMBINE2, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_combine2 : public rtl_opt_pass
+{
+public:
+  pass_combine2 (gcc::context *ctxt, int flag)
+    : rtl_opt_pass (pass_data_combine2, ctxt), m_flag (flag)
+  {}
+
+  bool
+  gate (function *) OVERRIDE
+  {
+    return optimize && (param_run_combine & m_flag) != 0;
+  }
+
+  unsigned int
+  execute (function *f) OVERRIDE
+  {
+    combine2 (f).execute ();
+    return 0;
+  }
+
+private:
+  unsigned int m_flag;
+}; // class pass_combine2
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_combine2_before (gcc::context *ctxt)
+{
+  return new pass_combine2 (ctxt, 1);
+}
+
+rtl_opt_pass *
+make_pass_combine2_after (gcc::context *ctxt)
+{
+  return new pass_combine2 (ctxt, 4);
+}
Segher Boessenkool Nov. 18, 2019, 11:56 p.m. UTC | #3
Hi!

On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote:
> While working on SVE, I've noticed several cases in which we fail
> to combine instructions because the combined form would need to be
> placed earlier in the instruction stream than the last of the
> instructions being combined.  This includes one very important
> case in the handling of the first fault register (FFR).

Do you have an example of that?

> Combine currently requires the combined instruction to live at the same
> location as i3.

Or i2 and i3.

> I thought about trying to relax that restriction, but it
> would be difficult to do with the current pass structure while keeping
> everything linear-ish time.

s/difficult/impossible/, yes.

A long time ago we had to only move insns forward for correctness even,
but that should no longer be required, combine always is finite by other
means now.

> So this patch instead goes for an option that has been talked about
> several times over the years: writing a new combine pass that just
> does instruction combination, and not all the other optimisations
> that have been bolted onto combine over time.  E.g. it deliberately
> doesn't do things like nonzero-bits tracking, since that really ought
> to be a separate, more global, optimisation.

In my dreams tracking nonzero bits would be a dataflow problem.

> This is still far from being a realistic replacement for the even
> the combine parts of the current combine pass.  E.g.:
> 
> - it only handles combinations that can be built up from individual
>   two-instruction combinations.

And combine does any of {2,3,4}->{1,2} combinations, and it also can
modify a third insn ("other_insn").  For the bigger ->1 combos, if it
*can* be decomposed in a bunch of 2->1, then those result in insns that
are greater cost than those we started with (or else those combinations
*would* be done).  For the ->2 combinations, there are many ways those
two insns can be formed: it can be the two arms of a parallel, or
combine can break a non-matching insn into two at what looks like a good
spot for that, or it can use a define_split for it.

All those things lead to many more successful combinations :-)

> On a more positive note, the pass handles things that the current
> combine pass doesn't:
> 
> - the main motivating feature mentioned above: it works out where
>   the combined instruction could validly live and moves it there
>   if necessary.  If there are a range of valid places, it tries
>   to pick the best one based on register pressure (although only
>   with a simple heuristic for now).

How are dependencies represented in your new pass?  If it just does
walks over the insn stream for everything, you get quadratic complexity
if you move insns backwards.  We have that in combine already, mostly
from modified_between_p, but that is limited because of how LOG_LINKS
work, and we have been doing this for so long and there are no problems
found with it, so it must work in practice.  But I am worried about it
when moving insns back an unlimited distance.

If combine results in two insns it puts them at i2 and i3, and it can
actually move a SET to i2 that was at i3 before the combination.

> - once it has combined two instructions, it can try combining the
>   result with both later and earlier code, i.e. it can combine
>   in both directions.

That is what combine does, too.

> - it tries using REG_EQUAL notes for the final instruction.

And that.

> - it can parallelise two independent instructions that both read from
>   the same register or both read from memory.

That only if somehow there is a link between the two (so essentially
never).  The only combinations tried by combine are those via LOG_LINKs,
which are between a SET and the first corresponding use.  This is a key
factor that makes it kind of linear (instead of exponential) complexity.

> The pass is supposed to be linear time without debug insns.
> It only tries a constant number C of combinations per instruction
> and its bookkeeping updates are constant-time.

But how many other insns does it look at, say by modified_between_p or
the like?

> The patch adds two instances of the new pass: one before combine and
> one after it.

One thing I want to do is some mini-combine after every split, probably
only with the insns new from the split.  But we have no cfglayout mode
anymore then, and only hard regs (except in the first split pass, which
is just a little later than your new pass).

> As far as compile-time goes, I tried compiling optabs.ii at -O2
> with an --enable-checking=release compiler:
> 
> run-combine=2 (normal combine):  100.0% (baseline)
> run-combine=4 (new pass only)     98.0%
> run-combine=6 (both passes)      100.3%
> 
> where the results are easily outside the noise.  So the pass on
> its own is quicker than combine, but that's not a fair comparison
> when it doesn't do everything combine does.  Running both passes
> only has a slight overhead.

And amount of garbage produced?

> To get a feel for the effect on multiple targets, I did my usual
> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
> and g++.dg, this time comparing run-combine=2 and run-combine=6
> using -O2 -ftree-vectorize:

One problem with this is that these are very short functions on average.

What is the kind of changes you see for other targets?

Wait, does this combine sets with a hard reg source as well?  It
shouldn't do that, that is RA's job; doing this in a greedy way is a
bad idea.  (I haven't yet verified if you do this, fwiw).

> Inevitably there was some scan-assembler fallout for other tests.
> E.g. in gcc.target/aarch64/vmov_n_1.c:
> 
> #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory")
>   ...
>   INHIB_OPTIMIZATION;							\
>   (a) = TEST (test, data_len);						\
>   INHIB_OPTIMIZATION;							\
>   (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a));	\
> 
> is no longer effective for preventing move (a) from being merged
> into (b), because the pass can merge at the point of (a).

It never was effective for that.  Unless (b) lives in memory, in which
case your new pass has a bug here.

> I think
> this is a valid thing to do -- the asm semantics are still satisfied,
> and asm volatile ("" : : : "memory") never acted as a register barrier.
> But perhaps we should deal with this as a special case?

I don't think we should, no.  What does "register barrier" even mean,
*exactly*?

(I'll look at the new version of your patch, but not today).


Segher
Richard Sandiford Nov. 19, 2019, 11:33 a.m. UTC | #4
Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote:
>> While working on SVE, I've noticed several cases in which we fail
>> to combine instructions because the combined form would need to be
>> placed earlier in the instruction stream than the last of the
>> instructions being combined.  This includes one very important
>> case in the handling of the first fault register (FFR).
>
> Do you have an example of that?

It's difficult to share realistic examples at this stage since this
isn't really the right forum for making them public for the first time.
But in rtl terms we have:

(set (reg/v:VNx16BI 102 [ ok ])
     (reg:VNx16BI 85 ffrt))
(set (reg:VNx16BI 85 ffrt)
     (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
(set (reg:CC_NZC 66 cc)
     (unspec:CC_NZC
       [(reg:VNx16BI 106) repeated x2
        (const_int 1 [0x1])
        (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))

and want to combine the first and third instruction at the site of the
first instruction.  Current combine gives:

Trying 18 -> 24:
   18: r102:VNx16BI=ffrt:VNx16BI
   24: cc:CC_NZC=unspec[r106:VNx16BI,r106:VNx16BI,0x1,r102:VNx16BI] 104
Can't combine i2 into i3

because of:

      /* Make sure that the value that is to be substituted for the register
	 does not use any registers whose values alter in between.  However,
	 If the insns are adjacent, a use can't cross a set even though we
	 think it might (this can happen for a sequence of insns each setting
	 the same destination; last_set of that register might point to
	 a NOTE).  If INSN has a REG_EQUIV note, the register is always
	 equivalent to the memory so the substitution is valid even if there
	 are intervening stores.  Also, don't move a volatile asm or
	 UNSPEC_VOLATILE across any other insns.  */
      || (! all_adjacent
	  && (((!MEM_P (src)
		|| ! find_reg_note (insn, REG_EQUIV, src))
	       && modified_between_p (src, insn, i3))
	      || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
	      || GET_CODE (src) == UNSPEC_VOLATILE))

>> Combine currently requires the combined instruction to live at the same
>> location as i3.
>
> Or i2 and i3.
>
>> I thought about trying to relax that restriction, but it
>> would be difficult to do with the current pass structure while keeping
>> everything linear-ish time.
>
> s/difficult/impossible/, yes.
>
> A long time ago we had to only move insns forward for correctness even,
> but that should no longer be required, combine always is finite by other
> means now.
>
>> So this patch instead goes for an option that has been talked about
>> several times over the years: writing a new combine pass that just
>> does instruction combination, and not all the other optimisations
>> that have been bolted onto combine over time.  E.g. it deliberately
>> doesn't do things like nonzero-bits tracking, since that really ought
>> to be a separate, more global, optimisation.
>
> In my dreams tracking nonzero bits would be a dataflow problem.
>
>> This is still far from being a realistic replacement for the even
>> the combine parts of the current combine pass.  E.g.:
>> 
>> - it only handles combinations that can be built up from individual
>>   two-instruction combinations.
>
> And combine does any of {2,3,4}->{1,2} combinations, and it also can
> modify a third insn ("other_insn").  For the bigger ->1 combos, if it
> *can* be decomposed in a bunch of 2->1, then those result in insns that
> are greater cost than those we started with (or else those combinations
> *would* be done).  For the ->2 combinations, there are many ways those
> two insns can be formed: it can be the two arms of a parallel, or
> combine can break a non-matching insn into two at what looks like a good
> spot for that, or it can use a define_split for it.
>
> All those things lead to many more successful combinations :-)

Right.  I definitely want to support multi-insn combos too.  It's one of
the TODOs in the head comment, along with the other points in this list.
Like I say, it's not yet a realistic replacement for even the combine parts
of the current pass.

>> On a more positive note, the pass handles things that the current
>> combine pass doesn't:
>> 
>> - the main motivating feature mentioned above: it works out where
>>   the combined instruction could validly live and moves it there
>>   if necessary.  If there are a range of valid places, it tries
>>   to pick the best one based on register pressure (although only
>>   with a simple heuristic for now).
>
> How are dependencies represented in your new pass?  If it just does
> walks over the insn stream for everything, you get quadratic complexity
> if you move insns backwards.  We have that in combine already, mostly
> from modified_between_p, but that is limited because of how LOG_LINKS
> work, and we have been doing this for so long and there are no problems
> found with it, so it must work in practice.  But I am worried about it
> when moving insns back an unlimited distance.

It builds def-use chains, but using a constant limit on the number of
explicitly-recorded uses.  All other uses go in a numerical live range
from which they (conservatively) never escape.  The def-use chains
represent memory as a single entity, a bit like in gimple.

I avoided the rtlanal.c dependency routines for exactly this reason. :-)

> If combine results in two insns it puts them at i2 and i3, and it can
> actually move a SET to i2 that was at i3 before the combination.
>
>> - once it has combined two instructions, it can try combining the
>>   result with both later and earlier code, i.e. it can combine
>>   in both directions.
>
> That is what combine does, too.

Yeah, that part was bogus, sorry.

>> - it tries using REG_EQUAL notes for the final instruction.
>
> And that.

I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3
with i3's REG_EQUAL note and combining into that.  Does combine do that?
I couldn't see it, and in:

   https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html

you seemed to reject the idea of allowing it.

>> - it can parallelise two independent instructions that both read from
>>   the same register or both read from memory.
>
> That only if somehow there is a link between the two (so essentially
> never).  The only combinations tried by combine are those via LOG_LINKs,
> which are between a SET and the first corresponding use.  This is a key
> factor that makes it kind of linear (instead of exponential) complexity.

Tracking limited def-use chains is what makes this last bit easy.
We can just try parallelising two instructions from the (bounded) list
of uses.  And for this case there's not any garbage rtl involved, since
we reuse the same PARALLEL rtx between attempts.  The cost is basically
all in the recog call (which would obviously mount up if we went
overboard).

The new pass also tries combining definitions with uses later than the
first, but of course in that case we need to keep the original set in
parallel.

>> The pass is supposed to be linear time without debug insns.
>> It only tries a constant number C of combinations per instruction
>> and its bookkeeping updates are constant-time.
>
> But how many other insns does it look at, say by modified_between_p or
> the like?

Hope the above answers this.

>> The patch adds two instances of the new pass: one before combine and
>> one after it.
>
> One thing I want to do is some mini-combine after every split, probably
> only with the insns new from the split.  But we have no cfglayout mode
> anymore then, and only hard regs (except in the first split pass, which
> is just a little later than your new pass).

Yeah, sounds like it could be useful.  I guess there'd need to be
an extra condition on the combination that the new insn can't be
immediately split.

>> As far as compile-time goes, I tried compiling optabs.ii at -O2
>> with an --enable-checking=release compiler:
>> 
>> run-combine=2 (normal combine):  100.0% (baseline)
>> run-combine=4 (new pass only)     98.0%
>> run-combine=6 (both passes)      100.3%
>> 
>> where the results are easily outside the noise.  So the pass on
>> its own is quicker than combine, but that's not a fair comparison
>> when it doesn't do everything combine does.  Running both passes
>> only has a slight overhead.
>
> And amount of garbage produced?

If -ftime-report stats are accurate, then the total amount of
memory allocated is:

run-combine=2 (normal combine): 1793 kB
run-combine=4 (new pass only):    98 kB
run-combine=6 (both passes):    1871 kB (new pass accounts for 78 kB)

But again that's not a fair comparison when the main combine pass does more.

I did try hard to keep the amount of garbage rtl down though.  This is
why I added validate_simplify_replace_rtx rather than trying to make
do with existing routines.  It should only create new rtl if the
simplification routines did something useful.  (Of course, that's mostly
true of combine as well, but things like the make_compound_operation/
expand_compound_operation wrangler can create expressions that are never
actually useful.)

>> To get a feel for the effect on multiple targets, I did my usual
>> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
>> and g++.dg, this time comparing run-combine=2 and run-combine=6
>> using -O2 -ftree-vectorize:
>
> One problem with this is that these are very short functions on average.

There are some long ones too :-)

> What is the kind of changes you see for other targets?

On powerpc64le-linux-gnu it mostly comes from eliminating comparisons
in favour of other flag-setting instructions and making more use of
post-increments.  Not sure the last one is actually a win, but the
target costs say it's OK :-).  E.g. from gcc.c-torture/execute/pr78675.c:

@@ -48,9 +48,8 @@
        blr
        .align 4
 .L19:
-       cmpdi 0,10,0
+       mr. 9,10
        mr 3,8
-       mr 9,10
        bne 0,.L9
        b .L3
        .align 4

and a slightly more interesting example in gcc.c-torture/execute/loop-6.c:

@@ -16,24 +16,22 @@
        mflr 0
        li 9,50
        mtctr 9
-       li 8,1
+       li 10,1
        li 7,1
        std 0,16(1)
        stdu 1,-32(1)
 .LCFI0:
 .L2:
-       addi 10,8,1
-       extsw 8,10
+       addi 9,10,1
+       extsw 10,9
        bdz .L11
-       slw 9,7,10
-       rlwinm 9,9,0,0xff
-       cmpwi 0,9,0
+       slw 8,7,9
+       andi. 9,8,0xff
        beq 0,.L3
-       addi 10,8,1
-       slw 9,7,10
-       extsw 8,10
-       rlwinm 9,9,0,0xff
-       cmpwi 0,9,0
+       addi 9,10,1
+       slw 8,7,9
+       extsw 10,9
+       andi. 9,8,0xff
        bne 0,.L2
 .L3:
        li 3,0

gcc.c-torture/execute/20081218-1.c is an example where we make more use
of post-increment:

 .L9:
-       lbz 10,1(9)
-       addi 9,9,1
+       lbzu 10,1(9)
        cmpwi 0,10,38
        bne 0,.L8
-       lbz 10,1(9)
-       addi 9,9,1
+       lbzu 10,1(9)
        cmpwi 0,10,38
        bne 0,.L8
        bdnz .L9

The changes for s390x-linux-gnu are also often flag-related.  E.g.
gcc.c-torture/execute/pr68624.c:

@@ -27,9 +27,8 @@
 .L9:
        larl    %r2,d
        larl    %r3,.LANCHOR0
-       l       %r2,0(%r2)
+       icm     %r2,15,0(%r2)
        st      %r2,0(%r3)
-       ltr     %r2,%r2
        jne     .L11
        lhi     %r2,-4
        st      %r2,0(%r1)

where we move the flag-setting up and gcc.c-torture/execute/20050826-2.c:

@@ -62,8 +62,7 @@
        lgr     %r3,%r9
        lghi    %r2,0
        brasl   %r14,inet_check_attr
-       ltr     %r2,%r2
-       lr      %r12,%r2
+       ltr     %r12,%r2
        jne     .L16
        lgr     %r1,%r9
        lhi     %r3,-7

where we eliminate a separate move, like in the first powerpc64le
example above.

> Wait, does this combine sets with a hard reg source as well?  It
> shouldn't do that, that is RA's job; doing this in a greedy way is a
> bad idea.  (I haven't yet verified if you do this, fwiw).

No:

  /* Mimic combine's behavior by not combining moves from allocatable hard
     registers (e.g. when copying parameters or function return values).  */
  if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
    return false;

Although if that could have accounted for the difference, it sounds like
we're leaving a lot on the table by doing this :-)

>> Inevitably there was some scan-assembler fallout for other tests.
>> E.g. in gcc.target/aarch64/vmov_n_1.c:
>> 
>> #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory")
>>   ...
>>   INHIB_OPTIMIZATION;							\
>>   (a) = TEST (test, data_len);						\
>>   INHIB_OPTIMIZATION;							\
>>   (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a));	\
>> 
>> is no longer effective for preventing move (a) from being merged
>> into (b), because the pass can merge at the point of (a).
>
> It never was effective for that.  Unless (b) lives in memory, in which
> case your new pass has a bug here.

The target of the vmov is a register.

>> I think
>> this is a valid thing to do -- the asm semantics are still satisfied,
>> and asm volatile ("" : : : "memory") never acted as a register barrier.
>> But perhaps we should deal with this as a special case?
>
> I don't think we should, no.  What does "register barrier" even mean,
> *exactly*?

Yeah, agree with you and Andrew that we shouldn't, was just checking
that there was agreement.

Thanks,
Richard
Segher Boessenkool Nov. 19, 2019, 8:39 p.m. UTC | #5
On Tue, Nov 19, 2019 at 11:33:13AM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote:
> >> While working on SVE, I've noticed several cases in which we fail
> >> to combine instructions because the combined form would need to be
> >> placed earlier in the instruction stream than the last of the
> >> instructions being combined.  This includes one very important
> >> case in the handling of the first fault register (FFR).
> >
> > Do you have an example of that?
> 
> It's difficult to share realistic examples at this stage since this
> isn't really the right forum for making them public for the first time.

Oh I'm very sorry.  In the future, just say "Future" and I know what
you mean :-)

>       /* Make sure that the value that is to be substituted for the register
> 	 does not use any registers whose values alter in between.  However,
> 	 If the insns are adjacent, a use can't cross a set even though we
> 	 think it might (this can happen for a sequence of insns each setting
> 	 the same destination; last_set of that register might point to
> 	 a NOTE).  If INSN has a REG_EQUIV note, the register is always
> 	 equivalent to the memory so the substitution is valid even if there
> 	 are intervening stores.  Also, don't move a volatile asm or
> 	 UNSPEC_VOLATILE across any other insns.  */
>       || (! all_adjacent
> 	  && (((!MEM_P (src)
> 		|| ! find_reg_note (insn, REG_EQUIV, src))
> 	       && modified_between_p (src, insn, i3))
> 	      || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
> 	      || GET_CODE (src) == UNSPEC_VOLATILE))

So this would work if you had pseudos here, instead of the hard reg?
Because it is a hard reg it is the same number in both places, making it
hard to move.

> > How are dependencies represented in your new pass?  If it just does
> > walks over the insn stream for everything, you get quadratic complexity
> > if you move insns backwards.  We have that in combine already, mostly
> > from modified_between_p, but that is limited because of how LOG_LINKS
> > work, and we have been doing this for so long and there are no problems
> > found with it, so it must work in practice.  But I am worried about it
> > when moving insns back an unlimited distance.
> 
> It builds def-use chains, but using a constant limit on the number of
> explicitly-recorded uses.  All other uses go in a numerical live range
> from which they (conservatively) never escape.  The def-use chains
> represent memory as a single entity, a bit like in gimple.

Ah.  So that range thing ensures correctness.

Why don't you use DF for the DU chains?

> >> - it tries using REG_EQUAL notes for the final instruction.
> >
> > And that.
> 
> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3
> with i3's REG_EQUAL note and combining into that.  Does combine do that?
> I couldn't see it, and in:
> 
>    https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html
> 
> you seemed to reject the idea of allowing it.

Yes, I still do.  Do you have an example where it helps?

> >> - it can parallelise two independent instructions that both read from
> >>   the same register or both read from memory.
> >
> > That only if somehow there is a link between the two (so essentially
> > never).  The only combinations tried by combine are those via LOG_LINKs,
> > which are between a SET and the first corresponding use.  This is a key
> > factor that makes it kind of linear (instead of exponential) complexity.
> 
> Tracking limited def-use chains is what makes this last bit easy.
> We can just try parallelising two instructions from the (bounded) list
> of uses.  And for this case there's not any garbage rtl involved, since
> we reuse the same PARALLEL rtx between attempts.  The cost is basically
> all in the recog call (which would obviously mount up if we went
> overboard).

*All* examples above and below are just this.

If you disable everything else, what do the statistics look like then?

> > One thing I want to do is some mini-combine after every split, probably
> > only with the insns new from the split.  But we have no cfglayout mode
> > anymore then, and only hard regs (except in the first split pass, which
> > is just a little later than your new pass).
> 
> Yeah, sounds like it could be useful.  I guess there'd need to be
> an extra condition on the combination that the new insn can't be
> immediately split.

It would run *after* split.  Not interleaved with it.

> > And amount of garbage produced?
> 
> If -ftime-report stats are accurate, then the total amount of
> memory allocated is:
> 
> run-combine=2 (normal combine): 1793 kB
> run-combine=4 (new pass only):    98 kB
> run-combine=6 (both passes):    1871 kB (new pass accounts for 78 kB)
> 
> But again that's not a fair comparison when the main combine pass does more.

The way combine does SUBST is pretty fundamental to how it works (it can
be ripped out, and probably we'll have to at some point, but that will be
very invasive).  Originally all this temporary RTL was on obstacks and
reaping it was cheap, but everything is GCed now (fixing the bugs was not
cheap :-) )

If you look at even really bad cases, combine is still only a few percent
of total, so it isn't too bad.

> I did try hard to keep the amount of garbage rtl down though.  This is
> why I added validate_simplify_replace_rtx rather than trying to make
> do with existing routines.  It should only create new rtl if the
> simplification routines did something useful.  (Of course, that's mostly
> true of combine as well, but things like the make_compound_operation/
> expand_compound_operation wrangler can create expressions that are never
> actually useful.)

Don't mention those, thanks :-)

> >> To get a feel for the effect on multiple targets, I did my usual
> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
> >> and g++.dg, this time comparing run-combine=2 and run-combine=6
> >> using -O2 -ftree-vectorize:
> >
> > One problem with this is that these are very short functions on average.
> 
> There are some long ones too :-)

Yes, but this isn't a good stand-in for representative programs.

> > What is the kind of changes you see for other targets?
> 
> On powerpc64le-linux-gnu it mostly comes from eliminating comparisons
> in favour of other flag-setting instructions and making more use of
> post-increments.  Not sure the last one is actually a win, but the
> target costs say it's OK :-).  E.g. from gcc.c-torture/execute/pr78675.c:
> 
> @@ -48,9 +48,8 @@
>         blr
>         .align 4
>  .L19:
> -       cmpdi 0,10,0
> +       mr. 9,10
>         mr 3,8
> -       mr 9,10
>         bne 0,.L9
>         b .L3
>         .align 4

Okay, so this combining two uses of r10 into one insn.

This isn't necessarily a good idea: the combined insn cannot be moved as
much as one of its components could, which can also immediately prevent
further combinations.

But doing this after combine, as you do, is probably beneficial.

> and a slightly more interesting example in gcc.c-torture/execute/loop-6.c:

This is the same thing (we do andi. a,b,0xff instead of rlwinm. a,b,0,0xff
because this is cheaper on p7 and p8).

> gcc.c-torture/execute/20081218-1.c is an example where we make more use
> of post-increment:
> 
>  .L9:
> -       lbz 10,1(9)
> -       addi 9,9,1
> +       lbzu 10,1(9)
>         cmpwi 0,10,38
>         bne 0,.L8
> -       lbz 10,1(9)
> -       addi 9,9,1
> +       lbzu 10,1(9)
>         cmpwi 0,10,38
>         bne 0,.L8
>         bdnz .L9

Pre-increment (we only *have* pre-modify memory accesses).

>   /* Mimic combine's behavior by not combining moves from allocatable hard
>      registers (e.g. when copying parameters or function return values).  */
>   if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
>     return false;
> 
> Although if that could have accounted for the difference, it sounds like
> we're leaving a lot on the table by doing this :-)

It actually helps (and quite a bit).  But if your test cases are mainly
tiny functions, anything can happen.  But since you see this across all
targets, it must be doing something good :-)


So I'd love to see statistics for *only* combining two uses of the same
thing, this is something combine cannot do, and arguably *shouldn't* do!


Segher
Richard Sandiford Nov. 20, 2019, 6:20 p.m. UTC | #6
Segher Boessenkool <segher@kernel.crashing.org> writes:
>>       /* Make sure that the value that is to be substituted for the register
>> 	 does not use any registers whose values alter in between.  However,
>> 	 If the insns are adjacent, a use can't cross a set even though we
>> 	 think it might (this can happen for a sequence of insns each setting
>> 	 the same destination; last_set of that register might point to
>> 	 a NOTE).  If INSN has a REG_EQUIV note, the register is always
>> 	 equivalent to the memory so the substitution is valid even if there
>> 	 are intervening stores.  Also, don't move a volatile asm or
>> 	 UNSPEC_VOLATILE across any other insns.  */
>>       || (! all_adjacent
>> 	  && (((!MEM_P (src)
>> 		|| ! find_reg_note (insn, REG_EQUIV, src))
>> 	       && modified_between_p (src, insn, i3))
>> 	      || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
>> 	      || GET_CODE (src) == UNSPEC_VOLATILE))
>
> So this would work if you had pseudos here, instead of the hard reg?
> Because it is a hard reg it is the same number in both places, making it
> hard to move.

Yeah, probably.  But the hard reg is a critical part of this.
Going back to the example:

(set (reg/v:VNx16BI 102 [ ok ])
     (reg:VNx16BI 85 ffrt))
(set (reg:VNx16BI 85 ffrt)
     (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
(set (reg:CC_NZC 66 cc)
     (unspec:CC_NZC
       [(reg:VNx16BI 106) repeated x2
        (const_int 1 [0x1])
        (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))

FFR is the real first fault register.  FFRT is actually a fake register
whose only purpose is to describe the dependencies (in rtl) between writes
to the FFR, reads from the FFR and first-faulting loads.  The whole scheme
depends on having only one fixed FFRT register.

>> > How are dependencies represented in your new pass?  If it just does
>> > walks over the insn stream for everything, you get quadratic complexity
>> > if you move insns backwards.  We have that in combine already, mostly
>> > from modified_between_p, but that is limited because of how LOG_LINKS
>> > work, and we have been doing this for so long and there are no problems
>> > found with it, so it must work in practice.  But I am worried about it
>> > when moving insns back an unlimited distance.
>> 
>> It builds def-use chains, but using a constant limit on the number of
>> explicitly-recorded uses.  All other uses go in a numerical live range
>> from which they (conservatively) never escape.  The def-use chains
>> represent memory as a single entity, a bit like in gimple.
>
> Ah.  So that range thing ensures correctness.

Yeah.

> Why don't you use DF for the DU chains?

The problem with DF_DU_CHAIN is that it's quadratic in the worst case.
fwprop.c gets around that by using the MD problem and having its own
dominator walker to calculate limited def-use chains:

  /* We use the multiple definitions problem to compute our restricted
     use-def chains.  */

So taking that approach here would still require some amount of
roll-your-own.  Other reasons are:

* Even what fwprop does is more elaborate than we need for now.

* We need to handle memory too, and it's nice to be able to handle
  it in the same way as registers.

* Updating a full, ordered def-use chain after a move is a linear-time
  operation, so whatever happens, we'd need to apply some kind of limit
  on the number of uses we maintain, with something like that integer
  point range for the rest.

* Once we've analysed the insn and built its def-use chains, we don't
  look at the df_refs again until we update the chains after a successful
  combination.  So it should be more efficient to maintain a small array
  of insn_info_rec pointers alongside the numerical range, rather than
  walk and pollute chains of df_refs and then link back the insn uids
  to the pass-local info.

>> >> - it tries using REG_EQUAL notes for the final instruction.
>> >
>> > And that.
>> 
>> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3
>> with i3's REG_EQUAL note and combining into that.  Does combine do that?
>> I couldn't see it, and in:
>> 
>>    https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html
>> 
>> you seemed to reject the idea of allowing it.
>
> Yes, I still do.  Do you have an example where it helps?

I'll run another set of tests for that.

>> >> - it can parallelise two independent instructions that both read from
>> >>   the same register or both read from memory.
>> >
>> > That only if somehow there is a link between the two (so essentially
>> > never).  The only combinations tried by combine are those via LOG_LINKs,
>> > which are between a SET and the first corresponding use.  This is a key
>> > factor that makes it kind of linear (instead of exponential) complexity.
>> 
>> Tracking limited def-use chains is what makes this last bit easy.
>> We can just try parallelising two instructions from the (bounded) list
>> of uses.  And for this case there's not any garbage rtl involved, since
>> we reuse the same PARALLEL rtx between attempts.  The cost is basically
>> all in the recog call (which would obviously mount up if we went
>> overboard).
>
> *All* examples above and below are just this.

Yeah, the powerpc and s390x examples were.  The motivating FFR example
above isn't though: it's a def-use combination in parallel with the
existing definition.

> If you disable everything else, what do the statistics look like then?

Had no idea how this would turn out -- which is a good sign it was
worth doing -- but: results below.

>> > One thing I want to do is some mini-combine after every split, probably
>> > only with the insns new from the split.  But we have no cfglayout mode
>> > anymore then, and only hard regs (except in the first split pass, which
>> > is just a little later than your new pass).
>> 
>> Yeah, sounds like it could be useful.  I guess there'd need to be
>> an extra condition on the combination that the new insn can't be
>> immediately split.
>
> It would run *after* split.  Not interleaved with it.

Yeah.  But what I meant was: a lot of insns that are split after
reload are combined for RA purposes and the split form is really
the preferred form (especially for scheduling).  So if we have
a combine pass *after* split, I think it should avoid using any
combination that matches a split.

>> > And amount of garbage produced?
>> 
>> If -ftime-report stats are accurate, then the total amount of
>> memory allocated is:
>> 
>> run-combine=2 (normal combine): 1793 kB
>> run-combine=4 (new pass only):    98 kB
>> run-combine=6 (both passes):    1871 kB (new pass accounts for 78 kB)
>> 
>> But again that's not a fair comparison when the main combine pass does more.
>
> The way combine does SUBST is pretty fundamental to how it works (it can
> be ripped out, and probably we'll have to at some point, but that will be
> very invasive).  Originally all this temporary RTL was on obstacks and
> reaping it was cheap, but everything is GCed now (fixing the bugs was not
> cheap :-) )

Yeah, I remember :-)

> If you look at even really bad cases, combine is still only a few
> percent of total, so it isn't too bad.
>
>> I did try hard to keep the amount of garbage rtl down though.  This is
>> why I added validate_simplify_replace_rtx rather than trying to make
>> do with existing routines.  It should only create new rtl if the
>> simplification routines did something useful.  (Of course, that's mostly
>> true of combine as well, but things like the make_compound_operation/
>> expand_compound_operation wrangler can create expressions that are never
>> actually useful.)
>
> Don't mention those, thanks :-)
>
>> >> To get a feel for the effect on multiple targets, I did my usual
>> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
>> >> and g++.dg, this time comparing run-combine=2 and run-combine=6
>> >> using -O2 -ftree-vectorize:
>> >
>> > One problem with this is that these are very short functions on average.
>> 
>> There are some long ones too :-)
>
> Yes, but this isn't a good stand-in for representative programs.

Right.  And number of lines of asm isn't a good stand-in for anything much.
Like I say, the whole thing is just to get a feel, on tests that are readily
to hand and are easy to compile without a full toolchain.

>> > What is the kind of changes you see for other targets?
>> 
>> On powerpc64le-linux-gnu it mostly comes from eliminating comparisons
>> in favour of other flag-setting instructions and making more use of
>> post-increments.  Not sure the last one is actually a win, but the
>> target costs say it's OK :-).  E.g. from gcc.c-torture/execute/pr78675.c:
>> 
>> @@ -48,9 +48,8 @@
>>         blr
>>         .align 4
>>  .L19:
>> -       cmpdi 0,10,0
>> +       mr. 9,10
>>         mr 3,8
>> -       mr 9,10
>>         bne 0,.L9
>>         b .L3
>>         .align 4
>
> Okay, so this combining two uses of r10 into one insn.
>
> This isn't necessarily a good idea: the combined insn cannot be moved as
> much as one of its components could, which can also immediately prevent
> further combinations.
>
> But doing this after combine, as you do, is probably beneficial.
>
>> and a slightly more interesting example in gcc.c-torture/execute/loop-6.c:
>
> This is the same thing (we do andi. a,b,0xff instead of rlwinm. a,b,0,0xff
> because this is cheaper on p7 and p8).
>
>> gcc.c-torture/execute/20081218-1.c is an example where we make more use
>> of post-increment:
>> 
>>  .L9:
>> -       lbz 10,1(9)
>> -       addi 9,9,1
>> +       lbzu 10,1(9)
>>         cmpwi 0,10,38
>>         bne 0,.L8
>> -       lbz 10,1(9)
>> -       addi 9,9,1
>> +       lbzu 10,1(9)
>>         cmpwi 0,10,38
>>         bne 0,.L8
>>         bdnz .L9
>
> Pre-increment (we only *have* pre-modify memory accesses).

Oops, yes.

>>   /* Mimic combine's behavior by not combining moves from allocatable hard
>>      registers (e.g. when copying parameters or function return values).  */
>>   if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
>>     return false;
>> 
>> Although if that could have accounted for the difference, it sounds like
>> we're leaving a lot on the table by doing this :-)
>
> It actually helps (and quite a bit).  But if your test cases are mainly
> tiny functions, anything can happen.  But since you see this across all
> targets, it must be doing something good :-)
>
>
> So I'd love to see statistics for *only* combining two uses of the same
> thing, this is something combine cannot do, and arguably *shouldn't* do!

OK, here are two sets of results.  The first is for:

(A) --param run-combine=2 (current combine only)
(B) --param run-combine=6 (both passes), use-use combinations only

Target                 Tests   Delta    Best   Worst  Median
======                 =====   =====    ====   =====  ======
aarch64-linux-gnu        158    3060     -72     520      -1
aarch64_be-linux-gnu     111      24     -57     324      -1
alpha-linux-gnu            3       3       1       1       1
amdgcn-amdhsa             18      71     -17      26       1
arc-elf                  310   -4414   -1516     356       1
arm-linux-gnueabi         28     -50     -13       3      -1
arm-linux-gnueabihf       28     -50     -13       3      -1
avr-elf                   26     308      -1      36      12
bfin-elf                   6       8      -1       3       1
bpf-elf                   10      21      -1       6       1
c6x-elf                    7       9      -6       6       1
cr16-elf                  13     102       1      27       2
cris-elf                  35   -1001    -700       3      -2
csky-elf                   9      28       1       6       2
epiphany-elf              29     -29      -2       1      -1
fr30-elf                  12      17      -1       5       1
frv-linux-gnu              1      -2      -2      -2      -2
ft32-elf                  10      22      -1       5       2
h8300-elf                 29      56     -22      14       2
hppa64-hp-hpux11.23        9      17      -1       4       2
i686-apple-darwin         10     -33     -20      12      -2
i686-pc-linux-gnu         41     243     -12      33       3
ia64-linux-gnu            28     -32     -29      39      -4
iq2000-elf                 6       8       1       2       1
lm32-elf                  10      12      -3       5       1
m32r-elf                   3       2      -2       2       2
m68k-linux-gnu            19      27      -2       5       2
mcore-elf                 14      23     -10       6       2
microblaze-elf             5       5       1       1       1
mipsel-linux-gnu           9      12      -5       6       2
mipsisa64-linux-gnu        7       1      -3       1       1
mmix                       6       6      -2       4       1
mn10300-elf               20      15      -4       5       1
moxie-rtems                8      11      -2       3       1
msp430-elf                 8      24       1       6       2
nds32le-elf               91    -188     -24     136      -1
nios2-linux-gnu            2       6       1       5       1
nvptx-none               396     756       1      16       1
or1k-elf                   8      20       1       4       2
pdp11                     65     149     -10      45       2
powerpc-ibm-aix7.0      1039    1114    -366    2124      -1
powerpc64-linux-gnu      854    2753    -274    3094      -2
powerpc64le-linux-gnu    648    -551    -340     208      -1
pru-elf                    5       5      -2       3       1
riscv32-elf                7       6      -2       5       1
riscv64-elf                2       5       2       3       2
rl78-elf                  80    -648     -98      13      -4
rx-elf                    16       2      -4       5      -1
s390-linux-gnu            60    -174     -39      14      -1
s390x-linux-gnu          152    -781    -159      14      -1
sh-linux-gnu              13       5     -15       7       1
sparc-linux-gnu           29       7      -3      11      -1
sparc64-linux-gnu         51       1      -8      15      -1
tilepro-linux-gnu        119    -567    -164      15      -2
v850-elf                   4       4      -1       3       1
vax-netbsdelf             10      13      -4       5       1
visium-elf                 4       0      -5       3       1
x86_64-darwin              7     -12      -9       4      -2
x86_64-linux-gnu           6     -11      -6       4      -2
xstormy16-elf             10      13       1       2       1
xtensa-elf                 6       8      -1       2       2

which definitely shows up some outliers I need to look at.  The second
set is for:

(B) --param run-combine=6 (both passes), use-use combinations only
(C) --param run-combine=6 (both passes), no restrictions

Target                 Tests   Delta    Best   Worst  Median
======                 =====   =====    ====   =====  ======
aarch64-linux-gnu        272   -3844    -585      18      -1
aarch64_be-linux-gnu     190   -3336    -370      18      -1
alpha-linux-gnu          401   -2735    -370      22      -2
amdgcn-amdhsa            188    1867    -484    1259      -1
arc-elf                  257   -1498    -650      54      -1
arm-linux-gnueabi        168   -1117    -612     680      -1
arm-linux-gnueabihf      168   -1117    -612     680      -1
avr-elf                 1341 -111401  -13824     680     -10
bfin-elf                1346  -18950   -8461     465      -2
bpf-elf                   63    -496     -60       3      -2
c6x-elf                  179  -10527  -10084      41      -2
cr16-elf                1616  -51479  -10657      42     -13
cris-elf                 113    -533     -84       4      -2
csky-elf                 129   -3399    -474       1      -2
epiphany-elf             151    -375    -149      84      -1
fr30-elf                 155   -1773    -756     289      -2
frv-linux-gnu            808  -13332   -2074      67      -1
ft32-elf                 276   -1688    -111      -1      -2
h8300-elf                527  -11522   -1747      68      -3
hppa64-hp-hpux11.23      179    -865    -142      34      -1
i686-apple-darwin        335   -1266     -56      44      -1
i686-pc-linux-gnu        222   -2216    -556      32      -1
ia64-linux-gnu           122   -4793   -1134      40      -5
iq2000-elf               171   -1341     -61       3      -2
lm32-elf                 187   -1814    -316      47      -2
m32r-elf                  70    -597     -98      11      -2
m68k-linux-gnu           197   -2375    -332     148      -2
mcore-elf                125   -1236    -146       7      -1
microblaze-elf           442   -4498   -2094      32      -2
mipsel-linux-gnu         125   -2050    -222      60      -2
mipsisa64-linux-gnu      107   -2015    -130      14      -2
mmix                     103    -239     -26       4      -1
mn10300-elf              215   -1039    -234      80      -1
moxie-rtems              149    -754     -79       4      -2
msp430-elf               180    -600     -63      19      -1
nds32le-elf              183    -287     -37      32      -1
nios2-linux-gnu           81    -329     -66       4      -1
nvptx-none               200   -1882    -208      -2      -2
or1k-elf                  57    -317     -25       2      -1
pdp11                    207   -1441    -182      83      -2
powerpc-ibm-aix7.0       400   -4145    -271      14      -2
powerpc64-linux-gnu      375   -2062    -160     117      -2
powerpc64le-linux-gnu    491   -4169    -700     156      -2
pru-elf                   47   -7020   -6921       6      -1
riscv32-elf               59   -1379    -139       7      -2
riscv64-elf               89   -1562    -264       7      -1
rl78-elf                 289  -16157   -1665      42      -6
rx-elf                    82    -195     -53       8      -1
s390-linux-gnu           128   -2108   -1485      63      -1
s390x-linux-gnu          112     418     -32     522      -1
sh-linux-gnu             218    -410    -108      68      -1
sparc-linux-gnu          141    -866     -99      18      -1
sparc64-linux-gnu        129    -792    -102       3      -2
tilepro-linux-gnu        953   -4331    -297     332      -2
v850-elf                  50    -412     -53       2      -3
vax-netbsdelf            254   -3328    -400       4      -2
visium-elf               100    -693    -138      16      -1
x86_64-darwin            345   -2134    -490      72      -1
x86_64-linux-gnu         307    -843    -288     210      -1
xstormy16-elf            218    -788    -156      59      -1
xtensa-elf               195   -1426    -322      36       1

So the main benefit does seem to come from the def-use part.

Here are some powerpc64le-linux-gnu examples of (B)->(C):

gcc.c-torture/execute/20171008-1.c:

@@ -79,8 +79,7 @@
        stdu 1,-32(1)
 .LCFI5:
        bl foo
-       rlwinm 3,3,0,0xff
-       cmpwi 0,3,0
+       andi. 9,3,0xff
        bne 0,.L13
        addi 1,1,32

gcc.c-torture/execute/pr28982a.c:

@@ -427,15 +427,13 @@
        stxvd2x 0,7,6
        .align 4
 .L9:
-       xxlor 12,32,32
+       xvcvsxwsp 0,32
        vadduwm 0,0,1
        addi 10,9,16
-       xvcvsxwsp 0,12
-       xxlor 12,32,32
+       stxvd2x 0,0,9
+       xvcvsxwsp 0,32
+       addi 9,9,32
        vadduwm 0,0,1
-       stxvd2x 0,0,9
-       xvcvsxwsp 0,12
-       addi 9,9,32
        stxvd2x 0,0,10
        bdnz .L9
        li 3,4

(Disclaimer: I have no idea if that's correct.)

gcc.c-torture/execute/pr65215-3.c:

@@ -56,11 +56,10 @@
        srdi 10,3,32
        srdi 9,3,56
        slwi 6,10,24
-       srwi 7,10,8
+       rlwinm 7,10,24,16,23
        or 9,9,6
-       rlwinm 7,7,0,16,23
+       rlwinm 10,10,8,8,15
        or 9,9,7
-       rlwinm 10,10,8,8,15
        or 9,9,10
        cmpw 0,9,8
        bne 0,.L4

Just to emphasise though: I'm not proposing that we switch this on for
all targets yet.  It would be opt-in until the pass is more mature.
But that FFR case is really important for the situation it handles.

Thanks,
Richard
Segher Boessenkool Nov. 20, 2019, 8:45 p.m. UTC | #7
On Wed, Nov 20, 2019 at 06:20:34PM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > So this would work if you had pseudos here, instead of the hard reg?
> > Because it is a hard reg it is the same number in both places, making it
> > hard to move.
> 
> Yeah, probably.  But the hard reg is a critical part of this.
> Going back to the example:
> 
> (set (reg/v:VNx16BI 102 [ ok ])
>      (reg:VNx16BI 85 ffrt))
> (set (reg:VNx16BI 85 ffrt)
>      (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
> (set (reg:CC_NZC 66 cc)
>      (unspec:CC_NZC
>        [(reg:VNx16BI 106) repeated x2
>         (const_int 1 [0x1])
>         (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))
> 
> FFR is the real first fault register.  FFRT is actually a fake register
> whose only purpose is to describe the dependencies (in rtl) between writes
> to the FFR, reads from the FFR and first-faulting loads.  The whole scheme
> depends on having only one fixed FFRT register.

Right.  The reason this cannot work in combine is that combine always
combines to just *one* insn, at i3; later, if it turns out that it needs
to split it, it can put something at i2.  But that doesn't even happen
here, only the first and the last of those three insns are what is
combined.

It is important combine only moves things forward in the insn stream, to
make sure this whole process is finite.  Or this was true years ago, at
least :-)

> > Why don't you use DF for the DU chains?
> 
> The problem with DF_DU_CHAIN is that it's quadratic in the worst case.

Oh, wow.

> fwprop.c gets around that by using the MD problem and having its own
> dominator walker to calculate limited def-use chains:
> 
>   /* We use the multiple definitions problem to compute our restricted
>      use-def chains.  */

It's not great if every pass invents its own version of some common
infrastructure thing because that common one is not suitable.

I.e., can this be fixed somehow?  Maybe just by having a restricted DU
chains df problem?

> So taking that approach here would still require some amount of
> roll-your-own.  Other reasons are:
> 
> * Even what fwprop does is more elaborate than we need for now.
> 
> * We need to handle memory too, and it's nice to be able to handle
>   it in the same way as registers.
> 
> * Updating a full, ordered def-use chain after a move is a linear-time
>   operation, so whatever happens, we'd need to apply some kind of limit
>   on the number of uses we maintain, with something like that integer
>   point range for the rest.
> 
> * Once we've analysed the insn and built its def-use chains, we don't
>   look at the df_refs again until we update the chains after a successful
>   combination.  So it should be more efficient to maintain a small array
>   of insn_info_rec pointers alongside the numerical range, rather than
>   walk and pollute chains of df_refs and then link back the insn uids
>   to the pass-local info.

So you need something like combine's LOG_LINKS?  Not that handling those
is not quadratic in the worst case, but in practice it works well.  And
it *could* be made linear.

> >> Tracking limited def-use chains is what makes this last bit easy.
> >> We can just try parallelising two instructions from the (bounded) list
> >> of uses.  And for this case there's not any garbage rtl involved, since
> >> we reuse the same PARALLEL rtx between attempts.  The cost is basically
> >> all in the recog call (which would obviously mount up if we went
> >> overboard).
> >
> > *All* examples above and below are just this.
> 
> Yeah, the powerpc and s390x examples were.  The motivating FFR example
> above isn't though: it's a def-use combination in parallel with the
> existing definition.

Right, good point :-)

> >> >> To get a feel for the effect on multiple targets, I did my usual
> >> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
> >> >> and g++.dg, this time comparing run-combine=2 and run-combine=6
> >> >> using -O2 -ftree-vectorize:
> >> >
> >> > One problem with this is that these are very short functions on average.
> >> 
> >> There are some long ones too :-)
> >
> > Yes, but this isn't a good stand-in for representative programs.
> 
> Right.  And number of lines of asm isn't a good stand-in for anything much.

For combine, number of insns generated is a surprisingly good measure of
how it performed.  Sometimes not, when it goes over a border of an
inlining decision, say, or bb-reorder decides to duplicate more because
it is cheaper now.

> Like I say, the whole thing is just to get a feel, on tests that are readily
> to hand and are easy to compile without a full toolchain.

Absolutely.  But I have no experience with using your test set, so the
numbers do not necessarily mean so much to me :-)

> > So I'd love to see statistics for *only* combining two uses of the same
> > thing, this is something combine cannot do, and arguably *shouldn't* do!
> 
> OK, here are two sets of results.  The first is for:
> 
> (A) --param run-combine=2 (current combine only)
> (B) --param run-combine=6 (both passes), use-use combinations only
> 
> Target                 Tests   Delta    Best   Worst  Median
> ======                 =====   =====    ====   =====  ======
> aarch64-linux-gnu        158    3060     -72     520      -1
> aarch64_be-linux-gnu     111      24     -57     324      -1
> alpha-linux-gnu            3       3       1       1       1
> amdgcn-amdhsa             18      71     -17      26       1
> arc-elf                  310   -4414   -1516     356       1
> arm-linux-gnueabi         28     -50     -13       3      -1
> arm-linux-gnueabihf       28     -50     -13       3      -1
> avr-elf                   26     308      -1      36      12
> bfin-elf                   6       8      -1       3       1
> bpf-elf                   10      21      -1       6       1
> c6x-elf                    7       9      -6       6       1
> cr16-elf                  13     102       1      27       2
> cris-elf                  35   -1001    -700       3      -2
> csky-elf                   9      28       1       6       2
> epiphany-elf              29     -29      -2       1      -1
> fr30-elf                  12      17      -1       5       1
> frv-linux-gnu              1      -2      -2      -2      -2
> ft32-elf                  10      22      -1       5       2
> h8300-elf                 29      56     -22      14       2
> hppa64-hp-hpux11.23        9      17      -1       4       2
> i686-apple-darwin         10     -33     -20      12      -2
> i686-pc-linux-gnu         41     243     -12      33       3
> ia64-linux-gnu            28     -32     -29      39      -4
> iq2000-elf                 6       8       1       2       1
> lm32-elf                  10      12      -3       5       1
> m32r-elf                   3       2      -2       2       2
> m68k-linux-gnu            19      27      -2       5       2
> mcore-elf                 14      23     -10       6       2
> microblaze-elf             5       5       1       1       1
> mipsel-linux-gnu           9      12      -5       6       2
> mipsisa64-linux-gnu        7       1      -3       1       1
> mmix                       6       6      -2       4       1
> mn10300-elf               20      15      -4       5       1
> moxie-rtems                8      11      -2       3       1
> msp430-elf                 8      24       1       6       2
> nds32le-elf               91    -188     -24     136      -1
> nios2-linux-gnu            2       6       1       5       1
> nvptx-none               396     756       1      16       1
> or1k-elf                   8      20       1       4       2
> pdp11                     65     149     -10      45       2
> powerpc-ibm-aix7.0      1039    1114    -366    2124      -1
> powerpc64-linux-gnu      854    2753    -274    3094      -2
> powerpc64le-linux-gnu    648    -551    -340     208      -1
> pru-elf                    5       5      -2       3       1
> riscv32-elf                7       6      -2       5       1
> riscv64-elf                2       5       2       3       2
> rl78-elf                  80    -648     -98      13      -4
> rx-elf                    16       2      -4       5      -1
> s390-linux-gnu            60    -174     -39      14      -1
> s390x-linux-gnu          152    -781    -159      14      -1
> sh-linux-gnu              13       5     -15       7       1
> sparc-linux-gnu           29       7      -3      11      -1
> sparc64-linux-gnu         51       1      -8      15      -1
> tilepro-linux-gnu        119    -567    -164      15      -2
> v850-elf                   4       4      -1       3       1
> vax-netbsdelf             10      13      -4       5       1
> visium-elf                 4       0      -5       3       1
> x86_64-darwin              7     -12      -9       4      -2
> x86_64-linux-gnu           6     -11      -6       4      -2
> xstormy16-elf             10      13       1       2       1
> xtensa-elf                 6       8      -1       2       2
> 
> which definitely shows up some outliers I need to look at.

Yeah, huh, it's all over the map.

> The second set is for:
> 
> (B) --param run-combine=6 (both passes), use-use combinations only
> (C) --param run-combine=6 (both passes), no restrictions
> 
> Target                 Tests   Delta    Best   Worst  Median
> ======                 =====   =====    ====   =====  ======
> aarch64-linux-gnu        272   -3844    -585      18      -1
> aarch64_be-linux-gnu     190   -3336    -370      18      -1
> alpha-linux-gnu          401   -2735    -370      22      -2
> amdgcn-amdhsa            188    1867    -484    1259      -1
> arc-elf                  257   -1498    -650      54      -1
> arm-linux-gnueabi        168   -1117    -612     680      -1
> arm-linux-gnueabihf      168   -1117    -612     680      -1
> avr-elf                 1341 -111401  -13824     680     -10

Things like this are kind of suspicious :-)

> bfin-elf                1346  -18950   -8461     465      -2
> bpf-elf                   63    -496     -60       3      -2
> c6x-elf                  179  -10527  -10084      41      -2
> cr16-elf                1616  -51479  -10657      42     -13
> cris-elf                 113    -533     -84       4      -2
> csky-elf                 129   -3399    -474       1      -2
> epiphany-elf             151    -375    -149      84      -1
> fr30-elf                 155   -1773    -756     289      -2
> frv-linux-gnu            808  -13332   -2074      67      -1
> ft32-elf                 276   -1688    -111      -1      -2
> h8300-elf                527  -11522   -1747      68      -3
> hppa64-hp-hpux11.23      179    -865    -142      34      -1
> i686-apple-darwin        335   -1266     -56      44      -1
> i686-pc-linux-gnu        222   -2216    -556      32      -1
> ia64-linux-gnu           122   -4793   -1134      40      -5
> iq2000-elf               171   -1341     -61       3      -2
> lm32-elf                 187   -1814    -316      47      -2
> m32r-elf                  70    -597     -98      11      -2
> m68k-linux-gnu           197   -2375    -332     148      -2
> mcore-elf                125   -1236    -146       7      -1
> microblaze-elf           442   -4498   -2094      32      -2
> mipsel-linux-gnu         125   -2050    -222      60      -2
> mipsisa64-linux-gnu      107   -2015    -130      14      -2
> mmix                     103    -239     -26       4      -1
> mn10300-elf              215   -1039    -234      80      -1
> moxie-rtems              149    -754     -79       4      -2
> msp430-elf               180    -600     -63      19      -1
> nds32le-elf              183    -287     -37      32      -1
> nios2-linux-gnu           81    -329     -66       4      -1
> nvptx-none               200   -1882    -208      -2      -2
> or1k-elf                  57    -317     -25       2      -1
> pdp11                    207   -1441    -182      83      -2
> powerpc-ibm-aix7.0       400   -4145    -271      14      -2
> powerpc64-linux-gnu      375   -2062    -160     117      -2
> powerpc64le-linux-gnu    491   -4169    -700     156      -2
> pru-elf                   47   -7020   -6921       6      -1
> riscv32-elf               59   -1379    -139       7      -2
> riscv64-elf               89   -1562    -264       7      -1
> rl78-elf                 289  -16157   -1665      42      -6
> rx-elf                    82    -195     -53       8      -1
> s390-linux-gnu           128   -2108   -1485      63      -1
> s390x-linux-gnu          112     418     -32     522      -1
> sh-linux-gnu             218    -410    -108      68      -1
> sparc-linux-gnu          141    -866     -99      18      -1
> sparc64-linux-gnu        129    -792    -102       3      -2
> tilepro-linux-gnu        953   -4331    -297     332      -2
> v850-elf                  50    -412     -53       2      -3
> vax-netbsdelf            254   -3328    -400       4      -2
> visium-elf               100    -693    -138      16      -1
> x86_64-darwin            345   -2134    -490      72      -1
> x86_64-linux-gnu         307    -843    -288     210      -1
> xstormy16-elf            218    -788    -156      59      -1
> xtensa-elf               195   -1426    -322      36       1
> 
> So the main benefit does seem to come from the def-use part.
> 
> Here are some powerpc64le-linux-gnu examples of (B)->(C):
> 
> gcc.c-torture/execute/20171008-1.c:
> 
> @@ -79,8 +79,7 @@
>         stdu 1,-32(1)
>  .LCFI5:
>         bl foo
> -       rlwinm 3,3,0,0xff
> -       cmpwi 0,3,0
> +       andi. 9,3,0xff
>         bne 0,.L13
>         addi 1,1,32

Soo this starts as

insn_cost 4 for     6: r118:SI=r124:SI
      REG_DEAD r124:SI
insn_cost 4 for     8: r121:SI=zero_extend(r118:SI#0)
      REG_DEAD r118:SI
insn_cost 4 for     9: r122:CC=cmp(r121:SI,0)
      REG_DEAD r121:SI

and then it combines 6->8 of course, but then

Trying 8 -> 9:
    8: r121:SI=zero_extend(r124:SI#0)
      REG_DEAD r124:SI
    9: r122:CC=cmp(r121:SI,0)
      REG_DEAD r121:SI
Failed to match this instruction:
(set (reg:CC 122)
    (compare:CC (subreg:QI (reg:SI 124) 0)
        (const_int 0 [0])))

Hrm, that is a bad idea in general, why do we do that.

> gcc.c-torture/execute/pr28982a.c:
> 
> @@ -427,15 +427,13 @@
>         stxvd2x 0,7,6
>         .align 4
>  .L9:
> -       xxlor 12,32,32
> +       xvcvsxwsp 0,32
>         vadduwm 0,0,1
>         addi 10,9,16
> -       xvcvsxwsp 0,12
> -       xxlor 12,32,32
> +       stxvd2x 0,0,9
> +       xvcvsxwsp 0,32
> +       addi 9,9,32
>         vadduwm 0,0,1
> -       stxvd2x 0,0,9
> -       xvcvsxwsp 0,12
> -       addi 9,9,32
>         stxvd2x 0,0,10
>         bdnz .L9
>         li 3,4
> 
> (Disclaimer: I have no idea if that's correct.)

This seems to be -O3 -mcpu=power8.

It look to be correct just fine.  It saves the two xxlor insns (which are
just register move instructions).  RA couldn't fix this up because there
are two uses of the register (the xvcvsxwsp -- convert V4SI to V4SF; and
the vadduwm -- V4SI addition), and RA doesn't reorder code.

I would say your new code is better.

> gcc.c-torture/execute/pr65215-3.c:
> 
> @@ -56,11 +56,10 @@
>         srdi 10,3,32
>         srdi 9,3,56
>         slwi 6,10,24
> -       srwi 7,10,8
> +       rlwinm 7,10,24,16,23
>         or 9,9,6
> -       rlwinm 7,7,0,16,23
> +       rlwinm 10,10,8,8,15
>         or 9,9,7
> -       rlwinm 10,10,8,8,15
>         or 9,9,10
>         cmpw 0,9,8
>         bne 0,.L4

insn_cost 4 for    15: r139:SI=r118:DI#0 0>>0x8
insn_cost 4 for    16: r140:SI=r139:SI&0xff00
      REG_DEAD r139:SI

(that's both of those insns setting r7).  r118 does not die here yet,
that's only in the 10,10 insn.

Trying 15 -> 16:
   15: r139:SI=r118:DI#0 0>>0x8
   16: r140:SI=r139:SI&0xff00
      REG_DEAD r139:SI
Failed to match this instruction:
(set (reg:SI 140)
    (and:SI (subreg:SI (zero_extract:DI (reg:DI 118 [ _2 ])
                (const_int 32 [0x20])
                (const_int 24 [0x18])) 0)
        (const_int 65280 [0xff00])))
Failed to match this instruction:
(set (reg:SI 140)
    (and:SI (subreg:SI (and:DI (lshiftrt:DI (reg:DI 118 [ _2 ])
                    (const_int 8 [0x8]))
                (const_int 4294967295 [0xffffffff])) 0)
        (const_int 65280 [0xff00])))

Yeah, it's one of those, make_compound_insn :-/  (*^%$(*^$(*@^

> Just to emphasise though: I'm not proposing that we switch this on for
> all targets yet.  It would be opt-in until the pass is more mature.

I do have to wonder if it is a bit late for stage 1.  But opt-in as in,
the user has to use some flag, that should be fine I guess?  But default
for some targets might not be so great, esp. primary targets.

> But that FFR case is really important for the situation it handles.

Yeah.

I hope to have some time to review your actual patch soon.  Should be
less depressing than some of the combine failures :-)


Segher
Nicholas Krause Nov. 21, 2019, 6:37 p.m. UTC | #8
On 11/17/19 6:35 PM, Richard Sandiford wrote:
> (It's 23:35 local time, so it's still just about stage 1. :-))
>
> While working on SVE, I've noticed several cases in which we fail
> to combine instructions because the combined form would need to be
> placed earlier in the instruction stream than the last of the
> instructions being combined.  This includes one very important
> case in the handling of the first fault register (FFR).
>
> Combine currently requires the combined instruction to live at the same
> location as i3.  I thought about trying to relax that restriction, but it
> would be difficult to do with the current pass structure while keeping
> everything linear-ish time.
>
> So this patch instead goes for an option that has been talked about
> several times over the years: writing a new combine pass that just
> does instruction combination, and not all the other optimisations
> that have been bolted onto combine over time.  E.g. it deliberately
> doesn't do things like nonzero-bits tracking, since that really ought
> to be a separate, more global, optimisation.
>
> This is still far from being a realistic replacement for the even
> the combine parts of the current combine pass.  E.g.:
>
> - it only handles combinations that can be built up from individual
>    two-instruction combinations.
>
> - it doesn't allow new hard register clobbers to be added.
>
> - it doesn't have the special treatment of CC operations.
>
> - etc.
>
> But we have to start somewhere.
>
> On a more positive note, the pass handles things that the current
> combine pass doesn't:
>
> - the main motivating feature mentioned above: it works out where
>    the combined instruction could validly live and moves it there
>    if necessary.  If there are a range of valid places, it tries
>    to pick the best one based on register pressure (although only
>    with a simple heuristic for now).
>
> - once it has combined two instructions, it can try combining the
>    result with both later and earlier code, i.e. it can combine
>    in both directions.
>
> - it tries using REG_EQUAL notes for the final instruction.
>
> - it can parallelise two independent instructions that both read from
>    the same register or both read from memory.
>
> This last feature is useful for generating more load-pair combinations
> on AArch64.  In some cases it can also produce more store-pair combinations,
> but only for consecutive stores.  However, since the pass currently does
> this in a very greedy, peephole way, it only allows load/store-pair
> combinations if the first memory access has a higher alignment than
> the second, i.e. if we can be sure that the combined access is naturally
> aligned.  This should help it to make better decisions than the post-RA
> peephole pass in some cases while not being too aggressive.
>
> The pass is supposed to be linear time without debug insns.
> It only tries a constant number C of combinations per instruction
> and its bookkeeping updates are constant-time.  Once it has combined two
> instructions, it'll try up to C combinations on the result, but this can
> be counted against the instruction that was deleted by the combination
> and so effectively just doubles the constant.  (Note that C depends
> on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.)
>
> Unfortunately, debug updates via propagate_for_debug are more expensive.
> This could probably be fixed if the pass did more to track debug insns
> itself, but using propagate_for_debug matches combine's behaviour.
>
> The patch adds two instances of the new pass: one before combine and
> one after it.  By default both are disabled, but this can be changed
> using the new 3-bit run-combine param, where:
>
> - bit 0 selects the new pre-combine pass
> - bit 1 selects the main combine pass
> - bit 2 selects the new post-combine pass
>
> The idea is that run-combine=3 can be used to see which combinations
> are missed by the new pass, while run-combine=6 (which I hope to be
> the production setting for AArch64 at -O2+) just uses the new pass
> to mop up cases that normal combine misses.  Maybe in some distant
> future, the pass will be good enough for run-combine=[14] to be a
> realistic option.
>
> I ended up having to add yet another validate_simplify_* routine,
> this time to do the equivalent of:
>
>     newx = simplify_replace_rtx (*loc, old_rtx, new_rtx);
>     validate_change (insn, loc, newx, 1);
>
> but in a more memory-efficient way.  validate_replace_rtx isn't suitable
> because it deliberately only tries simplifications in limited cases:
>
>    /* Do changes needed to keep rtx consistent.  Don't do any other
>       simplifications, as it is not our job.  */
>
> And validate_simplify_insn isn't useful for this case because it works
> on patterns that have already had changes made to them and expects
> those patterns to be valid rtxes.  simplify-replace operations instead
> need to simplify as they go, when the original modes are still to hand.
>
> As far as compile-time goes, I tried compiling optabs.ii at -O2
> with an --enable-checking=release compiler:
>
> run-combine=2 (normal combine):  100.0% (baseline)
> run-combine=4 (new pass only)     98.0%
> run-combine=6 (both passes)      100.3%
>
> where the results are easily outside the noise.  So the pass on
> its own is quicker than combine, but that's not a fair comparison
> when it doesn't do everything combine does.  Running both passes
> only has a slight overhead.
>
> To get a feel for the effect on multiple targets, I did my usual
> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
> and g++.dg, this time comparing run-combine=2 and run-combine=6
> using -O2 -ftree-vectorize:
>
> Target                 Tests   Delta    Best   Worst  Median
> ======                 =====   =====    ====   =====  ======
> aarch64-linux-gnu       3974  -39393   -2275      90      -2
> aarch64_be-linux-gnu    3389  -36683   -2275     165      -2
> alpha-linux-gnu         4154  -62860   -2132     335      -2
> amdgcn-amdhsa           4818    9079   -7987   51850      -2
> arc-elf                 2868  -63710  -18998     286      -1
> arm-linux-gnueabi       4053  -80404  -10019     605      -2
> arm-linux-gnueabihf     4053  -80404  -10019     605      -2
> avr-elf                 3620   38513   -2386   23364       2
> bfin-elf                2691  -32973   -1483    1127      -2
> bpf-elf                 5581  -78105  -11064     113      -3
> c6x-elf                 3915  -31710   -2441    1560      -2
> cr16-elf                6030  192102   -1757   60009      12
> cris-elf                2217  -30794   -1716     294      -2
> csky-elf                2003  -24989   -9999    1468      -2
> epiphany-elf            3345  -19416   -1803    4594      -2
> fr30-elf                3562  -15077   -1921    2334      -1
> frv-linux-gnu           2423  -16589   -1736     999      -1
> ft32-elf                2246  -46337  -15988     433      -2
> h8300-elf               2581  -33553   -1403     168      -2
> hppa64-hp-hpux11.23     3926 -120876  -50134    1056      -2
> i686-apple-darwin       3562  -46851   -1764     310      -2
> i686-pc-linux-gnu       2902   -3639   -4809    6848      -2
> ia64-linux-gnu          2900 -158870  -14006     428      -7
> iq2000-elf              2929  -54690   -2904    2576      -3
> lm32-elf                5265  162519   -1918    8004       5
> m32r-elf                1861  -25296   -2713    1004      -2
> m68k-linux-gnu          2520 -241573  -21879     200      -3
> mcore-elf               2378  -28532   -1810    1635      -2
> microblaze-elf          2782 -137363   -9516    1986      -2
> mipsel-linux-gnu        2443  -38422   -8331     458      -1
> mipsisa64-linux-gnu     2287  -60294  -12214     432      -2
> mmix                    4910 -136549  -13616     599      -2
> mn10300-elf             2944  -29151   -2488     132      -1
> moxie-rtems             1935  -12364   -1002     125      -1
> msp430-elf              2379  -37007   -2163     176      -2
> nds32le-elf             2356  -27551   -2126     163      -1
> nios2-linux-gnu         1572  -44828  -23613      92      -2
> nvptx-none              1014  -17337   -1590      16      -3
> or1k-elf                2724  -92816  -14144      56      -3
> pdp11                   1897  -27296   -1370     534      -2
> powerpc-ibm-aix7.0      2909  -58829  -10026    2001      -2
> powerpc64-linux-gnu     3685  -60551  -12158    2001      -1
> powerpc64le-linux-gnu   3501  -61846  -10024     765      -2
> pru-elf                 1574  -29734  -19998    1718      -1
> riscv32-elf             2357  -22506  -10002   10175      -1
> riscv64-elf             3320  -56777  -10002     226      -2
> rl78-elf                2113 -232328  -18607    4065      -3
> rx-elf                  2800  -38515    -896     491      -2
> s390-linux-gnu          3582  -75626  -12098    3999      -2
> s390x-linux-gnu         3761  -73473  -13748    3999      -2
> sh-linux-gnu            2350  -26401   -1003     522      -2
> sparc-linux-gnu         3279  -49518   -2175    2223      -2
> sparc64-linux-gnu       3849 -123084  -30200    2141      -2
> tilepro-linux-gnu       2737  -35562   -3458    2848      -2
> v850-elf                9002 -169126  -49996      76      -4
> vax-netbsdelf           3325  -57734  -10000    1989      -2
> visium-elf              1860  -17006   -1006    1066      -2
> x86_64-darwin           3278  -48933   -9999    1408      -2
> x86_64-linux-gnu        3008  -43887   -9999    3248      -2
> xstormy16-elf           2497  -26569   -2051      89      -2
> xtensa-elf              2161  -31231   -6910     138      -2
>
> So running both passes does seem to have a significant benefit
> on most targets, but there are some nasty-looking outliers.
> The usual caveat applies: number of lines is a very poor measurement,
> it's just to get a feel.
>
> Bootstrapped & regression-tested on aarch64-linux-gnu and
> x86_64-linux-gnu with both run-combine=3 as the default (so that the new
> pass runs first) and with run-combine=6 as the default (so that the new
> pass runs second).  There were no new execution failures.  A couple of
> guality.exp tests that already failed for most options started failing
> for a couple more.  Enabling the pass fixes the XFAILs in:
>
> gcc.target/aarch64/sve/acle/general/ptrue_pat_[234].c
>
> Inevitably there was some scan-assembler fallout for other tests.
> E.g. in gcc.target/aarch64/vmov_n_1.c:
>
> #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory")
>    ...
>    INHIB_OPTIMIZATION;							\
>    (a) = TEST (test, data_len);						\
>    INHIB_OPTIMIZATION;							\
>    (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a));	\
>
> is no longer effective for preventing move (a) from being merged
> into (b), because the pass can merge at the point of (a).  I think
> this is a valid thing to do -- the asm semantics are still satisfied,
> and asm volatile ("" : : : "memory") never acted as a register barrier.
> But perhaps we should deal with this as a special case?
>
> Richard
I'm reviewed the patch but I'm not a expert on combine so I'm only
a few small comments e.t.c.  Segher probably has more comments
than I have anyhow.
Nick
>
>
> 2019-11-17  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
> 	* Makefile.in (OBJS): Add combine2.o
> 	* params.opt (--param=run-combine): New option.
> 	* doc/invoke.texi: Document it.
> 	* tree-pass.h (make_pass_combine2_before): Declare.
> 	(make_pass_combine2_after): Likewise.
> 	* passes.def: Add them.
> 	* timevar.def (TV_COMBINE2): New timevar.
> 	* cfgrtl.h (update_cfg_for_uncondjump): Declare.
> 	* combine.c (update_cfg_for_uncondjump): Move to...
> 	* cfgrtl.c (update_cfg_for_uncondjump): ...here.
> 	* simplify-rtx.c (simplify_truncation): Handle comparisons.
> 	* recog.h (validate_simplify_replace_rtx): Declare.
> 	* recog.c (validate_simplify_replace_rtx_1): New function.
> 	(validate_simplify_replace_rtx_uses): Likewise.
> 	(validate_simplify_replace_rtx): Likewise.
> 	* combine2.c: New file.
>
> Index: gcc/Makefile.in
> ===================================================================
> --- gcc/Makefile.in	2019-11-14 14:34:27.599783740 +0000
> +++ gcc/Makefile.in	2019-11-17 23:15:31.188500613 +0000
> @@ -1261,6 +1261,7 @@ OBJS = \
>   	cgraphunit.o \
>   	cgraphclones.o \
>   	combine.o \
> +	combine2.o \
>   	combine-stack-adj.o \
>   	compare-elim.o \
>   	context.o \
> Index: gcc/params.opt
> ===================================================================
> --- gcc/params.opt	2019-11-14 14:34:26.339792215 +0000
> +++ gcc/params.opt	2019-11-17 23:15:31.200500531 +0000
> @@ -768,6 +768,10 @@ Use internal function id in profile look
>   Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param
>   Maximum depth of a loop nest to fully value-number optimistically.
>   
> +-param=run-combine=
> +Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param
> +Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass.
> +
>   -param=sccvn-max-alias-queries-per-access=
>   Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param
>   Maximum number of disambiguations to perform per memory access.
> Index: gcc/doc/invoke.texi
> ===================================================================
> --- gcc/doc/invoke.texi	2019-11-16 10:43:45.597105823 +0000
> +++ gcc/doc/invoke.texi	2019-11-17 23:15:31.200500531 +0000
> @@ -11807,6 +11807,11 @@ in combiner for a pseudo register as las
>   @item max-combine-insns
>   The maximum number of instructions the RTL combiner tries to combine.
>   
> +@item run-combine
> +Choose which of the 3 available combine passes to run: bit 1 for the main
> +combine pass, bit 0 for an earlier variant of the combine pass, and bit 2
> +for a later variant of the combine pass.
> +
>   @item integer-share-limit
>   Small integer constants can use a shared data structure, reducing the
>   compiler's memory usage and increasing its speed.  This sets the maximum
> Index: gcc/tree-pass.h
> ===================================================================
> --- gcc/tree-pass.h	2019-10-29 08:29:03.096444049 +0000
> +++ gcc/tree-pass.h	2019-11-17 23:15:31.204500501 +0000
> @@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i
>   extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt);
>   extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt);
>   extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt);
>   extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt);
>   extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt);
>   extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt);
>   extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt);
> Index: gcc/passes.def
> ===================================================================
> --- gcc/passes.def	2019-10-29 08:29:03.224443133 +0000
> +++ gcc/passes.def	2019-11-17 23:15:31.200500531 +0000
> @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3.
>         NEXT_PASS (pass_inc_dec);
>         NEXT_PASS (pass_initialize_regs);
>         NEXT_PASS (pass_ud_rtl_dce);
> +      NEXT_PASS (pass_combine2_before);
>         NEXT_PASS (pass_combine);
> +      NEXT_PASS (pass_combine2_after);
>         NEXT_PASS (pass_if_after_combine);
>         NEXT_PASS (pass_jump_after_combine);
>         NEXT_PASS (pass_partition_blocks);
> Index: gcc/timevar.def
This is really two passes it seems or at least functions. Just a nit but you
may want to state that as I don't recall reading that.
> ===================================================================
> --- gcc/timevar.def	2019-10-11 15:43:53.403498517 +0100
> +++ gcc/timevar.def	2019-11-17 23:15:31.204500501 +0000
> @@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC          , "
>   DEFTIMEVAR (TV_CSE2                  , "CSE 2")
>   DEFTIMEVAR (TV_BRANCH_PROB           , "branch prediction")
>   DEFTIMEVAR (TV_COMBINE               , "combiner")
> +DEFTIMEVAR (TV_COMBINE2              , "second combiner")
>   DEFTIMEVAR (TV_IFCVT		     , "if-conversion")
>   DEFTIMEVAR (TV_MODE_SWITCH           , "mode switching")
>   DEFTIMEVAR (TV_SMS		     , "sms modulo scheduling")
> Index: gcc/cfgrtl.h
> ===================================================================
> --- gcc/cfgrtl.h	2019-03-08 18:15:39.320730391 +0000
> +++ gcc/cfgrtl.h	2019-11-17 23:15:31.192500584 +0000
> @@ -47,6 +47,7 @@ extern void fixup_partitions (void);
>   extern bool purge_dead_edges (basic_block);
>   extern bool purge_all_dead_edges (void);
>   extern bool fixup_abnormal_edges (void);
> +extern void update_cfg_for_uncondjump (rtx_insn *);
>   extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *);
>   extern void relink_block_chain (bool);
>   extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *);
> Index: gcc/combine.c
> ===================================================================
> --- gcc/combine.c	2019-11-13 08:42:45.537368745 +0000
> +++ gcc/combine.c	2019-11-17 23:15:31.192500584 +0000
> @@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg)
>   	 && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT;
>   }
>   
> -/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
> -   Note that the INSN should be deleted *after* removing dead edges, so
> -   that the kept edge is the fallthrough edge for a (set (pc) (pc))
> -   but not for a (set (pc) (label_ref FOO)).  */
> -
> -static void
> -update_cfg_for_uncondjump (rtx_insn *insn)
> -{
> -  basic_block bb = BLOCK_FOR_INSN (insn);
> -  gcc_assert (BB_END (bb) == insn);
> -
> -  purge_dead_edges (bb);
> -
> -  delete_insn (insn);
> -  if (EDGE_COUNT (bb->succs) == 1)
> -    {
> -      rtx_insn *insn;
> -
> -      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
> -
> -      /* Remove barriers from the footer if there are any.  */
> -      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
> -	if (BARRIER_P (insn))
> -	  {
> -	    if (PREV_INSN (insn))
> -	      SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
> -	    else
> -	      BB_FOOTER (bb) = NEXT_INSN (insn);
> -	    if (NEXT_INSN (insn))
> -	      SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
> -	  }
> -	else if (LABEL_P (insn))
> -	  break;
> -    }
> -}
> -
>   /* Return whether PAT is a PARALLEL of exactly N register SETs followed
>      by an arbitrary number of CLOBBERs.  */
>   static bool
> @@ -15096,7 +15060,10 @@ const pass_data pass_data_combine =
>     {}
>   
>     /* opt_pass methods: */
> -  virtual bool gate (function *) { return (optimize > 0); }
> +  virtual bool gate (function *)
> +    {
> +      return optimize > 0 && (param_run_combine & 2) != 0;
> +    }
>     virtual unsigned int execute (function *)
>       {
>         return rest_of_handle_combine ();
> Index: gcc/cfgrtl.c
> ===================================================================
> --- gcc/cfgrtl.c	2019-10-17 14:22:55.523309009 +0100
> +++ gcc/cfgrtl.c	2019-11-17 23:15:31.188500613 +0000
> @@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void)
>     return inserted;
>   }
>   
> +/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
> +   Note that the INSN should be deleted *after* removing dead edges, so
> +   that the kept edge is the fallthrough edge for a (set (pc) (pc))
> +   but not for a (set (pc) (label_ref FOO)).  */
> +
> +void
> +update_cfg_for_uncondjump (rtx_insn *insn)
> +{
> +  basic_block bb = BLOCK_FOR_INSN (insn);
> +  gcc_assert (BB_END (bb) == insn);
> +
> +  purge_dead_edges (bb);
> +
> +  delete_insn (insn);
> +  if (EDGE_COUNT (bb->succs) == 1)
> +    {
> +      rtx_insn *insn;
> +
> +      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
> +
> +      /* Remove barriers from the footer if there are any.  */
> +      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
> +	if (BARRIER_P (insn))
> +	  {
> +	    if (PREV_INSN (insn))
> +	      SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
> +	    else
> +	      BB_FOOTER (bb) = NEXT_INSN (insn);
> +	    if (NEXT_INSN (insn))
> +	      SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
> +	  }
> +	else if (LABEL_P (insn))
> +	  break;
> +    }
> +}
> +
>   /* Cut the insns from FIRST to LAST out of the insns stream.  */
>   
>   rtx_insn *
> Index: gcc/simplify-rtx.c
> ===================================================================
> --- gcc/simplify-rtx.c	2019-11-16 15:33:36.642840131 +0000
> +++ gcc/simplify-rtx.c	2019-11-17 23:15:31.204500501 +0000
> @@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode,
>         && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1)
>       return constm1_rtx;
>   
> +  /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result
> +     in a narrower mode if useful.  */
> +  if (COMPARISON_P (op))
> +    return simplify_gen_relational (GET_CODE (op), mode, VOIDmode,
> +				    XEXP (op, 0), XEXP (op, 1));
> +
>     return NULL_RTX;
>   }
>   
> Index: gcc/recog.h
> ===================================================================
> --- gcc/recog.h	2019-09-09 18:58:28.860430363 +0100
> +++ gcc/recog.h	2019-11-17 23:15:31.204500501 +0000
> @@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos
>   extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *);
>   extern void validate_replace_src_group (rtx, rtx, rtx_insn *);
>   extern bool validate_simplify_insn (rtx_insn *insn);
> +extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx);
>   extern int num_changes_pending (void);
>   extern int next_insn_tests_no_inequality (rtx_insn *);
>   extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode);
> Index: gcc/recog.c
> ===================================================================
> --- gcc/recog.c	2019-10-01 09:55:35.150088599 +0100
> +++ gcc/recog.c	2019-11-17 23:15:31.204500501 +0000
> @@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn)
>         }
>     return ((num_changes_pending () > 0) && (apply_change_group () > 0));
>   }
> +
> +/* A subroutine of validate_simplify_replace_rtx.  Apply the replacement
> +   described by R to LOC.  Return true on success; leave the caller
> +   to clean up on failure.  */
> +
> +static bool
> +validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc)
> +{
> +  rtx x = *loc;
> +  enum rtx_code code = GET_CODE (x);
> +  machine_mode mode = GET_MODE (x);
> +
> +  if (rtx_equal_p (x, r.from))
> +    {
> +      validate_unshare_change (r.insn, loc, r.to, 1);
> +      return true;
> +    }
> +
> +  /* Recursively apply the substitution and see if we can simplify
> +     the result.  This specifically shouldn't use simplify_gen_*,
> +     since we want to avoid generating new expressions where possible.  */
> +  int old_num_changes = num_validated_changes ();
> +  rtx newx = NULL_RTX;
> +  bool recurse_p = false;
> +  switch (GET_RTX_CLASS (code))
> +    {
> +    case RTX_UNARY:
> +      {
> +	machine_mode op0_mode = GET_MODE (XEXP (x, 0));
> +	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)))
> +	  return false;
> +
> +	newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode);
> +	break;
> +      }
> +
> +    case RTX_BIN_ARITH:
> +    case RTX_COMM_ARITH:
> +      {
> +	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
> +	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
> +	  return false;
> +
> +	newx = simplify_binary_operation (code, mode,
> +					  XEXP (x, 0), XEXP (x, 1));
> +	break;
> +      }
> +
> +    case RTX_COMPARE:
> +    case RTX_COMM_COMPARE:
> +      {
> +	machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode
> +				? GET_MODE (XEXP (x, 0))
> +				: GET_MODE (XEXP (x, 1)));
> +	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
> +	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
> +	  return false;
> +
> +	newx = simplify_relational_operation (code, mode, op_mode,
> +					      XEXP (x, 0), XEXP (x, 1));
> +	break;
> +      }
> +
> +    case RTX_TERNARY:
> +    case RTX_BITFIELD_OPS:
> +      {
> +	machine_mode op0_mode = GET_MODE (XEXP (x, 0));
> +	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
> +	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))
> +	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2)))
> +	  return false;
> +
> +	newx = simplify_ternary_operation (code, mode, op0_mode,
> +					   XEXP (x, 0), XEXP (x, 1),
> +					   XEXP (x, 2));
> +	break;
> +      }
> +
> +    case RTX_EXTRA:
> +      if (code == SUBREG)
> +	{
> +	  machine_mode inner_mode = GET_MODE (SUBREG_REG (x));
> +	  if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x)))
> +	    return false;
> +
> +	  rtx inner = SUBREG_REG (x);
> +	  newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x));
> +	  /* Reject the same cases that simplify_gen_subreg would.  */
> +	  if (!newx
> +	      && (GET_CODE (inner) == SUBREG
> +		  || GET_CODE (inner) == CONCAT
> +		  || GET_MODE (inner) == VOIDmode
> +		  || !validate_subreg (mode, inner_mode,
> +				       inner, SUBREG_BYTE (x))))
> +	    return false;
> +	  break;
> +	}
> +      else
> +	recurse_p = true;
> +      break;
> +
> +    case RTX_OBJ:
> +      if (code == LO_SUM)
> +	{
> +	  if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
> +	      || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
> +	    return false;
> +
> +	  /* (lo_sum (high x) y) -> y where x and y have the same base.  */
> +	  rtx op0 = XEXP (x, 0);
> +	  rtx op1 = XEXP (x, 1);
> +	  if (GET_CODE (op0) == HIGH)
> +	    {
> +	      rtx base0, base1, offset0, offset1;
> +	      split_const (XEXP (op0, 0), &base0, &offset0);
> +	      split_const (op1, &base1, &offset1);
> +	      if (rtx_equal_p (base0, base1))
> +		newx = op1;
> +	    }
> +	}
> +      else if (code == REG)
> +	{
> +	  if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from))
> +	    return false;
> +	}
> +      else
> +	recurse_p = true;
> +      break;
> +
> +    case RTX_CONST_OBJ:
> +      break;
> +
> +    case RTX_AUTOINC:
> +      if (reg_overlap_mentioned_p (XEXP (x, 0), r.from))
> +	return false;
> +      recurse_p = true;
> +      break;
> +
> +    case RTX_MATCH:
> +    case RTX_INSN:
> +      gcc_unreachable ();
> +    }
> +
> +  if (recurse_p)
> +    {
> +      const char *fmt = GET_RTX_FORMAT (code);
> +      for (int i = 0; fmt[i]; i++)
> +	switch (fmt[i])
> +	  {
> +	  case 'E':
> +	    for (int j = 0; j < XVECLEN (x, i); j++)
> +	      if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j)))
> +		return false;
> +	    break;
> +
> +	  case 'e':
> +	    if (XEXP (x, i)
> +		&& !validate_simplify_replace_rtx_1 (r, &XEXP (x, i)))
> +	      return false;
> +	    break;
> +	  }
> +    }
> +
> +  if (newx && !rtx_equal_p (x, newx))
> +    {
> +      /* There's no longer any point unsharing the substitutions made
> +	 for subexpressions, since we'll just copy this one instead.  */
> +      for (int i = old_num_changes; i < num_changes; ++i)
> +	changes[i].unshare = false;
> +      validate_unshare_change (r.insn, loc, newx, 1);
> +    }
> +
> +  return true;
> +}
> +
> +/* A note_uses callback for validate_simplify_replace_rtx.
> +   DATA points to a validate_replace_src_data object.  */
> +
> +static void
> +validate_simplify_replace_rtx_uses (rtx *loc, void *data)
> +{
> +  validate_replace_src_data &r = *(validate_replace_src_data *) data;
> +  if (r.insn && !validate_simplify_replace_rtx_1 (r, loc))
> +    r.insn = NULL;
> +}
> +
> +/* Try to perform the equivalent of:
> +
> +      newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX);
> +      validate_change (INSN, LOC, newx, 1);
> +
> +   but without generating as much garbage rtl when the resulting
> +   pattern doesn't match.
> +
> +   Return true if we were able to replace all uses of OLD_RTX in *LOC
> +   and if the result conforms to general rtx rules (e.g. for whether
> +   subregs are meaningful).
> +
> +   When returning true, add all replacements to the current validation group,
> +   leaving the caller to test it in the normal way.  Leave both *LOC and the
> +   validation group unchanged on failure.  */
> +
> +bool
> +validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc,
> +			       rtx old_rtx, rtx new_rtx)
> +{
> +  validate_replace_src_data r;
> +  r.from = old_rtx;
> +  r.to = new_rtx;
> +  r.insn = insn;
> +
> +  unsigned int num_changes = num_validated_changes ();
> +  note_uses (loc, validate_simplify_replace_rtx_uses, &r);
> +  if (!r.insn)
> +    {
> +      cancel_changes (num_changes);
> +      return false;
> +    }
> +  return true;
> +}
>   
>   /* Return 1 if the insn using CC0 set by INSN does not contain
>      any ordered tests applied to the condition codes.
> Index: gcc/combine2.c
> ===================================================================
> --- /dev/null	2019-09-17 11:41:18.176664108 +0100
> +++ gcc/combine2.c	2019-11-17 23:15:31.196500559 +0000
> @@ -0,0 +1,1576 @@
> +/* Combine instructions
> +   Copyright (C) 2019 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "tree-pass.h"
> +#include "memmodel.h"
> +#include "emit-rtl.h"
> +#include "insn-config.h"
> +#include "recog.h"
> +#include "print-rtl.h"
> +#include "rtl-iter.h"
> +#include "predict.h"
> +#include "cfgcleanup.h"
> +#include "cfghooks.h"
> +#include "cfgrtl.h"
> +#include "alias.h"
> +#include "valtrack.h"
> +
> +/* This pass tries to combine instructions in the following ways:
> +
> +   (1) If we have two dependent instructions:
> +
> +	 I1: (set DEST1 SRC1)
> +	 I2: (...DEST1...)
> +
> +       and I2 is the only user of DEST1, the pass tries to combine them into:
> +
> +	 I2: (...SRC1...)
> +
> +   (2) If we have two dependent instructions:
> +
> +	 I1: (set DEST1 SRC1)
> +	 I2: (...DEST1...)
> +
> +       the pass tries to combine them into:
> +
> +	 I2: (parallel [(set DEST1 SRC1) (...SRC1...)])
> +
> +       or:
> +
> +	 I2: (parallel [(...SRC1...) (set DEST1 SRC1)])
> +
> +   (3) If we have two independent instructions:
> +
> +	 I1: (set DEST1 SRC1)
> +	 I2: (set DEST2 SRC2)
> +
> +       that read from memory or from the same register, the pass tries to
> +       combine them into:
> +
> +	 I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)])
> +
> +       or:
> +
> +	 I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)])
> +
> +   If the combined form is a valid instruction, the pass tries to find a
> +   place between I1 and I2 inclusive for the new instruction.  If there
> +   are multiple valid locations, it tries to pick the best one by taking
> +   the effect on register pressure into account.
> +
> +   If a combination succeeds and produces a single set, the pass tries to
> +   combine the new form with earlier or later instructions.
> +
> +   The pass currently optimizes each basic block separately.  It walks
> +   the instructions in reverse order, building up live ranges for registers
> +   and memory.  It then uses these live ranges to look for possible
> +   combination opportunities and to decide where the combined instructions
> +   could be placed.
> +
> +   The pass represents positions in the block using point numbers,
> +   with higher numbers indicating earlier instructions.  The numbering
> +   scheme is that:
> +
> +   - the end of the current instruction sequence has an even base point B.
> +
> +   - instructions initially have odd-numbered points B + 1, B + 3, etc.
> +     with B + 1 being the final instruction in the sequence.
> +
> +   - even points after B represent gaps between instructions where combined
> +     instructions could be placed.
> +
> +   Thus even points initially represent no instructions and odd points
> +   initially represent single instructions.  However, when picking a
> +   place for a combined instruction, the pass may choose somewhere
> +   inbetween the original two instructions, so that over time a point
> +   may come to represent several instructions.  When this happens,
> +   the pass maintains the invariant that all instructions with the same
> +   point number are independent of each other and thus can be treated as
> +   acting in parallel (or as acting in any arbitrary sequence).
> +
> +   TODOs:
> +
> +   - Handle 3-instruction combinations, and possibly more.
> +
> +   - Handle existing clobbers more efficiently.  At the moment we can't
> +     move an instruction that clobbers R across another instruction that
> +     clobbers R.
> +
> +   - Allow hard register clobbers to be added, like combine does.
> +
> +   - Perhaps work on EBBs, or SESE regions.  */
> +
> +namespace {
> +
> +/* The number of explicit uses to record in a live range.  */
> +const unsigned int NUM_RANGE_USERS = 4;
> +
> +/* The maximum number of instructions that we can combine at once.  */
> +const unsigned int MAX_COMBINE_INSNS = 2;
> +
> +/* A fake cost for instructions that we haven't costed yet.  */
> +const unsigned int UNKNOWN_COST = ~0U;
> +
> +class combine2
> +{
> +public:
> +  combine2 (function *);
> +  ~combine2 ();
> +
> +  void execute ();
> +
> +private:
> +  struct insn_info_rec;
> +
> +  /* Describes the live range of a register or of memory.  For simplicity,
> +     we treat memory as a single entity.
> +
> +     If we had a fully-accurate live range, updating it to account for a
> +     moved instruction would be a linear-time operation.  Doing this for
> +     each combination would then make the pass quadratic.  We therefore
> +     just maintain a list of NUM_RANGE_USERS use insns and use simple,
> +     conservatively-correct behavior for the rest.  */
> +  struct live_range_rec
> +  {
> +    /* Which instruction provides the dominating definition, or null if
> +       we don't know yet.  */
> +    insn_info_rec *producer;
> +
> +    /* A selection of instructions that use the resource, in program order.  */
> +    insn_info_rec *users[NUM_RANGE_USERS];
> +
> +    /* An inclusive range of points that covers instructions not mentioned
> +       in USERS.  Both values are zero if there are no such instructions.
> +
> +       Once we've included a use U at point P in this range, we continue
> +       to assume that some kind of use exists at P whatever happens to U
> +       afterwards.  */
> +    unsigned int first_extra_use;
> +    unsigned int last_extra_use;
> +
> +    /* The register number this range describes, or INVALID_REGNUM
> +       for memory.  */
> +    unsigned int regno;
> +
> +    /* Forms a linked list of ranges for the same resource, in program
> +       order.  */
> +    live_range_rec *prev_range;
> +    live_range_rec *next_range;
> +  };
> +
> +  /* Pass-specific information about an instruction.  */
> +  struct insn_info_rec
> +  {
> +    /* The instruction itself.  */
> +    rtx_insn *insn;
> +
> +    /* A null-terminated list of live ranges for the things that this
> +       instruction defines.  */
> +    live_range_rec **defs;
> +
> +    /* A null-terminated list of live ranges for the things that this
> +       instruction uses.  */
> +    live_range_rec **uses;
> +
> +    /* The point at which the instruction appears.  */
> +    unsigned int point;
> +
> +    /* The cost of the instruction, or UNKNOWN_COST if we haven't
> +       measured it yet.  */
> +    unsigned int cost;
> +  };
> +
> +  /* Describes one attempt to combine instructions.  */
> +  struct combination_attempt_rec
> +  {
> +    /* The instruction that we're currently trying to optimize.
> +       If the combination succeeds, we'll use this insn_info_rec
> +       to describe the new instruction.  */
> +    insn_info_rec *new_home;
> +
> +    /* The instructions we're combining, in program order.  */
> +    insn_info_rec *sequence[MAX_COMBINE_INSNS];
Can't we can this a vec in order to grow to lengths and just loop through
merging on instructions in the vec as required?
> +
> +    /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the
> +       live range that describes the substituted register.  */
> +    live_range_rec *def_use_range;
> +
> +    /* The earliest and latest points at which we could insert the
> +       combined instruction.  */
> +    unsigned int earliest_point;
> +    unsigned int latest_point;
> +
> +    /* The cost of the new instruction, once we have a successful match.  */
> +    unsigned int new_cost;
> +  };
> +
> +  /* Pass-specific information about a register.  */
> +  struct reg_info_rec
> +  {
> +    /* The live range associated with the last reference to the register.  */
> +    live_range_rec *range;
> +
> +    /* The point at which the last reference occurred.  */
> +    unsigned int next_ref;
> +
> +    /* True if the register is currently live.  We record this here rather
> +       than in a separate bitmap because (a) there's a natural hole for
> +       it on LP64 hosts and (b) we only refer to it when updating the
> +       other fields, and so recording it here should give better locality.  */
> +    unsigned int live_p : 1;
> +  };
> +
> +  live_range_rec *new_live_range (unsigned int, live_range_rec *);
> +  live_range_rec *reg_live_range (unsigned int);
> +  live_range_rec *mem_live_range ();
> +  bool add_range_use (live_range_rec *, insn_info_rec *);
> +  void remove_range_use (live_range_rec *, insn_info_rec *);
> +  bool has_single_use_p (live_range_rec *);
> +  bool known_last_use_p (live_range_rec *, insn_info_rec *);
> +  unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *);
> +  unsigned int find_latest_point (insn_info_rec *, insn_info_rec *);
> +  bool start_combination (combination_attempt_rec &, insn_info_rec *,
> +			  insn_info_rec *, live_range_rec * = NULL);
> +  bool verify_combination (combination_attempt_rec &);
> +  int estimate_reg_pressure_delta (insn_info_rec *);
> +  void commit_combination (combination_attempt_rec &, bool);
> +  bool try_parallel_sets (combination_attempt_rec &, rtx, rtx);
> +  bool try_parallelize_insns (combination_attempt_rec &);
> +  bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool);
> +  bool try_combine_def_use (combination_attempt_rec &, rtx, rtx);
> +  bool try_combine_two_uses (combination_attempt_rec &);
> +  bool try_combine (insn_info_rec *, rtx, unsigned int);
> +  bool optimize_insn (insn_info_rec *);
> +  void record_defs (insn_info_rec *);
> +  void record_reg_use (insn_info_rec *, df_ref);
> +  void record_uses (insn_info_rec *);
> +  void process_insn (insn_info_rec *);
> +  void start_sequence ();
> +
> +  /* The function we're optimizing.  */
> +  function *m_fn;
> +
> +  /* The highest pseudo register number plus one.  */
> +  unsigned int m_num_regs;
> +
> +  /* The current basic block.  */
> +  basic_block m_bb;
> +
> +  /* True if we should optimize the current basic block for speed.  */
> +  bool m_optimize_for_speed_p;
> +
> +  /* The point number to allocate to the next instruction we visit
> +     in the backward traversal.  */
> +  unsigned int m_point;
> +
> +  /* The point number corresponding to the end of the current
> +     instruction sequence, i.e. the lowest point number about which
> +     we still have valid information.  */
> +  unsigned int m_end_of_sequence;
> +
> +  /* The point number corresponding to the end of the current basic block.
> +     This is the same as M_END_OF_SEQUENCE when processing the last
> +     instruction sequence in a basic block.  */
> +  unsigned int m_end_of_bb;
> +
> +  /* The memory live range, or null if we haven't yet found a memory
> +     reference in the current instruction sequence.  */
> +  live_range_rec *m_mem_range;
> +
> +  /* Gives information about each register.  We track both hard and
> +     pseudo registers.  */
> +  auto_vec<reg_info_rec> m_reg_info;
> +
> +  /* A bitmap of registers whose entry in m_reg_info is valid.  */
> +  auto_sbitmap m_valid_regs;
> +
> +  /* If nonnuull, an unused 2-element PARALLEL that we can use to test
> +     instruction combinations.  */
> +  rtx m_spare_parallel;
> +
> +  /* A bitmap of instructions that we've already tried to combine with.  */
> +  auto_bitmap m_tried_insns;
> +
> +  /* A temporary bitmap used to hold register numbers.  */
> +  auto_bitmap m_true_deps;
> +
> +  /* An obstack used for allocating insn_info_recs and for building
> +     up their lists of definitions and uses.  */
> +  obstack m_insn_obstack;
> +
> +  /* An obstack used for allocating live_range_recs.  */
> +  obstack m_range_obstack;
> +
> +  /* Start-of-object pointers for the two obstacks.  */
> +  char *m_insn_obstack_start;
> +  char *m_range_obstack_start;
> +
> +  /* A list of instructions that we've optimized and whose new forms
> +     change the cfg.  */
> +  auto_vec<rtx_insn *> m_cfg_altering_insns;
> +
> +  /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS.  */
> +  auto_bitmap m_cfg_altering_insn_ids;
> +
> +  /* We can insert new instructions at point P * 2 by inserting them
> +     after M_POINTS[P - M_END_OF_SEQUENCE / 2].  We can insert new
> +     instructions at point P * 2 + 1 by inserting them before
> +     M_POINTS[P - M_END_OF_SEQUENCE / 2].  */
> +  auto_vec<rtx_insn *, 256> m_points;
> +};
> +
> +combine2::combine2 (function *fn)
> +  : m_fn (fn),
> +    m_num_regs (max_reg_num ()),
> +    m_bb (NULL),
> +    m_optimize_for_speed_p (false),
> +    m_point (2),
> +    m_end_of_sequence (m_point),
> +    m_end_of_bb (m_point),
> +    m_mem_range (NULL),
> +    m_reg_info (m_num_regs),
> +    m_valid_regs (m_num_regs),
> +    m_spare_parallel (NULL_RTX)
> +{
> +  gcc_obstack_init (&m_insn_obstack);
> +  gcc_obstack_init (&m_range_obstack);
> +  m_reg_info.quick_grow (m_num_regs);
> +  bitmap_clear (m_valid_regs);
> +  m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0);
> +  m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0);
> +}
> +
> +combine2::~combine2 ()
> +{
> +  obstack_free (&m_insn_obstack, NULL);
> +  obstack_free (&m_range_obstack, NULL);
> +}
> +
> +/* Return true if it's possible in principle to combine INSN with
> +   other instructions.  ALLOW_ASMS_P is true if the caller can cope
> +   with asm statements.  */
> +
> +static bool
> +combinable_insn_p (rtx_insn *insn, bool allow_asms_p)
> +{
> +  rtx pattern = PATTERN (insn);
> +
> +  if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER)
> +    return false;
> +
> +  if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX))
> +    return false;
> +
> +  if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0)
> +    return false;
> +
> +  return true;
> +}
> +
> +/* Return true if it's possible in principle to move INSN somewhere else,
> +   as long as all dependencies are satisfied.  */
> +
> +static bool
> +movable_insn_p (rtx_insn *insn)
> +{
> +  if (JUMP_P (insn))
> +    return false;
> +
> +  if (volatile_refs_p (PATTERN (insn)))
> +    return false;
> +
> +  return true;
> +}
> +
> +/* Create and return a new live range for REGNO.  NEXT is the next range
> +   in program order, or null if this is the first live range in the
> +   sequence.  */
> +
> +combine2::live_range_rec *
> +combine2::new_live_range (unsigned int regno, live_range_rec *next)
> +{
> +  live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec);
> +  memset (range, 0, sizeof (*range));
> +
> +  range->regno = regno;
> +  range->next_range = next;
> +  if (next)
> +    next->prev_range = range;
> +  return range;
> +}
> +
> +/* Return the current live range for register REGNO, creating a new
> +   one if necessary.  */
> +
> +combine2::live_range_rec *
> +combine2::reg_live_range (unsigned int regno)
> +{
> +  /* Initialize the liveness flag, if it isn't already valid for this BB.  */
> +  bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno);
> +  if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb)
> +    m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno);
> +
> +  /* See if we already have a live range associated with the current
> +     instruction sequence.  */
> +  live_range_rec *range = NULL;
> +  if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence)
> +    range = m_reg_info[regno].range;
> +
> +  /* Create a new range if this is the first reference to REGNO in the
> +     current instruction sequence or if the current range has been closed
> +     off by a definition.  */
> +  if (!range || range->producer)
> +    {
> +      range = new_live_range (regno, range);
> +
> +      /* If the register is live after the current sequence, treat that
> +	 as a fake use at the end of the sequence.  */
> +      if (!range->next_range && m_reg_info[regno].live_p)
> +	range->first_extra_use = range->last_extra_use = m_end_of_sequence;
> +
> +      /* Record that this is now the current range for REGNO.  */
> +      if (first_ref_p)
> +	bitmap_set_bit (m_valid_regs, regno);
> +      m_reg_info[regno].range = range;
> +      m_reg_info[regno].next_ref = m_point;
> +    }
> +  return range;
> +}
> +
> +/* Return the current live range for memory, treating memory as a single
> +   entity.  Create a new live range if necessary.  */
> +
> +combine2::live_range_rec *
> +combine2::mem_live_range ()
> +{
> +  if (!m_mem_range || m_mem_range->producer)
> +    m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range);
> +  return m_mem_range;
> +}
> +
> +/* Record that instruction USER uses the resource described by RANGE.
> +   Return true if this is new information.  */
> +
> +bool
> +combine2::add_range_use (live_range_rec *range, insn_info_rec *user)
> +{
> +  /* See if we've already recorded the instruction, or if there's a
> +     spare use slot we can use.  */
> +  unsigned int i = 0;
> +  for (; i < NUM_RANGE_USERS && range->users[i]; ++i)
> +    if (range->users[i] == user)
> +      return false;
> +
> +  if (i == NUM_RANGE_USERS)
> +    {
> +      /* Since we've processed USER recently, assume that it's more
> +	 interesting to record explicitly than the last user in the
> +	 current list.  Evict that last user and describe it in the
> +	 overflow "extra use" range instead.  */
> +      insn_info_rec *ousted_user = range->users[--i];
> +      if (range->first_extra_use < ousted_user->point)
> +	range->first_extra_use = ousted_user->point;
> +      if (range->last_extra_use > ousted_user->point)
> +	range->last_extra_use = ousted_user->point;
> +    }
> +
> +  /* Insert USER while keeping the list sorted.  */
> +  for (; i > 0 && range->users[i - 1]->point < user->point; --i)
> +    range->users[i] = range->users[i - 1];
> +  range->users[i] = user;
> +  return true;
> +}
> +
> +/* Remove USER from the uses recorded for RANGE, if we can.
> +   There's nothing we can do if USER was described in the
> +   overflow "extra use" range.  */
> +
> +void
> +combine2::remove_range_use (live_range_rec *range, insn_info_rec *user)
> +{
> +  for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
> +    if (range->users[i] == user)
> +      {
> +	for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j)
> +	  range->users[j] = range->users[j + 1];
> +	range->users[NUM_RANGE_USERS - 1] = NULL;
> +	break;
> +      }
> +}
> +
> +/* Return true if RANGE has a single known user.  */
> +
> +bool
> +combine2::has_single_use_p (live_range_rec *range)
> +{
> +  return range->users[0] && !range->users[1] && !range->first_extra_use;
> +}
> +
> +/* Return true if we know that USER is the last user of RANGE.  */
> +
> +bool
> +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user)
> +{
> +  if (range->last_extra_use <= user->point)
> +    return false;
> +
> +  for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i)
> +    if (range->users[i] == user)
> +      return i == NUM_RANGE_USERS - 1 || !range->users[i + 1];
Small nit and I could be wrong but do:

return !range->users[i + 1] || i == NUM_RANGE_USERS - 1;

Based on your code it seems that the getting to NUM_RANGE_USERS is far 
less likely.
> +    else if (range->users[i]->point == user->point)
> +      return false;
> +
> +  gcc_unreachable ();
> +}
> +
> +/* Find the earliest point that we could move I2 up in order to combine
> +   it with I1.  Ignore any dependencies between I1 and I2; leave the
> +   caller to deal with those instead.  */
> +
> +unsigned int
> +combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1)
> +{
> +  if (!movable_insn_p (i2->insn))
> +    return i2->point;
> +
> +  /* Start by optimistically assuming that we can move the instruction
> +     all the way up to I1.  */
> +  unsigned int point = i1->point;
> +
> +  /* Make sure that the new position preserves all necessary true dependencies
> +     on earlier instructions.  */
> +  for (live_range_rec **use = i2->uses; *use; ++use)
> +    {
> +      live_range_rec *range = *use;
> +      if (range->producer
> +	  && range->producer != i1
> +	  && point >= range->producer->point)
> +	point = range->producer->point - 1;
> +    }
> +
> +  /* Make sure that the new position preserves all necessary output and
> +     anti dependencies on earlier instructions.  */
> +  for (live_range_rec **def = i2->defs; *def; ++def)
> +    if (live_range_rec *range = (*def)->prev_range)
> +      {
> +	if (range->producer
> +	    && range->producer != i1
> +	    && point >= range->producer->point)
> +	  point = range->producer->point - 1;
> +
> +	for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;)
> +	  if (range->users[i] && range->users[i] != i1)
> +	    {
> +	      if (point >= range->users[i]->point)
> +		point = range->users[i]->point - 1;
> +	      break;
> +	    }
> +
> +	if (range->last_extra_use && point >= range->last_extra_use)
> +	  point = range->last_extra_use - 1;
> +      }
> +
> +  return point;
> +}
> +
> +/* Find the latest point that we could move I1 down in order to combine
> +   it with I2.  Ignore any dependencies between I1 and I2; leave the
> +   caller to deal with those instead.  */
> +
> +unsigned int
> +combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2)
> +{
> +  if (!movable_insn_p (i1->insn))
> +    return i1->point;
> +
> +  /* Start by optimistically assuming that we can move the instruction
> +     all the way down to I2.  */
> +  unsigned int point = i2->point;
> +
> +  /* Make sure that the new position preserves all necessary anti dependencies
> +     on later instructions.  */
> +  for (live_range_rec **use = i1->uses; *use; ++use)
> +    if (live_range_rec *range = (*use)->next_range)
> +      if (range->producer != i2 && point <= range->producer->point)
> +	point = range->producer->point + 1;
> +
> +  /* Make sure that the new position preserves all necessary output and
> +     true dependencies on later instructions.  */
> +  for (live_range_rec **def = i1->defs; *def; ++def)
> +    {
> +      live_range_rec *range = *def;
> +
> +      for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
> +	if (range->users[i] != i2)
> +	  {
> +	    if (range->users[i] && point <= range->users[i]->point)
> +	      point = range->users[i]->point + 1;
> +	    break;
> +	  }
> +
> +      if (range->first_extra_use && point <= range->first_extra_use)
> +	point = range->first_extra_use + 1;
> +
> +      live_range_rec *next_range = range->next_range;
> +      if (next_range
> +	  && next_range->producer != i2
> +	  && point <= next_range->producer->point)
> +	point = next_range->producer->point + 1;
> +    }
> +
> +  return point;
> +}
> +
> +/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2,
> +   where I1 is the instruction that we're currently trying to optimize.
> +   If DEF_USE_RANGE is nonnull, I1 defines the value described by
> +   DEF_USE_RANGE and I2 uses it.  */
> +
> +bool
> +combine2::start_combination (combination_attempt_rec &attempt,
> +			     insn_info_rec *i1, insn_info_rec *i2,
> +			     live_range_rec *def_use_range)
> +{
> +  attempt.new_home = i1;
> +  attempt.sequence[0] = i1;
> +  attempt.sequence[1] = i2;
> +  if (attempt.sequence[0]->point < attempt.sequence[1]->point)
> +    std::swap (attempt.sequence[0], attempt.sequence[1]);
> +  attempt.def_use_range = def_use_range;
> +
> +  /* Check that the instructions have no true dependencies other than
> +     DEF_USE_RANGE.  */
> +  bitmap_clear (m_true_deps);
> +  for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def)
> +    if (*def != def_use_range)
> +      bitmap_set_bit (m_true_deps, (*def)->regno);
> +  for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use)
> +    if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno))
> +      return false;
> +
> +  /* Calculate the range of points at which the combined instruction
> +     could live.  */
> +  attempt.earliest_point = find_earliest_point (attempt.sequence[1],
> +						attempt.sequence[0]);
> +  attempt.latest_point = find_latest_point (attempt.sequence[0],
> +					    attempt.sequence[1]);
> +  if (attempt.earliest_point < attempt.latest_point)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "cannot combine %d and %d: no suitable"
> +		 " location for combined insn\n",
> +		 INSN_UID (attempt.sequence[0]->insn),
> +		 INSN_UID (attempt.sequence[1]->insn));
> +      return false;
> +    }
> +
> +  /* Make sure we have valid costs for the original instructions before
> +     we start changing their patterns.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    if (attempt.sequence[i]->cost == UNKNOWN_COST)
> +      attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn,
> +					     m_optimize_for_speed_p);
> +  return true;
> +}
> +
> +/* Check whether the combination attempt described by ATTEMPT matches
> +   an .md instruction (or matches its constraints, in the case of an
> +   asm statement).  If so, calculate the cost of the new instruction
> +   and check whether it's cheap enough.  */
> +
> +bool
> +combine2::verify_combination (combination_attempt_rec &attempt)
> +{
> +  rtx_insn *insn = attempt.sequence[1]->insn;
> +
> +  bool ok_p = verify_changes (0);
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      if (!ok_p)
> +	fprintf (dump_file, "failed to match this instruction:\n");
> +      else if (const char *name = get_insn_name (INSN_CODE (insn)))
> +	fprintf (dump_file, "successfully matched this instruction to %s:\n",
> +		 name);
> +      else
> +	fprintf (dump_file, "successfully matched this instruction:\n");
> +      print_rtl_single (dump_file, PATTERN (insn));
> +    }
> +  if (!ok_p)
> +    return false;
> +
> +  unsigned int cost1 = attempt.sequence[0]->cost;
> +  unsigned int cost2 = attempt.sequence[1]->cost;
> +  attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p);
> +  ok_p = (attempt.new_cost <= cost1 + cost2);
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n",
> +	     cost1, cost2, attempt.new_cost,
> +	     ok_p ? "keeping replacement" : "rejecting replacement");
> +  if (!ok_p)
> +    return false;
> +
> +  confirm_change_group ();
> +  return true;
> +}
> +
> +/* Return true if we should consider register REGNO when calculating
> +   register pressure estimates.  */
> +
> +static bool
> +count_reg_pressure_p (unsigned int regno)
> +{
> +  if (regno == INVALID_REGNUM)
> +    return false;
> +
> +  /* Unallocatable registers aren't interesting.  */
> +  if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno])
> +    return false;
> +
> +  return true;
> +}
> +
> +/* Try to estimate the effect that the original form of INSN_INFO
> +   had on register pressure, in the form "born - dying".  */
> +
> +int
> +combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info)
> +{
> +  int delta = 0;
> +
> +  for (live_range_rec **def = insn_info->defs; *def; ++def)
> +    if (count_reg_pressure_p ((*def)->regno))
> +      delta += 1;
> +
> +  for (live_range_rec **use = insn_info->uses; *use; ++use)
> +    if (count_reg_pressure_p ((*use)->regno)
> +	&& known_last_use_p (*use, insn_info))
> +      delta -= 1;
> +
> +  return delta;
> +}
> +
> +/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete
> +   FROM_INSN.  Copy any useful information to TO_INSN before doing that.  */
> +
> +static void
> +transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn)
> +{
> +  INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn);
> +  INSN_CODE (to_insn) = INSN_CODE (from_insn);
> +  REG_NOTES (to_insn) = REG_NOTES (from_insn);
> +}
> +
> +/* The combination attempt in ATTEMPT has succeeded and is currently
> +   part of an open validate_change group.  Commit to making the change
> +   and decide where the new instruction should go.
> +
> +   KEPT_DEF_P is true if the new instruction continues to perform
> +   the definition described by ATTEMPT.def_use_range.  */
> +
> +void
> +combine2::commit_combination (combination_attempt_rec &attempt,
> +			      bool kept_def_p)
> +{
> +  insn_info_rec *new_home = attempt.new_home;
> +  rtx_insn *old_insn = attempt.sequence[0]->insn;
> +  rtx_insn *new_insn = attempt.sequence[1]->insn;
> +
> +  /* Remove any notes that are no longer relevant.  */
> +  bool single_set_p = single_set (new_insn);
> +  for (rtx *note_ptr = &REG_NOTES (new_insn); *note_ptr; )
> +    {
> +      rtx note = *note_ptr;
> +      bool keep_p = true;
> +      switch (REG_NOTE_KIND (note))
> +	{
> +	case REG_EQUAL:
> +	case REG_EQUIV:
> +	case REG_NOALIAS:
> +	  keep_p = single_set_p;
> +	  break;
> +
> +	case REG_UNUSED:
> +	  keep_p = false;
> +	  break;
> +
> +	default:
> +	  break;
> +	}
> +      if (keep_p)
> +	note_ptr = &XEXP (*note_ptr, 1);
> +      else
> +	{
> +	  *note_ptr = XEXP (*note_ptr, 1);
> +	  free_EXPR_LIST_node (note);
> +	}
> +    }
> +
> +  /* Complete the open validate_change group.  */
> +  confirm_change_group ();
> +
> +  /* Decide where the new instruction should go.  */
> +  unsigned int new_point = attempt.latest_point;
> +  if (new_point != attempt.earliest_point
> +      && prev_real_insn (new_insn) != old_insn)
> +    {
> +      /* Prefer the earlier point if the combined instruction reduces
> +	 register pressure and the latest point if it increases register
> +	 pressure.
> +
> +	 The choice isn't obvious in the event of a tie, but picking
> +	 the earliest point should reduce the number of times that
> +	 we need to invalidate debug insns.  */
> +      int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]);
> +      int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]);
> +      bool move_up_p = (delta1 + delta2 <= 0);
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file,
> +		 "register pressure delta = %d + %d; using %s position\n",
> +		 delta1, delta2, move_up_p ? "earliest" : "latest");
> +      if (move_up_p)
> +	new_point = attempt.earliest_point;
> +    }
> +
> +  /* Translate inserting at NEW_POINT into inserting before or after
> +     a particular insn.  */
> +  rtx_insn *anchor = NULL;
> +  bool before_p = (new_point & 1);
> +  if (new_point != attempt.sequence[1]->point
> +      && new_point != attempt.sequence[0]->point)
> +    {
> +      anchor = m_points[(new_point - m_end_of_sequence) / 2];
> +      rtx_insn *other_side = (before_p
> +			      ? prev_real_insn (anchor)
> +			      : next_real_insn (anchor));
> +      /* Inserting next to an insn X and then deleting X is just a
> +	 roundabout way of using X as the insertion point.  */
> +      if (anchor == new_insn || other_side == new_insn)
> +	new_point = attempt.sequence[1]->point;
> +      else if (anchor == old_insn || other_side == old_insn)
> +	new_point = attempt.sequence[0]->point;
> +    }
> +
> +  /* Actually perform the move.  */
> +  if (new_point == attempt.sequence[1]->point)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "using insn %d to hold the combined pattern\n",
> +		 INSN_UID (new_insn));
> +      set_insn_deleted (old_insn);
> +    }
> +  else if (new_point == attempt.sequence[0]->point)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "using insn %d to hold the combined pattern\n",
> +		 INSN_UID (old_insn));
> +      PATTERN (old_insn) = PATTERN (new_insn);
> +      transfer_insn (old_insn, new_insn);
> +      std::swap (old_insn, new_insn);
> +      set_insn_deleted (old_insn);
> +    }
> +  else
> +    {
> +      /* We need to insert a new instruction.  We can't simply move
> +	 NEW_INSN because it acts as an insertion anchor in m_points.  */
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "inserting combined insn %s insn %d\n",
> +		 before_p ? "before" : "after", INSN_UID (anchor));
> +
> +      rtx_insn *added_insn = (before_p
> +			      ? emit_insn_before (PATTERN (new_insn), anchor)
> +			      : emit_insn_after (PATTERN (new_insn), anchor));
> +      transfer_insn (added_insn, new_insn);
> +      set_insn_deleted (old_insn);
> +      set_insn_deleted (new_insn);
> +      new_insn = added_insn;
> +    }
> +  df_insn_rescan (new_insn);
> +
> +  /* Unlink the old uses.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
> +      remove_range_use (*use, attempt.sequence[i]);
> +
> +  /* Work out which registers the new pattern uses.  */
> +  bitmap_clear (m_true_deps);
> +  df_ref use;
> +  FOR_EACH_INSN_USE (use, new_insn)
> +    {
> +      rtx reg = DF_REF_REAL_REG (use);
> +      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
> +    }
> +  FOR_EACH_INSN_EQ_USE (use, new_insn)
> +    {
> +      rtx reg = DF_REF_REAL_REG (use);
> +      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
> +    }
> +
> +  /* Describe the combined instruction in NEW_HOME.  */
> +  new_home->insn = new_insn;
> +  new_home->point = new_point;
> +  new_home->cost = attempt.new_cost;
> +
> +  /* Build up a list of definitions for the combined instructions
> +     and update all the ranges accordingly.  It shouldn't matter
> +     which order we do this in.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def)
> +      if (kept_def_p || *def != attempt.def_use_range)
> +	{
> +	  obstack_ptr_grow (&m_insn_obstack, *def);
> +	  (*def)->producer = new_home;
> +	}
> +  obstack_ptr_grow (&m_insn_obstack, NULL);
> +  new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
> +
> +  /* Build up a list of uses for the combined instructions and update
> +     all the ranges accordingly.  Again, it shouldn't matter which
> +     order we do this in.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
> +      if (*use != attempt.def_use_range
> +	  && add_range_use (*use, new_home))
> +	obstack_ptr_grow (&m_insn_obstack, *use);
> +  obstack_ptr_grow (&m_insn_obstack, NULL);
> +  new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
> +
> +  /* There shouldn't be any remaining references to other instructions
> +     in the combination.  Invalidate their contents to make lingering
> +     references a noisy failure.  */
> +  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
> +    if (attempt.sequence[i] != new_home)
> +      {
> +	attempt.sequence[i]->insn = NULL;
> +	attempt.sequence[i]->point = ~0U;
> +      }
> +
> +  /* Unlink the def-use range.  */
> +  if (!kept_def_p && attempt.def_use_range)
> +    {
> +      live_range_rec *range = attempt.def_use_range;
> +      if (range->prev_range)
> +	range->prev_range->next_range = range->next_range;
> +      else
> +	m_reg_info[range->regno].range = range->next_range;
> +      if (range->next_range)
> +	range->next_range->prev_range = range->prev_range;
> +    }
> +
> +  /* Record instructions whose new form alters the cfg.  */
> +  rtx pattern = PATTERN (new_insn);
> +  if ((returnjump_p (new_insn)
> +       || any_uncondjump_p (new_insn)
> +       || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx))
> +      && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn)))
> +    m_cfg_altering_insns.safe_push (new_insn);
> +}
> +
> +/* Return true if X1 and X2 are memories and if X1 does not have
> +   a higher alignment than X2.  */
> +
> +static bool
> +dubious_mem_pair_p (rtx x1, rtx x2)
> +{
> +  return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2);
> +}
> +
> +/* Try implement ATTEMPT using (parallel [SET1 SET2]).  */
> +
> +bool
> +combine2::try_parallel_sets (combination_attempt_rec &attempt,
> +			     rtx set1, rtx set2)
> +{
> +  rtx_insn *insn = attempt.sequence[1]->insn;
> +
> +  /* Combining two loads or two stores can be useful on targets that
> +     allow them to be treated as a single access.  However, we use a
> +     very peephole approach to picking the pairs, so we need to be
> +     relatively confident that we're making a good choice.
> +
> +     For now just aim for cases in which the memory references are
> +     consecutive and the first reference has a higher alignment.
> +     We can leave the target to test the consecutive part; whatever test
> +     we added here might be different from the target's, and in any case
> +     it's fine if the target accepts other well-aligned cases too.  */
> +  if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2))
> +      || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2)))
> +    return false;
> +
> +  /* Cache the PARALLEL rtx between attempts so that we don't generate
> +     too much garbage rtl.  */
> +  if (!m_spare_parallel)
> +    {
> +      rtvec vec = gen_rtvec (2, set1, set2);
> +      m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec);
> +    }
> +  else
> +    {
> +      XVECEXP (m_spare_parallel, 0, 0) = set1;
> +      XVECEXP (m_spare_parallel, 0, 1) = set2;
> +    }
> +
> +  unsigned int num_changes = num_validated_changes ();
> +  validate_change (insn, &PATTERN (insn), m_spare_parallel, true);
> +  if (verify_combination (attempt))
> +    {
> +      m_spare_parallel = NULL_RTX;
> +      return true;
> +    }
> +  cancel_changes (num_changes);
> +  return false;
> +}
> +
> +/* Try to parallelize the two instructions in ATTEMPT.  */
> +
> +bool
> +combine2::try_parallelize_insns (combination_attempt_rec &attempt)
> +{
> +  rtx_insn *i1_insn = attempt.sequence[0]->insn;
> +  rtx_insn *i2_insn = attempt.sequence[1]->insn;
> +
> +  /* Can't parallelize asm statements.  */
> +  if (asm_noperands (PATTERN (i1_insn)) >= 0
> +      || asm_noperands (PATTERN (i2_insn)) >= 0)
> +    return false;
> +
> +  /* For now, just handle the case in which both instructions are
> +     single sets.  We could handle more than 2 sets as well, but few
> +     targets support that anyway.  */
> +  rtx set1 = single_set (i1_insn);
> +  if (!set1)
> +    return false;
> +  rtx set2 = single_set (i2_insn);
> +  if (!set2)
> +    return false;
> +
> +  /* Make sure that we have structural proof that the destinations
> +     are independent.  Things like alias analysis rely on semantic
> +     information and assume no undefined behavior, which is rarely a
> +     good enough guarantee to allow a useful instruction combination.  */
> +  rtx dest1 = SET_DEST (set1);
> +  rtx dest2 = SET_DEST (set2);
> +  if (MEM_P (dest1)
> +      ? MEM_P (dest2) && nonoverlapping_memrefs_p (dest1, dest2, false)
> +      : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2))
> +    return false;
> +
> +  /* Try the sets in both orders.  */
> +  if (try_parallel_sets (attempt, set1, set2)
> +      || try_parallel_sets (attempt, set2, set1))
> +    {
> +      commit_combination (attempt, true);
> +      if (MAY_HAVE_DEBUG_BIND_INSNS
> +	  && attempt.new_home->insn != i1_insn)
> +	propagate_for_debug (i1_insn, attempt.new_home->insn,
> +			     SET_DEST (set1), SET_SRC (set1), m_bb);
> +      return true;
> +    }
> +  return false;
> +}
> +
> +/* Replace DEST with SRC in the register notes for INSN.  */
> +
> +static void
> +substitute_into_note (rtx_insn *insn, rtx dest, rtx src)
> +{
> +  for (rtx *note_ptr = &REG_NOTES (insn); *note_ptr; )
> +    {
> +      rtx note = *note_ptr;
> +      bool keep_p = true;
> +      switch (REG_NOTE_KIND (note))
> +	{
> +	case REG_EQUAL:
> +	case REG_EQUIV:
> +	  keep_p = validate_simplify_replace_rtx (insn, &XEXP (note, 0),
> +						  dest, src);
> +	  break;
> +
> +	default:
> +	  break;
> +	}
> +      if (keep_p)
> +	note_ptr = &XEXP (*note_ptr, 1);
> +      else
> +	{
> +	  *note_ptr = XEXP (*note_ptr, 1);
> +	  free_EXPR_LIST_node (note);
> +	}
> +    }
> +}
> +
> +/* A subroutine of try_combine_def_use.  Try replacing DEST with SRC
> +   in ATTEMPT.  SRC might be either the original SET_SRC passed to the
> +   parent routine or a value pulled from a note; SRC_IS_NOTE_P is true
> +   in the latter case.  */
> +
> +bool
> +combine2::try_combine_def_use_1 (combination_attempt_rec &attempt,
> +				 rtx dest, rtx src, bool src_is_note_p)
> +{
> +  rtx_insn *def_insn = attempt.sequence[0]->insn;
> +  rtx_insn *use_insn = attempt.sequence[1]->insn;
> +
> +  /* Mimic combine's behavior by not combining moves from allocatable hard
> +     registers (e.g. when copying parameters or function return values).  */
> +  if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
> +    return false;
> +
> +  /* Don't mess with volatile references.  For one thing, we don't yet
> +     know how many copies of SRC we'll need.  */
> +  if (volatile_refs_p (src))
> +    return false;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "trying to combine %d and %d%s:\n",
> +	       INSN_UID (def_insn), INSN_UID (use_insn),
> +	       src_is_note_p ? " using equal/equiv note" : "");
> +      dump_insn_slim (dump_file, def_insn);
> +      dump_insn_slim (dump_file, use_insn);
> +    }
> +
> +  unsigned int num_changes = num_validated_changes ();
> +  if (!validate_simplify_replace_rtx (use_insn, &PATTERN (use_insn),
> +				      dest, src))
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "combination failed -- unable to substitute"
> +		 " all uses\n");
> +      return false;
> +    }
> +
> +  /* Try matching the instruction on its own if DEST isn't used elsewhere.  */
> +  if (has_single_use_p (attempt.def_use_range)
> +      && verify_combination (attempt))
> +    {
> +      live_range_rec *next_range = attempt.def_use_range->next_range;
> +      substitute_into_note (use_insn, dest, src);
> +      commit_combination (attempt, false);
> +      if (MAY_HAVE_DEBUG_BIND_INSNS)
> +	{
> +	  rtx_insn *end_of_range = (next_range
> +				    ? next_range->producer->insn
> +				    : BB_END (m_bb));
> +	  propagate_for_debug (def_insn, end_of_range, dest, src, m_bb);
> +	}
> +      return true;
> +    }
> +
> +  /* Try doing the new USE_INSN pattern in parallel with the DEF_INSN
> +     pattern.  */
> +  if (try_parallelize_insns (attempt))
> +    return true;
> +
> +  cancel_changes (num_changes);
> +  return false;
> +}
> +
> +/* ATTEMPT describes an attempt to substitute the result of the first
> +   instruction into the second instruction.  Try to implement it,
> +   given that the first instruction sets DEST to SRC.  */
> +
> +bool
> +combine2::try_combine_def_use (combination_attempt_rec &attempt,
> +			       rtx dest, rtx src)
> +{
> +  rtx_insn *def_insn = attempt.sequence[0]->insn;
> +  rtx_insn *use_insn = attempt.sequence[1]->insn;
> +  rtx def_note = find_reg_equal_equiv_note (def_insn);
> +
> +  /* First try combining the instructions in their original form.  */
> +  if (try_combine_def_use_1 (attempt, dest, src, false))
> +    return true;
> +
> +  /* Try to replace DEST with a REG_EQUAL/EQUIV value instead.  */
> +  if (def_note
> +      && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
> +    return true;
> +
> +  /* If USE_INSN has a REG_EQUAL/EQUIV note that refers to DEST, try
> +     using that instead of the main pattern.  */
> +  for (rtx *link_ptr = &REG_NOTES (use_insn); *link_ptr;
> +       link_ptr = &XEXP (*link_ptr, 1))
> +    {
> +      rtx use_note = *link_ptr;
> +      if (REG_NOTE_KIND (use_note) != REG_EQUAL
> +	  && REG_NOTE_KIND (use_note) != REG_EQUIV)
> +	continue;
> +
> +      rtx use_set = single_set (use_insn);
> +      if (!use_set)
> +	break;
> +
> +      if (!reg_overlap_mentioned_p (dest, XEXP (use_note, 0)))
> +	continue;
> +
> +      /* Try snipping out the note and putting it in the SET instead.  */
> +      validate_change (use_insn, link_ptr, XEXP (use_note, 1), 1);
> +      validate_change (use_insn, &SET_SRC (use_set), XEXP (use_note, 0), 1);
> +
> +      if (try_combine_def_use_1 (attempt, dest, src, false))
> +	return true;
> +
> +      if (def_note
> +	  && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
> +	return true;
> +
> +      cancel_changes (0);
> +    }
> +
> +  return false;
> +}
> +
> +/* ATTEMPT describes an attempt to combine two instructions that use
> +   the same resource.  Try to implement it, returning true on success.  */
> +
> +bool
> +combine2::try_combine_two_uses (combination_attempt_rec &attempt)
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "trying to parallelize %d and %d:\n",
> +	       INSN_UID (attempt.sequence[0]->insn),
> +	       INSN_UID (attempt.sequence[1]->insn));
> +      dump_insn_slim (dump_file, attempt.sequence[0]->insn);
> +      dump_insn_slim (dump_file, attempt.sequence[1]->insn);
> +    }
> +
> +  return try_parallelize_insns (attempt);
> +}
> +
> +/* Try to optimize instruction INSN_INFO.  Return true on success.  */
> +
> +bool
> +combine2::optimize_insn (insn_info_rec *i1)
> +{
> +  combination_attempt_rec attempt;
> +
> +  if (!combinable_insn_p (i1->insn, false))
> +    return false;
> +
> +  rtx set = single_set (i1->insn);
> +  if (!set)
> +    return false;
> +
> +  /* First try combining INSN with a user of its result.  */
> +  rtx dest = SET_DEST (set);
> +  rtx src = SET_SRC (set);
> +  if (REG_P (dest) && REG_NREGS (dest) == 1)
> +    for (live_range_rec **def = i1->defs; *def; ++def)
> +      if ((*def)->regno == REGNO (dest))
> +	{
> +	  for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
> +	    {
> +	      insn_info_rec *use = (*def)->users[i];
> +	      if (use
> +		  && combinable_insn_p (use->insn, has_single_use_p (*def))
> +		  && start_combination (attempt, i1, use, *def)
> +		  && try_combine_def_use (attempt, dest, src))
> +		return true;
> +	    }
> +	  break;
> +	}
> +
> +  /* Try parallelizing INSN and another instruction that uses the same
> +     resource.  */
> +  bitmap_clear (m_tried_insns);
> +  for (live_range_rec **use = i1->uses; *use; ++use)
> +    for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
> +      {
> +	insn_info_rec *i2 = (*use)->users[i];
> +	if (i2
> +	    && i2 != i1
> +	    && combinable_insn_p (i2->insn, false)
> +	    && bitmap_set_bit (m_tried_insns, INSN_UID (i2->insn))
> +	    && start_combination (attempt, i1, i2)
> +	    && try_combine_two_uses (attempt))
> +	  return true;
> +      }
> +
> +  return false;
> +}
> +
> +/* A note_stores callback.  Set the bool at *DATA to true if DEST is in
> +   memory.  */
> +
> +static void
> +find_mem_def (rtx dest, const_rtx, void *data)
> +{
> +  /* note_stores has stripped things like subregs and zero_extracts,
> +     so we don't need to worry about them here.  */
> +  if (MEM_P (dest))
> +    *(bool *) data = true;
> +}
> +
> +/* Record all register and memory definitions in INSN_INFO and fill in its
> +   "defs" list.  */
> +
> +void
> +combine2::record_defs (insn_info_rec *insn_info)
> +{
> +  rtx_insn *insn = insn_info->insn;
> +
> +  /* Record register definitions.  */
> +  df_ref def;
> +  FOR_EACH_INSN_DEF (def, insn)
> +    {
> +      rtx reg = DF_REF_REAL_REG (def);
> +      unsigned int end_regno = END_REGNO (reg);
> +      for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
> +	{
> +	  live_range_rec *range = reg_live_range (regno);
> +	  range->producer = insn_info;
> +	  m_reg_info[regno].live_p = false;
> +	  obstack_ptr_grow (&m_insn_obstack, range);
> +	}
> +    }
> +
> +  /* If the instruction writes to memory, record that too.  */
> +  bool saw_mem_p = false;
> +  note_stores (insn, find_mem_def, &saw_mem_p);
> +  if (saw_mem_p)
> +    {
> +      live_range_rec *range = mem_live_range ();
> +      range->producer = insn_info;
> +      obstack_ptr_grow (&m_insn_obstack, range);
> +    }
> +
> +  /* Complete the list of definitions.  */
> +  obstack_ptr_grow (&m_insn_obstack, NULL);
> +  insn_info->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
> +}
> +
> +/* Record that INSN_INFO contains register use USE.  If this requires
> +   new entries to be added to INSN_INFO->uses, add those entries to the
> +   list we're building in m_insn_obstack.  */
> +
> +void
> +combine2::record_reg_use (insn_info_rec *insn_info, df_ref use)
> +{
> +  rtx reg = DF_REF_REAL_REG (use);
> +  unsigned int end_regno = END_REGNO (reg);
> +  for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
> +    {
> +      live_range_rec *range = reg_live_range (regno);
> +      if (add_range_use (range, insn_info))
> +	obstack_ptr_grow (&m_insn_obstack, range);
> +      m_reg_info[regno].live_p = true;
> +    }
> +}
> +
> +/* A note_uses callback.  Set the bool at DATA to true if *LOC reads
> +   from variable memory.  */
> +
> +static void
> +find_mem_use (rtx *loc, void *data)
> +{
> +  subrtx_iterator::array_type array;
> +  FOR_EACH_SUBRTX (iter, array, *loc, NONCONST)
> +    if (MEM_P (*iter) && !MEM_READONLY_P (*iter))
> +      {
> +	*(bool *) data = true;
> +	break;
> +      }
> +}
> +
> +/* Record all register and memory uses in INSN_INFO and fill in its
> +   "uses" list.  */
> +
> +void
> +combine2::record_uses (insn_info_rec *insn_info)
> +{
> +  rtx_insn *insn = insn_info->insn;
> +
> +  /* Record register uses in the main pattern.  */
> +  df_ref use;
> +  FOR_EACH_INSN_USE (use, insn)
> +    record_reg_use (insn_info, use);
> +
> +  /* Treat REG_EQUAL uses as first-class uses.  We don't lose much
> +     by doing that, since it's rare for a REG_EQUAL note to mention
> +     registers that the main pattern doesn't.  It also gives us the
> +     maximum freedom to use REG_EQUAL notes in place of the main pattern.  */
> +  FOR_EACH_INSN_EQ_USE (use, insn)
> +    record_reg_use (insn_info, use);
> +
> +  /* Record a memory use if either the pattern or the notes read from
> +     memory.  */
> +  bool saw_mem_p = false;
> +  note_uses (&PATTERN (insn), find_mem_use, &saw_mem_p);
> +  for (rtx note = REG_NOTES (insn); !saw_mem_p && note; note = XEXP (note, 1))
> +    if (REG_NOTE_KIND (note) == REG_EQUAL
> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
> +      note_uses (&XEXP (note, 0), find_mem_use, &saw_mem_p);
> +  if (saw_mem_p)
> +    {
> +      live_range_rec *range = mem_live_range ();
> +      if (add_range_use (range, insn_info))
> +	obstack_ptr_grow (&m_insn_obstack, range);
> +    }
> +
> +  /* Complete the list of uses.  */
> +  obstack_ptr_grow (&m_insn_obstack, NULL);
> +  insn_info->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
> +}
> +
> +/* Start a new instruction sequence, discarding all information about
> +   the previous one.  */
> +
> +void
> +combine2::start_sequence (void)
> +{
> +  m_end_of_sequence = m_point;
> +  m_mem_range = NULL;
> +  m_points.truncate (0);
> +  obstack_free (&m_insn_obstack, m_insn_obstack_start);
> +  obstack_free (&m_range_obstack, m_range_obstack_start);
> +}
> +
> +/* Run the pass on the current function.  */
> +
> +void
> +combine2::execute (void)
> +{
> +  df_analyze ();
> +  FOR_EACH_BB_FN (m_bb, cfun)
> +    {
> +      m_optimize_for_speed_p = optimize_bb_for_speed_p (m_bb);
> +      m_end_of_bb = m_point;
> +      start_sequence ();
> +
> +      rtx_insn *insn, *prev;
> +      FOR_BB_INSNS_REVERSE_SAFE (m_bb, insn, prev)
> +	{
> +	  if (!NONDEBUG_INSN_P (insn))
> +	    continue;
> +
> +	  /* The current m_point represents the end of the sequence if
> +	     INSN is the last instruction in the sequence, otherwise it
> +	     represents the gap between INSN and the next instruction.
> +	     m_point + 1 represents INSN itself.
> +
> +	     Instructions can be added to m_point by inserting them
> +	     after INSN.  They can be added to m_point + 1 by inserting
> +	     them before INSN.  */
> +	  m_points.safe_push (insn);
> +	  m_point += 1;
> +
> +	  insn_info_rec *insn_info = XOBNEW (&m_insn_obstack, insn_info_rec);
> +	  insn_info->insn = insn;
> +	  insn_info->point = m_point;
> +	  insn_info->cost = UNKNOWN_COST;
> +
> +	  record_defs (insn_info);
> +	  record_uses (insn_info);
> +
> +	  /* Set up m_point for the next instruction.  */
> +	  m_point += 1;
> +
> +	  if (CALL_P (insn))
> +	    start_sequence ();
> +	  else
> +	    while (optimize_insn (insn_info))
> +	      gcc_assert (insn_info->insn);
> +	}
> +    }
> +
> +  /* If an instruction changes the cfg, update the containing block
> +     accordingly.  */
> +  rtx_insn *insn;
> +  unsigned int i;
> +  FOR_EACH_VEC_ELT (m_cfg_altering_insns, i, insn)
> +    if (JUMP_P (insn))
> +      {
> +	mark_jump_label (PATTERN (insn), insn, 0);
> +	update_cfg_for_uncondjump (insn);
> +      }
> +    else
> +      {
> +	remove_edge (split_block (BLOCK_FOR_INSN (insn), insn));
> +	emit_barrier_after_bb (BLOCK_FOR_INSN (insn));
> +      }
> +
> +  /* Propagate the above block-local cfg changes to the rest of the cfg.  */
> +  if (!m_cfg_altering_insns.is_empty ())
> +    {
> +      if (dom_info_available_p (CDI_DOMINATORS))
> +	free_dominance_info (CDI_DOMINATORS);
> +      timevar_push (TV_JUMP);
> +      rebuild_jump_labels (get_insns ());
> +      cleanup_cfg (0);
> +      timevar_pop (TV_JUMP);
> +    }
> +}
> +
> +const pass_data pass_data_combine2 =
> +{
> +  RTL_PASS, /* type */
> +  "combine2", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_COMBINE2, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_combine2 : public rtl_opt_pass
> +{
> +public:
> +  pass_combine2 (gcc::context *ctxt, int flag)
> +    : rtl_opt_pass (pass_data_combine2, ctxt), m_flag (flag)
> +  {}
> +
> +  bool
> +  gate (function *) OVERRIDE
> +  {
> +    return optimize && (param_run_combine & m_flag) != 0;
> +  }
> +
> +  unsigned int
> +  execute (function *f) OVERRIDE
> +  {
> +    combine2 (f).execute ();
> +    return 0;
> +  }
> +
> +private:
> +  unsigned int m_flag;
> +}; // class pass_combine2
> +
> +} // anon namespace
> +
> +rtl_opt_pass *
> +make_pass_combine2_before (gcc::context *ctxt)
> +{
> +  return new pass_combine2 (ctxt, 1);
> +}
> +
> +rtl_opt_pass *
> +make_pass_combine2_after (gcc::context *ctxt)
> +{
> +  return new pass_combine2 (ctxt, 4);
> +}
Richard Sandiford Nov. 21, 2019, 7:41 p.m. UTC | #9
Hi Nick,

Thanks for the comments.

Nicholas Krause <xerofoify@gmail.com> writes:
>> Index: gcc/passes.def
>> ===================================================================
>> --- gcc/passes.def	2019-10-29 08:29:03.224443133 +0000
>> +++ gcc/passes.def	2019-11-17 23:15:31.200500531 +0000
>> @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3.
>>         NEXT_PASS (pass_inc_dec);
>>         NEXT_PASS (pass_initialize_regs);
>>         NEXT_PASS (pass_ud_rtl_dce);
>> +      NEXT_PASS (pass_combine2_before);
>>         NEXT_PASS (pass_combine);
>> +      NEXT_PASS (pass_combine2_after);
>>         NEXT_PASS (pass_if_after_combine);
>>         NEXT_PASS (pass_jump_after_combine);
>>         NEXT_PASS (pass_partition_blocks);
>> Index: gcc/timevar.def
> This is really two passes it seems or at least functions. Just a nit but you
> may want to state that as I don't recall reading that.

It's really two instances of the same pass, but yeah, each instance
goes under a different name.  This is because each instance needs to
know which bit of the run-combine value it should be testing:

>> The patch adds two instances of the new pass: one before combine and
>> one after it.  By default both are disabled, but this can be changed
>> using the new 3-bit run-combine param, where:
>>
>> - bit 0 selects the new pre-combine pass
>> - bit 1 selects the main combine pass
>> - bit 2 selects the new post-combine pass

So bit 0 is pass_combine2_before, bit 1 is pass_combine and bit 2 is
pass_combine2_after.  But the passes are identical apart from the choice
of bit they test.

>> +  /* Describes one attempt to combine instructions.  */
>> +  struct combination_attempt_rec
>> +  {
>> +    /* The instruction that we're currently trying to optimize.
>> +       If the combination succeeds, we'll use this insn_info_rec
>> +       to describe the new instruction.  */
>> +    insn_info_rec *new_home;
>> +
>> +    /* The instructions we're combining, in program order.  */
>> +    insn_info_rec *sequence[MAX_COMBINE_INSNS];
> Can't we can this a vec in order to grow to lengths and just loop through
> merging on instructions in the vec as required?

Yeah, extending this to combining more than 2 instructions would be
future work.  When that happens, this would likely end up becoming an
auto_vec<insn_info_rec *, MAX_COMBINE_INSNS>.  I imagine there would
still be a fairly low compile-time limit on the number of combinations
though.  E.g. current combine has a limit of 4, with even 4 being
restricted to certain high-value cases.  I don't think I've ever
seen a case where 5 or more would help.

>> +/* Return true if we know that USER is the last user of RANGE.  */
>> +
>> +bool
>> +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user)
>> +{
>> +  if (range->last_extra_use <= user->point)
>> +    return false;
>> +
>> +  for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i)
>> +    if (range->users[i] == user)
>> +      return i == NUM_RANGE_USERS - 1 || !range->users[i + 1];
> Small nit and I could be wrong but do:
>
> return !range->users[i + 1] || i == NUM_RANGE_USERS - 1;
>
> Based on your code it seems that the getting to NUM_RANGE_USERS is far 
> less likely.

The problem is that we'll then be accessing outside the users[] array
when i == NUM_RANGE_USERS - 1, so we have to check the limit first.

Thanks,
Richard
Richard Sandiford Nov. 21, 2019, 8:32 p.m. UTC | #10
Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Wed, Nov 20, 2019 at 06:20:34PM +0000, Richard Sandiford wrote:
>> > Why don't you use DF for the DU chains?
>> 
>> The problem with DF_DU_CHAIN is that it's quadratic in the worst case.
>
> Oh, wow.
>
>> fwprop.c gets around that by using the MD problem and having its own
>> dominator walker to calculate limited def-use chains:
>> 
>>   /* We use the multiple definitions problem to compute our restricted
>>      use-def chains.  */
>
> It's not great if every pass invents its own version of some common
> infrastructure thing because that common one is not suitable.
>
> I.e., can this be fixed somehow?  Maybe just by having a restricted DU
> chains df problem?

Well, it'd probably make sense to make fwprop.c's approach available
as a "proper" df interface at some point.  Hopefully if anyone wants the
same thing as fwprop.c, they'd do that rather than copy the code. :-)

>> So taking that approach here would still require some amount of
>> roll-your-own.  Other reasons are:
>> 
>> * Even what fwprop does is more elaborate than we need for now.
>> 
>> * We need to handle memory too, and it's nice to be able to handle
>>   it in the same way as registers.
>> 
>> * Updating a full, ordered def-use chain after a move is a linear-time
>>   operation, so whatever happens, we'd need to apply some kind of limit
>>   on the number of uses we maintain, with something like that integer
>>   point range for the rest.
>> 
>> * Once we've analysed the insn and built its def-use chains, we don't
>>   look at the df_refs again until we update the chains after a successful
>>   combination.  So it should be more efficient to maintain a small array
>>   of insn_info_rec pointers alongside the numerical range, rather than
>>   walk and pollute chains of df_refs and then link back the insn uids
>>   to the pass-local info.
>
> So you need something like combine's LOG_LINKS?  Not that handling those
> is not quadratic in the worst case, but in practice it works well.  And
> it *could* be made linear.

Not sure why what I've used isn't what I need though :-)  If it's an
array vs. linked-list thing, then for the multi-use case, we need
two sets of link pointers, one for "next use of the same resource"
and one for "next use in this instruction".  Then we need the payload of
the list node itself.  For the small number of entries we're talking about,
using null-terminated arrays of "things that this instruction uses"
and "instructions that use this resource" should be more efficient than
pointer-chasing, and occupies the same space as the link pointers
(i.e. saves the extra payload).

We also need to be able to walk in both directions, to answer the
questions:

- which insns can I combine with this definition?
- where is this value of a resource defined?
- where are the uses of this resource?
- where was the previous definition of this resource, and where
  was its last use?

So if we're comparing it to existing linked-list GCC structures,
it's more similar to df_ref (see above for why that seemed like
a bad idea) or -- more light-weight -- dep_link_t in the scheduler.

And both the array and linked-list approaches still need to fall back to
the simple live range once a certain threshold is hit.

>> The second set is for:
>> 
>> (B) --param run-combine=6 (both passes), use-use combinations only
>> (C) --param run-combine=6 (both passes), no restrictions
>> 
>> Target                 Tests   Delta    Best   Worst  Median
>> ======                 =====   =====    ====   =====  ======
>> aarch64-linux-gnu        272   -3844    -585      18      -1
>> aarch64_be-linux-gnu     190   -3336    -370      18      -1
>> alpha-linux-gnu          401   -2735    -370      22      -2
>> amdgcn-amdhsa            188    1867    -484    1259      -1
>> arc-elf                  257   -1498    -650      54      -1
>> arm-linux-gnueabi        168   -1117    -612     680      -1
>> arm-linux-gnueabihf      168   -1117    -612     680      -1
>> avr-elf                 1341 -111401  -13824     680     -10
>
> Things like this are kind of suspicious :-)

Yeah.  This mostly seems to come from mopping up the extra moves created
by make_more_copies.  So we have combinations like:

   58: r70:SF=r94:SF
      REG_DEAD r94:SF
   60: r22:SF=r70:SF
      REG_DEAD r70:SF

(r22 is a hard reg, the others are pseudos) which produces:

        std Y+1,r22
        std Y+2,r23
        std Y+3,r24
        std Y+4,r25
-       ldd r22,Y+1
-       ldd r23,Y+2
-       ldd r24,Y+3
-       ldd r25,Y+4

On the REG_EQUAL thing: you're right that it doesn't make much difference
for run-combine=6:

Target             Tests   Delta    Best   Worst  Median
======             =====   =====    ====   =====  ======
arc-elf                1      -1      -1      -1      -1
avr-elf                1      -1      -1      -1      -1
bfin-elf               1      -1      -1      -1      -1
bpf-elf                2      -6      -5      -1      -5
c6x-elf                1      -2      -2      -2      -2
cr16-elf               1       7       7       7       7
epiphany-elf           5     -15      -4      -1      -4
fr30-elf               2     -16     -11      -5     -11
frv-linux-gnu          2     -20     -16      -4     -16
h8300-elf              2      -2      -1      -1      -1
i686-apple-darwin      1      -3      -3      -3      -3
ia64-linux-gnu         3     -39     -26      -6      -7
m32r-elf               3     -17     -10      -2      -5
mcore-elf              4      -7      -3      -1      -2
mn10300-elf            1      -2      -2      -2      -2
moxie-rtems            4     -15      -5      -2      -4
nds32le-elf            1      -1      -1      -1      -1
nios2-linux-gnu        1      -1      -1      -1      -1
or1k-elf               3     -18     -12      -2      -4
s390-linux-gnu         6     -28      -9      -1      -7
s390x-linux-gnu        1      -1      -1      -1      -1
sh-linux-gnu           1      -1      -1      -1      -1
sparc-linux-gnu        4     -24     -14      -2      -5
xstormy16-elf          9     -27     -10      -1      -2

So there's only one case in which it isn't a win, but the number of
tests is tiny.  So I agree there's no justification for trying this in
combine proper as things stand (and I wasn't arguing otherwise FWIW).
I'd still like to keep it in the new pass because it does help
*sometimes* and there's no sign yet that it has a noticeable
compile-time cost.

It might also be interesting to see how much difference it makes for
run-combine=4 (e.g. to see how much it makes up for the current 2-insn
limit)...

Thanks,
Richard
Segher Boessenkool Nov. 22, 2019, 4:16 p.m. UTC | #11
On Thu, Nov 21, 2019 at 07:41:56PM +0000, Richard Sandiford wrote:
> Nicholas Krause <xerofoify@gmail.com> writes:
> >> +    /* The instructions we're combining, in program order.  */
> >> +    insn_info_rec *sequence[MAX_COMBINE_INSNS];
> > Can't we can this a vec in order to grow to lengths and just loop through
> > merging on instructions in the vec as required?
> 
> Yeah, extending this to combining more than 2 instructions would be
> future work.  When that happens, this would likely end up becoming an
> auto_vec<insn_info_rec *, MAX_COMBINE_INSNS>.  I imagine there would
> still be a fairly low compile-time limit on the number of combinations
> though.  E.g. current combine has a limit of 4, with even 4 being
> restricted to certain high-value cases.  I don't think I've ever
> seen a case where 5 or more would help.

And sometimes it looks like 4 would help, but often this is because of a
limitation elsewhere (like, it should have done a 2->2 before, for example).

4 _does_ help quite a bit with irregular instruction sets.  It could
sometimes help with RMW insns, too, but there are other problems with
that.

What you see a lot where 4 "helps" is where it really should combine
with just 3 of them, but something prevents that, often cost, while
throwing in a 4th insn tilts the balance just enough.  We used to have
a lot of that with 3-insn combinations as well, and probably still have
some.


Segher
Segher Boessenkool Nov. 22, 2019, 4:39 p.m. UTC | #12
On Thu, Nov 21, 2019 at 08:32:14PM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > It's not great if every pass invents its own version of some common
> > infrastructure thing because that common one is not suitable.
> >
> > I.e., can this be fixed somehow?  Maybe just by having a restricted DU
> > chains df problem?
> 
> Well, it'd probably make sense to make fwprop.c's approach available
> as a "proper" df interface at some point.  Hopefully if anyone wants the
> same thing as fwprop.c, they'd do that rather than copy the code. :-)

> >> * Updating a full, ordered def-use chain after a move is a linear-time
> >>   operation, so whatever happens, we'd need to apply some kind of limit
> >>   on the number of uses we maintain, with something like that integer
> >>   point range for the rest.

Yeah.

> >> * Once we've analysed the insn and built its def-use chains, we don't
> >>   look at the df_refs again until we update the chains after a successful
> >>   combination.  So it should be more efficient to maintain a small array
> >>   of insn_info_rec pointers alongside the numerical range, rather than
> >>   walk and pollute chains of df_refs and then link back the insn uids
> >>   to the pass-local info.
> >
> > So you need something like combine's LOG_LINKS?  Not that handling those
> > is not quadratic in the worst case, but in practice it works well.  And
> > it *could* be made linear.
> 
> Not sure why what I've used isn't what I need though :-)

I am wondering the other way around :-)  Is what you do for combine2
something that would be more generally applicable/useful?  That's what
I'm trying to find out :-)

What combine does could use some improvement, if you want to hear a
more direct motivations.  LOG_LINKS just skip references we cannot
handle (and some more), so we always have to do modified_between etc.,
which hurts.

> >> Target                 Tests   Delta    Best   Worst  Median
> >> avr-elf                 1341 -111401  -13824     680     -10
> >
> > Things like this are kind of suspicious :-)
> 
> Yeah.  This mostly seems to come from mopping up the extra moves created
> by make_more_copies.  So we have combinations like:
> 
>    58: r70:SF=r94:SF
>       REG_DEAD r94:SF
>    60: r22:SF=r70:SF
>       REG_DEAD r70:SF

Why didn't combine do this?  A target problem?

> So there's only one case in which it isn't a win, but the number of
> tests is tiny.  So I agree there's no justification for trying this in
> combine proper as things stand (and I wasn't arguing otherwise FWIW).
> I'd still like to keep it in the new pass because it does help
> *sometimes* and there's no sign yet that it has a noticeable
> compile-time cost.

So when does it help?  I can only think of cases where there are
problems elsewhere.

> It might also be interesting to see how much difference it makes for
> run-combine=4 (e.g. to see how much it makes up for the current 2-insn
> limit)...

Numbers are good :-)


Segher
Segher Boessenkool Nov. 23, 2019, 10:34 p.m. UTC | #13
Hi!

On Mon, Nov 18, 2019 at 05:55:13PM +0000, Richard Sandiford wrote:
> Richard Sandiford <richard.sandiford@arm.com> writes:
> > (It's 23:35 local time, so it's still just about stage 1. :-))
> 
> Or actually, just under 1 day after end of stage 1.  Oops.
> Could have sworn stage 1 ended on the 17th :-(  Only realised
> I'd got it wrong when catching up on Saturday's email traffic.
> 
> And inevitably, I introduced a couple of stupid mistakes while
> trying to clean the patch up for submission by that (non-)deadline.
> Here's a version that fixes an inverted overlapping memref check
> and that correctly prunes the use list for combined instructions.
> (This last one is just a compile-time saving -- the old code was
> correct, just suboptimal.)

I've build the Linux kernel with the previous version, as well as this
one.  R0 is unmodified GCC, R1 is the first patch, R2 is this one:

(I've forced --param=run-combine=6 for R1 and R2):
(Percentages are relative to R0):

                    R0        R1        R2        R1        R2
       alpha   6107088   6101088   6101088   99.902%   99.902%
         arc   4008224   4006568   4006568   99.959%   99.959%
         arm   9206728   9200936   9201000   99.937%   99.938%
       arm64  13056174  13018174  13018194   99.709%   99.709%
       armhf         0         0         0         0         0
         c6x   2337237   2337077   2337077   99.993%   99.993%
        csky   3356602         0         0         0         0
       h8300   1166996   1166776   1166776   99.981%   99.981%
        i386  11352159         0         0         0         0
        ia64  18230640  18167000  18167000   99.651%   99.651%
        m68k   3714271         0         0         0         0
  microblaze   4982749   4979945   4979945   99.944%   99.944%
        mips   8499309   8495205   8495205   99.952%   99.952%
      mips64   7042036   7039816   7039816   99.968%   99.968%
       nds32   4486663         0         0         0         0
       nios2   3680001   3679417   3679417   99.984%   99.984%
    openrisc   4226076   4225868   4225868   99.995%   99.995%
      parisc   7681895   7680063   7680063   99.976%   99.976%
    parisc64   8677077   8676581   8676581   99.994%   99.994%
     powerpc  10687611  10682199  10682199   99.949%   99.949%
   powerpc64  17671082  17658570  17658570   99.929%   99.929%
 powerpc64le  17671082  17658570  17658570   99.929%   99.929%
     riscv32   1554938   1554758   1554758   99.988%   99.988%
     riscv64   6634342   6632788   6632788   99.977%   99.977%
        s390  13049643  13014939  13014939   99.734%   99.734%
          sh   3254743         0         0         0         0
     shnommu   1632364   1632124   1632124   99.985%   99.985%
       sparc   4404993   4399593   4399593   99.877%   99.877%
     sparc64   6796711   6797491   6797491  100.011%  100.011%
      x86_64  19713174  19712817  19712817   99.998%   99.998%
      xtensa         0         0         0         0         0

0 means it didn't build.

armhf is probably my own problem, not sure yet.

xtensa starts with
/tmp/ccmJoY7l.s: Assembler messages:
/tmp/ccmJoY7l.s:407: Error: cannot represent `BFD_RELOC_8' relocation in object file
and it doesn't get better.

My powerpc64 config actually built the powerpc64le config, since the
kernel since a while looks what the host system is, for its defconfig.
Oh well, fixed now.

There are fivew new failures, with either of the combine2 patches.  And
all five are actually different (different symptoms, at least):

- csky fails on libgcc build:

/home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi':
/home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads for:
 1405 | }
      | ^
(insn 199 86 87 8 (parallel [
            (set (reg:SI 101)
                (plus:SI (reg:SI 98)
                    (const_int -32 [0xffffffffffffffe0])))
            (set (reg:CC 33 c)
                (lt:CC (plus:SI (reg:SI 98)
                        (const_int -32 [0xffffffffffffffe0]))
                    (const_int 0 [0])))
        ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt}
     (nil))
during RTL pass: reload

Target problem?

- i386 goes into an infinite loop compiling, or at least an hour or so...
Erm I forgot too record what it was compiling.  I did attach a GDB...  It
is something from lra_create_live_ranges.

- m68k:

/home/segher/src/kernel/fs/exec.c: In function 'copy_strings':
/home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in final_scan_insn_1, at final.c:3048
  590 | }
      | ^
0x10408307 final_scan_insn_1
        /home/segher/src/gcc/gcc/final.c:3048
0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
        /home/segher/src/gcc/gcc/final.c:3152
0x10408797 final_1
        /home/segher/src/gcc/gcc/final.c:2020
0x104091f7 rest_of_handle_final
        /home/segher/src/gcc/gcc/final.c:4658
0x104091f7 execute
        /home/segher/src/gcc/gcc/final.c:4736

and that line is
            gcc_assert (prev_nonnote_insn (insn) == last_ignored_compare);

- nds32:

/tmp/ccC8Czca.s: Assembler messages:
/tmp/ccC8Czca.s:3144: Error: Unrecognized operand/register, lmw.bi [$fp+(-60)],[$fp],$r11,0x0.

/tmp/ccl8o20c.s: Assembler messages:
/tmp/ccl8o20c.s:2449: Error: Unrecognized operand/register, lmw.bi $r9,[$fp],[$fp+(-132)],0x0.

/tmp/ccZxjwHd.s: Assembler messages:
/tmp/ccZxjwHd.s:4776: Error: Unrecognized operand/register, lmw.bi [$fp+(-52)],[$fp],[$fp+(-56)],0x0.

/tmp/cczjOS3d.s: Assembler messages:
/tmp/cczjOS3d.s:2336: Error: Unrecognized operand/register, lmw.bi $r16,[$fp],$r7,0x0.

and more.  All lmw.bi...  target issue?

- sh (that's sh4-linux):

/home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field':
/home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS'
 1638 | }
      | ^
/home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn:
(insn 18 17 19 2 (set (reg:SI 0 r0)
        (mem:SI (plus:SI (reg:SI 4 r4 [178])
                (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
     (expr_list:REG_DEAD (reg:SI 4 r4 [178])
        (expr_list:REG_DEAD (reg:SI 6 r6 [171])
            (nil))))
/home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out


Looking at just binary size, which is a good stand-in for how many insns
it combined:

                    R2
       arm64   99.709%
        ia64   99.651%
        s390   99.734%
       sparc   99.877%
     sparc64  100.011%

(These are those that are not between 99.9% and 100.0%).

So only sparc64 regressed, and just a tiny bit (I can look at what that
is, if there is interest).  But 32-bit sparc improved, and s390, arm64,
and ia64 got actual benefit.

Again this is just code size, not analysing the actually changed code.


I did look at the powerpc64le changes.  It is almost completely load-
with-update (and store-with-update) insns that make the difference, but
there are also some dot insns.  The extra mr. are usually not a good
idea, but the extsw. are.  Sometimes this causes *more* insns in the end
(register move insns), but that is the exception.

This mr. problem is there with combine already, btw.  In the end it is
caused by this just not being something good to do on pseudos, it would
be better to do this after RA, in a peephole or similar.  OTOH it isn't
actually really important for performance either way.

Btw, does the new pass use TARGET_LEGITIMATE_COMBINED_INSN?  It probably
should.  (That would be the hook where we would probably want to prevent
generating mr. insns).


Segher
Nicholas Krause Nov. 23, 2019, 11:01 p.m. UTC | #14
On 11/23/19 5:34 PM, Segher Boessenkool wrote:
> Hi!
>
> On Mon, Nov 18, 2019 at 05:55:13PM +0000, Richard Sandiford wrote:
>> Richard Sandiford <richard.sandiford@arm.com> writes:
>>> (It's 23:35 local time, so it's still just about stage 1. :-))
>> Or actually, just under 1 day after end of stage 1.  Oops.
>> Could have sworn stage 1 ended on the 17th :-(  Only realised
>> I'd got it wrong when catching up on Saturday's email traffic.
>>
>> And inevitably, I introduced a couple of stupid mistakes while
>> trying to clean the patch up for submission by that (non-)deadline.
>> Here's a version that fixes an inverted overlapping memref check
>> and that correctly prunes the use list for combined instructions.
>> (This last one is just a compile-time saving -- the old code was
>> correct, just suboptimal.)
> I've build the Linux kernel with the previous version, as well as this
> one.  R0 is unmodified GCC, R1 is the first patch, R2 is this one:
>
> (I've forced --param=run-combine=6 for R1 and R2):
> (Percentages are relative to R0):
>
>                      R0        R1        R2        R1        R2
>         alpha   6107088   6101088   6101088   99.902%   99.902%
>           arc   4008224   4006568   4006568   99.959%   99.959%
>           arm   9206728   9200936   9201000   99.937%   99.938%
>         arm64  13056174  13018174  13018194   99.709%   99.709%
>         armhf         0         0         0         0         0
>           c6x   2337237   2337077   2337077   99.993%   99.993%
>          csky   3356602         0         0         0         0
>         h8300   1166996   1166776   1166776   99.981%   99.981%
>          i386  11352159         0         0         0         0
>          ia64  18230640  18167000  18167000   99.651%   99.651%
>          m68k   3714271         0         0         0         0
>    microblaze   4982749   4979945   4979945   99.944%   99.944%
>          mips   8499309   8495205   8495205   99.952%   99.952%
>        mips64   7042036   7039816   7039816   99.968%   99.968%
>         nds32   4486663         0         0         0         0
>         nios2   3680001   3679417   3679417   99.984%   99.984%
>      openrisc   4226076   4225868   4225868   99.995%   99.995%
>        parisc   7681895   7680063   7680063   99.976%   99.976%
>      parisc64   8677077   8676581   8676581   99.994%   99.994%
>       powerpc  10687611  10682199  10682199   99.949%   99.949%
>     powerpc64  17671082  17658570  17658570   99.929%   99.929%
>   powerpc64le  17671082  17658570  17658570   99.929%   99.929%
>       riscv32   1554938   1554758   1554758   99.988%   99.988%
>       riscv64   6634342   6632788   6632788   99.977%   99.977%
>          s390  13049643  13014939  13014939   99.734%   99.734%
>            sh   3254743         0         0         0         0
>       shnommu   1632364   1632124   1632124   99.985%   99.985%
>         sparc   4404993   4399593   4399593   99.877%   99.877%
>       sparc64   6796711   6797491   6797491  100.011%  100.011%
>        x86_64  19713174  19712817  19712817   99.998%   99.998%
>        xtensa         0         0         0         0         0
>
> 0 means it didn't build.
>
> armhf is probably my own problem, not sure yet.
>
> xtensa starts with
> /tmp/ccmJoY7l.s: Assembler messages:
> /tmp/ccmJoY7l.s:407: Error: cannot represent `BFD_RELOC_8' relocation in object file
> and it doesn't get better.
>
> My powerpc64 config actually built the powerpc64le config, since the
> kernel since a while looks what the host system is, for its defconfig.
> Oh well, fixed now.
>
> There are fivew new failures, with either of the combine2 patches.  And
> all five are actually different (different symptoms, at least):
>
> - csky fails on libgcc build:
>
> /home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi':
> /home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads for:
>   1405 | }
>        | ^
> (insn 199 86 87 8 (parallel [
>              (set (reg:SI 101)
>                  (plus:SI (reg:SI 98)
>                      (const_int -32 [0xffffffffffffffe0])))
>              (set (reg:CC 33 c)
>                  (lt:CC (plus:SI (reg:SI 98)
>                          (const_int -32 [0xffffffffffffffe0]))
>                      (const_int 0 [0])))
>          ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt}
>       (nil))
> during RTL pass: reload
>
> Target problem?
>
> - i386 goes into an infinite loop compiling, or at least an hour or so...
> Erm I forgot too record what it was compiling.  I did attach a GDB...  It
> is something from lra_create_live_ranges.
>
> - m68k:
>
> /home/segher/src/kernel/fs/exec.c: In function 'copy_strings':
> /home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in final_scan_insn_1, at final.c:3048
>    590 | }
>        | ^
> 0x10408307 final_scan_insn_1
>          /home/segher/src/gcc/gcc/final.c:3048
> 0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
>          /home/segher/src/gcc/gcc/final.c:3152
> 0x10408797 final_1
>          /home/segher/src/gcc/gcc/final.c:2020
> 0x104091f7 rest_of_handle_final
>          /home/segher/src/gcc/gcc/final.c:4658
> 0x104091f7 execute
>          /home/segher/src/gcc/gcc/final.c:4736
>
> and that line is
>              gcc_assert (prev_nonnote_insn (insn) == last_ignored_compare);
>
> - nds32:
>
> /tmp/ccC8Czca.s: Assembler messages:
> /tmp/ccC8Czca.s:3144: Error: Unrecognized operand/register, lmw.bi [$fp+(-60)],[$fp],$r11,0x0.
>
> /tmp/ccl8o20c.s: Assembler messages:
> /tmp/ccl8o20c.s:2449: Error: Unrecognized operand/register, lmw.bi $r9,[$fp],[$fp+(-132)],0x0.
>
> /tmp/ccZxjwHd.s: Assembler messages:
> /tmp/ccZxjwHd.s:4776: Error: Unrecognized operand/register, lmw.bi [$fp+(-52)],[$fp],[$fp+(-56)],0x0.
>
> /tmp/cczjOS3d.s: Assembler messages:
> /tmp/cczjOS3d.s:2336: Error: Unrecognized operand/register, lmw.bi $r16,[$fp],$r7,0x0.
>
> and more.  All lmw.bi...  target issue?
>
> - sh (that's sh4-linux):
>
> /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field':
> /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS'
>   1638 | }
>        | ^
> /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn:
> (insn 18 17 19 2 (set (reg:SI 0 r0)
>          (mem:SI (plus:SI (reg:SI 4 r4 [178])
>                  (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
>       (expr_list:REG_DEAD (reg:SI 4 r4 [178])
>          (expr_list:REG_DEAD (reg:SI 6 r6 [171])
>              (nil))))
> /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out
>
>
> Looking at just binary size, which is a good stand-in for how many insns
> it combined:
>
>                      R2
>         arm64   99.709%
>          ia64   99.651%
>          s390   99.734%
>         sparc   99.877%
>       sparc64  100.011%
>
> (These are those that are not between 99.9% and 100.0%).
>
> So only sparc64 regressed, and just a tiny bit (I can look at what that
> is, if there is interest).  But 32-bit sparc improved, and s390, arm64,
> and ia64 got actual benefit.
>
> Again this is just code size, not analysing the actually changed code.
>
>
> I did look at the powerpc64le changes.  It is almost completely load-
> with-update (and store-with-update) insns that make the difference, but
> there are also some dot insns.  The extra mr. are usually not a good
> idea, but the extsw. are.  Sometimes this causes *more* insns in the end
> (register move insns), but that is the exception.
>
> This mr. problem is there with combine already, btw.  In the end it is
> caused by this just not being something good to do on pseudos, it would
> be better to do this after RA, in a peephole or similar.  OTOH it isn't
> actually really important for performance either way.
>
> Btw, does the new pass use TARGET_LEGITIMATE_COMBINED_INSN?  It probably
> should.  (That would be the hook where we would probably want to prevent
> generating mr. insns).
>
>
> Segher
Segher,

Please just CC to this conversation as I keep getting removed.

Thanks,
Nick
Segher Boessenkool Nov. 23, 2019, 11:09 p.m. UTC | #15
On Sat, Nov 23, 2019 at 06:01:28PM -0500, Nicholas Krause wrote:
> Please just CC to this conversation as I keep getting removed.

Everyone who was on Cc: for this thread still is.  This is how email
works.  If you want to see everything on the list, subscribe to the
mailing list?


Segher
Nicholas Krause Nov. 23, 2019, 11:43 p.m. UTC | #16
On 11/23/19 6:09 PM, Segher Boessenkool wrote:
> On Sat, Nov 23, 2019 at 06:01:28PM -0500, Nicholas Krause wrote:
>> Please just CC to this conversation as I keep getting removed.
> Everyone who was on Cc: for this thread still is.  This is how email
> works.  If you want to see everything on the list, subscribe to the
> mailing list?
>
>
> Segher

I was part of the original CCs on my comments but seems that there were 
two or seemed
to be two splitting versions of the thread. I would like to just keep 
all comments merged
in one thread is all.

Sorry for the confusion Segher,
Nick
Richard Sandiford Nov. 25, 2019, 9:16 p.m. UTC | #17
Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Thu, Nov 21, 2019 at 08:32:14PM +0000, Richard Sandiford wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > It's not great if every pass invents its own version of some common
>> > infrastructure thing because that common one is not suitable.
>> >
>> > I.e., can this be fixed somehow?  Maybe just by having a restricted DU
>> > chains df problem?
>> 
>> Well, it'd probably make sense to make fwprop.c's approach available
>> as a "proper" df interface at some point.  Hopefully if anyone wants the
>> same thing as fwprop.c, they'd do that rather than copy the code. :-)
>
>> >> * Updating a full, ordered def-use chain after a move is a linear-time
>> >>   operation, so whatever happens, we'd need to apply some kind of limit
>> >>   on the number of uses we maintain, with something like that integer
>> >>   point range for the rest.
>
> Yeah.
>
>> >> * Once we've analysed the insn and built its def-use chains, we don't
>> >>   look at the df_refs again until we update the chains after a successful
>> >>   combination.  So it should be more efficient to maintain a small array
>> >>   of insn_info_rec pointers alongside the numerical range, rather than
>> >>   walk and pollute chains of df_refs and then link back the insn uids
>> >>   to the pass-local info.
>> >
>> > So you need something like combine's LOG_LINKS?  Not that handling those
>> > is not quadratic in the worst case, but in practice it works well.  And
>> > it *could* be made linear.
>> 
>> Not sure why what I've used isn't what I need though :-)
>
> I am wondering the other way around :-)  Is what you do for combine2
> something that would be more generally applicable/useful?  That's what
> I'm trying to find out :-)
>
> What combine does could use some improvement, if you want to hear a
> more direct motivations.  LOG_LINKS just skip references we cannot
> handle (and some more), so we always have to do modified_between etc.,
> which hurts.

The trade-offs behind the choice of representation are very specific
to the pass.  You'd only pick this if you wanted both to propagate
definitions into uses and to move insns around.  You'd also only pick
it if you were happy with tracking a small number of named uses per
definition.  I can't think of any other passes that would prefer this
over what they already use.  (Combine itself is an exception, since
the new pass started out as a deliberate attempt to start from scratch.)

>> >> Target                 Tests   Delta    Best   Worst  Median
>> >> avr-elf                 1341 -111401  -13824     680     -10
>> >
>> > Things like this are kind of suspicious :-)
>> 
>> Yeah.  This mostly seems to come from mopping up the extra moves created
>> by make_more_copies.  So we have combinations like:
>> 
>>    58: r70:SF=r94:SF
>>       REG_DEAD r94:SF
>>    60: r22:SF=r70:SF
>>       REG_DEAD r70:SF
>
> Why didn't combine do this?  A target problem?

Seems to be because combine rejects hard-reg destinations whose classes
are likely spilled (cant_combine_insn_p).  This SF argument register
happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject
the combination based on POINTER_X_REGS being likely spilled.

I think the same thing could happen on other targets, e.g. for
TAILCALL_ADDR_REGS on aarch64.

>> So there's only one case in which it isn't a win, but the number of
>> tests is tiny.  So I agree there's no justification for trying this in
>> combine proper as things stand (and I wasn't arguing otherwise FWIW).
>> I'd still like to keep it in the new pass because it does help
>> *sometimes* and there's no sign yet that it has a noticeable
>> compile-time cost.
>
> So when does it help?  I can only think of cases where there are
> problems elsewhere.

The full list of affected tests (all at -O2 -ftree-vectorize) are:

    arc-elf             gcc.c-torture/compile/pr67506.c
    avr-elf             gcc.dg/torture/pr77916.c
    bpf-elf             gcc.dg/torture/vshuf-v8hi.c
    bpf-elf             gcc.dg/torture/vshuf-v4si.c
    bfin-elf            gcc.dg/torture/vshuf-v8qi.c
    c6x-elf             gcc.c-torture/execute/991118-1.c
    cr16-elf            gcc.c-torture/compile/pr82052.c
    epiphany-elf        gcc.c-torture/execute/991118-1.c
    epiphany-elf        gcc.dg/pr77664.c
    epiphany-elf        gcc.dg/vect/vect-mult-pattern-2.c
    epiphany-elf        gcc.dg/torture/vshuf-v8hi.c
    epiphany-elf        gcc.dg/tree-ssa/pr77664.c
    epiphany-elf        gcc.dg/tree-ssa/negneg-3.c
    fr30-elf            gcc.dg/torture/vshuf-v4hi.c
    fr30-elf            gcc.dg/torture/vshuf-v8hi.c
    frv-linux-gnu       gcc.dg/torture/vshuf-v4hi.c
    frv-linux-gnu       gcc.dg/torture/vshuf-v8hi.c
    h8300-elf           gcc.c-torture/execute/20000422-1.c
    h8300-elf           gcc.dg/torture/pr77916.c
    ia64-linux-gnu      gcc.c-torture/execute/ieee/pr30704.c
    ia64-linux-gnu      gcc.dg/vect/pr49478.c
    ia64-linux-gnu      gcc.dg/tree-ssa/ldist-16.c
    i686-apple-darwin   gcc.dg/vect/vect-mult-pattern-2.c
    m32r-elf            gcc.dg/store_merging_8.c
    m32r-elf            gcc.dg/torture/vshuf-v4hi.c
    m32r-elf            gcc.dg/torture/vshuf-v8hi.c
    m32r-elf            gcc.dg/tree-ssa/vrp61.c
    mcore-elf           gcc.c-torture/execute/991118-1.c
    mcore-elf           gcc.dg/torture/vshuf-v4hi.c
    mcore-elf           gcc.dg/torture/vshuf-v8hi.c
    mcore-elf           gcc.dg/torture/vshuf-v8qi.c
    mmix                gcc.dg/torture/20181024-1.c
    mn10300-elf         g++.dg/warn/Warray-bounds-6.C
    moxie-rtems         gcc.c-torture/execute/930718-1.c
    moxie-rtems         gcc.c-torture/compile/pr70263-1.c
    moxie-rtems         gcc.dg/graphite/scop-5.c
    moxie-rtems         g++.dg/pr80707.C
    nds32le-elf         gcc.dg/torture/vshuf-v16qi.c
    nios2-linux-gnu     gcc.dg/torture/vshuf-v8qi.c
    or1k-elf            gcc.dg/torture/vshuf-v4hi.c
    or1k-elf            gcc.dg/torture/vshuf-v8hi.c
    or1k-elf            gcc.dg/tree-ssa/vrp61.c
    powerpc-ibm-aix7.0  g++.dg/warn/Wunused-3.C
    powerpc-ibm-aix7.0  g++.dg/lto/pr88049_0.C
    powerpc-ibm-aix7.0  g++.dg/other/cxa-atexit1.C
    s390-linux-gnu      gcc.c-torture/compile/20020304-1.c
    s390-linux-gnu      gcc.dg/atomic-op-1.c
    s390-linux-gnu      gcc.dg/atomic/stdatomic-op-1.c
    s390-linux-gnu      gcc.dg/atomic/c11-atomic-exec-2.c
    s390-linux-gnu      gcc.dg/atomic/c11-atomic-exec-3.c
    s390-linux-gnu      gcc.dg/ubsan/float-cast-overflow-atomic.c
    s390x-linux-gnu     gcc.c-torture/compile/20020304-1.c
    sh-linux-gnu        gcc.c-torture/execute/991118-1.c
    sh-linux-gnu        gcc.dg/torture/vshuf-v8qi.c
    sparc-linux-gnu     gcc.dg/pr56890-2.c
    sparc-linux-gnu     gcc.dg/torture/vshuf-v4hi.c
    sparc-linux-gnu     gcc.dg/torture/vshuf-v8hi.c
    sparc-linux-gnu     gcc.dg/torture/20181024-1.c
    sparc64-linux-gnu   gcc.dg/torture/20181024-1.c
    xstormy16-elf       gcc.c-torture/execute/strlen-5.c
    xstormy16-elf       gcc.c-torture/execute/20080424-1.c
    xstormy16-elf       gcc.c-torture/compile/pr60655-1.c
    xstormy16-elf       gcc.c-torture/compile/pr60655-2.c
    xstormy16-elf       gcc.dg/Wrestrict-9.c
    xstormy16-elf       gcc.dg/graphite/scop-15.c
    xstormy16-elf       gcc.dg/guality/pr43051-1.c
    xstormy16-elf       gcc.dg/torture/pr68955.c
    xstormy16-elf       gcc.dg/torture/pr58955-2.c
    xstormy16-elf       gcc.dg/tree-ssa/builtin-sprintf-warn-23.c

The s390x-linux-gnu test is one in which we have:

  116: {r167:DI=r86:DI-0x1000;clobber %cc:CC;}
      REG_DEAD r86:DI
      REG_UNUSED %cc:CC
  118: %r2:DI=[r167:DI+r155:DI+0x5]
      REG_DEAD r167:DI
      REG_DEAD r155:DI
      REG_EQUAL [r167:DI+0x1005]

and so the 0x1000s cancel each other out.  And yeah, you could
definitely argue that it's a problem elsewhere. :-)  Expand has:

;; _32 = BGl_equalzf3zf3zz__r4_equivalence_6_2z00 (_31, 2B);

(insn 113 112 114 (set (reg:DI 164)
        (const_int -4096 [0xfffffffffffff000])) "gcc.c-torture/compile/20020304-1.c":161:9 -1
     (nil))

(insn 114 113 115 (set (reg:DI 165)
        (reg:DI 164)) "gcc.c-torture/compile/20020304-1.c":161:9 -1
     (nil))

(insn 115 114 116 (set (reg:DI 166)
        (const_int 4096 [0x1000])) "gcc.c-torture/compile/20020304-1.c":161:9 -1
     (nil))

(insn 116 115 117 (parallel [
            (set (reg:DI 167)
                (plus:DI (reg:DI 86 [ BgL_cdrzd21994zd2_959.10_27 ])
                    (reg:DI 165)))
            (clobber (reg:CC 33 %cc))
        ]) "gcc.c-torture/compile/20020304-1.c":161:9 -1
     (nil))

(insn 117 116 118 (set (reg:DI 3 %r3)
        (const_int 2 [0x2])) "gcc.c-torture/compile/20020304-1.c":161:9 -1
     (nil))

(insn 118 117 119 (set (reg:DI 2 %r2)
        (mem/f/j:DI (plus:DI (plus:DI (reg:DI 167)
                    (reg:DI 166))
                (const_int 5 [0x5])) [2 _30->pair_t.cdr+0 S8 A64])) "gcc.c-torture/compile/20020304-1.c":161:9 -1
     (nil))

>> It might also be interesting to see how much difference it makes for
>> run-combine=4 (e.g. to see how much it makes up for the current 2-insn
>> limit)...
>
> Numbers are good :-)

FWIW, it does make more of a difference there, but not massively:

Target                 Tests   Delta    Best   Worst  Median
======                 =====   =====    ====   =====  ======
aarch64-linux-gnu          5     -15      -5      -1      -3
aarch64_be-linux-gnu       4     -14      -5      -2      -4
arc-elf                    1      -4      -4      -4      -4
arm-linux-gnueabi          4     -22     -10      -2      -8
arm-linux-gnueabihf        4     -22     -10      -2      -8
avr-elf                    1      -1      -1      -1      -1
bfin-elf                  25    -592    -223       3      -5
bpf-elf                   47    -508     -95      -1      -3
c6x-elf                   26    -388     -74       1      -4
cr16-elf                  18    -142     -82      -1      -2
csky-elf                   5     -10      -4      -1      -2
epiphany-elf              30    -514    -155      -1      -4
fr30-elf                  28    -416    -140      -1      -3
frv-linux-gnu             45   -1274    -209      -1      -4
ft32-elf                   7     -17      -6      -1      -2
h8300-elf                  3      -7      -5      -1      -1
hppa64-hp-hpux11.23        1      -1      -1      -1      -1
i686-apple-darwin          1      -3      -3      -3      -3
ia64-linux-gnu             8     -86     -26      -5     -10
iq2000-elf                 1      -2      -2      -2      -2
m32r-elf                  78   -1692    -308      -2      -4
mcore-elf                 58   -1117    -174       3      -5
mipsel-linux-gnu           7     -26      -8      -2      -3
mipsisa64-linux-gnu       30    -136     -18      -2      -3
mmix                       5      -7      -2      -1      -1
mn10300-elf                1      -2      -2      -2      -2
moxie-rtems               11     -35      -5      -2      -3
msp430-elf                 1      -1      -1      -1      -1
nds32le-elf               15    -142     -88      -1      -2
nios2-linux-gnu           22    -259    -110      -1      -4
nvptx-none                 2      -8      -4      -4      -4
or1k-elf                  34    -592    -160      -1      -3
powerpc64le-linux-gnu      1      -8      -8      -8      -8
riscv32-elf                4     -11      -6      -1      -2
riscv64-elf                2      -7      -6      -1      -6
rl78-elf                   1      -7      -7      -7      -7
rx-elf                     1      -2      -2      -2      -2
s390-linux-gnu            35     708     -12     292      -1
s390x-linux-gnu           15     -53      -6      -2      -3
sh-linux-gnu              38    -741    -141       2      -6
sparc-linux-gnu           26    -478    -156      -1      -7
sparc64-linux-gnu         10     -86     -28      -2      -4
vax-netbsdelf              1      -4      -4      -4      -4
visium-elf                30    -467    -159      -1      -4
x86_64-darwin              7     -24     -10      -1      -2
x86_64-linux-gnu           7     -26     -12      -1      -2
xstormy16-elf             15     -70     -45       2      -2
xtensa-elf                26    -682    -226      -2      -4

Thanks,
Richard
Richard Sandiford Nov. 25, 2019, 9:40 p.m. UTC | #18
Segher Boessenkool <segher@kernel.crashing.org> writes:
> Hi!
>
> On Mon, Nov 18, 2019 at 05:55:13PM +0000, Richard Sandiford wrote:
>> Richard Sandiford <richard.sandiford@arm.com> writes:
>> > (It's 23:35 local time, so it's still just about stage 1. :-))
>> 
>> Or actually, just under 1 day after end of stage 1.  Oops.
>> Could have sworn stage 1 ended on the 17th :-(  Only realised
>> I'd got it wrong when catching up on Saturday's email traffic.
>> 
>> And inevitably, I introduced a couple of stupid mistakes while
>> trying to clean the patch up for submission by that (non-)deadline.
>> Here's a version that fixes an inverted overlapping memref check
>> and that correctly prunes the use list for combined instructions.
>> (This last one is just a compile-time saving -- the old code was
>> correct, just suboptimal.)
>
> I've build the Linux kernel with the previous version, as well as this
> one.  R0 is unmodified GCC, R1 is the first patch, R2 is this one:
>
> (I've forced --param=run-combine=6 for R1 and R2):
> (Percentages are relative to R0):
>
>                     R0        R1        R2        R1        R2
>        alpha   6107088   6101088   6101088   99.902%   99.902%
>          arc   4008224   4006568   4006568   99.959%   99.959%
>          arm   9206728   9200936   9201000   99.937%   99.938%
>        arm64  13056174  13018174  13018194   99.709%   99.709%
>        armhf         0         0         0         0         0
>          c6x   2337237   2337077   2337077   99.993%   99.993%
>         csky   3356602         0         0         0         0
>        h8300   1166996   1166776   1166776   99.981%   99.981%
>         i386  11352159         0         0         0         0
>         ia64  18230640  18167000  18167000   99.651%   99.651%
>         m68k   3714271         0         0         0         0
>   microblaze   4982749   4979945   4979945   99.944%   99.944%
>         mips   8499309   8495205   8495205   99.952%   99.952%
>       mips64   7042036   7039816   7039816   99.968%   99.968%
>        nds32   4486663         0         0         0         0
>        nios2   3680001   3679417   3679417   99.984%   99.984%
>     openrisc   4226076   4225868   4225868   99.995%   99.995%
>       parisc   7681895   7680063   7680063   99.976%   99.976%
>     parisc64   8677077   8676581   8676581   99.994%   99.994%
>      powerpc  10687611  10682199  10682199   99.949%   99.949%
>    powerpc64  17671082  17658570  17658570   99.929%   99.929%
>  powerpc64le  17671082  17658570  17658570   99.929%   99.929%
>      riscv32   1554938   1554758   1554758   99.988%   99.988%
>      riscv64   6634342   6632788   6632788   99.977%   99.977%
>         s390  13049643  13014939  13014939   99.734%   99.734%
>           sh   3254743         0         0         0         0
>      shnommu   1632364   1632124   1632124   99.985%   99.985%
>        sparc   4404993   4399593   4399593   99.877%   99.877%
>      sparc64   6796711   6797491   6797491  100.011%  100.011%
>       x86_64  19713174  19712817  19712817   99.998%   99.998%
>       xtensa         0         0         0         0         0

Thanks for running these.

> There are fivew new failures, with either of the combine2 patches.  And
> all five are actually different (different symptoms, at least):
>
> - csky fails on libgcc build:
>
> /home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi':
> /home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads for:
>  1405 | }
>       | ^
> (insn 199 86 87 8 (parallel [
>             (set (reg:SI 101)
>                 (plus:SI (reg:SI 98)
>                     (const_int -32 [0xffffffffffffffe0])))
>             (set (reg:CC 33 c)
>                 (lt:CC (plus:SI (reg:SI 98)
>                         (const_int -32 [0xffffffffffffffe0]))
>                     (const_int 0 [0])))
>         ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt}
>      (nil))
> during RTL pass: reload
>
> Target problem?

Yeah, looks like it.  The pattern is:

(define_insn "*cskyv2_declt"
  [(set (match_operand:SI 0 "register_operand" "=r")
	(plus:SI (match_operand:SI 1 "register_operand" "r")
		 (match_operand:SI 2 "const_int_operand" "Uh")))
   (set (reg:CC CSKY_CC_REGNUM)
	(lt:CC (plus:SI (match_dup 1) (match_dup 2))
	       (const_int 0)))]
  "CSKY_ISA_FEATURE (2E3)"
  "declt\t%0, %1, %M2"
)

So the predicate accepts all const_ints but the constraint doesn't.

> - i386 goes into an infinite loop compiling, or at least an hour or so...
> Erm I forgot too record what it was compiling.  I did attach a GDB...  It
> is something from lra_create_live_ranges.

Hmm.

> - m68k:
>
> /home/segher/src/kernel/fs/exec.c: In function 'copy_strings':
> /home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in final_scan_insn_1, at final.c:3048
>   590 | }
>       | ^
> 0x10408307 final_scan_insn_1
>         /home/segher/src/gcc/gcc/final.c:3048
> 0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
>         /home/segher/src/gcc/gcc/final.c:3152
> 0x10408797 final_1
>         /home/segher/src/gcc/gcc/final.c:2020
> 0x104091f7 rest_of_handle_final
>         /home/segher/src/gcc/gcc/final.c:4658
> 0x104091f7 execute
>         /home/segher/src/gcc/gcc/final.c:4736
>
> and that line is
>             gcc_assert (prev_nonnote_insn (insn) == last_ignored_compare);

Ah, this'll be while m68k was still a cc0 target.  Yeah, I should probably
just skip the whole pass for cc0.

> - nds32:
>
> /tmp/ccC8Czca.s: Assembler messages:
> /tmp/ccC8Czca.s:3144: Error: Unrecognized operand/register, lmw.bi [$fp+(-60)],[$fp],$r11,0x0.
>
> /tmp/ccl8o20c.s: Assembler messages:
> /tmp/ccl8o20c.s:2449: Error: Unrecognized operand/register, lmw.bi $r9,[$fp],[$fp+(-132)],0x0.
>
> /tmp/ccZxjwHd.s: Assembler messages:
> /tmp/ccZxjwHd.s:4776: Error: Unrecognized operand/register, lmw.bi [$fp+(-52)],[$fp],[$fp+(-56)],0x0.
>
> /tmp/cczjOS3d.s: Assembler messages:
> /tmp/cczjOS3d.s:2336: Error: Unrecognized operand/register, lmw.bi $r16,[$fp],$r7,0x0.
>
> and more.  All lmw.bi...  target issue?

Yeah, looks like it wasn't expecting this pattern to be generated
automatically before RA, so it doesn't have constraints (and probably
couldn't, since the registers need to be consecutive).

> - sh (that's sh4-linux):
>
> /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field':
> /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS'
>  1638 | }
>       | ^
> /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn:
> (insn 18 17 19 2 (set (reg:SI 0 r0)
>         (mem:SI (plus:SI (reg:SI 4 r4 [178])
>                 (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
>      (expr_list:REG_DEAD (reg:SI 4 r4 [178])
>         (expr_list:REG_DEAD (reg:SI 6 r6 [171])
>             (nil))))
> /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out

Would have to look more at this one.  Seems odd that it can't allocate
R0 when it's already the destination and when R0 can't be live before
the insn.  But there again, this is reload, so my enthuasiasm for looking
is a bit limited :-)

> Looking at just binary size, which is a good stand-in for how many insns
> it combined:
>
>                     R2
>        arm64   99.709%
>         ia64   99.651%
>         s390   99.734%
>        sparc   99.877%
>      sparc64  100.011%
>
> (These are those that are not between 99.9% and 100.0%).
>
> So only sparc64 regressed, and just a tiny bit (I can look at what that
> is, if there is interest).  But 32-bit sparc improved, and s390, arm64,
> and ia64 got actual benefit.
>
> Again this is just code size, not analysing the actually changed code.

OK.  Certainly not an earth-shattering improvement then, but not
entirely worthless either.

> I did look at the powerpc64le changes.  It is almost completely load-
> with-update (and store-with-update) insns that make the difference, but
> there are also some dot insns.  The extra mr. are usually not a good
> idea, but the extsw. are.  Sometimes this causes *more* insns in the end
> (register move insns), but that is the exception.
>
> This mr. problem is there with combine already, btw.  In the end it is
> caused by this just not being something good to do on pseudos, it would
> be better to do this after RA, in a peephole or similar.  OTOH it isn't
> actually really important for performance either way.
>
> Btw, does the new pass use TARGET_LEGITIMATE_COMBINED_INSN?  It probably
> should.  (That would be the hook where we would probably want to prevent
> generating mr. insns).

No, it doesn't use that yet, but I agree it should.  Will fix.

I see combine also tests cannot_copy_insn_p.  I'm not sure whether that's
appropriate for the new pass or not.  Arguably it's not copying the
instruction, it's just moving it to be in parallel with something else.
(But then that's largely true of the combine case too.)

Thanks,
Richard
Segher Boessenkool Nov. 25, 2019, 10:13 p.m. UTC | #19
Hi!

On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > I am wondering the other way around :-)  Is what you do for combine2
> > something that would be more generally applicable/useful?  That's what
> > I'm trying to find out :-)
> >
> > What combine does could use some improvement, if you want to hear a
> > more direct motivations.  LOG_LINKS just skip references we cannot
> > handle (and some more), so we always have to do modified_between etc.,
> > which hurts.
> 
> The trade-offs behind the choice of representation are very specific
> to the pass.

Yes, but hopefully not so specific that every pass needs a completely
different representation ;-)

> >> >> Target                 Tests   Delta    Best   Worst  Median
> >> >> avr-elf                 1341 -111401  -13824     680     -10
> >> >
> >> > Things like this are kind of suspicious :-)
> >> 
> >> Yeah.  This mostly seems to come from mopping up the extra moves created
> >> by make_more_copies.  So we have combinations like:
> >> 
> >>    58: r70:SF=r94:SF
> >>       REG_DEAD r94:SF
> >>    60: r22:SF=r70:SF
> >>       REG_DEAD r70:SF
> >
> > Why didn't combine do this?  A target problem?
> 
> Seems to be because combine rejects hard-reg destinations whose classes
> are likely spilled (cant_combine_insn_p).

Ah, okay.  And that is required to prevent ICEs, in combine2 as well
then -- ICEs in RA.

There should be a better way to do this.

> This SF argument register
> happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject
> the combination based on POINTER_X_REGS being likely spilled.

static bool
avr_class_likely_spilled_p (reg_class_t c)
{
  return (c != ALL_REGS &&
           (AVR_TINY ? 1 : c != ADDW_REGS));
}

So this target severely shackles combine.  Does it have to?  If so, why
not with combine2?

> >> So there's only one case in which it isn't a win, but the number of
> >> tests is tiny.  So I agree there's no justification for trying this in
> >> combine proper as things stand (and I wasn't arguing otherwise FWIW).
> >> I'd still like to keep it in the new pass because it does help
> >> *sometimes* and there's no sign yet that it has a noticeable
> >> compile-time cost.
> >
> > So when does it help?  I can only think of cases where there are
> > problems elsewhere.
> 
> The full list of affected tests (all at -O2 -ftree-vectorize) are:

I'll have to look at this closer later, sorry.


Segher
Segher Boessenkool Nov. 25, 2019, 10:47 p.m. UTC | #20
On Mon, Nov 25, 2019 at 09:40:36PM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > - i386 goes into an infinite loop compiling, or at least an hour or so...
> > Erm I forgot too record what it was compiling.  I did attach a GDB...  It
> > is something from lra_create_live_ranges.
> 
> Hmm.

This one is actually worrying me -- it's not obviously a simple problem,
or a target problem, or a pre-existing problem.

> Ah, this'll be while m68k was still a cc0 target.  Yeah, I should probably
> just skip the whole pass for cc0.

Yes, tree of last friday or saturday or so.

And yup if you don't handle cc0 yet, yes you want to skip it completely :-)

> > - sh (that's sh4-linux):
> >
> > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field':
> > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS'
> >  1638 | }
> >       | ^
> > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn:
> > (insn 18 17 19 2 (set (reg:SI 0 r0)
> >         (mem:SI (plus:SI (reg:SI 4 r4 [178])
> >                 (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
> >      (expr_list:REG_DEAD (reg:SI 4 r4 [178])
> >         (expr_list:REG_DEAD (reg:SI 6 r6 [171])
> >             (nil))))
> > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out
> 
> Would have to look more at this one.  Seems odd that it can't allocate
> R0 when it's already the destination and when R0 can't be live before
> the insn.  But there again, this is reload, so my enthuasiasm for looking
> is a bit limited :-)

It wants to use r0 in some other insn, so it needs to spill it here, but
cannot.  This is what class_likely_spilled is for.

> > Looking at just binary size, which is a good stand-in for how many insns
> > it combined:
> >
> >                     R2
> >        arm64   99.709%
> >         ia64   99.651%
> >         s390   99.734%
> >        sparc   99.877%
> >      sparc64  100.011%
> >
> > (These are those that are not between 99.9% and 100.0%).
> >
> > So only sparc64 regressed, and just a tiny bit (I can look at what that
> > is, if there is interest).  But 32-bit sparc improved, and s390, arm64,
> > and ia64 got actual benefit.
> >
> > Again this is just code size, not analysing the actually changed code.
> 
> OK.  Certainly not an earth-shattering improvement then, but not
> entirely worthless either.

I usually takes 0.2% as "definitely useful" for combine improvements, so
there are a few targets that have that.  There can be improvements that
are important for a target even if they do not improve code size much,
of course, and it can identify weaknesses in the backend code, so you
always need to look at what really changes.

> I see combine also tests cannot_copy_insn_p.  I'm not sure whether that's
> appropriate for the new pass or not.  Arguably it's not copying the
> instruction, it's just moving it to be in parallel with something else.
> (But then that's largely true of the combine case too.)

combine tests this only for the cases where it *does* have to copy an
insn: when the dest if i0, i1, or i2 doesn't die, it is added as another
arm to the (parallel) result.


Segher
Richard Sandiford Nov. 25, 2019, 11:08 p.m. UTC | #21
Segher Boessenkool <segher@kernel.crashing.org> writes:
> Hi!
>
> On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > I am wondering the other way around :-)  Is what you do for combine2
>> > something that would be more generally applicable/useful?  That's what
>> > I'm trying to find out :-)
>> >
>> > What combine does could use some improvement, if you want to hear a
>> > more direct motivations.  LOG_LINKS just skip references we cannot
>> > handle (and some more), so we always have to do modified_between etc.,
>> > which hurts.
>> 
>> The trade-offs behind the choice of representation are very specific
>> to the pass.
>
> Yes, but hopefully not so specific that every pass needs a completely
> different representation ;-)

Well, it depends.  Most passes make do with df (without DU/UD-chains).
But since DU/UD-chains are naturally quadratic in the general case,
and are expensive to keep up to date, each DU/UD pass is going to have
make some compromises.  It doesn't seem too bad that passes make
different compromises based on what they're trying to do.  (combine:
single use per definition; fwprop.c: track all uses, but for dominating
definitions only; sched: fudged via a param; regrename: single
definition/multiple use chains optimised for renmaing; combine2: full
live range information, but limited use list; etc.)

So yeah, if passes want to make roughly the same compromises, it would
obviously be good if they shared a representation.  But since each pass
does something different, I don't think it's a bad sign that they make
different compromises and use different representations.

So I don't think a new pass with a new representation is in itself a
sign of failure.

>> >> >> Target                 Tests   Delta    Best   Worst  Median
>> >> >> avr-elf                 1341 -111401  -13824     680     -10
>> >> >
>> >> > Things like this are kind of suspicious :-)
>> >> 
>> >> Yeah.  This mostly seems to come from mopping up the extra moves created
>> >> by make_more_copies.  So we have combinations like:
>> >> 
>> >>    58: r70:SF=r94:SF
>> >>       REG_DEAD r94:SF
>> >>    60: r22:SF=r70:SF
>> >>       REG_DEAD r70:SF
>> >
>> > Why didn't combine do this?  A target problem?
>> 
>> Seems to be because combine rejects hard-reg destinations whose classes
>> are likely spilled (cant_combine_insn_p).
>
> Ah, okay.  And that is required to prevent ICEs, in combine2 as well
> then -- ICEs in RA.

Not in this case though.  The final instruction is a hardreg<-pseudo move
whatever happens.  There's nothing special about r70 compared to r94.

> There should be a better way to do this.

ISTM we should be checking for whichever cases actually cause the
RA failures.  E.g. to take on extreme example, if all the following
are true:

- an insn has a single alternative
- an insn has a single non-earyclobber output
- an insn has no parallel clobbers
- an insn has no auto-inc/decs
- an insn has a hard register destination that satisfies its constraints
- the hard register is defined in its original location

then there should be no problem.  The insn shouldn't need any output
reloads that would conflict with the hard register.  It also doesn't
extend the live range of the output.

Obviously that's a lot of conditions :-)  And IMO they should be built
up the other way around: reject specific cases that are known to cause
problems, based on information about the matched insn.  But I think the
avr example shows that there's a real problem with using REGNO_REG_CLASS
for this too.  REGNO_REG_CLASS gives the smallest enclosing class, which
might not be the most relevant one in context.  (It isn't here, since
we're just passing arguments to functions.)

>> This SF argument register
>> happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject
>> the combination based on POINTER_X_REGS being likely spilled.
>
> static bool
> avr_class_likely_spilled_p (reg_class_t c)
> {
>   return (c != ALL_REGS &&
>            (AVR_TINY ? 1 : c != ADDW_REGS));
> }
>
> So this target severely shackles combine.  Does it have to?  If so, why
> not with combine2?

As far as the above example goes, I think returning true for
POINTER_X_REGS is the right thing to do.  It only has two 8-bit
registers, and they act as a pair when used as a pointer.

Thanks,
Richard
Segher Boessenkool Nov. 26, 2019, 1:42 a.m. UTC | #22
On Mon, Nov 25, 2019 at 11:08:47PM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote:
> >> Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> > I am wondering the other way around :-)  Is what you do for combine2
> >> > something that would be more generally applicable/useful?  That's what
> >> > I'm trying to find out :-)
> >> >
> >> > What combine does could use some improvement, if you want to hear a
> >> > more direct motivations.  LOG_LINKS just skip references we cannot
> >> > handle (and some more), so we always have to do modified_between etc.,
> >> > which hurts.
> >> 
> >> The trade-offs behind the choice of representation are very specific
> >> to the pass.
> >
> > Yes, but hopefully not so specific that every pass needs a completely
> > different representation ;-)
> 
> Well, it depends.  Most passes make do with df (without DU/UD-chains).
> But since DU/UD-chains are naturally quadratic in the general case,
> and are expensive to keep up to date, each DU/UD pass is going to have
> make some compromises.  It doesn't seem too bad that passes make
> different compromises based on what they're trying to do.  (combine:
> single use per definition; fwprop.c: track all uses, but for dominating
> definitions only; sched: fudged via a param; regrename: single
> definition/multiple use chains optimised for renmaing; combine2: full
> live range information, but limited use list; etc.)

combine actually *calculates* DU chains almost completely, it just throws
away most of that information (it wants to have LOG_LINKS, as it did ages
ago).  The only thing stopping us from doing that right now is that not
all uses are counted (some are skipped).

Since combine works only within BBs, DU chains are linear to compute, and
UD chains are trivial (and just linear to compute).

Updating is quadratic in general, sure.  Luckily in most realistic cases
it is cheap (most, sigh) (insns aren't combined to very far away).

> So yeah, if passes want to make roughly the same compromises, it would
> obviously be good if they shared a representation.  But since each pass
> does something different, I don't think it's a bad sign that they make
> different compromises and use different representations.
> 
> So I don't think a new pass with a new representation is in itself a
> sign of failure.

Oh, I don't think so either.  I just wonder if it would be useful more
generically :-)

> >> >> >> Target                 Tests   Delta    Best   Worst  Median
> >> >> >> avr-elf                 1341 -111401  -13824     680     -10
> >> >> >
> >> >> > Things like this are kind of suspicious :-)
> >> >> 
> >> >> Yeah.  This mostly seems to come from mopping up the extra moves created
> >> >> by make_more_copies.  So we have combinations like:
> >> >> 
> >> >>    58: r70:SF=r94:SF
> >> >>       REG_DEAD r94:SF
> >> >>    60: r22:SF=r70:SF
> >> >>       REG_DEAD r70:SF
> >> >
> >> > Why didn't combine do this?  A target problem?
> >> 
> >> Seems to be because combine rejects hard-reg destinations whose classes
> >> are likely spilled (cant_combine_insn_p).
> >
> > Ah, okay.  And that is required to prevent ICEs, in combine2 as well
> > then -- ICEs in RA.
> 
> Not in this case though.  The final instruction is a hardreg<-pseudo move
> whatever happens.  There's nothing special about r70 compared to r94.

So the target hook could be improved?  Or, this doesn't matter anyway,
the extra register move does not prevent any combinations, and RA should
get rid of it when that is beneficial.

But you see smaller code in the end, hrm.


Segher
Richard Biener Nov. 27, 2019, 8:29 a.m. UTC | #23
On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Mon, Nov 25, 2019 at 11:08:47PM +0000, Richard Sandiford wrote:
> > Segher Boessenkool <segher@kernel.crashing.org> writes:
> > > On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote:
> > >> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > >> > I am wondering the other way around :-)  Is what you do for combine2
> > >> > something that would be more generally applicable/useful?  That's what
> > >> > I'm trying to find out :-)
> > >> >
> > >> > What combine does could use some improvement, if you want to hear a
> > >> > more direct motivations.  LOG_LINKS just skip references we cannot
> > >> > handle (and some more), so we always have to do modified_between etc.,
> > >> > which hurts.
> > >>
> > >> The trade-offs behind the choice of representation are very specific
> > >> to the pass.
> > >
> > > Yes, but hopefully not so specific that every pass needs a completely
> > > different representation ;-)
> >
> > Well, it depends.  Most passes make do with df (without DU/UD-chains).
> > But since DU/UD-chains are naturally quadratic in the general case,
> > and are expensive to keep up to date, each DU/UD pass is going to have
> > make some compromises.  It doesn't seem too bad that passes make
> > different compromises based on what they're trying to do.  (combine:
> > single use per definition; fwprop.c: track all uses, but for dominating
> > definitions only; sched: fudged via a param; regrename: single
> > definition/multiple use chains optimised for renmaing; combine2: full
> > live range information, but limited use list; etc.)
>
> combine actually *calculates* DU chains almost completely, it just throws
> away most of that information (it wants to have LOG_LINKS, as it did ages
> ago).  The only thing stopping us from doing that right now is that not
> all uses are counted (some are skipped).
>
> Since combine works only within BBs, DU chains are linear to compute, and
> UD chains are trivial (and just linear to compute).

quadraticness appears for RTL DU/UD chains because of partial definitions,
that doesn't change for BBs so even there computing is them is quadratic
(because recording them is).  The situation is simply having N partial
defs all reaching M uses which gives you a chain of size N * M.

Now - for combine you don't want partial defs, so for simplicity we could
choose to _not_ record DU/UD chains whenever we see a partial def for
a pseudo (and mark those as "bad").  Or, slightly enhanced, we can
handle DU/UD chains for regions where there is no partial definition
and add a "fake" D denoting (there are [multiple] defs beyond that
might be partial).  Depending on the use-case that should suffice and
make the problem linear.

I think you want to ask sth like "is REG changed [partially] between
its use in insn A and the def in insn B" and you want to answer that by using
REGs UD chain for that.  If you only ever reached the def in insn B via the
"pruned" chain then this would work, likewise for the chain we do not compute
any UD chain for REG.

> Updating is quadratic in general, sure.  Luckily in most realistic cases
> it is cheap (most, sigh) (insns aren't combined to very far away).

Updating is linear as well if you can disregard partial defs.  Updating cannot
be quadratic if compute is linear ;)

> > So yeah, if passes want to make roughly the same compromises, it would
> > obviously be good if they shared a representation.  But since each pass
> > does something different, I don't think it's a bad sign that they make
> > different compromises and use different representations.
> >
> > So I don't think a new pass with a new representation is in itself a
> > sign of failure.
>
> Oh, I don't think so either.  I just wonder if it would be useful more
> generically :-)
>
> > >> >> >> Target                 Tests   Delta    Best   Worst  Median
> > >> >> >> avr-elf                 1341 -111401  -13824     680     -10
> > >> >> >
> > >> >> > Things like this are kind of suspicious :-)
> > >> >>
> > >> >> Yeah.  This mostly seems to come from mopping up the extra moves created
> > >> >> by make_more_copies.  So we have combinations like:
> > >> >>
> > >> >>    58: r70:SF=r94:SF
> > >> >>       REG_DEAD r94:SF
> > >> >>    60: r22:SF=r70:SF
> > >> >>       REG_DEAD r70:SF
> > >> >
> > >> > Why didn't combine do this?  A target problem?
> > >>
> > >> Seems to be because combine rejects hard-reg destinations whose classes
> > >> are likely spilled (cant_combine_insn_p).
> > >
> > > Ah, okay.  And that is required to prevent ICEs, in combine2 as well
> > > then -- ICEs in RA.
> >
> > Not in this case though.  The final instruction is a hardreg<-pseudo move
> > whatever happens.  There's nothing special about r70 compared to r94.
>
> So the target hook could be improved?  Or, this doesn't matter anyway,
> the extra register move does not prevent any combinations, and RA should
> get rid of it when that is beneficial.
>
> But you see smaller code in the end, hrm.
>
>
> Segher
Richard Sandiford Nov. 27, 2019, 10:08 a.m. UTC | #24
Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
>>
>> On Mon, Nov 25, 2019 at 11:08:47PM +0000, Richard Sandiford wrote:
>> > Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > > On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote:
>> > >> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > >> > I am wondering the other way around :-)  Is what you do for combine2
>> > >> > something that would be more generally applicable/useful?  That's what
>> > >> > I'm trying to find out :-)
>> > >> >
>> > >> > What combine does could use some improvement, if you want to hear a
>> > >> > more direct motivations.  LOG_LINKS just skip references we cannot
>> > >> > handle (and some more), so we always have to do modified_between etc.,
>> > >> > which hurts.
>> > >>
>> > >> The trade-offs behind the choice of representation are very specific
>> > >> to the pass.
>> > >
>> > > Yes, but hopefully not so specific that every pass needs a completely
>> > > different representation ;-)
>> >
>> > Well, it depends.  Most passes make do with df (without DU/UD-chains).
>> > But since DU/UD-chains are naturally quadratic in the general case,
>> > and are expensive to keep up to date, each DU/UD pass is going to have
>> > make some compromises.  It doesn't seem too bad that passes make
>> > different compromises based on what they're trying to do.  (combine:
>> > single use per definition; fwprop.c: track all uses, but for dominating
>> > definitions only; sched: fudged via a param; regrename: single
>> > definition/multiple use chains optimised for renmaing; combine2: full
>> > live range information, but limited use list; etc.)
>>
>> combine actually *calculates* DU chains almost completely, it just throws
>> away most of that information (it wants to have LOG_LINKS, as it did ages
>> ago).  The only thing stopping us from doing that right now is that not
>> all uses are counted (some are skipped).
>>
>> Since combine works only within BBs, DU chains are linear to compute, and
>> UD chains are trivial (and just linear to compute).
>
> quadraticness appears for RTL DU/UD chains because of partial definitions,
> that doesn't change for BBs so even there computing is them is quadratic
> (because recording them is).  The situation is simply having N partial
> defs all reaching M uses which gives you a chain of size N * M.
>
> Now - for combine you don't want partial defs, so for simplicity we could
> choose to _not_ record DU/UD chains whenever we see a partial def for
> a pseudo (and mark those as "bad").  Or, slightly enhanced, we can
> handle DU/UD chains for regions where there is no partial definition
> and add a "fake" D denoting (there are [multiple] defs beyond that
> might be partial).  Depending on the use-case that should suffice and
> make the problem linear.
>
> I think you want to ask sth like "is REG changed [partially] between
> its use in insn A and the def in insn B" and you want to answer that by using
> REGs UD chain for that.  If you only ever reached the def in insn B via the
> "pruned" chain then this would work, likewise for the chain we do not compute
> any UD chain for REG.

(Passing over this as I think it's about what current combine wants.)

>> Updating is quadratic in general, sure.  Luckily in most realistic cases
>> it is cheap (most, sigh) (insns aren't combined to very far away).
>
> Updating is linear as well if you can disregard partial defs.
> Updating cannot be quadratic if compute is linear ;)

This was based on the assumption that we'd do an update after each
combination, so that the pass still sees correct info.  That then makes
the updates across one run of the pass quadratic, since the number of
successful combinations is O(ninsns).

As far as the new pass goes: the pass would be quadratic if we tried
to combine each use in single-def DU chain with its definition.  It would
also be quadratic if we tried to parallelise each pair of uses in a DU chain.
So if we did have full DU chains in the new pass, we'd also need some
limit N on the number of uses we try to combine with.

And if we're only going to try combining with N uses, then it seemed
better to track only N uses "by name", rather than pay the cost of
tracking all uses by name but ignoring the information for some of them.
All we care about for other uses is whether they would prevent a move.
We can track that using a simple point-based live range, where points
are LUIDs with gaps in between for new insns.

So the new pass uses a list of N specific uses and a single live range.
Querying whether a particular definition is live at a particular point
is then a constant-time operation.  So is updating the info after a
successful combination (potentially including a move).

That still seems like a reasonable way of representing this, given what
the pass wants to do.  Moving to full DU chains would IMO just make the
pass more expensive with no obvious benefit.

Thanks,
Richard
Segher Boessenkool Nov. 27, 2019, 7:31 p.m. UTC | #25
On Wed, Nov 27, 2019 at 09:29:27AM +0100, Richard Biener wrote:
> On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > combine actually *calculates* DU chains almost completely, it just throws
> > away most of that information (it wants to have LOG_LINKS, as it did ages
> > ago).  The only thing stopping us from doing that right now is that not
> > all uses are counted (some are skipped).
> >
> > Since combine works only within BBs, DU chains are linear to compute, and
> > UD chains are trivial (and just linear to compute).
> 
> quadraticness appears for RTL DU/UD chains because of partial definitions,
> that doesn't change for BBs so even there computing is them is quadratic
> (because recording them is).  The situation is simply having N partial
> defs all reaching M uses which gives you a chain of size N * M.

And both N and M are constants here (bounded by a constant).  The only
dimensions we care about are those the user can grow unlimited: number
of registers, number of instructions, number of functions, that kind of
thing.

The control flow graph in a basic block is a DAG, making most of this
linear to compute.  Only updating it after every separate change is not
easily linear in total.

> Updating is linear as well if you can disregard partial defs.  Updating cannot
> be quadratic if compute is linear ;)

Sure it can.  Updating has to be O(1) (amortized) per change for the whole
pass to be O(n).  If it is O(n) per change you are likely O(n^2) in total.

I don't see how to make combine itself O(1) per change, but yeah I can
see how that can work (or almost work) for something simpler (and less
weighed down by history :-) ).


Segher
Oleg Endo Dec. 3, 2019, 1:33 p.m. UTC | #26
On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote:
> 
> > > - sh (that's sh4-linux):
> > > 
> > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field':
> > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS'
> > >  1638 | }
> > >       | ^
> > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn:
> > > (insn 18 17 19 2 (set (reg:SI 0 r0)
> > >         (mem:SI (plus:SI (reg:SI 4 r4 [178])
> > >                 (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
> > >      (expr_list:REG_DEAD (reg:SI 4 r4 [178])
> > >         (expr_list:REG_DEAD (reg:SI 6 r6 [171])
> > >             (nil))))
> > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out
> > 
> > Would have to look more at this one.  Seems odd that it can't allocate
> > R0 when it's already the destination and when R0 can't be live before
> > the insn.  But there again, this is reload, so my enthuasiasm for looking
> > is a bit limited :-)
> 
> It wants to use r0 in some other insn, so it needs to spill it here, but
> cannot.  This is what class_likely_spilled is for.
> 

Hmm ... the R0 problem ... SH doesn't override class_likely_spilled
explicitly, but it's got a R0_REGS class with only one said reg in it. 
So the default impl of class_likely_spilled should do its thing.

LRA is available on SH and often fixes the R0 problems -- but not
always.  Maybe it got better over time, haven't checked.

Could you re-run the SH build tests with -mlra, please ?

Cheers,
Oleg
Segher Boessenkool Dec. 3, 2019, 6:05 p.m. UTC | #27
On Tue, Dec 03, 2019 at 10:33:48PM +0900, Oleg Endo wrote:
> On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote:
> > 
> > > > - sh (that's sh4-linux):
> > > > 
> > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field':
> > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS'
> > > >  1638 | }
> > > >       | ^
> > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn:
> > > > (insn 18 17 19 2 (set (reg:SI 0 r0)
> > > >         (mem:SI (plus:SI (reg:SI 4 r4 [178])
> > > >                 (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
> > > >      (expr_list:REG_DEAD (reg:SI 4 r4 [178])
> > > >         (expr_list:REG_DEAD (reg:SI 6 r6 [171])
> > > >             (nil))))
> > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out
> > > 
> > > Would have to look more at this one.  Seems odd that it can't allocate
> > > R0 when it's already the destination and when R0 can't be live before
> > > the insn.  But there again, this is reload, so my enthuasiasm for looking
> > > is a bit limited :-)
> > 
> > It wants to use r0 in some other insn, so it needs to spill it here, but
> > cannot.  This is what class_likely_spilled is for.
> 
> Hmm ... the R0 problem ... SH doesn't override class_likely_spilled
> explicitly, but it's got a R0_REGS class with only one said reg in it. 
> So the default impl of class_likely_spilled should do its thing.

Yes, good point.  So what happened here?  Is it just RA messing things
up, unrelated to the new pass?


Segher
Oleg Endo Dec. 4, 2019, 10:43 a.m. UTC | #28
On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote:
> On Tue, Dec 03, 2019 at 10:33:48PM +0900, Oleg Endo wrote:
> > On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote:
> > > 
> > > > > - sh (that's sh4-linux):
> > > > > 
> > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field':
> > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS'
> > > > >  1638 | }
> > > > >       | ^
> > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn:
> > > > > (insn 18 17 19 2 (set (reg:SI 0 r0)
> > > > >         (mem:SI (plus:SI (reg:SI 4 r4 [178])
> > > > >                 (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
> > > > >      (expr_list:REG_DEAD (reg:SI 4 r4 [178])
> > > > >         (expr_list:REG_DEAD (reg:SI 6 r6 [171])
> > > > >             (nil))))
> > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out
> > > > 
> > > > Would have to look more at this one.  Seems odd that it can't allocate
> > > > R0 when it's already the destination and when R0 can't be live before
> > > > the insn.  But there again, this is reload, so my enthuasiasm for looking
> > > > is a bit limited :-)
> > > 
> > > It wants to use r0 in some other insn, so it needs to spill it here, but
> > > cannot.  This is what class_likely_spilled is for.
> > 
> > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled
> > explicitly, but it's got a R0_REGS class with only one said reg in it. 
> > So the default impl of class_likely_spilled should do its thing.
> 
> Yes, good point.  So what happened here?

"Something, somewhere, went terribly wrong"...

insn 18 wants to do

    mov.l @(r4,r6),r0

But it can't because the reg+reg address mode has a R0 constraint
itself.  So it needs to be changed to

    mov   r4,r0
    mov.l @(r0,r6),r0

And it can't handle that.  Or only sometimes?  Don't remember.


>   Is it just RA messing things
> up, unrelated to the new pass?
> 

Yep, I think so.  The additional pass seems to create "tougher" code so
reload passes out earlier than usual.  We've had the same issue when
trying address mode selection optimization.  In fact that was one huge
showstopper.

Cheers,
Oleg
Richard Sandiford Dec. 5, 2019, 10:16 a.m. UTC | #29
Here's a revised version based on the feedback so far.  Changes in v2:
- Don't move instructions that set or use allocatable hard registers.
- Check legitimate_combined_insn
- Check cannot_copy_insn_p when keeping the original insn in parallel
- Disable the pass if HAVE_cc0

I compared v1 and v2 in the same way as before and the new restrictions
didn't make much difference (as hoped).  Also bootstrapped & regression-
tested on aarch64-linux-gnu and x86_64-linux-gnu with run-combine
defaulting to 6 (unlike in the patch, where the new pass is disabled
by default).

Thanks,
Richard


2019-12-05  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* Makefile.in (OBJS): Add combine2.o
	* params.opt (--param=run-combine): New option.
	* doc/invoke.texi: Document it.
	* tree-pass.h (make_pass_combine2_before): Declare.
	(make_pass_combine2_after): Likewise.
	* passes.def: Add them.
	* timevar.def (TV_COMBINE2): New timevar.
	* cfgrtl.h (update_cfg_for_uncondjump): Declare.
	* combine.c (update_cfg_for_uncondjump): Move to...
	* cfgrtl.c (update_cfg_for_uncondjump): ...here.
	* simplify-rtx.c (simplify_truncation): Handle comparisons.
	* recog.h (validate_simplify_replace_rtx): Declare.
	* recog.c (validate_simplify_replace_rtx_1): New function.
	(validate_simplify_replace_rtx_uses): Likewise.
	(validate_simplify_replace_rtx): Likewise.
	* combine2.c: New file.

Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	2019-12-03 18:06:09.885650522 +0000
+++ gcc/Makefile.in	2019-12-05 10:11:50.637631870 +0000
@@ -1261,6 +1261,7 @@ OBJS = \
 	cgraphunit.o \
 	cgraphclones.o \
 	combine.o \
+	combine2.o \
 	combine-stack-adj.o \
 	compare-elim.o \
 	context.o \
Index: gcc/params.opt
===================================================================
--- gcc/params.opt	2019-12-02 17:38:20.072423250 +0000
+++ gcc/params.opt	2019-12-05 10:11:50.653631761 +0000
@@ -760,6 +760,10 @@ Use internal function id in profile look
 Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param
 Maximum depth of a loop nest to fully value-number optimistically.
 
+-param=run-combine=
+Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param
+Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass.
+
 -param=sccvn-max-alias-queries-per-access=
 Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param
 Maximum number of disambiguations to perform per memory access.
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	2019-12-02 17:38:18.364434903 +0000
+++ gcc/doc/invoke.texi	2019-12-05 10:11:50.653631761 +0000
@@ -11797,6 +11797,11 @@ in combiner for a pseudo register as las
 @item max-combine-insns
 The maximum number of instructions the RTL combiner tries to combine.
 
+@item run-combine
+Choose which of the 3 available combine passes to run: bit 1 for the main
+combine pass, bit 0 for an earlier variant of the combine pass, and bit 2
+for a later variant of the combine pass.
+
 @item integer-share-limit
 Small integer constants can use a shared data structure, reducing the
 compiler's memory usage and increasing its speed.  This sets the maximum
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	2019-11-19 16:25:28.000000000 +0000
+++ gcc/tree-pass.h	2019-12-05 10:11:50.657631731 +0000
@@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i
 extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt);
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	2019-11-19 16:25:28.000000000 +0000
+++ gcc/passes.def	2019-12-05 10:11:50.653631761 +0000
@@ -437,7 +437,9 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_inc_dec);
       NEXT_PASS (pass_initialize_regs);
       NEXT_PASS (pass_ud_rtl_dce);
+      NEXT_PASS (pass_combine2_before);
       NEXT_PASS (pass_combine);
+      NEXT_PASS (pass_combine2_after);
       NEXT_PASS (pass_if_after_combine);
       NEXT_PASS (pass_jump_after_combine);
       NEXT_PASS (pass_partition_blocks);
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	2019-11-19 16:25:28.000000000 +0000
+++ gcc/timevar.def	2019-12-05 10:11:50.657631731 +0000
@@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC          , "
 DEFTIMEVAR (TV_CSE2                  , "CSE 2")
 DEFTIMEVAR (TV_BRANCH_PROB           , "branch prediction")
 DEFTIMEVAR (TV_COMBINE               , "combiner")
+DEFTIMEVAR (TV_COMBINE2              , "second combiner")
 DEFTIMEVAR (TV_IFCVT		     , "if-conversion")
 DEFTIMEVAR (TV_MODE_SWITCH           , "mode switching")
 DEFTIMEVAR (TV_SMS		     , "sms modulo scheduling")
Index: gcc/cfgrtl.h
===================================================================
--- gcc/cfgrtl.h	2019-11-19 16:25:28.000000000 +0000
+++ gcc/cfgrtl.h	2019-12-05 10:11:50.641631840 +0000
@@ -47,6 +47,7 @@ extern void fixup_partitions (void);
 extern bool purge_dead_edges (basic_block);
 extern bool purge_all_dead_edges (void);
 extern bool fixup_abnormal_edges (void);
+extern void update_cfg_for_uncondjump (rtx_insn *);
 extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *);
 extern void relink_block_chain (bool);
 extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *);
Index: gcc/combine.c
===================================================================
--- gcc/combine.c	2019-11-29 13:04:14.458669072 +0000
+++ gcc/combine.c	2019-12-05 10:11:50.645631815 +0000
@@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg)
 	 && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT;
 }
 
-/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
-   Note that the INSN should be deleted *after* removing dead edges, so
-   that the kept edge is the fallthrough edge for a (set (pc) (pc))
-   but not for a (set (pc) (label_ref FOO)).  */
-
-static void
-update_cfg_for_uncondjump (rtx_insn *insn)
-{
-  basic_block bb = BLOCK_FOR_INSN (insn);
-  gcc_assert (BB_END (bb) == insn);
-
-  purge_dead_edges (bb);
-
-  delete_insn (insn);
-  if (EDGE_COUNT (bb->succs) == 1)
-    {
-      rtx_insn *insn;
-
-      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
-
-      /* Remove barriers from the footer if there are any.  */
-      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
-	if (BARRIER_P (insn))
-	  {
-	    if (PREV_INSN (insn))
-	      SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
-	    else
-	      BB_FOOTER (bb) = NEXT_INSN (insn);
-	    if (NEXT_INSN (insn))
-	      SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
-	  }
-	else if (LABEL_P (insn))
-	  break;
-    }
-}
-
 /* Return whether PAT is a PARALLEL of exactly N register SETs followed
    by an arbitrary number of CLOBBERs.  */
 static bool
@@ -15098,7 +15062,10 @@ const pass_data pass_data_combine =
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *) { return (optimize > 0); }
+  virtual bool gate (function *)
+    {
+      return optimize > 0 && (param_run_combine & 2) != 0;
+    }
   virtual unsigned int execute (function *)
     {
       return rest_of_handle_combine ();
Index: gcc/cfgrtl.c
===================================================================
--- gcc/cfgrtl.c	2019-11-19 16:25:28.000000000 +0000
+++ gcc/cfgrtl.c	2019-12-05 10:11:50.641631840 +0000
@@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void)
   return inserted;
 }
 
+/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
+   Note that the INSN should be deleted *after* removing dead edges, so
+   that the kept edge is the fallthrough edge for a (set (pc) (pc))
+   but not for a (set (pc) (label_ref FOO)).  */
+
+void
+update_cfg_for_uncondjump (rtx_insn *insn)
+{
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  gcc_assert (BB_END (bb) == insn);
+
+  purge_dead_edges (bb);
+
+  delete_insn (insn);
+  if (EDGE_COUNT (bb->succs) == 1)
+    {
+      rtx_insn *insn;
+
+      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
+
+      /* Remove barriers from the footer if there are any.  */
+      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
+	if (BARRIER_P (insn))
+	  {
+	    if (PREV_INSN (insn))
+	      SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
+	    else
+	      BB_FOOTER (bb) = NEXT_INSN (insn);
+	    if (NEXT_INSN (insn))
+	      SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
+	  }
+	else if (LABEL_P (insn))
+	  break;
+    }
+}
+
 /* Cut the insns from FIRST to LAST out of the insns stream.  */
 
 rtx_insn *
Index: gcc/simplify-rtx.c
===================================================================
--- gcc/simplify-rtx.c	2019-11-19 16:31:13.504240251 +0000
+++ gcc/simplify-rtx.c	2019-12-05 10:11:50.657631731 +0000
@@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode,
       && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1)
     return constm1_rtx;
 
+  /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result
+     in a narrower mode if useful.  */
+  if (COMPARISON_P (op))
+    return simplify_gen_relational (GET_CODE (op), mode, VOIDmode,
+				    XEXP (op, 0), XEXP (op, 1));
+
   return NULL_RTX;
 }
 
Index: gcc/recog.h
===================================================================
--- gcc/recog.h	2019-11-26 22:04:57.419370912 +0000
+++ gcc/recog.h	2019-12-05 10:11:50.657631731 +0000
@@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos
 extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *);
 extern void validate_replace_src_group (rtx, rtx, rtx_insn *);
 extern bool validate_simplify_insn (rtx_insn *insn);
+extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx);
 extern int num_changes_pending (void);
 extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode);
 
Index: gcc/recog.c
===================================================================
--- gcc/recog.c	2019-11-29 13:04:13.978672241 +0000
+++ gcc/recog.c	2019-12-05 10:11:50.657631731 +0000
@@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn)
       }
   return ((num_changes_pending () > 0) && (apply_change_group () > 0));
 }
+
+/* A subroutine of validate_simplify_replace_rtx.  Apply the replacement
+   described by R to LOC.  Return true on success; leave the caller
+   to clean up on failure.  */
+
+static bool
+validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc)
+{
+  rtx x = *loc;
+  enum rtx_code code = GET_CODE (x);
+  machine_mode mode = GET_MODE (x);
+
+  if (rtx_equal_p (x, r.from))
+    {
+      validate_unshare_change (r.insn, loc, r.to, 1);
+      return true;
+    }
+
+  /* Recursively apply the substitution and see if we can simplify
+     the result.  This specifically shouldn't use simplify_gen_*,
+     since we want to avoid generating new expressions where possible.  */
+  int old_num_changes = num_validated_changes ();
+  rtx newx = NULL_RTX;
+  bool recurse_p = false;
+  switch (GET_RTX_CLASS (code))
+    {
+    case RTX_UNARY:
+      {
+	machine_mode op0_mode = GET_MODE (XEXP (x, 0));
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)))
+	  return false;
+
+	newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode);
+	break;
+      }
+
+    case RTX_BIN_ARITH:
+    case RTX_COMM_ARITH:
+      {
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+	  return false;
+
+	newx = simplify_binary_operation (code, mode,
+					  XEXP (x, 0), XEXP (x, 1));
+	break;
+      }
+
+    case RTX_COMPARE:
+    case RTX_COMM_COMPARE:
+      {
+	machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode
+				? GET_MODE (XEXP (x, 0))
+				: GET_MODE (XEXP (x, 1)));
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+	  return false;
+
+	newx = simplify_relational_operation (code, mode, op_mode,
+					      XEXP (x, 0), XEXP (x, 1));
+	break;
+      }
+
+    case RTX_TERNARY:
+    case RTX_BITFIELD_OPS:
+      {
+	machine_mode op0_mode = GET_MODE (XEXP (x, 0));
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2)))
+	  return false;
+
+	newx = simplify_ternary_operation (code, mode, op0_mode,
+					   XEXP (x, 0), XEXP (x, 1),
+					   XEXP (x, 2));
+	break;
+      }
+
+    case RTX_EXTRA:
+      if (code == SUBREG)
+	{
+	  machine_mode inner_mode = GET_MODE (SUBREG_REG (x));
+	  if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x)))
+	    return false;
+
+	  rtx inner = SUBREG_REG (x);
+	  newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x));
+	  /* Reject the same cases that simplify_gen_subreg would.  */
+	  if (!newx
+	      && (GET_CODE (inner) == SUBREG
+		  || GET_CODE (inner) == CONCAT
+		  || GET_MODE (inner) == VOIDmode
+		  || !validate_subreg (mode, inner_mode,
+				       inner, SUBREG_BYTE (x))))
+	    return false;
+	  break;
+	}
+      else
+	recurse_p = true;
+      break;
+
+    case RTX_OBJ:
+      if (code == LO_SUM)
+	{
+	  if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	      || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+	    return false;
+
+	  /* (lo_sum (high x) y) -> y where x and y have the same base.  */
+	  rtx op0 = XEXP (x, 0);
+	  rtx op1 = XEXP (x, 1);
+	  if (GET_CODE (op0) == HIGH)
+	    {
+	      rtx base0, base1, offset0, offset1;
+	      split_const (XEXP (op0, 0), &base0, &offset0);
+	      split_const (op1, &base1, &offset1);
+	      if (rtx_equal_p (base0, base1))
+		newx = op1;
+	    }
+	}
+      else if (code == REG)
+	{
+	  if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from))
+	    return false;
+	}
+      else
+	recurse_p = true;
+      break;
+
+    case RTX_CONST_OBJ:
+      break;
+
+    case RTX_AUTOINC:
+      if (reg_overlap_mentioned_p (XEXP (x, 0), r.from))
+	return false;
+      recurse_p = true;
+      break;
+
+    case RTX_MATCH:
+    case RTX_INSN:
+      gcc_unreachable ();
+    }
+
+  if (recurse_p)
+    {
+      const char *fmt = GET_RTX_FORMAT (code);
+      for (int i = 0; fmt[i]; i++)
+	switch (fmt[i])
+	  {
+	  case 'E':
+	    for (int j = 0; j < XVECLEN (x, i); j++)
+	      if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j)))
+		return false;
+	    break;
+
+	  case 'e':
+	    if (XEXP (x, i)
+		&& !validate_simplify_replace_rtx_1 (r, &XEXP (x, i)))
+	      return false;
+	    break;
+	  }
+    }
+
+  if (newx && !rtx_equal_p (x, newx))
+    {
+      /* There's no longer any point unsharing the substitutions made
+	 for subexpressions, since we'll just copy this one instead.  */
+      for (int i = old_num_changes; i < num_changes; ++i)
+	changes[i].unshare = false;
+      validate_unshare_change (r.insn, loc, newx, 1);
+    }
+
+  return true;
+}
+
+/* A note_uses callback for validate_simplify_replace_rtx.
+   DATA points to a validate_replace_src_data object.  */
+
+static void
+validate_simplify_replace_rtx_uses (rtx *loc, void *data)
+{
+  validate_replace_src_data &r = *(validate_replace_src_data *) data;
+  if (r.insn && !validate_simplify_replace_rtx_1 (r, loc))
+    r.insn = NULL;
+}
+
+/* Try to perform the equivalent of:
+
+      newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX);
+      validate_change (INSN, LOC, newx, 1);
+
+   but without generating as much garbage rtl when the resulting
+   pattern doesn't match.
+
+   Return true if we were able to replace all uses of OLD_RTX in *LOC
+   and if the result conforms to general rtx rules (e.g. for whether
+   subregs are meaningful).
+
+   When returning true, add all replacements to the current validation group,
+   leaving the caller to test it in the normal way.  Leave both *LOC and the
+   validation group unchanged on failure.  */
+
+bool
+validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc,
+			       rtx old_rtx, rtx new_rtx)
+{
+  validate_replace_src_data r;
+  r.from = old_rtx;
+  r.to = new_rtx;
+  r.insn = insn;
+
+  unsigned int num_changes = num_validated_changes ();
+  note_uses (loc, validate_simplify_replace_rtx_uses, &r);
+  if (!r.insn)
+    {
+      cancel_changes (num_changes);
+      return false;
+    }
+  return true;
+}
 
 /* Return 1 if OP is a valid general operand for machine mode MODE.
    This is either a register reference, a memory reference,
Index: gcc/combine2.c
===================================================================
--- /dev/null	2019-09-17 11:41:18.176664108 +0100
+++ gcc/combine2.c	2019-12-05 10:11:50.645631815 +0000
@@ -0,0 +1,1658 @@
+/* Combine instructions
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "tree-pass.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "insn-config.h"
+#include "recog.h"
+#include "print-rtl.h"
+#include "rtl-iter.h"
+#include "predict.h"
+#include "cfgcleanup.h"
+#include "cfghooks.h"
+#include "cfgrtl.h"
+#include "alias.h"
+#include "valtrack.h"
+#include "target.h"
+
+/* This pass tries to combine instructions in the following ways:
+
+   (1) If we have two dependent instructions:
+
+	 I1: (set DEST1 SRC1)
+	 I2: (...DEST1...)
+
+       and I2 is the only user of DEST1, the pass tries to combine them into:
+
+	 I2: (...SRC1...)
+
+   (2) If we have two dependent instructions:
+
+	 I1: (set DEST1 SRC1)
+	 I2: (...DEST1...)
+
+       the pass tries to combine them into:
+
+	 I2: (parallel [(set DEST1 SRC1) (...SRC1...)])
+
+       or:
+
+	 I2: (parallel [(...SRC1...) (set DEST1 SRC1)])
+
+   (3) If we have two independent instructions:
+
+	 I1: (set DEST1 SRC1)
+	 I2: (set DEST2 SRC2)
+
+       that read from memory or from the same register, the pass tries to
+       combine them into:
+
+	 I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)])
+
+       or:
+
+	 I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)])
+
+   If the combined form is a valid instruction, the pass tries to find a
+   place between I1 and I2 inclusive for the new instruction.  If there
+   are multiple valid locations, it tries to pick the best one by taking
+   the effect on register pressure into account.
+
+   If a combination succeeds and produces a single set, the pass tries to
+   combine the new form with earlier or later instructions.
+
+   The pass currently optimizes each basic block separately.  It walks
+   the instructions in reverse order, building up live ranges for registers
+   and memory.  It then uses these live ranges to look for possible
+   combination opportunities and to decide where the combined instructions
+   could be placed.
+
+   The pass represents positions in the block using point numbers,
+   with higher numbers indicating earlier instructions.  The numbering
+   scheme is that:
+
+   - the end of the current instruction sequence has an even base point B.
+
+   - instructions initially have odd-numbered points B + 1, B + 3, etc.
+     with B + 1 being the final instruction in the sequence.
+
+   - even points after B represent gaps between instructions where combined
+     instructions could be placed.
+
+   Thus even points initially represent no instructions and odd points
+   initially represent single instructions.  However, when picking a
+   place for a combined instruction, the pass may choose somewhere
+   inbetween the original two instructions, so that over time a point
+   may come to represent several instructions.  When this happens,
+   the pass maintains the invariant that all instructions with the same
+   point number are independent of each other and thus can be treated as
+   acting in parallel (or as acting in any arbitrary sequence).
+
+   TODOs:
+
+   - Handle 3-instruction combinations, and possibly more.
+
+   - Handle existing clobbers more efficiently.  At the moment we can't
+     move an instruction that clobbers R across another instruction that
+     clobbers R.
+
+   - Allow hard register clobbers to be added, like combine does.
+
+   - Perhaps work on EBBs, or SESE regions.  */
+
+namespace {
+
+/* The number of explicit uses to record in a live range.  */
+const unsigned int NUM_RANGE_USERS = 4;
+
+/* The maximum number of instructions that we can combine at once.  */
+const unsigned int MAX_COMBINE_INSNS = 2;
+
+/* A fake cost for instructions that we haven't costed yet.  */
+const unsigned int UNKNOWN_COST = ~0U;
+
+class combine2
+{
+public:
+  combine2 (function *);
+  ~combine2 ();
+
+  void execute ();
+
+private:
+  struct insn_info_rec;
+
+  /* Describes the live range of a register or of memory.  For simplicity,
+     we treat memory as a single entity.
+
+     If we had a fully-accurate live range, updating it to account for a
+     moved instruction would be a linear-time operation.  Doing this for
+     each combination would then make the pass quadratic.  We therefore
+     just maintain a list of NUM_RANGE_USERS use insns and use simple,
+     conservatively-correct behavior for the rest.  */
+  struct live_range_rec
+  {
+    /* Which instruction provides the dominating definition, or null if
+       we don't know yet.  */
+    insn_info_rec *producer;
+
+    /* A selection of instructions that use the resource, in program order.  */
+    insn_info_rec *users[NUM_RANGE_USERS];
+
+    /* An inclusive range of points that covers instructions not mentioned
+       in USERS.  Both values are zero if there are no such instructions.
+
+       Once we've included a use U at point P in this range, we continue
+       to assume that some kind of use exists at P whatever happens to U
+       afterwards.  */
+    unsigned int first_extra_use;
+    unsigned int last_extra_use;
+
+    /* The register number this range describes, or INVALID_REGNUM
+       for memory.  */
+    unsigned int regno;
+
+    /* Forms a linked list of ranges for the same resource, in program
+       order.  */
+    live_range_rec *prev_range;
+    live_range_rec *next_range;
+  };
+
+  /* Pass-specific information about an instruction.  */
+  struct insn_info_rec
+  {
+    /* The instruction itself.  */
+    rtx_insn *insn;
+
+    /* A null-terminated list of live ranges for the things that this
+       instruction defines.  */
+    live_range_rec **defs;
+
+    /* A null-terminated list of live ranges for the things that this
+       instruction uses.  */
+    live_range_rec **uses;
+
+    /* The point at which the instruction appears.  */
+    unsigned int point;
+
+    /* The cost of the instruction, or UNKNOWN_COST if we haven't
+       measured it yet.  */
+    unsigned int cost;
+  };
+
+  /* Describes one attempt to combine instructions.  */
+  struct combination_attempt_rec
+  {
+    /* The instruction that we're currently trying to optimize.
+       If the combination succeeds, we'll use this insn_info_rec
+       to describe the new instruction.  */
+    insn_info_rec *new_home;
+
+    /* The instructions we're combining, in program order.  */
+    insn_info_rec *sequence[MAX_COMBINE_INSNS];
+
+    /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the
+       live range that describes the substituted register.  */
+    live_range_rec *def_use_range;
+
+    /* The earliest and latest points at which we could insert the
+       combined instruction.  */
+    unsigned int earliest_point;
+    unsigned int latest_point;
+
+    /* The cost of the new instruction, once we have a successful match.  */
+    unsigned int new_cost;
+  };
+
+  /* Pass-specific information about a register.  */
+  struct reg_info_rec
+  {
+    /* The live range associated with the last reference to the register.  */
+    live_range_rec *range;
+
+    /* The point at which the last reference occurred.  */
+    unsigned int next_ref;
+
+    /* True if the register is currently live.  We record this here rather
+       than in a separate bitmap because (a) there's a natural hole for
+       it on LP64 hosts and (b) we only refer to it when updating the
+       other fields, and so recording it here should give better locality.  */
+    unsigned int live_p : 1;
+  };
+
+  live_range_rec *new_live_range (unsigned int, live_range_rec *);
+  live_range_rec *reg_live_range (unsigned int);
+  live_range_rec *mem_live_range ();
+  bool add_range_use (live_range_rec *, insn_info_rec *);
+  void remove_range_use (live_range_rec *, insn_info_rec *);
+  bool has_single_use_p (live_range_rec *);
+  bool known_last_use_p (live_range_rec *, insn_info_rec *);
+  unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *);
+  unsigned int find_latest_point (insn_info_rec *, insn_info_rec *);
+  bool start_combination (combination_attempt_rec &, insn_info_rec *,
+			  insn_info_rec *, live_range_rec * = NULL);
+  bool verify_combination (combination_attempt_rec &);
+  int estimate_reg_pressure_delta (insn_info_rec *);
+  void commit_combination (combination_attempt_rec &, bool);
+  bool try_parallel_sets (combination_attempt_rec &, rtx, rtx);
+  bool try_parallelize_insns (combination_attempt_rec &);
+  bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool);
+  bool try_combine_def_use (combination_attempt_rec &, rtx, rtx);
+  bool try_combine_two_uses (combination_attempt_rec &);
+  bool try_combine (insn_info_rec *, rtx, unsigned int);
+  bool optimize_insn (insn_info_rec *);
+  void record_defs (insn_info_rec *);
+  void record_reg_use (insn_info_rec *, df_ref);
+  void record_uses (insn_info_rec *);
+  void process_insn (insn_info_rec *);
+  void start_sequence ();
+
+  /* The function we're optimizing.  */
+  function *m_fn;
+
+  /* The highest pseudo register number plus one.  */
+  unsigned int m_num_regs;
+
+  /* The current basic block.  */
+  basic_block m_bb;
+
+  /* True if we should optimize the current basic block for speed.  */
+  bool m_optimize_for_speed_p;
+
+  /* The point number to allocate to the next instruction we visit
+     in the backward traversal.  */
+  unsigned int m_point;
+
+  /* The point number corresponding to the end of the current
+     instruction sequence, i.e. the lowest point number about which
+     we still have valid information.  */
+  unsigned int m_end_of_sequence;
+
+  /* The point number corresponding to the end of the current basic block.
+     This is the same as M_END_OF_SEQUENCE when processing the last
+     instruction sequence in a basic block.  */
+  unsigned int m_end_of_bb;
+
+  /* The memory live range, or null if we haven't yet found a memory
+     reference in the current instruction sequence.  */
+  live_range_rec *m_mem_range;
+
+  /* Gives information about each register.  We track both hard and
+     pseudo registers.  */
+  auto_vec<reg_info_rec> m_reg_info;
+
+  /* A bitmap of registers whose entry in m_reg_info is valid.  */
+  auto_sbitmap m_valid_regs;
+
+  /* If nonnuull, an unused 2-element PARALLEL that we can use to test
+     instruction combinations.  */
+  rtx m_spare_parallel;
+
+  /* A bitmap of instructions that we've already tried to combine with.  */
+  auto_bitmap m_tried_insns;
+
+  /* A temporary bitmap used to hold register numbers.  */
+  auto_bitmap m_true_deps;
+
+  /* An obstack used for allocating insn_info_recs and for building
+     up their lists of definitions and uses.  */
+  obstack m_insn_obstack;
+
+  /* An obstack used for allocating live_range_recs.  */
+  obstack m_range_obstack;
+
+  /* Start-of-object pointers for the two obstacks.  */
+  char *m_insn_obstack_start;
+  char *m_range_obstack_start;
+
+  /* A list of instructions that we've optimized and whose new forms
+     change the cfg.  */
+  auto_vec<rtx_insn *> m_cfg_altering_insns;
+
+  /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS.  */
+  auto_bitmap m_cfg_altering_insn_ids;
+
+  /* We can insert new instructions at point P * 2 by inserting them
+     after M_POINTS[P - M_END_OF_SEQUENCE / 2].  We can insert new
+     instructions at point P * 2 + 1 by inserting them before
+     M_POINTS[P - M_END_OF_SEQUENCE / 2].  */
+  auto_vec<rtx_insn *, 256> m_points;
+};
+
+combine2::combine2 (function *fn)
+  : m_fn (fn),
+    m_num_regs (max_reg_num ()),
+    m_bb (NULL),
+    m_optimize_for_speed_p (false),
+    m_point (2),
+    m_end_of_sequence (m_point),
+    m_end_of_bb (m_point),
+    m_mem_range (NULL),
+    m_reg_info (m_num_regs),
+    m_valid_regs (m_num_regs),
+    m_spare_parallel (NULL_RTX)
+{
+  gcc_obstack_init (&m_insn_obstack);
+  gcc_obstack_init (&m_range_obstack);
+  m_reg_info.quick_grow (m_num_regs);
+  bitmap_clear (m_valid_regs);
+  m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0);
+  m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0);
+}
+
+combine2::~combine2 ()
+{
+  obstack_free (&m_insn_obstack, NULL);
+  obstack_free (&m_range_obstack, NULL);
+}
+
+/* Return true if extending the live range of REGNO might introduce a
+   spill failure during register allocation.  We deliberately don't check
+   targetm.class_likely_spilled_p since:
+
+   (a) in the right circumstances, any allocatable hard register could
+       trigger a spill failure;
+
+   (b) using REGNO_REG_CLASS to get the class would on many targets lead
+       to an artificial distinction between general registers that happen
+       to be in a small class for a rarely-used constraint and those
+       whose class is GENERAL_REGS itself.
+
+   (c) there should be few cases in which moving references to allocatable
+       hard registers is important before RA.  */
+
+static bool
+move_could_cause_spill_failure_p (unsigned int regno)
+{
+  return (regno != INVALID_REGNUM
+	  && HARD_REGISTER_NUM_P (regno)
+	  && !fixed_regs[regno]);
+}
+
+/* Return true if it's possible in principle to combine INSN with
+   other instructions.  ALLOW_ASMS_P is true if the caller can cope
+   with asm statements.  */
+
+static bool
+combinable_insn_p (rtx_insn *insn, bool allow_asms_p)
+{
+  rtx pattern = PATTERN (insn);
+
+  if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER)
+    return false;
+
+  if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX))
+    return false;
+
+  if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0)
+    return false;
+
+  return true;
+}
+
+/* Return true if it's possible in principle to move INSN somewhere else,
+   as long as all dependencies are satisfied.  */
+
+static bool
+movable_insn_p (rtx_insn *insn)
+{
+  if (JUMP_P (insn))
+    return false;
+
+  if (volatile_refs_p (PATTERN (insn)))
+    return false;
+
+  return true;
+}
+
+/* A note_stores callback.  Set the bool at *DATA to true if DEST is in
+   memory.  */
+
+static void
+find_mem_def (rtx dest, const_rtx, void *data)
+{
+  /* note_stores has stripped things like subregs and zero_extracts,
+     so we don't need to worry about them here.  */
+  if (MEM_P (dest))
+    *(bool *) data = true;
+}
+
+/* Return true if instruction INSN writes to memory.  */
+
+static bool
+insn_writes_mem_p (rtx_insn *insn)
+{
+  bool saw_mem_p = false;
+  note_stores (insn, find_mem_def, &saw_mem_p);
+  return saw_mem_p;
+}
+
+/* A note_uses callback.  Set the bool at DATA to true if *LOC reads
+   from variable memory.  */
+
+static void
+find_mem_use (rtx *loc, void *data)
+{
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, *loc, NONCONST)
+    if (MEM_P (*iter) && !MEM_READONLY_P (*iter))
+      {
+	*(bool *) data = true;
+	break;
+      }
+}
+
+/* Return true if instruction INSN reads memory, including via notes.  */
+
+static bool
+insn_reads_mem_p (rtx_insn *insn)
+{
+  bool saw_mem_p = false;
+  note_uses (&PATTERN (insn), find_mem_use, &saw_mem_p);
+  for (rtx note = REG_NOTES (insn); !saw_mem_p && note; note = XEXP (note, 1))
+    if (REG_NOTE_KIND (note) == REG_EQUAL
+	|| REG_NOTE_KIND (note) == REG_EQUIV)
+      note_uses (&XEXP (note, 0), find_mem_use, &saw_mem_p);
+  return saw_mem_p;
+}
+
+/* Create and return a new live range for REGNO.  NEXT is the next range
+   in program order, or null if this is the first live range in the
+   sequence.  */
+
+combine2::live_range_rec *
+combine2::new_live_range (unsigned int regno, live_range_rec *next)
+{
+  live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec);
+  memset (range, 0, sizeof (*range));
+
+  range->regno = regno;
+  range->next_range = next;
+  if (next)
+    next->prev_range = range;
+  return range;
+}
+
+/* Return the current live range for register REGNO, creating a new
+   one if necessary.  */
+
+combine2::live_range_rec *
+combine2::reg_live_range (unsigned int regno)
+{
+  /* Initialize the liveness flag, if it isn't already valid for this BB.  */
+  bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno);
+  if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb)
+    m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno);
+
+  /* See if we already have a live range associated with the current
+     instruction sequence.  */
+  live_range_rec *range = NULL;
+  if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence)
+    range = m_reg_info[regno].range;
+
+  /* Create a new range if this is the first reference to REGNO in the
+     current instruction sequence or if the current range has been closed
+     off by a definition.  */
+  if (!range || range->producer)
+    {
+      range = new_live_range (regno, range);
+
+      /* If the register is live after the current sequence, treat that
+	 as a fake use at the end of the sequence.  */
+      if (!range->next_range && m_reg_info[regno].live_p)
+	range->first_extra_use = range->last_extra_use = m_end_of_sequence;
+
+      /* Record that this is now the current range for REGNO.  */
+      if (first_ref_p)
+	bitmap_set_bit (m_valid_regs, regno);
+      m_reg_info[regno].range = range;
+      m_reg_info[regno].next_ref = m_point;
+    }
+  return range;
+}
+
+/* Return the current live range for memory, treating memory as a single
+   entity.  Create a new live range if necessary.  */
+
+combine2::live_range_rec *
+combine2::mem_live_range ()
+{
+  if (!m_mem_range || m_mem_range->producer)
+    m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range);
+  return m_mem_range;
+}
+
+/* Record that instruction USER uses the resource described by RANGE.
+   Return true if this is new information.  */
+
+bool
+combine2::add_range_use (live_range_rec *range, insn_info_rec *user)
+{
+  /* See if we've already recorded the instruction, or if there's a
+     spare use slot we can use.  */
+  unsigned int i = 0;
+  for (; i < NUM_RANGE_USERS && range->users[i]; ++i)
+    if (range->users[i] == user)
+      return false;
+
+  if (i == NUM_RANGE_USERS)
+    {
+      /* Since we've processed USER recently, assume that it's more
+	 interesting to record explicitly than the last user in the
+	 current list.  Evict that last user and describe it in the
+	 overflow "extra use" range instead.  */
+      insn_info_rec *ousted_user = range->users[--i];
+      if (range->first_extra_use < ousted_user->point)
+	range->first_extra_use = ousted_user->point;
+      if (range->last_extra_use > ousted_user->point)
+	range->last_extra_use = ousted_user->point;
+    }
+
+  /* Insert USER while keeping the list sorted.  */
+  for (; i > 0 && range->users[i - 1]->point < user->point; --i)
+    range->users[i] = range->users[i - 1];
+  range->users[i] = user;
+  return true;
+}
+
+/* Remove USER from the uses recorded for RANGE, if we can.
+   There's nothing we can do if USER was described in the
+   overflow "extra use" range.  */
+
+void
+combine2::remove_range_use (live_range_rec *range, insn_info_rec *user)
+{
+  for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+    if (range->users[i] == user)
+      {
+	for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j)
+	  range->users[j] = range->users[j + 1];
+	range->users[NUM_RANGE_USERS - 1] = NULL;
+	break;
+      }
+}
+
+/* Return true if RANGE has a single known user.  */
+
+bool
+combine2::has_single_use_p (live_range_rec *range)
+{
+  return range->users[0] && !range->users[1] && !range->first_extra_use;
+}
+
+/* Return true if we know that USER is the last user of RANGE.  */
+
+bool
+combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user)
+{
+  if (range->last_extra_use <= user->point)
+    return false;
+
+  for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i)
+    if (range->users[i] == user)
+      return i == NUM_RANGE_USERS - 1 || !range->users[i + 1];
+    else if (range->users[i]->point == user->point)
+      return false;
+
+  gcc_unreachable ();
+}
+
+/* Find the earliest point that we could move I2 up in order to combine
+   it with I1.  Ignore any dependencies between I1 and I2; leave the
+   caller to deal with those instead.  */
+
+unsigned int
+combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1)
+{
+  if (!movable_insn_p (i2->insn))
+    return i2->point;
+
+  /* Don't allow sets to be moved earlier if doing so could introduce
+     a spill failure.  */
+  if (prev_real_insn (i2->insn) != i1->insn)
+    for (live_range_rec **def = i2->defs; *def; ++def)
+      if (move_could_cause_spill_failure_p ((*def)->regno))
+	return i2->point;
+
+  /* Start by optimistically assuming that we can move the instruction
+     all the way up to I1.  */
+  unsigned int point = i1->point;
+
+  /* Make sure that the new position preserves all necessary true dependencies
+     on earlier instructions.  */
+  for (live_range_rec **use = i2->uses; *use; ++use)
+    {
+      live_range_rec *range = *use;
+      if (range->producer
+	  && range->producer != i1
+	  && point >= range->producer->point)
+	point = range->producer->point - 1;
+    }
+
+  /* Make sure that the new position preserves all necessary output and
+     anti dependencies on earlier instructions.  */
+  for (live_range_rec **def = i2->defs; *def; ++def)
+    if (live_range_rec *range = (*def)->prev_range)
+      {
+	if (range->producer
+	    && range->producer != i1
+	    && point >= range->producer->point)
+	  point = range->producer->point - 1;
+
+	for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;)
+	  if (range->users[i] && range->users[i] != i1)
+	    {
+	      if (point >= range->users[i]->point)
+		point = range->users[i]->point - 1;
+	      break;
+	    }
+
+	if (range->last_extra_use && point >= range->last_extra_use)
+	  point = range->last_extra_use - 1;
+      }
+
+  return point;
+}
+
+/* Find the latest point that we could move I1 down in order to combine
+   it with I2.  Ignore any dependencies between I1 and I2; leave the
+   caller to deal with those instead.  */
+
+unsigned int
+combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2)
+{
+  if (!movable_insn_p (i1->insn))
+    return i1->point;
+
+  /* Start by optimistically assuming that we can move the instruction
+     all the way down to I2.  */
+  unsigned int point = i2->point;
+
+  /* Make sure that the new position preserves all necessary anti dependencies
+     on later instructions.  */
+  for (live_range_rec **use = i1->uses; *use; ++use)
+    if (live_range_rec *range = (*use)->next_range)
+      if (range->producer != i2 && point <= range->producer->point)
+	point = range->producer->point + 1;
+
+  /* Make sure that the new position preserves all necessary output and
+     true dependencies on later instructions.  */
+  for (live_range_rec **def = i1->defs; *def; ++def)
+    {
+      live_range_rec *range = *def;
+
+      for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+	if (range->users[i] != i2)
+	  {
+	    if (range->users[i] && point <= range->users[i]->point)
+	      point = range->users[i]->point + 1;
+	    break;
+	  }
+
+      if (range->first_extra_use && point <= range->first_extra_use)
+	point = range->first_extra_use + 1;
+
+      live_range_rec *next_range = range->next_range;
+      if (next_range
+	  && next_range->producer != i2
+	  && point <= next_range->producer->point)
+	point = next_range->producer->point + 1;
+    }
+
+  /* Don't allow the live range of a register to be extended if doing
+     so could introduce a spill failure.  */
+  if (prev_real_insn (i2->insn) != i1->insn)
+    for (live_range_rec **use = i1->uses; *use; ++use)
+      {
+	live_range_rec *range = *use;
+	if (move_could_cause_spill_failure_p (range->regno))
+	  {
+	    for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;)
+	      if (range->users[i])
+		{
+		  if (point < range->users[i]->point)
+		    point = range->users[i]->point;
+		  break;
+		}
+
+	    if (range->last_extra_use && point < range->last_extra_use)
+	      point = range->last_extra_use;
+	  }
+      }
+
+  return point;
+}
+
+/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2,
+   where I1 is the instruction that we're currently trying to optimize.
+   If DEF_USE_RANGE is nonnull, I1 defines the value described by
+   DEF_USE_RANGE and I2 uses it.  */
+
+bool
+combine2::start_combination (combination_attempt_rec &attempt,
+			     insn_info_rec *i1, insn_info_rec *i2,
+			     live_range_rec *def_use_range)
+{
+  attempt.new_home = i1;
+  attempt.sequence[0] = i1;
+  attempt.sequence[1] = i2;
+  if (attempt.sequence[0]->point < attempt.sequence[1]->point)
+    std::swap (attempt.sequence[0], attempt.sequence[1]);
+  attempt.def_use_range = def_use_range;
+
+  /* Check that the instructions have no true dependencies other than
+     DEF_USE_RANGE.  */
+  bitmap_clear (m_true_deps);
+  for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def)
+    if (*def != def_use_range)
+      bitmap_set_bit (m_true_deps, (*def)->regno);
+  for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use)
+    if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno))
+      return false;
+
+  /* Calculate the range of points at which the combined instruction
+     could live.  */
+  attempt.earliest_point = find_earliest_point (attempt.sequence[1],
+						attempt.sequence[0]);
+  attempt.latest_point = find_latest_point (attempt.sequence[0],
+					    attempt.sequence[1]);
+  if (attempt.earliest_point < attempt.latest_point)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "cannot combine %d and %d: no suitable"
+		 " location for combined insn\n",
+		 INSN_UID (attempt.sequence[0]->insn),
+		 INSN_UID (attempt.sequence[1]->insn));
+      return false;
+    }
+
+  /* Make sure we have valid costs for the original instructions before
+     we start changing their patterns.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    if (attempt.sequence[i]->cost == UNKNOWN_COST)
+      attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn,
+					     m_optimize_for_speed_p);
+  return true;
+}
+
+/* Check whether the combination attempt described by ATTEMPT matches
+   an .md instruction (or matches its constraints, in the case of an
+   asm statement).  If so, calculate the cost of the new instruction
+   and check whether it's cheap enough.  */
+
+bool
+combine2::verify_combination (combination_attempt_rec &attempt)
+{
+  rtx_insn *insn = attempt.sequence[1]->insn;
+
+  bool ok_p = verify_changes (0);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      if (!ok_p)
+	fprintf (dump_file, "failed to match this instruction:\n");
+      else if (const char *name = get_insn_name (INSN_CODE (insn)))
+	fprintf (dump_file, "successfully matched this instruction to %s:\n",
+		 name);
+      else
+	fprintf (dump_file, "successfully matched this instruction:\n");
+      print_rtl_single (dump_file, PATTERN (insn));
+    }
+  if (!ok_p)
+    return false;
+
+  if (INSN_CODE (insn) >= 0 && !targetm.legitimate_combined_insn (insn))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "instruction rejected by target\n");
+      return false;
+    }
+
+  unsigned int cost1 = attempt.sequence[0]->cost;
+  unsigned int cost2 = attempt.sequence[1]->cost;
+  attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p);
+  ok_p = (attempt.new_cost <= cost1 + cost2);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n",
+	     cost1, cost2, attempt.new_cost,
+	     ok_p ? "keeping replacement" : "rejecting replacement");
+  if (!ok_p)
+    return false;
+
+  confirm_change_group ();
+  return true;
+}
+
+/* Return true if we should consider register REGNO when calculating
+   register pressure estimates.  */
+
+static bool
+count_reg_pressure_p (unsigned int regno)
+{
+  if (regno == INVALID_REGNUM)
+    return false;
+
+  /* Unallocatable registers aren't interesting.  */
+  if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno])
+    return false;
+
+  return true;
+}
+
+/* Try to estimate the effect that the original form of INSN_INFO
+   had on register pressure, in the form "born - dying".  */
+
+int
+combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info)
+{
+  int delta = 0;
+
+  for (live_range_rec **def = insn_info->defs; *def; ++def)
+    if (count_reg_pressure_p ((*def)->regno))
+      delta += 1;
+
+  for (live_range_rec **use = insn_info->uses; *use; ++use)
+    if (count_reg_pressure_p ((*use)->regno)
+	&& known_last_use_p (*use, insn_info))
+      delta -= 1;
+
+  return delta;
+}
+
+/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete
+   FROM_INSN.  Copy any useful information to TO_INSN before doing that.  */
+
+static void
+transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn)
+{
+  INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn);
+  INSN_CODE (to_insn) = INSN_CODE (from_insn);
+  REG_NOTES (to_insn) = REG_NOTES (from_insn);
+}
+
+/* The combination attempt in ATTEMPT has succeeded and is currently
+   part of an open validate_change group.  Commit to making the change
+   and decide where the new instruction should go.
+
+   KEPT_DEF_P is true if the new instruction continues to perform
+   the definition described by ATTEMPT.def_use_range.  */
+
+void
+combine2::commit_combination (combination_attempt_rec &attempt,
+			      bool kept_def_p)
+{
+  insn_info_rec *new_home = attempt.new_home;
+  rtx_insn *old_insn = attempt.sequence[0]->insn;
+  rtx_insn *new_insn = attempt.sequence[1]->insn;
+
+  /* Remove any notes that are no longer relevant.  */
+  bool single_set_p = single_set (new_insn);
+  for (rtx *note_ptr = &REG_NOTES (new_insn); *note_ptr; )
+    {
+      rtx note = *note_ptr;
+      bool keep_p = true;
+      switch (REG_NOTE_KIND (note))
+	{
+	case REG_EQUAL:
+	case REG_EQUIV:
+	case REG_NOALIAS:
+	  keep_p = single_set_p;
+	  break;
+
+	case REG_UNUSED:
+	  keep_p = false;
+	  break;
+
+	default:
+	  break;
+	}
+      if (keep_p)
+	note_ptr = &XEXP (*note_ptr, 1);
+      else
+	{
+	  *note_ptr = XEXP (*note_ptr, 1);
+	  free_EXPR_LIST_node (note);
+	}
+    }
+
+  /* Complete the open validate_change group.  */
+  confirm_change_group ();
+
+  /* Decide where the new instruction should go.  */
+  unsigned int new_point = attempt.latest_point;
+  if (new_point != attempt.earliest_point
+      && prev_real_insn (new_insn) != old_insn)
+    {
+      /* Prefer the earlier point if the combined instruction reduces
+	 register pressure and the latest point if it increases register
+	 pressure.
+
+	 The choice isn't obvious in the event of a tie, but picking
+	 the earliest point should reduce the number of times that
+	 we need to invalidate debug insns.  */
+      int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]);
+      int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]);
+      bool move_up_p = (delta1 + delta2 <= 0);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "register pressure delta = %d + %d; using %s position\n",
+		 delta1, delta2, move_up_p ? "earliest" : "latest");
+      if (move_up_p)
+	new_point = attempt.earliest_point;
+    }
+
+  /* Translate inserting at NEW_POINT into inserting before or after
+     a particular insn.  */
+  rtx_insn *anchor = NULL;
+  bool before_p = (new_point & 1);
+  if (new_point != attempt.sequence[1]->point
+      && new_point != attempt.sequence[0]->point)
+    {
+      anchor = m_points[(new_point - m_end_of_sequence) / 2];
+      rtx_insn *other_side = (before_p
+			      ? prev_real_insn (anchor)
+			      : next_real_insn (anchor));
+      /* Inserting next to an insn X and then deleting X is just a
+	 roundabout way of using X as the insertion point.  */
+      if (anchor == new_insn || other_side == new_insn)
+	new_point = attempt.sequence[1]->point;
+      else if (anchor == old_insn || other_side == old_insn)
+	new_point = attempt.sequence[0]->point;
+    }
+
+  /* Actually perform the move.  */
+  if (new_point == attempt.sequence[1]->point)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "using insn %d to hold the combined pattern\n",
+		 INSN_UID (new_insn));
+      set_insn_deleted (old_insn);
+    }
+  else if (new_point == attempt.sequence[0]->point)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "using insn %d to hold the combined pattern\n",
+		 INSN_UID (old_insn));
+      PATTERN (old_insn) = PATTERN (new_insn);
+      transfer_insn (old_insn, new_insn);
+      std::swap (old_insn, new_insn);
+      set_insn_deleted (old_insn);
+    }
+  else
+    {
+      /* We need to insert a new instruction.  We can't simply move
+	 NEW_INSN because it acts as an insertion anchor in m_points.  */
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "inserting combined insn %s insn %d\n",
+		 before_p ? "before" : "after", INSN_UID (anchor));
+
+      rtx_insn *added_insn = (before_p
+			      ? emit_insn_before (PATTERN (new_insn), anchor)
+			      : emit_insn_after (PATTERN (new_insn), anchor));
+      transfer_insn (added_insn, new_insn);
+      set_insn_deleted (old_insn);
+      set_insn_deleted (new_insn);
+      new_insn = added_insn;
+    }
+  df_insn_rescan (new_insn);
+
+  /* Unlink the old uses.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
+      remove_range_use (*use, attempt.sequence[i]);
+
+  /* Work out which registers the new pattern uses.  */
+  bitmap_clear (m_true_deps);
+  df_ref use;
+  FOR_EACH_INSN_USE (use, new_insn)
+    {
+      rtx reg = DF_REF_REAL_REG (use);
+      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
+    }
+  FOR_EACH_INSN_EQ_USE (use, new_insn)
+    {
+      rtx reg = DF_REF_REAL_REG (use);
+      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
+    }
+
+  /* Describe the combined instruction in NEW_HOME.  */
+  new_home->insn = new_insn;
+  new_home->point = new_point;
+  new_home->cost = attempt.new_cost;
+
+  /* Build up a list of definitions for the combined instructions
+     and update all the ranges accordingly.  It shouldn't matter
+     which order we do this in.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def)
+      if (kept_def_p || *def != attempt.def_use_range)
+	{
+	  obstack_ptr_grow (&m_insn_obstack, *def);
+	  (*def)->producer = new_home;
+	}
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
+
+  /* Build up a list of uses for the combined instructions and update
+     all the ranges accordingly.  Again, it shouldn't matter which
+     order we do this in.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
+      {
+	live_range_rec *range = *use;
+	if (range != attempt.def_use_range
+	    && (range->regno == INVALID_REGNUM
+		? insn_reads_mem_p (new_insn)
+		: bitmap_bit_p (m_true_deps, range->regno))
+	    && add_range_use (range, new_home))
+	  obstack_ptr_grow (&m_insn_obstack, range);
+      }
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
+
+  /* There shouldn't be any remaining references to other instructions
+     in the combination.  Invalidate their contents to make lingering
+     references a noisy failure.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    if (attempt.sequence[i] != new_home)
+      {
+	attempt.sequence[i]->insn = NULL;
+	attempt.sequence[i]->point = ~0U;
+      }
+
+  /* Unlink the def-use range.  */
+  if (!kept_def_p && attempt.def_use_range)
+    {
+      live_range_rec *range = attempt.def_use_range;
+      if (range->prev_range)
+	range->prev_range->next_range = range->next_range;
+      else
+	m_reg_info[range->regno].range = range->next_range;
+      if (range->next_range)
+	range->next_range->prev_range = range->prev_range;
+    }
+
+  /* Record instructions whose new form alters the cfg.  */
+  rtx pattern = PATTERN (new_insn);
+  if ((returnjump_p (new_insn)
+       || any_uncondjump_p (new_insn)
+       || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx))
+      && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn)))
+    m_cfg_altering_insns.safe_push (new_insn);
+}
+
+/* Return true if X1 and X2 are memories and if X1 does not have
+   a higher alignment than X2.  */
+
+static bool
+dubious_mem_pair_p (rtx x1, rtx x2)
+{
+  return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2);
+}
+
+/* Try implement ATTEMPT using (parallel [SET1 SET2]).  */
+
+bool
+combine2::try_parallel_sets (combination_attempt_rec &attempt,
+			     rtx set1, rtx set2)
+{
+  rtx_insn *insn = attempt.sequence[1]->insn;
+
+  /* Combining two loads or two stores can be useful on targets that
+     allow them to be treated as a single access.  However, we use a
+     very peephole approach to picking the pairs, so we need to be
+     relatively confident that we're making a good choice.
+
+     For now just aim for cases in which the memory references are
+     consecutive and the first reference has a higher alignment.
+     We can leave the target to test the consecutive part; whatever test
+     we added here might be different from the target's, and in any case
+     it's fine if the target accepts other well-aligned cases too.  */
+  if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2))
+      || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2)))
+    return false;
+
+  /* Cache the PARALLEL rtx between attempts so that we don't generate
+     too much garbage rtl.  */
+  if (!m_spare_parallel)
+    {
+      rtvec vec = gen_rtvec (2, set1, set2);
+      m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec);
+    }
+  else
+    {
+      XVECEXP (m_spare_parallel, 0, 0) = set1;
+      XVECEXP (m_spare_parallel, 0, 1) = set2;
+    }
+
+  unsigned int num_changes = num_validated_changes ();
+  validate_change (insn, &PATTERN (insn), m_spare_parallel, true);
+  if (verify_combination (attempt))
+    {
+      m_spare_parallel = NULL_RTX;
+      return true;
+    }
+  cancel_changes (num_changes);
+  return false;
+}
+
+/* Try to parallelize the two instructions in ATTEMPT.  */
+
+bool
+combine2::try_parallelize_insns (combination_attempt_rec &attempt)
+{
+  rtx_insn *i1_insn = attempt.sequence[0]->insn;
+  rtx_insn *i2_insn = attempt.sequence[1]->insn;
+
+  /* Can't parallelize asm statements.  */
+  if (asm_noperands (PATTERN (i1_insn)) >= 0
+      || asm_noperands (PATTERN (i2_insn)) >= 0)
+    return false;
+
+  /* For now, just handle the case in which both instructions are
+     single sets.  We could handle more than 2 sets as well, but few
+     targets support that anyway.  */
+  rtx set1 = single_set (i1_insn);
+  if (!set1)
+    return false;
+  rtx set2 = single_set (i2_insn);
+  if (!set2)
+    return false;
+
+  /* Make sure that we have structural proof that the destinations
+     are independent.  Things like alias analysis rely on semantic
+     information and assume no undefined behavior, which is rarely a
+     good enough guarantee to allow a useful instruction combination.  */
+  rtx dest1 = SET_DEST (set1);
+  rtx dest2 = SET_DEST (set2);
+  if (MEM_P (dest1)
+      ? MEM_P (dest2) && !nonoverlapping_memrefs_p (dest1, dest2, false)
+      : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2))
+    return false;
+
+  /* Try the sets in both orders.  */
+  if (try_parallel_sets (attempt, set1, set2)
+      || try_parallel_sets (attempt, set2, set1))
+    {
+      commit_combination (attempt, true);
+      if (MAY_HAVE_DEBUG_BIND_INSNS
+	  && attempt.new_home->insn != i1_insn)
+	propagate_for_debug (i1_insn, attempt.new_home->insn,
+			     SET_DEST (set1), SET_SRC (set1), m_bb);
+      return true;
+    }
+  return false;
+}
+
+/* Replace DEST with SRC in the register notes for INSN.  */
+
+static void
+substitute_into_note (rtx_insn *insn, rtx dest, rtx src)
+{
+  for (rtx *note_ptr = &REG_NOTES (insn); *note_ptr; )
+    {
+      rtx note = *note_ptr;
+      bool keep_p = true;
+      switch (REG_NOTE_KIND (note))
+	{
+	case REG_EQUAL:
+	case REG_EQUIV:
+	  keep_p = validate_simplify_replace_rtx (insn, &XEXP (note, 0),
+						  dest, src);
+	  break;
+
+	default:
+	  break;
+	}
+      if (keep_p)
+	note_ptr = &XEXP (*note_ptr, 1);
+      else
+	{
+	  *note_ptr = XEXP (*note_ptr, 1);
+	  free_EXPR_LIST_node (note);
+	}
+    }
+}
+
+/* A subroutine of try_combine_def_use.  Try replacing DEST with SRC
+   in ATTEMPT.  SRC might be either the original SET_SRC passed to the
+   parent routine or a value pulled from a note; SRC_IS_NOTE_P is true
+   in the latter case.  */
+
+bool
+combine2::try_combine_def_use_1 (combination_attempt_rec &attempt,
+				 rtx dest, rtx src, bool src_is_note_p)
+{
+  rtx_insn *def_insn = attempt.sequence[0]->insn;
+  rtx_insn *use_insn = attempt.sequence[1]->insn;
+
+  /* Mimic combine's behavior by not combining moves from allocatable hard
+     registers (e.g. when copying parameters or function return values).  */
+  if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
+    return false;
+
+  /* Don't mess with volatile references.  For one thing, we don't yet
+     know how many copies of SRC we'll need.  */
+  if (volatile_refs_p (src))
+    return false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "trying to combine %d and %d%s:\n",
+	       INSN_UID (def_insn), INSN_UID (use_insn),
+	       src_is_note_p ? " using equal/equiv note" : "");
+      dump_insn_slim (dump_file, def_insn);
+      dump_insn_slim (dump_file, use_insn);
+    }
+
+  unsigned int num_changes = num_validated_changes ();
+  if (!validate_simplify_replace_rtx (use_insn, &PATTERN (use_insn),
+				      dest, src))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "combination failed -- unable to substitute"
+		 " all uses\n");
+      return false;
+    }
+
+  /* Try matching the instruction on its own if DEST isn't used elsewhere.  */
+  if (has_single_use_p (attempt.def_use_range)
+      && verify_combination (attempt))
+    {
+      live_range_rec *next_range = attempt.def_use_range->next_range;
+      substitute_into_note (use_insn, dest, src);
+      commit_combination (attempt, false);
+      if (MAY_HAVE_DEBUG_BIND_INSNS)
+	{
+	  rtx_insn *end_of_range = (next_range
+				    ? next_range->producer->insn
+				    : BB_END (m_bb));
+	  propagate_for_debug (def_insn, end_of_range, dest, src, m_bb);
+	}
+      return true;
+    }
+
+  /* Try doing the new USE_INSN pattern in parallel with the DEF_INSN
+     pattern.  */
+  if ((!targetm.cannot_copy_insn_p || !targetm.cannot_copy_insn_p (def_insn))
+      && try_parallelize_insns (attempt))
+    return true;
+
+  cancel_changes (num_changes);
+  return false;
+}
+
+/* ATTEMPT describes an attempt to substitute the result of the first
+   instruction into the second instruction.  Try to implement it,
+   given that the first instruction sets DEST to SRC.  */
+
+bool
+combine2::try_combine_def_use (combination_attempt_rec &attempt,
+			       rtx dest, rtx src)
+{
+  rtx_insn *def_insn = attempt.sequence[0]->insn;
+  rtx_insn *use_insn = attempt.sequence[1]->insn;
+  rtx def_note = find_reg_equal_equiv_note (def_insn);
+
+  /* First try combining the instructions in their original form.  */
+  if (try_combine_def_use_1 (attempt, dest, src, false))
+    return true;
+
+  /* Try to replace DEST with a REG_EQUAL/EQUIV value instead.  */
+  if (def_note
+      && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
+    return true;
+
+  /* If USE_INSN has a REG_EQUAL/EQUIV note that refers to DEST, try
+     using that instead of the main pattern.  */
+  for (rtx *link_ptr = &REG_NOTES (use_insn); *link_ptr;
+       link_ptr = &XEXP (*link_ptr, 1))
+    {
+      rtx use_note = *link_ptr;
+      if (REG_NOTE_KIND (use_note) != REG_EQUAL
+	  && REG_NOTE_KIND (use_note) != REG_EQUIV)
+	continue;
+
+      rtx use_set = single_set (use_insn);
+      if (!use_set)
+	break;
+
+      if (!reg_overlap_mentioned_p (dest, XEXP (use_note, 0)))
+	continue;
+
+      /* Try snipping out the note and putting it in the SET instead.  */
+      validate_change (use_insn, link_ptr, XEXP (use_note, 1), 1);
+      validate_change (use_insn, &SET_SRC (use_set), XEXP (use_note, 0), 1);
+
+      if (try_combine_def_use_1 (attempt, dest, src, false))
+	return true;
+
+      if (def_note
+	  && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
+	return true;
+
+      cancel_changes (0);
+    }
+
+  return false;
+}
+
+/* ATTEMPT describes an attempt to combine two instructions that use
+   the same resource.  Try to implement it, returning true on success.  */
+
+bool
+combine2::try_combine_two_uses (combination_attempt_rec &attempt)
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "trying to parallelize %d and %d:\n",
+	       INSN_UID (attempt.sequence[0]->insn),
+	       INSN_UID (attempt.sequence[1]->insn));
+      dump_insn_slim (dump_file, attempt.sequence[0]->insn);
+      dump_insn_slim (dump_file, attempt.sequence[1]->insn);
+    }
+
+  return try_parallelize_insns (attempt);
+}
+
+/* Try to optimize instruction INSN_INFO.  Return true on success.  */
+
+bool
+combine2::optimize_insn (insn_info_rec *i1)
+{
+  combination_attempt_rec attempt;
+
+  if (!combinable_insn_p (i1->insn, false))
+    return false;
+
+  rtx set = single_set (i1->insn);
+  if (!set)
+    return false;
+
+  /* First try combining INSN with a user of its result.  */
+  rtx dest = SET_DEST (set);
+  rtx src = SET_SRC (set);
+  if (REG_P (dest) && REG_NREGS (dest) == 1)
+    for (live_range_rec **def = i1->defs; *def; ++def)
+      if ((*def)->regno == REGNO (dest))
+	{
+	  for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+	    {
+	      insn_info_rec *use = (*def)->users[i];
+	      if (use
+		  && combinable_insn_p (use->insn, has_single_use_p (*def))
+		  && start_combination (attempt, i1, use, *def)
+		  && try_combine_def_use (attempt, dest, src))
+		return true;
+	    }
+	  break;
+	}
+
+  /* Try parallelizing INSN and another instruction that uses the same
+     resource.  */
+  bitmap_clear (m_tried_insns);
+  for (live_range_rec **use = i1->uses; *use; ++use)
+    for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+      {
+	insn_info_rec *i2 = (*use)->users[i];
+	if (i2
+	    && i2 != i1
+	    && combinable_insn_p (i2->insn, false)
+	    && bitmap_set_bit (m_tried_insns, INSN_UID (i2->insn))
+	    && start_combination (attempt, i1, i2)
+	    && try_combine_two_uses (attempt))
+	  return true;
+      }
+
+  return false;
+}
+
+/* Record all register and memory definitions in INSN_INFO and fill in its
+   "defs" list.  */
+
+void
+combine2::record_defs (insn_info_rec *insn_info)
+{
+  rtx_insn *insn = insn_info->insn;
+
+  /* Record register definitions.  */
+  df_ref def;
+  FOR_EACH_INSN_DEF (def, insn)
+    {
+      rtx reg = DF_REF_REAL_REG (def);
+      unsigned int end_regno = END_REGNO (reg);
+      for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
+	{
+	  live_range_rec *range = reg_live_range (regno);
+	  range->producer = insn_info;
+	  m_reg_info[regno].live_p = false;
+	  obstack_ptr_grow (&m_insn_obstack, range);
+	}
+    }
+
+  /* If the instruction writes to memory, record that too.  */
+  if (insn_writes_mem_p (insn))
+    {
+      live_range_rec *range = mem_live_range ();
+      range->producer = insn_info;
+      obstack_ptr_grow (&m_insn_obstack, range);
+    }
+
+  /* Complete the list of definitions.  */
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  insn_info->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
+}
+
+/* Record that INSN_INFO contains register use USE.  If this requires
+   new entries to be added to INSN_INFO->uses, add those entries to the
+   list we're building in m_insn_obstack.  */
+
+void
+combine2::record_reg_use (insn_info_rec *insn_info, df_ref use)
+{
+  rtx reg = DF_REF_REAL_REG (use);
+  unsigned int end_regno = END_REGNO (reg);
+  for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
+    {
+      live_range_rec *range = reg_live_range (regno);
+      if (add_range_use (range, insn_info))
+	obstack_ptr_grow (&m_insn_obstack, range);
+      m_reg_info[regno].live_p = true;
+    }
+}
+
+/* Record all register and memory uses in INSN_INFO and fill in its
+   "uses" list.  */
+
+void
+combine2::record_uses (insn_info_rec *insn_info)
+{
+  rtx_insn *insn = insn_info->insn;
+
+  /* Record register uses in the main pattern.  */
+  df_ref use;
+  FOR_EACH_INSN_USE (use, insn)
+    record_reg_use (insn_info, use);
+
+  /* Treat REG_EQUAL uses as first-class uses.  We don't lose much
+     by doing that, since it's rare for a REG_EQUAL note to mention
+     registers that the main pattern doesn't.  It also gives us the
+     maximum freedom to use REG_EQUAL notes in place of the main pattern.  */
+  FOR_EACH_INSN_EQ_USE (use, insn)
+    record_reg_use (insn_info, use);
+
+  /* Record a memory use if either the pattern or the notes read from
+     memory.  */
+  if (insn_reads_mem_p (insn))
+    {
+      live_range_rec *range = mem_live_range ();
+      if (add_range_use (range, insn_info))
+	obstack_ptr_grow (&m_insn_obstack, range);
+    }
+
+  /* Complete the list of uses.  */
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  insn_info->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
+}
+
+/* Start a new instruction sequence, discarding all information about
+   the previous one.  */
+
+void
+combine2::start_sequence (void)
+{
+  m_end_of_sequence = m_point;
+  m_mem_range = NULL;
+  m_points.truncate (0);
+  obstack_free (&m_insn_obstack, m_insn_obstack_start);
+  obstack_free (&m_range_obstack, m_range_obstack_start);
+}
+
+/* Run the pass on the current function.  */
+
+void
+combine2::execute (void)
+{
+  df_analyze ();
+  FOR_EACH_BB_FN (m_bb, cfun)
+    {
+      m_optimize_for_speed_p = optimize_bb_for_speed_p (m_bb);
+      m_end_of_bb = m_point;
+      start_sequence ();
+
+      rtx_insn *insn, *prev;
+      FOR_BB_INSNS_REVERSE_SAFE (m_bb, insn, prev)
+	{
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+
+	  /* The current m_point represents the end of the sequence if
+	     INSN is the last instruction in the sequence, otherwise it
+	     represents the gap between INSN and the next instruction.
+	     m_point + 1 represents INSN itself.
+
+	     Instructions can be added to m_point by inserting them
+	     after INSN.  They can be added to m_point + 1 by inserting
+	     them before INSN.  */
+	  m_points.safe_push (insn);
+	  m_point += 1;
+
+	  insn_info_rec *insn_info = XOBNEW (&m_insn_obstack, insn_info_rec);
+	  insn_info->insn = insn;
+	  insn_info->point = m_point;
+	  insn_info->cost = UNKNOWN_COST;
+
+	  record_defs (insn_info);
+	  record_uses (insn_info);
+
+	  /* Set up m_point for the next instruction.  */
+	  m_point += 1;
+
+	  if (CALL_P (insn))
+	    start_sequence ();
+	  else
+	    while (optimize_insn (insn_info))
+	      gcc_assert (insn_info->insn);
+	}
+    }
+
+  /* If an instruction changes the cfg, update the containing block
+     accordingly.  */
+  rtx_insn *insn;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (m_cfg_altering_insns, i, insn)
+    if (JUMP_P (insn))
+      {
+	mark_jump_label (PATTERN (insn), insn, 0);
+	update_cfg_for_uncondjump (insn);
+      }
+    else
+      {
+	remove_edge (split_block (BLOCK_FOR_INSN (insn), insn));
+	emit_barrier_after_bb (BLOCK_FOR_INSN (insn));
+      }
+
+  /* Propagate the above block-local cfg changes to the rest of the cfg.  */
+  if (!m_cfg_altering_insns.is_empty ())
+    {
+      if (dom_info_available_p (CDI_DOMINATORS))
+	free_dominance_info (CDI_DOMINATORS);
+      timevar_push (TV_JUMP);
+      rebuild_jump_labels (get_insns ());
+      cleanup_cfg (0);
+      timevar_pop (TV_JUMP);
+    }
+}
+
+const pass_data pass_data_combine2 =
+{
+  RTL_PASS, /* type */
+  "combine2", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_COMBINE2, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_combine2 : public rtl_opt_pass
+{
+public:
+  pass_combine2 (gcc::context *ctxt, int flag)
+    : rtl_opt_pass (pass_data_combine2, ctxt), m_flag (flag)
+  {}
+
+  bool
+  gate (function *) OVERRIDE
+  {
+    return optimize && (param_run_combine & m_flag) != 0 && !HAVE_cc0;
+  }
+
+  unsigned int
+  execute (function *f) OVERRIDE
+  {
+    combine2 (f).execute ();
+    return 0;
+  }
+
+private:
+  unsigned int m_flag;
+}; // class pass_combine2
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_combine2_before (gcc::context *ctxt)
+{
+  return new pass_combine2 (ctxt, 1);
+}
+
+rtl_opt_pass *
+make_pass_combine2_after (gcc::context *ctxt)
+{
+  return new pass_combine2 (ctxt, 4);
+}

Patch
diff mbox series

Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	2019-11-14 14:34:27.599783740 +0000
+++ gcc/Makefile.in	2019-11-17 23:15:31.188500613 +0000
@@ -1261,6 +1261,7 @@  OBJS = \
 	cgraphunit.o \
 	cgraphclones.o \
 	combine.o \
+	combine2.o \
 	combine-stack-adj.o \
 	compare-elim.o \
 	context.o \
Index: gcc/params.opt
===================================================================
--- gcc/params.opt	2019-11-14 14:34:26.339792215 +0000
+++ gcc/params.opt	2019-11-17 23:15:31.200500531 +0000
@@ -768,6 +768,10 @@  Use internal function id in profile look
 Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param
 Maximum depth of a loop nest to fully value-number optimistically.
 
+-param=run-combine=
+Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param
+Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass.
+
 -param=sccvn-max-alias-queries-per-access=
 Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param
 Maximum number of disambiguations to perform per memory access.
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	2019-11-16 10:43:45.597105823 +0000
+++ gcc/doc/invoke.texi	2019-11-17 23:15:31.200500531 +0000
@@ -11807,6 +11807,11 @@  in combiner for a pseudo register as las
 @item max-combine-insns
 The maximum number of instructions the RTL combiner tries to combine.
 
+@item run-combine
+Choose which of the 3 available combine passes to run: bit 1 for the main
+combine pass, bit 0 for an earlier variant of the combine pass, and bit 2
+for a later variant of the combine pass.
+
 @item integer-share-limit
 Small integer constants can use a shared data structure, reducing the
 compiler's memory usage and increasing its speed.  This sets the maximum
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	2019-10-29 08:29:03.096444049 +0000
+++ gcc/tree-pass.h	2019-11-17 23:15:31.204500501 +0000
@@ -562,7 +562,9 @@  extern rtl_opt_pass *make_pass_reginfo_i
 extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt);
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	2019-10-29 08:29:03.224443133 +0000
+++ gcc/passes.def	2019-11-17 23:15:31.200500531 +0000
@@ -437,7 +437,9 @@  along with GCC; see the file COPYING3.
       NEXT_PASS (pass_inc_dec);
       NEXT_PASS (pass_initialize_regs);
       NEXT_PASS (pass_ud_rtl_dce);
+      NEXT_PASS (pass_combine2_before);
       NEXT_PASS (pass_combine);
+      NEXT_PASS (pass_combine2_after);
       NEXT_PASS (pass_if_after_combine);
       NEXT_PASS (pass_jump_after_combine);
       NEXT_PASS (pass_partition_blocks);
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	2019-10-11 15:43:53.403498517 +0100
+++ gcc/timevar.def	2019-11-17 23:15:31.204500501 +0000
@@ -251,6 +251,7 @@  DEFTIMEVAR (TV_AUTO_INC_DEC          , "
 DEFTIMEVAR (TV_CSE2                  , "CSE 2")
 DEFTIMEVAR (TV_BRANCH_PROB           , "branch prediction")
 DEFTIMEVAR (TV_COMBINE               , "combiner")
+DEFTIMEVAR (TV_COMBINE2              , "second combiner")
 DEFTIMEVAR (TV_IFCVT		     , "if-conversion")
 DEFTIMEVAR (TV_MODE_SWITCH           , "mode switching")
 DEFTIMEVAR (TV_SMS		     , "sms modulo scheduling")
Index: gcc/cfgrtl.h
===================================================================
--- gcc/cfgrtl.h	2019-03-08 18:15:39.320730391 +0000
+++ gcc/cfgrtl.h	2019-11-17 23:15:31.192500584 +0000
@@ -47,6 +47,7 @@  extern void fixup_partitions (void);
 extern bool purge_dead_edges (basic_block);
 extern bool purge_all_dead_edges (void);
 extern bool fixup_abnormal_edges (void);
+extern void update_cfg_for_uncondjump (rtx_insn *);
 extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *);
 extern void relink_block_chain (bool);
 extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *);
Index: gcc/combine.c
===================================================================
--- gcc/combine.c	2019-11-13 08:42:45.537368745 +0000
+++ gcc/combine.c	2019-11-17 23:15:31.192500584 +0000
@@ -2530,42 +2530,6 @@  reg_subword_p (rtx x, rtx reg)
 	 && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT;
 }
 
-/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
-   Note that the INSN should be deleted *after* removing dead edges, so
-   that the kept edge is the fallthrough edge for a (set (pc) (pc))
-   but not for a (set (pc) (label_ref FOO)).  */
-
-static void
-update_cfg_for_uncondjump (rtx_insn *insn)
-{
-  basic_block bb = BLOCK_FOR_INSN (insn);
-  gcc_assert (BB_END (bb) == insn);
-
-  purge_dead_edges (bb);
-
-  delete_insn (insn);
-  if (EDGE_COUNT (bb->succs) == 1)
-    {
-      rtx_insn *insn;
-
-      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
-
-      /* Remove barriers from the footer if there are any.  */
-      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
-	if (BARRIER_P (insn))
-	  {
-	    if (PREV_INSN (insn))
-	      SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
-	    else
-	      BB_FOOTER (bb) = NEXT_INSN (insn);
-	    if (NEXT_INSN (insn))
-	      SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
-	  }
-	else if (LABEL_P (insn))
-	  break;
-    }
-}
-
 /* Return whether PAT is a PARALLEL of exactly N register SETs followed
    by an arbitrary number of CLOBBERs.  */
 static bool
@@ -15096,7 +15060,10 @@  const pass_data pass_data_combine =
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *) { return (optimize > 0); }
+  virtual bool gate (function *)
+    {
+      return optimize > 0 && (param_run_combine & 2) != 0;
+    }
   virtual unsigned int execute (function *)
     {
       return rest_of_handle_combine ();
Index: gcc/cfgrtl.c
===================================================================
--- gcc/cfgrtl.c	2019-10-17 14:22:55.523309009 +0100
+++ gcc/cfgrtl.c	2019-11-17 23:15:31.188500613 +0000
@@ -3409,6 +3409,42 @@  fixup_abnormal_edges (void)
   return inserted;
 }
 
+/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
+   Note that the INSN should be deleted *after* removing dead edges, so
+   that the kept edge is the fallthrough edge for a (set (pc) (pc))
+   but not for a (set (pc) (label_ref FOO)).  */
+
+void
+update_cfg_for_uncondjump (rtx_insn *insn)
+{
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  gcc_assert (BB_END (bb) == insn);
+
+  purge_dead_edges (bb);
+
+  delete_insn (insn);
+  if (EDGE_COUNT (bb->succs) == 1)
+    {
+      rtx_insn *insn;
+
+      single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
+
+      /* Remove barriers from the footer if there are any.  */
+      for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
+	if (BARRIER_P (insn))
+	  {
+	    if (PREV_INSN (insn))
+	      SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
+	    else
+	      BB_FOOTER (bb) = NEXT_INSN (insn);
+	    if (NEXT_INSN (insn))
+	      SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
+	  }
+	else if (LABEL_P (insn))
+	  break;
+    }
+}
+
 /* Cut the insns from FIRST to LAST out of the insns stream.  */
 
 rtx_insn *
Index: gcc/simplify-rtx.c
===================================================================
--- gcc/simplify-rtx.c	2019-11-16 15:33:36.642840131 +0000
+++ gcc/simplify-rtx.c	2019-11-17 23:15:31.204500501 +0000
@@ -851,6 +851,12 @@  simplify_truncation (machine_mode mode,
       && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1)
     return constm1_rtx;
 
+  /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result
+     in a narrower mode if useful.  */
+  if (COMPARISON_P (op))
+    return simplify_gen_relational (GET_CODE (op), mode, VOIDmode,
+				    XEXP (op, 0), XEXP (op, 1));
+
   return NULL_RTX;
 }
 
Index: gcc/recog.h
===================================================================
--- gcc/recog.h	2019-09-09 18:58:28.860430363 +0100
+++ gcc/recog.h	2019-11-17 23:15:31.204500501 +0000
@@ -111,6 +111,7 @@  extern int validate_replace_rtx_part_nos
 extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *);
 extern void validate_replace_src_group (rtx, rtx, rtx_insn *);
 extern bool validate_simplify_insn (rtx_insn *insn);
+extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx);
 extern int num_changes_pending (void);
 extern int next_insn_tests_no_inequality (rtx_insn *);
 extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode);
Index: gcc/recog.c
===================================================================
--- gcc/recog.c	2019-10-01 09:55:35.150088599 +0100
+++ gcc/recog.c	2019-11-17 23:15:31.204500501 +0000
@@ -922,6 +922,226 @@  validate_simplify_insn (rtx_insn *insn)
       }
   return ((num_changes_pending () > 0) && (apply_change_group () > 0));
 }
+
+/* A subroutine of validate_simplify_replace_rtx.  Apply the replacement
+   described by R to LOC.  Return true on success; leave the caller
+   to clean up on failure.  */
+
+static bool
+validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc)
+{
+  rtx x = *loc;
+  enum rtx_code code = GET_CODE (x);
+  machine_mode mode = GET_MODE (x);
+
+  if (rtx_equal_p (x, r.from))
+    {
+      validate_unshare_change (r.insn, loc, r.to, 1);
+      return true;
+    }
+
+  /* Recursively apply the substitution and see if we can simplify
+     the result.  This specifically shouldn't use simplify_gen_*,
+     since we want to avoid generating new expressions where possible.  */
+  int old_num_changes = num_validated_changes ();
+  rtx newx = NULL_RTX;
+  bool recurse_p = false;
+  switch (GET_RTX_CLASS (code))
+    {
+    case RTX_UNARY:
+      {
+	machine_mode op0_mode = GET_MODE (XEXP (x, 0));
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)))
+	  return false;
+
+	newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode);
+	break;
+      }
+
+    case RTX_BIN_ARITH:
+    case RTX_COMM_ARITH:
+      {
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+	  return false;
+
+	newx = simplify_binary_operation (code, mode,
+					  XEXP (x, 0), XEXP (x, 1));
+	break;
+      }
+
+    case RTX_COMPARE:
+    case RTX_COMM_COMPARE:
+      {
+	machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode
+				? GET_MODE (XEXP (x, 0))
+				: GET_MODE (XEXP (x, 1)));
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+	  return false;
+
+	newx = simplify_relational_operation (code, mode, op_mode,
+					      XEXP (x, 0), XEXP (x, 1));
+	break;
+      }
+
+    case RTX_TERNARY:
+    case RTX_BITFIELD_OPS:
+      {
+	machine_mode op0_mode = GET_MODE (XEXP (x, 0));
+	if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))
+	    || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2)))
+	  return false;
+
+	newx = simplify_ternary_operation (code, mode, op0_mode,
+					   XEXP (x, 0), XEXP (x, 1),
+					   XEXP (x, 2));
+	break;
+      }
+
+    case RTX_EXTRA:
+      if (code == SUBREG)
+	{
+	  machine_mode inner_mode = GET_MODE (SUBREG_REG (x));
+	  if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x)))
+	    return false;
+
+	  rtx inner = SUBREG_REG (x);
+	  newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x));
+	  /* Reject the same cases that simplify_gen_subreg would.  */
+	  if (!newx
+	      && (GET_CODE (inner) == SUBREG
+		  || GET_CODE (inner) == CONCAT
+		  || GET_MODE (inner) == VOIDmode
+		  || !validate_subreg (mode, inner_mode,
+				       inner, SUBREG_BYTE (x))))
+	    return false;
+	  break;
+	}
+      else
+	recurse_p = true;
+      break;
+
+    case RTX_OBJ:
+      if (code == LO_SUM)
+	{
+	  if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+	      || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+	    return false;
+
+	  /* (lo_sum (high x) y) -> y where x and y have the same base.  */
+	  rtx op0 = XEXP (x, 0);
+	  rtx op1 = XEXP (x, 1);
+	  if (GET_CODE (op0) == HIGH)
+	    {
+	      rtx base0, base1, offset0, offset1;
+	      split_const (XEXP (op0, 0), &base0, &offset0);
+	      split_const (op1, &base1, &offset1);
+	      if (rtx_equal_p (base0, base1))
+		newx = op1;
+	    }
+	}
+      else if (code == REG)
+	{
+	  if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from))
+	    return false;
+	}
+      else
+	recurse_p = true;
+      break;
+
+    case RTX_CONST_OBJ:
+      break;
+
+    case RTX_AUTOINC:
+      if (reg_overlap_mentioned_p (XEXP (x, 0), r.from))
+	return false;
+      recurse_p = true;
+      break;
+
+    case RTX_MATCH:
+    case RTX_INSN:
+      gcc_unreachable ();
+    }
+
+  if (recurse_p)
+    {
+      const char *fmt = GET_RTX_FORMAT (code);
+      for (int i = 0; fmt[i]; i++)
+	switch (fmt[i])
+	  {
+	  case 'E':
+	    for (int j = 0; j < XVECLEN (x, i); j++)
+	      if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j)))
+		return false;
+	    break;
+
+	  case 'e':
+	    if (XEXP (x, i)
+		&& !validate_simplify_replace_rtx_1 (r, &XEXP (x, i)))
+	      return false;
+	    break;
+	  }
+    }
+
+  if (newx && !rtx_equal_p (x, newx))
+    {
+      /* There's no longer any point unsharing the substitutions made
+	 for subexpressions, since we'll just copy this one instead.  */
+      for (int i = old_num_changes; i < num_changes; ++i)
+	changes[i].unshare = false;
+      validate_unshare_change (r.insn, loc, newx, 1);
+    }
+
+  return true;
+}
+
+/* A note_uses callback for validate_simplify_replace_rtx.
+   DATA points to a validate_replace_src_data object.  */
+
+static void
+validate_simplify_replace_rtx_uses (rtx *loc, void *data)
+{
+  validate_replace_src_data &r = *(validate_replace_src_data *) data;
+  if (r.insn && !validate_simplify_replace_rtx_1 (r, loc))
+    r.insn = NULL;
+}
+
+/* Try to perform the equivalent of:
+
+      newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX);
+      validate_change (INSN, LOC, newx, 1);
+
+   but without generating as much garbage rtl when the resulting
+   pattern doesn't match.
+
+   Return true if we were able to replace all uses of OLD_RTX in *LOC
+   and if the result conforms to general rtx rules (e.g. for whether
+   subregs are meaningful).
+
+   When returning true, add all replacements to the current validation group,
+   leaving the caller to test it in the normal way.  Leave both *LOC and the
+   validation group unchanged on failure.  */
+
+bool
+validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc,
+			       rtx old_rtx, rtx new_rtx)
+{
+  validate_replace_src_data r;
+  r.from = old_rtx;
+  r.to = new_rtx;
+  r.insn = insn;
+
+  unsigned int num_changes = num_validated_changes ();
+  note_uses (loc, validate_simplify_replace_rtx_uses, &r);
+  if (!r.insn)
+    {
+      cancel_changes (num_changes);
+      return false;
+    }
+  return true;
+}
 
 /* Return 1 if the insn using CC0 set by INSN does not contain
    any ordered tests applied to the condition codes.
Index: gcc/combine2.c
===================================================================
--- /dev/null	2019-09-17 11:41:18.176664108 +0100
+++ gcc/combine2.c	2019-11-17 23:15:31.196500559 +0000
@@ -0,0 +1,1576 @@ 
+/* Combine instructions
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "tree-pass.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "insn-config.h"
+#include "recog.h"
+#include "print-rtl.h"
+#include "rtl-iter.h"
+#include "predict.h"
+#include "cfgcleanup.h"
+#include "cfghooks.h"
+#include "cfgrtl.h"
+#include "alias.h"
+#include "valtrack.h"
+
+/* This pass tries to combine instructions in the following ways:
+
+   (1) If we have two dependent instructions:
+
+	 I1: (set DEST1 SRC1)
+	 I2: (...DEST1...)
+
+       and I2 is the only user of DEST1, the pass tries to combine them into:
+
+	 I2: (...SRC1...)
+
+   (2) If we have two dependent instructions:
+
+	 I1: (set DEST1 SRC1)
+	 I2: (...DEST1...)
+
+       the pass tries to combine them into:
+
+	 I2: (parallel [(set DEST1 SRC1) (...SRC1...)])
+
+       or:
+
+	 I2: (parallel [(...SRC1...) (set DEST1 SRC1)])
+
+   (3) If we have two independent instructions:
+
+	 I1: (set DEST1 SRC1)
+	 I2: (set DEST2 SRC2)
+
+       that read from memory or from the same register, the pass tries to
+       combine them into:
+
+	 I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)])
+
+       or:
+
+	 I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)])
+
+   If the combined form is a valid instruction, the pass tries to find a
+   place between I1 and I2 inclusive for the new instruction.  If there
+   are multiple valid locations, it tries to pick the best one by taking
+   the effect on register pressure into account.
+
+   If a combination succeeds and produces a single set, the pass tries to
+   combine the new form with earlier or later instructions.
+
+   The pass currently optimizes each basic block separately.  It walks
+   the instructions in reverse order, building up live ranges for registers
+   and memory.  It then uses these live ranges to look for possible
+   combination opportunities and to decide where the combined instructions
+   could be placed.
+
+   The pass represents positions in the block using point numbers,
+   with higher numbers indicating earlier instructions.  The numbering
+   scheme is that:
+
+   - the end of the current instruction sequence has an even base point B.
+
+   - instructions initially have odd-numbered points B + 1, B + 3, etc.
+     with B + 1 being the final instruction in the sequence.
+
+   - even points after B represent gaps between instructions where combined
+     instructions could be placed.
+
+   Thus even points initially represent no instructions and odd points
+   initially represent single instructions.  However, when picking a
+   place for a combined instruction, the pass may choose somewhere
+   inbetween the original two instructions, so that over time a point
+   may come to represent several instructions.  When this happens,
+   the pass maintains the invariant that all instructions with the same
+   point number are independent of each other and thus can be treated as
+   acting in parallel (or as acting in any arbitrary sequence).
+
+   TODOs:
+
+   - Handle 3-instruction combinations, and possibly more.
+
+   - Handle existing clobbers more efficiently.  At the moment we can't
+     move an instruction that clobbers R across another instruction that
+     clobbers R.
+
+   - Allow hard register clobbers to be added, like combine does.
+
+   - Perhaps work on EBBs, or SESE regions.  */
+
+namespace {
+
+/* The number of explicit uses to record in a live range.  */
+const unsigned int NUM_RANGE_USERS = 4;
+
+/* The maximum number of instructions that we can combine at once.  */
+const unsigned int MAX_COMBINE_INSNS = 2;
+
+/* A fake cost for instructions that we haven't costed yet.  */
+const unsigned int UNKNOWN_COST = ~0U;
+
+class combine2
+{
+public:
+  combine2 (function *);
+  ~combine2 ();
+
+  void execute ();
+
+private:
+  struct insn_info_rec;
+
+  /* Describes the live range of a register or of memory.  For simplicity,
+     we treat memory as a single entity.
+
+     If we had a fully-accurate live range, updating it to account for a
+     moved instruction would be a linear-time operation.  Doing this for
+     each combination would then make the pass quadratic.  We therefore
+     just maintain a list of NUM_RANGE_USERS use insns and use simple,
+     conservatively-correct behavior for the rest.  */
+  struct live_range_rec
+  {
+    /* Which instruction provides the dominating definition, or null if
+       we don't know yet.  */
+    insn_info_rec *producer;
+
+    /* A selection of instructions that use the resource, in program order.  */
+    insn_info_rec *users[NUM_RANGE_USERS];
+
+    /* An inclusive range of points that covers instructions not mentioned
+       in USERS.  Both values are zero if there are no such instructions.
+
+       Once we've included a use U at point P in this range, we continue
+       to assume that some kind of use exists at P whatever happens to U
+       afterwards.  */
+    unsigned int first_extra_use;
+    unsigned int last_extra_use;
+
+    /* The register number this range describes, or INVALID_REGNUM
+       for memory.  */
+    unsigned int regno;
+
+    /* Forms a linked list of ranges for the same resource, in program
+       order.  */
+    live_range_rec *prev_range;
+    live_range_rec *next_range;
+  };
+
+  /* Pass-specific information about an instruction.  */
+  struct insn_info_rec
+  {
+    /* The instruction itself.  */
+    rtx_insn *insn;
+
+    /* A null-terminated list of live ranges for the things that this
+       instruction defines.  */
+    live_range_rec **defs;
+
+    /* A null-terminated list of live ranges for the things that this
+       instruction uses.  */
+    live_range_rec **uses;
+
+    /* The point at which the instruction appears.  */
+    unsigned int point;
+
+    /* The cost of the instruction, or UNKNOWN_COST if we haven't
+       measured it yet.  */
+    unsigned int cost;
+  };
+
+  /* Describes one attempt to combine instructions.  */
+  struct combination_attempt_rec
+  {
+    /* The instruction that we're currently trying to optimize.
+       If the combination succeeds, we'll use this insn_info_rec
+       to describe the new instruction.  */
+    insn_info_rec *new_home;
+
+    /* The instructions we're combining, in program order.  */
+    insn_info_rec *sequence[MAX_COMBINE_INSNS];
+
+    /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the
+       live range that describes the substituted register.  */
+    live_range_rec *def_use_range;
+
+    /* The earliest and latest points at which we could insert the
+       combined instruction.  */
+    unsigned int earliest_point;
+    unsigned int latest_point;
+
+    /* The cost of the new instruction, once we have a successful match.  */
+    unsigned int new_cost;
+  };
+
+  /* Pass-specific information about a register.  */
+  struct reg_info_rec
+  {
+    /* The live range associated with the last reference to the register.  */
+    live_range_rec *range;
+
+    /* The point at which the last reference occurred.  */
+    unsigned int next_ref;
+
+    /* True if the register is currently live.  We record this here rather
+       than in a separate bitmap because (a) there's a natural hole for
+       it on LP64 hosts and (b) we only refer to it when updating the
+       other fields, and so recording it here should give better locality.  */
+    unsigned int live_p : 1;
+  };
+
+  live_range_rec *new_live_range (unsigned int, live_range_rec *);
+  live_range_rec *reg_live_range (unsigned int);
+  live_range_rec *mem_live_range ();
+  bool add_range_use (live_range_rec *, insn_info_rec *);
+  void remove_range_use (live_range_rec *, insn_info_rec *);
+  bool has_single_use_p (live_range_rec *);
+  bool known_last_use_p (live_range_rec *, insn_info_rec *);
+  unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *);
+  unsigned int find_latest_point (insn_info_rec *, insn_info_rec *);
+  bool start_combination (combination_attempt_rec &, insn_info_rec *,
+			  insn_info_rec *, live_range_rec * = NULL);
+  bool verify_combination (combination_attempt_rec &);
+  int estimate_reg_pressure_delta (insn_info_rec *);
+  void commit_combination (combination_attempt_rec &, bool);
+  bool try_parallel_sets (combination_attempt_rec &, rtx, rtx);
+  bool try_parallelize_insns (combination_attempt_rec &);
+  bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool);
+  bool try_combine_def_use (combination_attempt_rec &, rtx, rtx);
+  bool try_combine_two_uses (combination_attempt_rec &);
+  bool try_combine (insn_info_rec *, rtx, unsigned int);
+  bool optimize_insn (insn_info_rec *);
+  void record_defs (insn_info_rec *);
+  void record_reg_use (insn_info_rec *, df_ref);
+  void record_uses (insn_info_rec *);
+  void process_insn (insn_info_rec *);
+  void start_sequence ();
+
+  /* The function we're optimizing.  */
+  function *m_fn;
+
+  /* The highest pseudo register number plus one.  */
+  unsigned int m_num_regs;
+
+  /* The current basic block.  */
+  basic_block m_bb;
+
+  /* True if we should optimize the current basic block for speed.  */
+  bool m_optimize_for_speed_p;
+
+  /* The point number to allocate to the next instruction we visit
+     in the backward traversal.  */
+  unsigned int m_point;
+
+  /* The point number corresponding to the end of the current
+     instruction sequence, i.e. the lowest point number about which
+     we still have valid information.  */
+  unsigned int m_end_of_sequence;
+
+  /* The point number corresponding to the end of the current basic block.
+     This is the same as M_END_OF_SEQUENCE when processing the last
+     instruction sequence in a basic block.  */
+  unsigned int m_end_of_bb;
+
+  /* The memory live range, or null if we haven't yet found a memory
+     reference in the current instruction sequence.  */
+  live_range_rec *m_mem_range;
+
+  /* Gives information about each register.  We track both hard and
+     pseudo registers.  */
+  auto_vec<reg_info_rec> m_reg_info;
+
+  /* A bitmap of registers whose entry in m_reg_info is valid.  */
+  auto_sbitmap m_valid_regs;
+
+  /* If nonnuull, an unused 2-element PARALLEL that we can use to test
+     instruction combinations.  */
+  rtx m_spare_parallel;
+
+  /* A bitmap of instructions that we've already tried to combine with.  */
+  auto_bitmap m_tried_insns;
+
+  /* A temporary bitmap used to hold register numbers.  */
+  auto_bitmap m_true_deps;
+
+  /* An obstack used for allocating insn_info_recs and for building
+     up their lists of definitions and uses.  */
+  obstack m_insn_obstack;
+
+  /* An obstack used for allocating live_range_recs.  */
+  obstack m_range_obstack;
+
+  /* Start-of-object pointers for the two obstacks.  */
+  char *m_insn_obstack_start;
+  char *m_range_obstack_start;
+
+  /* A list of instructions that we've optimized and whose new forms
+     change the cfg.  */
+  auto_vec<rtx_insn *> m_cfg_altering_insns;
+
+  /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS.  */
+  auto_bitmap m_cfg_altering_insn_ids;
+
+  /* We can insert new instructions at point P * 2 by inserting them
+     after M_POINTS[P - M_END_OF_SEQUENCE / 2].  We can insert new
+     instructions at point P * 2 + 1 by inserting them before
+     M_POINTS[P - M_END_OF_SEQUENCE / 2].  */
+  auto_vec<rtx_insn *, 256> m_points;
+};
+
+combine2::combine2 (function *fn)
+  : m_fn (fn),
+    m_num_regs (max_reg_num ()),
+    m_bb (NULL),
+    m_optimize_for_speed_p (false),
+    m_point (2),
+    m_end_of_sequence (m_point),
+    m_end_of_bb (m_point),
+    m_mem_range (NULL),
+    m_reg_info (m_num_regs),
+    m_valid_regs (m_num_regs),
+    m_spare_parallel (NULL_RTX)
+{
+  gcc_obstack_init (&m_insn_obstack);
+  gcc_obstack_init (&m_range_obstack);
+  m_reg_info.quick_grow (m_num_regs);
+  bitmap_clear (m_valid_regs);
+  m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0);
+  m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0);
+}
+
+combine2::~combine2 ()
+{
+  obstack_free (&m_insn_obstack, NULL);
+  obstack_free (&m_range_obstack, NULL);
+}
+
+/* Return true if it's possible in principle to combine INSN with
+   other instructions.  ALLOW_ASMS_P is true if the caller can cope
+   with asm statements.  */
+
+static bool
+combinable_insn_p (rtx_insn *insn, bool allow_asms_p)
+{
+  rtx pattern = PATTERN (insn);
+
+  if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER)
+    return false;
+
+  if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX))
+    return false;
+
+  if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0)
+    return false;
+
+  return true;
+}
+
+/* Return true if it's possible in principle to move INSN somewhere else,
+   as long as all dependencies are satisfied.  */
+
+static bool
+movable_insn_p (rtx_insn *insn)
+{
+  if (JUMP_P (insn))
+    return false;
+
+  if (volatile_refs_p (PATTERN (insn)))
+    return false;
+
+  return true;
+}
+
+/* Create and return a new live range for REGNO.  NEXT is the next range
+   in program order, or null if this is the first live range in the
+   sequence.  */
+
+combine2::live_range_rec *
+combine2::new_live_range (unsigned int regno, live_range_rec *next)
+{
+  live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec);
+  memset (range, 0, sizeof (*range));
+
+  range->regno = regno;
+  range->next_range = next;
+  if (next)
+    next->prev_range = range;
+  return range;
+}
+
+/* Return the current live range for register REGNO, creating a new
+   one if necessary.  */
+
+combine2::live_range_rec *
+combine2::reg_live_range (unsigned int regno)
+{
+  /* Initialize the liveness flag, if it isn't already valid for this BB.  */
+  bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno);
+  if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb)
+    m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno);
+
+  /* See if we already have a live range associated with the current
+     instruction sequence.  */
+  live_range_rec *range = NULL;
+  if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence)
+    range = m_reg_info[regno].range;
+
+  /* Create a new range if this is the first reference to REGNO in the
+     current instruction sequence or if the current range has been closed
+     off by a definition.  */
+  if (!range || range->producer)
+    {
+      range = new_live_range (regno, range);
+
+      /* If the register is live after the current sequence, treat that
+	 as a fake use at the end of the sequence.  */
+      if (!range->next_range && m_reg_info[regno].live_p)
+	range->first_extra_use = range->last_extra_use = m_end_of_sequence;
+
+      /* Record that this is now the current range for REGNO.  */
+      if (first_ref_p)
+	bitmap_set_bit (m_valid_regs, regno);
+      m_reg_info[regno].range = range;
+      m_reg_info[regno].next_ref = m_point;
+    }
+  return range;
+}
+
+/* Return the current live range for memory, treating memory as a single
+   entity.  Create a new live range if necessary.  */
+
+combine2::live_range_rec *
+combine2::mem_live_range ()
+{
+  if (!m_mem_range || m_mem_range->producer)
+    m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range);
+  return m_mem_range;
+}
+
+/* Record that instruction USER uses the resource described by RANGE.
+   Return true if this is new information.  */
+
+bool
+combine2::add_range_use (live_range_rec *range, insn_info_rec *user)
+{
+  /* See if we've already recorded the instruction, or if there's a
+     spare use slot we can use.  */
+  unsigned int i = 0;
+  for (; i < NUM_RANGE_USERS && range->users[i]; ++i)
+    if (range->users[i] == user)
+      return false;
+
+  if (i == NUM_RANGE_USERS)
+    {
+      /* Since we've processed USER recently, assume that it's more
+	 interesting to record explicitly than the last user in the
+	 current list.  Evict that last user and describe it in the
+	 overflow "extra use" range instead.  */
+      insn_info_rec *ousted_user = range->users[--i];
+      if (range->first_extra_use < ousted_user->point)
+	range->first_extra_use = ousted_user->point;
+      if (range->last_extra_use > ousted_user->point)
+	range->last_extra_use = ousted_user->point;
+    }
+
+  /* Insert USER while keeping the list sorted.  */
+  for (; i > 0 && range->users[i - 1]->point < user->point; --i)
+    range->users[i] = range->users[i - 1];
+  range->users[i] = user;
+  return true;
+}
+
+/* Remove USER from the uses recorded for RANGE, if we can.
+   There's nothing we can do if USER was described in the
+   overflow "extra use" range.  */
+
+void
+combine2::remove_range_use (live_range_rec *range, insn_info_rec *user)
+{
+  for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+    if (range->users[i] == user)
+      {
+	for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j)
+	  range->users[j] = range->users[j + 1];
+	range->users[NUM_RANGE_USERS - 1] = NULL;
+	break;
+      }
+}
+
+/* Return true if RANGE has a single known user.  */
+
+bool
+combine2::has_single_use_p (live_range_rec *range)
+{
+  return range->users[0] && !range->users[1] && !range->first_extra_use;
+}
+
+/* Return true if we know that USER is the last user of RANGE.  */
+
+bool
+combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user)
+{
+  if (range->last_extra_use <= user->point)
+    return false;
+
+  for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i)
+    if (range->users[i] == user)
+      return i == NUM_RANGE_USERS - 1 || !range->users[i + 1];
+    else if (range->users[i]->point == user->point)
+      return false;
+
+  gcc_unreachable ();
+}
+
+/* Find the earliest point that we could move I2 up in order to combine
+   it with I1.  Ignore any dependencies between I1 and I2; leave the
+   caller to deal with those instead.  */
+
+unsigned int
+combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1)
+{
+  if (!movable_insn_p (i2->insn))
+    return i2->point;
+
+  /* Start by optimistically assuming that we can move the instruction
+     all the way up to I1.  */
+  unsigned int point = i1->point;
+
+  /* Make sure that the new position preserves all necessary true dependencies
+     on earlier instructions.  */
+  for (live_range_rec **use = i2->uses; *use; ++use)
+    {
+      live_range_rec *range = *use;
+      if (range->producer
+	  && range->producer != i1
+	  && point >= range->producer->point)
+	point = range->producer->point - 1;
+    }
+
+  /* Make sure that the new position preserves all necessary output and
+     anti dependencies on earlier instructions.  */
+  for (live_range_rec **def = i2->defs; *def; ++def)
+    if (live_range_rec *range = (*def)->prev_range)
+      {
+	if (range->producer
+	    && range->producer != i1
+	    && point >= range->producer->point)
+	  point = range->producer->point - 1;
+
+	for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;)
+	  if (range->users[i] && range->users[i] != i1)
+	    {
+	      if (point >= range->users[i]->point)
+		point = range->users[i]->point - 1;
+	      break;
+	    }
+
+	if (range->last_extra_use && point >= range->last_extra_use)
+	  point = range->last_extra_use - 1;
+      }
+
+  return point;
+}
+
+/* Find the latest point that we could move I1 down in order to combine
+   it with I2.  Ignore any dependencies between I1 and I2; leave the
+   caller to deal with those instead.  */
+
+unsigned int
+combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2)
+{
+  if (!movable_insn_p (i1->insn))
+    return i1->point;
+
+  /* Start by optimistically assuming that we can move the instruction
+     all the way down to I2.  */
+  unsigned int point = i2->point;
+
+  /* Make sure that the new position preserves all necessary anti dependencies
+     on later instructions.  */
+  for (live_range_rec **use = i1->uses; *use; ++use)
+    if (live_range_rec *range = (*use)->next_range)
+      if (range->producer != i2 && point <= range->producer->point)
+	point = range->producer->point + 1;
+
+  /* Make sure that the new position preserves all necessary output and
+     true dependencies on later instructions.  */
+  for (live_range_rec **def = i1->defs; *def; ++def)
+    {
+      live_range_rec *range = *def;
+
+      for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+	if (range->users[i] != i2)
+	  {
+	    if (range->users[i] && point <= range->users[i]->point)
+	      point = range->users[i]->point + 1;
+	    break;
+	  }
+
+      if (range->first_extra_use && point <= range->first_extra_use)
+	point = range->first_extra_use + 1;
+
+      live_range_rec *next_range = range->next_range;
+      if (next_range
+	  && next_range->producer != i2
+	  && point <= next_range->producer->point)
+	point = next_range->producer->point + 1;
+    }
+
+  return point;
+}
+
+/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2,
+   where I1 is the instruction that we're currently trying to optimize.
+   If DEF_USE_RANGE is nonnull, I1 defines the value described by
+   DEF_USE_RANGE and I2 uses it.  */
+
+bool
+combine2::start_combination (combination_attempt_rec &attempt,
+			     insn_info_rec *i1, insn_info_rec *i2,
+			     live_range_rec *def_use_range)
+{
+  attempt.new_home = i1;
+  attempt.sequence[0] = i1;
+  attempt.sequence[1] = i2;
+  if (attempt.sequence[0]->point < attempt.sequence[1]->point)
+    std::swap (attempt.sequence[0], attempt.sequence[1]);
+  attempt.def_use_range = def_use_range;
+
+  /* Check that the instructions have no true dependencies other than
+     DEF_USE_RANGE.  */
+  bitmap_clear (m_true_deps);
+  for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def)
+    if (*def != def_use_range)
+      bitmap_set_bit (m_true_deps, (*def)->regno);
+  for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use)
+    if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno))
+      return false;
+
+  /* Calculate the range of points at which the combined instruction
+     could live.  */
+  attempt.earliest_point = find_earliest_point (attempt.sequence[1],
+						attempt.sequence[0]);
+  attempt.latest_point = find_latest_point (attempt.sequence[0],
+					    attempt.sequence[1]);
+  if (attempt.earliest_point < attempt.latest_point)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "cannot combine %d and %d: no suitable"
+		 " location for combined insn\n",
+		 INSN_UID (attempt.sequence[0]->insn),
+		 INSN_UID (attempt.sequence[1]->insn));
+      return false;
+    }
+
+  /* Make sure we have valid costs for the original instructions before
+     we start changing their patterns.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    if (attempt.sequence[i]->cost == UNKNOWN_COST)
+      attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn,
+					     m_optimize_for_speed_p);
+  return true;
+}
+
+/* Check whether the combination attempt described by ATTEMPT matches
+   an .md instruction (or matches its constraints, in the case of an
+   asm statement).  If so, calculate the cost of the new instruction
+   and check whether it's cheap enough.  */
+
+bool
+combine2::verify_combination (combination_attempt_rec &attempt)
+{
+  rtx_insn *insn = attempt.sequence[1]->insn;
+
+  bool ok_p = verify_changes (0);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      if (!ok_p)
+	fprintf (dump_file, "failed to match this instruction:\n");
+      else if (const char *name = get_insn_name (INSN_CODE (insn)))
+	fprintf (dump_file, "successfully matched this instruction to %s:\n",
+		 name);
+      else
+	fprintf (dump_file, "successfully matched this instruction:\n");
+      print_rtl_single (dump_file, PATTERN (insn));
+    }
+  if (!ok_p)
+    return false;
+
+  unsigned int cost1 = attempt.sequence[0]->cost;
+  unsigned int cost2 = attempt.sequence[1]->cost;
+  attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p);
+  ok_p = (attempt.new_cost <= cost1 + cost2);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n",
+	     cost1, cost2, attempt.new_cost,
+	     ok_p ? "keeping replacement" : "rejecting replacement");
+  if (!ok_p)
+    return false;
+
+  confirm_change_group ();
+  return true;
+}
+
+/* Return true if we should consider register REGNO when calculating
+   register pressure estimates.  */
+
+static bool
+count_reg_pressure_p (unsigned int regno)
+{
+  if (regno == INVALID_REGNUM)
+    return false;
+
+  /* Unallocatable registers aren't interesting.  */
+  if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno])
+    return false;
+
+  return true;
+}
+
+/* Try to estimate the effect that the original form of INSN_INFO
+   had on register pressure, in the form "born - dying".  */
+
+int
+combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info)
+{
+  int delta = 0;
+
+  for (live_range_rec **def = insn_info->defs; *def; ++def)
+    if (count_reg_pressure_p ((*def)->regno))
+      delta += 1;
+
+  for (live_range_rec **use = insn_info->uses; *use; ++use)
+    if (count_reg_pressure_p ((*use)->regno)
+	&& known_last_use_p (*use, insn_info))
+      delta -= 1;
+
+  return delta;
+}
+
+/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete
+   FROM_INSN.  Copy any useful information to TO_INSN before doing that.  */
+
+static void
+transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn)
+{
+  INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn);
+  INSN_CODE (to_insn) = INSN_CODE (from_insn);
+  REG_NOTES (to_insn) = REG_NOTES (from_insn);
+}
+
+/* The combination attempt in ATTEMPT has succeeded and is currently
+   part of an open validate_change group.  Commit to making the change
+   and decide where the new instruction should go.
+
+   KEPT_DEF_P is true if the new instruction continues to perform
+   the definition described by ATTEMPT.def_use_range.  */
+
+void
+combine2::commit_combination (combination_attempt_rec &attempt,
+			      bool kept_def_p)
+{
+  insn_info_rec *new_home = attempt.new_home;
+  rtx_insn *old_insn = attempt.sequence[0]->insn;
+  rtx_insn *new_insn = attempt.sequence[1]->insn;
+
+  /* Remove any notes that are no longer relevant.  */
+  bool single_set_p = single_set (new_insn);
+  for (rtx *note_ptr = &REG_NOTES (new_insn); *note_ptr; )
+    {
+      rtx note = *note_ptr;
+      bool keep_p = true;
+      switch (REG_NOTE_KIND (note))
+	{
+	case REG_EQUAL:
+	case REG_EQUIV:
+	case REG_NOALIAS:
+	  keep_p = single_set_p;
+	  break;
+
+	case REG_UNUSED:
+	  keep_p = false;
+	  break;
+
+	default:
+	  break;
+	}
+      if (keep_p)
+	note_ptr = &XEXP (*note_ptr, 1);
+      else
+	{
+	  *note_ptr = XEXP (*note_ptr, 1);
+	  free_EXPR_LIST_node (note);
+	}
+    }
+
+  /* Complete the open validate_change group.  */
+  confirm_change_group ();
+
+  /* Decide where the new instruction should go.  */
+  unsigned int new_point = attempt.latest_point;
+  if (new_point != attempt.earliest_point
+      && prev_real_insn (new_insn) != old_insn)
+    {
+      /* Prefer the earlier point if the combined instruction reduces
+	 register pressure and the latest point if it increases register
+	 pressure.
+
+	 The choice isn't obvious in the event of a tie, but picking
+	 the earliest point should reduce the number of times that
+	 we need to invalidate debug insns.  */
+      int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]);
+      int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]);
+      bool move_up_p = (delta1 + delta2 <= 0);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "register pressure delta = %d + %d; using %s position\n",
+		 delta1, delta2, move_up_p ? "earliest" : "latest");
+      if (move_up_p)
+	new_point = attempt.earliest_point;
+    }
+
+  /* Translate inserting at NEW_POINT into inserting before or after
+     a particular insn.  */
+  rtx_insn *anchor = NULL;
+  bool before_p = (new_point & 1);
+  if (new_point != attempt.sequence[1]->point
+      && new_point != attempt.sequence[0]->point)
+    {
+      anchor = m_points[(new_point - m_end_of_sequence) / 2];
+      rtx_insn *other_side = (before_p
+			      ? prev_real_insn (anchor)
+			      : next_real_insn (anchor));
+      /* Inserting next to an insn X and then deleting X is just a
+	 roundabout way of using X as the insertion point.  */
+      if (anchor == new_insn || other_side == new_insn)
+	new_point = attempt.sequence[1]->point;
+      else if (anchor == old_insn || other_side == old_insn)
+	new_point = attempt.sequence[0]->point;
+    }
+
+  /* Actually perform the move.  */
+  if (new_point == attempt.sequence[1]->point)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "using insn %d to hold the combined pattern\n",
+		 INSN_UID (new_insn));
+      set_insn_deleted (old_insn);
+    }
+  else if (new_point == attempt.sequence[0]->point)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "using insn %d to hold the combined pattern\n",
+		 INSN_UID (old_insn));
+      PATTERN (old_insn) = PATTERN (new_insn);
+      transfer_insn (old_insn, new_insn);
+      std::swap (old_insn, new_insn);
+      set_insn_deleted (old_insn);
+    }
+  else
+    {
+      /* We need to insert a new instruction.  We can't simply move
+	 NEW_INSN because it acts as an insertion anchor in m_points.  */
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "inserting combined insn %s insn %d\n",
+		 before_p ? "before" : "after", INSN_UID (anchor));
+
+      rtx_insn *added_insn = (before_p
+			      ? emit_insn_before (PATTERN (new_insn), anchor)
+			      : emit_insn_after (PATTERN (new_insn), anchor));
+      transfer_insn (added_insn, new_insn);
+      set_insn_deleted (old_insn);
+      set_insn_deleted (new_insn);
+      new_insn = added_insn;
+    }
+  df_insn_rescan (new_insn);
+
+  /* Unlink the old uses.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
+      remove_range_use (*use, attempt.sequence[i]);
+
+  /* Work out which registers the new pattern uses.  */
+  bitmap_clear (m_true_deps);
+  df_ref use;
+  FOR_EACH_INSN_USE (use, new_insn)
+    {
+      rtx reg = DF_REF_REAL_REG (use);
+      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
+    }
+  FOR_EACH_INSN_EQ_USE (use, new_insn)
+    {
+      rtx reg = DF_REF_REAL_REG (use);
+      bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
+    }
+
+  /* Describe the combined instruction in NEW_HOME.  */
+  new_home->insn = new_insn;
+  new_home->point = new_point;
+  new_home->cost = attempt.new_cost;
+
+  /* Build up a list of definitions for the combined instructions
+     and update all the ranges accordingly.  It shouldn't matter
+     which order we do this in.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def)
+      if (kept_def_p || *def != attempt.def_use_range)
+	{
+	  obstack_ptr_grow (&m_insn_obstack, *def);
+	  (*def)->producer = new_home;
+	}
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
+
+  /* Build up a list of uses for the combined instructions and update
+     all the ranges accordingly.  Again, it shouldn't matter which
+     order we do this in.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
+      if (*use != attempt.def_use_range
+	  && add_range_use (*use, new_home))
+	obstack_ptr_grow (&m_insn_obstack, *use);
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
+
+  /* There shouldn't be any remaining references to other instructions
+     in the combination.  Invalidate their contents to make lingering
+     references a noisy failure.  */
+  for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+    if (attempt.sequence[i] != new_home)
+      {
+	attempt.sequence[i]->insn = NULL;
+	attempt.sequence[i]->point = ~0U;
+      }
+
+  /* Unlink the def-use range.  */
+  if (!kept_def_p && attempt.def_use_range)
+    {
+      live_range_rec *range = attempt.def_use_range;
+      if (range->prev_range)
+	range->prev_range->next_range = range->next_range;
+      else
+	m_reg_info[range->regno].range = range->next_range;
+      if (range->next_range)
+	range->next_range->prev_range = range->prev_range;
+    }
+
+  /* Record instructions whose new form alters the cfg.  */
+  rtx pattern = PATTERN (new_insn);
+  if ((returnjump_p (new_insn)
+       || any_uncondjump_p (new_insn)
+       || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx))
+      && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn)))
+    m_cfg_altering_insns.safe_push (new_insn);
+}
+
+/* Return true if X1 and X2 are memories and if X1 does not have
+   a higher alignment than X2.  */
+
+static bool
+dubious_mem_pair_p (rtx x1, rtx x2)
+{
+  return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2);
+}
+
+/* Try implement ATTEMPT using (parallel [SET1 SET2]).  */
+
+bool
+combine2::try_parallel_sets (combination_attempt_rec &attempt,
+			     rtx set1, rtx set2)
+{
+  rtx_insn *insn = attempt.sequence[1]->insn;
+
+  /* Combining two loads or two stores can be useful on targets that
+     allow them to be treated as a single access.  However, we use a
+     very peephole approach to picking the pairs, so we need to be
+     relatively confident that we're making a good choice.
+
+     For now just aim for cases in which the memory references are
+     consecutive and the first reference has a higher alignment.
+     We can leave the target to test the consecutive part; whatever test
+     we added here might be different from the target's, and in any case
+     it's fine if the target accepts other well-aligned cases too.  */
+  if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2))
+      || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2)))
+    return false;
+
+  /* Cache the PARALLEL rtx between attempts so that we don't generate
+     too much garbage rtl.  */
+  if (!m_spare_parallel)
+    {
+      rtvec vec = gen_rtvec (2, set1, set2);
+      m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec);
+    }
+  else
+    {
+      XVECEXP (m_spare_parallel, 0, 0) = set1;
+      XVECEXP (m_spare_parallel, 0, 1) = set2;
+    }
+
+  unsigned int num_changes = num_validated_changes ();
+  validate_change (insn, &PATTERN (insn), m_spare_parallel, true);
+  if (verify_combination (attempt))
+    {
+      m_spare_parallel = NULL_RTX;
+      return true;
+    }
+  cancel_changes (num_changes);
+  return false;
+}
+
+/* Try to parallelize the two instructions in ATTEMPT.  */
+
+bool
+combine2::try_parallelize_insns (combination_attempt_rec &attempt)
+{
+  rtx_insn *i1_insn = attempt.sequence[0]->insn;
+  rtx_insn *i2_insn = attempt.sequence[1]->insn;
+
+  /* Can't parallelize asm statements.  */
+  if (asm_noperands (PATTERN (i1_insn)) >= 0
+      || asm_noperands (PATTERN (i2_insn)) >= 0)
+    return false;
+
+  /* For now, just handle the case in which both instructions are
+     single sets.  We could handle more than 2 sets as well, but few
+     targets support that anyway.  */
+  rtx set1 = single_set (i1_insn);
+  if (!set1)
+    return false;
+  rtx set2 = single_set (i2_insn);
+  if (!set2)
+    return false;
+
+  /* Make sure that we have structural proof that the destinations
+     are independent.  Things like alias analysis rely on semantic
+     information and assume no undefined behavior, which is rarely a
+     good enough guarantee to allow a useful instruction combination.  */
+  rtx dest1 = SET_DEST (set1);
+  rtx dest2 = SET_DEST (set2);
+  if (MEM_P (dest1)
+      ? MEM_P (dest2) && nonoverlapping_memrefs_p (dest1, dest2, false)
+      : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2))
+    return false;
+
+  /* Try the sets in both orders.  */
+  if (try_parallel_sets (attempt, set1, set2)
+      || try_parallel_sets (attempt, set2, set1))
+    {
+      commit_combination (attempt, true);
+      if (MAY_HAVE_DEBUG_BIND_INSNS
+	  && attempt.new_home->insn != i1_insn)
+	propagate_for_debug (i1_insn, attempt.new_home->insn,
+			     SET_DEST (set1), SET_SRC (set1), m_bb);
+      return true;
+    }
+  return false;
+}
+
+/* Replace DEST with SRC in the register notes for INSN.  */
+
+static void
+substitute_into_note (rtx_insn *insn, rtx dest, rtx src)
+{
+  for (rtx *note_ptr = &REG_NOTES (insn); *note_ptr; )
+    {
+      rtx note = *note_ptr;
+      bool keep_p = true;
+      switch (REG_NOTE_KIND (note))
+	{
+	case REG_EQUAL:
+	case REG_EQUIV:
+	  keep_p = validate_simplify_replace_rtx (insn, &XEXP (note, 0),
+						  dest, src);
+	  break;
+
+	default:
+	  break;
+	}
+      if (keep_p)
+	note_ptr = &XEXP (*note_ptr, 1);
+      else
+	{
+	  *note_ptr = XEXP (*note_ptr, 1);
+	  free_EXPR_LIST_node (note);
+	}
+    }
+}
+
+/* A subroutine of try_combine_def_use.  Try replacing DEST with SRC
+   in ATTEMPT.  SRC might be either the original SET_SRC passed to the
+   parent routine or a value pulled from a note; SRC_IS_NOTE_P is true
+   in the latter case.  */
+
+bool
+combine2::try_combine_def_use_1 (combination_attempt_rec &attempt,
+				 rtx dest, rtx src, bool src_is_note_p)
+{
+  rtx_insn *def_insn = attempt.sequence[0]->insn;
+  rtx_insn *use_insn = attempt.sequence[1]->insn;
+
+  /* Mimic combine's behavior by not combining moves from allocatable hard
+     registers (e.g. when copying parameters or function return values).  */
+  if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
+    return false;
+
+  /* Don't mess with volatile references.  For one thing, we don't yet
+     know how many copies of SRC we'll need.  */
+  if (volatile_refs_p (src))
+    return false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "trying to combine %d and %d%s:\n",
+	       INSN_UID (def_insn), INSN_UID (use_insn),
+	       src_is_note_p ? " using equal/equiv note" : "");
+      dump_insn_slim (dump_file, def_insn);
+      dump_insn_slim (dump_file, use_insn);
+    }
+
+  unsigned int num_changes = num_validated_changes ();
+  if (!validate_simplify_replace_rtx (use_insn, &PATTERN (use_insn),
+				      dest, src))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "combination failed -- unable to substitute"
+		 " all uses\n");
+      return false;
+    }
+
+  /* Try matching the instruction on its own if DEST isn't used elsewhere.  */
+  if (has_single_use_p (attempt.def_use_range)
+      && verify_combination (attempt))
+    {
+      live_range_rec *next_range = attempt.def_use_range->next_range;
+      substitute_into_note (use_insn, dest, src);
+      commit_combination (attempt, false);
+      if (MAY_HAVE_DEBUG_BIND_INSNS)
+	{
+	  rtx_insn *end_of_range = (next_range
+				    ? next_range->producer->insn
+				    : BB_END (m_bb));
+	  propagate_for_debug (def_insn, end_of_range, dest, src, m_bb);
+	}
+      return true;
+    }
+
+  /* Try doing the new USE_INSN pattern in parallel with the DEF_INSN
+     pattern.  */
+  if (try_parallelize_insns (attempt))
+    return true;
+
+  cancel_changes (num_changes);
+  return false;
+}
+
+/* ATTEMPT describes an attempt to substitute the result of the first
+   instruction into the second instruction.  Try to implement it,
+   given that the first instruction sets DEST to SRC.  */
+
+bool
+combine2::try_combine_def_use (combination_attempt_rec &attempt,
+			       rtx dest, rtx src)
+{
+  rtx_insn *def_insn = attempt.sequence[0]->insn;
+  rtx_insn *use_insn = attempt.sequence[1]->insn;
+  rtx def_note = find_reg_equal_equiv_note (def_insn);
+
+  /* First try combining the instructions in their original form.  */
+  if (try_combine_def_use_1 (attempt, dest, src, false))
+    return true;
+
+  /* Try to replace DEST with a REG_EQUAL/EQUIV value instead.  */
+  if (def_note
+      && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
+    return true;
+
+  /* If USE_INSN has a REG_EQUAL/EQUIV note that refers to DEST, try
+     using that instead of the main pattern.  */
+  for (rtx *link_ptr = &REG_NOTES (use_insn); *link_ptr;
+       link_ptr = &XEXP (*link_ptr, 1))
+    {
+      rtx use_note = *link_ptr;
+      if (REG_NOTE_KIND (use_note) != REG_EQUAL
+	  && REG_NOTE_KIND (use_note) != REG_EQUIV)
+	continue;
+
+      rtx use_set = single_set (use_insn);
+      if (!use_set)
+	break;
+
+      if (!reg_overlap_mentioned_p (dest, XEXP (use_note, 0)))
+	continue;
+
+      /* Try snipping out the note and putting it in the SET instead.  */
+      validate_change (use_insn, link_ptr, XEXP (use_note, 1), 1);
+      validate_change (use_insn, &SET_SRC (use_set), XEXP (use_note, 0), 1);
+
+      if (try_combine_def_use_1 (attempt, dest, src, false))
+	return true;
+
+      if (def_note
+	  && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
+	return true;
+
+      cancel_changes (0);
+    }
+
+  return false;
+}
+
+/* ATTEMPT describes an attempt to combine two instructions that use
+   the same resource.  Try to implement it, returning true on success.  */
+
+bool
+combine2::try_combine_two_uses (combination_attempt_rec &attempt)
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "trying to parallelize %d and %d:\n",
+	       INSN_UID (attempt.sequence[0]->insn),
+	       INSN_UID (attempt.sequence[1]->insn));
+      dump_insn_slim (dump_file, attempt.sequence[0]->insn);
+      dump_insn_slim (dump_file, attempt.sequence[1]->insn);
+    }
+
+  return try_parallelize_insns (attempt);
+}
+
+/* Try to optimize instruction INSN_INFO.  Return true on success.  */
+
+bool
+combine2::optimize_insn (insn_info_rec *i1)
+{
+  combination_attempt_rec attempt;
+
+  if (!combinable_insn_p (i1->insn, false))
+    return false;
+
+  rtx set = single_set (i1->insn);
+  if (!set)
+    return false;
+
+  /* First try combining INSN with a user of its result.  */
+  rtx dest = SET_DEST (set);
+  rtx src = SET_SRC (set);
+  if (REG_P (dest) && REG_NREGS (dest) == 1)
+    for (live_range_rec **def = i1->defs; *def; ++def)
+      if ((*def)->regno == REGNO (dest))
+	{
+	  for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+	    {
+	      insn_info_rec *use = (*def)->users[i];
+	      if (use
+		  && combinable_insn_p (use->insn, has_single_use_p (*def))
+		  && start_combination (attempt, i1, use, *def)
+		  && try_combine_def_use (attempt, dest, src))
+		return true;
+	    }
+	  break;
+	}
+
+  /* Try parallelizing INSN and another instruction that uses the same
+     resource.  */
+  bitmap_clear (m_tried_insns);
+  for (live_range_rec **use = i1->uses; *use; ++use)
+    for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+      {
+	insn_info_rec *i2 = (*use)->users[i];
+	if (i2
+	    && i2 != i1
+	    && combinable_insn_p (i2->insn, false)
+	    && bitmap_set_bit (m_tried_insns, INSN_UID (i2->insn))
+	    && start_combination (attempt, i1, i2)
+	    && try_combine_two_uses (attempt))
+	  return true;
+      }
+
+  return false;
+}
+
+/* A note_stores callback.  Set the bool at *DATA to true if DEST is in
+   memory.  */
+
+static void
+find_mem_def (rtx dest, const_rtx, void *data)
+{
+  /* note_stores has stripped things like subregs and zero_extracts,
+     so we don't need to worry about them here.  */
+  if (MEM_P (dest))
+    *(bool *) data = true;
+}
+
+/* Record all register and memory definitions in INSN_INFO and fill in its
+   "defs" list.  */
+
+void
+combine2::record_defs (insn_info_rec *insn_info)
+{
+  rtx_insn *insn = insn_info->insn;
+
+  /* Record register definitions.  */
+  df_ref def;
+  FOR_EACH_INSN_DEF (def, insn)
+    {
+      rtx reg = DF_REF_REAL_REG (def);
+      unsigned int end_regno = END_REGNO (reg);
+      for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
+	{
+	  live_range_rec *range = reg_live_range (regno);
+	  range->producer = insn_info;
+	  m_reg_info[regno].live_p = false;
+	  obstack_ptr_grow (&m_insn_obstack, range);
+	}
+    }
+
+  /* If the instruction writes to memory, record that too.  */
+  bool saw_mem_p = false;
+  note_stores (insn, find_mem_def, &saw_mem_p);
+  if (saw_mem_p)
+    {
+      live_range_rec *range = mem_live_range ();
+      range->producer = insn_info;
+      obstack_ptr_grow (&m_insn_obstack, range);
+    }
+
+  /* Complete the list of definitions.  */
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  insn_info->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
+}
+
+/* Record that INSN_INFO contains register use USE.  If this requires
+   new entries to be added to INSN_INFO->uses, add those entries to the
+   list we're building in m_insn_obstack.  */
+
+void
+combine2::record_reg_use (insn_info_rec *insn_info, df_ref use)
+{
+  rtx reg = DF_REF_REAL_REG (use);
+  unsigned int end_regno = END_REGNO (reg);
+  for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
+    {
+      live_range_rec *range = reg_live_range (regno);
+      if (add_range_use (range, insn_info))
+	obstack_ptr_grow (&m_insn_obstack, range);
+      m_reg_info[regno].live_p = true;
+    }
+}
+
+/* A note_uses callback.  Set the bool at DATA to true if *LOC reads
+   from variable memory.  */
+
+static void
+find_mem_use (rtx *loc, void *data)
+{
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, *loc, NONCONST)
+    if (MEM_P (*iter) && !MEM_READONLY_P (*iter))
+      {
+	*(bool *) data = true;
+	break;
+      }
+}
+
+/* Record all register and memory uses in INSN_INFO and fill in its
+   "uses" list.  */
+
+void
+combine2::record_uses (insn_info_rec *insn_info)
+{
+  rtx_insn *insn = insn_info->insn;
+
+  /* Record register uses in the main pattern.  */
+  df_ref use;
+  FOR_EACH_INSN_USE (use, insn)
+    record_reg_use (insn_info, use);
+
+  /* Treat REG_EQUAL uses as first-class uses.  We don't lose much
+     by doing that, since it's rare for a REG_EQUAL note to mention
+     registers that the main pattern doesn't.  It also gives us the
+     maximum freedom to use REG_EQUAL notes in place of the main pattern.  */
+  FOR_EACH_INSN_EQ_USE (use, insn)
+    record_reg_use (insn_info, use);
+
+  /* Record a memory use if either the pattern or the notes read from
+     memory.  */
+  bool saw_mem_p = false;
+  note_uses (&PATTERN (insn), find_mem_use, &saw_mem_p);
+  for (rtx note = REG_NOTES (insn); !saw_mem_p && note; note = XEXP (note, 1))
+    if (REG_NOTE_KIND (note) == REG_EQUAL
+	|| REG_NOTE_KIND (note) == REG_EQUIV)
+      note_uses (&XEXP (note, 0), find_mem_use, &saw_mem_p);
+  if (saw_mem_p)
+    {
+      live_range_rec *range = mem_live_range ();
+      if (add_range_use (range, insn_info))
+	obstack_ptr_grow (&m_insn_obstack, range);
+    }
+
+  /* Complete the list of uses.  */
+  obstack_ptr_grow (&m_insn_obstack, NULL);
+  insn_info->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
+}
+
+/* Start a new instruction sequence, discarding all information about
+   the previous one.  */
+
+void
+combine2::start_sequence (void)
+{
+  m_end_of_sequence = m_point;
+  m_mem_range = NULL;
+  m_points.truncate (0);
+  obstack_free (&m_insn_obstack, m_insn_obstack_start);
+  obstack_free (&m_range_obstack, m_range_obstack_start);
+}
+
+/* Run the pass on the current function.  */
+
+void
+combine2::execute (void)
+{
+  df_analyze ();
+  FOR_EACH_BB_FN (m_bb, cfun)
+    {
+      m_optimize_for_speed_p = optimize_bb_for_speed_p (m_bb);
+      m_end_of_bb = m_point;
+      start_sequence ();
+
+      rtx_insn *insn, *prev;
+      FOR_BB_INSNS_REVERSE_SAFE (m_bb, insn, prev)
+	{
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+
+	  /* The current m_point represents the end of the sequence if
+	     INSN is the last instruction in the sequence, otherwise it
+	     represents the gap between INSN and the next instruction.
+	     m_point + 1 represents INSN itself.
+
+	     Instructions can be added to m_point by inserting them
+	     after INSN.  They can be added to m_point + 1 by inserting
+	     them before INSN.  */
+	  m_points.safe_push (insn);
+	  m_point += 1;
+
+	  insn_info_rec *insn_info = XOBNEW (&m_insn_obstack, insn_info_rec);
+	  insn_info->insn = insn;
+	  insn_info->point = m_point;
+	  insn_info->cost = UNKNOWN_COST;
+
+	  record_defs (insn_info);
+	  record_uses (insn_info);
+
+	  /* Set up m_point for the next instruction.  */
+	  m_point += 1;
+
+	  if (CALL_P (insn))
+	    start_sequence ();
+	  else
+	    while (optimize_insn (insn_info))
+	      gcc_assert (insn_info->insn);
+	}
+    }
+
+  /* If an instruction changes the cfg, update the containing block
+     accordingly.  */
+  rtx_insn *insn;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (m_cfg_altering_insns, i, insn)
+    if (JUMP_P (insn))
+      {
+	mark_jump_label (PATTERN (insn), insn, 0);
+	update_cfg_for_uncondjump (insn);
+      }
+    else
+      {
+	remove_edge (split_block (BLOCK_FOR_INSN (insn), insn));
+	emit_barrier_after_bb (BLOCK_FOR_INSN (insn));
+      }
+
+  /* Propagate the above block-local cfg changes to the rest of the cfg.  */
+  if (!m_cfg_altering_insns.is_empty ())
+    {
+      if (dom_info_available_p (CDI_DOMINATORS))
+	free_dominance_info (CDI_DOMINATORS);
+      timevar_push (TV_JUMP);
+      rebuild_jump_labels (get_insns ());
+      cleanup_cfg (0);
+      timevar_pop (TV_JUMP);
+    }
+}
+
+const pass_data pass_data_combine2 =
+{
+  RTL_PASS, /* type */
+  "combine2", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_COMBINE2, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_combine2 : public rtl_opt_pass
+{
+public:
+  pass_combine2 (gcc::context *ctxt, int flag)
+    : rtl_opt_pass (pass_data_combine2, ctxt), m_flag (flag)
+  {}
+
+  bool
+  gate (function *) OVERRIDE
+  {
+    return optimize && (param_run_combine & m_flag) != 0;
+  }
+
+  unsigned int
+  execute (function *f) OVERRIDE
+  {
+    combine2 (f).execute ();
+    return 0;
+  }
+
+private:
+  unsigned int m_flag;
+}; // class pass_combine2
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_combine2_before (gcc::context *ctxt)
+{
+  return new pass_combine2 (ctxt, 1);
+}
+
+rtl_opt_pass *
+make_pass_combine2_after (gcc::context *ctxt)
+{
+  return new pass_combine2 (ctxt, 4);
+}