Message ID | mpteey6xeip.fsf@arm.com |
---|---|
State | New |
Headers | show |
Series | Add a new combine pass | expand |
On Sun, Nov 17, 2019 at 3:35 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > (It's 23:35 local time, so it's still just about stage 1. :-)) > > While working on SVE, I've noticed several cases in which we fail > to combine instructions because the combined form would need to be > placed earlier in the instruction stream than the last of the > instructions being combined. This includes one very important > case in the handling of the first fault register (FFR). > > Combine currently requires the combined instruction to live at the same > location as i3. I thought about trying to relax that restriction, but it > would be difficult to do with the current pass structure while keeping > everything linear-ish time. > > So this patch instead goes for an option that has been talked about > several times over the years: writing a new combine pass that just > does instruction combination, and not all the other optimisations > that have been bolted onto combine over time. E.g. it deliberately > doesn't do things like nonzero-bits tracking, since that really ought > to be a separate, more global, optimisation. > > This is still far from being a realistic replacement for the even > the combine parts of the current combine pass. E.g.: > > - it only handles combinations that can be built up from individual > two-instruction combinations. > > - it doesn't allow new hard register clobbers to be added. > > - it doesn't have the special treatment of CC operations. > > - etc. > > But we have to start somewhere. > > On a more positive note, the pass handles things that the current > combine pass doesn't: > > - the main motivating feature mentioned above: it works out where > the combined instruction could validly live and moves it there > if necessary. If there are a range of valid places, it tries > to pick the best one based on register pressure (although only > with a simple heuristic for now). > > - once it has combined two instructions, it can try combining the > result with both later and earlier code, i.e. it can combine > in both directions. > > - it tries using REG_EQUAL notes for the final instruction. > > - it can parallelise two independent instructions that both read from > the same register or both read from memory. > > This last feature is useful for generating more load-pair combinations > on AArch64. In some cases it can also produce more store-pair combinations, > but only for consecutive stores. However, since the pass currently does > this in a very greedy, peephole way, it only allows load/store-pair > combinations if the first memory access has a higher alignment than > the second, i.e. if we can be sure that the combined access is naturally > aligned. This should help it to make better decisions than the post-RA > peephole pass in some cases while not being too aggressive. > > The pass is supposed to be linear time without debug insns. > It only tries a constant number C of combinations per instruction > and its bookkeeping updates are constant-time. Once it has combined two > instructions, it'll try up to C combinations on the result, but this can > be counted against the instruction that was deleted by the combination > and so effectively just doubles the constant. (Note that C depends > on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.) > > Unfortunately, debug updates via propagate_for_debug are more expensive. > This could probably be fixed if the pass did more to track debug insns > itself, but using propagate_for_debug matches combine's behaviour. > > The patch adds two instances of the new pass: one before combine and > one after it. By default both are disabled, but this can be changed > using the new 3-bit run-combine param, where: > > - bit 0 selects the new pre-combine pass > - bit 1 selects the main combine pass > - bit 2 selects the new post-combine pass > > The idea is that run-combine=3 can be used to see which combinations > are missed by the new pass, while run-combine=6 (which I hope to be > the production setting for AArch64 at -O2+) just uses the new pass > to mop up cases that normal combine misses. Maybe in some distant > future, the pass will be good enough for run-combine=[14] to be a > realistic option. > > I ended up having to add yet another validate_simplify_* routine, > this time to do the equivalent of: > > newx = simplify_replace_rtx (*loc, old_rtx, new_rtx); > validate_change (insn, loc, newx, 1); > > but in a more memory-efficient way. validate_replace_rtx isn't suitable > because it deliberately only tries simplifications in limited cases: > > /* Do changes needed to keep rtx consistent. Don't do any other > simplifications, as it is not our job. */ > > And validate_simplify_insn isn't useful for this case because it works > on patterns that have already had changes made to them and expects > those patterns to be valid rtxes. simplify-replace operations instead > need to simplify as they go, when the original modes are still to hand. > > As far as compile-time goes, I tried compiling optabs.ii at -O2 > with an --enable-checking=release compiler: > > run-combine=2 (normal combine): 100.0% (baseline) > run-combine=4 (new pass only) 98.0% > run-combine=6 (both passes) 100.3% > > where the results are easily outside the noise. So the pass on > its own is quicker than combine, but that's not a fair comparison > when it doesn't do everything combine does. Running both passes > only has a slight overhead. > > To get a feel for the effect on multiple targets, I did my usual > bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg > and g++.dg, this time comparing run-combine=2 and run-combine=6 > using -O2 -ftree-vectorize: > > Target Tests Delta Best Worst Median > ====== ===== ===== ==== ===== ====== > aarch64-linux-gnu 3974 -39393 -2275 90 -2 > aarch64_be-linux-gnu 3389 -36683 -2275 165 -2 > alpha-linux-gnu 4154 -62860 -2132 335 -2 > amdgcn-amdhsa 4818 9079 -7987 51850 -2 > arc-elf 2868 -63710 -18998 286 -1 > arm-linux-gnueabi 4053 -80404 -10019 605 -2 > arm-linux-gnueabihf 4053 -80404 -10019 605 -2 > avr-elf 3620 38513 -2386 23364 2 > bfin-elf 2691 -32973 -1483 1127 -2 > bpf-elf 5581 -78105 -11064 113 -3 > c6x-elf 3915 -31710 -2441 1560 -2 > cr16-elf 6030 192102 -1757 60009 12 > cris-elf 2217 -30794 -1716 294 -2 > csky-elf 2003 -24989 -9999 1468 -2 > epiphany-elf 3345 -19416 -1803 4594 -2 > fr30-elf 3562 -15077 -1921 2334 -1 > frv-linux-gnu 2423 -16589 -1736 999 -1 > ft32-elf 2246 -46337 -15988 433 -2 > h8300-elf 2581 -33553 -1403 168 -2 > hppa64-hp-hpux11.23 3926 -120876 -50134 1056 -2 > i686-apple-darwin 3562 -46851 -1764 310 -2 > i686-pc-linux-gnu 2902 -3639 -4809 6848 -2 > ia64-linux-gnu 2900 -158870 -14006 428 -7 > iq2000-elf 2929 -54690 -2904 2576 -3 > lm32-elf 5265 162519 -1918 8004 5 > m32r-elf 1861 -25296 -2713 1004 -2 > m68k-linux-gnu 2520 -241573 -21879 200 -3 > mcore-elf 2378 -28532 -1810 1635 -2 > microblaze-elf 2782 -137363 -9516 1986 -2 > mipsel-linux-gnu 2443 -38422 -8331 458 -1 > mipsisa64-linux-gnu 2287 -60294 -12214 432 -2 > mmix 4910 -136549 -13616 599 -2 > mn10300-elf 2944 -29151 -2488 132 -1 > moxie-rtems 1935 -12364 -1002 125 -1 > msp430-elf 2379 -37007 -2163 176 -2 > nds32le-elf 2356 -27551 -2126 163 -1 > nios2-linux-gnu 1572 -44828 -23613 92 -2 > nvptx-none 1014 -17337 -1590 16 -3 > or1k-elf 2724 -92816 -14144 56 -3 > pdp11 1897 -27296 -1370 534 -2 > powerpc-ibm-aix7.0 2909 -58829 -10026 2001 -2 > powerpc64-linux-gnu 3685 -60551 -12158 2001 -1 > powerpc64le-linux-gnu 3501 -61846 -10024 765 -2 > pru-elf 1574 -29734 -19998 1718 -1 > riscv32-elf 2357 -22506 -10002 10175 -1 > riscv64-elf 3320 -56777 -10002 226 -2 > rl78-elf 2113 -232328 -18607 4065 -3 > rx-elf 2800 -38515 -896 491 -2 > s390-linux-gnu 3582 -75626 -12098 3999 -2 > s390x-linux-gnu 3761 -73473 -13748 3999 -2 > sh-linux-gnu 2350 -26401 -1003 522 -2 > sparc-linux-gnu 3279 -49518 -2175 2223 -2 > sparc64-linux-gnu 3849 -123084 -30200 2141 -2 > tilepro-linux-gnu 2737 -35562 -3458 2848 -2 > v850-elf 9002 -169126 -49996 76 -4 > vax-netbsdelf 3325 -57734 -10000 1989 -2 > visium-elf 1860 -17006 -1006 1066 -2 > x86_64-darwin 3278 -48933 -9999 1408 -2 > x86_64-linux-gnu 3008 -43887 -9999 3248 -2 > xstormy16-elf 2497 -26569 -2051 89 -2 > xtensa-elf 2161 -31231 -6910 138 -2 > > So running both passes does seem to have a significant benefit > on most targets, but there are some nasty-looking outliers. > The usual caveat applies: number of lines is a very poor measurement, > it's just to get a feel. > > Bootstrapped & regression-tested on aarch64-linux-gnu and > x86_64-linux-gnu with both run-combine=3 as the default (so that the new > pass runs first) and with run-combine=6 as the default (so that the new > pass runs second). There were no new execution failures. A couple of > guality.exp tests that already failed for most options started failing > for a couple more. Enabling the pass fixes the XFAILs in: > > gcc.target/aarch64/sve/acle/general/ptrue_pat_[234].c > > Inevitably there was some scan-assembler fallout for other tests. > E.g. in gcc.target/aarch64/vmov_n_1.c: > > #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory") > ... > INHIB_OPTIMIZATION; \ > (a) = TEST (test, data_len); \ > INHIB_OPTIMIZATION; \ > (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a)); \ > > is no longer effective for preventing move (a) from being merged > into (b), because the pass can merge at the point of (a). I think > this is a valid thing to do -- the asm semantics are still satisfied, > and asm volatile ("" : : : "memory") never acted as a register barrier. > But perhaps we should deal with this as a special case? Not really. I think the testcase should be changed to: INHIB_OPT_VAR(a) instead. Where INHIB_OPT_VAR should be: #define INHIB_OPT_VAR(a) asm("":"+X"(a)); Since it is obviously not doing the correct testing in the first place. Even then, this testcase is huge and really should be broken up into different testcases. Thanks, Andrew > > Richard > > > 2019-11-17 Richard Sandiford <richard.sandiford@arm.com> > > gcc/ > * Makefile.in (OBJS): Add combine2.o > * params.opt (--param=run-combine): New option. > * doc/invoke.texi: Document it. > * tree-pass.h (make_pass_combine2_before): Declare. > (make_pass_combine2_after): Likewise. > * passes.def: Add them. > * timevar.def (TV_COMBINE2): New timevar. > * cfgrtl.h (update_cfg_for_uncondjump): Declare. > * combine.c (update_cfg_for_uncondjump): Move to... > * cfgrtl.c (update_cfg_for_uncondjump): ...here. > * simplify-rtx.c (simplify_truncation): Handle comparisons. > * recog.h (validate_simplify_replace_rtx): Declare. > * recog.c (validate_simplify_replace_rtx_1): New function. > (validate_simplify_replace_rtx_uses): Likewise. > (validate_simplify_replace_rtx): Likewise. > * combine2.c: New file. > > Index: gcc/Makefile.in > =================================================================== > --- gcc/Makefile.in 2019-11-14 14:34:27.599783740 +0000 > +++ gcc/Makefile.in 2019-11-17 23:15:31.188500613 +0000 > @@ -1261,6 +1261,7 @@ OBJS = \ > cgraphunit.o \ > cgraphclones.o \ > combine.o \ > + combine2.o \ > combine-stack-adj.o \ > compare-elim.o \ > context.o \ > Index: gcc/params.opt > =================================================================== > --- gcc/params.opt 2019-11-14 14:34:26.339792215 +0000 > +++ gcc/params.opt 2019-11-17 23:15:31.200500531 +0000 > @@ -768,6 +768,10 @@ Use internal function id in profile look > Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param > Maximum depth of a loop nest to fully value-number optimistically. > > +-param=run-combine= > +Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param > +Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass. > + > -param=sccvn-max-alias-queries-per-access= > Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param > Maximum number of disambiguations to perform per memory access. > Index: gcc/doc/invoke.texi > =================================================================== > --- gcc/doc/invoke.texi 2019-11-16 10:43:45.597105823 +0000 > +++ gcc/doc/invoke.texi 2019-11-17 23:15:31.200500531 +0000 > @@ -11807,6 +11807,11 @@ in combiner for a pseudo register as las > @item max-combine-insns > The maximum number of instructions the RTL combiner tries to combine. > > +@item run-combine > +Choose which of the 3 available combine passes to run: bit 1 for the main > +combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 > +for a later variant of the combine pass. > + > @item integer-share-limit > Small integer constants can use a shared data structure, reducing the > compiler's memory usage and increasing its speed. This sets the maximum > Index: gcc/tree-pass.h > =================================================================== > --- gcc/tree-pass.h 2019-10-29 08:29:03.096444049 +0000 > +++ gcc/tree-pass.h 2019-11-17 23:15:31.204500501 +0000 > @@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i > extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt); > +extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt); > +extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt); > Index: gcc/passes.def > =================================================================== > --- gcc/passes.def 2019-10-29 08:29:03.224443133 +0000 > +++ gcc/passes.def 2019-11-17 23:15:31.200500531 +0000 > @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3. > NEXT_PASS (pass_inc_dec); > NEXT_PASS (pass_initialize_regs); > NEXT_PASS (pass_ud_rtl_dce); > + NEXT_PASS (pass_combine2_before); > NEXT_PASS (pass_combine); > + NEXT_PASS (pass_combine2_after); > NEXT_PASS (pass_if_after_combine); > NEXT_PASS (pass_jump_after_combine); > NEXT_PASS (pass_partition_blocks); > Index: gcc/timevar.def > =================================================================== > --- gcc/timevar.def 2019-10-11 15:43:53.403498517 +0100 > +++ gcc/timevar.def 2019-11-17 23:15:31.204500501 +0000 > @@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC , " > DEFTIMEVAR (TV_CSE2 , "CSE 2") > DEFTIMEVAR (TV_BRANCH_PROB , "branch prediction") > DEFTIMEVAR (TV_COMBINE , "combiner") > +DEFTIMEVAR (TV_COMBINE2 , "second combiner") > DEFTIMEVAR (TV_IFCVT , "if-conversion") > DEFTIMEVAR (TV_MODE_SWITCH , "mode switching") > DEFTIMEVAR (TV_SMS , "sms modulo scheduling") > Index: gcc/cfgrtl.h > =================================================================== > --- gcc/cfgrtl.h 2019-03-08 18:15:39.320730391 +0000 > +++ gcc/cfgrtl.h 2019-11-17 23:15:31.192500584 +0000 > @@ -47,6 +47,7 @@ extern void fixup_partitions (void); > extern bool purge_dead_edges (basic_block); > extern bool purge_all_dead_edges (void); > extern bool fixup_abnormal_edges (void); > +extern void update_cfg_for_uncondjump (rtx_insn *); > extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *); > extern void relink_block_chain (bool); > extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *); > Index: gcc/combine.c > =================================================================== > --- gcc/combine.c 2019-11-13 08:42:45.537368745 +0000 > +++ gcc/combine.c 2019-11-17 23:15:31.192500584 +0000 > @@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg) > && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT; > } > > -/* Delete the unconditional jump INSN and adjust the CFG correspondingly. > - Note that the INSN should be deleted *after* removing dead edges, so > - that the kept edge is the fallthrough edge for a (set (pc) (pc)) > - but not for a (set (pc) (label_ref FOO)). */ > - > -static void > -update_cfg_for_uncondjump (rtx_insn *insn) > -{ > - basic_block bb = BLOCK_FOR_INSN (insn); > - gcc_assert (BB_END (bb) == insn); > - > - purge_dead_edges (bb); > - > - delete_insn (insn); > - if (EDGE_COUNT (bb->succs) == 1) > - { > - rtx_insn *insn; > - > - single_succ_edge (bb)->flags |= EDGE_FALLTHRU; > - > - /* Remove barriers from the footer if there are any. */ > - for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) > - if (BARRIER_P (insn)) > - { > - if (PREV_INSN (insn)) > - SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); > - else > - BB_FOOTER (bb) = NEXT_INSN (insn); > - if (NEXT_INSN (insn)) > - SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); > - } > - else if (LABEL_P (insn)) > - break; > - } > -} > - > /* Return whether PAT is a PARALLEL of exactly N register SETs followed > by an arbitrary number of CLOBBERs. */ > static bool > @@ -15096,7 +15060,10 @@ const pass_data pass_data_combine = > {} > > /* opt_pass methods: */ > - virtual bool gate (function *) { return (optimize > 0); } > + virtual bool gate (function *) > + { > + return optimize > 0 && (param_run_combine & 2) != 0; > + } > virtual unsigned int execute (function *) > { > return rest_of_handle_combine (); > Index: gcc/cfgrtl.c > =================================================================== > --- gcc/cfgrtl.c 2019-10-17 14:22:55.523309009 +0100 > +++ gcc/cfgrtl.c 2019-11-17 23:15:31.188500613 +0000 > @@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void) > return inserted; > } > > +/* Delete the unconditional jump INSN and adjust the CFG correspondingly. > + Note that the INSN should be deleted *after* removing dead edges, so > + that the kept edge is the fallthrough edge for a (set (pc) (pc)) > + but not for a (set (pc) (label_ref FOO)). */ > + > +void > +update_cfg_for_uncondjump (rtx_insn *insn) > +{ > + basic_block bb = BLOCK_FOR_INSN (insn); > + gcc_assert (BB_END (bb) == insn); > + > + purge_dead_edges (bb); > + > + delete_insn (insn); > + if (EDGE_COUNT (bb->succs) == 1) > + { > + rtx_insn *insn; > + > + single_succ_edge (bb)->flags |= EDGE_FALLTHRU; > + > + /* Remove barriers from the footer if there are any. */ > + for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) > + if (BARRIER_P (insn)) > + { > + if (PREV_INSN (insn)) > + SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); > + else > + BB_FOOTER (bb) = NEXT_INSN (insn); > + if (NEXT_INSN (insn)) > + SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); > + } > + else if (LABEL_P (insn)) > + break; > + } > +} > + > /* Cut the insns from FIRST to LAST out of the insns stream. */ > > rtx_insn * > Index: gcc/simplify-rtx.c > =================================================================== > --- gcc/simplify-rtx.c 2019-11-16 15:33:36.642840131 +0000 > +++ gcc/simplify-rtx.c 2019-11-17 23:15:31.204500501 +0000 > @@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode, > && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1) > return constm1_rtx; > > + /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result > + in a narrower mode if useful. */ > + if (COMPARISON_P (op)) > + return simplify_gen_relational (GET_CODE (op), mode, VOIDmode, > + XEXP (op, 0), XEXP (op, 1)); > + > return NULL_RTX; > } > > Index: gcc/recog.h > =================================================================== > --- gcc/recog.h 2019-09-09 18:58:28.860430363 +0100 > +++ gcc/recog.h 2019-11-17 23:15:31.204500501 +0000 > @@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos > extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *); > extern void validate_replace_src_group (rtx, rtx, rtx_insn *); > extern bool validate_simplify_insn (rtx_insn *insn); > +extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx); > extern int num_changes_pending (void); > extern int next_insn_tests_no_inequality (rtx_insn *); > extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode); > Index: gcc/recog.c > =================================================================== > --- gcc/recog.c 2019-10-01 09:55:35.150088599 +0100 > +++ gcc/recog.c 2019-11-17 23:15:31.204500501 +0000 > @@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn) > } > return ((num_changes_pending () > 0) && (apply_change_group () > 0)); > } > + > +/* A subroutine of validate_simplify_replace_rtx. Apply the replacement > + described by R to LOC. Return true on success; leave the caller > + to clean up on failure. */ > + > +static bool > +validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc) > +{ > + rtx x = *loc; > + enum rtx_code code = GET_CODE (x); > + machine_mode mode = GET_MODE (x); > + > + if (rtx_equal_p (x, r.from)) > + { > + validate_unshare_change (r.insn, loc, r.to, 1); > + return true; > + } > + > + /* Recursively apply the substitution and see if we can simplify > + the result. This specifically shouldn't use simplify_gen_*, > + since we want to avoid generating new expressions where possible. */ > + int old_num_changes = num_validated_changes (); > + rtx newx = NULL_RTX; > + bool recurse_p = false; > + switch (GET_RTX_CLASS (code)) > + { > + case RTX_UNARY: > + { > + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))) > + return false; > + > + newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode); > + break; > + } > + > + case RTX_BIN_ARITH: > + case RTX_COMM_ARITH: > + { > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) > + return false; > + > + newx = simplify_binary_operation (code, mode, > + XEXP (x, 0), XEXP (x, 1)); > + break; > + } > + > + case RTX_COMPARE: > + case RTX_COMM_COMPARE: > + { > + machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode > + ? GET_MODE (XEXP (x, 0)) > + : GET_MODE (XEXP (x, 1))); > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) > + return false; > + > + newx = simplify_relational_operation (code, mode, op_mode, > + XEXP (x, 0), XEXP (x, 1)); > + break; > + } > + > + case RTX_TERNARY: > + case RTX_BITFIELD_OPS: > + { > + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2))) > + return false; > + > + newx = simplify_ternary_operation (code, mode, op0_mode, > + XEXP (x, 0), XEXP (x, 1), > + XEXP (x, 2)); > + break; > + } > + > + case RTX_EXTRA: > + if (code == SUBREG) > + { > + machine_mode inner_mode = GET_MODE (SUBREG_REG (x)); > + if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x))) > + return false; > + > + rtx inner = SUBREG_REG (x); > + newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x)); > + /* Reject the same cases that simplify_gen_subreg would. */ > + if (!newx > + && (GET_CODE (inner) == SUBREG > + || GET_CODE (inner) == CONCAT > + || GET_MODE (inner) == VOIDmode > + || !validate_subreg (mode, inner_mode, > + inner, SUBREG_BYTE (x)))) > + return false; > + break; > + } > + else > + recurse_p = true; > + break; > + > + case RTX_OBJ: > + if (code == LO_SUM) > + { > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) > + return false; > + > + /* (lo_sum (high x) y) -> y where x and y have the same base. */ > + rtx op0 = XEXP (x, 0); > + rtx op1 = XEXP (x, 1); > + if (GET_CODE (op0) == HIGH) > + { > + rtx base0, base1, offset0, offset1; > + split_const (XEXP (op0, 0), &base0, &offset0); > + split_const (op1, &base1, &offset1); > + if (rtx_equal_p (base0, base1)) > + newx = op1; > + } > + } > + else if (code == REG) > + { > + if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from)) > + return false; > + } > + else > + recurse_p = true; > + break; > + > + case RTX_CONST_OBJ: > + break; > + > + case RTX_AUTOINC: > + if (reg_overlap_mentioned_p (XEXP (x, 0), r.from)) > + return false; > + recurse_p = true; > + break; > + > + case RTX_MATCH: > + case RTX_INSN: > + gcc_unreachable (); > + } > + > + if (recurse_p) > + { > + const char *fmt = GET_RTX_FORMAT (code); > + for (int i = 0; fmt[i]; i++) > + switch (fmt[i]) > + { > + case 'E': > + for (int j = 0; j < XVECLEN (x, i); j++) > + if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j))) > + return false; > + break; > + > + case 'e': > + if (XEXP (x, i) > + && !validate_simplify_replace_rtx_1 (r, &XEXP (x, i))) > + return false; > + break; > + } > + } > + > + if (newx && !rtx_equal_p (x, newx)) > + { > + /* There's no longer any point unsharing the substitutions made > + for subexpressions, since we'll just copy this one instead. */ > + for (int i = old_num_changes; i < num_changes; ++i) > + changes[i].unshare = false; > + validate_unshare_change (r.insn, loc, newx, 1); > + } > + > + return true; > +} > + > +/* A note_uses callback for validate_simplify_replace_rtx. > + DATA points to a validate_replace_src_data object. */ > + > +static void > +validate_simplify_replace_rtx_uses (rtx *loc, void *data) > +{ > + validate_replace_src_data &r = *(validate_replace_src_data *) data; > + if (r.insn && !validate_simplify_replace_rtx_1 (r, loc)) > + r.insn = NULL; > +} > + > +/* Try to perform the equivalent of: > + > + newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX); > + validate_change (INSN, LOC, newx, 1); > + > + but without generating as much garbage rtl when the resulting > + pattern doesn't match. > + > + Return true if we were able to replace all uses of OLD_RTX in *LOC > + and if the result conforms to general rtx rules (e.g. for whether > + subregs are meaningful). > + > + When returning true, add all replacements to the current validation group, > + leaving the caller to test it in the normal way. Leave both *LOC and the > + validation group unchanged on failure. */ > + > +bool > +validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc, > + rtx old_rtx, rtx new_rtx) > +{ > + validate_replace_src_data r; > + r.from = old_rtx; > + r.to = new_rtx; > + r.insn = insn; > + > + unsigned int num_changes = num_validated_changes (); > + note_uses (loc, validate_simplify_replace_rtx_uses, &r); > + if (!r.insn) > + { > + cancel_changes (num_changes); > + return false; > + } > + return true; > +} > > /* Return 1 if the insn using CC0 set by INSN does not contain > any ordered tests applied to the condition codes. > Index: gcc/combine2.c > =================================================================== > --- /dev/null 2019-09-17 11:41:18.176664108 +0100 > +++ gcc/combine2.c 2019-11-17 23:15:31.196500559 +0000 > @@ -0,0 +1,1576 @@ > +/* Combine instructions > + Copyright (C) 2019 Free Software Foundation, Inc. > + > +This file is part of GCC. > + > +GCC is free software; you can redistribute it and/or modify it under > +the terms of the GNU General Public License as published by the Free > +Software Foundation; either version 3, or (at your option) any later > +version. > + > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY > +WARRANTY; without even the implied warranty of MERCHANTABILITY or > +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License > +for more details. > + > +You should have received a copy of the GNU General Public License > +along with GCC; see the file COPYING3. If not see > +<http://www.gnu.org/licenses/>. */ > + > +#include "config.h" > +#include "system.h" > +#include "coretypes.h" > +#include "backend.h" > +#include "rtl.h" > +#include "df.h" > +#include "tree-pass.h" > +#include "memmodel.h" > +#include "emit-rtl.h" > +#include "insn-config.h" > +#include "recog.h" > +#include "print-rtl.h" > +#include "rtl-iter.h" > +#include "predict.h" > +#include "cfgcleanup.h" > +#include "cfghooks.h" > +#include "cfgrtl.h" > +#include "alias.h" > +#include "valtrack.h" > + > +/* This pass tries to combine instructions in the following ways: > + > + (1) If we have two dependent instructions: > + > + I1: (set DEST1 SRC1) > + I2: (...DEST1...) > + > + and I2 is the only user of DEST1, the pass tries to combine them into: > + > + I2: (...SRC1...) > + > + (2) If we have two dependent instructions: > + > + I1: (set DEST1 SRC1) > + I2: (...DEST1...) > + > + the pass tries to combine them into: > + > + I2: (parallel [(set DEST1 SRC1) (...SRC1...)]) > + > + or: > + > + I2: (parallel [(...SRC1...) (set DEST1 SRC1)]) > + > + (3) If we have two independent instructions: > + > + I1: (set DEST1 SRC1) > + I2: (set DEST2 SRC2) > + > + that read from memory or from the same register, the pass tries to > + combine them into: > + > + I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)]) > + > + or: > + > + I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)]) > + > + If the combined form is a valid instruction, the pass tries to find a > + place between I1 and I2 inclusive for the new instruction. If there > + are multiple valid locations, it tries to pick the best one by taking > + the effect on register pressure into account. > + > + If a combination succeeds and produces a single set, the pass tries to > + combine the new form with earlier or later instructions. > + > + The pass currently optimizes each basic block separately. It walks > + the instructions in reverse order, building up live ranges for registers > + and memory. It then uses these live ranges to look for possible > + combination opportunities and to decide where the combined instructions > + could be placed. > + > + The pass represents positions in the block using point numbers, > + with higher numbers indicating earlier instructions. The numbering > + scheme is that: > + > + - the end of the current instruction sequence has an even base point B. > + > + - instructions initially have odd-numbered points B + 1, B + 3, etc. > + with B + 1 being the final instruction in the sequence. > + > + - even points after B represent gaps between instructions where combined > + instructions could be placed. > + > + Thus even points initially represent no instructions and odd points > + initially represent single instructions. However, when picking a > + place for a combined instruction, the pass may choose somewhere > + inbetween the original two instructions, so that over time a point > + may come to represent several instructions. When this happens, > + the pass maintains the invariant that all instructions with the same > + point number are independent of each other and thus can be treated as > + acting in parallel (or as acting in any arbitrary sequence). > + > + TODOs: > + > + - Handle 3-instruction combinations, and possibly more. > + > + - Handle existing clobbers more efficiently. At the moment we can't > + move an instruction that clobbers R across another instruction that > + clobbers R. > + > + - Allow hard register clobbers to be added, like combine does. > + > + - Perhaps work on EBBs, or SESE regions. */ > + > +namespace { > + > +/* The number of explicit uses to record in a live range. */ > +const unsigned int NUM_RANGE_USERS = 4; > + > +/* The maximum number of instructions that we can combine at once. */ > +const unsigned int MAX_COMBINE_INSNS = 2; > + > +/* A fake cost for instructions that we haven't costed yet. */ > +const unsigned int UNKNOWN_COST = ~0U; > + > +class combine2 > +{ > +public: > + combine2 (function *); > + ~combine2 (); > + > + void execute (); > + > +private: > + struct insn_info_rec; > + > + /* Describes the live range of a register or of memory. For simplicity, > + we treat memory as a single entity. > + > + If we had a fully-accurate live range, updating it to account for a > + moved instruction would be a linear-time operation. Doing this for > + each combination would then make the pass quadratic. We therefore > + just maintain a list of NUM_RANGE_USERS use insns and use simple, > + conservatively-correct behavior for the rest. */ > + struct live_range_rec > + { > + /* Which instruction provides the dominating definition, or null if > + we don't know yet. */ > + insn_info_rec *producer; > + > + /* A selection of instructions that use the resource, in program order. */ > + insn_info_rec *users[NUM_RANGE_USERS]; > + > + /* An inclusive range of points that covers instructions not mentioned > + in USERS. Both values are zero if there are no such instructions. > + > + Once we've included a use U at point P in this range, we continue > + to assume that some kind of use exists at P whatever happens to U > + afterwards. */ > + unsigned int first_extra_use; > + unsigned int last_extra_use; > + > + /* The register number this range describes, or INVALID_REGNUM > + for memory. */ > + unsigned int regno; > + > + /* Forms a linked list of ranges for the same resource, in program > + order. */ > + live_range_rec *prev_range; > + live_range_rec *next_range; > + }; > + > + /* Pass-specific information about an instruction. */ > + struct insn_info_rec > + { > + /* The instruction itself. */ > + rtx_insn *insn; > + > + /* A null-terminated list of live ranges for the things that this > + instruction defines. */ > + live_range_rec **defs; > + > + /* A null-terminated list of live ranges for the things that this > + instruction uses. */ > + live_range_rec **uses; > + > + /* The point at which the instruction appears. */ > + unsigned int point; > + > + /* The cost of the instruction, or UNKNOWN_COST if we haven't > + measured it yet. */ > + unsigned int cost; > + }; > + > + /* Describes one attempt to combine instructions. */ > + struct combination_attempt_rec > + { > + /* The instruction that we're currently trying to optimize. > + If the combination succeeds, we'll use this insn_info_rec > + to describe the new instruction. */ > + insn_info_rec *new_home; > + > + /* The instructions we're combining, in program order. */ > + insn_info_rec *sequence[MAX_COMBINE_INSNS]; > + > + /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the > + live range that describes the substituted register. */ > + live_range_rec *def_use_range; > + > + /* The earliest and latest points at which we could insert the > + combined instruction. */ > + unsigned int earliest_point; > + unsigned int latest_point; > + > + /* The cost of the new instruction, once we have a successful match. */ > + unsigned int new_cost; > + }; > + > + /* Pass-specific information about a register. */ > + struct reg_info_rec > + { > + /* The live range associated with the last reference to the register. */ > + live_range_rec *range; > + > + /* The point at which the last reference occurred. */ > + unsigned int next_ref; > + > + /* True if the register is currently live. We record this here rather > + than in a separate bitmap because (a) there's a natural hole for > + it on LP64 hosts and (b) we only refer to it when updating the > + other fields, and so recording it here should give better locality. */ > + unsigned int live_p : 1; > + }; > + > + live_range_rec *new_live_range (unsigned int, live_range_rec *); > + live_range_rec *reg_live_range (unsigned int); > + live_range_rec *mem_live_range (); > + bool add_range_use (live_range_rec *, insn_info_rec *); > + void remove_range_use (live_range_rec *, insn_info_rec *); > + bool has_single_use_p (live_range_rec *); > + bool known_last_use_p (live_range_rec *, insn_info_rec *); > + unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *); > + unsigned int find_latest_point (insn_info_rec *, insn_info_rec *); > + bool start_combination (combination_attempt_rec &, insn_info_rec *, > + insn_info_rec *, live_range_rec * = NULL); > + bool verify_combination (combination_attempt_rec &); > + int estimate_reg_pressure_delta (insn_info_rec *); > + void commit_combination (combination_attempt_rec &, bool); > + bool try_parallel_sets (combination_attempt_rec &, rtx, rtx); > + bool try_parallelize_insns (combination_attempt_rec &); > + bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool); > + bool try_combine_def_use (combination_attempt_rec &, rtx, rtx); > + bool try_combine_two_uses (combination_attempt_rec &); > + bool try_combine (insn_info_rec *, rtx, unsigned int); > + bool optimize_insn (insn_info_rec *); > + void record_defs (insn_info_rec *); > + void record_reg_use (insn_info_rec *, df_ref); > + void record_uses (insn_info_rec *); > + void process_insn (insn_info_rec *); > + void start_sequence (); > + > + /* The function we're optimizing. */ > + function *m_fn; > + > + /* The highest pseudo register number plus one. */ > + unsigned int m_num_regs; > + > + /* The current basic block. */ > + basic_block m_bb; > + > + /* True if we should optimize the current basic block for speed. */ > + bool m_optimize_for_speed_p; > + > + /* The point number to allocate to the next instruction we visit > + in the backward traversal. */ > + unsigned int m_point; > + > + /* The point number corresponding to the end of the current > + instruction sequence, i.e. the lowest point number about which > + we still have valid information. */ > + unsigned int m_end_of_sequence; > + > + /* The point number corresponding to the end of the current basic block. > + This is the same as M_END_OF_SEQUENCE when processing the last > + instruction sequence in a basic block. */ > + unsigned int m_end_of_bb; > + > + /* The memory live range, or null if we haven't yet found a memory > + reference in the current instruction sequence. */ > + live_range_rec *m_mem_range; > + > + /* Gives information about each register. We track both hard and > + pseudo registers. */ > + auto_vec<reg_info_rec> m_reg_info; > + > + /* A bitmap of registers whose entry in m_reg_info is valid. */ > + auto_sbitmap m_valid_regs; > + > + /* If nonnuull, an unused 2-element PARALLEL that we can use to test > + instruction combinations. */ > + rtx m_spare_parallel; > + > + /* A bitmap of instructions that we've already tried to combine with. */ > + auto_bitmap m_tried_insns; > + > + /* A temporary bitmap used to hold register numbers. */ > + auto_bitmap m_true_deps; > + > + /* An obstack used for allocating insn_info_recs and for building > + up their lists of definitions and uses. */ > + obstack m_insn_obstack; > + > + /* An obstack used for allocating live_range_recs. */ > + obstack m_range_obstack; > + > + /* Start-of-object pointers for the two obstacks. */ > + char *m_insn_obstack_start; > + char *m_range_obstack_start; > + > + /* A list of instructions that we've optimized and whose new forms > + change the cfg. */ > + auto_vec<rtx_insn *> m_cfg_altering_insns; > + > + /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS. */ > + auto_bitmap m_cfg_altering_insn_ids; > + > + /* We can insert new instructions at point P * 2 by inserting them > + after M_POINTS[P - M_END_OF_SEQUENCE / 2]. We can insert new > + instructions at point P * 2 + 1 by inserting them before > + M_POINTS[P - M_END_OF_SEQUENCE / 2]. */ > + auto_vec<rtx_insn *, 256> m_points; > +}; > + > +combine2::combine2 (function *fn) > + : m_fn (fn), > + m_num_regs (max_reg_num ()), > + m_bb (NULL), > + m_optimize_for_speed_p (false), > + m_point (2), > + m_end_of_sequence (m_point), > + m_end_of_bb (m_point), > + m_mem_range (NULL), > + m_reg_info (m_num_regs), > + m_valid_regs (m_num_regs), > + m_spare_parallel (NULL_RTX) > +{ > + gcc_obstack_init (&m_insn_obstack); > + gcc_obstack_init (&m_range_obstack); > + m_reg_info.quick_grow (m_num_regs); > + bitmap_clear (m_valid_regs); > + m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0); > + m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0); > +} > + > +combine2::~combine2 () > +{ > + obstack_free (&m_insn_obstack, NULL); > + obstack_free (&m_range_obstack, NULL); > +} > + > +/* Return true if it's possible in principle to combine INSN with > + other instructions. ALLOW_ASMS_P is true if the caller can cope > + with asm statements. */ > + > +static bool > +combinable_insn_p (rtx_insn *insn, bool allow_asms_p) > +{ > + rtx pattern = PATTERN (insn); > + > + if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER) > + return false; > + > + if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX)) > + return false; > + > + if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0) > + return false; > + > + return true; > +} > + > +/* Return true if it's possible in principle to move INSN somewhere else, > + as long as all dependencies are satisfied. */ > + > +static bool > +movable_insn_p (rtx_insn *insn) > +{ > + if (JUMP_P (insn)) > + return false; > + > + if (volatile_refs_p (PATTERN (insn))) > + return false; > + > + return true; > +} > + > +/* Create and return a new live range for REGNO. NEXT is the next range > + in program order, or null if this is the first live range in the > + sequence. */ > + > +combine2::live_range_rec * > +combine2::new_live_range (unsigned int regno, live_range_rec *next) > +{ > + live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec); > + memset (range, 0, sizeof (*range)); > + > + range->regno = regno; > + range->next_range = next; > + if (next) > + next->prev_range = range; > + return range; > +} > + > +/* Return the current live range for register REGNO, creating a new > + one if necessary. */ > + > +combine2::live_range_rec * > +combine2::reg_live_range (unsigned int regno) > +{ > + /* Initialize the liveness flag, if it isn't already valid for this BB. */ > + bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno); > + if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb) > + m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno); > + > + /* See if we already have a live range associated with the current > + instruction sequence. */ > + live_range_rec *range = NULL; > + if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence) > + range = m_reg_info[regno].range; > + > + /* Create a new range if this is the first reference to REGNO in the > + current instruction sequence or if the current range has been closed > + off by a definition. */ > + if (!range || range->producer) > + { > + range = new_live_range (regno, range); > + > + /* If the register is live after the current sequence, treat that > + as a fake use at the end of the sequence. */ > + if (!range->next_range && m_reg_info[regno].live_p) > + range->first_extra_use = range->last_extra_use = m_end_of_sequence; > + > + /* Record that this is now the current range for REGNO. */ > + if (first_ref_p) > + bitmap_set_bit (m_valid_regs, regno); > + m_reg_info[regno].range = range; > + m_reg_info[regno].next_ref = m_point; > + } > + return range; > +} > + > +/* Return the current live range for memory, treating memory as a single > + entity. Create a new live range if necessary. */ > + > +combine2::live_range_rec * > +combine2::mem_live_range () > +{ > + if (!m_mem_range || m_mem_range->producer) > + m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range); > + return m_mem_range; > +} > + > +/* Record that instruction USER uses the resource described by RANGE. > + Return true if this is new information. */ > + > +bool > +combine2::add_range_use (live_range_rec *range, insn_info_rec *user) > +{ > + /* See if we've already recorded the instruction, or if there's a > + spare use slot we can use. */ > + unsigned int i = 0; > + for (; i < NUM_RANGE_USERS && range->users[i]; ++i) > + if (range->users[i] == user) > + return false; > + > + if (i == NUM_RANGE_USERS) > + { > + /* Since we've processed USER recently, assume that it's more > + interesting to record explicitly than the last user in the > + current list. Evict that last user and describe it in the > + overflow "extra use" range instead. */ > + insn_info_rec *ousted_user = range->users[--i]; > + if (range->first_extra_use < ousted_user->point) > + range->first_extra_use = ousted_user->point; > + if (range->last_extra_use > ousted_user->point) > + range->last_extra_use = ousted_user->point; > + } > + > + /* Insert USER while keeping the list sorted. */ > + for (; i > 0 && range->users[i - 1]->point < user->point; --i) > + range->users[i] = range->users[i - 1]; > + range->users[i] = user; > + return true; > +} > + > +/* Remove USER from the uses recorded for RANGE, if we can. > + There's nothing we can do if USER was described in the > + overflow "extra use" range. */ > + > +void > +combine2::remove_range_use (live_range_rec *range, insn_info_rec *user) > +{ > + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) > + if (range->users[i] == user) > + { > + for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j) > + range->users[j] = range->users[j + 1]; > + range->users[NUM_RANGE_USERS - 1] = NULL; > + break; > + } > +} > + > +/* Return true if RANGE has a single known user. */ > + > +bool > +combine2::has_single_use_p (live_range_rec *range) > +{ > + return range->users[0] && !range->users[1] && !range->first_extra_use; > +} > + > +/* Return true if we know that USER is the last user of RANGE. */ > + > +bool > +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user) > +{ > + if (range->last_extra_use <= user->point) > + return false; > + > + for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i) > + if (range->users[i] == user) > + return i == NUM_RANGE_USERS - 1 || !range->users[i + 1]; > + else if (range->users[i]->point == user->point) > + return false; > + > + gcc_unreachable (); > +} > + > +/* Find the earliest point that we could move I2 up in order to combine > + it with I1. Ignore any dependencies between I1 and I2; leave the > + caller to deal with those instead. */ > + > +unsigned int > +combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1) > +{ > + if (!movable_insn_p (i2->insn)) > + return i2->point; > + > + /* Start by optimistically assuming that we can move the instruction > + all the way up to I1. */ > + unsigned int point = i1->point; > + > + /* Make sure that the new position preserves all necessary true dependencies > + on earlier instructions. */ > + for (live_range_rec **use = i2->uses; *use; ++use) > + { > + live_range_rec *range = *use; > + if (range->producer > + && range->producer != i1 > + && point >= range->producer->point) > + point = range->producer->point - 1; > + } > + > + /* Make sure that the new position preserves all necessary output and > + anti dependencies on earlier instructions. */ > + for (live_range_rec **def = i2->defs; *def; ++def) > + if (live_range_rec *range = (*def)->prev_range) > + { > + if (range->producer > + && range->producer != i1 > + && point >= range->producer->point) > + point = range->producer->point - 1; > + > + for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;) > + if (range->users[i] && range->users[i] != i1) > + { > + if (point >= range->users[i]->point) > + point = range->users[i]->point - 1; > + break; > + } > + > + if (range->last_extra_use && point >= range->last_extra_use) > + point = range->last_extra_use - 1; > + } > + > + return point; > +} > + > +/* Find the latest point that we could move I1 down in order to combine > + it with I2. Ignore any dependencies between I1 and I2; leave the > + caller to deal with those instead. */ > + > +unsigned int > +combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2) > +{ > + if (!movable_insn_p (i1->insn)) > + return i1->point; > + > + /* Start by optimistically assuming that we can move the instruction > + all the way down to I2. */ > + unsigned int point = i2->point; > + > + /* Make sure that the new position preserves all necessary anti dependencies > + on later instructions. */ > + for (live_range_rec **use = i1->uses; *use; ++use) > + if (live_range_rec *range = (*use)->next_range) > + if (range->producer != i2 && point <= range->producer->point) > + point = range->producer->point + 1; > + > + /* Make sure that the new position preserves all necessary output and > + true dependencies on later instructions. */ > + for (live_range_rec **def = i1->defs; *def; ++def) > + { > + live_range_rec *range = *def; > + > + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) > + if (range->users[i] != i2) > + { > + if (range->users[i] && point <= range->users[i]->point) > + point = range->users[i]->point + 1; > + break; > + } > + > + if (range->first_extra_use && point <= range->first_extra_use) > + point = range->first_extra_use + 1; > + > + live_range_rec *next_range = range->next_range; > + if (next_range > + && next_range->producer != i2 > + && point <= next_range->producer->point) > + point = next_range->producer->point + 1; > + } > + > + return point; > +} > + > +/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2, > + where I1 is the instruction that we're currently trying to optimize. > + If DEF_USE_RANGE is nonnull, I1 defines the value described by > + DEF_USE_RANGE and I2 uses it. */ > + > +bool > +combine2::start_combination (combination_attempt_rec &attempt, > + insn_info_rec *i1, insn_info_rec *i2, > + live_range_rec *def_use_range) > +{ > + attempt.new_home = i1; > + attempt.sequence[0] = i1; > + attempt.sequence[1] = i2; > + if (attempt.sequence[0]->point < attempt.sequence[1]->point) > + std::swap (attempt.sequence[0], attempt.sequence[1]); > + attempt.def_use_range = def_use_range; > + > + /* Check that the instructions have no true dependencies other than > + DEF_USE_RANGE. */ > + bitmap_clear (m_true_deps); > + for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def) > + if (*def != def_use_range) > + bitmap_set_bit (m_true_deps, (*def)->regno); > + for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use) > + if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno)) > + return false; > + > + /* Calculate the range of points at which the combined instruction > + could live. */ > + attempt.earliest_point = find_earliest_point (attempt.sequence[1], > + attempt.sequence[0]); > + attempt.latest_point = find_latest_point (attempt.sequence[0], > + attempt.sequence[1]); > + if (attempt.earliest_point < attempt.latest_point) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "cannot combine %d and %d: no suitable" > + " location for combined insn\n", > + INSN_UID (attempt.sequence[0]->insn), > + INSN_UID (attempt.sequence[1]->insn)); > + return false; > + } > + > + /* Make sure we have valid costs for the original instructions before > + we start changing their patterns. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + if (attempt.sequence[i]->cost == UNKNOWN_COST) > + attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn, > + m_optimize_for_speed_p); > + return true; > +} > + > +/* Check whether the combination attempt described by ATTEMPT matches > + an .md instruction (or matches its constraints, in the case of an > + asm statement). If so, calculate the cost of the new instruction > + and check whether it's cheap enough. */ > + > +bool > +combine2::verify_combination (combination_attempt_rec &attempt) > +{ > + rtx_insn *insn = attempt.sequence[1]->insn; > + > + bool ok_p = verify_changes (0); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + if (!ok_p) > + fprintf (dump_file, "failed to match this instruction:\n"); > + else if (const char *name = get_insn_name (INSN_CODE (insn))) > + fprintf (dump_file, "successfully matched this instruction to %s:\n", > + name); > + else > + fprintf (dump_file, "successfully matched this instruction:\n"); > + print_rtl_single (dump_file, PATTERN (insn)); > + } > + if (!ok_p) > + return false; > + > + unsigned int cost1 = attempt.sequence[0]->cost; > + unsigned int cost2 = attempt.sequence[1]->cost; > + attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p); > + ok_p = (attempt.new_cost <= cost1 + cost2); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n", > + cost1, cost2, attempt.new_cost, > + ok_p ? "keeping replacement" : "rejecting replacement"); > + if (!ok_p) > + return false; > + > + confirm_change_group (); > + return true; > +} > + > +/* Return true if we should consider register REGNO when calculating > + register pressure estimates. */ > + > +static bool > +count_reg_pressure_p (unsigned int regno) > +{ > + if (regno == INVALID_REGNUM) > + return false; > + > + /* Unallocatable registers aren't interesting. */ > + if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno]) > + return false; > + > + return true; > +} > + > +/* Try to estimate the effect that the original form of INSN_INFO > + had on register pressure, in the form "born - dying". */ > + > +int > +combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info) > +{ > + int delta = 0; > + > + for (live_range_rec **def = insn_info->defs; *def; ++def) > + if (count_reg_pressure_p ((*def)->regno)) > + delta += 1; > + > + for (live_range_rec **use = insn_info->uses; *use; ++use) > + if (count_reg_pressure_p ((*use)->regno) > + && known_last_use_p (*use, insn_info)) > + delta -= 1; > + > + return delta; > +} > + > +/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete > + FROM_INSN. Copy any useful information to TO_INSN before doing that. */ > + > +static void > +transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn) > +{ > + INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn); > + INSN_CODE (to_insn) = INSN_CODE (from_insn); > + REG_NOTES (to_insn) = REG_NOTES (from_insn); > +} > + > +/* The combination attempt in ATTEMPT has succeeded and is currently > + part of an open validate_change group. Commit to making the change > + and decide where the new instruction should go. > + > + KEPT_DEF_P is true if the new instruction continues to perform > + the definition described by ATTEMPT.def_use_range. */ > + > +void > +combine2::commit_combination (combination_attempt_rec &attempt, > + bool kept_def_p) > +{ > + insn_info_rec *new_home = attempt.new_home; > + rtx_insn *old_insn = attempt.sequence[0]->insn; > + rtx_insn *new_insn = attempt.sequence[1]->insn; > + > + /* Remove any notes that are no longer relevant. */ > + bool single_set_p = single_set (new_insn); > + for (rtx *note_ptr = ®_NOTES (new_insn); *note_ptr; ) > + { > + rtx note = *note_ptr; > + bool keep_p = true; > + switch (REG_NOTE_KIND (note)) > + { > + case REG_EQUAL: > + case REG_EQUIV: > + case REG_NOALIAS: > + keep_p = single_set_p; > + break; > + > + case REG_UNUSED: > + keep_p = false; > + break; > + > + default: > + break; > + } > + if (keep_p) > + note_ptr = &XEXP (*note_ptr, 1); > + else > + { > + *note_ptr = XEXP (*note_ptr, 1); > + free_EXPR_LIST_node (note); > + } > + } > + > + /* Complete the open validate_change group. */ > + confirm_change_group (); > + > + /* Decide where the new instruction should go. */ > + unsigned int new_point = attempt.latest_point; > + if (new_point != attempt.earliest_point > + && prev_real_insn (new_insn) != old_insn) > + { > + /* Prefer the earlier point if the combined instruction reduces > + register pressure and the latest point if it increases register > + pressure. > + > + The choice isn't obvious in the event of a tie, but picking > + the earliest point should reduce the number of times that > + we need to invalidate debug insns. */ > + int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]); > + int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]); > + bool move_up_p = (delta1 + delta2 <= 0); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "register pressure delta = %d + %d; using %s position\n", > + delta1, delta2, move_up_p ? "earliest" : "latest"); > + if (move_up_p) > + new_point = attempt.earliest_point; > + } > + > + /* Translate inserting at NEW_POINT into inserting before or after > + a particular insn. */ > + rtx_insn *anchor = NULL; > + bool before_p = (new_point & 1); > + if (new_point != attempt.sequence[1]->point > + && new_point != attempt.sequence[0]->point) > + { > + anchor = m_points[(new_point - m_end_of_sequence) / 2]; > + rtx_insn *other_side = (before_p > + ? prev_real_insn (anchor) > + : next_real_insn (anchor)); > + /* Inserting next to an insn X and then deleting X is just a > + roundabout way of using X as the insertion point. */ > + if (anchor == new_insn || other_side == new_insn) > + new_point = attempt.sequence[1]->point; > + else if (anchor == old_insn || other_side == old_insn) > + new_point = attempt.sequence[0]->point; > + } > + > + /* Actually perform the move. */ > + if (new_point == attempt.sequence[1]->point) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "using insn %d to hold the combined pattern\n", > + INSN_UID (new_insn)); > + set_insn_deleted (old_insn); > + } > + else if (new_point == attempt.sequence[0]->point) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "using insn %d to hold the combined pattern\n", > + INSN_UID (old_insn)); > + PATTERN (old_insn) = PATTERN (new_insn); > + transfer_insn (old_insn, new_insn); > + std::swap (old_insn, new_insn); > + set_insn_deleted (old_insn); > + } > + else > + { > + /* We need to insert a new instruction. We can't simply move > + NEW_INSN because it acts as an insertion anchor in m_points. */ > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "inserting combined insn %s insn %d\n", > + before_p ? "before" : "after", INSN_UID (anchor)); > + > + rtx_insn *added_insn = (before_p > + ? emit_insn_before (PATTERN (new_insn), anchor) > + : emit_insn_after (PATTERN (new_insn), anchor)); > + transfer_insn (added_insn, new_insn); > + set_insn_deleted (old_insn); > + set_insn_deleted (new_insn); > + new_insn = added_insn; > + } > + df_insn_rescan (new_insn); > + > + /* Unlink the old uses. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) > + remove_range_use (*use, attempt.sequence[i]); > + > + /* Work out which registers the new pattern uses. */ > + bitmap_clear (m_true_deps); > + df_ref use; > + FOR_EACH_INSN_USE (use, new_insn) > + { > + rtx reg = DF_REF_REAL_REG (use); > + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); > + } > + FOR_EACH_INSN_EQ_USE (use, new_insn) > + { > + rtx reg = DF_REF_REAL_REG (use); > + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); > + } > + > + /* Describe the combined instruction in NEW_HOME. */ > + new_home->insn = new_insn; > + new_home->point = new_point; > + new_home->cost = attempt.new_cost; > + > + /* Build up a list of definitions for the combined instructions > + and update all the ranges accordingly. It shouldn't matter > + which order we do this in. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def) > + if (kept_def_p || *def != attempt.def_use_range) > + { > + obstack_ptr_grow (&m_insn_obstack, *def); > + (*def)->producer = new_home; > + } > + obstack_ptr_grow (&m_insn_obstack, NULL); > + new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack); > + > + /* Build up a list of uses for the combined instructions and update > + all the ranges accordingly. Again, it shouldn't matter which > + order we do this in. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) > + if (*use != attempt.def_use_range > + && add_range_use (*use, new_home)) > + obstack_ptr_grow (&m_insn_obstack, *use); > + obstack_ptr_grow (&m_insn_obstack, NULL); > + new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack); > + > + /* There shouldn't be any remaining references to other instructions > + in the combination. Invalidate their contents to make lingering > + references a noisy failure. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + if (attempt.sequence[i] != new_home) > + { > + attempt.sequence[i]->insn = NULL; > + attempt.sequence[i]->point = ~0U; > + } > + > + /* Unlink the def-use range. */ > + if (!kept_def_p && attempt.def_use_range) > + { > + live_range_rec *range = attempt.def_use_range; > + if (range->prev_range) > + range->prev_range->next_range = range->next_range; > + else > + m_reg_info[range->regno].range = range->next_range; > + if (range->next_range) > + range->next_range->prev_range = range->prev_range; > + } > + > + /* Record instructions whose new form alters the cfg. */ > + rtx pattern = PATTERN (new_insn); > + if ((returnjump_p (new_insn) > + || any_uncondjump_p (new_insn) > + || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx)) > + && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn))) > + m_cfg_altering_insns.safe_push (new_insn); > +} > + > +/* Return true if X1 and X2 are memories and if X1 does not have > + a higher alignment than X2. */ > + > +static bool > +dubious_mem_pair_p (rtx x1, rtx x2) > +{ > + return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2); > +} > + > +/* Try implement ATTEMPT using (parallel [SET1 SET2]). */ > + > +bool > +combine2::try_parallel_sets (combination_attempt_rec &attempt, > + rtx set1, rtx set2) > +{ > + rtx_insn *insn = attempt.sequence[1]->insn; > + > + /* Combining two loads or two stores can be useful on targets that > + allow them to be treated as a single access. However, we use a > + very peephole approach to picking the pairs, so we need to be > + relatively confident that we're making a good choice. > + > + For now just aim for cases in which the memory references are > + consecutive and the first reference has a higher alignment. > + We can leave the target to test the consecutive part; whatever test > + we added here might be different from the target's, and in any case > + it's fine if the target accepts other well-aligned cases too. */ > + if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2)) > + || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2))) > + return false; > + > + /* Cache the PARALLEL rtx between attempts so that we don't generate > + too much garbage rtl. */ > + if (!m_spare_parallel) > + { > + rtvec vec = gen_rtvec (2, set1, set2); > + m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec); > + } > + else > + { > + XVECEXP (m_spare_parallel, 0, 0) = set1; > + XVECEXP (m_spare_parallel, 0, 1) = set2; > + } > + > + unsigned int num_changes = num_validated_changes (); > + validate_change (insn, &PATTERN (insn), m_spare_parallel, true); > + if (verify_combination (attempt)) > + { > + m_spare_parallel = NULL_RTX; > + return true; > + } > + cancel_changes (num_changes); > + return false; > +} > + > +/* Try to parallelize the two instructions in ATTEMPT. */ > + > +bool > +combine2::try_parallelize_insns (combination_attempt_rec &attempt) > +{ > + rtx_insn *i1_insn = attempt.sequence[0]->insn; > + rtx_insn *i2_insn = attempt.sequence[1]->insn; > + > + /* Can't parallelize asm statements. */ > + if (asm_noperands (PATTERN (i1_insn)) >= 0 > + || asm_noperands (PATTERN (i2_insn)) >= 0) > + return false; > + > + /* For now, just handle the case in which both instructions are > + single sets. We could handle more than 2 sets as well, but few > + targets support that anyway. */ > + rtx set1 = single_set (i1_insn); > + if (!set1) > + return false; > + rtx set2 = single_set (i2_insn); > + if (!set2) > + return false; > + > + /* Make sure that we have structural proof that the destinations > + are independent. Things like alias analysis rely on semantic > + information and assume no undefined behavior, which is rarely a > + good enough guarantee to allow a useful instruction combination. */ > + rtx dest1 = SET_DEST (set1); > + rtx dest2 = SET_DEST (set2); > + if (MEM_P (dest1) > + ? MEM_P (dest2) && nonoverlapping_memrefs_p (dest1, dest2, false) > + : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2)) > + return false; > + > + /* Try the sets in both orders. */ > + if (try_parallel_sets (attempt, set1, set2) > + || try_parallel_sets (attempt, set2, set1)) > + { > + commit_combination (attempt, true); > + if (MAY_HAVE_DEBUG_BIND_INSNS > + && attempt.new_home->insn != i1_insn) > + propagate_for_debug (i1_insn, attempt.new_home->insn, > +
Richard Sandiford <richard.sandiford@arm.com> writes:
> (It's 23:35 local time, so it's still just about stage 1. :-))
Or actually, just under 1 day after end of stage 1. Oops.
Could have sworn stage 1 ended on the 17th :-( Only realised
I'd got it wrong when catching up on Saturday's email traffic.
And inevitably, I introduced a couple of stupid mistakes while
trying to clean the patch up for submission by that (non-)deadline.
Here's a version that fixes an inverted overlapping memref check
and that correctly prunes the use list for combined instructions.
(This last one is just a compile-time saving -- the old code was
correct, just suboptimal.)
And those comparisons that looked too good to be true were:
I'd bodged the choice of run-combine parameters when setting
up the tests. All in all, not a great a day.
Here are the (much less impressive) real values:
Target Tests Delta Best Worst Median
====== ===== ===== ==== ===== ======
aarch64-linux-gnu 412 -786 -270 520 -1
aarch64_be-linux-gnu 288 -3314 -270 33 -1
alpha-linux-gnu 399 -2721 -370 22 -2
amdgcn-amdhsa 201 1938 -484 1259 -1
arc-elf 530 -5901 -1529 356 -1
arm-linux-gnueabi 193 -1167 -612 680 -1
arm-linux-gnueabihf 193 -1167 -612 680 -1
avr-elf 1331 -111093 -13824 680 -9
bfin-elf 1347 -18928 -8461 465 -2
bpf-elf 63 -475 -60 6 -2
c6x-elf 183 -10508 -10084 41 -2
cr16-elf 1610 -51360 -10657 42 -13
cris-elf 143 -1534 -702 4 -2
csky-elf 136 -3371 -474 6 -2
epiphany-elf 178 -389 -149 84 -1
fr30-elf 161 -1756 -756 289 -2
frv-linux-gnu 807 -13324 -2074 67 -1
ft32-elf 282 -1666 -111 5 -2
h8300-elf 522 -11451 -1747 68 -3
hppa64-hp-hpux11.23 186 -848 -142 34 -1
i686-apple-darwin 344 -1298 -56 44 -1
i686-pc-linux-gnu 242 -1953 -556 33 -1
ia64-linux-gnu 150 -4834 -1134 40 -4
iq2000-elf 177 -1333 -61 3 -2
lm32-elf 193 -1792 -316 47 -2
m32r-elf 73 -595 -98 11 -2
m68k-linux-gnu 210 -2351 -332 148 -2
mcore-elf 133 -1213 -146 7 -1
microblaze-elf 445 -4493 -2094 32 -2
mipsel-linux-gnu 134 -2038 -222 60 -2
mmix 108 -233 -26 4 -1
mn10300-elf 224 -1024 -234 80 -1
moxie-rtems 154 -743 -79 4 -2
msp430-elf 182 -586 -63 19 -1
nds32le-elf 267 -485 -37 136 -1
nios2-linux-gnu 83 -323 -66 5 -1
nvptx-none 568 -1124 -208 16 1
or1k-elf 61 -281 -25 4 -1
pdp11 248 -1292 -182 83 -1
powerpc-ibm-aix7.0 1288 -3031 -370 2046 -1
powerpc64-linux-gnu 1118 692 -274 2934 -2
powerpc64le-linux-gnu 1044 -4719 -688 156 -1
pru-elf 48 -7014 -6921 6 -1
riscv32-elf 63 -1364 -139 7 -2
riscv64-elf 91 -1557 -264 7 -1
rl78-elf 354 -16805 -1665 42 -6
rx-elf 95 -186 -53 8 -1
s390-linux-gnu 184 -2282 -1485 63 -1
s390x-linux-gnu 257 -363 -159 522 -1
sh-linux-gnu 225 -405 -108 68 -1
sparc-linux-gnu 164 -859 -99 18 -1
sparc64-linux-gnu 169 -791 -102 15 -1
tilepro-linux-gnu 1037 -4896 -315 332 -2
v850-elf 54 -408 -53 3 -2
vax-netbsdelf 251 -3315 -400 2 -2
visium-elf 101 -693 -138 16 -1
x86_64-darwin 350 -2145 -490 72 -1
x86_64-linux-gnu 311 -853 -288 210 -1
xstormy16-elf 219 -770 -156 59 -1
xtensa-elf 201 -1418 -322 36 1
Also, the number of LDPs on aarch64-linux-gnu went up from
3543 to 5235. The number of STPs went up from 10494 to 12151.
All the new pairs should be aligned ones.
Retested on aarch64-linux-gnu and x86_64-linux-gnu. It missed the
deadline, but I thought I'd post it anyway to put the record straight.
Thanks,
Richard
2019-11-18 Richard Sandiford <richard.sandiford@arm.com>
gcc/
* Makefile.in (OBJS): Add combine2.o
* params.opt (--param=run-combine): New option.
* doc/invoke.texi: Document it.
* tree-pass.h (make_pass_combine2_before): Declare.
(make_pass_combine2_after): Likewise.
* passes.def: Add them.
* timevar.def (TV_COMBINE2): New timevar.
* cfgrtl.h (update_cfg_for_uncondjump): Declare.
* combine.c (update_cfg_for_uncondjump): Move to...
* cfgrtl.c (update_cfg_for_uncondjump): ...here.
* simplify-rtx.c (simplify_truncation): Handle comparisons.
* recog.h (validate_simplify_replace_rtx): Declare.
* recog.c (validate_simplify_replace_rtx_1): New function.
(validate_simplify_replace_rtx_uses): Likewise.
(validate_simplify_replace_rtx): Likewise.
* combine2.c: New file.
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in 2019-11-18 15:12:34.000000000 +0000
+++ gcc/Makefile.in 2019-11-18 17:43:14.245303327 +0000
@@ -1261,6 +1261,7 @@ OBJS = \
cgraphunit.o \
cgraphclones.o \
combine.o \
+ combine2.o \
combine-stack-adj.o \
compare-elim.o \
context.o \
Index: gcc/params.opt
===================================================================
--- gcc/params.opt 2019-11-18 15:12:34.000000000 +0000
+++ gcc/params.opt 2019-11-18 17:43:14.257303244 +0000
@@ -768,6 +768,10 @@ Use internal function id in profile look
Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param
Maximum depth of a loop nest to fully value-number optimistically.
+-param=run-combine=
+Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param
+Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass.
+
-param=sccvn-max-alias-queries-per-access=
Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param
Maximum number of disambiguations to perform per memory access.
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi 2019-11-18 15:12:34.000000000 +0000
+++ gcc/doc/invoke.texi 2019-11-18 17:43:14.257303244 +0000
@@ -11807,6 +11807,11 @@ in combiner for a pseudo register as las
@item max-combine-insns
The maximum number of instructions the RTL combiner tries to combine.
+@item run-combine
+Choose which of the 3 available combine passes to run: bit 1 for the main
+combine pass, bit 0 for an earlier variant of the combine pass, and bit 2
+for a later variant of the combine pass.
+
@item integer-share-limit
Small integer constants can use a shared data structure, reducing the
compiler's memory usage and increasing its speed. This sets the maximum
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h 2019-11-18 15:12:34.000000000 +0000
+++ gcc/tree-pass.h 2019-11-18 17:43:14.257303244 +0000
@@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i
extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt);
extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt);
Index: gcc/passes.def
===================================================================
--- gcc/passes.def 2019-11-18 15:12:34.000000000 +0000
+++ gcc/passes.def 2019-11-18 17:43:14.257303244 +0000
@@ -437,7 +437,9 @@ along with GCC; see the file COPYING3.
NEXT_PASS (pass_inc_dec);
NEXT_PASS (pass_initialize_regs);
NEXT_PASS (pass_ud_rtl_dce);
+ NEXT_PASS (pass_combine2_before);
NEXT_PASS (pass_combine);
+ NEXT_PASS (pass_combine2_after);
NEXT_PASS (pass_if_after_combine);
NEXT_PASS (pass_jump_after_combine);
NEXT_PASS (pass_partition_blocks);
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def 2019-11-18 15:12:34.000000000 +0000
+++ gcc/timevar.def 2019-11-18 17:43:14.257303244 +0000
@@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC , "
DEFTIMEVAR (TV_CSE2 , "CSE 2")
DEFTIMEVAR (TV_BRANCH_PROB , "branch prediction")
DEFTIMEVAR (TV_COMBINE , "combiner")
+DEFTIMEVAR (TV_COMBINE2 , "second combiner")
DEFTIMEVAR (TV_IFCVT , "if-conversion")
DEFTIMEVAR (TV_MODE_SWITCH , "mode switching")
DEFTIMEVAR (TV_SMS , "sms modulo scheduling")
Index: gcc/cfgrtl.h
===================================================================
--- gcc/cfgrtl.h 2019-11-18 15:12:34.000000000 +0000
+++ gcc/cfgrtl.h 2019-11-18 17:43:14.245303327 +0000
@@ -47,6 +47,7 @@ extern void fixup_partitions (void);
extern bool purge_dead_edges (basic_block);
extern bool purge_all_dead_edges (void);
extern bool fixup_abnormal_edges (void);
+extern void update_cfg_for_uncondjump (rtx_insn *);
extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *);
extern void relink_block_chain (bool);
extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *);
Index: gcc/combine.c
===================================================================
--- gcc/combine.c 2019-11-18 15:12:34.000000000 +0000
+++ gcc/combine.c 2019-11-18 17:43:14.249303299 +0000
@@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg)
&& GET_MODE_CLASS (GET_MODE (x)) == MODE_INT;
}
-/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
- Note that the INSN should be deleted *after* removing dead edges, so
- that the kept edge is the fallthrough edge for a (set (pc) (pc))
- but not for a (set (pc) (label_ref FOO)). */
-
-static void
-update_cfg_for_uncondjump (rtx_insn *insn)
-{
- basic_block bb = BLOCK_FOR_INSN (insn);
- gcc_assert (BB_END (bb) == insn);
-
- purge_dead_edges (bb);
-
- delete_insn (insn);
- if (EDGE_COUNT (bb->succs) == 1)
- {
- rtx_insn *insn;
-
- single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
-
- /* Remove barriers from the footer if there are any. */
- for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
- if (BARRIER_P (insn))
- {
- if (PREV_INSN (insn))
- SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
- else
- BB_FOOTER (bb) = NEXT_INSN (insn);
- if (NEXT_INSN (insn))
- SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
- }
- else if (LABEL_P (insn))
- break;
- }
-}
-
/* Return whether PAT is a PARALLEL of exactly N register SETs followed
by an arbitrary number of CLOBBERs. */
static bool
@@ -15096,7 +15060,10 @@ const pass_data pass_data_combine =
{}
/* opt_pass methods: */
- virtual bool gate (function *) { return (optimize > 0); }
+ virtual bool gate (function *)
+ {
+ return optimize > 0 && (param_run_combine & 2) != 0;
+ }
virtual unsigned int execute (function *)
{
return rest_of_handle_combine ();
Index: gcc/cfgrtl.c
===================================================================
--- gcc/cfgrtl.c 2019-11-18 15:12:34.000000000 +0000
+++ gcc/cfgrtl.c 2019-11-18 17:43:14.245303327 +0000
@@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void)
return inserted;
}
+/* Delete the unconditional jump INSN and adjust the CFG correspondingly.
+ Note that the INSN should be deleted *after* removing dead edges, so
+ that the kept edge is the fallthrough edge for a (set (pc) (pc))
+ but not for a (set (pc) (label_ref FOO)). */
+
+void
+update_cfg_for_uncondjump (rtx_insn *insn)
+{
+ basic_block bb = BLOCK_FOR_INSN (insn);
+ gcc_assert (BB_END (bb) == insn);
+
+ purge_dead_edges (bb);
+
+ delete_insn (insn);
+ if (EDGE_COUNT (bb->succs) == 1)
+ {
+ rtx_insn *insn;
+
+ single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
+
+ /* Remove barriers from the footer if there are any. */
+ for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn))
+ if (BARRIER_P (insn))
+ {
+ if (PREV_INSN (insn))
+ SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn);
+ else
+ BB_FOOTER (bb) = NEXT_INSN (insn);
+ if (NEXT_INSN (insn))
+ SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn);
+ }
+ else if (LABEL_P (insn))
+ break;
+ }
+}
+
/* Cut the insns from FIRST to LAST out of the insns stream. */
rtx_insn *
Index: gcc/simplify-rtx.c
===================================================================
--- gcc/simplify-rtx.c 2019-11-18 15:28:59.916793401 +0000
+++ gcc/simplify-rtx.c 2019-11-18 17:43:14.257303244 +0000
@@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode,
&& trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1)
return constm1_rtx;
+ /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result
+ in a narrower mode if useful. */
+ if (COMPARISON_P (op))
+ return simplify_gen_relational (GET_CODE (op), mode, VOIDmode,
+ XEXP (op, 0), XEXP (op, 1));
+
return NULL_RTX;
}
Index: gcc/recog.h
===================================================================
--- gcc/recog.h 2019-11-18 15:12:34.000000000 +0000
+++ gcc/recog.h 2019-11-18 17:43:14.257303244 +0000
@@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos
extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *);
extern void validate_replace_src_group (rtx, rtx, rtx_insn *);
extern bool validate_simplify_insn (rtx_insn *insn);
+extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx);
extern int num_changes_pending (void);
extern int next_insn_tests_no_inequality (rtx_insn *);
extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode);
Index: gcc/recog.c
===================================================================
--- gcc/recog.c 2019-11-18 15:12:34.000000000 +0000
+++ gcc/recog.c 2019-11-18 17:43:14.257303244 +0000
@@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn)
}
return ((num_changes_pending () > 0) && (apply_change_group () > 0));
}
+
+/* A subroutine of validate_simplify_replace_rtx. Apply the replacement
+ described by R to LOC. Return true on success; leave the caller
+ to clean up on failure. */
+
+static bool
+validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc)
+{
+ rtx x = *loc;
+ enum rtx_code code = GET_CODE (x);
+ machine_mode mode = GET_MODE (x);
+
+ if (rtx_equal_p (x, r.from))
+ {
+ validate_unshare_change (r.insn, loc, r.to, 1);
+ return true;
+ }
+
+ /* Recursively apply the substitution and see if we can simplify
+ the result. This specifically shouldn't use simplify_gen_*,
+ since we want to avoid generating new expressions where possible. */
+ int old_num_changes = num_validated_changes ();
+ rtx newx = NULL_RTX;
+ bool recurse_p = false;
+ switch (GET_RTX_CLASS (code))
+ {
+ case RTX_UNARY:
+ {
+ machine_mode op0_mode = GET_MODE (XEXP (x, 0));
+ if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)))
+ return false;
+
+ newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode);
+ break;
+ }
+
+ case RTX_BIN_ARITH:
+ case RTX_COMM_ARITH:
+ {
+ if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+ || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+ return false;
+
+ newx = simplify_binary_operation (code, mode,
+ XEXP (x, 0), XEXP (x, 1));
+ break;
+ }
+
+ case RTX_COMPARE:
+ case RTX_COMM_COMPARE:
+ {
+ machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode
+ ? GET_MODE (XEXP (x, 0))
+ : GET_MODE (XEXP (x, 1)));
+ if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+ || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+ return false;
+
+ newx = simplify_relational_operation (code, mode, op_mode,
+ XEXP (x, 0), XEXP (x, 1));
+ break;
+ }
+
+ case RTX_TERNARY:
+ case RTX_BITFIELD_OPS:
+ {
+ machine_mode op0_mode = GET_MODE (XEXP (x, 0));
+ if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+ || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))
+ || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2)))
+ return false;
+
+ newx = simplify_ternary_operation (code, mode, op0_mode,
+ XEXP (x, 0), XEXP (x, 1),
+ XEXP (x, 2));
+ break;
+ }
+
+ case RTX_EXTRA:
+ if (code == SUBREG)
+ {
+ machine_mode inner_mode = GET_MODE (SUBREG_REG (x));
+ if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x)))
+ return false;
+
+ rtx inner = SUBREG_REG (x);
+ newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x));
+ /* Reject the same cases that simplify_gen_subreg would. */
+ if (!newx
+ && (GET_CODE (inner) == SUBREG
+ || GET_CODE (inner) == CONCAT
+ || GET_MODE (inner) == VOIDmode
+ || !validate_subreg (mode, inner_mode,
+ inner, SUBREG_BYTE (x))))
+ return false;
+ break;
+ }
+ else
+ recurse_p = true;
+ break;
+
+ case RTX_OBJ:
+ if (code == LO_SUM)
+ {
+ if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))
+ || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)))
+ return false;
+
+ /* (lo_sum (high x) y) -> y where x and y have the same base. */
+ rtx op0 = XEXP (x, 0);
+ rtx op1 = XEXP (x, 1);
+ if (GET_CODE (op0) == HIGH)
+ {
+ rtx base0, base1, offset0, offset1;
+ split_const (XEXP (op0, 0), &base0, &offset0);
+ split_const (op1, &base1, &offset1);
+ if (rtx_equal_p (base0, base1))
+ newx = op1;
+ }
+ }
+ else if (code == REG)
+ {
+ if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from))
+ return false;
+ }
+ else
+ recurse_p = true;
+ break;
+
+ case RTX_CONST_OBJ:
+ break;
+
+ case RTX_AUTOINC:
+ if (reg_overlap_mentioned_p (XEXP (x, 0), r.from))
+ return false;
+ recurse_p = true;
+ break;
+
+ case RTX_MATCH:
+ case RTX_INSN:
+ gcc_unreachable ();
+ }
+
+ if (recurse_p)
+ {
+ const char *fmt = GET_RTX_FORMAT (code);
+ for (int i = 0; fmt[i]; i++)
+ switch (fmt[i])
+ {
+ case 'E':
+ for (int j = 0; j < XVECLEN (x, i); j++)
+ if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j)))
+ return false;
+ break;
+
+ case 'e':
+ if (XEXP (x, i)
+ && !validate_simplify_replace_rtx_1 (r, &XEXP (x, i)))
+ return false;
+ break;
+ }
+ }
+
+ if (newx && !rtx_equal_p (x, newx))
+ {
+ /* There's no longer any point unsharing the substitutions made
+ for subexpressions, since we'll just copy this one instead. */
+ for (int i = old_num_changes; i < num_changes; ++i)
+ changes[i].unshare = false;
+ validate_unshare_change (r.insn, loc, newx, 1);
+ }
+
+ return true;
+}
+
+/* A note_uses callback for validate_simplify_replace_rtx.
+ DATA points to a validate_replace_src_data object. */
+
+static void
+validate_simplify_replace_rtx_uses (rtx *loc, void *data)
+{
+ validate_replace_src_data &r = *(validate_replace_src_data *) data;
+ if (r.insn && !validate_simplify_replace_rtx_1 (r, loc))
+ r.insn = NULL;
+}
+
+/* Try to perform the equivalent of:
+
+ newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX);
+ validate_change (INSN, LOC, newx, 1);
+
+ but without generating as much garbage rtl when the resulting
+ pattern doesn't match.
+
+ Return true if we were able to replace all uses of OLD_RTX in *LOC
+ and if the result conforms to general rtx rules (e.g. for whether
+ subregs are meaningful).
+
+ When returning true, add all replacements to the current validation group,
+ leaving the caller to test it in the normal way. Leave both *LOC and the
+ validation group unchanged on failure. */
+
+bool
+validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc,
+ rtx old_rtx, rtx new_rtx)
+{
+ validate_replace_src_data r;
+ r.from = old_rtx;
+ r.to = new_rtx;
+ r.insn = insn;
+
+ unsigned int num_changes = num_validated_changes ();
+ note_uses (loc, validate_simplify_replace_rtx_uses, &r);
+ if (!r.insn)
+ {
+ cancel_changes (num_changes);
+ return false;
+ }
+ return true;
+}
/* Return 1 if the insn using CC0 set by INSN does not contain
any ordered tests applied to the condition codes.
Index: gcc/combine2.c
===================================================================
--- /dev/null 2019-09-17 11:41:18.176664108 +0100
+++ gcc/combine2.c 2019-11-18 17:43:14.249303299 +0000
@@ -0,0 +1,1598 @@
+/* Combine instructions
+ Copyright (C) 2019 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3. If not see
+<http://www.gnu.org/licenses/>. */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "tree-pass.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "insn-config.h"
+#include "recog.h"
+#include "print-rtl.h"
+#include "rtl-iter.h"
+#include "predict.h"
+#include "cfgcleanup.h"
+#include "cfghooks.h"
+#include "cfgrtl.h"
+#include "alias.h"
+#include "valtrack.h"
+
+/* This pass tries to combine instructions in the following ways:
+
+ (1) If we have two dependent instructions:
+
+ I1: (set DEST1 SRC1)
+ I2: (...DEST1...)
+
+ and I2 is the only user of DEST1, the pass tries to combine them into:
+
+ I2: (...SRC1...)
+
+ (2) If we have two dependent instructions:
+
+ I1: (set DEST1 SRC1)
+ I2: (...DEST1...)
+
+ the pass tries to combine them into:
+
+ I2: (parallel [(set DEST1 SRC1) (...SRC1...)])
+
+ or:
+
+ I2: (parallel [(...SRC1...) (set DEST1 SRC1)])
+
+ (3) If we have two independent instructions:
+
+ I1: (set DEST1 SRC1)
+ I2: (set DEST2 SRC2)
+
+ that read from memory or from the same register, the pass tries to
+ combine them into:
+
+ I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)])
+
+ or:
+
+ I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)])
+
+ If the combined form is a valid instruction, the pass tries to find a
+ place between I1 and I2 inclusive for the new instruction. If there
+ are multiple valid locations, it tries to pick the best one by taking
+ the effect on register pressure into account.
+
+ If a combination succeeds and produces a single set, the pass tries to
+ combine the new form with earlier or later instructions.
+
+ The pass currently optimizes each basic block separately. It walks
+ the instructions in reverse order, building up live ranges for registers
+ and memory. It then uses these live ranges to look for possible
+ combination opportunities and to decide where the combined instructions
+ could be placed.
+
+ The pass represents positions in the block using point numbers,
+ with higher numbers indicating earlier instructions. The numbering
+ scheme is that:
+
+ - the end of the current instruction sequence has an even base point B.
+
+ - instructions initially have odd-numbered points B + 1, B + 3, etc.
+ with B + 1 being the final instruction in the sequence.
+
+ - even points after B represent gaps between instructions where combined
+ instructions could be placed.
+
+ Thus even points initially represent no instructions and odd points
+ initially represent single instructions. However, when picking a
+ place for a combined instruction, the pass may choose somewhere
+ inbetween the original two instructions, so that over time a point
+ may come to represent several instructions. When this happens,
+ the pass maintains the invariant that all instructions with the same
+ point number are independent of each other and thus can be treated as
+ acting in parallel (or as acting in any arbitrary sequence).
+
+ TODOs:
+
+ - Handle 3-instruction combinations, and possibly more.
+
+ - Handle existing clobbers more efficiently. At the moment we can't
+ move an instruction that clobbers R across another instruction that
+ clobbers R.
+
+ - Allow hard register clobbers to be added, like combine does.
+
+ - Perhaps work on EBBs, or SESE regions. */
+
+namespace {
+
+/* The number of explicit uses to record in a live range. */
+const unsigned int NUM_RANGE_USERS = 4;
+
+/* The maximum number of instructions that we can combine at once. */
+const unsigned int MAX_COMBINE_INSNS = 2;
+
+/* A fake cost for instructions that we haven't costed yet. */
+const unsigned int UNKNOWN_COST = ~0U;
+
+class combine2
+{
+public:
+ combine2 (function *);
+ ~combine2 ();
+
+ void execute ();
+
+private:
+ struct insn_info_rec;
+
+ /* Describes the live range of a register or of memory. For simplicity,
+ we treat memory as a single entity.
+
+ If we had a fully-accurate live range, updating it to account for a
+ moved instruction would be a linear-time operation. Doing this for
+ each combination would then make the pass quadratic. We therefore
+ just maintain a list of NUM_RANGE_USERS use insns and use simple,
+ conservatively-correct behavior for the rest. */
+ struct live_range_rec
+ {
+ /* Which instruction provides the dominating definition, or null if
+ we don't know yet. */
+ insn_info_rec *producer;
+
+ /* A selection of instructions that use the resource, in program order. */
+ insn_info_rec *users[NUM_RANGE_USERS];
+
+ /* An inclusive range of points that covers instructions not mentioned
+ in USERS. Both values are zero if there are no such instructions.
+
+ Once we've included a use U at point P in this range, we continue
+ to assume that some kind of use exists at P whatever happens to U
+ afterwards. */
+ unsigned int first_extra_use;
+ unsigned int last_extra_use;
+
+ /* The register number this range describes, or INVALID_REGNUM
+ for memory. */
+ unsigned int regno;
+
+ /* Forms a linked list of ranges for the same resource, in program
+ order. */
+ live_range_rec *prev_range;
+ live_range_rec *next_range;
+ };
+
+ /* Pass-specific information about an instruction. */
+ struct insn_info_rec
+ {
+ /* The instruction itself. */
+ rtx_insn *insn;
+
+ /* A null-terminated list of live ranges for the things that this
+ instruction defines. */
+ live_range_rec **defs;
+
+ /* A null-terminated list of live ranges for the things that this
+ instruction uses. */
+ live_range_rec **uses;
+
+ /* The point at which the instruction appears. */
+ unsigned int point;
+
+ /* The cost of the instruction, or UNKNOWN_COST if we haven't
+ measured it yet. */
+ unsigned int cost;
+ };
+
+ /* Describes one attempt to combine instructions. */
+ struct combination_attempt_rec
+ {
+ /* The instruction that we're currently trying to optimize.
+ If the combination succeeds, we'll use this insn_info_rec
+ to describe the new instruction. */
+ insn_info_rec *new_home;
+
+ /* The instructions we're combining, in program order. */
+ insn_info_rec *sequence[MAX_COMBINE_INSNS];
+
+ /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the
+ live range that describes the substituted register. */
+ live_range_rec *def_use_range;
+
+ /* The earliest and latest points at which we could insert the
+ combined instruction. */
+ unsigned int earliest_point;
+ unsigned int latest_point;
+
+ /* The cost of the new instruction, once we have a successful match. */
+ unsigned int new_cost;
+ };
+
+ /* Pass-specific information about a register. */
+ struct reg_info_rec
+ {
+ /* The live range associated with the last reference to the register. */
+ live_range_rec *range;
+
+ /* The point at which the last reference occurred. */
+ unsigned int next_ref;
+
+ /* True if the register is currently live. We record this here rather
+ than in a separate bitmap because (a) there's a natural hole for
+ it on LP64 hosts and (b) we only refer to it when updating the
+ other fields, and so recording it here should give better locality. */
+ unsigned int live_p : 1;
+ };
+
+ live_range_rec *new_live_range (unsigned int, live_range_rec *);
+ live_range_rec *reg_live_range (unsigned int);
+ live_range_rec *mem_live_range ();
+ bool add_range_use (live_range_rec *, insn_info_rec *);
+ void remove_range_use (live_range_rec *, insn_info_rec *);
+ bool has_single_use_p (live_range_rec *);
+ bool known_last_use_p (live_range_rec *, insn_info_rec *);
+ unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *);
+ unsigned int find_latest_point (insn_info_rec *, insn_info_rec *);
+ bool start_combination (combination_attempt_rec &, insn_info_rec *,
+ insn_info_rec *, live_range_rec * = NULL);
+ bool verify_combination (combination_attempt_rec &);
+ int estimate_reg_pressure_delta (insn_info_rec *);
+ void commit_combination (combination_attempt_rec &, bool);
+ bool try_parallel_sets (combination_attempt_rec &, rtx, rtx);
+ bool try_parallelize_insns (combination_attempt_rec &);
+ bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool);
+ bool try_combine_def_use (combination_attempt_rec &, rtx, rtx);
+ bool try_combine_two_uses (combination_attempt_rec &);
+ bool try_combine (insn_info_rec *, rtx, unsigned int);
+ bool optimize_insn (insn_info_rec *);
+ void record_defs (insn_info_rec *);
+ void record_reg_use (insn_info_rec *, df_ref);
+ void record_uses (insn_info_rec *);
+ void process_insn (insn_info_rec *);
+ void start_sequence ();
+
+ /* The function we're optimizing. */
+ function *m_fn;
+
+ /* The highest pseudo register number plus one. */
+ unsigned int m_num_regs;
+
+ /* The current basic block. */
+ basic_block m_bb;
+
+ /* True if we should optimize the current basic block for speed. */
+ bool m_optimize_for_speed_p;
+
+ /* The point number to allocate to the next instruction we visit
+ in the backward traversal. */
+ unsigned int m_point;
+
+ /* The point number corresponding to the end of the current
+ instruction sequence, i.e. the lowest point number about which
+ we still have valid information. */
+ unsigned int m_end_of_sequence;
+
+ /* The point number corresponding to the end of the current basic block.
+ This is the same as M_END_OF_SEQUENCE when processing the last
+ instruction sequence in a basic block. */
+ unsigned int m_end_of_bb;
+
+ /* The memory live range, or null if we haven't yet found a memory
+ reference in the current instruction sequence. */
+ live_range_rec *m_mem_range;
+
+ /* Gives information about each register. We track both hard and
+ pseudo registers. */
+ auto_vec<reg_info_rec> m_reg_info;
+
+ /* A bitmap of registers whose entry in m_reg_info is valid. */
+ auto_sbitmap m_valid_regs;
+
+ /* If nonnuull, an unused 2-element PARALLEL that we can use to test
+ instruction combinations. */
+ rtx m_spare_parallel;
+
+ /* A bitmap of instructions that we've already tried to combine with. */
+ auto_bitmap m_tried_insns;
+
+ /* A temporary bitmap used to hold register numbers. */
+ auto_bitmap m_true_deps;
+
+ /* An obstack used for allocating insn_info_recs and for building
+ up their lists of definitions and uses. */
+ obstack m_insn_obstack;
+
+ /* An obstack used for allocating live_range_recs. */
+ obstack m_range_obstack;
+
+ /* Start-of-object pointers for the two obstacks. */
+ char *m_insn_obstack_start;
+ char *m_range_obstack_start;
+
+ /* A list of instructions that we've optimized and whose new forms
+ change the cfg. */
+ auto_vec<rtx_insn *> m_cfg_altering_insns;
+
+ /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS. */
+ auto_bitmap m_cfg_altering_insn_ids;
+
+ /* We can insert new instructions at point P * 2 by inserting them
+ after M_POINTS[P - M_END_OF_SEQUENCE / 2]. We can insert new
+ instructions at point P * 2 + 1 by inserting them before
+ M_POINTS[P - M_END_OF_SEQUENCE / 2]. */
+ auto_vec<rtx_insn *, 256> m_points;
+};
+
+combine2::combine2 (function *fn)
+ : m_fn (fn),
+ m_num_regs (max_reg_num ()),
+ m_bb (NULL),
+ m_optimize_for_speed_p (false),
+ m_point (2),
+ m_end_of_sequence (m_point),
+ m_end_of_bb (m_point),
+ m_mem_range (NULL),
+ m_reg_info (m_num_regs),
+ m_valid_regs (m_num_regs),
+ m_spare_parallel (NULL_RTX)
+{
+ gcc_obstack_init (&m_insn_obstack);
+ gcc_obstack_init (&m_range_obstack);
+ m_reg_info.quick_grow (m_num_regs);
+ bitmap_clear (m_valid_regs);
+ m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0);
+ m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0);
+}
+
+combine2::~combine2 ()
+{
+ obstack_free (&m_insn_obstack, NULL);
+ obstack_free (&m_range_obstack, NULL);
+}
+
+/* Return true if it's possible in principle to combine INSN with
+ other instructions. ALLOW_ASMS_P is true if the caller can cope
+ with asm statements. */
+
+static bool
+combinable_insn_p (rtx_insn *insn, bool allow_asms_p)
+{
+ rtx pattern = PATTERN (insn);
+
+ if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER)
+ return false;
+
+ if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX))
+ return false;
+
+ if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0)
+ return false;
+
+ return true;
+}
+
+/* Return true if it's possible in principle to move INSN somewhere else,
+ as long as all dependencies are satisfied. */
+
+static bool
+movable_insn_p (rtx_insn *insn)
+{
+ if (JUMP_P (insn))
+ return false;
+
+ if (volatile_refs_p (PATTERN (insn)))
+ return false;
+
+ return true;
+}
+
+/* A note_stores callback. Set the bool at *DATA to true if DEST is in
+ memory. */
+
+static void
+find_mem_def (rtx dest, const_rtx, void *data)
+{
+ /* note_stores has stripped things like subregs and zero_extracts,
+ so we don't need to worry about them here. */
+ if (MEM_P (dest))
+ *(bool *) data = true;
+}
+
+/* Return true if instruction INSN writes to memory. */
+
+static bool
+insn_writes_mem_p (rtx_insn *insn)
+{
+ bool saw_mem_p = false;
+ note_stores (insn, find_mem_def, &saw_mem_p);
+ return saw_mem_p;
+}
+
+/* A note_uses callback. Set the bool at DATA to true if *LOC reads
+ from variable memory. */
+
+static void
+find_mem_use (rtx *loc, void *data)
+{
+ subrtx_iterator::array_type array;
+ FOR_EACH_SUBRTX (iter, array, *loc, NONCONST)
+ if (MEM_P (*iter) && !MEM_READONLY_P (*iter))
+ {
+ *(bool *) data = true;
+ break;
+ }
+}
+
+/* Return true if instruction INSN reads memory, including via notes. */
+
+static bool
+insn_reads_mem_p (rtx_insn *insn)
+{
+ bool saw_mem_p = false;
+ note_uses (&PATTERN (insn), find_mem_use, &saw_mem_p);
+ for (rtx note = REG_NOTES (insn); !saw_mem_p && note; note = XEXP (note, 1))
+ if (REG_NOTE_KIND (note) == REG_EQUAL
+ || REG_NOTE_KIND (note) == REG_EQUIV)
+ note_uses (&XEXP (note, 0), find_mem_use, &saw_mem_p);
+ return saw_mem_p;
+}
+
+/* Create and return a new live range for REGNO. NEXT is the next range
+ in program order, or null if this is the first live range in the
+ sequence. */
+
+combine2::live_range_rec *
+combine2::new_live_range (unsigned int regno, live_range_rec *next)
+{
+ live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec);
+ memset (range, 0, sizeof (*range));
+
+ range->regno = regno;
+ range->next_range = next;
+ if (next)
+ next->prev_range = range;
+ return range;
+}
+
+/* Return the current live range for register REGNO, creating a new
+ one if necessary. */
+
+combine2::live_range_rec *
+combine2::reg_live_range (unsigned int regno)
+{
+ /* Initialize the liveness flag, if it isn't already valid for this BB. */
+ bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno);
+ if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb)
+ m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno);
+
+ /* See if we already have a live range associated with the current
+ instruction sequence. */
+ live_range_rec *range = NULL;
+ if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence)
+ range = m_reg_info[regno].range;
+
+ /* Create a new range if this is the first reference to REGNO in the
+ current instruction sequence or if the current range has been closed
+ off by a definition. */
+ if (!range || range->producer)
+ {
+ range = new_live_range (regno, range);
+
+ /* If the register is live after the current sequence, treat that
+ as a fake use at the end of the sequence. */
+ if (!range->next_range && m_reg_info[regno].live_p)
+ range->first_extra_use = range->last_extra_use = m_end_of_sequence;
+
+ /* Record that this is now the current range for REGNO. */
+ if (first_ref_p)
+ bitmap_set_bit (m_valid_regs, regno);
+ m_reg_info[regno].range = range;
+ m_reg_info[regno].next_ref = m_point;
+ }
+ return range;
+}
+
+/* Return the current live range for memory, treating memory as a single
+ entity. Create a new live range if necessary. */
+
+combine2::live_range_rec *
+combine2::mem_live_range ()
+{
+ if (!m_mem_range || m_mem_range->producer)
+ m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range);
+ return m_mem_range;
+}
+
+/* Record that instruction USER uses the resource described by RANGE.
+ Return true if this is new information. */
+
+bool
+combine2::add_range_use (live_range_rec *range, insn_info_rec *user)
+{
+ /* See if we've already recorded the instruction, or if there's a
+ spare use slot we can use. */
+ unsigned int i = 0;
+ for (; i < NUM_RANGE_USERS && range->users[i]; ++i)
+ if (range->users[i] == user)
+ return false;
+
+ if (i == NUM_RANGE_USERS)
+ {
+ /* Since we've processed USER recently, assume that it's more
+ interesting to record explicitly than the last user in the
+ current list. Evict that last user and describe it in the
+ overflow "extra use" range instead. */
+ insn_info_rec *ousted_user = range->users[--i];
+ if (range->first_extra_use < ousted_user->point)
+ range->first_extra_use = ousted_user->point;
+ if (range->last_extra_use > ousted_user->point)
+ range->last_extra_use = ousted_user->point;
+ }
+
+ /* Insert USER while keeping the list sorted. */
+ for (; i > 0 && range->users[i - 1]->point < user->point; --i)
+ range->users[i] = range->users[i - 1];
+ range->users[i] = user;
+ return true;
+}
+
+/* Remove USER from the uses recorded for RANGE, if we can.
+ There's nothing we can do if USER was described in the
+ overflow "extra use" range. */
+
+void
+combine2::remove_range_use (live_range_rec *range, insn_info_rec *user)
+{
+ for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+ if (range->users[i] == user)
+ {
+ for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j)
+ range->users[j] = range->users[j + 1];
+ range->users[NUM_RANGE_USERS - 1] = NULL;
+ break;
+ }
+}
+
+/* Return true if RANGE has a single known user. */
+
+bool
+combine2::has_single_use_p (live_range_rec *range)
+{
+ return range->users[0] && !range->users[1] && !range->first_extra_use;
+}
+
+/* Return true if we know that USER is the last user of RANGE. */
+
+bool
+combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user)
+{
+ if (range->last_extra_use <= user->point)
+ return false;
+
+ for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i)
+ if (range->users[i] == user)
+ return i == NUM_RANGE_USERS - 1 || !range->users[i + 1];
+ else if (range->users[i]->point == user->point)
+ return false;
+
+ gcc_unreachable ();
+}
+
+/* Find the earliest point that we could move I2 up in order to combine
+ it with I1. Ignore any dependencies between I1 and I2; leave the
+ caller to deal with those instead. */
+
+unsigned int
+combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1)
+{
+ if (!movable_insn_p (i2->insn))
+ return i2->point;
+
+ /* Start by optimistically assuming that we can move the instruction
+ all the way up to I1. */
+ unsigned int point = i1->point;
+
+ /* Make sure that the new position preserves all necessary true dependencies
+ on earlier instructions. */
+ for (live_range_rec **use = i2->uses; *use; ++use)
+ {
+ live_range_rec *range = *use;
+ if (range->producer
+ && range->producer != i1
+ && point >= range->producer->point)
+ point = range->producer->point - 1;
+ }
+
+ /* Make sure that the new position preserves all necessary output and
+ anti dependencies on earlier instructions. */
+ for (live_range_rec **def = i2->defs; *def; ++def)
+ if (live_range_rec *range = (*def)->prev_range)
+ {
+ if (range->producer
+ && range->producer != i1
+ && point >= range->producer->point)
+ point = range->producer->point - 1;
+
+ for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;)
+ if (range->users[i] && range->users[i] != i1)
+ {
+ if (point >= range->users[i]->point)
+ point = range->users[i]->point - 1;
+ break;
+ }
+
+ if (range->last_extra_use && point >= range->last_extra_use)
+ point = range->last_extra_use - 1;
+ }
+
+ return point;
+}
+
+/* Find the latest point that we could move I1 down in order to combine
+ it with I2. Ignore any dependencies between I1 and I2; leave the
+ caller to deal with those instead. */
+
+unsigned int
+combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2)
+{
+ if (!movable_insn_p (i1->insn))
+ return i1->point;
+
+ /* Start by optimistically assuming that we can move the instruction
+ all the way down to I2. */
+ unsigned int point = i2->point;
+
+ /* Make sure that the new position preserves all necessary anti dependencies
+ on later instructions. */
+ for (live_range_rec **use = i1->uses; *use; ++use)
+ if (live_range_rec *range = (*use)->next_range)
+ if (range->producer != i2 && point <= range->producer->point)
+ point = range->producer->point + 1;
+
+ /* Make sure that the new position preserves all necessary output and
+ true dependencies on later instructions. */
+ for (live_range_rec **def = i1->defs; *def; ++def)
+ {
+ live_range_rec *range = *def;
+
+ for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+ if (range->users[i] != i2)
+ {
+ if (range->users[i] && point <= range->users[i]->point)
+ point = range->users[i]->point + 1;
+ break;
+ }
+
+ if (range->first_extra_use && point <= range->first_extra_use)
+ point = range->first_extra_use + 1;
+
+ live_range_rec *next_range = range->next_range;
+ if (next_range
+ && next_range->producer != i2
+ && point <= next_range->producer->point)
+ point = next_range->producer->point + 1;
+ }
+
+ return point;
+}
+
+/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2,
+ where I1 is the instruction that we're currently trying to optimize.
+ If DEF_USE_RANGE is nonnull, I1 defines the value described by
+ DEF_USE_RANGE and I2 uses it. */
+
+bool
+combine2::start_combination (combination_attempt_rec &attempt,
+ insn_info_rec *i1, insn_info_rec *i2,
+ live_range_rec *def_use_range)
+{
+ attempt.new_home = i1;
+ attempt.sequence[0] = i1;
+ attempt.sequence[1] = i2;
+ if (attempt.sequence[0]->point < attempt.sequence[1]->point)
+ std::swap (attempt.sequence[0], attempt.sequence[1]);
+ attempt.def_use_range = def_use_range;
+
+ /* Check that the instructions have no true dependencies other than
+ DEF_USE_RANGE. */
+ bitmap_clear (m_true_deps);
+ for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def)
+ if (*def != def_use_range)
+ bitmap_set_bit (m_true_deps, (*def)->regno);
+ for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use)
+ if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno))
+ return false;
+
+ /* Calculate the range of points at which the combined instruction
+ could live. */
+ attempt.earliest_point = find_earliest_point (attempt.sequence[1],
+ attempt.sequence[0]);
+ attempt.latest_point = find_latest_point (attempt.sequence[0],
+ attempt.sequence[1]);
+ if (attempt.earliest_point < attempt.latest_point)
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "cannot combine %d and %d: no suitable"
+ " location for combined insn\n",
+ INSN_UID (attempt.sequence[0]->insn),
+ INSN_UID (attempt.sequence[1]->insn));
+ return false;
+ }
+
+ /* Make sure we have valid costs for the original instructions before
+ we start changing their patterns. */
+ for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+ if (attempt.sequence[i]->cost == UNKNOWN_COST)
+ attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn,
+ m_optimize_for_speed_p);
+ return true;
+}
+
+/* Check whether the combination attempt described by ATTEMPT matches
+ an .md instruction (or matches its constraints, in the case of an
+ asm statement). If so, calculate the cost of the new instruction
+ and check whether it's cheap enough. */
+
+bool
+combine2::verify_combination (combination_attempt_rec &attempt)
+{
+ rtx_insn *insn = attempt.sequence[1]->insn;
+
+ bool ok_p = verify_changes (0);
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ if (!ok_p)
+ fprintf (dump_file, "failed to match this instruction:\n");
+ else if (const char *name = get_insn_name (INSN_CODE (insn)))
+ fprintf (dump_file, "successfully matched this instruction to %s:\n",
+ name);
+ else
+ fprintf (dump_file, "successfully matched this instruction:\n");
+ print_rtl_single (dump_file, PATTERN (insn));
+ }
+ if (!ok_p)
+ return false;
+
+ unsigned int cost1 = attempt.sequence[0]->cost;
+ unsigned int cost2 = attempt.sequence[1]->cost;
+ attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p);
+ ok_p = (attempt.new_cost <= cost1 + cost2);
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n",
+ cost1, cost2, attempt.new_cost,
+ ok_p ? "keeping replacement" : "rejecting replacement");
+ if (!ok_p)
+ return false;
+
+ confirm_change_group ();
+ return true;
+}
+
+/* Return true if we should consider register REGNO when calculating
+ register pressure estimates. */
+
+static bool
+count_reg_pressure_p (unsigned int regno)
+{
+ if (regno == INVALID_REGNUM)
+ return false;
+
+ /* Unallocatable registers aren't interesting. */
+ if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno])
+ return false;
+
+ return true;
+}
+
+/* Try to estimate the effect that the original form of INSN_INFO
+ had on register pressure, in the form "born - dying". */
+
+int
+combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info)
+{
+ int delta = 0;
+
+ for (live_range_rec **def = insn_info->defs; *def; ++def)
+ if (count_reg_pressure_p ((*def)->regno))
+ delta += 1;
+
+ for (live_range_rec **use = insn_info->uses; *use; ++use)
+ if (count_reg_pressure_p ((*use)->regno)
+ && known_last_use_p (*use, insn_info))
+ delta -= 1;
+
+ return delta;
+}
+
+/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete
+ FROM_INSN. Copy any useful information to TO_INSN before doing that. */
+
+static void
+transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn)
+{
+ INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn);
+ INSN_CODE (to_insn) = INSN_CODE (from_insn);
+ REG_NOTES (to_insn) = REG_NOTES (from_insn);
+}
+
+/* The combination attempt in ATTEMPT has succeeded and is currently
+ part of an open validate_change group. Commit to making the change
+ and decide where the new instruction should go.
+
+ KEPT_DEF_P is true if the new instruction continues to perform
+ the definition described by ATTEMPT.def_use_range. */
+
+void
+combine2::commit_combination (combination_attempt_rec &attempt,
+ bool kept_def_p)
+{
+ insn_info_rec *new_home = attempt.new_home;
+ rtx_insn *old_insn = attempt.sequence[0]->insn;
+ rtx_insn *new_insn = attempt.sequence[1]->insn;
+
+ /* Remove any notes that are no longer relevant. */
+ bool single_set_p = single_set (new_insn);
+ for (rtx *note_ptr = ®_NOTES (new_insn); *note_ptr; )
+ {
+ rtx note = *note_ptr;
+ bool keep_p = true;
+ switch (REG_NOTE_KIND (note))
+ {
+ case REG_EQUAL:
+ case REG_EQUIV:
+ case REG_NOALIAS:
+ keep_p = single_set_p;
+ break;
+
+ case REG_UNUSED:
+ keep_p = false;
+ break;
+
+ default:
+ break;
+ }
+ if (keep_p)
+ note_ptr = &XEXP (*note_ptr, 1);
+ else
+ {
+ *note_ptr = XEXP (*note_ptr, 1);
+ free_EXPR_LIST_node (note);
+ }
+ }
+
+ /* Complete the open validate_change group. */
+ confirm_change_group ();
+
+ /* Decide where the new instruction should go. */
+ unsigned int new_point = attempt.latest_point;
+ if (new_point != attempt.earliest_point
+ && prev_real_insn (new_insn) != old_insn)
+ {
+ /* Prefer the earlier point if the combined instruction reduces
+ register pressure and the latest point if it increases register
+ pressure.
+
+ The choice isn't obvious in the event of a tie, but picking
+ the earliest point should reduce the number of times that
+ we need to invalidate debug insns. */
+ int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]);
+ int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]);
+ bool move_up_p = (delta1 + delta2 <= 0);
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "register pressure delta = %d + %d; using %s position\n",
+ delta1, delta2, move_up_p ? "earliest" : "latest");
+ if (move_up_p)
+ new_point = attempt.earliest_point;
+ }
+
+ /* Translate inserting at NEW_POINT into inserting before or after
+ a particular insn. */
+ rtx_insn *anchor = NULL;
+ bool before_p = (new_point & 1);
+ if (new_point != attempt.sequence[1]->point
+ && new_point != attempt.sequence[0]->point)
+ {
+ anchor = m_points[(new_point - m_end_of_sequence) / 2];
+ rtx_insn *other_side = (before_p
+ ? prev_real_insn (anchor)
+ : next_real_insn (anchor));
+ /* Inserting next to an insn X and then deleting X is just a
+ roundabout way of using X as the insertion point. */
+ if (anchor == new_insn || other_side == new_insn)
+ new_point = attempt.sequence[1]->point;
+ else if (anchor == old_insn || other_side == old_insn)
+ new_point = attempt.sequence[0]->point;
+ }
+
+ /* Actually perform the move. */
+ if (new_point == attempt.sequence[1]->point)
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "using insn %d to hold the combined pattern\n",
+ INSN_UID (new_insn));
+ set_insn_deleted (old_insn);
+ }
+ else if (new_point == attempt.sequence[0]->point)
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "using insn %d to hold the combined pattern\n",
+ INSN_UID (old_insn));
+ PATTERN (old_insn) = PATTERN (new_insn);
+ transfer_insn (old_insn, new_insn);
+ std::swap (old_insn, new_insn);
+ set_insn_deleted (old_insn);
+ }
+ else
+ {
+ /* We need to insert a new instruction. We can't simply move
+ NEW_INSN because it acts as an insertion anchor in m_points. */
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "inserting combined insn %s insn %d\n",
+ before_p ? "before" : "after", INSN_UID (anchor));
+
+ rtx_insn *added_insn = (before_p
+ ? emit_insn_before (PATTERN (new_insn), anchor)
+ : emit_insn_after (PATTERN (new_insn), anchor));
+ transfer_insn (added_insn, new_insn);
+ set_insn_deleted (old_insn);
+ set_insn_deleted (new_insn);
+ new_insn = added_insn;
+ }
+ df_insn_rescan (new_insn);
+
+ /* Unlink the old uses. */
+ for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+ for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
+ remove_range_use (*use, attempt.sequence[i]);
+
+ /* Work out which registers the new pattern uses. */
+ bitmap_clear (m_true_deps);
+ df_ref use;
+ FOR_EACH_INSN_USE (use, new_insn)
+ {
+ rtx reg = DF_REF_REAL_REG (use);
+ bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
+ }
+ FOR_EACH_INSN_EQ_USE (use, new_insn)
+ {
+ rtx reg = DF_REF_REAL_REG (use);
+ bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg));
+ }
+
+ /* Describe the combined instruction in NEW_HOME. */
+ new_home->insn = new_insn;
+ new_home->point = new_point;
+ new_home->cost = attempt.new_cost;
+
+ /* Build up a list of definitions for the combined instructions
+ and update all the ranges accordingly. It shouldn't matter
+ which order we do this in. */
+ for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+ for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def)
+ if (kept_def_p || *def != attempt.def_use_range)
+ {
+ obstack_ptr_grow (&m_insn_obstack, *def);
+ (*def)->producer = new_home;
+ }
+ obstack_ptr_grow (&m_insn_obstack, NULL);
+ new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
+
+ /* Build up a list of uses for the combined instructions and update
+ all the ranges accordingly. Again, it shouldn't matter which
+ order we do this in. */
+ for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+ for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use)
+ {
+ live_range_rec *range = *use;
+ if (range != attempt.def_use_range
+ && (range->regno == INVALID_REGNUM
+ ? insn_reads_mem_p (new_insn)
+ : bitmap_bit_p (m_true_deps, range->regno))
+ && add_range_use (range, new_home))
+ obstack_ptr_grow (&m_insn_obstack, range);
+ }
+ obstack_ptr_grow (&m_insn_obstack, NULL);
+ new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
+
+ /* There shouldn't be any remaining references to other instructions
+ in the combination. Invalidate their contents to make lingering
+ references a noisy failure. */
+ for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i)
+ if (attempt.sequence[i] != new_home)
+ {
+ attempt.sequence[i]->insn = NULL;
+ attempt.sequence[i]->point = ~0U;
+ }
+
+ /* Unlink the def-use range. */
+ if (!kept_def_p && attempt.def_use_range)
+ {
+ live_range_rec *range = attempt.def_use_range;
+ if (range->prev_range)
+ range->prev_range->next_range = range->next_range;
+ else
+ m_reg_info[range->regno].range = range->next_range;
+ if (range->next_range)
+ range->next_range->prev_range = range->prev_range;
+ }
+
+ /* Record instructions whose new form alters the cfg. */
+ rtx pattern = PATTERN (new_insn);
+ if ((returnjump_p (new_insn)
+ || any_uncondjump_p (new_insn)
+ || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx))
+ && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn)))
+ m_cfg_altering_insns.safe_push (new_insn);
+}
+
+/* Return true if X1 and X2 are memories and if X1 does not have
+ a higher alignment than X2. */
+
+static bool
+dubious_mem_pair_p (rtx x1, rtx x2)
+{
+ return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2);
+}
+
+/* Try implement ATTEMPT using (parallel [SET1 SET2]). */
+
+bool
+combine2::try_parallel_sets (combination_attempt_rec &attempt,
+ rtx set1, rtx set2)
+{
+ rtx_insn *insn = attempt.sequence[1]->insn;
+
+ /* Combining two loads or two stores can be useful on targets that
+ allow them to be treated as a single access. However, we use a
+ very peephole approach to picking the pairs, so we need to be
+ relatively confident that we're making a good choice.
+
+ For now just aim for cases in which the memory references are
+ consecutive and the first reference has a higher alignment.
+ We can leave the target to test the consecutive part; whatever test
+ we added here might be different from the target's, and in any case
+ it's fine if the target accepts other well-aligned cases too. */
+ if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2))
+ || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2)))
+ return false;
+
+ /* Cache the PARALLEL rtx between attempts so that we don't generate
+ too much garbage rtl. */
+ if (!m_spare_parallel)
+ {
+ rtvec vec = gen_rtvec (2, set1, set2);
+ m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec);
+ }
+ else
+ {
+ XVECEXP (m_spare_parallel, 0, 0) = set1;
+ XVECEXP (m_spare_parallel, 0, 1) = set2;
+ }
+
+ unsigned int num_changes = num_validated_changes ();
+ validate_change (insn, &PATTERN (insn), m_spare_parallel, true);
+ if (verify_combination (attempt))
+ {
+ m_spare_parallel = NULL_RTX;
+ return true;
+ }
+ cancel_changes (num_changes);
+ return false;
+}
+
+/* Try to parallelize the two instructions in ATTEMPT. */
+
+bool
+combine2::try_parallelize_insns (combination_attempt_rec &attempt)
+{
+ rtx_insn *i1_insn = attempt.sequence[0]->insn;
+ rtx_insn *i2_insn = attempt.sequence[1]->insn;
+
+ /* Can't parallelize asm statements. */
+ if (asm_noperands (PATTERN (i1_insn)) >= 0
+ || asm_noperands (PATTERN (i2_insn)) >= 0)
+ return false;
+
+ /* For now, just handle the case in which both instructions are
+ single sets. We could handle more than 2 sets as well, but few
+ targets support that anyway. */
+ rtx set1 = single_set (i1_insn);
+ if (!set1)
+ return false;
+ rtx set2 = single_set (i2_insn);
+ if (!set2)
+ return false;
+
+ /* Make sure that we have structural proof that the destinations
+ are independent. Things like alias analysis rely on semantic
+ information and assume no undefined behavior, which is rarely a
+ good enough guarantee to allow a useful instruction combination. */
+ rtx dest1 = SET_DEST (set1);
+ rtx dest2 = SET_DEST (set2);
+ if (MEM_P (dest1)
+ ? MEM_P (dest2) && !nonoverlapping_memrefs_p (dest1, dest2, false)
+ : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2))
+ return false;
+
+ /* Try the sets in both orders. */
+ if (try_parallel_sets (attempt, set1, set2)
+ || try_parallel_sets (attempt, set2, set1))
+ {
+ commit_combination (attempt, true);
+ if (MAY_HAVE_DEBUG_BIND_INSNS
+ && attempt.new_home->insn != i1_insn)
+ propagate_for_debug (i1_insn, attempt.new_home->insn,
+ SET_DEST (set1), SET_SRC (set1), m_bb);
+ return true;
+ }
+ return false;
+}
+
+/* Replace DEST with SRC in the register notes for INSN. */
+
+static void
+substitute_into_note (rtx_insn *insn, rtx dest, rtx src)
+{
+ for (rtx *note_ptr = ®_NOTES (insn); *note_ptr; )
+ {
+ rtx note = *note_ptr;
+ bool keep_p = true;
+ switch (REG_NOTE_KIND (note))
+ {
+ case REG_EQUAL:
+ case REG_EQUIV:
+ keep_p = validate_simplify_replace_rtx (insn, &XEXP (note, 0),
+ dest, src);
+ break;
+
+ default:
+ break;
+ }
+ if (keep_p)
+ note_ptr = &XEXP (*note_ptr, 1);
+ else
+ {
+ *note_ptr = XEXP (*note_ptr, 1);
+ free_EXPR_LIST_node (note);
+ }
+ }
+}
+
+/* A subroutine of try_combine_def_use. Try replacing DEST with SRC
+ in ATTEMPT. SRC might be either the original SET_SRC passed to the
+ parent routine or a value pulled from a note; SRC_IS_NOTE_P is true
+ in the latter case. */
+
+bool
+combine2::try_combine_def_use_1 (combination_attempt_rec &attempt,
+ rtx dest, rtx src, bool src_is_note_p)
+{
+ rtx_insn *def_insn = attempt.sequence[0]->insn;
+ rtx_insn *use_insn = attempt.sequence[1]->insn;
+
+ /* Mimic combine's behavior by not combining moves from allocatable hard
+ registers (e.g. when copying parameters or function return values). */
+ if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
+ return false;
+
+ /* Don't mess with volatile references. For one thing, we don't yet
+ know how many copies of SRC we'll need. */
+ if (volatile_refs_p (src))
+ return false;
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "trying to combine %d and %d%s:\n",
+ INSN_UID (def_insn), INSN_UID (use_insn),
+ src_is_note_p ? " using equal/equiv note" : "");
+ dump_insn_slim (dump_file, def_insn);
+ dump_insn_slim (dump_file, use_insn);
+ }
+
+ unsigned int num_changes = num_validated_changes ();
+ if (!validate_simplify_replace_rtx (use_insn, &PATTERN (use_insn),
+ dest, src))
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "combination failed -- unable to substitute"
+ " all uses\n");
+ return false;
+ }
+
+ /* Try matching the instruction on its own if DEST isn't used elsewhere. */
+ if (has_single_use_p (attempt.def_use_range)
+ && verify_combination (attempt))
+ {
+ live_range_rec *next_range = attempt.def_use_range->next_range;
+ substitute_into_note (use_insn, dest, src);
+ commit_combination (attempt, false);
+ if (MAY_HAVE_DEBUG_BIND_INSNS)
+ {
+ rtx_insn *end_of_range = (next_range
+ ? next_range->producer->insn
+ : BB_END (m_bb));
+ propagate_for_debug (def_insn, end_of_range, dest, src, m_bb);
+ }
+ return true;
+ }
+
+ /* Try doing the new USE_INSN pattern in parallel with the DEF_INSN
+ pattern. */
+ if (try_parallelize_insns (attempt))
+ return true;
+
+ cancel_changes (num_changes);
+ return false;
+}
+
+/* ATTEMPT describes an attempt to substitute the result of the first
+ instruction into the second instruction. Try to implement it,
+ given that the first instruction sets DEST to SRC. */
+
+bool
+combine2::try_combine_def_use (combination_attempt_rec &attempt,
+ rtx dest, rtx src)
+{
+ rtx_insn *def_insn = attempt.sequence[0]->insn;
+ rtx_insn *use_insn = attempt.sequence[1]->insn;
+ rtx def_note = find_reg_equal_equiv_note (def_insn);
+
+ /* First try combining the instructions in their original form. */
+ if (try_combine_def_use_1 (attempt, dest, src, false))
+ return true;
+
+ /* Try to replace DEST with a REG_EQUAL/EQUIV value instead. */
+ if (def_note
+ && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
+ return true;
+
+ /* If USE_INSN has a REG_EQUAL/EQUIV note that refers to DEST, try
+ using that instead of the main pattern. */
+ for (rtx *link_ptr = ®_NOTES (use_insn); *link_ptr;
+ link_ptr = &XEXP (*link_ptr, 1))
+ {
+ rtx use_note = *link_ptr;
+ if (REG_NOTE_KIND (use_note) != REG_EQUAL
+ && REG_NOTE_KIND (use_note) != REG_EQUIV)
+ continue;
+
+ rtx use_set = single_set (use_insn);
+ if (!use_set)
+ break;
+
+ if (!reg_overlap_mentioned_p (dest, XEXP (use_note, 0)))
+ continue;
+
+ /* Try snipping out the note and putting it in the SET instead. */
+ validate_change (use_insn, link_ptr, XEXP (use_note, 1), 1);
+ validate_change (use_insn, &SET_SRC (use_set), XEXP (use_note, 0), 1);
+
+ if (try_combine_def_use_1 (attempt, dest, src, false))
+ return true;
+
+ if (def_note
+ && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true))
+ return true;
+
+ cancel_changes (0);
+ }
+
+ return false;
+}
+
+/* ATTEMPT describes an attempt to combine two instructions that use
+ the same resource. Try to implement it, returning true on success. */
+
+bool
+combine2::try_combine_two_uses (combination_attempt_rec &attempt)
+{
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "trying to parallelize %d and %d:\n",
+ INSN_UID (attempt.sequence[0]->insn),
+ INSN_UID (attempt.sequence[1]->insn));
+ dump_insn_slim (dump_file, attempt.sequence[0]->insn);
+ dump_insn_slim (dump_file, attempt.sequence[1]->insn);
+ }
+
+ return try_parallelize_insns (attempt);
+}
+
+/* Try to optimize instruction INSN_INFO. Return true on success. */
+
+bool
+combine2::optimize_insn (insn_info_rec *i1)
+{
+ combination_attempt_rec attempt;
+
+ if (!combinable_insn_p (i1->insn, false))
+ return false;
+
+ rtx set = single_set (i1->insn);
+ if (!set)
+ return false;
+
+ /* First try combining INSN with a user of its result. */
+ rtx dest = SET_DEST (set);
+ rtx src = SET_SRC (set);
+ if (REG_P (dest) && REG_NREGS (dest) == 1)
+ for (live_range_rec **def = i1->defs; *def; ++def)
+ if ((*def)->regno == REGNO (dest))
+ {
+ for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+ {
+ insn_info_rec *use = (*def)->users[i];
+ if (use
+ && combinable_insn_p (use->insn, has_single_use_p (*def))
+ && start_combination (attempt, i1, use, *def)
+ && try_combine_def_use (attempt, dest, src))
+ return true;
+ }
+ break;
+ }
+
+ /* Try parallelizing INSN and another instruction that uses the same
+ resource. */
+ bitmap_clear (m_tried_insns);
+ for (live_range_rec **use = i1->uses; *use; ++use)
+ for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i)
+ {
+ insn_info_rec *i2 = (*use)->users[i];
+ if (i2
+ && i2 != i1
+ && combinable_insn_p (i2->insn, false)
+ && bitmap_set_bit (m_tried_insns, INSN_UID (i2->insn))
+ && start_combination (attempt, i1, i2)
+ && try_combine_two_uses (attempt))
+ return true;
+ }
+
+ return false;
+}
+
+/* Record all register and memory definitions in INSN_INFO and fill in its
+ "defs" list. */
+
+void
+combine2::record_defs (insn_info_rec *insn_info)
+{
+ rtx_insn *insn = insn_info->insn;
+
+ /* Record register definitions. */
+ df_ref def;
+ FOR_EACH_INSN_DEF (def, insn)
+ {
+ rtx reg = DF_REF_REAL_REG (def);
+ unsigned int end_regno = END_REGNO (reg);
+ for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
+ {
+ live_range_rec *range = reg_live_range (regno);
+ range->producer = insn_info;
+ m_reg_info[regno].live_p = false;
+ obstack_ptr_grow (&m_insn_obstack, range);
+ }
+ }
+
+ /* If the instruction writes to memory, record that too. */
+ if (insn_writes_mem_p (insn))
+ {
+ live_range_rec *range = mem_live_range ();
+ range->producer = insn_info;
+ obstack_ptr_grow (&m_insn_obstack, range);
+ }
+
+ /* Complete the list of definitions. */
+ obstack_ptr_grow (&m_insn_obstack, NULL);
+ insn_info->defs = (live_range_rec **) obstack_finish (&m_insn_obstack);
+}
+
+/* Record that INSN_INFO contains register use USE. If this requires
+ new entries to be added to INSN_INFO->uses, add those entries to the
+ list we're building in m_insn_obstack. */
+
+void
+combine2::record_reg_use (insn_info_rec *insn_info, df_ref use)
+{
+ rtx reg = DF_REF_REAL_REG (use);
+ unsigned int end_regno = END_REGNO (reg);
+ for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno)
+ {
+ live_range_rec *range = reg_live_range (regno);
+ if (add_range_use (range, insn_info))
+ obstack_ptr_grow (&m_insn_obstack, range);
+ m_reg_info[regno].live_p = true;
+ }
+}
+
+/* Record all register and memory uses in INSN_INFO and fill in its
+ "uses" list. */
+
+void
+combine2::record_uses (insn_info_rec *insn_info)
+{
+ rtx_insn *insn = insn_info->insn;
+
+ /* Record register uses in the main pattern. */
+ df_ref use;
+ FOR_EACH_INSN_USE (use, insn)
+ record_reg_use (insn_info, use);
+
+ /* Treat REG_EQUAL uses as first-class uses. We don't lose much
+ by doing that, since it's rare for a REG_EQUAL note to mention
+ registers that the main pattern doesn't. It also gives us the
+ maximum freedom to use REG_EQUAL notes in place of the main pattern. */
+ FOR_EACH_INSN_EQ_USE (use, insn)
+ record_reg_use (insn_info, use);
+
+ /* Record a memory use if either the pattern or the notes read from
+ memory. */
+ if (insn_reads_mem_p (insn))
+ {
+ live_range_rec *range = mem_live_range ();
+ if (add_range_use (range, insn_info))
+ obstack_ptr_grow (&m_insn_obstack, range);
+ }
+
+ /* Complete the list of uses. */
+ obstack_ptr_grow (&m_insn_obstack, NULL);
+ insn_info->uses = (live_range_rec **) obstack_finish (&m_insn_obstack);
+}
+
+/* Start a new instruction sequence, discarding all information about
+ the previous one. */
+
+void
+combine2::start_sequence (void)
+{
+ m_end_of_sequence = m_point;
+ m_mem_range = NULL;
+ m_points.truncate (0);
+ obstack_free (&m_insn_obstack, m_insn_obstack_start);
+ obstack_free (&m_range_obstack, m_range_obstack_start);
+}
+
+/* Run the pass on the current function. */
+
+void
+combine2::execute (void)
+{
+ df_analyze ();
+ FOR_EACH_BB_FN (m_bb, cfun)
+ {
+ m_optimize_for_speed_p = optimize_bb_for_speed_p (m_bb);
+ m_end_of_bb = m_point;
+ start_sequence ();
+
+ rtx_insn *insn, *prev;
+ FOR_BB_INSNS_REVERSE_SAFE (m_bb, insn, prev)
+ {
+ if (!NONDEBUG_INSN_P (insn))
+ continue;
+
+ /* The current m_point represents the end of the sequence if
+ INSN is the last instruction in the sequence, otherwise it
+ represents the gap between INSN and the next instruction.
+ m_point + 1 represents INSN itself.
+
+ Instructions can be added to m_point by inserting them
+ after INSN. They can be added to m_point + 1 by inserting
+ them before INSN. */
+ m_points.safe_push (insn);
+ m_point += 1;
+
+ insn_info_rec *insn_info = XOBNEW (&m_insn_obstack, insn_info_rec);
+ insn_info->insn = insn;
+ insn_info->point = m_point;
+ insn_info->cost = UNKNOWN_COST;
+
+ record_defs (insn_info);
+ record_uses (insn_info);
+
+ /* Set up m_point for the next instruction. */
+ m_point += 1;
+
+ if (CALL_P (insn))
+ start_sequence ();
+ else
+ while (optimize_insn (insn_info))
+ gcc_assert (insn_info->insn);
+ }
+ }
+
+ /* If an instruction changes the cfg, update the containing block
+ accordingly. */
+ rtx_insn *insn;
+ unsigned int i;
+ FOR_EACH_VEC_ELT (m_cfg_altering_insns, i, insn)
+ if (JUMP_P (insn))
+ {
+ mark_jump_label (PATTERN (insn), insn, 0);
+ update_cfg_for_uncondjump (insn);
+ }
+ else
+ {
+ remove_edge (split_block (BLOCK_FOR_INSN (insn), insn));
+ emit_barrier_after_bb (BLOCK_FOR_INSN (insn));
+ }
+
+ /* Propagate the above block-local cfg changes to the rest of the cfg. */
+ if (!m_cfg_altering_insns.is_empty ())
+ {
+ if (dom_info_available_p (CDI_DOMINATORS))
+ free_dominance_info (CDI_DOMINATORS);
+ timevar_push (TV_JUMP);
+ rebuild_jump_labels (get_insns ());
+ cleanup_cfg (0);
+ timevar_pop (TV_JUMP);
+ }
+}
+
+const pass_data pass_data_combine2 =
+{
+ RTL_PASS, /* type */
+ "combine2", /* name */
+ OPTGROUP_NONE, /* optinfo_flags */
+ TV_COMBINE2, /* tv_id */
+ 0, /* properties_required */
+ 0, /* properties_provided */
+ 0, /* properties_destroyed */
+ 0, /* todo_flags_start */
+ TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_combine2 : public rtl_opt_pass
+{
+public:
+ pass_combine2 (gcc::context *ctxt, int flag)
+ : rtl_opt_pass (pass_data_combine2, ctxt), m_flag (flag)
+ {}
+
+ bool
+ gate (function *) OVERRIDE
+ {
+ return optimize && (param_run_combine & m_flag) != 0;
+ }
+
+ unsigned int
+ execute (function *f) OVERRIDE
+ {
+ combine2 (f).execute ();
+ return 0;
+ }
+
+private:
+ unsigned int m_flag;
+}; // class pass_combine2
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_combine2_before (gcc::context *ctxt)
+{
+ return new pass_combine2 (ctxt, 1);
+}
+
+rtl_opt_pass *
+make_pass_combine2_after (gcc::context *ctxt)
+{
+ return new pass_combine2 (ctxt, 4);
+}
Hi! On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote: > While working on SVE, I've noticed several cases in which we fail > to combine instructions because the combined form would need to be > placed earlier in the instruction stream than the last of the > instructions being combined. This includes one very important > case in the handling of the first fault register (FFR). Do you have an example of that? > Combine currently requires the combined instruction to live at the same > location as i3. Or i2 and i3. > I thought about trying to relax that restriction, but it > would be difficult to do with the current pass structure while keeping > everything linear-ish time. s/difficult/impossible/, yes. A long time ago we had to only move insns forward for correctness even, but that should no longer be required, combine always is finite by other means now. > So this patch instead goes for an option that has been talked about > several times over the years: writing a new combine pass that just > does instruction combination, and not all the other optimisations > that have been bolted onto combine over time. E.g. it deliberately > doesn't do things like nonzero-bits tracking, since that really ought > to be a separate, more global, optimisation. In my dreams tracking nonzero bits would be a dataflow problem. > This is still far from being a realistic replacement for the even > the combine parts of the current combine pass. E.g.: > > - it only handles combinations that can be built up from individual > two-instruction combinations. And combine does any of {2,3,4}->{1,2} combinations, and it also can modify a third insn ("other_insn"). For the bigger ->1 combos, if it *can* be decomposed in a bunch of 2->1, then those result in insns that are greater cost than those we started with (or else those combinations *would* be done). For the ->2 combinations, there are many ways those two insns can be formed: it can be the two arms of a parallel, or combine can break a non-matching insn into two at what looks like a good spot for that, or it can use a define_split for it. All those things lead to many more successful combinations :-) > On a more positive note, the pass handles things that the current > combine pass doesn't: > > - the main motivating feature mentioned above: it works out where > the combined instruction could validly live and moves it there > if necessary. If there are a range of valid places, it tries > to pick the best one based on register pressure (although only > with a simple heuristic for now). How are dependencies represented in your new pass? If it just does walks over the insn stream for everything, you get quadratic complexity if you move insns backwards. We have that in combine already, mostly from modified_between_p, but that is limited because of how LOG_LINKS work, and we have been doing this for so long and there are no problems found with it, so it must work in practice. But I am worried about it when moving insns back an unlimited distance. If combine results in two insns it puts them at i2 and i3, and it can actually move a SET to i2 that was at i3 before the combination. > - once it has combined two instructions, it can try combining the > result with both later and earlier code, i.e. it can combine > in both directions. That is what combine does, too. > - it tries using REG_EQUAL notes for the final instruction. And that. > - it can parallelise two independent instructions that both read from > the same register or both read from memory. That only if somehow there is a link between the two (so essentially never). The only combinations tried by combine are those via LOG_LINKs, which are between a SET and the first corresponding use. This is a key factor that makes it kind of linear (instead of exponential) complexity. > The pass is supposed to be linear time without debug insns. > It only tries a constant number C of combinations per instruction > and its bookkeeping updates are constant-time. But how many other insns does it look at, say by modified_between_p or the like? > The patch adds two instances of the new pass: one before combine and > one after it. One thing I want to do is some mini-combine after every split, probably only with the insns new from the split. But we have no cfglayout mode anymore then, and only hard regs (except in the first split pass, which is just a little later than your new pass). > As far as compile-time goes, I tried compiling optabs.ii at -O2 > with an --enable-checking=release compiler: > > run-combine=2 (normal combine): 100.0% (baseline) > run-combine=4 (new pass only) 98.0% > run-combine=6 (both passes) 100.3% > > where the results are easily outside the noise. So the pass on > its own is quicker than combine, but that's not a fair comparison > when it doesn't do everything combine does. Running both passes > only has a slight overhead. And amount of garbage produced? > To get a feel for the effect on multiple targets, I did my usual > bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg > and g++.dg, this time comparing run-combine=2 and run-combine=6 > using -O2 -ftree-vectorize: One problem with this is that these are very short functions on average. What is the kind of changes you see for other targets? Wait, does this combine sets with a hard reg source as well? It shouldn't do that, that is RA's job; doing this in a greedy way is a bad idea. (I haven't yet verified if you do this, fwiw). > Inevitably there was some scan-assembler fallout for other tests. > E.g. in gcc.target/aarch64/vmov_n_1.c: > > #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory") > ... > INHIB_OPTIMIZATION; \ > (a) = TEST (test, data_len); \ > INHIB_OPTIMIZATION; \ > (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a)); \ > > is no longer effective for preventing move (a) from being merged > into (b), because the pass can merge at the point of (a). It never was effective for that. Unless (b) lives in memory, in which case your new pass has a bug here. > I think > this is a valid thing to do -- the asm semantics are still satisfied, > and asm volatile ("" : : : "memory") never acted as a register barrier. > But perhaps we should deal with this as a special case? I don't think we should, no. What does "register barrier" even mean, *exactly*? (I'll look at the new version of your patch, but not today). Segher
Segher Boessenkool <segher@kernel.crashing.org> writes: > On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote: >> While working on SVE, I've noticed several cases in which we fail >> to combine instructions because the combined form would need to be >> placed earlier in the instruction stream than the last of the >> instructions being combined. This includes one very important >> case in the handling of the first fault register (FFR). > > Do you have an example of that? It's difficult to share realistic examples at this stage since this isn't really the right forum for making them public for the first time. But in rtl terms we have: (set (reg/v:VNx16BI 102 [ ok ]) (reg:VNx16BI 85 ffrt)) (set (reg:VNx16BI 85 ffrt) (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT)) (set (reg:CC_NZC 66 cc) (unspec:CC_NZC [(reg:VNx16BI 106) repeated x2 (const_int 1 [0x1]) (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST)) and want to combine the first and third instruction at the site of the first instruction. Current combine gives: Trying 18 -> 24: 18: r102:VNx16BI=ffrt:VNx16BI 24: cc:CC_NZC=unspec[r106:VNx16BI,r106:VNx16BI,0x1,r102:VNx16BI] 104 Can't combine i2 into i3 because of: /* Make sure that the value that is to be substituted for the register does not use any registers whose values alter in between. However, If the insns are adjacent, a use can't cross a set even though we think it might (this can happen for a sequence of insns each setting the same destination; last_set of that register might point to a NOTE). If INSN has a REG_EQUIV note, the register is always equivalent to the memory so the substitution is valid even if there are intervening stores. Also, don't move a volatile asm or UNSPEC_VOLATILE across any other insns. */ || (! all_adjacent && (((!MEM_P (src) || ! find_reg_note (insn, REG_EQUIV, src)) && modified_between_p (src, insn, i3)) || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src)) || GET_CODE (src) == UNSPEC_VOLATILE)) >> Combine currently requires the combined instruction to live at the same >> location as i3. > > Or i2 and i3. > >> I thought about trying to relax that restriction, but it >> would be difficult to do with the current pass structure while keeping >> everything linear-ish time. > > s/difficult/impossible/, yes. > > A long time ago we had to only move insns forward for correctness even, > but that should no longer be required, combine always is finite by other > means now. > >> So this patch instead goes for an option that has been talked about >> several times over the years: writing a new combine pass that just >> does instruction combination, and not all the other optimisations >> that have been bolted onto combine over time. E.g. it deliberately >> doesn't do things like nonzero-bits tracking, since that really ought >> to be a separate, more global, optimisation. > > In my dreams tracking nonzero bits would be a dataflow problem. > >> This is still far from being a realistic replacement for the even >> the combine parts of the current combine pass. E.g.: >> >> - it only handles combinations that can be built up from individual >> two-instruction combinations. > > And combine does any of {2,3,4}->{1,2} combinations, and it also can > modify a third insn ("other_insn"). For the bigger ->1 combos, if it > *can* be decomposed in a bunch of 2->1, then those result in insns that > are greater cost than those we started with (or else those combinations > *would* be done). For the ->2 combinations, there are many ways those > two insns can be formed: it can be the two arms of a parallel, or > combine can break a non-matching insn into two at what looks like a good > spot for that, or it can use a define_split for it. > > All those things lead to many more successful combinations :-) Right. I definitely want to support multi-insn combos too. It's one of the TODOs in the head comment, along with the other points in this list. Like I say, it's not yet a realistic replacement for even the combine parts of the current pass. >> On a more positive note, the pass handles things that the current >> combine pass doesn't: >> >> - the main motivating feature mentioned above: it works out where >> the combined instruction could validly live and moves it there >> if necessary. If there are a range of valid places, it tries >> to pick the best one based on register pressure (although only >> with a simple heuristic for now). > > How are dependencies represented in your new pass? If it just does > walks over the insn stream for everything, you get quadratic complexity > if you move insns backwards. We have that in combine already, mostly > from modified_between_p, but that is limited because of how LOG_LINKS > work, and we have been doing this for so long and there are no problems > found with it, so it must work in practice. But I am worried about it > when moving insns back an unlimited distance. It builds def-use chains, but using a constant limit on the number of explicitly-recorded uses. All other uses go in a numerical live range from which they (conservatively) never escape. The def-use chains represent memory as a single entity, a bit like in gimple. I avoided the rtlanal.c dependency routines for exactly this reason. :-) > If combine results in two insns it puts them at i2 and i3, and it can > actually move a SET to i2 that was at i3 before the combination. > >> - once it has combined two instructions, it can try combining the >> result with both later and earlier code, i.e. it can combine >> in both directions. > > That is what combine does, too. Yeah, that part was bogus, sorry. >> - it tries using REG_EQUAL notes for the final instruction. > > And that. I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3 with i3's REG_EQUAL note and combining into that. Does combine do that? I couldn't see it, and in: https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html you seemed to reject the idea of allowing it. >> - it can parallelise two independent instructions that both read from >> the same register or both read from memory. > > That only if somehow there is a link between the two (so essentially > never). The only combinations tried by combine are those via LOG_LINKs, > which are between a SET and the first corresponding use. This is a key > factor that makes it kind of linear (instead of exponential) complexity. Tracking limited def-use chains is what makes this last bit easy. We can just try parallelising two instructions from the (bounded) list of uses. And for this case there's not any garbage rtl involved, since we reuse the same PARALLEL rtx between attempts. The cost is basically all in the recog call (which would obviously mount up if we went overboard). The new pass also tries combining definitions with uses later than the first, but of course in that case we need to keep the original set in parallel. >> The pass is supposed to be linear time without debug insns. >> It only tries a constant number C of combinations per instruction >> and its bookkeeping updates are constant-time. > > But how many other insns does it look at, say by modified_between_p or > the like? Hope the above answers this. >> The patch adds two instances of the new pass: one before combine and >> one after it. > > One thing I want to do is some mini-combine after every split, probably > only with the insns new from the split. But we have no cfglayout mode > anymore then, and only hard regs (except in the first split pass, which > is just a little later than your new pass). Yeah, sounds like it could be useful. I guess there'd need to be an extra condition on the combination that the new insn can't be immediately split. >> As far as compile-time goes, I tried compiling optabs.ii at -O2 >> with an --enable-checking=release compiler: >> >> run-combine=2 (normal combine): 100.0% (baseline) >> run-combine=4 (new pass only) 98.0% >> run-combine=6 (both passes) 100.3% >> >> where the results are easily outside the noise. So the pass on >> its own is quicker than combine, but that's not a fair comparison >> when it doesn't do everything combine does. Running both passes >> only has a slight overhead. > > And amount of garbage produced? If -ftime-report stats are accurate, then the total amount of memory allocated is: run-combine=2 (normal combine): 1793 kB run-combine=4 (new pass only): 98 kB run-combine=6 (both passes): 1871 kB (new pass accounts for 78 kB) But again that's not a fair comparison when the main combine pass does more. I did try hard to keep the amount of garbage rtl down though. This is why I added validate_simplify_replace_rtx rather than trying to make do with existing routines. It should only create new rtl if the simplification routines did something useful. (Of course, that's mostly true of combine as well, but things like the make_compound_operation/ expand_compound_operation wrangler can create expressions that are never actually useful.) >> To get a feel for the effect on multiple targets, I did my usual >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg >> and g++.dg, this time comparing run-combine=2 and run-combine=6 >> using -O2 -ftree-vectorize: > > One problem with this is that these are very short functions on average. There are some long ones too :-) > What is the kind of changes you see for other targets? On powerpc64le-linux-gnu it mostly comes from eliminating comparisons in favour of other flag-setting instructions and making more use of post-increments. Not sure the last one is actually a win, but the target costs say it's OK :-). E.g. from gcc.c-torture/execute/pr78675.c: @@ -48,9 +48,8 @@ blr .align 4 .L19: - cmpdi 0,10,0 + mr. 9,10 mr 3,8 - mr 9,10 bne 0,.L9 b .L3 .align 4 and a slightly more interesting example in gcc.c-torture/execute/loop-6.c: @@ -16,24 +16,22 @@ mflr 0 li 9,50 mtctr 9 - li 8,1 + li 10,1 li 7,1 std 0,16(1) stdu 1,-32(1) .LCFI0: .L2: - addi 10,8,1 - extsw 8,10 + addi 9,10,1 + extsw 10,9 bdz .L11 - slw 9,7,10 - rlwinm 9,9,0,0xff - cmpwi 0,9,0 + slw 8,7,9 + andi. 9,8,0xff beq 0,.L3 - addi 10,8,1 - slw 9,7,10 - extsw 8,10 - rlwinm 9,9,0,0xff - cmpwi 0,9,0 + addi 9,10,1 + slw 8,7,9 + extsw 10,9 + andi. 9,8,0xff bne 0,.L2 .L3: li 3,0 gcc.c-torture/execute/20081218-1.c is an example where we make more use of post-increment: .L9: - lbz 10,1(9) - addi 9,9,1 + lbzu 10,1(9) cmpwi 0,10,38 bne 0,.L8 - lbz 10,1(9) - addi 9,9,1 + lbzu 10,1(9) cmpwi 0,10,38 bne 0,.L8 bdnz .L9 The changes for s390x-linux-gnu are also often flag-related. E.g. gcc.c-torture/execute/pr68624.c: @@ -27,9 +27,8 @@ .L9: larl %r2,d larl %r3,.LANCHOR0 - l %r2,0(%r2) + icm %r2,15,0(%r2) st %r2,0(%r3) - ltr %r2,%r2 jne .L11 lhi %r2,-4 st %r2,0(%r1) where we move the flag-setting up and gcc.c-torture/execute/20050826-2.c: @@ -62,8 +62,7 @@ lgr %r3,%r9 lghi %r2,0 brasl %r14,inet_check_attr - ltr %r2,%r2 - lr %r12,%r2 + ltr %r12,%r2 jne .L16 lgr %r1,%r9 lhi %r3,-7 where we eliminate a separate move, like in the first powerpc64le example above. > Wait, does this combine sets with a hard reg source as well? It > shouldn't do that, that is RA's job; doing this in a greedy way is a > bad idea. (I haven't yet verified if you do this, fwiw). No: /* Mimic combine's behavior by not combining moves from allocatable hard registers (e.g. when copying parameters or function return values). */ if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)]) return false; Although if that could have accounted for the difference, it sounds like we're leaving a lot on the table by doing this :-) >> Inevitably there was some scan-assembler fallout for other tests. >> E.g. in gcc.target/aarch64/vmov_n_1.c: >> >> #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory") >> ... >> INHIB_OPTIMIZATION; \ >> (a) = TEST (test, data_len); \ >> INHIB_OPTIMIZATION; \ >> (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a)); \ >> >> is no longer effective for preventing move (a) from being merged >> into (b), because the pass can merge at the point of (a). > > It never was effective for that. Unless (b) lives in memory, in which > case your new pass has a bug here. The target of the vmov is a register. >> I think >> this is a valid thing to do -- the asm semantics are still satisfied, >> and asm volatile ("" : : : "memory") never acted as a register barrier. >> But perhaps we should deal with this as a special case? > > I don't think we should, no. What does "register barrier" even mean, > *exactly*? Yeah, agree with you and Andrew that we shouldn't, was just checking that there was agreement. Thanks, Richard
On Tue, Nov 19, 2019 at 11:33:13AM +0000, Richard Sandiford wrote: > Segher Boessenkool <segher@kernel.crashing.org> writes: > > On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote: > >> While working on SVE, I've noticed several cases in which we fail > >> to combine instructions because the combined form would need to be > >> placed earlier in the instruction stream than the last of the > >> instructions being combined. This includes one very important > >> case in the handling of the first fault register (FFR). > > > > Do you have an example of that? > > It's difficult to share realistic examples at this stage since this > isn't really the right forum for making them public for the first time. Oh I'm very sorry. In the future, just say "Future" and I know what you mean :-) > /* Make sure that the value that is to be substituted for the register > does not use any registers whose values alter in between. However, > If the insns are adjacent, a use can't cross a set even though we > think it might (this can happen for a sequence of insns each setting > the same destination; last_set of that register might point to > a NOTE). If INSN has a REG_EQUIV note, the register is always > equivalent to the memory so the substitution is valid even if there > are intervening stores. Also, don't move a volatile asm or > UNSPEC_VOLATILE across any other insns. */ > || (! all_adjacent > && (((!MEM_P (src) > || ! find_reg_note (insn, REG_EQUIV, src)) > && modified_between_p (src, insn, i3)) > || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src)) > || GET_CODE (src) == UNSPEC_VOLATILE)) So this would work if you had pseudos here, instead of the hard reg? Because it is a hard reg it is the same number in both places, making it hard to move. > > How are dependencies represented in your new pass? If it just does > > walks over the insn stream for everything, you get quadratic complexity > > if you move insns backwards. We have that in combine already, mostly > > from modified_between_p, but that is limited because of how LOG_LINKS > > work, and we have been doing this for so long and there are no problems > > found with it, so it must work in practice. But I am worried about it > > when moving insns back an unlimited distance. > > It builds def-use chains, but using a constant limit on the number of > explicitly-recorded uses. All other uses go in a numerical live range > from which they (conservatively) never escape. The def-use chains > represent memory as a single entity, a bit like in gimple. Ah. So that range thing ensures correctness. Why don't you use DF for the DU chains? > >> - it tries using REG_EQUAL notes for the final instruction. > > > > And that. > > I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3 > with i3's REG_EQUAL note and combining into that. Does combine do that? > I couldn't see it, and in: > > https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html > > you seemed to reject the idea of allowing it. Yes, I still do. Do you have an example where it helps? > >> - it can parallelise two independent instructions that both read from > >> the same register or both read from memory. > > > > That only if somehow there is a link between the two (so essentially > > never). The only combinations tried by combine are those via LOG_LINKs, > > which are between a SET and the first corresponding use. This is a key > > factor that makes it kind of linear (instead of exponential) complexity. > > Tracking limited def-use chains is what makes this last bit easy. > We can just try parallelising two instructions from the (bounded) list > of uses. And for this case there's not any garbage rtl involved, since > we reuse the same PARALLEL rtx between attempts. The cost is basically > all in the recog call (which would obviously mount up if we went > overboard). *All* examples above and below are just this. If you disable everything else, what do the statistics look like then? > > One thing I want to do is some mini-combine after every split, probably > > only with the insns new from the split. But we have no cfglayout mode > > anymore then, and only hard regs (except in the first split pass, which > > is just a little later than your new pass). > > Yeah, sounds like it could be useful. I guess there'd need to be > an extra condition on the combination that the new insn can't be > immediately split. It would run *after* split. Not interleaved with it. > > And amount of garbage produced? > > If -ftime-report stats are accurate, then the total amount of > memory allocated is: > > run-combine=2 (normal combine): 1793 kB > run-combine=4 (new pass only): 98 kB > run-combine=6 (both passes): 1871 kB (new pass accounts for 78 kB) > > But again that's not a fair comparison when the main combine pass does more. The way combine does SUBST is pretty fundamental to how it works (it can be ripped out, and probably we'll have to at some point, but that will be very invasive). Originally all this temporary RTL was on obstacks and reaping it was cheap, but everything is GCed now (fixing the bugs was not cheap :-) ) If you look at even really bad cases, combine is still only a few percent of total, so it isn't too bad. > I did try hard to keep the amount of garbage rtl down though. This is > why I added validate_simplify_replace_rtx rather than trying to make > do with existing routines. It should only create new rtl if the > simplification routines did something useful. (Of course, that's mostly > true of combine as well, but things like the make_compound_operation/ > expand_compound_operation wrangler can create expressions that are never > actually useful.) Don't mention those, thanks :-) > >> To get a feel for the effect on multiple targets, I did my usual > >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg > >> and g++.dg, this time comparing run-combine=2 and run-combine=6 > >> using -O2 -ftree-vectorize: > > > > One problem with this is that these are very short functions on average. > > There are some long ones too :-) Yes, but this isn't a good stand-in for representative programs. > > What is the kind of changes you see for other targets? > > On powerpc64le-linux-gnu it mostly comes from eliminating comparisons > in favour of other flag-setting instructions and making more use of > post-increments. Not sure the last one is actually a win, but the > target costs say it's OK :-). E.g. from gcc.c-torture/execute/pr78675.c: > > @@ -48,9 +48,8 @@ > blr > .align 4 > .L19: > - cmpdi 0,10,0 > + mr. 9,10 > mr 3,8 > - mr 9,10 > bne 0,.L9 > b .L3 > .align 4 Okay, so this combining two uses of r10 into one insn. This isn't necessarily a good idea: the combined insn cannot be moved as much as one of its components could, which can also immediately prevent further combinations. But doing this after combine, as you do, is probably beneficial. > and a slightly more interesting example in gcc.c-torture/execute/loop-6.c: This is the same thing (we do andi. a,b,0xff instead of rlwinm. a,b,0,0xff because this is cheaper on p7 and p8). > gcc.c-torture/execute/20081218-1.c is an example where we make more use > of post-increment: > > .L9: > - lbz 10,1(9) > - addi 9,9,1 > + lbzu 10,1(9) > cmpwi 0,10,38 > bne 0,.L8 > - lbz 10,1(9) > - addi 9,9,1 > + lbzu 10,1(9) > cmpwi 0,10,38 > bne 0,.L8 > bdnz .L9 Pre-increment (we only *have* pre-modify memory accesses). > /* Mimic combine's behavior by not combining moves from allocatable hard > registers (e.g. when copying parameters or function return values). */ > if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)]) > return false; > > Although if that could have accounted for the difference, it sounds like > we're leaving a lot on the table by doing this :-) It actually helps (and quite a bit). But if your test cases are mainly tiny functions, anything can happen. But since you see this across all targets, it must be doing something good :-) So I'd love to see statistics for *only* combining two uses of the same thing, this is something combine cannot do, and arguably *shouldn't* do! Segher
Segher Boessenkool <segher@kernel.crashing.org> writes: >> /* Make sure that the value that is to be substituted for the register >> does not use any registers whose values alter in between. However, >> If the insns are adjacent, a use can't cross a set even though we >> think it might (this can happen for a sequence of insns each setting >> the same destination; last_set of that register might point to >> a NOTE). If INSN has a REG_EQUIV note, the register is always >> equivalent to the memory so the substitution is valid even if there >> are intervening stores. Also, don't move a volatile asm or >> UNSPEC_VOLATILE across any other insns. */ >> || (! all_adjacent >> && (((!MEM_P (src) >> || ! find_reg_note (insn, REG_EQUIV, src)) >> && modified_between_p (src, insn, i3)) >> || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src)) >> || GET_CODE (src) == UNSPEC_VOLATILE)) > > So this would work if you had pseudos here, instead of the hard reg? > Because it is a hard reg it is the same number in both places, making it > hard to move. Yeah, probably. But the hard reg is a critical part of this. Going back to the example: (set (reg/v:VNx16BI 102 [ ok ]) (reg:VNx16BI 85 ffrt)) (set (reg:VNx16BI 85 ffrt) (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT)) (set (reg:CC_NZC 66 cc) (unspec:CC_NZC [(reg:VNx16BI 106) repeated x2 (const_int 1 [0x1]) (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST)) FFR is the real first fault register. FFRT is actually a fake register whose only purpose is to describe the dependencies (in rtl) between writes to the FFR, reads from the FFR and first-faulting loads. The whole scheme depends on having only one fixed FFRT register. >> > How are dependencies represented in your new pass? If it just does >> > walks over the insn stream for everything, you get quadratic complexity >> > if you move insns backwards. We have that in combine already, mostly >> > from modified_between_p, but that is limited because of how LOG_LINKS >> > work, and we have been doing this for so long and there are no problems >> > found with it, so it must work in practice. But I am worried about it >> > when moving insns back an unlimited distance. >> >> It builds def-use chains, but using a constant limit on the number of >> explicitly-recorded uses. All other uses go in a numerical live range >> from which they (conservatively) never escape. The def-use chains >> represent memory as a single entity, a bit like in gimple. > > Ah. So that range thing ensures correctness. Yeah. > Why don't you use DF for the DU chains? The problem with DF_DU_CHAIN is that it's quadratic in the worst case. fwprop.c gets around that by using the MD problem and having its own dominator walker to calculate limited def-use chains: /* We use the multiple definitions problem to compute our restricted use-def chains. */ So taking that approach here would still require some amount of roll-your-own. Other reasons are: * Even what fwprop does is more elaborate than we need for now. * We need to handle memory too, and it's nice to be able to handle it in the same way as registers. * Updating a full, ordered def-use chain after a move is a linear-time operation, so whatever happens, we'd need to apply some kind of limit on the number of uses we maintain, with something like that integer point range for the rest. * Once we've analysed the insn and built its def-use chains, we don't look at the df_refs again until we update the chains after a successful combination. So it should be more efficient to maintain a small array of insn_info_rec pointers alongside the numerical range, rather than walk and pollute chains of df_refs and then link back the insn uids to the pass-local info. >> >> - it tries using REG_EQUAL notes for the final instruction. >> > >> > And that. >> >> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3 >> with i3's REG_EQUAL note and combining into that. Does combine do that? >> I couldn't see it, and in: >> >> https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html >> >> you seemed to reject the idea of allowing it. > > Yes, I still do. Do you have an example where it helps? I'll run another set of tests for that. >> >> - it can parallelise two independent instructions that both read from >> >> the same register or both read from memory. >> > >> > That only if somehow there is a link between the two (so essentially >> > never). The only combinations tried by combine are those via LOG_LINKs, >> > which are between a SET and the first corresponding use. This is a key >> > factor that makes it kind of linear (instead of exponential) complexity. >> >> Tracking limited def-use chains is what makes this last bit easy. >> We can just try parallelising two instructions from the (bounded) list >> of uses. And for this case there's not any garbage rtl involved, since >> we reuse the same PARALLEL rtx between attempts. The cost is basically >> all in the recog call (which would obviously mount up if we went >> overboard). > > *All* examples above and below are just this. Yeah, the powerpc and s390x examples were. The motivating FFR example above isn't though: it's a def-use combination in parallel with the existing definition. > If you disable everything else, what do the statistics look like then? Had no idea how this would turn out -- which is a good sign it was worth doing -- but: results below. >> > One thing I want to do is some mini-combine after every split, probably >> > only with the insns new from the split. But we have no cfglayout mode >> > anymore then, and only hard regs (except in the first split pass, which >> > is just a little later than your new pass). >> >> Yeah, sounds like it could be useful. I guess there'd need to be >> an extra condition on the combination that the new insn can't be >> immediately split. > > It would run *after* split. Not interleaved with it. Yeah. But what I meant was: a lot of insns that are split after reload are combined for RA purposes and the split form is really the preferred form (especially for scheduling). So if we have a combine pass *after* split, I think it should avoid using any combination that matches a split. >> > And amount of garbage produced? >> >> If -ftime-report stats are accurate, then the total amount of >> memory allocated is: >> >> run-combine=2 (normal combine): 1793 kB >> run-combine=4 (new pass only): 98 kB >> run-combine=6 (both passes): 1871 kB (new pass accounts for 78 kB) >> >> But again that's not a fair comparison when the main combine pass does more. > > The way combine does SUBST is pretty fundamental to how it works (it can > be ripped out, and probably we'll have to at some point, but that will be > very invasive). Originally all this temporary RTL was on obstacks and > reaping it was cheap, but everything is GCed now (fixing the bugs was not > cheap :-) ) Yeah, I remember :-) > If you look at even really bad cases, combine is still only a few > percent of total, so it isn't too bad. > >> I did try hard to keep the amount of garbage rtl down though. This is >> why I added validate_simplify_replace_rtx rather than trying to make >> do with existing routines. It should only create new rtl if the >> simplification routines did something useful. (Of course, that's mostly >> true of combine as well, but things like the make_compound_operation/ >> expand_compound_operation wrangler can create expressions that are never >> actually useful.) > > Don't mention those, thanks :-) > >> >> To get a feel for the effect on multiple targets, I did my usual >> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg >> >> and g++.dg, this time comparing run-combine=2 and run-combine=6 >> >> using -O2 -ftree-vectorize: >> > >> > One problem with this is that these are very short functions on average. >> >> There are some long ones too :-) > > Yes, but this isn't a good stand-in for representative programs. Right. And number of lines of asm isn't a good stand-in for anything much. Like I say, the whole thing is just to get a feel, on tests that are readily to hand and are easy to compile without a full toolchain. >> > What is the kind of changes you see for other targets? >> >> On powerpc64le-linux-gnu it mostly comes from eliminating comparisons >> in favour of other flag-setting instructions and making more use of >> post-increments. Not sure the last one is actually a win, but the >> target costs say it's OK :-). E.g. from gcc.c-torture/execute/pr78675.c: >> >> @@ -48,9 +48,8 @@ >> blr >> .align 4 >> .L19: >> - cmpdi 0,10,0 >> + mr. 9,10 >> mr 3,8 >> - mr 9,10 >> bne 0,.L9 >> b .L3 >> .align 4 > > Okay, so this combining two uses of r10 into one insn. > > This isn't necessarily a good idea: the combined insn cannot be moved as > much as one of its components could, which can also immediately prevent > further combinations. > > But doing this after combine, as you do, is probably beneficial. > >> and a slightly more interesting example in gcc.c-torture/execute/loop-6.c: > > This is the same thing (we do andi. a,b,0xff instead of rlwinm. a,b,0,0xff > because this is cheaper on p7 and p8). > >> gcc.c-torture/execute/20081218-1.c is an example where we make more use >> of post-increment: >> >> .L9: >> - lbz 10,1(9) >> - addi 9,9,1 >> + lbzu 10,1(9) >> cmpwi 0,10,38 >> bne 0,.L8 >> - lbz 10,1(9) >> - addi 9,9,1 >> + lbzu 10,1(9) >> cmpwi 0,10,38 >> bne 0,.L8 >> bdnz .L9 > > Pre-increment (we only *have* pre-modify memory accesses). Oops, yes. >> /* Mimic combine's behavior by not combining moves from allocatable hard >> registers (e.g. when copying parameters or function return values). */ >> if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)]) >> return false; >> >> Although if that could have accounted for the difference, it sounds like >> we're leaving a lot on the table by doing this :-) > > It actually helps (and quite a bit). But if your test cases are mainly > tiny functions, anything can happen. But since you see this across all > targets, it must be doing something good :-) > > > So I'd love to see statistics for *only* combining two uses of the same > thing, this is something combine cannot do, and arguably *shouldn't* do! OK, here are two sets of results. The first is for: (A) --param run-combine=2 (current combine only) (B) --param run-combine=6 (both passes), use-use combinations only Target Tests Delta Best Worst Median ====== ===== ===== ==== ===== ====== aarch64-linux-gnu 158 3060 -72 520 -1 aarch64_be-linux-gnu 111 24 -57 324 -1 alpha-linux-gnu 3 3 1 1 1 amdgcn-amdhsa 18 71 -17 26 1 arc-elf 310 -4414 -1516 356 1 arm-linux-gnueabi 28 -50 -13 3 -1 arm-linux-gnueabihf 28 -50 -13 3 -1 avr-elf 26 308 -1 36 12 bfin-elf 6 8 -1 3 1 bpf-elf 10 21 -1 6 1 c6x-elf 7 9 -6 6 1 cr16-elf 13 102 1 27 2 cris-elf 35 -1001 -700 3 -2 csky-elf 9 28 1 6 2 epiphany-elf 29 -29 -2 1 -1 fr30-elf 12 17 -1 5 1 frv-linux-gnu 1 -2 -2 -2 -2 ft32-elf 10 22 -1 5 2 h8300-elf 29 56 -22 14 2 hppa64-hp-hpux11.23 9 17 -1 4 2 i686-apple-darwin 10 -33 -20 12 -2 i686-pc-linux-gnu 41 243 -12 33 3 ia64-linux-gnu 28 -32 -29 39 -4 iq2000-elf 6 8 1 2 1 lm32-elf 10 12 -3 5 1 m32r-elf 3 2 -2 2 2 m68k-linux-gnu 19 27 -2 5 2 mcore-elf 14 23 -10 6 2 microblaze-elf 5 5 1 1 1 mipsel-linux-gnu 9 12 -5 6 2 mipsisa64-linux-gnu 7 1 -3 1 1 mmix 6 6 -2 4 1 mn10300-elf 20 15 -4 5 1 moxie-rtems 8 11 -2 3 1 msp430-elf 8 24 1 6 2 nds32le-elf 91 -188 -24 136 -1 nios2-linux-gnu 2 6 1 5 1 nvptx-none 396 756 1 16 1 or1k-elf 8 20 1 4 2 pdp11 65 149 -10 45 2 powerpc-ibm-aix7.0 1039 1114 -366 2124 -1 powerpc64-linux-gnu 854 2753 -274 3094 -2 powerpc64le-linux-gnu 648 -551 -340 208 -1 pru-elf 5 5 -2 3 1 riscv32-elf 7 6 -2 5 1 riscv64-elf 2 5 2 3 2 rl78-elf 80 -648 -98 13 -4 rx-elf 16 2 -4 5 -1 s390-linux-gnu 60 -174 -39 14 -1 s390x-linux-gnu 152 -781 -159 14 -1 sh-linux-gnu 13 5 -15 7 1 sparc-linux-gnu 29 7 -3 11 -1 sparc64-linux-gnu 51 1 -8 15 -1 tilepro-linux-gnu 119 -567 -164 15 -2 v850-elf 4 4 -1 3 1 vax-netbsdelf 10 13 -4 5 1 visium-elf 4 0 -5 3 1 x86_64-darwin 7 -12 -9 4 -2 x86_64-linux-gnu 6 -11 -6 4 -2 xstormy16-elf 10 13 1 2 1 xtensa-elf 6 8 -1 2 2 which definitely shows up some outliers I need to look at. The second set is for: (B) --param run-combine=6 (both passes), use-use combinations only (C) --param run-combine=6 (both passes), no restrictions Target Tests Delta Best Worst Median ====== ===== ===== ==== ===== ====== aarch64-linux-gnu 272 -3844 -585 18 -1 aarch64_be-linux-gnu 190 -3336 -370 18 -1 alpha-linux-gnu 401 -2735 -370 22 -2 amdgcn-amdhsa 188 1867 -484 1259 -1 arc-elf 257 -1498 -650 54 -1 arm-linux-gnueabi 168 -1117 -612 680 -1 arm-linux-gnueabihf 168 -1117 -612 680 -1 avr-elf 1341 -111401 -13824 680 -10 bfin-elf 1346 -18950 -8461 465 -2 bpf-elf 63 -496 -60 3 -2 c6x-elf 179 -10527 -10084 41 -2 cr16-elf 1616 -51479 -10657 42 -13 cris-elf 113 -533 -84 4 -2 csky-elf 129 -3399 -474 1 -2 epiphany-elf 151 -375 -149 84 -1 fr30-elf 155 -1773 -756 289 -2 frv-linux-gnu 808 -13332 -2074 67 -1 ft32-elf 276 -1688 -111 -1 -2 h8300-elf 527 -11522 -1747 68 -3 hppa64-hp-hpux11.23 179 -865 -142 34 -1 i686-apple-darwin 335 -1266 -56 44 -1 i686-pc-linux-gnu 222 -2216 -556 32 -1 ia64-linux-gnu 122 -4793 -1134 40 -5 iq2000-elf 171 -1341 -61 3 -2 lm32-elf 187 -1814 -316 47 -2 m32r-elf 70 -597 -98 11 -2 m68k-linux-gnu 197 -2375 -332 148 -2 mcore-elf 125 -1236 -146 7 -1 microblaze-elf 442 -4498 -2094 32 -2 mipsel-linux-gnu 125 -2050 -222 60 -2 mipsisa64-linux-gnu 107 -2015 -130 14 -2 mmix 103 -239 -26 4 -1 mn10300-elf 215 -1039 -234 80 -1 moxie-rtems 149 -754 -79 4 -2 msp430-elf 180 -600 -63 19 -1 nds32le-elf 183 -287 -37 32 -1 nios2-linux-gnu 81 -329 -66 4 -1 nvptx-none 200 -1882 -208 -2 -2 or1k-elf 57 -317 -25 2 -1 pdp11 207 -1441 -182 83 -2 powerpc-ibm-aix7.0 400 -4145 -271 14 -2 powerpc64-linux-gnu 375 -2062 -160 117 -2 powerpc64le-linux-gnu 491 -4169 -700 156 -2 pru-elf 47 -7020 -6921 6 -1 riscv32-elf 59 -1379 -139 7 -2 riscv64-elf 89 -1562 -264 7 -1 rl78-elf 289 -16157 -1665 42 -6 rx-elf 82 -195 -53 8 -1 s390-linux-gnu 128 -2108 -1485 63 -1 s390x-linux-gnu 112 418 -32 522 -1 sh-linux-gnu 218 -410 -108 68 -1 sparc-linux-gnu 141 -866 -99 18 -1 sparc64-linux-gnu 129 -792 -102 3 -2 tilepro-linux-gnu 953 -4331 -297 332 -2 v850-elf 50 -412 -53 2 -3 vax-netbsdelf 254 -3328 -400 4 -2 visium-elf 100 -693 -138 16 -1 x86_64-darwin 345 -2134 -490 72 -1 x86_64-linux-gnu 307 -843 -288 210 -1 xstormy16-elf 218 -788 -156 59 -1 xtensa-elf 195 -1426 -322 36 1 So the main benefit does seem to come from the def-use part. Here are some powerpc64le-linux-gnu examples of (B)->(C): gcc.c-torture/execute/20171008-1.c: @@ -79,8 +79,7 @@ stdu 1,-32(1) .LCFI5: bl foo - rlwinm 3,3,0,0xff - cmpwi 0,3,0 + andi. 9,3,0xff bne 0,.L13 addi 1,1,32 gcc.c-torture/execute/pr28982a.c: @@ -427,15 +427,13 @@ stxvd2x 0,7,6 .align 4 .L9: - xxlor 12,32,32 + xvcvsxwsp 0,32 vadduwm 0,0,1 addi 10,9,16 - xvcvsxwsp 0,12 - xxlor 12,32,32 + stxvd2x 0,0,9 + xvcvsxwsp 0,32 + addi 9,9,32 vadduwm 0,0,1 - stxvd2x 0,0,9 - xvcvsxwsp 0,12 - addi 9,9,32 stxvd2x 0,0,10 bdnz .L9 li 3,4 (Disclaimer: I have no idea if that's correct.) gcc.c-torture/execute/pr65215-3.c: @@ -56,11 +56,10 @@ srdi 10,3,32 srdi 9,3,56 slwi 6,10,24 - srwi 7,10,8 + rlwinm 7,10,24,16,23 or 9,9,6 - rlwinm 7,7,0,16,23 + rlwinm 10,10,8,8,15 or 9,9,7 - rlwinm 10,10,8,8,15 or 9,9,10 cmpw 0,9,8 bne 0,.L4 Just to emphasise though: I'm not proposing that we switch this on for all targets yet. It would be opt-in until the pass is more mature. But that FFR case is really important for the situation it handles. Thanks, Richard
On Wed, Nov 20, 2019 at 06:20:34PM +0000, Richard Sandiford wrote: > Segher Boessenkool <segher@kernel.crashing.org> writes: > > So this would work if you had pseudos here, instead of the hard reg? > > Because it is a hard reg it is the same number in both places, making it > > hard to move. > > Yeah, probably. But the hard reg is a critical part of this. > Going back to the example: > > (set (reg/v:VNx16BI 102 [ ok ]) > (reg:VNx16BI 85 ffrt)) > (set (reg:VNx16BI 85 ffrt) > (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT)) > (set (reg:CC_NZC 66 cc) > (unspec:CC_NZC > [(reg:VNx16BI 106) repeated x2 > (const_int 1 [0x1]) > (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST)) > > FFR is the real first fault register. FFRT is actually a fake register > whose only purpose is to describe the dependencies (in rtl) between writes > to the FFR, reads from the FFR and first-faulting loads. The whole scheme > depends on having only one fixed FFRT register. Right. The reason this cannot work in combine is that combine always combines to just *one* insn, at i3; later, if it turns out that it needs to split it, it can put something at i2. But that doesn't even happen here, only the first and the last of those three insns are what is combined. It is important combine only moves things forward in the insn stream, to make sure this whole process is finite. Or this was true years ago, at least :-) > > Why don't you use DF for the DU chains? > > The problem with DF_DU_CHAIN is that it's quadratic in the worst case. Oh, wow. > fwprop.c gets around that by using the MD problem and having its own > dominator walker to calculate limited def-use chains: > > /* We use the multiple definitions problem to compute our restricted > use-def chains. */ It's not great if every pass invents its own version of some common infrastructure thing because that common one is not suitable. I.e., can this be fixed somehow? Maybe just by having a restricted DU chains df problem? > So taking that approach here would still require some amount of > roll-your-own. Other reasons are: > > * Even what fwprop does is more elaborate than we need for now. > > * We need to handle memory too, and it's nice to be able to handle > it in the same way as registers. > > * Updating a full, ordered def-use chain after a move is a linear-time > operation, so whatever happens, we'd need to apply some kind of limit > on the number of uses we maintain, with something like that integer > point range for the rest. > > * Once we've analysed the insn and built its def-use chains, we don't > look at the df_refs again until we update the chains after a successful > combination. So it should be more efficient to maintain a small array > of insn_info_rec pointers alongside the numerical range, rather than > walk and pollute chains of df_refs and then link back the insn uids > to the pass-local info. So you need something like combine's LOG_LINKS? Not that handling those is not quadratic in the worst case, but in practice it works well. And it *could* be made linear. > >> Tracking limited def-use chains is what makes this last bit easy. > >> We can just try parallelising two instructions from the (bounded) list > >> of uses. And for this case there's not any garbage rtl involved, since > >> we reuse the same PARALLEL rtx between attempts. The cost is basically > >> all in the recog call (which would obviously mount up if we went > >> overboard). > > > > *All* examples above and below are just this. > > Yeah, the powerpc and s390x examples were. The motivating FFR example > above isn't though: it's a def-use combination in parallel with the > existing definition. Right, good point :-) > >> >> To get a feel for the effect on multiple targets, I did my usual > >> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg > >> >> and g++.dg, this time comparing run-combine=2 and run-combine=6 > >> >> using -O2 -ftree-vectorize: > >> > > >> > One problem with this is that these are very short functions on average. > >> > >> There are some long ones too :-) > > > > Yes, but this isn't a good stand-in for representative programs. > > Right. And number of lines of asm isn't a good stand-in for anything much. For combine, number of insns generated is a surprisingly good measure of how it performed. Sometimes not, when it goes over a border of an inlining decision, say, or bb-reorder decides to duplicate more because it is cheaper now. > Like I say, the whole thing is just to get a feel, on tests that are readily > to hand and are easy to compile without a full toolchain. Absolutely. But I have no experience with using your test set, so the numbers do not necessarily mean so much to me :-) > > So I'd love to see statistics for *only* combining two uses of the same > > thing, this is something combine cannot do, and arguably *shouldn't* do! > > OK, here are two sets of results. The first is for: > > (A) --param run-combine=2 (current combine only) > (B) --param run-combine=6 (both passes), use-use combinations only > > Target Tests Delta Best Worst Median > ====== ===== ===== ==== ===== ====== > aarch64-linux-gnu 158 3060 -72 520 -1 > aarch64_be-linux-gnu 111 24 -57 324 -1 > alpha-linux-gnu 3 3 1 1 1 > amdgcn-amdhsa 18 71 -17 26 1 > arc-elf 310 -4414 -1516 356 1 > arm-linux-gnueabi 28 -50 -13 3 -1 > arm-linux-gnueabihf 28 -50 -13 3 -1 > avr-elf 26 308 -1 36 12 > bfin-elf 6 8 -1 3 1 > bpf-elf 10 21 -1 6 1 > c6x-elf 7 9 -6 6 1 > cr16-elf 13 102 1 27 2 > cris-elf 35 -1001 -700 3 -2 > csky-elf 9 28 1 6 2 > epiphany-elf 29 -29 -2 1 -1 > fr30-elf 12 17 -1 5 1 > frv-linux-gnu 1 -2 -2 -2 -2 > ft32-elf 10 22 -1 5 2 > h8300-elf 29 56 -22 14 2 > hppa64-hp-hpux11.23 9 17 -1 4 2 > i686-apple-darwin 10 -33 -20 12 -2 > i686-pc-linux-gnu 41 243 -12 33 3 > ia64-linux-gnu 28 -32 -29 39 -4 > iq2000-elf 6 8 1 2 1 > lm32-elf 10 12 -3 5 1 > m32r-elf 3 2 -2 2 2 > m68k-linux-gnu 19 27 -2 5 2 > mcore-elf 14 23 -10 6 2 > microblaze-elf 5 5 1 1 1 > mipsel-linux-gnu 9 12 -5 6 2 > mipsisa64-linux-gnu 7 1 -3 1 1 > mmix 6 6 -2 4 1 > mn10300-elf 20 15 -4 5 1 > moxie-rtems 8 11 -2 3 1 > msp430-elf 8 24 1 6 2 > nds32le-elf 91 -188 -24 136 -1 > nios2-linux-gnu 2 6 1 5 1 > nvptx-none 396 756 1 16 1 > or1k-elf 8 20 1 4 2 > pdp11 65 149 -10 45 2 > powerpc-ibm-aix7.0 1039 1114 -366 2124 -1 > powerpc64-linux-gnu 854 2753 -274 3094 -2 > powerpc64le-linux-gnu 648 -551 -340 208 -1 > pru-elf 5 5 -2 3 1 > riscv32-elf 7 6 -2 5 1 > riscv64-elf 2 5 2 3 2 > rl78-elf 80 -648 -98 13 -4 > rx-elf 16 2 -4 5 -1 > s390-linux-gnu 60 -174 -39 14 -1 > s390x-linux-gnu 152 -781 -159 14 -1 > sh-linux-gnu 13 5 -15 7 1 > sparc-linux-gnu 29 7 -3 11 -1 > sparc64-linux-gnu 51 1 -8 15 -1 > tilepro-linux-gnu 119 -567 -164 15 -2 > v850-elf 4 4 -1 3 1 > vax-netbsdelf 10 13 -4 5 1 > visium-elf 4 0 -5 3 1 > x86_64-darwin 7 -12 -9 4 -2 > x86_64-linux-gnu 6 -11 -6 4 -2 > xstormy16-elf 10 13 1 2 1 > xtensa-elf 6 8 -1 2 2 > > which definitely shows up some outliers I need to look at. Yeah, huh, it's all over the map. > The second set is for: > > (B) --param run-combine=6 (both passes), use-use combinations only > (C) --param run-combine=6 (both passes), no restrictions > > Target Tests Delta Best Worst Median > ====== ===== ===== ==== ===== ====== > aarch64-linux-gnu 272 -3844 -585 18 -1 > aarch64_be-linux-gnu 190 -3336 -370 18 -1 > alpha-linux-gnu 401 -2735 -370 22 -2 > amdgcn-amdhsa 188 1867 -484 1259 -1 > arc-elf 257 -1498 -650 54 -1 > arm-linux-gnueabi 168 -1117 -612 680 -1 > arm-linux-gnueabihf 168 -1117 -612 680 -1 > avr-elf 1341 -111401 -13824 680 -10 Things like this are kind of suspicious :-) > bfin-elf 1346 -18950 -8461 465 -2 > bpf-elf 63 -496 -60 3 -2 > c6x-elf 179 -10527 -10084 41 -2 > cr16-elf 1616 -51479 -10657 42 -13 > cris-elf 113 -533 -84 4 -2 > csky-elf 129 -3399 -474 1 -2 > epiphany-elf 151 -375 -149 84 -1 > fr30-elf 155 -1773 -756 289 -2 > frv-linux-gnu 808 -13332 -2074 67 -1 > ft32-elf 276 -1688 -111 -1 -2 > h8300-elf 527 -11522 -1747 68 -3 > hppa64-hp-hpux11.23 179 -865 -142 34 -1 > i686-apple-darwin 335 -1266 -56 44 -1 > i686-pc-linux-gnu 222 -2216 -556 32 -1 > ia64-linux-gnu 122 -4793 -1134 40 -5 > iq2000-elf 171 -1341 -61 3 -2 > lm32-elf 187 -1814 -316 47 -2 > m32r-elf 70 -597 -98 11 -2 > m68k-linux-gnu 197 -2375 -332 148 -2 > mcore-elf 125 -1236 -146 7 -1 > microblaze-elf 442 -4498 -2094 32 -2 > mipsel-linux-gnu 125 -2050 -222 60 -2 > mipsisa64-linux-gnu 107 -2015 -130 14 -2 > mmix 103 -239 -26 4 -1 > mn10300-elf 215 -1039 -234 80 -1 > moxie-rtems 149 -754 -79 4 -2 > msp430-elf 180 -600 -63 19 -1 > nds32le-elf 183 -287 -37 32 -1 > nios2-linux-gnu 81 -329 -66 4 -1 > nvptx-none 200 -1882 -208 -2 -2 > or1k-elf 57 -317 -25 2 -1 > pdp11 207 -1441 -182 83 -2 > powerpc-ibm-aix7.0 400 -4145 -271 14 -2 > powerpc64-linux-gnu 375 -2062 -160 117 -2 > powerpc64le-linux-gnu 491 -4169 -700 156 -2 > pru-elf 47 -7020 -6921 6 -1 > riscv32-elf 59 -1379 -139 7 -2 > riscv64-elf 89 -1562 -264 7 -1 > rl78-elf 289 -16157 -1665 42 -6 > rx-elf 82 -195 -53 8 -1 > s390-linux-gnu 128 -2108 -1485 63 -1 > s390x-linux-gnu 112 418 -32 522 -1 > sh-linux-gnu 218 -410 -108 68 -1 > sparc-linux-gnu 141 -866 -99 18 -1 > sparc64-linux-gnu 129 -792 -102 3 -2 > tilepro-linux-gnu 953 -4331 -297 332 -2 > v850-elf 50 -412 -53 2 -3 > vax-netbsdelf 254 -3328 -400 4 -2 > visium-elf 100 -693 -138 16 -1 > x86_64-darwin 345 -2134 -490 72 -1 > x86_64-linux-gnu 307 -843 -288 210 -1 > xstormy16-elf 218 -788 -156 59 -1 > xtensa-elf 195 -1426 -322 36 1 > > So the main benefit does seem to come from the def-use part. > > Here are some powerpc64le-linux-gnu examples of (B)->(C): > > gcc.c-torture/execute/20171008-1.c: > > @@ -79,8 +79,7 @@ > stdu 1,-32(1) > .LCFI5: > bl foo > - rlwinm 3,3,0,0xff > - cmpwi 0,3,0 > + andi. 9,3,0xff > bne 0,.L13 > addi 1,1,32 Soo this starts as insn_cost 4 for 6: r118:SI=r124:SI REG_DEAD r124:SI insn_cost 4 for 8: r121:SI=zero_extend(r118:SI#0) REG_DEAD r118:SI insn_cost 4 for 9: r122:CC=cmp(r121:SI,0) REG_DEAD r121:SI and then it combines 6->8 of course, but then Trying 8 -> 9: 8: r121:SI=zero_extend(r124:SI#0) REG_DEAD r124:SI 9: r122:CC=cmp(r121:SI,0) REG_DEAD r121:SI Failed to match this instruction: (set (reg:CC 122) (compare:CC (subreg:QI (reg:SI 124) 0) (const_int 0 [0]))) Hrm, that is a bad idea in general, why do we do that. > gcc.c-torture/execute/pr28982a.c: > > @@ -427,15 +427,13 @@ > stxvd2x 0,7,6 > .align 4 > .L9: > - xxlor 12,32,32 > + xvcvsxwsp 0,32 > vadduwm 0,0,1 > addi 10,9,16 > - xvcvsxwsp 0,12 > - xxlor 12,32,32 > + stxvd2x 0,0,9 > + xvcvsxwsp 0,32 > + addi 9,9,32 > vadduwm 0,0,1 > - stxvd2x 0,0,9 > - xvcvsxwsp 0,12 > - addi 9,9,32 > stxvd2x 0,0,10 > bdnz .L9 > li 3,4 > > (Disclaimer: I have no idea if that's correct.) This seems to be -O3 -mcpu=power8. It look to be correct just fine. It saves the two xxlor insns (which are just register move instructions). RA couldn't fix this up because there are two uses of the register (the xvcvsxwsp -- convert V4SI to V4SF; and the vadduwm -- V4SI addition), and RA doesn't reorder code. I would say your new code is better. > gcc.c-torture/execute/pr65215-3.c: > > @@ -56,11 +56,10 @@ > srdi 10,3,32 > srdi 9,3,56 > slwi 6,10,24 > - srwi 7,10,8 > + rlwinm 7,10,24,16,23 > or 9,9,6 > - rlwinm 7,7,0,16,23 > + rlwinm 10,10,8,8,15 > or 9,9,7 > - rlwinm 10,10,8,8,15 > or 9,9,10 > cmpw 0,9,8 > bne 0,.L4 insn_cost 4 for 15: r139:SI=r118:DI#0 0>>0x8 insn_cost 4 for 16: r140:SI=r139:SI&0xff00 REG_DEAD r139:SI (that's both of those insns setting r7). r118 does not die here yet, that's only in the 10,10 insn. Trying 15 -> 16: 15: r139:SI=r118:DI#0 0>>0x8 16: r140:SI=r139:SI&0xff00 REG_DEAD r139:SI Failed to match this instruction: (set (reg:SI 140) (and:SI (subreg:SI (zero_extract:DI (reg:DI 118 [ _2 ]) (const_int 32 [0x20]) (const_int 24 [0x18])) 0) (const_int 65280 [0xff00]))) Failed to match this instruction: (set (reg:SI 140) (and:SI (subreg:SI (and:DI (lshiftrt:DI (reg:DI 118 [ _2 ]) (const_int 8 [0x8])) (const_int 4294967295 [0xffffffff])) 0) (const_int 65280 [0xff00]))) Yeah, it's one of those, make_compound_insn :-/ (*^%$(*^$(*@^ > Just to emphasise though: I'm not proposing that we switch this on for > all targets yet. It would be opt-in until the pass is more mature. I do have to wonder if it is a bit late for stage 1. But opt-in as in, the user has to use some flag, that should be fine I guess? But default for some targets might not be so great, esp. primary targets. > But that FFR case is really important for the situation it handles. Yeah. I hope to have some time to review your actual patch soon. Should be less depressing than some of the combine failures :-) Segher
On 11/17/19 6:35 PM, Richard Sandiford wrote: > (It's 23:35 local time, so it's still just about stage 1. :-)) > > While working on SVE, I've noticed several cases in which we fail > to combine instructions because the combined form would need to be > placed earlier in the instruction stream than the last of the > instructions being combined. This includes one very important > case in the handling of the first fault register (FFR). > > Combine currently requires the combined instruction to live at the same > location as i3. I thought about trying to relax that restriction, but it > would be difficult to do with the current pass structure while keeping > everything linear-ish time. > > So this patch instead goes for an option that has been talked about > several times over the years: writing a new combine pass that just > does instruction combination, and not all the other optimisations > that have been bolted onto combine over time. E.g. it deliberately > doesn't do things like nonzero-bits tracking, since that really ought > to be a separate, more global, optimisation. > > This is still far from being a realistic replacement for the even > the combine parts of the current combine pass. E.g.: > > - it only handles combinations that can be built up from individual > two-instruction combinations. > > - it doesn't allow new hard register clobbers to be added. > > - it doesn't have the special treatment of CC operations. > > - etc. > > But we have to start somewhere. > > On a more positive note, the pass handles things that the current > combine pass doesn't: > > - the main motivating feature mentioned above: it works out where > the combined instruction could validly live and moves it there > if necessary. If there are a range of valid places, it tries > to pick the best one based on register pressure (although only > with a simple heuristic for now). > > - once it has combined two instructions, it can try combining the > result with both later and earlier code, i.e. it can combine > in both directions. > > - it tries using REG_EQUAL notes for the final instruction. > > - it can parallelise two independent instructions that both read from > the same register or both read from memory. > > This last feature is useful for generating more load-pair combinations > on AArch64. In some cases it can also produce more store-pair combinations, > but only for consecutive stores. However, since the pass currently does > this in a very greedy, peephole way, it only allows load/store-pair > combinations if the first memory access has a higher alignment than > the second, i.e. if we can be sure that the combined access is naturally > aligned. This should help it to make better decisions than the post-RA > peephole pass in some cases while not being too aggressive. > > The pass is supposed to be linear time without debug insns. > It only tries a constant number C of combinations per instruction > and its bookkeeping updates are constant-time. Once it has combined two > instructions, it'll try up to C combinations on the result, but this can > be counted against the instruction that was deleted by the combination > and so effectively just doubles the constant. (Note that C depends > on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.) > > Unfortunately, debug updates via propagate_for_debug are more expensive. > This could probably be fixed if the pass did more to track debug insns > itself, but using propagate_for_debug matches combine's behaviour. > > The patch adds two instances of the new pass: one before combine and > one after it. By default both are disabled, but this can be changed > using the new 3-bit run-combine param, where: > > - bit 0 selects the new pre-combine pass > - bit 1 selects the main combine pass > - bit 2 selects the new post-combine pass > > The idea is that run-combine=3 can be used to see which combinations > are missed by the new pass, while run-combine=6 (which I hope to be > the production setting for AArch64 at -O2+) just uses the new pass > to mop up cases that normal combine misses. Maybe in some distant > future, the pass will be good enough for run-combine=[14] to be a > realistic option. > > I ended up having to add yet another validate_simplify_* routine, > this time to do the equivalent of: > > newx = simplify_replace_rtx (*loc, old_rtx, new_rtx); > validate_change (insn, loc, newx, 1); > > but in a more memory-efficient way. validate_replace_rtx isn't suitable > because it deliberately only tries simplifications in limited cases: > > /* Do changes needed to keep rtx consistent. Don't do any other > simplifications, as it is not our job. */ > > And validate_simplify_insn isn't useful for this case because it works > on patterns that have already had changes made to them and expects > those patterns to be valid rtxes. simplify-replace operations instead > need to simplify as they go, when the original modes are still to hand. > > As far as compile-time goes, I tried compiling optabs.ii at -O2 > with an --enable-checking=release compiler: > > run-combine=2 (normal combine): 100.0% (baseline) > run-combine=4 (new pass only) 98.0% > run-combine=6 (both passes) 100.3% > > where the results are easily outside the noise. So the pass on > its own is quicker than combine, but that's not a fair comparison > when it doesn't do everything combine does. Running both passes > only has a slight overhead. > > To get a feel for the effect on multiple targets, I did my usual > bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg > and g++.dg, this time comparing run-combine=2 and run-combine=6 > using -O2 -ftree-vectorize: > > Target Tests Delta Best Worst Median > ====== ===== ===== ==== ===== ====== > aarch64-linux-gnu 3974 -39393 -2275 90 -2 > aarch64_be-linux-gnu 3389 -36683 -2275 165 -2 > alpha-linux-gnu 4154 -62860 -2132 335 -2 > amdgcn-amdhsa 4818 9079 -7987 51850 -2 > arc-elf 2868 -63710 -18998 286 -1 > arm-linux-gnueabi 4053 -80404 -10019 605 -2 > arm-linux-gnueabihf 4053 -80404 -10019 605 -2 > avr-elf 3620 38513 -2386 23364 2 > bfin-elf 2691 -32973 -1483 1127 -2 > bpf-elf 5581 -78105 -11064 113 -3 > c6x-elf 3915 -31710 -2441 1560 -2 > cr16-elf 6030 192102 -1757 60009 12 > cris-elf 2217 -30794 -1716 294 -2 > csky-elf 2003 -24989 -9999 1468 -2 > epiphany-elf 3345 -19416 -1803 4594 -2 > fr30-elf 3562 -15077 -1921 2334 -1 > frv-linux-gnu 2423 -16589 -1736 999 -1 > ft32-elf 2246 -46337 -15988 433 -2 > h8300-elf 2581 -33553 -1403 168 -2 > hppa64-hp-hpux11.23 3926 -120876 -50134 1056 -2 > i686-apple-darwin 3562 -46851 -1764 310 -2 > i686-pc-linux-gnu 2902 -3639 -4809 6848 -2 > ia64-linux-gnu 2900 -158870 -14006 428 -7 > iq2000-elf 2929 -54690 -2904 2576 -3 > lm32-elf 5265 162519 -1918 8004 5 > m32r-elf 1861 -25296 -2713 1004 -2 > m68k-linux-gnu 2520 -241573 -21879 200 -3 > mcore-elf 2378 -28532 -1810 1635 -2 > microblaze-elf 2782 -137363 -9516 1986 -2 > mipsel-linux-gnu 2443 -38422 -8331 458 -1 > mipsisa64-linux-gnu 2287 -60294 -12214 432 -2 > mmix 4910 -136549 -13616 599 -2 > mn10300-elf 2944 -29151 -2488 132 -1 > moxie-rtems 1935 -12364 -1002 125 -1 > msp430-elf 2379 -37007 -2163 176 -2 > nds32le-elf 2356 -27551 -2126 163 -1 > nios2-linux-gnu 1572 -44828 -23613 92 -2 > nvptx-none 1014 -17337 -1590 16 -3 > or1k-elf 2724 -92816 -14144 56 -3 > pdp11 1897 -27296 -1370 534 -2 > powerpc-ibm-aix7.0 2909 -58829 -10026 2001 -2 > powerpc64-linux-gnu 3685 -60551 -12158 2001 -1 > powerpc64le-linux-gnu 3501 -61846 -10024 765 -2 > pru-elf 1574 -29734 -19998 1718 -1 > riscv32-elf 2357 -22506 -10002 10175 -1 > riscv64-elf 3320 -56777 -10002 226 -2 > rl78-elf 2113 -232328 -18607 4065 -3 > rx-elf 2800 -38515 -896 491 -2 > s390-linux-gnu 3582 -75626 -12098 3999 -2 > s390x-linux-gnu 3761 -73473 -13748 3999 -2 > sh-linux-gnu 2350 -26401 -1003 522 -2 > sparc-linux-gnu 3279 -49518 -2175 2223 -2 > sparc64-linux-gnu 3849 -123084 -30200 2141 -2 > tilepro-linux-gnu 2737 -35562 -3458 2848 -2 > v850-elf 9002 -169126 -49996 76 -4 > vax-netbsdelf 3325 -57734 -10000 1989 -2 > visium-elf 1860 -17006 -1006 1066 -2 > x86_64-darwin 3278 -48933 -9999 1408 -2 > x86_64-linux-gnu 3008 -43887 -9999 3248 -2 > xstormy16-elf 2497 -26569 -2051 89 -2 > xtensa-elf 2161 -31231 -6910 138 -2 > > So running both passes does seem to have a significant benefit > on most targets, but there are some nasty-looking outliers. > The usual caveat applies: number of lines is a very poor measurement, > it's just to get a feel. > > Bootstrapped & regression-tested on aarch64-linux-gnu and > x86_64-linux-gnu with both run-combine=3 as the default (so that the new > pass runs first) and with run-combine=6 as the default (so that the new > pass runs second). There were no new execution failures. A couple of > guality.exp tests that already failed for most options started failing > for a couple more. Enabling the pass fixes the XFAILs in: > > gcc.target/aarch64/sve/acle/general/ptrue_pat_[234].c > > Inevitably there was some scan-assembler fallout for other tests. > E.g. in gcc.target/aarch64/vmov_n_1.c: > > #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory") > ... > INHIB_OPTIMIZATION; \ > (a) = TEST (test, data_len); \ > INHIB_OPTIMIZATION; \ > (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a)); \ > > is no longer effective for preventing move (a) from being merged > into (b), because the pass can merge at the point of (a). I think > this is a valid thing to do -- the asm semantics are still satisfied, > and asm volatile ("" : : : "memory") never acted as a register barrier. > But perhaps we should deal with this as a special case? > > Richard I'm reviewed the patch but I'm not a expert on combine so I'm only a few small comments e.t.c. Segher probably has more comments than I have anyhow. Nick > > > 2019-11-17 Richard Sandiford <richard.sandiford@arm.com> > > gcc/ > * Makefile.in (OBJS): Add combine2.o > * params.opt (--param=run-combine): New option. > * doc/invoke.texi: Document it. > * tree-pass.h (make_pass_combine2_before): Declare. > (make_pass_combine2_after): Likewise. > * passes.def: Add them. > * timevar.def (TV_COMBINE2): New timevar. > * cfgrtl.h (update_cfg_for_uncondjump): Declare. > * combine.c (update_cfg_for_uncondjump): Move to... > * cfgrtl.c (update_cfg_for_uncondjump): ...here. > * simplify-rtx.c (simplify_truncation): Handle comparisons. > * recog.h (validate_simplify_replace_rtx): Declare. > * recog.c (validate_simplify_replace_rtx_1): New function. > (validate_simplify_replace_rtx_uses): Likewise. > (validate_simplify_replace_rtx): Likewise. > * combine2.c: New file. > > Index: gcc/Makefile.in > =================================================================== > --- gcc/Makefile.in 2019-11-14 14:34:27.599783740 +0000 > +++ gcc/Makefile.in 2019-11-17 23:15:31.188500613 +0000 > @@ -1261,6 +1261,7 @@ OBJS = \ > cgraphunit.o \ > cgraphclones.o \ > combine.o \ > + combine2.o \ > combine-stack-adj.o \ > compare-elim.o \ > context.o \ > Index: gcc/params.opt > =================================================================== > --- gcc/params.opt 2019-11-14 14:34:26.339792215 +0000 > +++ gcc/params.opt 2019-11-17 23:15:31.200500531 +0000 > @@ -768,6 +768,10 @@ Use internal function id in profile look > Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param > Maximum depth of a loop nest to fully value-number optimistically. > > +-param=run-combine= > +Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param > +Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass. > + > -param=sccvn-max-alias-queries-per-access= > Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param > Maximum number of disambiguations to perform per memory access. > Index: gcc/doc/invoke.texi > =================================================================== > --- gcc/doc/invoke.texi 2019-11-16 10:43:45.597105823 +0000 > +++ gcc/doc/invoke.texi 2019-11-17 23:15:31.200500531 +0000 > @@ -11807,6 +11807,11 @@ in combiner for a pseudo register as las > @item max-combine-insns > The maximum number of instructions the RTL combiner tries to combine. > > +@item run-combine > +Choose which of the 3 available combine passes to run: bit 1 for the main > +combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 > +for a later variant of the combine pass. > + > @item integer-share-limit > Small integer constants can use a shared data structure, reducing the > compiler's memory usage and increasing its speed. This sets the maximum > Index: gcc/tree-pass.h > =================================================================== > --- gcc/tree-pass.h 2019-10-29 08:29:03.096444049 +0000 > +++ gcc/tree-pass.h 2019-11-17 23:15:31.204500501 +0000 > @@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i > extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt); > +extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt); > +extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt); > Index: gcc/passes.def > =================================================================== > --- gcc/passes.def 2019-10-29 08:29:03.224443133 +0000 > +++ gcc/passes.def 2019-11-17 23:15:31.200500531 +0000 > @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3. > NEXT_PASS (pass_inc_dec); > NEXT_PASS (pass_initialize_regs); > NEXT_PASS (pass_ud_rtl_dce); > + NEXT_PASS (pass_combine2_before); > NEXT_PASS (pass_combine); > + NEXT_PASS (pass_combine2_after); > NEXT_PASS (pass_if_after_combine); > NEXT_PASS (pass_jump_after_combine); > NEXT_PASS (pass_partition_blocks); > Index: gcc/timevar.def This is really two passes it seems or at least functions. Just a nit but you may want to state that as I don't recall reading that. > =================================================================== > --- gcc/timevar.def 2019-10-11 15:43:53.403498517 +0100 > +++ gcc/timevar.def 2019-11-17 23:15:31.204500501 +0000 > @@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC , " > DEFTIMEVAR (TV_CSE2 , "CSE 2") > DEFTIMEVAR (TV_BRANCH_PROB , "branch prediction") > DEFTIMEVAR (TV_COMBINE , "combiner") > +DEFTIMEVAR (TV_COMBINE2 , "second combiner") > DEFTIMEVAR (TV_IFCVT , "if-conversion") > DEFTIMEVAR (TV_MODE_SWITCH , "mode switching") > DEFTIMEVAR (TV_SMS , "sms modulo scheduling") > Index: gcc/cfgrtl.h > =================================================================== > --- gcc/cfgrtl.h 2019-03-08 18:15:39.320730391 +0000 > +++ gcc/cfgrtl.h 2019-11-17 23:15:31.192500584 +0000 > @@ -47,6 +47,7 @@ extern void fixup_partitions (void); > extern bool purge_dead_edges (basic_block); > extern bool purge_all_dead_edges (void); > extern bool fixup_abnormal_edges (void); > +extern void update_cfg_for_uncondjump (rtx_insn *); > extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *); > extern void relink_block_chain (bool); > extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *); > Index: gcc/combine.c > =================================================================== > --- gcc/combine.c 2019-11-13 08:42:45.537368745 +0000 > +++ gcc/combine.c 2019-11-17 23:15:31.192500584 +0000 > @@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg) > && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT; > } > > -/* Delete the unconditional jump INSN and adjust the CFG correspondingly. > - Note that the INSN should be deleted *after* removing dead edges, so > - that the kept edge is the fallthrough edge for a (set (pc) (pc)) > - but not for a (set (pc) (label_ref FOO)). */ > - > -static void > -update_cfg_for_uncondjump (rtx_insn *insn) > -{ > - basic_block bb = BLOCK_FOR_INSN (insn); > - gcc_assert (BB_END (bb) == insn); > - > - purge_dead_edges (bb); > - > - delete_insn (insn); > - if (EDGE_COUNT (bb->succs) == 1) > - { > - rtx_insn *insn; > - > - single_succ_edge (bb)->flags |= EDGE_FALLTHRU; > - > - /* Remove barriers from the footer if there are any. */ > - for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) > - if (BARRIER_P (insn)) > - { > - if (PREV_INSN (insn)) > - SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); > - else > - BB_FOOTER (bb) = NEXT_INSN (insn); > - if (NEXT_INSN (insn)) > - SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); > - } > - else if (LABEL_P (insn)) > - break; > - } > -} > - > /* Return whether PAT is a PARALLEL of exactly N register SETs followed > by an arbitrary number of CLOBBERs. */ > static bool > @@ -15096,7 +15060,10 @@ const pass_data pass_data_combine = > {} > > /* opt_pass methods: */ > - virtual bool gate (function *) { return (optimize > 0); } > + virtual bool gate (function *) > + { > + return optimize > 0 && (param_run_combine & 2) != 0; > + } > virtual unsigned int execute (function *) > { > return rest_of_handle_combine (); > Index: gcc/cfgrtl.c > =================================================================== > --- gcc/cfgrtl.c 2019-10-17 14:22:55.523309009 +0100 > +++ gcc/cfgrtl.c 2019-11-17 23:15:31.188500613 +0000 > @@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void) > return inserted; > } > > +/* Delete the unconditional jump INSN and adjust the CFG correspondingly. > + Note that the INSN should be deleted *after* removing dead edges, so > + that the kept edge is the fallthrough edge for a (set (pc) (pc)) > + but not for a (set (pc) (label_ref FOO)). */ > + > +void > +update_cfg_for_uncondjump (rtx_insn *insn) > +{ > + basic_block bb = BLOCK_FOR_INSN (insn); > + gcc_assert (BB_END (bb) == insn); > + > + purge_dead_edges (bb); > + > + delete_insn (insn); > + if (EDGE_COUNT (bb->succs) == 1) > + { > + rtx_insn *insn; > + > + single_succ_edge (bb)->flags |= EDGE_FALLTHRU; > + > + /* Remove barriers from the footer if there are any. */ > + for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) > + if (BARRIER_P (insn)) > + { > + if (PREV_INSN (insn)) > + SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); > + else > + BB_FOOTER (bb) = NEXT_INSN (insn); > + if (NEXT_INSN (insn)) > + SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); > + } > + else if (LABEL_P (insn)) > + break; > + } > +} > + > /* Cut the insns from FIRST to LAST out of the insns stream. */ > > rtx_insn * > Index: gcc/simplify-rtx.c > =================================================================== > --- gcc/simplify-rtx.c 2019-11-16 15:33:36.642840131 +0000 > +++ gcc/simplify-rtx.c 2019-11-17 23:15:31.204500501 +0000 > @@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode, > && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1) > return constm1_rtx; > > + /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result > + in a narrower mode if useful. */ > + if (COMPARISON_P (op)) > + return simplify_gen_relational (GET_CODE (op), mode, VOIDmode, > + XEXP (op, 0), XEXP (op, 1)); > + > return NULL_RTX; > } > > Index: gcc/recog.h > =================================================================== > --- gcc/recog.h 2019-09-09 18:58:28.860430363 +0100 > +++ gcc/recog.h 2019-11-17 23:15:31.204500501 +0000 > @@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos > extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *); > extern void validate_replace_src_group (rtx, rtx, rtx_insn *); > extern bool validate_simplify_insn (rtx_insn *insn); > +extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx); > extern int num_changes_pending (void); > extern int next_insn_tests_no_inequality (rtx_insn *); > extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode); > Index: gcc/recog.c > =================================================================== > --- gcc/recog.c 2019-10-01 09:55:35.150088599 +0100 > +++ gcc/recog.c 2019-11-17 23:15:31.204500501 +0000 > @@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn) > } > return ((num_changes_pending () > 0) && (apply_change_group () > 0)); > } > + > +/* A subroutine of validate_simplify_replace_rtx. Apply the replacement > + described by R to LOC. Return true on success; leave the caller > + to clean up on failure. */ > + > +static bool > +validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc) > +{ > + rtx x = *loc; > + enum rtx_code code = GET_CODE (x); > + machine_mode mode = GET_MODE (x); > + > + if (rtx_equal_p (x, r.from)) > + { > + validate_unshare_change (r.insn, loc, r.to, 1); > + return true; > + } > + > + /* Recursively apply the substitution and see if we can simplify > + the result. This specifically shouldn't use simplify_gen_*, > + since we want to avoid generating new expressions where possible. */ > + int old_num_changes = num_validated_changes (); > + rtx newx = NULL_RTX; > + bool recurse_p = false; > + switch (GET_RTX_CLASS (code)) > + { > + case RTX_UNARY: > + { > + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))) > + return false; > + > + newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode); > + break; > + } > + > + case RTX_BIN_ARITH: > + case RTX_COMM_ARITH: > + { > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) > + return false; > + > + newx = simplify_binary_operation (code, mode, > + XEXP (x, 0), XEXP (x, 1)); > + break; > + } > + > + case RTX_COMPARE: > + case RTX_COMM_COMPARE: > + { > + machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode > + ? GET_MODE (XEXP (x, 0)) > + : GET_MODE (XEXP (x, 1))); > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) > + return false; > + > + newx = simplify_relational_operation (code, mode, op_mode, > + XEXP (x, 0), XEXP (x, 1)); > + break; > + } > + > + case RTX_TERNARY: > + case RTX_BITFIELD_OPS: > + { > + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2))) > + return false; > + > + newx = simplify_ternary_operation (code, mode, op0_mode, > + XEXP (x, 0), XEXP (x, 1), > + XEXP (x, 2)); > + break; > + } > + > + case RTX_EXTRA: > + if (code == SUBREG) > + { > + machine_mode inner_mode = GET_MODE (SUBREG_REG (x)); > + if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x))) > + return false; > + > + rtx inner = SUBREG_REG (x); > + newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x)); > + /* Reject the same cases that simplify_gen_subreg would. */ > + if (!newx > + && (GET_CODE (inner) == SUBREG > + || GET_CODE (inner) == CONCAT > + || GET_MODE (inner) == VOIDmode > + || !validate_subreg (mode, inner_mode, > + inner, SUBREG_BYTE (x)))) > + return false; > + break; > + } > + else > + recurse_p = true; > + break; > + > + case RTX_OBJ: > + if (code == LO_SUM) > + { > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) > + return false; > + > + /* (lo_sum (high x) y) -> y where x and y have the same base. */ > + rtx op0 = XEXP (x, 0); > + rtx op1 = XEXP (x, 1); > + if (GET_CODE (op0) == HIGH) > + { > + rtx base0, base1, offset0, offset1; > + split_const (XEXP (op0, 0), &base0, &offset0); > + split_const (op1, &base1, &offset1); > + if (rtx_equal_p (base0, base1)) > + newx = op1; > + } > + } > + else if (code == REG) > + { > + if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from)) > + return false; > + } > + else > + recurse_p = true; > + break; > + > + case RTX_CONST_OBJ: > + break; > + > + case RTX_AUTOINC: > + if (reg_overlap_mentioned_p (XEXP (x, 0), r.from)) > + return false; > + recurse_p = true; > + break; > + > + case RTX_MATCH: > + case RTX_INSN: > + gcc_unreachable (); > + } > + > + if (recurse_p) > + { > + const char *fmt = GET_RTX_FORMAT (code); > + for (int i = 0; fmt[i]; i++) > + switch (fmt[i]) > + { > + case 'E': > + for (int j = 0; j < XVECLEN (x, i); j++) > + if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j))) > + return false; > + break; > + > + case 'e': > + if (XEXP (x, i) > + && !validate_simplify_replace_rtx_1 (r, &XEXP (x, i))) > + return false; > + break; > + } > + } > + > + if (newx && !rtx_equal_p (x, newx)) > + { > + /* There's no longer any point unsharing the substitutions made > + for subexpressions, since we'll just copy this one instead. */ > + for (int i = old_num_changes; i < num_changes; ++i) > + changes[i].unshare = false; > + validate_unshare_change (r.insn, loc, newx, 1); > + } > + > + return true; > +} > + > +/* A note_uses callback for validate_simplify_replace_rtx. > + DATA points to a validate_replace_src_data object. */ > + > +static void > +validate_simplify_replace_rtx_uses (rtx *loc, void *data) > +{ > + validate_replace_src_data &r = *(validate_replace_src_data *) data; > + if (r.insn && !validate_simplify_replace_rtx_1 (r, loc)) > + r.insn = NULL; > +} > + > +/* Try to perform the equivalent of: > + > + newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX); > + validate_change (INSN, LOC, newx, 1); > + > + but without generating as much garbage rtl when the resulting > + pattern doesn't match. > + > + Return true if we were able to replace all uses of OLD_RTX in *LOC > + and if the result conforms to general rtx rules (e.g. for whether > + subregs are meaningful). > + > + When returning true, add all replacements to the current validation group, > + leaving the caller to test it in the normal way. Leave both *LOC and the > + validation group unchanged on failure. */ > + > +bool > +validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc, > + rtx old_rtx, rtx new_rtx) > +{ > + validate_replace_src_data r; > + r.from = old_rtx; > + r.to = new_rtx; > + r.insn = insn; > + > + unsigned int num_changes = num_validated_changes (); > + note_uses (loc, validate_simplify_replace_rtx_uses, &r); > + if (!r.insn) > + { > + cancel_changes (num_changes); > + return false; > + } > + return true; > +} > > /* Return 1 if the insn using CC0 set by INSN does not contain > any ordered tests applied to the condition codes. > Index: gcc/combine2.c > =================================================================== > --- /dev/null 2019-09-17 11:41:18.176664108 +0100 > +++ gcc/combine2.c 2019-11-17 23:15:31.196500559 +0000 > @@ -0,0 +1,1576 @@ > +/* Combine instructions > + Copyright (C) 2019 Free Software Foundation, Inc. > + > +This file is part of GCC. > + > +GCC is free software; you can redistribute it and/or modify it under > +the terms of the GNU General Public License as published by the Free > +Software Foundation; either version 3, or (at your option) any later > +version. > + > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY > +WARRANTY; without even the implied warranty of MERCHANTABILITY or > +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License > +for more details. > + > +You should have received a copy of the GNU General Public License > +along with GCC; see the file COPYING3. If not see > +<http://www.gnu.org/licenses/>. */ > + > +#include "config.h" > +#include "system.h" > +#include "coretypes.h" > +#include "backend.h" > +#include "rtl.h" > +#include "df.h" > +#include "tree-pass.h" > +#include "memmodel.h" > +#include "emit-rtl.h" > +#include "insn-config.h" > +#include "recog.h" > +#include "print-rtl.h" > +#include "rtl-iter.h" > +#include "predict.h" > +#include "cfgcleanup.h" > +#include "cfghooks.h" > +#include "cfgrtl.h" > +#include "alias.h" > +#include "valtrack.h" > + > +/* This pass tries to combine instructions in the following ways: > + > + (1) If we have two dependent instructions: > + > + I1: (set DEST1 SRC1) > + I2: (...DEST1...) > + > + and I2 is the only user of DEST1, the pass tries to combine them into: > + > + I2: (...SRC1...) > + > + (2) If we have two dependent instructions: > + > + I1: (set DEST1 SRC1) > + I2: (...DEST1...) > + > + the pass tries to combine them into: > + > + I2: (parallel [(set DEST1 SRC1) (...SRC1...)]) > + > + or: > + > + I2: (parallel [(...SRC1...) (set DEST1 SRC1)]) > + > + (3) If we have two independent instructions: > + > + I1: (set DEST1 SRC1) > + I2: (set DEST2 SRC2) > + > + that read from memory or from the same register, the pass tries to > + combine them into: > + > + I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)]) > + > + or: > + > + I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)]) > + > + If the combined form is a valid instruction, the pass tries to find a > + place between I1 and I2 inclusive for the new instruction. If there > + are multiple valid locations, it tries to pick the best one by taking > + the effect on register pressure into account. > + > + If a combination succeeds and produces a single set, the pass tries to > + combine the new form with earlier or later instructions. > + > + The pass currently optimizes each basic block separately. It walks > + the instructions in reverse order, building up live ranges for registers > + and memory. It then uses these live ranges to look for possible > + combination opportunities and to decide where the combined instructions > + could be placed. > + > + The pass represents positions in the block using point numbers, > + with higher numbers indicating earlier instructions. The numbering > + scheme is that: > + > + - the end of the current instruction sequence has an even base point B. > + > + - instructions initially have odd-numbered points B + 1, B + 3, etc. > + with B + 1 being the final instruction in the sequence. > + > + - even points after B represent gaps between instructions where combined > + instructions could be placed. > + > + Thus even points initially represent no instructions and odd points > + initially represent single instructions. However, when picking a > + place for a combined instruction, the pass may choose somewhere > + inbetween the original two instructions, so that over time a point > + may come to represent several instructions. When this happens, > + the pass maintains the invariant that all instructions with the same > + point number are independent of each other and thus can be treated as > + acting in parallel (or as acting in any arbitrary sequence). > + > + TODOs: > + > + - Handle 3-instruction combinations, and possibly more. > + > + - Handle existing clobbers more efficiently. At the moment we can't > + move an instruction that clobbers R across another instruction that > + clobbers R. > + > + - Allow hard register clobbers to be added, like combine does. > + > + - Perhaps work on EBBs, or SESE regions. */ > + > +namespace { > + > +/* The number of explicit uses to record in a live range. */ > +const unsigned int NUM_RANGE_USERS = 4; > + > +/* The maximum number of instructions that we can combine at once. */ > +const unsigned int MAX_COMBINE_INSNS = 2; > + > +/* A fake cost for instructions that we haven't costed yet. */ > +const unsigned int UNKNOWN_COST = ~0U; > + > +class combine2 > +{ > +public: > + combine2 (function *); > + ~combine2 (); > + > + void execute (); > + > +private: > + struct insn_info_rec; > + > + /* Describes the live range of a register or of memory. For simplicity, > + we treat memory as a single entity. > + > + If we had a fully-accurate live range, updating it to account for a > + moved instruction would be a linear-time operation. Doing this for > + each combination would then make the pass quadratic. We therefore > + just maintain a list of NUM_RANGE_USERS use insns and use simple, > + conservatively-correct behavior for the rest. */ > + struct live_range_rec > + { > + /* Which instruction provides the dominating definition, or null if > + we don't know yet. */ > + insn_info_rec *producer; > + > + /* A selection of instructions that use the resource, in program order. */ > + insn_info_rec *users[NUM_RANGE_USERS]; > + > + /* An inclusive range of points that covers instructions not mentioned > + in USERS. Both values are zero if there are no such instructions. > + > + Once we've included a use U at point P in this range, we continue > + to assume that some kind of use exists at P whatever happens to U > + afterwards. */ > + unsigned int first_extra_use; > + unsigned int last_extra_use; > + > + /* The register number this range describes, or INVALID_REGNUM > + for memory. */ > + unsigned int regno; > + > + /* Forms a linked list of ranges for the same resource, in program > + order. */ > + live_range_rec *prev_range; > + live_range_rec *next_range; > + }; > + > + /* Pass-specific information about an instruction. */ > + struct insn_info_rec > + { > + /* The instruction itself. */ > + rtx_insn *insn; > + > + /* A null-terminated list of live ranges for the things that this > + instruction defines. */ > + live_range_rec **defs; > + > + /* A null-terminated list of live ranges for the things that this > + instruction uses. */ > + live_range_rec **uses; > + > + /* The point at which the instruction appears. */ > + unsigned int point; > + > + /* The cost of the instruction, or UNKNOWN_COST if we haven't > + measured it yet. */ > + unsigned int cost; > + }; > + > + /* Describes one attempt to combine instructions. */ > + struct combination_attempt_rec > + { > + /* The instruction that we're currently trying to optimize. > + If the combination succeeds, we'll use this insn_info_rec > + to describe the new instruction. */ > + insn_info_rec *new_home; > + > + /* The instructions we're combining, in program order. */ > + insn_info_rec *sequence[MAX_COMBINE_INSNS]; Can't we can this a vec in order to grow to lengths and just loop through merging on instructions in the vec as required? > + > + /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the > + live range that describes the substituted register. */ > + live_range_rec *def_use_range; > + > + /* The earliest and latest points at which we could insert the > + combined instruction. */ > + unsigned int earliest_point; > + unsigned int latest_point; > + > + /* The cost of the new instruction, once we have a successful match. */ > + unsigned int new_cost; > + }; > + > + /* Pass-specific information about a register. */ > + struct reg_info_rec > + { > + /* The live range associated with the last reference to the register. */ > + live_range_rec *range; > + > + /* The point at which the last reference occurred. */ > + unsigned int next_ref; > + > + /* True if the register is currently live. We record this here rather > + than in a separate bitmap because (a) there's a natural hole for > + it on LP64 hosts and (b) we only refer to it when updating the > + other fields, and so recording it here should give better locality. */ > + unsigned int live_p : 1; > + }; > + > + live_range_rec *new_live_range (unsigned int, live_range_rec *); > + live_range_rec *reg_live_range (unsigned int); > + live_range_rec *mem_live_range (); > + bool add_range_use (live_range_rec *, insn_info_rec *); > + void remove_range_use (live_range_rec *, insn_info_rec *); > + bool has_single_use_p (live_range_rec *); > + bool known_last_use_p (live_range_rec *, insn_info_rec *); > + unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *); > + unsigned int find_latest_point (insn_info_rec *, insn_info_rec *); > + bool start_combination (combination_attempt_rec &, insn_info_rec *, > + insn_info_rec *, live_range_rec * = NULL); > + bool verify_combination (combination_attempt_rec &); > + int estimate_reg_pressure_delta (insn_info_rec *); > + void commit_combination (combination_attempt_rec &, bool); > + bool try_parallel_sets (combination_attempt_rec &, rtx, rtx); > + bool try_parallelize_insns (combination_attempt_rec &); > + bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool); > + bool try_combine_def_use (combination_attempt_rec &, rtx, rtx); > + bool try_combine_two_uses (combination_attempt_rec &); > + bool try_combine (insn_info_rec *, rtx, unsigned int); > + bool optimize_insn (insn_info_rec *); > + void record_defs (insn_info_rec *); > + void record_reg_use (insn_info_rec *, df_ref); > + void record_uses (insn_info_rec *); > + void process_insn (insn_info_rec *); > + void start_sequence (); > + > + /* The function we're optimizing. */ > + function *m_fn; > + > + /* The highest pseudo register number plus one. */ > + unsigned int m_num_regs; > + > + /* The current basic block. */ > + basic_block m_bb; > + > + /* True if we should optimize the current basic block for speed. */ > + bool m_optimize_for_speed_p; > + > + /* The point number to allocate to the next instruction we visit > + in the backward traversal. */ > + unsigned int m_point; > + > + /* The point number corresponding to the end of the current > + instruction sequence, i.e. the lowest point number about which > + we still have valid information. */ > + unsigned int m_end_of_sequence; > + > + /* The point number corresponding to the end of the current basic block. > + This is the same as M_END_OF_SEQUENCE when processing the last > + instruction sequence in a basic block. */ > + unsigned int m_end_of_bb; > + > + /* The memory live range, or null if we haven't yet found a memory > + reference in the current instruction sequence. */ > + live_range_rec *m_mem_range; > + > + /* Gives information about each register. We track both hard and > + pseudo registers. */ > + auto_vec<reg_info_rec> m_reg_info; > + > + /* A bitmap of registers whose entry in m_reg_info is valid. */ > + auto_sbitmap m_valid_regs; > + > + /* If nonnuull, an unused 2-element PARALLEL that we can use to test > + instruction combinations. */ > + rtx m_spare_parallel; > + > + /* A bitmap of instructions that we've already tried to combine with. */ > + auto_bitmap m_tried_insns; > + > + /* A temporary bitmap used to hold register numbers. */ > + auto_bitmap m_true_deps; > + > + /* An obstack used for allocating insn_info_recs and for building > + up their lists of definitions and uses. */ > + obstack m_insn_obstack; > + > + /* An obstack used for allocating live_range_recs. */ > + obstack m_range_obstack; > + > + /* Start-of-object pointers for the two obstacks. */ > + char *m_insn_obstack_start; > + char *m_range_obstack_start; > + > + /* A list of instructions that we've optimized and whose new forms > + change the cfg. */ > + auto_vec<rtx_insn *> m_cfg_altering_insns; > + > + /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS. */ > + auto_bitmap m_cfg_altering_insn_ids; > + > + /* We can insert new instructions at point P * 2 by inserting them > + after M_POINTS[P - M_END_OF_SEQUENCE / 2]. We can insert new > + instructions at point P * 2 + 1 by inserting them before > + M_POINTS[P - M_END_OF_SEQUENCE / 2]. */ > + auto_vec<rtx_insn *, 256> m_points; > +}; > + > +combine2::combine2 (function *fn) > + : m_fn (fn), > + m_num_regs (max_reg_num ()), > + m_bb (NULL), > + m_optimize_for_speed_p (false), > + m_point (2), > + m_end_of_sequence (m_point), > + m_end_of_bb (m_point), > + m_mem_range (NULL), > + m_reg_info (m_num_regs), > + m_valid_regs (m_num_regs), > + m_spare_parallel (NULL_RTX) > +{ > + gcc_obstack_init (&m_insn_obstack); > + gcc_obstack_init (&m_range_obstack); > + m_reg_info.quick_grow (m_num_regs); > + bitmap_clear (m_valid_regs); > + m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0); > + m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0); > +} > + > +combine2::~combine2 () > +{ > + obstack_free (&m_insn_obstack, NULL); > + obstack_free (&m_range_obstack, NULL); > +} > + > +/* Return true if it's possible in principle to combine INSN with > + other instructions. ALLOW_ASMS_P is true if the caller can cope > + with asm statements. */ > + > +static bool > +combinable_insn_p (rtx_insn *insn, bool allow_asms_p) > +{ > + rtx pattern = PATTERN (insn); > + > + if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER) > + return false; > + > + if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX)) > + return false; > + > + if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0) > + return false; > + > + return true; > +} > + > +/* Return true if it's possible in principle to move INSN somewhere else, > + as long as all dependencies are satisfied. */ > + > +static bool > +movable_insn_p (rtx_insn *insn) > +{ > + if (JUMP_P (insn)) > + return false; > + > + if (volatile_refs_p (PATTERN (insn))) > + return false; > + > + return true; > +} > + > +/* Create and return a new live range for REGNO. NEXT is the next range > + in program order, or null if this is the first live range in the > + sequence. */ > + > +combine2::live_range_rec * > +combine2::new_live_range (unsigned int regno, live_range_rec *next) > +{ > + live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec); > + memset (range, 0, sizeof (*range)); > + > + range->regno = regno; > + range->next_range = next; > + if (next) > + next->prev_range = range; > + return range; > +} > + > +/* Return the current live range for register REGNO, creating a new > + one if necessary. */ > + > +combine2::live_range_rec * > +combine2::reg_live_range (unsigned int regno) > +{ > + /* Initialize the liveness flag, if it isn't already valid for this BB. */ > + bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno); > + if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb) > + m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno); > + > + /* See if we already have a live range associated with the current > + instruction sequence. */ > + live_range_rec *range = NULL; > + if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence) > + range = m_reg_info[regno].range; > + > + /* Create a new range if this is the first reference to REGNO in the > + current instruction sequence or if the current range has been closed > + off by a definition. */ > + if (!range || range->producer) > + { > + range = new_live_range (regno, range); > + > + /* If the register is live after the current sequence, treat that > + as a fake use at the end of the sequence. */ > + if (!range->next_range && m_reg_info[regno].live_p) > + range->first_extra_use = range->last_extra_use = m_end_of_sequence; > + > + /* Record that this is now the current range for REGNO. */ > + if (first_ref_p) > + bitmap_set_bit (m_valid_regs, regno); > + m_reg_info[regno].range = range; > + m_reg_info[regno].next_ref = m_point; > + } > + return range; > +} > + > +/* Return the current live range for memory, treating memory as a single > + entity. Create a new live range if necessary. */ > + > +combine2::live_range_rec * > +combine2::mem_live_range () > +{ > + if (!m_mem_range || m_mem_range->producer) > + m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range); > + return m_mem_range; > +} > + > +/* Record that instruction USER uses the resource described by RANGE. > + Return true if this is new information. */ > + > +bool > +combine2::add_range_use (live_range_rec *range, insn_info_rec *user) > +{ > + /* See if we've already recorded the instruction, or if there's a > + spare use slot we can use. */ > + unsigned int i = 0; > + for (; i < NUM_RANGE_USERS && range->users[i]; ++i) > + if (range->users[i] == user) > + return false; > + > + if (i == NUM_RANGE_USERS) > + { > + /* Since we've processed USER recently, assume that it's more > + interesting to record explicitly than the last user in the > + current list. Evict that last user and describe it in the > + overflow "extra use" range instead. */ > + insn_info_rec *ousted_user = range->users[--i]; > + if (range->first_extra_use < ousted_user->point) > + range->first_extra_use = ousted_user->point; > + if (range->last_extra_use > ousted_user->point) > + range->last_extra_use = ousted_user->point; > + } > + > + /* Insert USER while keeping the list sorted. */ > + for (; i > 0 && range->users[i - 1]->point < user->point; --i) > + range->users[i] = range->users[i - 1]; > + range->users[i] = user; > + return true; > +} > + > +/* Remove USER from the uses recorded for RANGE, if we can. > + There's nothing we can do if USER was described in the > + overflow "extra use" range. */ > + > +void > +combine2::remove_range_use (live_range_rec *range, insn_info_rec *user) > +{ > + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) > + if (range->users[i] == user) > + { > + for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j) > + range->users[j] = range->users[j + 1]; > + range->users[NUM_RANGE_USERS - 1] = NULL; > + break; > + } > +} > + > +/* Return true if RANGE has a single known user. */ > + > +bool > +combine2::has_single_use_p (live_range_rec *range) > +{ > + return range->users[0] && !range->users[1] && !range->first_extra_use; > +} > + > +/* Return true if we know that USER is the last user of RANGE. */ > + > +bool > +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user) > +{ > + if (range->last_extra_use <= user->point) > + return false; > + > + for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i) > + if (range->users[i] == user) > + return i == NUM_RANGE_USERS - 1 || !range->users[i + 1]; Small nit and I could be wrong but do: return !range->users[i + 1] || i == NUM_RANGE_USERS - 1; Based on your code it seems that the getting to NUM_RANGE_USERS is far less likely. > + else if (range->users[i]->point == user->point) > + return false; > + > + gcc_unreachable (); > +} > + > +/* Find the earliest point that we could move I2 up in order to combine > + it with I1. Ignore any dependencies between I1 and I2; leave the > + caller to deal with those instead. */ > + > +unsigned int > +combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1) > +{ > + if (!movable_insn_p (i2->insn)) > + return i2->point; > + > + /* Start by optimistically assuming that we can move the instruction > + all the way up to I1. */ > + unsigned int point = i1->point; > + > + /* Make sure that the new position preserves all necessary true dependencies > + on earlier instructions. */ > + for (live_range_rec **use = i2->uses; *use; ++use) > + { > + live_range_rec *range = *use; > + if (range->producer > + && range->producer != i1 > + && point >= range->producer->point) > + point = range->producer->point - 1; > + } > + > + /* Make sure that the new position preserves all necessary output and > + anti dependencies on earlier instructions. */ > + for (live_range_rec **def = i2->defs; *def; ++def) > + if (live_range_rec *range = (*def)->prev_range) > + { > + if (range->producer > + && range->producer != i1 > + && point >= range->producer->point) > + point = range->producer->point - 1; > + > + for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;) > + if (range->users[i] && range->users[i] != i1) > + { > + if (point >= range->users[i]->point) > + point = range->users[i]->point - 1; > + break; > + } > + > + if (range->last_extra_use && point >= range->last_extra_use) > + point = range->last_extra_use - 1; > + } > + > + return point; > +} > + > +/* Find the latest point that we could move I1 down in order to combine > + it with I2. Ignore any dependencies between I1 and I2; leave the > + caller to deal with those instead. */ > + > +unsigned int > +combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2) > +{ > + if (!movable_insn_p (i1->insn)) > + return i1->point; > + > + /* Start by optimistically assuming that we can move the instruction > + all the way down to I2. */ > + unsigned int point = i2->point; > + > + /* Make sure that the new position preserves all necessary anti dependencies > + on later instructions. */ > + for (live_range_rec **use = i1->uses; *use; ++use) > + if (live_range_rec *range = (*use)->next_range) > + if (range->producer != i2 && point <= range->producer->point) > + point = range->producer->point + 1; > + > + /* Make sure that the new position preserves all necessary output and > + true dependencies on later instructions. */ > + for (live_range_rec **def = i1->defs; *def; ++def) > + { > + live_range_rec *range = *def; > + > + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) > + if (range->users[i] != i2) > + { > + if (range->users[i] && point <= range->users[i]->point) > + point = range->users[i]->point + 1; > + break; > + } > + > + if (range->first_extra_use && point <= range->first_extra_use) > + point = range->first_extra_use + 1; > + > + live_range_rec *next_range = range->next_range; > + if (next_range > + && next_range->producer != i2 > + && point <= next_range->producer->point) > + point = next_range->producer->point + 1; > + } > + > + return point; > +} > + > +/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2, > + where I1 is the instruction that we're currently trying to optimize. > + If DEF_USE_RANGE is nonnull, I1 defines the value described by > + DEF_USE_RANGE and I2 uses it. */ > + > +bool > +combine2::start_combination (combination_attempt_rec &attempt, > + insn_info_rec *i1, insn_info_rec *i2, > + live_range_rec *def_use_range) > +{ > + attempt.new_home = i1; > + attempt.sequence[0] = i1; > + attempt.sequence[1] = i2; > + if (attempt.sequence[0]->point < attempt.sequence[1]->point) > + std::swap (attempt.sequence[0], attempt.sequence[1]); > + attempt.def_use_range = def_use_range; > + > + /* Check that the instructions have no true dependencies other than > + DEF_USE_RANGE. */ > + bitmap_clear (m_true_deps); > + for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def) > + if (*def != def_use_range) > + bitmap_set_bit (m_true_deps, (*def)->regno); > + for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use) > + if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno)) > + return false; > + > + /* Calculate the range of points at which the combined instruction > + could live. */ > + attempt.earliest_point = find_earliest_point (attempt.sequence[1], > + attempt.sequence[0]); > + attempt.latest_point = find_latest_point (attempt.sequence[0], > + attempt.sequence[1]); > + if (attempt.earliest_point < attempt.latest_point) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "cannot combine %d and %d: no suitable" > + " location for combined insn\n", > + INSN_UID (attempt.sequence[0]->insn), > + INSN_UID (attempt.sequence[1]->insn)); > + return false; > + } > + > + /* Make sure we have valid costs for the original instructions before > + we start changing their patterns. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + if (attempt.sequence[i]->cost == UNKNOWN_COST) > + attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn, > + m_optimize_for_speed_p); > + return true; > +} > + > +/* Check whether the combination attempt described by ATTEMPT matches > + an .md instruction (or matches its constraints, in the case of an > + asm statement). If so, calculate the cost of the new instruction > + and check whether it's cheap enough. */ > + > +bool > +combine2::verify_combination (combination_attempt_rec &attempt) > +{ > + rtx_insn *insn = attempt.sequence[1]->insn; > + > + bool ok_p = verify_changes (0); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + if (!ok_p) > + fprintf (dump_file, "failed to match this instruction:\n"); > + else if (const char *name = get_insn_name (INSN_CODE (insn))) > + fprintf (dump_file, "successfully matched this instruction to %s:\n", > + name); > + else > + fprintf (dump_file, "successfully matched this instruction:\n"); > + print_rtl_single (dump_file, PATTERN (insn)); > + } > + if (!ok_p) > + return false; > + > + unsigned int cost1 = attempt.sequence[0]->cost; > + unsigned int cost2 = attempt.sequence[1]->cost; > + attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p); > + ok_p = (attempt.new_cost <= cost1 + cost2); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n", > + cost1, cost2, attempt.new_cost, > + ok_p ? "keeping replacement" : "rejecting replacement"); > + if (!ok_p) > + return false; > + > + confirm_change_group (); > + return true; > +} > + > +/* Return true if we should consider register REGNO when calculating > + register pressure estimates. */ > + > +static bool > +count_reg_pressure_p (unsigned int regno) > +{ > + if (regno == INVALID_REGNUM) > + return false; > + > + /* Unallocatable registers aren't interesting. */ > + if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno]) > + return false; > + > + return true; > +} > + > +/* Try to estimate the effect that the original form of INSN_INFO > + had on register pressure, in the form "born - dying". */ > + > +int > +combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info) > +{ > + int delta = 0; > + > + for (live_range_rec **def = insn_info->defs; *def; ++def) > + if (count_reg_pressure_p ((*def)->regno)) > + delta += 1; > + > + for (live_range_rec **use = insn_info->uses; *use; ++use) > + if (count_reg_pressure_p ((*use)->regno) > + && known_last_use_p (*use, insn_info)) > + delta -= 1; > + > + return delta; > +} > + > +/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete > + FROM_INSN. Copy any useful information to TO_INSN before doing that. */ > + > +static void > +transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn) > +{ > + INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn); > + INSN_CODE (to_insn) = INSN_CODE (from_insn); > + REG_NOTES (to_insn) = REG_NOTES (from_insn); > +} > + > +/* The combination attempt in ATTEMPT has succeeded and is currently > + part of an open validate_change group. Commit to making the change > + and decide where the new instruction should go. > + > + KEPT_DEF_P is true if the new instruction continues to perform > + the definition described by ATTEMPT.def_use_range. */ > + > +void > +combine2::commit_combination (combination_attempt_rec &attempt, > + bool kept_def_p) > +{ > + insn_info_rec *new_home = attempt.new_home; > + rtx_insn *old_insn = attempt.sequence[0]->insn; > + rtx_insn *new_insn = attempt.sequence[1]->insn; > + > + /* Remove any notes that are no longer relevant. */ > + bool single_set_p = single_set (new_insn); > + for (rtx *note_ptr = ®_NOTES (new_insn); *note_ptr; ) > + { > + rtx note = *note_ptr; > + bool keep_p = true; > + switch (REG_NOTE_KIND (note)) > + { > + case REG_EQUAL: > + case REG_EQUIV: > + case REG_NOALIAS: > + keep_p = single_set_p; > + break; > + > + case REG_UNUSED: > + keep_p = false; > + break; > + > + default: > + break; > + } > + if (keep_p) > + note_ptr = &XEXP (*note_ptr, 1); > + else > + { > + *note_ptr = XEXP (*note_ptr, 1); > + free_EXPR_LIST_node (note); > + } > + } > + > + /* Complete the open validate_change group. */ > + confirm_change_group (); > + > + /* Decide where the new instruction should go. */ > + unsigned int new_point = attempt.latest_point; > + if (new_point != attempt.earliest_point > + && prev_real_insn (new_insn) != old_insn) > + { > + /* Prefer the earlier point if the combined instruction reduces > + register pressure and the latest point if it increases register > + pressure. > + > + The choice isn't obvious in the event of a tie, but picking > + the earliest point should reduce the number of times that > + we need to invalidate debug insns. */ > + int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]); > + int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]); > + bool move_up_p = (delta1 + delta2 <= 0); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "register pressure delta = %d + %d; using %s position\n", > + delta1, delta2, move_up_p ? "earliest" : "latest"); > + if (move_up_p) > + new_point = attempt.earliest_point; > + } > + > + /* Translate inserting at NEW_POINT into inserting before or after > + a particular insn. */ > + rtx_insn *anchor = NULL; > + bool before_p = (new_point & 1); > + if (new_point != attempt.sequence[1]->point > + && new_point != attempt.sequence[0]->point) > + { > + anchor = m_points[(new_point - m_end_of_sequence) / 2]; > + rtx_insn *other_side = (before_p > + ? prev_real_insn (anchor) > + : next_real_insn (anchor)); > + /* Inserting next to an insn X and then deleting X is just a > + roundabout way of using X as the insertion point. */ > + if (anchor == new_insn || other_side == new_insn) > + new_point = attempt.sequence[1]->point; > + else if (anchor == old_insn || other_side == old_insn) > + new_point = attempt.sequence[0]->point; > + } > + > + /* Actually perform the move. */ > + if (new_point == attempt.sequence[1]->point) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "using insn %d to hold the combined pattern\n", > + INSN_UID (new_insn)); > + set_insn_deleted (old_insn); > + } > + else if (new_point == attempt.sequence[0]->point) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "using insn %d to hold the combined pattern\n", > + INSN_UID (old_insn)); > + PATTERN (old_insn) = PATTERN (new_insn); > + transfer_insn (old_insn, new_insn); > + std::swap (old_insn, new_insn); > + set_insn_deleted (old_insn); > + } > + else > + { > + /* We need to insert a new instruction. We can't simply move > + NEW_INSN because it acts as an insertion anchor in m_points. */ > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "inserting combined insn %s insn %d\n", > + before_p ? "before" : "after", INSN_UID (anchor)); > + > + rtx_insn *added_insn = (before_p > + ? emit_insn_before (PATTERN (new_insn), anchor) > + : emit_insn_after (PATTERN (new_insn), anchor)); > + transfer_insn (added_insn, new_insn); > + set_insn_deleted (old_insn); > + set_insn_deleted (new_insn); > + new_insn = added_insn; > + } > + df_insn_rescan (new_insn); > + > + /* Unlink the old uses. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) > + remove_range_use (*use, attempt.sequence[i]); > + > + /* Work out which registers the new pattern uses. */ > + bitmap_clear (m_true_deps); > + df_ref use; > + FOR_EACH_INSN_USE (use, new_insn) > + { > + rtx reg = DF_REF_REAL_REG (use); > + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); > + } > + FOR_EACH_INSN_EQ_USE (use, new_insn) > + { > + rtx reg = DF_REF_REAL_REG (use); > + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); > + } > + > + /* Describe the combined instruction in NEW_HOME. */ > + new_home->insn = new_insn; > + new_home->point = new_point; > + new_home->cost = attempt.new_cost; > + > + /* Build up a list of definitions for the combined instructions > + and update all the ranges accordingly. It shouldn't matter > + which order we do this in. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def) > + if (kept_def_p || *def != attempt.def_use_range) > + { > + obstack_ptr_grow (&m_insn_obstack, *def); > + (*def)->producer = new_home; > + } > + obstack_ptr_grow (&m_insn_obstack, NULL); > + new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack); > + > + /* Build up a list of uses for the combined instructions and update > + all the ranges accordingly. Again, it shouldn't matter which > + order we do this in. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) > + if (*use != attempt.def_use_range > + && add_range_use (*use, new_home)) > + obstack_ptr_grow (&m_insn_obstack, *use); > + obstack_ptr_grow (&m_insn_obstack, NULL); > + new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack); > + > + /* There shouldn't be any remaining references to other instructions > + in the combination. Invalidate their contents to make lingering > + references a noisy failure. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + if (attempt.sequence[i] != new_home) > + { > + attempt.sequence[i]->insn = NULL; > + attempt.sequence[i]->point = ~0U; > + } > + > + /* Unlink the def-use range. */ > + if (!kept_def_p && attempt.def_use_range) > + { > + live_range_rec *range = attempt.def_use_range; > + if (range->prev_range) > + range->prev_range->next_range = range->next_range; > + else > + m_reg_info[range->regno].range = range->next_range; > + if (range->next_range) > + range->next_range->prev_range = range->prev_range; > + } > + > + /* Record instructions whose new form alters the cfg. */ > + rtx pattern = PATTERN (new_insn); > + if ((returnjump_p (new_insn) > + || any_uncondjump_p (new_insn) > + || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx)) > + && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn))) > + m_cfg_altering_insns.safe_push (new_insn); > +} > + > +/* Return true if X1 and X2 are memories and if X1 does not have > + a higher alignment than X2. */ > + > +static bool > +dubious_mem_pair_p (rtx x1, rtx x2) > +{ > + return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2); > +} > + > +/* Try implement ATTEMPT using (parallel [SET1 SET2]). */ > + > +bool > +combine2::try_parallel_sets (combination_attempt_rec &attempt, > + rtx set1, rtx set2) > +{ > + rtx_insn *insn = attempt.sequence[1]->insn; > + > + /* Combining two loads or two stores can be useful on targets that > + allow them to be treated as a single access. However, we use a > + very peephole approach to picking the pairs, so we need to be > + relatively confident that we're making a good choice. > + > + For now just aim for cases in which the memory references are > + consecutive and the first reference has a higher alignment. > + We can leave the target to test the consecutive part; whatever test > + we added here might be different from the target's, and in any case > + it's fine if the target accepts other well-aligned cases too. */ > + if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2)) > + || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2))) > + return false; > + > + /* Cache the PARALLEL rtx between attempts so that we don't generate > + too much garbage rtl. */ > + if (!m_spare_parallel) > + { > + rtvec vec = gen_rtvec (2, set1, set2); > + m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec); > + } > + else > + { > + XVECEXP (m_spare_parallel, 0, 0) = set1; > + XVECEXP (m_spare_parallel, 0, 1) = set2; > + } > + > + unsigned int num_changes = num_validated_changes (); > + validate_change (insn, &PATTERN (insn), m_spare_parallel, true); > + if (verify_combination (attempt)) > + { > + m_spare_parallel = NULL_RTX; > + return true; > + } > + cancel_changes (num_changes); > + return false; > +} > + > +/* Try to parallelize the two instructions in ATTEMPT. */ > + > +bool > +combine2::try_parallelize_insns (combination_attempt_rec &attempt) > +{ > + rtx_insn *i1_insn = attempt.sequence[0]->insn; > + rtx_insn *i2_insn = attempt.sequence[1]->insn; > + > + /* Can't parallelize asm statements. */ > + if (asm_noperands (PATTERN (i1_insn)) >= 0 > + || asm_noperands (PATTERN (i2_insn)) >= 0) > + return false; > + > + /* For now, just handle the case in which both instructions are > + single sets. We could handle more than 2 sets as well, but few > + targets support that anyway. */ > + rtx set1 = single_set (i1_insn); > + if (!set1) > + return false; > + rtx set2 = single_set (i2_insn); > + if (!set2) > + return false; > + > + /* Make sure that we have structural proof that the destinations > + are independent. Things like alias analysis rely on semantic > + information and assume no undefined behavior, which is rarely a > + good enough guarantee to allow a useful instruction combination. */ > + rtx dest1 = SET_DEST (set1); > + rtx dest2 = SET_DEST (set2); > + if (MEM_P (dest1) > + ? MEM_P (dest2) && nonoverlapping_memrefs_p (dest1, dest2, false) > + : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2)) > + return false; > + > + /* Try the sets in both orders. */ > + if (try_parallel_sets (attempt, set1, set2) > + || try_parallel_sets (attempt, set2, set1)) > + { > + commit_combination (attempt, true); > + if (MAY_HAVE_DEBUG_BIND_INSNS > + && attempt.new_home->insn != i1_insn) > + propagate_for_debug (i1_insn, attempt.new_home->insn, > + SET_DEST (set1), SET_SRC (set1), m_bb); > + return true; > + } > + return false; > +} > + > +/* Replace DEST with SRC in the register notes for INSN. */ > + > +static void > +substitute_into_note (rtx_insn *insn, rtx dest, rtx src) > +{ > + for (rtx *note_ptr = ®_NOTES (insn); *note_ptr; ) > + { > + rtx note = *note_ptr; > + bool keep_p = true; > + switch (REG_NOTE_KIND (note)) > + { > + case REG_EQUAL: > + case REG_EQUIV: > + keep_p = validate_simplify_replace_rtx (insn, &XEXP (note, 0), > + dest, src); > + break; > + > + default: > + break; > + } > + if (keep_p) > + note_ptr = &XEXP (*note_ptr, 1); > + else > + { > + *note_ptr = XEXP (*note_ptr, 1); > + free_EXPR_LIST_node (note); > + } > + } > +} > + > +/* A subroutine of try_combine_def_use. Try replacing DEST with SRC > + in ATTEMPT. SRC might be either the original SET_SRC passed to the > + parent routine or a value pulled from a note; SRC_IS_NOTE_P is true > + in the latter case. */ > + > +bool > +combine2::try_combine_def_use_1 (combination_attempt_rec &attempt, > + rtx dest, rtx src, bool src_is_note_p) > +{ > + rtx_insn *def_insn = attempt.sequence[0]->insn; > + rtx_insn *use_insn = attempt.sequence[1]->insn; > + > + /* Mimic combine's behavior by not combining moves from allocatable hard > + registers (e.g. when copying parameters or function return values). */ > + if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)]) > + return false; > + > + /* Don't mess with volatile references. For one thing, we don't yet > + know how many copies of SRC we'll need. */ > + if (volatile_refs_p (src)) > + return false; > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "trying to combine %d and %d%s:\n", > + INSN_UID (def_insn), INSN_UID (use_insn), > + src_is_note_p ? " using equal/equiv note" : ""); > + dump_insn_slim (dump_file, def_insn); > + dump_insn_slim (dump_file, use_insn); > + } > + > + unsigned int num_changes = num_validated_changes (); > + if (!validate_simplify_replace_rtx (use_insn, &PATTERN (use_insn), > + dest, src)) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "combination failed -- unable to substitute" > + " all uses\n"); > + return false; > + } > + > + /* Try matching the instruction on its own if DEST isn't used elsewhere. */ > + if (has_single_use_p (attempt.def_use_range) > + && verify_combination (attempt)) > + { > + live_range_rec *next_range = attempt.def_use_range->next_range; > + substitute_into_note (use_insn, dest, src); > + commit_combination (attempt, false); > + if (MAY_HAVE_DEBUG_BIND_INSNS) > + { > + rtx_insn *end_of_range = (next_range > + ? next_range->producer->insn > + : BB_END (m_bb)); > + propagate_for_debug (def_insn, end_of_range, dest, src, m_bb); > + } > + return true; > + } > + > + /* Try doing the new USE_INSN pattern in parallel with the DEF_INSN > + pattern. */ > + if (try_parallelize_insns (attempt)) > + return true; > + > + cancel_changes (num_changes); > + return false; > +} > + > +/* ATTEMPT describes an attempt to substitute the result of the first > + instruction into the second instruction. Try to implement it, > + given that the first instruction sets DEST to SRC. */ > + > +bool > +combine2::try_combine_def_use (combination_attempt_rec &attempt, > + rtx dest, rtx src) > +{ > + rtx_insn *def_insn = attempt.sequence[0]->insn; > + rtx_insn *use_insn = attempt.sequence[1]->insn; > + rtx def_note = find_reg_equal_equiv_note (def_insn); > + > + /* First try combining the instructions in their original form. */ > + if (try_combine_def_use_1 (attempt, dest, src, false)) > + return true; > + > + /* Try to replace DEST with a REG_EQUAL/EQUIV value instead. */ > + if (def_note > + && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true)) > + return true; > + > + /* If USE_INSN has a REG_EQUAL/EQUIV note that refers to DEST, try > + using that instead of the main pattern. */ > + for (rtx *link_ptr = ®_NOTES (use_insn); *link_ptr; > + link_ptr = &XEXP (*link_ptr, 1)) > + { > + rtx use_note = *link_ptr; > + if (REG_NOTE_KIND (use_note) != REG_EQUAL > + && REG_NOTE_KIND (use_note) != REG_EQUIV) > + continue; > + > + rtx use_set = single_set (use_insn); > + if (!use_set) > + break; > + > + if (!reg_overlap_mentioned_p (dest, XEXP (use_note, 0))) > + continue; > + > + /* Try snipping out the note and putting it in the SET instead. */ > + validate_change (use_insn, link_ptr, XEXP (use_note, 1), 1); > + validate_change (use_insn, &SET_SRC (use_set), XEXP (use_note, 0), 1); > + > + if (try_combine_def_use_1 (attempt, dest, src, false)) > + return true; > + > + if (def_note > + && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true)) > + return true; > + > + cancel_changes (0); > + } > + > + return false; > +} > + > +/* ATTEMPT describes an attempt to combine two instructions that use > + the same resource. Try to implement it, returning true on success. */ > + > +bool > +combine2::try_combine_two_uses (combination_attempt_rec &attempt) > +{ > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "trying to parallelize %d and %d:\n", > + INSN_UID (attempt.sequence[0]->insn), > + INSN_UID (attempt.sequence[1]->insn)); > + dump_insn_slim (dump_file, attempt.sequence[0]->insn); > + dump_insn_slim (dump_file, attempt.sequence[1]->insn); > + } > + > + return try_parallelize_insns (attempt); > +} > + > +/* Try to optimize instruction INSN_INFO. Return true on success. */ > + > +bool > +combine2::optimize_insn (insn_info_rec *i1) > +{ > + combination_attempt_rec attempt; > + > + if (!combinable_insn_p (i1->insn, false)) > + return false; > + > + rtx set = single_set (i1->insn); > + if (!set) > + return false; > + > + /* First try combining INSN with a user of its result. */ > + rtx dest = SET_DEST (set); > + rtx src = SET_SRC (set); > + if (REG_P (dest) && REG_NREGS (dest) == 1) > + for (live_range_rec **def = i1->defs; *def; ++def) > + if ((*def)->regno == REGNO (dest)) > + { > + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) > + { > + insn_info_rec *use = (*def)->users[i]; > + if (use > + && combinable_insn_p (use->insn, has_single_use_p (*def)) > + && start_combination (attempt, i1, use, *def) > + && try_combine_def_use (attempt, dest, src)) > + return true; > + } > + break; > + } > + > + /* Try parallelizing INSN and another instruction that uses the same > + resource. */ > + bitmap_clear (m_tried_insns); > + for (live_range_rec **use = i1->uses; *use; ++use) > + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) > + { > + insn_info_rec *i2 = (*use)->users[i]; > + if (i2 > + && i2 != i1 > + && combinable_insn_p (i2->insn, false) > + && bitmap_set_bit (m_tried_insns, INSN_UID (i2->insn)) > + && start_combination (attempt, i1, i2) > + && try_combine_two_uses (attempt)) > + return true; > + } > + > + return false; > +} > + > +/* A note_stores callback. Set the bool at *DATA to true if DEST is in > + memory. */ > + > +static void > +find_mem_def (rtx dest, const_rtx, void *data) > +{ > + /* note_stores has stripped things like subregs and zero_extracts, > + so we don't need to worry about them here. */ > + if (MEM_P (dest)) > + *(bool *) data = true; > +} > + > +/* Record all register and memory definitions in INSN_INFO and fill in its > + "defs" list. */ > + > +void > +combine2::record_defs (insn_info_rec *insn_info) > +{ > + rtx_insn *insn = insn_info->insn; > + > + /* Record register definitions. */ > + df_ref def; > + FOR_EACH_INSN_DEF (def, insn) > + { > + rtx reg = DF_REF_REAL_REG (def); > + unsigned int end_regno = END_REGNO (reg); > + for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno) > + { > + live_range_rec *range = reg_live_range (regno); > + range->producer = insn_info; > + m_reg_info[regno].live_p = false; > + obstack_ptr_grow (&m_insn_obstack, range); > + } > + } > + > + /* If the instruction writes to memory, record that too. */ > + bool saw_mem_p = false; > + note_stores (insn, find_mem_def, &saw_mem_p); > + if (saw_mem_p) > + { > + live_range_rec *range = mem_live_range (); > + range->producer = insn_info; > + obstack_ptr_grow (&m_insn_obstack, range); > + } > + > + /* Complete the list of definitions. */ > + obstack_ptr_grow (&m_insn_obstack, NULL); > + insn_info->defs = (live_range_rec **) obstack_finish (&m_insn_obstack); > +} > + > +/* Record that INSN_INFO contains register use USE. If this requires > + new entries to be added to INSN_INFO->uses, add those entries to the > + list we're building in m_insn_obstack. */ > + > +void > +combine2::record_reg_use (insn_info_rec *insn_info, df_ref use) > +{ > + rtx reg = DF_REF_REAL_REG (use); > + unsigned int end_regno = END_REGNO (reg); > + for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno) > + { > + live_range_rec *range = reg_live_range (regno); > + if (add_range_use (range, insn_info)) > + obstack_ptr_grow (&m_insn_obstack, range); > + m_reg_info[regno].live_p = true; > + } > +} > + > +/* A note_uses callback. Set the bool at DATA to true if *LOC reads > + from variable memory. */ > + > +static void > +find_mem_use (rtx *loc, void *data) > +{ > + subrtx_iterator::array_type array; > + FOR_EACH_SUBRTX (iter, array, *loc, NONCONST) > + if (MEM_P (*iter) && !MEM_READONLY_P (*iter)) > + { > + *(bool *) data = true; > + break; > + } > +} > + > +/* Record all register and memory uses in INSN_INFO and fill in its > + "uses" list. */ > + > +void > +combine2::record_uses (insn_info_rec *insn_info) > +{ > + rtx_insn *insn = insn_info->insn; > + > + /* Record register uses in the main pattern. */ > + df_ref use; > + FOR_EACH_INSN_USE (use, insn) > + record_reg_use (insn_info, use); > + > + /* Treat REG_EQUAL uses as first-class uses. We don't lose much > + by doing that, since it's rare for a REG_EQUAL note to mention > + registers that the main pattern doesn't. It also gives us the > + maximum freedom to use REG_EQUAL notes in place of the main pattern. */ > + FOR_EACH_INSN_EQ_USE (use, insn) > + record_reg_use (insn_info, use); > + > + /* Record a memory use if either the pattern or the notes read from > + memory. */ > + bool saw_mem_p = false; > + note_uses (&PATTERN (insn), find_mem_use, &saw_mem_p); > + for (rtx note = REG_NOTES (insn); !saw_mem_p && note; note = XEXP (note, 1)) > + if (REG_NOTE_KIND (note) == REG_EQUAL > + || REG_NOTE_KIND (note) == REG_EQUIV) > + note_uses (&XEXP (note, 0), find_mem_use, &saw_mem_p); > + if (saw_mem_p) > + { > + live_range_rec *range = mem_live_range (); > + if (add_range_use (range, insn_info)) > + obstack_ptr_grow (&m_insn_obstack, range); > + } > + > + /* Complete the list of uses. */ > + obstack_ptr_grow (&m_insn_obstack, NULL); > + insn_info->uses = (live_range_rec **) obstack_finish (&m_insn_obstack); > +} > + > +/* Start a new instruction sequence, discarding all information about > + the previous one. */ > + > +void > +combine2::start_sequence (void) > +{ > + m_end_of_sequence = m_point; > + m_mem_range = NULL; > + m_points.truncate (0); > + obstack_free (&m_insn_obstack, m_insn_obstack_start); > + obstack_free (&m_range_obstack, m_range_obstack_start); > +} > + > +/* Run the pass on the current function. */ > + > +void > +combine2::execute (void) > +{ > + df_analyze (); > + FOR_EACH_BB_FN (m_bb, cfun) > + { > + m_optimize_for_speed_p = optimize_bb_for_speed_p (m_bb); > + m_end_of_bb = m_point; > + start_sequence (); > + > + rtx_insn *insn, *prev; > + FOR_BB_INSNS_REVERSE_SAFE (m_bb, insn, prev) > + { > + if (!NONDEBUG_INSN_P (insn)) > + continue; > + > + /* The current m_point represents the end of the sequence if > + INSN is the last instruction in the sequence, otherwise it > + represents the gap between INSN and the next instruction. > + m_point + 1 represents INSN itself. > + > + Instructions can be added to m_point by inserting them > + after INSN. They can be added to m_point + 1 by inserting > + them before INSN. */ > + m_points.safe_push (insn); > + m_point += 1; > + > + insn_info_rec *insn_info = XOBNEW (&m_insn_obstack, insn_info_rec); > + insn_info->insn = insn; > + insn_info->point = m_point; > + insn_info->cost = UNKNOWN_COST; > + > + record_defs (insn_info); > + record_uses (insn_info); > + > + /* Set up m_point for the next instruction. */ > + m_point += 1; > + > + if (CALL_P (insn)) > + start_sequence (); > + else > + while (optimize_insn (insn_info)) > + gcc_assert (insn_info->insn); > + } > + } > + > + /* If an instruction changes the cfg, update the containing block > + accordingly. */ > + rtx_insn *insn; > + unsigned int i; > + FOR_EACH_VEC_ELT (m_cfg_altering_insns, i, insn) > + if (JUMP_P (insn)) > + { > + mark_jump_label (PATTERN (insn), insn, 0); > + update_cfg_for_uncondjump (insn); > + } > + else > + { > + remove_edge (split_block (BLOCK_FOR_INSN (insn), insn)); > + emit_barrier_after_bb (BLOCK_FOR_INSN (insn)); > + } > + > + /* Propagate the above block-local cfg changes to the rest of the cfg. */ > + if (!m_cfg_altering_insns.is_empty ()) > + { > + if (dom_info_available_p (CDI_DOMINATORS)) > + free_dominance_info (CDI_DOMINATORS); > + timevar_push (TV_JUMP); > + rebuild_jump_labels (get_insns ()); > + cleanup_cfg (0); > + timevar_pop (TV_JUMP); > + } > +} > + > +const pass_data pass_data_combine2 = > +{ > + RTL_PASS, /* type */ > + "combine2", /* name */ > + OPTGROUP_NONE, /* optinfo_flags */ > + TV_COMBINE2, /* tv_id */ > + 0, /* properties_required */ > + 0, /* properties_provided */ > + 0, /* properties_destroyed */ > + 0, /* todo_flags_start */ > + TODO_df_finish, /* todo_flags_finish */ > +}; > + > +class pass_combine2 : public rtl_opt_pass > +{ > +public: > + pass_combine2 (gcc::context *ctxt, int flag) > + : rtl_opt_pass (pass_data_combine2, ctxt), m_flag (flag) > + {} > + > + bool > + gate (function *) OVERRIDE > + { > + return optimize && (param_run_combine & m_flag) != 0; > + } > + > + unsigned int > + execute (function *f) OVERRIDE > + { > + combine2 (f).execute (); > + return 0; > + } > + > +private: > + unsigned int m_flag; > +}; // class pass_combine2 > + > +} // anon namespace > + > +rtl_opt_pass * > +make_pass_combine2_before (gcc::context *ctxt) > +{ > + return new pass_combine2 (ctxt, 1); > +} > + > +rtl_opt_pass * > +make_pass_combine2_after (gcc::context *ctxt) > +{ > + return new pass_combine2 (ctxt, 4); > +}
Hi Nick, Thanks for the comments. Nicholas Krause <xerofoify@gmail.com> writes: >> Index: gcc/passes.def >> =================================================================== >> --- gcc/passes.def 2019-10-29 08:29:03.224443133 +0000 >> +++ gcc/passes.def 2019-11-17 23:15:31.200500531 +0000 >> @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3. >> NEXT_PASS (pass_inc_dec); >> NEXT_PASS (pass_initialize_regs); >> NEXT_PASS (pass_ud_rtl_dce); >> + NEXT_PASS (pass_combine2_before); >> NEXT_PASS (pass_combine); >> + NEXT_PASS (pass_combine2_after); >> NEXT_PASS (pass_if_after_combine); >> NEXT_PASS (pass_jump_after_combine); >> NEXT_PASS (pass_partition_blocks); >> Index: gcc/timevar.def > This is really two passes it seems or at least functions. Just a nit but you > may want to state that as I don't recall reading that. It's really two instances of the same pass, but yeah, each instance goes under a different name. This is because each instance needs to know which bit of the run-combine value it should be testing: >> The patch adds two instances of the new pass: one before combine and >> one after it. By default both are disabled, but this can be changed >> using the new 3-bit run-combine param, where: >> >> - bit 0 selects the new pre-combine pass >> - bit 1 selects the main combine pass >> - bit 2 selects the new post-combine pass So bit 0 is pass_combine2_before, bit 1 is pass_combine and bit 2 is pass_combine2_after. But the passes are identical apart from the choice of bit they test. >> + /* Describes one attempt to combine instructions. */ >> + struct combination_attempt_rec >> + { >> + /* The instruction that we're currently trying to optimize. >> + If the combination succeeds, we'll use this insn_info_rec >> + to describe the new instruction. */ >> + insn_info_rec *new_home; >> + >> + /* The instructions we're combining, in program order. */ >> + insn_info_rec *sequence[MAX_COMBINE_INSNS]; > Can't we can this a vec in order to grow to lengths and just loop through > merging on instructions in the vec as required? Yeah, extending this to combining more than 2 instructions would be future work. When that happens, this would likely end up becoming an auto_vec<insn_info_rec *, MAX_COMBINE_INSNS>. I imagine there would still be a fairly low compile-time limit on the number of combinations though. E.g. current combine has a limit of 4, with even 4 being restricted to certain high-value cases. I don't think I've ever seen a case where 5 or more would help. >> +/* Return true if we know that USER is the last user of RANGE. */ >> + >> +bool >> +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user) >> +{ >> + if (range->last_extra_use <= user->point) >> + return false; >> + >> + for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i) >> + if (range->users[i] == user) >> + return i == NUM_RANGE_USERS - 1 || !range->users[i + 1]; > Small nit and I could be wrong but do: > > return !range->users[i + 1] || i == NUM_RANGE_USERS - 1; > > Based on your code it seems that the getting to NUM_RANGE_USERS is far > less likely. The problem is that we'll then be accessing outside the users[] array when i == NUM_RANGE_USERS - 1, so we have to check the limit first. Thanks, Richard
Segher Boessenkool <segher@kernel.crashing.org> writes: > On Wed, Nov 20, 2019 at 06:20:34PM +0000, Richard Sandiford wrote: >> > Why don't you use DF for the DU chains? >> >> The problem with DF_DU_CHAIN is that it's quadratic in the worst case. > > Oh, wow. > >> fwprop.c gets around that by using the MD problem and having its own >> dominator walker to calculate limited def-use chains: >> >> /* We use the multiple definitions problem to compute our restricted >> use-def chains. */ > > It's not great if every pass invents its own version of some common > infrastructure thing because that common one is not suitable. > > I.e., can this be fixed somehow? Maybe just by having a restricted DU > chains df problem? Well, it'd probably make sense to make fwprop.c's approach available as a "proper" df interface at some point. Hopefully if anyone wants the same thing as fwprop.c, they'd do that rather than copy the code. :-) >> So taking that approach here would still require some amount of >> roll-your-own. Other reasons are: >> >> * Even what fwprop does is more elaborate than we need for now. >> >> * We need to handle memory too, and it's nice to be able to handle >> it in the same way as registers. >> >> * Updating a full, ordered def-use chain after a move is a linear-time >> operation, so whatever happens, we'd need to apply some kind of limit >> on the number of uses we maintain, with something like that integer >> point range for the rest. >> >> * Once we've analysed the insn and built its def-use chains, we don't >> look at the df_refs again until we update the chains after a successful >> combination. So it should be more efficient to maintain a small array >> of insn_info_rec pointers alongside the numerical range, rather than >> walk and pollute chains of df_refs and then link back the insn uids >> to the pass-local info. > > So you need something like combine's LOG_LINKS? Not that handling those > is not quadratic in the worst case, but in practice it works well. And > it *could* be made linear. Not sure why what I've used isn't what I need though :-) If it's an array vs. linked-list thing, then for the multi-use case, we need two sets of link pointers, one for "next use of the same resource" and one for "next use in this instruction". Then we need the payload of the list node itself. For the small number of entries we're talking about, using null-terminated arrays of "things that this instruction uses" and "instructions that use this resource" should be more efficient than pointer-chasing, and occupies the same space as the link pointers (i.e. saves the extra payload). We also need to be able to walk in both directions, to answer the questions: - which insns can I combine with this definition? - where is this value of a resource defined? - where are the uses of this resource? - where was the previous definition of this resource, and where was its last use? So if we're comparing it to existing linked-list GCC structures, it's more similar to df_ref (see above for why that seemed like a bad idea) or -- more light-weight -- dep_link_t in the scheduler. And both the array and linked-list approaches still need to fall back to the simple live range once a certain threshold is hit. >> The second set is for: >> >> (B) --param run-combine=6 (both passes), use-use combinations only >> (C) --param run-combine=6 (both passes), no restrictions >> >> Target Tests Delta Best Worst Median >> ====== ===== ===== ==== ===== ====== >> aarch64-linux-gnu 272 -3844 -585 18 -1 >> aarch64_be-linux-gnu 190 -3336 -370 18 -1 >> alpha-linux-gnu 401 -2735 -370 22 -2 >> amdgcn-amdhsa 188 1867 -484 1259 -1 >> arc-elf 257 -1498 -650 54 -1 >> arm-linux-gnueabi 168 -1117 -612 680 -1 >> arm-linux-gnueabihf 168 -1117 -612 680 -1 >> avr-elf 1341 -111401 -13824 680 -10 > > Things like this are kind of suspicious :-) Yeah. This mostly seems to come from mopping up the extra moves created by make_more_copies. So we have combinations like: 58: r70:SF=r94:SF REG_DEAD r94:SF 60: r22:SF=r70:SF REG_DEAD r70:SF (r22 is a hard reg, the others are pseudos) which produces: std Y+1,r22 std Y+2,r23 std Y+3,r24 std Y+4,r25 - ldd r22,Y+1 - ldd r23,Y+2 - ldd r24,Y+3 - ldd r25,Y+4 On the REG_EQUAL thing: you're right that it doesn't make much difference for run-combine=6: Target Tests Delta Best Worst Median ====== ===== ===== ==== ===== ====== arc-elf 1 -1 -1 -1 -1 avr-elf 1 -1 -1 -1 -1 bfin-elf 1 -1 -1 -1 -1 bpf-elf 2 -6 -5 -1 -5 c6x-elf 1 -2 -2 -2 -2 cr16-elf 1 7 7 7 7 epiphany-elf 5 -15 -4 -1 -4 fr30-elf 2 -16 -11 -5 -11 frv-linux-gnu 2 -20 -16 -4 -16 h8300-elf 2 -2 -1 -1 -1 i686-apple-darwin 1 -3 -3 -3 -3 ia64-linux-gnu 3 -39 -26 -6 -7 m32r-elf 3 -17 -10 -2 -5 mcore-elf 4 -7 -3 -1 -2 mn10300-elf 1 -2 -2 -2 -2 moxie-rtems 4 -15 -5 -2 -4 nds32le-elf 1 -1 -1 -1 -1 nios2-linux-gnu 1 -1 -1 -1 -1 or1k-elf 3 -18 -12 -2 -4 s390-linux-gnu 6 -28 -9 -1 -7 s390x-linux-gnu 1 -1 -1 -1 -1 sh-linux-gnu 1 -1 -1 -1 -1 sparc-linux-gnu 4 -24 -14 -2 -5 xstormy16-elf 9 -27 -10 -1 -2 So there's only one case in which it isn't a win, but the number of tests is tiny. So I agree there's no justification for trying this in combine proper as things stand (and I wasn't arguing otherwise FWIW). I'd still like to keep it in the new pass because it does help *sometimes* and there's no sign yet that it has a noticeable compile-time cost. It might also be interesting to see how much difference it makes for run-combine=4 (e.g. to see how much it makes up for the current 2-insn limit)... Thanks, Richard
On Thu, Nov 21, 2019 at 07:41:56PM +0000, Richard Sandiford wrote: > Nicholas Krause <xerofoify@gmail.com> writes: > >> + /* The instructions we're combining, in program order. */ > >> + insn_info_rec *sequence[MAX_COMBINE_INSNS]; > > Can't we can this a vec in order to grow to lengths and just loop through > > merging on instructions in the vec as required? > > Yeah, extending this to combining more than 2 instructions would be > future work. When that happens, this would likely end up becoming an > auto_vec<insn_info_rec *, MAX_COMBINE_INSNS>. I imagine there would > still be a fairly low compile-time limit on the number of combinations > though. E.g. current combine has a limit of 4, with even 4 being > restricted to certain high-value cases. I don't think I've ever > seen a case where 5 or more would help. And sometimes it looks like 4 would help, but often this is because of a limitation elsewhere (like, it should have done a 2->2 before, for example). 4 _does_ help quite a bit with irregular instruction sets. It could sometimes help with RMW insns, too, but there are other problems with that. What you see a lot where 4 "helps" is where it really should combine with just 3 of them, but something prevents that, often cost, while throwing in a 4th insn tilts the balance just enough. We used to have a lot of that with 3-insn combinations as well, and probably still have some. Segher
On Thu, Nov 21, 2019 at 08:32:14PM +0000, Richard Sandiford wrote: > Segher Boessenkool <segher@kernel.crashing.org> writes: > > It's not great if every pass invents its own version of some common > > infrastructure thing because that common one is not suitable. > > > > I.e., can this be fixed somehow? Maybe just by having a restricted DU > > chains df problem? > > Well, it'd probably make sense to make fwprop.c's approach available > as a "proper" df interface at some point. Hopefully if anyone wants the > same thing as fwprop.c, they'd do that rather than copy the code. :-) > >> * Updating a full, ordered def-use chain after a move is a linear-time > >> operation, so whatever happens, we'd need to apply some kind of limit > >> on the number of uses we maintain, with something like that integer > >> point range for the rest. Yeah. > >> * Once we've analysed the insn and built its def-use chains, we don't > >> look at the df_refs again until we update the chains after a successful > >> combination. So it should be more efficient to maintain a small array > >> of insn_info_rec pointers alongside the numerical range, rather than > >> walk and pollute chains of df_refs and then link back the insn uids > >> to the pass-local info. > > > > So you need something like combine's LOG_LINKS? Not that handling those > > is not quadratic in the worst case, but in practice it works well. And > > it *could* be made linear. > > Not sure why what I've used isn't what I need though :-) I am wondering the other way around :-) Is what you do for combine2 something that would be more generally applicable/useful? That's what I'm trying to find out :-) What combine does could use some improvement, if you want to hear a more direct motivations. LOG_LINKS just skip references we cannot handle (and some more), so we always have to do modified_between etc., which hurts. > >> Target Tests Delta Best Worst Median > >> avr-elf 1341 -111401 -13824 680 -10 > > > > Things like this are kind of suspicious :-) > > Yeah. This mostly seems to come from mopping up the extra moves created > by make_more_copies. So we have combinations like: > > 58: r70:SF=r94:SF > REG_DEAD r94:SF > 60: r22:SF=r70:SF > REG_DEAD r70:SF Why didn't combine do this? A target problem? > So there's only one case in which it isn't a win, but the number of > tests is tiny. So I agree there's no justification for trying this in > combine proper as things stand (and I wasn't arguing otherwise FWIW). > I'd still like to keep it in the new pass because it does help > *sometimes* and there's no sign yet that it has a noticeable > compile-time cost. So when does it help? I can only think of cases where there are problems elsewhere. > It might also be interesting to see how much difference it makes for > run-combine=4 (e.g. to see how much it makes up for the current 2-insn > limit)... Numbers are good :-) Segher
Hi! On Mon, Nov 18, 2019 at 05:55:13PM +0000, Richard Sandiford wrote: > Richard Sandiford <richard.sandiford@arm.com> writes: > > (It's 23:35 local time, so it's still just about stage 1. :-)) > > Or actually, just under 1 day after end of stage 1. Oops. > Could have sworn stage 1 ended on the 17th :-( Only realised > I'd got it wrong when catching up on Saturday's email traffic. > > And inevitably, I introduced a couple of stupid mistakes while > trying to clean the patch up for submission by that (non-)deadline. > Here's a version that fixes an inverted overlapping memref check > and that correctly prunes the use list for combined instructions. > (This last one is just a compile-time saving -- the old code was > correct, just suboptimal.) I've build the Linux kernel with the previous version, as well as this one. R0 is unmodified GCC, R1 is the first patch, R2 is this one: (I've forced --param=run-combine=6 for R1 and R2): (Percentages are relative to R0): R0 R1 R2 R1 R2 alpha 6107088 6101088 6101088 99.902% 99.902% arc 4008224 4006568 4006568 99.959% 99.959% arm 9206728 9200936 9201000 99.937% 99.938% arm64 13056174 13018174 13018194 99.709% 99.709% armhf 0 0 0 0 0 c6x 2337237 2337077 2337077 99.993% 99.993% csky 3356602 0 0 0 0 h8300 1166996 1166776 1166776 99.981% 99.981% i386 11352159 0 0 0 0 ia64 18230640 18167000 18167000 99.651% 99.651% m68k 3714271 0 0 0 0 microblaze 4982749 4979945 4979945 99.944% 99.944% mips 8499309 8495205 8495205 99.952% 99.952% mips64 7042036 7039816 7039816 99.968% 99.968% nds32 4486663 0 0 0 0 nios2 3680001 3679417 3679417 99.984% 99.984% openrisc 4226076 4225868 4225868 99.995% 99.995% parisc 7681895 7680063 7680063 99.976% 99.976% parisc64 8677077 8676581 8676581 99.994% 99.994% powerpc 10687611 10682199 10682199 99.949% 99.949% powerpc64 17671082 17658570 17658570 99.929% 99.929% powerpc64le 17671082 17658570 17658570 99.929% 99.929% riscv32 1554938 1554758 1554758 99.988% 99.988% riscv64 6634342 6632788 6632788 99.977% 99.977% s390 13049643 13014939 13014939 99.734% 99.734% sh 3254743 0 0 0 0 shnommu 1632364 1632124 1632124 99.985% 99.985% sparc 4404993 4399593 4399593 99.877% 99.877% sparc64 6796711 6797491 6797491 100.011% 100.011% x86_64 19713174 19712817 19712817 99.998% 99.998% xtensa 0 0 0 0 0 0 means it didn't build. armhf is probably my own problem, not sure yet. xtensa starts with /tmp/ccmJoY7l.s: Assembler messages: /tmp/ccmJoY7l.s:407: Error: cannot represent `BFD_RELOC_8' relocation in object file and it doesn't get better. My powerpc64 config actually built the powerpc64le config, since the kernel since a while looks what the host system is, for its defconfig. Oh well, fixed now. There are fivew new failures, with either of the combine2 patches. And all five are actually different (different symptoms, at least): - csky fails on libgcc build: /home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi': /home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads for: 1405 | } | ^ (insn 199 86 87 8 (parallel [ (set (reg:SI 101) (plus:SI (reg:SI 98) (const_int -32 [0xffffffffffffffe0]))) (set (reg:CC 33 c) (lt:CC (plus:SI (reg:SI 98) (const_int -32 [0xffffffffffffffe0])) (const_int 0 [0]))) ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt} (nil)) during RTL pass: reload Target problem? - i386 goes into an infinite loop compiling, or at least an hour or so... Erm I forgot too record what it was compiling. I did attach a GDB... It is something from lra_create_live_ranges. - m68k: /home/segher/src/kernel/fs/exec.c: In function 'copy_strings': /home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in final_scan_insn_1, at final.c:3048 590 | } | ^ 0x10408307 final_scan_insn_1 /home/segher/src/gcc/gcc/final.c:3048 0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*) /home/segher/src/gcc/gcc/final.c:3152 0x10408797 final_1 /home/segher/src/gcc/gcc/final.c:2020 0x104091f7 rest_of_handle_final /home/segher/src/gcc/gcc/final.c:4658 0x104091f7 execute /home/segher/src/gcc/gcc/final.c:4736 and that line is gcc_assert (prev_nonnote_insn (insn) == last_ignored_compare); - nds32: /tmp/ccC8Czca.s: Assembler messages: /tmp/ccC8Czca.s:3144: Error: Unrecognized operand/register, lmw.bi [$fp+(-60)],[$fp],$r11,0x0. /tmp/ccl8o20c.s: Assembler messages: /tmp/ccl8o20c.s:2449: Error: Unrecognized operand/register, lmw.bi $r9,[$fp],[$fp+(-132)],0x0. /tmp/ccZxjwHd.s: Assembler messages: /tmp/ccZxjwHd.s:4776: Error: Unrecognized operand/register, lmw.bi [$fp+(-52)],[$fp],[$fp+(-56)],0x0. /tmp/cczjOS3d.s: Assembler messages: /tmp/cczjOS3d.s:2336: Error: Unrecognized operand/register, lmw.bi $r16,[$fp],$r7,0x0. and more. All lmw.bi... target issue? - sh (that's sh4-linux): /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field': /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS' 1638 | } | ^ /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn: (insn 18 17 19 2 (set (reg:SI 0 r0) (mem:SI (plus:SI (reg:SI 4 r4 [178]) (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} (expr_list:REG_DEAD (reg:SI 4 r4 [178]) (expr_list:REG_DEAD (reg:SI 6 r6 [171]) (nil)))) /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out Looking at just binary size, which is a good stand-in for how many insns it combined: R2 arm64 99.709% ia64 99.651% s390 99.734% sparc 99.877% sparc64 100.011% (These are those that are not between 99.9% and 100.0%). So only sparc64 regressed, and just a tiny bit (I can look at what that is, if there is interest). But 32-bit sparc improved, and s390, arm64, and ia64 got actual benefit. Again this is just code size, not analysing the actually changed code. I did look at the powerpc64le changes. It is almost completely load- with-update (and store-with-update) insns that make the difference, but there are also some dot insns. The extra mr. are usually not a good idea, but the extsw. are. Sometimes this causes *more* insns in the end (register move insns), but that is the exception. This mr. problem is there with combine already, btw. In the end it is caused by this just not being something good to do on pseudos, it would be better to do this after RA, in a peephole or similar. OTOH it isn't actually really important for performance either way. Btw, does the new pass use TARGET_LEGITIMATE_COMBINED_INSN? It probably should. (That would be the hook where we would probably want to prevent generating mr. insns). Segher
On 11/23/19 5:34 PM, Segher Boessenkool wrote: > Hi! > > On Mon, Nov 18, 2019 at 05:55:13PM +0000, Richard Sandiford wrote: >> Richard Sandiford <richard.sandiford@arm.com> writes: >>> (It's 23:35 local time, so it's still just about stage 1. :-)) >> Or actually, just under 1 day after end of stage 1. Oops. >> Could have sworn stage 1 ended on the 17th :-( Only realised >> I'd got it wrong when catching up on Saturday's email traffic. >> >> And inevitably, I introduced a couple of stupid mistakes while >> trying to clean the patch up for submission by that (non-)deadline. >> Here's a version that fixes an inverted overlapping memref check >> and that correctly prunes the use list for combined instructions. >> (This last one is just a compile-time saving -- the old code was >> correct, just suboptimal.) > I've build the Linux kernel with the previous version, as well as this > one. R0 is unmodified GCC, R1 is the first patch, R2 is this one: > > (I've forced --param=run-combine=6 for R1 and R2): > (Percentages are relative to R0): > > R0 R1 R2 R1 R2 > alpha 6107088 6101088 6101088 99.902% 99.902% > arc 4008224 4006568 4006568 99.959% 99.959% > arm 9206728 9200936 9201000 99.937% 99.938% > arm64 13056174 13018174 13018194 99.709% 99.709% > armhf 0 0 0 0 0 > c6x 2337237 2337077 2337077 99.993% 99.993% > csky 3356602 0 0 0 0 > h8300 1166996 1166776 1166776 99.981% 99.981% > i386 11352159 0 0 0 0 > ia64 18230640 18167000 18167000 99.651% 99.651% > m68k 3714271 0 0 0 0 > microblaze 4982749 4979945 4979945 99.944% 99.944% > mips 8499309 8495205 8495205 99.952% 99.952% > mips64 7042036 7039816 7039816 99.968% 99.968% > nds32 4486663 0 0 0 0 > nios2 3680001 3679417 3679417 99.984% 99.984% > openrisc 4226076 4225868 4225868 99.995% 99.995% > parisc 7681895 7680063 7680063 99.976% 99.976% > parisc64 8677077 8676581 8676581 99.994% 99.994% > powerpc 10687611 10682199 10682199 99.949% 99.949% > powerpc64 17671082 17658570 17658570 99.929% 99.929% > powerpc64le 17671082 17658570 17658570 99.929% 99.929% > riscv32 1554938 1554758 1554758 99.988% 99.988% > riscv64 6634342 6632788 6632788 99.977% 99.977% > s390 13049643 13014939 13014939 99.734% 99.734% > sh 3254743 0 0 0 0 > shnommu 1632364 1632124 1632124 99.985% 99.985% > sparc 4404993 4399593 4399593 99.877% 99.877% > sparc64 6796711 6797491 6797491 100.011% 100.011% > x86_64 19713174 19712817 19712817 99.998% 99.998% > xtensa 0 0 0 0 0 > > 0 means it didn't build. > > armhf is probably my own problem, not sure yet. > > xtensa starts with > /tmp/ccmJoY7l.s: Assembler messages: > /tmp/ccmJoY7l.s:407: Error: cannot represent `BFD_RELOC_8' relocation in object file > and it doesn't get better. > > My powerpc64 config actually built the powerpc64le config, since the > kernel since a while looks what the host system is, for its defconfig. > Oh well, fixed now. > > There are fivew new failures, with either of the combine2 patches. And > all five are actually different (different symptoms, at least): > > - csky fails on libgcc build: > > /home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi': > /home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads for: > 1405 | } > | ^ > (insn 199 86 87 8 (parallel [ > (set (reg:SI 101) > (plus:SI (reg:SI 98) > (const_int -32 [0xffffffffffffffe0]))) > (set (reg:CC 33 c) > (lt:CC (plus:SI (reg:SI 98) > (const_int -32 [0xffffffffffffffe0])) > (const_int 0 [0]))) > ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt} > (nil)) > during RTL pass: reload > > Target problem? > > - i386 goes into an infinite loop compiling, or at least an hour or so... > Erm I forgot too record what it was compiling. I did attach a GDB... It > is something from lra_create_live_ranges. > > - m68k: > > /home/segher/src/kernel/fs/exec.c: In function 'copy_strings': > /home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in final_scan_insn_1, at final.c:3048 > 590 | } > | ^ > 0x10408307 final_scan_insn_1 > /home/segher/src/gcc/gcc/final.c:3048 > 0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*) > /home/segher/src/gcc/gcc/final.c:3152 > 0x10408797 final_1 > /home/segher/src/gcc/gcc/final.c:2020 > 0x104091f7 rest_of_handle_final > /home/segher/src/gcc/gcc/final.c:4658 > 0x104091f7 execute > /home/segher/src/gcc/gcc/final.c:4736 > > and that line is > gcc_assert (prev_nonnote_insn (insn) == last_ignored_compare); > > - nds32: > > /tmp/ccC8Czca.s: Assembler messages: > /tmp/ccC8Czca.s:3144: Error: Unrecognized operand/register, lmw.bi [$fp+(-60)],[$fp],$r11,0x0. > > /tmp/ccl8o20c.s: Assembler messages: > /tmp/ccl8o20c.s:2449: Error: Unrecognized operand/register, lmw.bi $r9,[$fp],[$fp+(-132)],0x0. > > /tmp/ccZxjwHd.s: Assembler messages: > /tmp/ccZxjwHd.s:4776: Error: Unrecognized operand/register, lmw.bi [$fp+(-52)],[$fp],[$fp+(-56)],0x0. > > /tmp/cczjOS3d.s: Assembler messages: > /tmp/cczjOS3d.s:2336: Error: Unrecognized operand/register, lmw.bi $r16,[$fp],$r7,0x0. > > and more. All lmw.bi... target issue? > > - sh (that's sh4-linux): > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field': > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS' > 1638 | } > | ^ > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn: > (insn 18 17 19 2 (set (reg:SI 0 r0) > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > (nil)))) > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out > > > Looking at just binary size, which is a good stand-in for how many insns > it combined: > > R2 > arm64 99.709% > ia64 99.651% > s390 99.734% > sparc 99.877% > sparc64 100.011% > > (These are those that are not between 99.9% and 100.0%). > > So only sparc64 regressed, and just a tiny bit (I can look at what that > is, if there is interest). But 32-bit sparc improved, and s390, arm64, > and ia64 got actual benefit. > > Again this is just code size, not analysing the actually changed code. > > > I did look at the powerpc64le changes. It is almost completely load- > with-update (and store-with-update) insns that make the difference, but > there are also some dot insns. The extra mr. are usually not a good > idea, but the extsw. are. Sometimes this causes *more* insns in the end > (register move insns), but that is the exception. > > This mr. problem is there with combine already, btw. In the end it is > caused by this just not being something good to do on pseudos, it would > be better to do this after RA, in a peephole or similar. OTOH it isn't > actually really important for performance either way. > > Btw, does the new pass use TARGET_LEGITIMATE_COMBINED_INSN? It probably > should. (That would be the hook where we would probably want to prevent > generating mr. insns). > > > Segher Segher, Please just CC to this conversation as I keep getting removed. Thanks, Nick
On Sat, Nov 23, 2019 at 06:01:28PM -0500, Nicholas Krause wrote:
> Please just CC to this conversation as I keep getting removed.
Everyone who was on Cc: for this thread still is. This is how email
works. If you want to see everything on the list, subscribe to the
mailing list?
Segher
On 11/23/19 6:09 PM, Segher Boessenkool wrote: > On Sat, Nov 23, 2019 at 06:01:28PM -0500, Nicholas Krause wrote: >> Please just CC to this conversation as I keep getting removed. > Everyone who was on Cc: for this thread still is. This is how email > works. If you want to see everything on the list, subscribe to the > mailing list? > > > Segher I was part of the original CCs on my comments but seems that there were two or seemed to be two splitting versions of the thread. I would like to just keep all comments merged in one thread is all. Sorry for the confusion Segher, Nick
Segher Boessenkool <segher@kernel.crashing.org> writes: > On Thu, Nov 21, 2019 at 08:32:14PM +0000, Richard Sandiford wrote: >> Segher Boessenkool <segher@kernel.crashing.org> writes: >> > It's not great if every pass invents its own version of some common >> > infrastructure thing because that common one is not suitable. >> > >> > I.e., can this be fixed somehow? Maybe just by having a restricted DU >> > chains df problem? >> >> Well, it'd probably make sense to make fwprop.c's approach available >> as a "proper" df interface at some point. Hopefully if anyone wants the >> same thing as fwprop.c, they'd do that rather than copy the code. :-) > >> >> * Updating a full, ordered def-use chain after a move is a linear-time >> >> operation, so whatever happens, we'd need to apply some kind of limit >> >> on the number of uses we maintain, with something like that integer >> >> point range for the rest. > > Yeah. > >> >> * Once we've analysed the insn and built its def-use chains, we don't >> >> look at the df_refs again until we update the chains after a successful >> >> combination. So it should be more efficient to maintain a small array >> >> of insn_info_rec pointers alongside the numerical range, rather than >> >> walk and pollute chains of df_refs and then link back the insn uids >> >> to the pass-local info. >> > >> > So you need something like combine's LOG_LINKS? Not that handling those >> > is not quadratic in the worst case, but in practice it works well. And >> > it *could* be made linear. >> >> Not sure why what I've used isn't what I need though :-) > > I am wondering the other way around :-) Is what you do for combine2 > something that would be more generally applicable/useful? That's what > I'm trying to find out :-) > > What combine does could use some improvement, if you want to hear a > more direct motivations. LOG_LINKS just skip references we cannot > handle (and some more), so we always have to do modified_between etc., > which hurts. The trade-offs behind the choice of representation are very specific to the pass. You'd only pick this if you wanted both to propagate definitions into uses and to move insns around. You'd also only pick it if you were happy with tracking a small number of named uses per definition. I can't think of any other passes that would prefer this over what they already use. (Combine itself is an exception, since the new pass started out as a deliberate attempt to start from scratch.) >> >> Target Tests Delta Best Worst Median >> >> avr-elf 1341 -111401 -13824 680 -10 >> > >> > Things like this are kind of suspicious :-) >> >> Yeah. This mostly seems to come from mopping up the extra moves created >> by make_more_copies. So we have combinations like: >> >> 58: r70:SF=r94:SF >> REG_DEAD r94:SF >> 60: r22:SF=r70:SF >> REG_DEAD r70:SF > > Why didn't combine do this? A target problem? Seems to be because combine rejects hard-reg destinations whose classes are likely spilled (cant_combine_insn_p). This SF argument register happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject the combination based on POINTER_X_REGS being likely spilled. I think the same thing could happen on other targets, e.g. for TAILCALL_ADDR_REGS on aarch64. >> So there's only one case in which it isn't a win, but the number of >> tests is tiny. So I agree there's no justification for trying this in >> combine proper as things stand (and I wasn't arguing otherwise FWIW). >> I'd still like to keep it in the new pass because it does help >> *sometimes* and there's no sign yet that it has a noticeable >> compile-time cost. > > So when does it help? I can only think of cases where there are > problems elsewhere. The full list of affected tests (all at -O2 -ftree-vectorize) are: arc-elf gcc.c-torture/compile/pr67506.c avr-elf gcc.dg/torture/pr77916.c bpf-elf gcc.dg/torture/vshuf-v8hi.c bpf-elf gcc.dg/torture/vshuf-v4si.c bfin-elf gcc.dg/torture/vshuf-v8qi.c c6x-elf gcc.c-torture/execute/991118-1.c cr16-elf gcc.c-torture/compile/pr82052.c epiphany-elf gcc.c-torture/execute/991118-1.c epiphany-elf gcc.dg/pr77664.c epiphany-elf gcc.dg/vect/vect-mult-pattern-2.c epiphany-elf gcc.dg/torture/vshuf-v8hi.c epiphany-elf gcc.dg/tree-ssa/pr77664.c epiphany-elf gcc.dg/tree-ssa/negneg-3.c fr30-elf gcc.dg/torture/vshuf-v4hi.c fr30-elf gcc.dg/torture/vshuf-v8hi.c frv-linux-gnu gcc.dg/torture/vshuf-v4hi.c frv-linux-gnu gcc.dg/torture/vshuf-v8hi.c h8300-elf gcc.c-torture/execute/20000422-1.c h8300-elf gcc.dg/torture/pr77916.c ia64-linux-gnu gcc.c-torture/execute/ieee/pr30704.c ia64-linux-gnu gcc.dg/vect/pr49478.c ia64-linux-gnu gcc.dg/tree-ssa/ldist-16.c i686-apple-darwin gcc.dg/vect/vect-mult-pattern-2.c m32r-elf gcc.dg/store_merging_8.c m32r-elf gcc.dg/torture/vshuf-v4hi.c m32r-elf gcc.dg/torture/vshuf-v8hi.c m32r-elf gcc.dg/tree-ssa/vrp61.c mcore-elf gcc.c-torture/execute/991118-1.c mcore-elf gcc.dg/torture/vshuf-v4hi.c mcore-elf gcc.dg/torture/vshuf-v8hi.c mcore-elf gcc.dg/torture/vshuf-v8qi.c mmix gcc.dg/torture/20181024-1.c mn10300-elf g++.dg/warn/Warray-bounds-6.C moxie-rtems gcc.c-torture/execute/930718-1.c moxie-rtems gcc.c-torture/compile/pr70263-1.c moxie-rtems gcc.dg/graphite/scop-5.c moxie-rtems g++.dg/pr80707.C nds32le-elf gcc.dg/torture/vshuf-v16qi.c nios2-linux-gnu gcc.dg/torture/vshuf-v8qi.c or1k-elf gcc.dg/torture/vshuf-v4hi.c or1k-elf gcc.dg/torture/vshuf-v8hi.c or1k-elf gcc.dg/tree-ssa/vrp61.c powerpc-ibm-aix7.0 g++.dg/warn/Wunused-3.C powerpc-ibm-aix7.0 g++.dg/lto/pr88049_0.C powerpc-ibm-aix7.0 g++.dg/other/cxa-atexit1.C s390-linux-gnu gcc.c-torture/compile/20020304-1.c s390-linux-gnu gcc.dg/atomic-op-1.c s390-linux-gnu gcc.dg/atomic/stdatomic-op-1.c s390-linux-gnu gcc.dg/atomic/c11-atomic-exec-2.c s390-linux-gnu gcc.dg/atomic/c11-atomic-exec-3.c s390-linux-gnu gcc.dg/ubsan/float-cast-overflow-atomic.c s390x-linux-gnu gcc.c-torture/compile/20020304-1.c sh-linux-gnu gcc.c-torture/execute/991118-1.c sh-linux-gnu gcc.dg/torture/vshuf-v8qi.c sparc-linux-gnu gcc.dg/pr56890-2.c sparc-linux-gnu gcc.dg/torture/vshuf-v4hi.c sparc-linux-gnu gcc.dg/torture/vshuf-v8hi.c sparc-linux-gnu gcc.dg/torture/20181024-1.c sparc64-linux-gnu gcc.dg/torture/20181024-1.c xstormy16-elf gcc.c-torture/execute/strlen-5.c xstormy16-elf gcc.c-torture/execute/20080424-1.c xstormy16-elf gcc.c-torture/compile/pr60655-1.c xstormy16-elf gcc.c-torture/compile/pr60655-2.c xstormy16-elf gcc.dg/Wrestrict-9.c xstormy16-elf gcc.dg/graphite/scop-15.c xstormy16-elf gcc.dg/guality/pr43051-1.c xstormy16-elf gcc.dg/torture/pr68955.c xstormy16-elf gcc.dg/torture/pr58955-2.c xstormy16-elf gcc.dg/tree-ssa/builtin-sprintf-warn-23.c The s390x-linux-gnu test is one in which we have: 116: {r167:DI=r86:DI-0x1000;clobber %cc:CC;} REG_DEAD r86:DI REG_UNUSED %cc:CC 118: %r2:DI=[r167:DI+r155:DI+0x5] REG_DEAD r167:DI REG_DEAD r155:DI REG_EQUAL [r167:DI+0x1005] and so the 0x1000s cancel each other out. And yeah, you could definitely argue that it's a problem elsewhere. :-) Expand has: ;; _32 = BGl_equalzf3zf3zz__r4_equivalence_6_2z00 (_31, 2B); (insn 113 112 114 (set (reg:DI 164) (const_int -4096 [0xfffffffffffff000])) "gcc.c-torture/compile/20020304-1.c":161:9 -1 (nil)) (insn 114 113 115 (set (reg:DI 165) (reg:DI 164)) "gcc.c-torture/compile/20020304-1.c":161:9 -1 (nil)) (insn 115 114 116 (set (reg:DI 166) (const_int 4096 [0x1000])) "gcc.c-torture/compile/20020304-1.c":161:9 -1 (nil)) (insn 116 115 117 (parallel [ (set (reg:DI 167) (plus:DI (reg:DI 86 [ BgL_cdrzd21994zd2_959.10_27 ]) (reg:DI 165))) (clobber (reg:CC 33 %cc)) ]) "gcc.c-torture/compile/20020304-1.c":161:9 -1 (nil)) (insn 117 116 118 (set (reg:DI 3 %r3) (const_int 2 [0x2])) "gcc.c-torture/compile/20020304-1.c":161:9 -1 (nil)) (insn 118 117 119 (set (reg:DI 2 %r2) (mem/f/j:DI (plus:DI (plus:DI (reg:DI 167) (reg:DI 166)) (const_int 5 [0x5])) [2 _30->pair_t.cdr+0 S8 A64])) "gcc.c-torture/compile/20020304-1.c":161:9 -1 (nil)) >> It might also be interesting to see how much difference it makes for >> run-combine=4 (e.g. to see how much it makes up for the current 2-insn >> limit)... > > Numbers are good :-) FWIW, it does make more of a difference there, but not massively: Target Tests Delta Best Worst Median ====== ===== ===== ==== ===== ====== aarch64-linux-gnu 5 -15 -5 -1 -3 aarch64_be-linux-gnu 4 -14 -5 -2 -4 arc-elf 1 -4 -4 -4 -4 arm-linux-gnueabi 4 -22 -10 -2 -8 arm-linux-gnueabihf 4 -22 -10 -2 -8 avr-elf 1 -1 -1 -1 -1 bfin-elf 25 -592 -223 3 -5 bpf-elf 47 -508 -95 -1 -3 c6x-elf 26 -388 -74 1 -4 cr16-elf 18 -142 -82 -1 -2 csky-elf 5 -10 -4 -1 -2 epiphany-elf 30 -514 -155 -1 -4 fr30-elf 28 -416 -140 -1 -3 frv-linux-gnu 45 -1274 -209 -1 -4 ft32-elf 7 -17 -6 -1 -2 h8300-elf 3 -7 -5 -1 -1 hppa64-hp-hpux11.23 1 -1 -1 -1 -1 i686-apple-darwin 1 -3 -3 -3 -3 ia64-linux-gnu 8 -86 -26 -5 -10 iq2000-elf 1 -2 -2 -2 -2 m32r-elf 78 -1692 -308 -2 -4 mcore-elf 58 -1117 -174 3 -5 mipsel-linux-gnu 7 -26 -8 -2 -3 mipsisa64-linux-gnu 30 -136 -18 -2 -3 mmix 5 -7 -2 -1 -1 mn10300-elf 1 -2 -2 -2 -2 moxie-rtems 11 -35 -5 -2 -3 msp430-elf 1 -1 -1 -1 -1 nds32le-elf 15 -142 -88 -1 -2 nios2-linux-gnu 22 -259 -110 -1 -4 nvptx-none 2 -8 -4 -4 -4 or1k-elf 34 -592 -160 -1 -3 powerpc64le-linux-gnu 1 -8 -8 -8 -8 riscv32-elf 4 -11 -6 -1 -2 riscv64-elf 2 -7 -6 -1 -6 rl78-elf 1 -7 -7 -7 -7 rx-elf 1 -2 -2 -2 -2 s390-linux-gnu 35 708 -12 292 -1 s390x-linux-gnu 15 -53 -6 -2 -3 sh-linux-gnu 38 -741 -141 2 -6 sparc-linux-gnu 26 -478 -156 -1 -7 sparc64-linux-gnu 10 -86 -28 -2 -4 vax-netbsdelf 1 -4 -4 -4 -4 visium-elf 30 -467 -159 -1 -4 x86_64-darwin 7 -24 -10 -1 -2 x86_64-linux-gnu 7 -26 -12 -1 -2 xstormy16-elf 15 -70 -45 2 -2 xtensa-elf 26 -682 -226 -2 -4 Thanks, Richard
Segher Boessenkool <segher@kernel.crashing.org> writes: > Hi! > > On Mon, Nov 18, 2019 at 05:55:13PM +0000, Richard Sandiford wrote: >> Richard Sandiford <richard.sandiford@arm.com> writes: >> > (It's 23:35 local time, so it's still just about stage 1. :-)) >> >> Or actually, just under 1 day after end of stage 1. Oops. >> Could have sworn stage 1 ended on the 17th :-( Only realised >> I'd got it wrong when catching up on Saturday's email traffic. >> >> And inevitably, I introduced a couple of stupid mistakes while >> trying to clean the patch up for submission by that (non-)deadline. >> Here's a version that fixes an inverted overlapping memref check >> and that correctly prunes the use list for combined instructions. >> (This last one is just a compile-time saving -- the old code was >> correct, just suboptimal.) > > I've build the Linux kernel with the previous version, as well as this > one. R0 is unmodified GCC, R1 is the first patch, R2 is this one: > > (I've forced --param=run-combine=6 for R1 and R2): > (Percentages are relative to R0): > > R0 R1 R2 R1 R2 > alpha 6107088 6101088 6101088 99.902% 99.902% > arc 4008224 4006568 4006568 99.959% 99.959% > arm 9206728 9200936 9201000 99.937% 99.938% > arm64 13056174 13018174 13018194 99.709% 99.709% > armhf 0 0 0 0 0 > c6x 2337237 2337077 2337077 99.993% 99.993% > csky 3356602 0 0 0 0 > h8300 1166996 1166776 1166776 99.981% 99.981% > i386 11352159 0 0 0 0 > ia64 18230640 18167000 18167000 99.651% 99.651% > m68k 3714271 0 0 0 0 > microblaze 4982749 4979945 4979945 99.944% 99.944% > mips 8499309 8495205 8495205 99.952% 99.952% > mips64 7042036 7039816 7039816 99.968% 99.968% > nds32 4486663 0 0 0 0 > nios2 3680001 3679417 3679417 99.984% 99.984% > openrisc 4226076 4225868 4225868 99.995% 99.995% > parisc 7681895 7680063 7680063 99.976% 99.976% > parisc64 8677077 8676581 8676581 99.994% 99.994% > powerpc 10687611 10682199 10682199 99.949% 99.949% > powerpc64 17671082 17658570 17658570 99.929% 99.929% > powerpc64le 17671082 17658570 17658570 99.929% 99.929% > riscv32 1554938 1554758 1554758 99.988% 99.988% > riscv64 6634342 6632788 6632788 99.977% 99.977% > s390 13049643 13014939 13014939 99.734% 99.734% > sh 3254743 0 0 0 0 > shnommu 1632364 1632124 1632124 99.985% 99.985% > sparc 4404993 4399593 4399593 99.877% 99.877% > sparc64 6796711 6797491 6797491 100.011% 100.011% > x86_64 19713174 19712817 19712817 99.998% 99.998% > xtensa 0 0 0 0 0 Thanks for running these. > There are fivew new failures, with either of the combine2 patches. And > all five are actually different (different symptoms, at least): > > - csky fails on libgcc build: > > /home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi': > /home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads for: > 1405 | } > | ^ > (insn 199 86 87 8 (parallel [ > (set (reg:SI 101) > (plus:SI (reg:SI 98) > (const_int -32 [0xffffffffffffffe0]))) > (set (reg:CC 33 c) > (lt:CC (plus:SI (reg:SI 98) > (const_int -32 [0xffffffffffffffe0])) > (const_int 0 [0]))) > ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt} > (nil)) > during RTL pass: reload > > Target problem? Yeah, looks like it. The pattern is: (define_insn "*cskyv2_declt" [(set (match_operand:SI 0 "register_operand" "=r") (plus:SI (match_operand:SI 1 "register_operand" "r") (match_operand:SI 2 "const_int_operand" "Uh"))) (set (reg:CC CSKY_CC_REGNUM) (lt:CC (plus:SI (match_dup 1) (match_dup 2)) (const_int 0)))] "CSKY_ISA_FEATURE (2E3)" "declt\t%0, %1, %M2" ) So the predicate accepts all const_ints but the constraint doesn't. > - i386 goes into an infinite loop compiling, or at least an hour or so... > Erm I forgot too record what it was compiling. I did attach a GDB... It > is something from lra_create_live_ranges. Hmm. > - m68k: > > /home/segher/src/kernel/fs/exec.c: In function 'copy_strings': > /home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in final_scan_insn_1, at final.c:3048 > 590 | } > | ^ > 0x10408307 final_scan_insn_1 > /home/segher/src/gcc/gcc/final.c:3048 > 0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*) > /home/segher/src/gcc/gcc/final.c:3152 > 0x10408797 final_1 > /home/segher/src/gcc/gcc/final.c:2020 > 0x104091f7 rest_of_handle_final > /home/segher/src/gcc/gcc/final.c:4658 > 0x104091f7 execute > /home/segher/src/gcc/gcc/final.c:4736 > > and that line is > gcc_assert (prev_nonnote_insn (insn) == last_ignored_compare); Ah, this'll be while m68k was still a cc0 target. Yeah, I should probably just skip the whole pass for cc0. > - nds32: > > /tmp/ccC8Czca.s: Assembler messages: > /tmp/ccC8Czca.s:3144: Error: Unrecognized operand/register, lmw.bi [$fp+(-60)],[$fp],$r11,0x0. > > /tmp/ccl8o20c.s: Assembler messages: > /tmp/ccl8o20c.s:2449: Error: Unrecognized operand/register, lmw.bi $r9,[$fp],[$fp+(-132)],0x0. > > /tmp/ccZxjwHd.s: Assembler messages: > /tmp/ccZxjwHd.s:4776: Error: Unrecognized operand/register, lmw.bi [$fp+(-52)],[$fp],[$fp+(-56)],0x0. > > /tmp/cczjOS3d.s: Assembler messages: > /tmp/cczjOS3d.s:2336: Error: Unrecognized operand/register, lmw.bi $r16,[$fp],$r7,0x0. > > and more. All lmw.bi... target issue? Yeah, looks like it wasn't expecting this pattern to be generated automatically before RA, so it doesn't have constraints (and probably couldn't, since the registers need to be consecutive). > - sh (that's sh4-linux): > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field': > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS' > 1638 | } > | ^ > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn: > (insn 18 17 19 2 (set (reg:SI 0 r0) > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > (nil)))) > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out Would have to look more at this one. Seems odd that it can't allocate R0 when it's already the destination and when R0 can't be live before the insn. But there again, this is reload, so my enthuasiasm for looking is a bit limited :-) > Looking at just binary size, which is a good stand-in for how many insns > it combined: > > R2 > arm64 99.709% > ia64 99.651% > s390 99.734% > sparc 99.877% > sparc64 100.011% > > (These are those that are not between 99.9% and 100.0%). > > So only sparc64 regressed, and just a tiny bit (I can look at what that > is, if there is interest). But 32-bit sparc improved, and s390, arm64, > and ia64 got actual benefit. > > Again this is just code size, not analysing the actually changed code. OK. Certainly not an earth-shattering improvement then, but not entirely worthless either. > I did look at the powerpc64le changes. It is almost completely load- > with-update (and store-with-update) insns that make the difference, but > there are also some dot insns. The extra mr. are usually not a good > idea, but the extsw. are. Sometimes this causes *more* insns in the end > (register move insns), but that is the exception. > > This mr. problem is there with combine already, btw. In the end it is > caused by this just not being something good to do on pseudos, it would > be better to do this after RA, in a peephole or similar. OTOH it isn't > actually really important for performance either way. > > Btw, does the new pass use TARGET_LEGITIMATE_COMBINED_INSN? It probably > should. (That would be the hook where we would probably want to prevent > generating mr. insns). No, it doesn't use that yet, but I agree it should. Will fix. I see combine also tests cannot_copy_insn_p. I'm not sure whether that's appropriate for the new pass or not. Arguably it's not copying the instruction, it's just moving it to be in parallel with something else. (But then that's largely true of the combine case too.) Thanks, Richard
Hi! On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote: > Segher Boessenkool <segher@kernel.crashing.org> writes: > > I am wondering the other way around :-) Is what you do for combine2 > > something that would be more generally applicable/useful? That's what > > I'm trying to find out :-) > > > > What combine does could use some improvement, if you want to hear a > > more direct motivations. LOG_LINKS just skip references we cannot > > handle (and some more), so we always have to do modified_between etc., > > which hurts. > > The trade-offs behind the choice of representation are very specific > to the pass. Yes, but hopefully not so specific that every pass needs a completely different representation ;-) > >> >> Target Tests Delta Best Worst Median > >> >> avr-elf 1341 -111401 -13824 680 -10 > >> > > >> > Things like this are kind of suspicious :-) > >> > >> Yeah. This mostly seems to come from mopping up the extra moves created > >> by make_more_copies. So we have combinations like: > >> > >> 58: r70:SF=r94:SF > >> REG_DEAD r94:SF > >> 60: r22:SF=r70:SF > >> REG_DEAD r70:SF > > > > Why didn't combine do this? A target problem? > > Seems to be because combine rejects hard-reg destinations whose classes > are likely spilled (cant_combine_insn_p). Ah, okay. And that is required to prevent ICEs, in combine2 as well then -- ICEs in RA. There should be a better way to do this. > This SF argument register > happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject > the combination based on POINTER_X_REGS being likely spilled. static bool avr_class_likely_spilled_p (reg_class_t c) { return (c != ALL_REGS && (AVR_TINY ? 1 : c != ADDW_REGS)); } So this target severely shackles combine. Does it have to? If so, why not with combine2? > >> So there's only one case in which it isn't a win, but the number of > >> tests is tiny. So I agree there's no justification for trying this in > >> combine proper as things stand (and I wasn't arguing otherwise FWIW). > >> I'd still like to keep it in the new pass because it does help > >> *sometimes* and there's no sign yet that it has a noticeable > >> compile-time cost. > > > > So when does it help? I can only think of cases where there are > > problems elsewhere. > > The full list of affected tests (all at -O2 -ftree-vectorize) are: I'll have to look at this closer later, sorry. Segher
On Mon, Nov 25, 2019 at 09:40:36PM +0000, Richard Sandiford wrote: > Segher Boessenkool <segher@kernel.crashing.org> writes: > > - i386 goes into an infinite loop compiling, or at least an hour or so... > > Erm I forgot too record what it was compiling. I did attach a GDB... It > > is something from lra_create_live_ranges. > > Hmm. This one is actually worrying me -- it's not obviously a simple problem, or a target problem, or a pre-existing problem. > Ah, this'll be while m68k was still a cc0 target. Yeah, I should probably > just skip the whole pass for cc0. Yes, tree of last friday or saturday or so. And yup if you don't handle cc0 yet, yes you want to skip it completely :-) > > - sh (that's sh4-linux): > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field': > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS' > > 1638 | } > > | ^ > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn: > > (insn 18 17 19 2 (set (reg:SI 0 r0) > > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > > (nil)))) > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out > > Would have to look more at this one. Seems odd that it can't allocate > R0 when it's already the destination and when R0 can't be live before > the insn. But there again, this is reload, so my enthuasiasm for looking > is a bit limited :-) It wants to use r0 in some other insn, so it needs to spill it here, but cannot. This is what class_likely_spilled is for. > > Looking at just binary size, which is a good stand-in for how many insns > > it combined: > > > > R2 > > arm64 99.709% > > ia64 99.651% > > s390 99.734% > > sparc 99.877% > > sparc64 100.011% > > > > (These are those that are not between 99.9% and 100.0%). > > > > So only sparc64 regressed, and just a tiny bit (I can look at what that > > is, if there is interest). But 32-bit sparc improved, and s390, arm64, > > and ia64 got actual benefit. > > > > Again this is just code size, not analysing the actually changed code. > > OK. Certainly not an earth-shattering improvement then, but not > entirely worthless either. I usually takes 0.2% as "definitely useful" for combine improvements, so there are a few targets that have that. There can be improvements that are important for a target even if they do not improve code size much, of course, and it can identify weaknesses in the backend code, so you always need to look at what really changes. > I see combine also tests cannot_copy_insn_p. I'm not sure whether that's > appropriate for the new pass or not. Arguably it's not copying the > instruction, it's just moving it to be in parallel with something else. > (But then that's largely true of the combine case too.) combine tests this only for the cases where it *does* have to copy an insn: when the dest if i0, i1, or i2 doesn't die, it is added as another arm to the (parallel) result. Segher
Segher Boessenkool <segher@kernel.crashing.org> writes: > Hi! > > On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote: >> Segher Boessenkool <segher@kernel.crashing.org> writes: >> > I am wondering the other way around :-) Is what you do for combine2 >> > something that would be more generally applicable/useful? That's what >> > I'm trying to find out :-) >> > >> > What combine does could use some improvement, if you want to hear a >> > more direct motivations. LOG_LINKS just skip references we cannot >> > handle (and some more), so we always have to do modified_between etc., >> > which hurts. >> >> The trade-offs behind the choice of representation are very specific >> to the pass. > > Yes, but hopefully not so specific that every pass needs a completely > different representation ;-) Well, it depends. Most passes make do with df (without DU/UD-chains). But since DU/UD-chains are naturally quadratic in the general case, and are expensive to keep up to date, each DU/UD pass is going to have make some compromises. It doesn't seem too bad that passes make different compromises based on what they're trying to do. (combine: single use per definition; fwprop.c: track all uses, but for dominating definitions only; sched: fudged via a param; regrename: single definition/multiple use chains optimised for renmaing; combine2: full live range information, but limited use list; etc.) So yeah, if passes want to make roughly the same compromises, it would obviously be good if they shared a representation. But since each pass does something different, I don't think it's a bad sign that they make different compromises and use different representations. So I don't think a new pass with a new representation is in itself a sign of failure. >> >> >> Target Tests Delta Best Worst Median >> >> >> avr-elf 1341 -111401 -13824 680 -10 >> >> > >> >> > Things like this are kind of suspicious :-) >> >> >> >> Yeah. This mostly seems to come from mopping up the extra moves created >> >> by make_more_copies. So we have combinations like: >> >> >> >> 58: r70:SF=r94:SF >> >> REG_DEAD r94:SF >> >> 60: r22:SF=r70:SF >> >> REG_DEAD r70:SF >> > >> > Why didn't combine do this? A target problem? >> >> Seems to be because combine rejects hard-reg destinations whose classes >> are likely spilled (cant_combine_insn_p). > > Ah, okay. And that is required to prevent ICEs, in combine2 as well > then -- ICEs in RA. Not in this case though. The final instruction is a hardreg<-pseudo move whatever happens. There's nothing special about r70 compared to r94. > There should be a better way to do this. ISTM we should be checking for whichever cases actually cause the RA failures. E.g. to take on extreme example, if all the following are true: - an insn has a single alternative - an insn has a single non-earyclobber output - an insn has no parallel clobbers - an insn has no auto-inc/decs - an insn has a hard register destination that satisfies its constraints - the hard register is defined in its original location then there should be no problem. The insn shouldn't need any output reloads that would conflict with the hard register. It also doesn't extend the live range of the output. Obviously that's a lot of conditions :-) And IMO they should be built up the other way around: reject specific cases that are known to cause problems, based on information about the matched insn. But I think the avr example shows that there's a real problem with using REGNO_REG_CLASS for this too. REGNO_REG_CLASS gives the smallest enclosing class, which might not be the most relevant one in context. (It isn't here, since we're just passing arguments to functions.) >> This SF argument register >> happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject >> the combination based on POINTER_X_REGS being likely spilled. > > static bool > avr_class_likely_spilled_p (reg_class_t c) > { > return (c != ALL_REGS && > (AVR_TINY ? 1 : c != ADDW_REGS)); > } > > So this target severely shackles combine. Does it have to? If so, why > not with combine2? As far as the above example goes, I think returning true for POINTER_X_REGS is the right thing to do. It only has two 8-bit registers, and they act as a pair when used as a pointer. Thanks, Richard
On Mon, Nov 25, 2019 at 11:08:47PM +0000, Richard Sandiford wrote: > Segher Boessenkool <segher@kernel.crashing.org> writes: > > On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote: > >> Segher Boessenkool <segher@kernel.crashing.org> writes: > >> > I am wondering the other way around :-) Is what you do for combine2 > >> > something that would be more generally applicable/useful? That's what > >> > I'm trying to find out :-) > >> > > >> > What combine does could use some improvement, if you want to hear a > >> > more direct motivations. LOG_LINKS just skip references we cannot > >> > handle (and some more), so we always have to do modified_between etc., > >> > which hurts. > >> > >> The trade-offs behind the choice of representation are very specific > >> to the pass. > > > > Yes, but hopefully not so specific that every pass needs a completely > > different representation ;-) > > Well, it depends. Most passes make do with df (without DU/UD-chains). > But since DU/UD-chains are naturally quadratic in the general case, > and are expensive to keep up to date, each DU/UD pass is going to have > make some compromises. It doesn't seem too bad that passes make > different compromises based on what they're trying to do. (combine: > single use per definition; fwprop.c: track all uses, but for dominating > definitions only; sched: fudged via a param; regrename: single > definition/multiple use chains optimised for renmaing; combine2: full > live range information, but limited use list; etc.) combine actually *calculates* DU chains almost completely, it just throws away most of that information (it wants to have LOG_LINKS, as it did ages ago). The only thing stopping us from doing that right now is that not all uses are counted (some are skipped). Since combine works only within BBs, DU chains are linear to compute, and UD chains are trivial (and just linear to compute). Updating is quadratic in general, sure. Luckily in most realistic cases it is cheap (most, sigh) (insns aren't combined to very far away). > So yeah, if passes want to make roughly the same compromises, it would > obviously be good if they shared a representation. But since each pass > does something different, I don't think it's a bad sign that they make > different compromises and use different representations. > > So I don't think a new pass with a new representation is in itself a > sign of failure. Oh, I don't think so either. I just wonder if it would be useful more generically :-) > >> >> >> Target Tests Delta Best Worst Median > >> >> >> avr-elf 1341 -111401 -13824 680 -10 > >> >> > > >> >> > Things like this are kind of suspicious :-) > >> >> > >> >> Yeah. This mostly seems to come from mopping up the extra moves created > >> >> by make_more_copies. So we have combinations like: > >> >> > >> >> 58: r70:SF=r94:SF > >> >> REG_DEAD r94:SF > >> >> 60: r22:SF=r70:SF > >> >> REG_DEAD r70:SF > >> > > >> > Why didn't combine do this? A target problem? > >> > >> Seems to be because combine rejects hard-reg destinations whose classes > >> are likely spilled (cant_combine_insn_p). > > > > Ah, okay. And that is required to prevent ICEs, in combine2 as well > > then -- ICEs in RA. > > Not in this case though. The final instruction is a hardreg<-pseudo move > whatever happens. There's nothing special about r70 compared to r94. So the target hook could be improved? Or, this doesn't matter anyway, the extra register move does not prevent any combinations, and RA should get rid of it when that is beneficial. But you see smaller code in the end, hrm. Segher
On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Mon, Nov 25, 2019 at 11:08:47PM +0000, Richard Sandiford wrote: > > Segher Boessenkool <segher@kernel.crashing.org> writes: > > > On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote: > > >> Segher Boessenkool <segher@kernel.crashing.org> writes: > > >> > I am wondering the other way around :-) Is what you do for combine2 > > >> > something that would be more generally applicable/useful? That's what > > >> > I'm trying to find out :-) > > >> > > > >> > What combine does could use some improvement, if you want to hear a > > >> > more direct motivations. LOG_LINKS just skip references we cannot > > >> > handle (and some more), so we always have to do modified_between etc., > > >> > which hurts. > > >> > > >> The trade-offs behind the choice of representation are very specific > > >> to the pass. > > > > > > Yes, but hopefully not so specific that every pass needs a completely > > > different representation ;-) > > > > Well, it depends. Most passes make do with df (without DU/UD-chains). > > But since DU/UD-chains are naturally quadratic in the general case, > > and are expensive to keep up to date, each DU/UD pass is going to have > > make some compromises. It doesn't seem too bad that passes make > > different compromises based on what they're trying to do. (combine: > > single use per definition; fwprop.c: track all uses, but for dominating > > definitions only; sched: fudged via a param; regrename: single > > definition/multiple use chains optimised for renmaing; combine2: full > > live range information, but limited use list; etc.) > > combine actually *calculates* DU chains almost completely, it just throws > away most of that information (it wants to have LOG_LINKS, as it did ages > ago). The only thing stopping us from doing that right now is that not > all uses are counted (some are skipped). > > Since combine works only within BBs, DU chains are linear to compute, and > UD chains are trivial (and just linear to compute). quadraticness appears for RTL DU/UD chains because of partial definitions, that doesn't change for BBs so even there computing is them is quadratic (because recording them is). The situation is simply having N partial defs all reaching M uses which gives you a chain of size N * M. Now - for combine you don't want partial defs, so for simplicity we could choose to _not_ record DU/UD chains whenever we see a partial def for a pseudo (and mark those as "bad"). Or, slightly enhanced, we can handle DU/UD chains for regions where there is no partial definition and add a "fake" D denoting (there are [multiple] defs beyond that might be partial). Depending on the use-case that should suffice and make the problem linear. I think you want to ask sth like "is REG changed [partially] between its use in insn A and the def in insn B" and you want to answer that by using REGs UD chain for that. If you only ever reached the def in insn B via the "pruned" chain then this would work, likewise for the chain we do not compute any UD chain for REG. > Updating is quadratic in general, sure. Luckily in most realistic cases > it is cheap (most, sigh) (insns aren't combined to very far away). Updating is linear as well if you can disregard partial defs. Updating cannot be quadratic if compute is linear ;) > > So yeah, if passes want to make roughly the same compromises, it would > > obviously be good if they shared a representation. But since each pass > > does something different, I don't think it's a bad sign that they make > > different compromises and use different representations. > > > > So I don't think a new pass with a new representation is in itself a > > sign of failure. > > Oh, I don't think so either. I just wonder if it would be useful more > generically :-) > > > >> >> >> Target Tests Delta Best Worst Median > > >> >> >> avr-elf 1341 -111401 -13824 680 -10 > > >> >> > > > >> >> > Things like this are kind of suspicious :-) > > >> >> > > >> >> Yeah. This mostly seems to come from mopping up the extra moves created > > >> >> by make_more_copies. So we have combinations like: > > >> >> > > >> >> 58: r70:SF=r94:SF > > >> >> REG_DEAD r94:SF > > >> >> 60: r22:SF=r70:SF > > >> >> REG_DEAD r70:SF > > >> > > > >> > Why didn't combine do this? A target problem? > > >> > > >> Seems to be because combine rejects hard-reg destinations whose classes > > >> are likely spilled (cant_combine_insn_p). > > > > > > Ah, okay. And that is required to prevent ICEs, in combine2 as well > > > then -- ICEs in RA. > > > > Not in this case though. The final instruction is a hardreg<-pseudo move > > whatever happens. There's nothing special about r70 compared to r94. > > So the target hook could be improved? Or, this doesn't matter anyway, > the extra register move does not prevent any combinations, and RA should > get rid of it when that is beneficial. > > But you see smaller code in the end, hrm. > > > Segher
Richard Biener <richard.guenther@gmail.com> writes: > On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool > <segher@kernel.crashing.org> wrote: >> >> On Mon, Nov 25, 2019 at 11:08:47PM +0000, Richard Sandiford wrote: >> > Segher Boessenkool <segher@kernel.crashing.org> writes: >> > > On Mon, Nov 25, 2019 at 09:16:52PM +0000, Richard Sandiford wrote: >> > >> Segher Boessenkool <segher@kernel.crashing.org> writes: >> > >> > I am wondering the other way around :-) Is what you do for combine2 >> > >> > something that would be more generally applicable/useful? That's what >> > >> > I'm trying to find out :-) >> > >> > >> > >> > What combine does could use some improvement, if you want to hear a >> > >> > more direct motivations. LOG_LINKS just skip references we cannot >> > >> > handle (and some more), so we always have to do modified_between etc., >> > >> > which hurts. >> > >> >> > >> The trade-offs behind the choice of representation are very specific >> > >> to the pass. >> > > >> > > Yes, but hopefully not so specific that every pass needs a completely >> > > different representation ;-) >> > >> > Well, it depends. Most passes make do with df (without DU/UD-chains). >> > But since DU/UD-chains are naturally quadratic in the general case, >> > and are expensive to keep up to date, each DU/UD pass is going to have >> > make some compromises. It doesn't seem too bad that passes make >> > different compromises based on what they're trying to do. (combine: >> > single use per definition; fwprop.c: track all uses, but for dominating >> > definitions only; sched: fudged via a param; regrename: single >> > definition/multiple use chains optimised for renmaing; combine2: full >> > live range information, but limited use list; etc.) >> >> combine actually *calculates* DU chains almost completely, it just throws >> away most of that information (it wants to have LOG_LINKS, as it did ages >> ago). The only thing stopping us from doing that right now is that not >> all uses are counted (some are skipped). >> >> Since combine works only within BBs, DU chains are linear to compute, and >> UD chains are trivial (and just linear to compute). > > quadraticness appears for RTL DU/UD chains because of partial definitions, > that doesn't change for BBs so even there computing is them is quadratic > (because recording them is). The situation is simply having N partial > defs all reaching M uses which gives you a chain of size N * M. > > Now - for combine you don't want partial defs, so for simplicity we could > choose to _not_ record DU/UD chains whenever we see a partial def for > a pseudo (and mark those as "bad"). Or, slightly enhanced, we can > handle DU/UD chains for regions where there is no partial definition > and add a "fake" D denoting (there are [multiple] defs beyond that > might be partial). Depending on the use-case that should suffice and > make the problem linear. > > I think you want to ask sth like "is REG changed [partially] between > its use in insn A and the def in insn B" and you want to answer that by using > REGs UD chain for that. If you only ever reached the def in insn B via the > "pruned" chain then this would work, likewise for the chain we do not compute > any UD chain for REG. (Passing over this as I think it's about what current combine wants.) >> Updating is quadratic in general, sure. Luckily in most realistic cases >> it is cheap (most, sigh) (insns aren't combined to very far away). > > Updating is linear as well if you can disregard partial defs. > Updating cannot be quadratic if compute is linear ;) This was based on the assumption that we'd do an update after each combination, so that the pass still sees correct info. That then makes the updates across one run of the pass quadratic, since the number of successful combinations is O(ninsns). As far as the new pass goes: the pass would be quadratic if we tried to combine each use in single-def DU chain with its definition. It would also be quadratic if we tried to parallelise each pair of uses in a DU chain. So if we did have full DU chains in the new pass, we'd also need some limit N on the number of uses we try to combine with. And if we're only going to try combining with N uses, then it seemed better to track only N uses "by name", rather than pay the cost of tracking all uses by name but ignoring the information for some of them. All we care about for other uses is whether they would prevent a move. We can track that using a simple point-based live range, where points are LUIDs with gaps in between for new insns. So the new pass uses a list of N specific uses and a single live range. Querying whether a particular definition is live at a particular point is then a constant-time operation. So is updating the info after a successful combination (potentially including a move). That still seems like a reasonable way of representing this, given what the pass wants to do. Moving to full DU chains would IMO just make the pass more expensive with no obvious benefit. Thanks, Richard
On Wed, Nov 27, 2019 at 09:29:27AM +0100, Richard Biener wrote: > On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > combine actually *calculates* DU chains almost completely, it just throws > > away most of that information (it wants to have LOG_LINKS, as it did ages > > ago). The only thing stopping us from doing that right now is that not > > all uses are counted (some are skipped). > > > > Since combine works only within BBs, DU chains are linear to compute, and > > UD chains are trivial (and just linear to compute). > > quadraticness appears for RTL DU/UD chains because of partial definitions, > that doesn't change for BBs so even there computing is them is quadratic > (because recording them is). The situation is simply having N partial > defs all reaching M uses which gives you a chain of size N * M. And both N and M are constants here (bounded by a constant). The only dimensions we care about are those the user can grow unlimited: number of registers, number of instructions, number of functions, that kind of thing. The control flow graph in a basic block is a DAG, making most of this linear to compute. Only updating it after every separate change is not easily linear in total. > Updating is linear as well if you can disregard partial defs. Updating cannot > be quadratic if compute is linear ;) Sure it can. Updating has to be O(1) (amortized) per change for the whole pass to be O(n). If it is O(n) per change you are likely O(n^2) in total. I don't see how to make combine itself O(1) per change, but yeah I can see how that can work (or almost work) for something simpler (and less weighed down by history :-) ). Segher
On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote: > > > > - sh (that's sh4-linux): > > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field': > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS' > > > 1638 | } > > > | ^ > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn: > > > (insn 18 17 19 2 (set (reg:SI 0 r0) > > > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > > > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > > > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > > > (nil)))) > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out > > > > Would have to look more at this one. Seems odd that it can't allocate > > R0 when it's already the destination and when R0 can't be live before > > the insn. But there again, this is reload, so my enthuasiasm for looking > > is a bit limited :-) > > It wants to use r0 in some other insn, so it needs to spill it here, but > cannot. This is what class_likely_spilled is for. > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled explicitly, but it's got a R0_REGS class with only one said reg in it. So the default impl of class_likely_spilled should do its thing. LRA is available on SH and often fixes the R0 problems -- but not always. Maybe it got better over time, haven't checked. Could you re-run the SH build tests with -mlra, please ? Cheers, Oleg
On Tue, Dec 03, 2019 at 10:33:48PM +0900, Oleg Endo wrote: > On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote: > > > > > > - sh (that's sh4-linux): > > > > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field': > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS' > > > > 1638 | } > > > > | ^ > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn: > > > > (insn 18 17 19 2 (set (reg:SI 0 r0) > > > > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > > > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > > > > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > > > > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > > > > (nil)))) > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out > > > > > > Would have to look more at this one. Seems odd that it can't allocate > > > R0 when it's already the destination and when R0 can't be live before > > > the insn. But there again, this is reload, so my enthuasiasm for looking > > > is a bit limited :-) > > > > It wants to use r0 in some other insn, so it needs to spill it here, but > > cannot. This is what class_likely_spilled is for. > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled > explicitly, but it's got a R0_REGS class with only one said reg in it. > So the default impl of class_likely_spilled should do its thing. Yes, good point. So what happened here? Is it just RA messing things up, unrelated to the new pass? Segher
On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote: > On Tue, Dec 03, 2019 at 10:33:48PM +0900, Oleg Endo wrote: > > On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote: > > > > > > > > - sh (that's sh4-linux): > > > > > > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 'snmp_get_cpu_field': > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a register to spill in class 'R0_REGS' > > > > > 1638 | } > > > > > | ^ > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn: > > > > > (insn 18 17 19 2 (set (reg:SI 0 r0) > > > > > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > > > > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > > > > > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > > > > > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > > > > > (nil)))) > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier errors, bailing out > > > > > > > > Would have to look more at this one. Seems odd that it can't allocate > > > > R0 when it's already the destination and when R0 can't be live before > > > > the insn. But there again, this is reload, so my enthuasiasm for looking > > > > is a bit limited :-) > > > > > > It wants to use r0 in some other insn, so it needs to spill it here, but > > > cannot. This is what class_likely_spilled is for. > > > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled > > explicitly, but it's got a R0_REGS class with only one said reg in it. > > So the default impl of class_likely_spilled should do its thing. > > Yes, good point. So what happened here? "Something, somewhere, went terribly wrong"... insn 18 wants to do mov.l @(r4,r6),r0 But it can't because the reg+reg address mode has a R0 constraint itself. So it needs to be changed to mov r4,r0 mov.l @(r0,r6),r0 And it can't handle that. Or only sometimes? Don't remember. > Is it just RA messing things > up, unrelated to the new pass? > Yep, I think so. The additional pass seems to create "tougher" code so reload passes out earlier than usual. We've had the same issue when trying address mode selection optimization. In fact that was one huge showstopper. Cheers, Oleg
Here's a revised version based on the feedback so far. Changes in v2: - Don't move instructions that set or use allocatable hard registers. - Check legitimate_combined_insn - Check cannot_copy_insn_p when keeping the original insn in parallel - Disable the pass if HAVE_cc0 I compared v1 and v2 in the same way as before and the new restrictions didn't make much difference (as hoped). Also bootstrapped & regression- tested on aarch64-linux-gnu and x86_64-linux-gnu with run-combine defaulting to 6 (unlike in the patch, where the new pass is disabled by default). Thanks, Richard 2019-12-05 Richard Sandiford <richard.sandiford@arm.com> gcc/ * Makefile.in (OBJS): Add combine2.o * params.opt (--param=run-combine): New option. * doc/invoke.texi: Document it. * tree-pass.h (make_pass_combine2_before): Declare. (make_pass_combine2_after): Likewise. * passes.def: Add them. * timevar.def (TV_COMBINE2): New timevar. * cfgrtl.h (update_cfg_for_uncondjump): Declare. * combine.c (update_cfg_for_uncondjump): Move to... * cfgrtl.c (update_cfg_for_uncondjump): ...here. * simplify-rtx.c (simplify_truncation): Handle comparisons. * recog.h (validate_simplify_replace_rtx): Declare. * recog.c (validate_simplify_replace_rtx_1): New function. (validate_simplify_replace_rtx_uses): Likewise. (validate_simplify_replace_rtx): Likewise. * combine2.c: New file. Index: gcc/Makefile.in =================================================================== --- gcc/Makefile.in 2019-12-03 18:06:09.885650522 +0000 +++ gcc/Makefile.in 2019-12-05 10:11:50.637631870 +0000 @@ -1261,6 +1261,7 @@ OBJS = \ cgraphunit.o \ cgraphclones.o \ combine.o \ + combine2.o \ combine-stack-adj.o \ compare-elim.o \ context.o \ Index: gcc/params.opt =================================================================== --- gcc/params.opt 2019-12-02 17:38:20.072423250 +0000 +++ gcc/params.opt 2019-12-05 10:11:50.653631761 +0000 @@ -760,6 +760,10 @@ Use internal function id in profile look Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param Maximum depth of a loop nest to fully value-number optimistically. +-param=run-combine= +Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param +Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass. + -param=sccvn-max-alias-queries-per-access= Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param Maximum number of disambiguations to perform per memory access. Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi 2019-12-02 17:38:18.364434903 +0000 +++ gcc/doc/invoke.texi 2019-12-05 10:11:50.653631761 +0000 @@ -11797,6 +11797,11 @@ in combiner for a pseudo register as las @item max-combine-insns The maximum number of instructions the RTL combiner tries to combine. +@item run-combine +Choose which of the 3 available combine passes to run: bit 1 for the main +combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 +for a later variant of the combine pass. + @item integer-share-limit Small integer constants can use a shared data structure, reducing the compiler's memory usage and increasing its speed. This sets the maximum Index: gcc/tree-pass.h =================================================================== --- gcc/tree-pass.h 2019-11-19 16:25:28.000000000 +0000 +++ gcc/tree-pass.h 2019-12-05 10:11:50.657631731 +0000 @@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt); extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt); extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt); extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt); extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt); extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt); extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt); Index: gcc/passes.def =================================================================== --- gcc/passes.def 2019-11-19 16:25:28.000000000 +0000 +++ gcc/passes.def 2019-12-05 10:11:50.653631761 +0000 @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_inc_dec); NEXT_PASS (pass_initialize_regs); NEXT_PASS (pass_ud_rtl_dce); + NEXT_PASS (pass_combine2_before); NEXT_PASS (pass_combine); + NEXT_PASS (pass_combine2_after); NEXT_PASS (pass_if_after_combine); NEXT_PASS (pass_jump_after_combine); NEXT_PASS (pass_partition_blocks); Index: gcc/timevar.def =================================================================== --- gcc/timevar.def 2019-11-19 16:25:28.000000000 +0000 +++ gcc/timevar.def 2019-12-05 10:11:50.657631731 +0000 @@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC , " DEFTIMEVAR (TV_CSE2 , "CSE 2") DEFTIMEVAR (TV_BRANCH_PROB , "branch prediction") DEFTIMEVAR (TV_COMBINE , "combiner") +DEFTIMEVAR (TV_COMBINE2 , "second combiner") DEFTIMEVAR (TV_IFCVT , "if-conversion") DEFTIMEVAR (TV_MODE_SWITCH , "mode switching") DEFTIMEVAR (TV_SMS , "sms modulo scheduling") Index: gcc/cfgrtl.h =================================================================== --- gcc/cfgrtl.h 2019-11-19 16:25:28.000000000 +0000 +++ gcc/cfgrtl.h 2019-12-05 10:11:50.641631840 +0000 @@ -47,6 +47,7 @@ extern void fixup_partitions (void); extern bool purge_dead_edges (basic_block); extern bool purge_all_dead_edges (void); extern bool fixup_abnormal_edges (void); +extern void update_cfg_for_uncondjump (rtx_insn *); extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *); extern void relink_block_chain (bool); extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *); Index: gcc/combine.c =================================================================== --- gcc/combine.c 2019-11-29 13:04:14.458669072 +0000 +++ gcc/combine.c 2019-12-05 10:11:50.645631815 +0000 @@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg) && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT; } -/* Delete the unconditional jump INSN and adjust the CFG correspondingly. - Note that the INSN should be deleted *after* removing dead edges, so - that the kept edge is the fallthrough edge for a (set (pc) (pc)) - but not for a (set (pc) (label_ref FOO)). */ - -static void -update_cfg_for_uncondjump (rtx_insn *insn) -{ - basic_block bb = BLOCK_FOR_INSN (insn); - gcc_assert (BB_END (bb) == insn); - - purge_dead_edges (bb); - - delete_insn (insn); - if (EDGE_COUNT (bb->succs) == 1) - { - rtx_insn *insn; - - single_succ_edge (bb)->flags |= EDGE_FALLTHRU; - - /* Remove barriers from the footer if there are any. */ - for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) - if (BARRIER_P (insn)) - { - if (PREV_INSN (insn)) - SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); - else - BB_FOOTER (bb) = NEXT_INSN (insn); - if (NEXT_INSN (insn)) - SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); - } - else if (LABEL_P (insn)) - break; - } -} - /* Return whether PAT is a PARALLEL of exactly N register SETs followed by an arbitrary number of CLOBBERs. */ static bool @@ -15098,7 +15062,10 @@ const pass_data pass_data_combine = {} /* opt_pass methods: */ - virtual bool gate (function *) { return (optimize > 0); } + virtual bool gate (function *) + { + return optimize > 0 && (param_run_combine & 2) != 0; + } virtual unsigned int execute (function *) { return rest_of_handle_combine (); Index: gcc/cfgrtl.c =================================================================== --- gcc/cfgrtl.c 2019-11-19 16:25:28.000000000 +0000 +++ gcc/cfgrtl.c 2019-12-05 10:11:50.641631840 +0000 @@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void) return inserted; } +/* Delete the unconditional jump INSN and adjust the CFG correspondingly. + Note that the INSN should be deleted *after* removing dead edges, so + that the kept edge is the fallthrough edge for a (set (pc) (pc)) + but not for a (set (pc) (label_ref FOO)). */ + +void +update_cfg_for_uncondjump (rtx_insn *insn) +{ + basic_block bb = BLOCK_FOR_INSN (insn); + gcc_assert (BB_END (bb) == insn); + + purge_dead_edges (bb); + + delete_insn (insn); + if (EDGE_COUNT (bb->succs) == 1) + { + rtx_insn *insn; + + single_succ_edge (bb)->flags |= EDGE_FALLTHRU; + + /* Remove barriers from the footer if there are any. */ + for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) + if (BARRIER_P (insn)) + { + if (PREV_INSN (insn)) + SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); + else + BB_FOOTER (bb) = NEXT_INSN (insn); + if (NEXT_INSN (insn)) + SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); + } + else if (LABEL_P (insn)) + break; + } +} + /* Cut the insns from FIRST to LAST out of the insns stream. */ rtx_insn * Index: gcc/simplify-rtx.c =================================================================== --- gcc/simplify-rtx.c 2019-11-19 16:31:13.504240251 +0000 +++ gcc/simplify-rtx.c 2019-12-05 10:11:50.657631731 +0000 @@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode, && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1) return constm1_rtx; + /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result + in a narrower mode if useful. */ + if (COMPARISON_P (op)) + return simplify_gen_relational (GET_CODE (op), mode, VOIDmode, + XEXP (op, 0), XEXP (op, 1)); + return NULL_RTX; } Index: gcc/recog.h =================================================================== --- gcc/recog.h 2019-11-26 22:04:57.419370912 +0000 +++ gcc/recog.h 2019-12-05 10:11:50.657631731 +0000 @@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *); extern void validate_replace_src_group (rtx, rtx, rtx_insn *); extern bool validate_simplify_insn (rtx_insn *insn); +extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx); extern int num_changes_pending (void); extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode); Index: gcc/recog.c =================================================================== --- gcc/recog.c 2019-11-29 13:04:13.978672241 +0000 +++ gcc/recog.c 2019-12-05 10:11:50.657631731 +0000 @@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn) } return ((num_changes_pending () > 0) && (apply_change_group () > 0)); } + +/* A subroutine of validate_simplify_replace_rtx. Apply the replacement + described by R to LOC. Return true on success; leave the caller + to clean up on failure. */ + +static bool +validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc) +{ + rtx x = *loc; + enum rtx_code code = GET_CODE (x); + machine_mode mode = GET_MODE (x); + + if (rtx_equal_p (x, r.from)) + { + validate_unshare_change (r.insn, loc, r.to, 1); + return true; + } + + /* Recursively apply the substitution and see if we can simplify + the result. This specifically shouldn't use simplify_gen_*, + since we want to avoid generating new expressions where possible. */ + int old_num_changes = num_validated_changes (); + rtx newx = NULL_RTX; + bool recurse_p = false; + switch (GET_RTX_CLASS (code)) + { + case RTX_UNARY: + { + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))) + return false; + + newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode); + break; + } + + case RTX_BIN_ARITH: + case RTX_COMM_ARITH: + { + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) + return false; + + newx = simplify_binary_operation (code, mode, + XEXP (x, 0), XEXP (x, 1)); + break; + } + + case RTX_COMPARE: + case RTX_COMM_COMPARE: + { + machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode + ? GET_MODE (XEXP (x, 0)) + : GET_MODE (XEXP (x, 1))); + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) + return false; + + newx = simplify_relational_operation (code, mode, op_mode, + XEXP (x, 0), XEXP (x, 1)); + break; + } + + case RTX_TERNARY: + case RTX_BITFIELD_OPS: + { + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2))) + return false; + + newx = simplify_ternary_operation (code, mode, op0_mode, + XEXP (x, 0), XEXP (x, 1), + XEXP (x, 2)); + break; + } + + case RTX_EXTRA: + if (code == SUBREG) + { + machine_mode inner_mode = GET_MODE (SUBREG_REG (x)); + if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x))) + return false; + + rtx inner = SUBREG_REG (x); + newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x)); + /* Reject the same cases that simplify_gen_subreg would. */ + if (!newx + && (GET_CODE (inner) == SUBREG + || GET_CODE (inner) == CONCAT + || GET_MODE (inner) == VOIDmode + || !validate_subreg (mode, inner_mode, + inner, SUBREG_BYTE (x)))) + return false; + break; + } + else + recurse_p = true; + break; + + case RTX_OBJ: + if (code == LO_SUM) + { + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) + return false; + + /* (lo_sum (high x) y) -> y where x and y have the same base. */ + rtx op0 = XEXP (x, 0); + rtx op1 = XEXP (x, 1); + if (GET_CODE (op0) == HIGH) + { + rtx base0, base1, offset0, offset1; + split_const (XEXP (op0, 0), &base0, &offset0); + split_const (op1, &base1, &offset1); + if (rtx_equal_p (base0, base1)) + newx = op1; + } + } + else if (code == REG) + { + if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from)) + return false; + } + else + recurse_p = true; + break; + + case RTX_CONST_OBJ: + break; + + case RTX_AUTOINC: + if (reg_overlap_mentioned_p (XEXP (x, 0), r.from)) + return false; + recurse_p = true; + break; + + case RTX_MATCH: + case RTX_INSN: + gcc_unreachable (); + } + + if (recurse_p) + { + const char *fmt = GET_RTX_FORMAT (code); + for (int i = 0; fmt[i]; i++) + switch (fmt[i]) + { + case 'E': + for (int j = 0; j < XVECLEN (x, i); j++) + if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j))) + return false; + break; + + case 'e': + if (XEXP (x, i) + && !validate_simplify_replace_rtx_1 (r, &XEXP (x, i))) + return false; + break; + } + } + + if (newx && !rtx_equal_p (x, newx)) + { + /* There's no longer any point unsharing the substitutions made + for subexpressions, since we'll just copy this one instead. */ + for (int i = old_num_changes; i < num_changes; ++i) + changes[i].unshare = false; + validate_unshare_change (r.insn, loc, newx, 1); + } + + return true; +} + +/* A note_uses callback for validate_simplify_replace_rtx. + DATA points to a validate_replace_src_data object. */ + +static void +validate_simplify_replace_rtx_uses (rtx *loc, void *data) +{ + validate_replace_src_data &r = *(validate_replace_src_data *) data; + if (r.insn && !validate_simplify_replace_rtx_1 (r, loc)) + r.insn = NULL; +} + +/* Try to perform the equivalent of: + + newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX); + validate_change (INSN, LOC, newx, 1); + + but without generating as much garbage rtl when the resulting + pattern doesn't match. + + Return true if we were able to replace all uses of OLD_RTX in *LOC + and if the result conforms to general rtx rules (e.g. for whether + subregs are meaningful). + + When returning true, add all replacements to the current validation group, + leaving the caller to test it in the normal way. Leave both *LOC and the + validation group unchanged on failure. */ + +bool +validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc, + rtx old_rtx, rtx new_rtx) +{ + validate_replace_src_data r; + r.from = old_rtx; + r.to = new_rtx; + r.insn = insn; + + unsigned int num_changes = num_validated_changes (); + note_uses (loc, validate_simplify_replace_rtx_uses, &r); + if (!r.insn) + { + cancel_changes (num_changes); + return false; + } + return true; +} /* Return 1 if OP is a valid general operand for machine mode MODE. This is either a register reference, a memory reference, Index: gcc/combine2.c =================================================================== --- /dev/null 2019-09-17 11:41:18.176664108 +0100 +++ gcc/combine2.c 2019-12-05 10:11:50.645631815 +0000 @@ -0,0 +1,1658 @@ +/* Combine instructions + Copyright (C) 2019 Free Software Foundation, Inc. + +This file is part of GCC. + +GCC is free software; you can redistribute it and/or modify it under +the terms of the GNU General Public License as published by the Free +Software Foundation; either version 3, or (at your option) any later +version. + +GCC is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +for more details. + +You should have received a copy of the GNU General Public License +along with GCC; see the file COPYING3. If not see +<http://www.gnu.org/licenses/>. */ + +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "backend.h" +#include "rtl.h" +#include "df.h" +#include "tree-pass.h" +#include "memmodel.h" +#include "emit-rtl.h" +#include "insn-config.h" +#include "recog.h" +#include "print-rtl.h" +#include "rtl-iter.h" +#include "predict.h" +#include "cfgcleanup.h" +#include "cfghooks.h" +#include "cfgrtl.h" +#include "alias.h" +#include "valtrack.h" +#include "target.h" + +/* This pass tries to combine instructions in the following ways: + + (1) If we have two dependent instructions: + + I1: (set DEST1 SRC1) + I2: (...DEST1...) + + and I2 is the only user of DEST1, the pass tries to combine them into: + + I2: (...SRC1...) + + (2) If we have two dependent instructions: + + I1: (set DEST1 SRC1) + I2: (...DEST1...) + + the pass tries to combine them into: + + I2: (parallel [(set DEST1 SRC1) (...SRC1...)]) + + or: + + I2: (parallel [(...SRC1...) (set DEST1 SRC1)]) + + (3) If we have two independent instructions: + + I1: (set DEST1 SRC1) + I2: (set DEST2 SRC2) + + that read from memory or from the same register, the pass tries to + combine them into: + + I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)]) + + or: + + I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)]) + + If the combined form is a valid instruction, the pass tries to find a + place between I1 and I2 inclusive for the new instruction. If there + are multiple valid locations, it tries to pick the best one by taking + the effect on register pressure into account. + + If a combination succeeds and produces a single set, the pass tries to + combine the new form with earlier or later instructions. + + The pass currently optimizes each basic block separately. It walks + the instructions in reverse order, building up live ranges for registers + and memory. It then uses these live ranges to look for possible + combination opportunities and to decide where the combined instructions + could be placed. + + The pass represents positions in the block using point numbers, + with higher numbers indicating earlier instructions. The numbering + scheme is that: + + - the end of the current instruction sequence has an even base point B. + + - instructions initially have odd-numbered points B + 1, B + 3, etc. + with B + 1 being the final instruction in the sequence. + + - even points after B represent gaps between instructions where combined + instructions could be placed. + + Thus even points initially represent no instructions and odd points + initially represent single instructions. However, when picking a + place for a combined instruction, the pass may choose somewhere + inbetween the original two instructions, so that over time a point + may come to represent several instructions. When this happens, + the pass maintains the invariant that all instructions with the same + point number are independent of each other and thus can be treated as + acting in parallel (or as acting in any arbitrary sequence). + + TODOs: + + - Handle 3-instruction combinations, and possibly more. + + - Handle existing clobbers more efficiently. At the moment we can't + move an instruction that clobbers R across another instruction that + clobbers R. + + - Allow hard register clobbers to be added, like combine does. + + - Perhaps work on EBBs, or SESE regions. */ + +namespace { + +/* The number of explicit uses to record in a live range. */ +const unsigned int NUM_RANGE_USERS = 4; + +/* The maximum number of instructions that we can combine at once. */ +const unsigned int MAX_COMBINE_INSNS = 2; + +/* A fake cost for instructions that we haven't costed yet. */ +const unsigned int UNKNOWN_COST = ~0U; + +class combine2 +{ +public: + combine2 (function *); + ~combine2 (); + + void execute (); + +private: + struct insn_info_rec; + + /* Describes the live range of a register or of memory. For simplicity, + we treat memory as a single entity. + + If we had a fully-accurate live range, updating it to account for a + moved instruction would be a linear-time operation. Doing this for + each combination would then make the pass quadratic. We therefore + just maintain a list of NUM_RANGE_USERS use insns and use simple, + conservatively-correct behavior for the rest. */ + struct live_range_rec + { + /* Which instruction provides the dominating definition, or null if + we don't know yet. */ + insn_info_rec *producer; + + /* A selection of instructions that use the resource, in program order. */ + insn_info_rec *users[NUM_RANGE_USERS]; + + /* An inclusive range of points that covers instructions not mentioned + in USERS. Both values are zero if there are no such instructions. + + Once we've included a use U at point P in this range, we continue + to assume that some kind of use exists at P whatever happens to U + afterwards. */ + unsigned int first_extra_use; + unsigned int last_extra_use; + + /* The register number this range describes, or INVALID_REGNUM + for memory. */ + unsigned int regno; + + /* Forms a linked list of ranges for the same resource, in program + order. */ + live_range_rec *prev_range; + live_range_rec *next_range; + }; + + /* Pass-specific information about an instruction. */ + struct insn_info_rec + { + /* The instruction itself. */ + rtx_insn *insn; + + /* A null-terminated list of live ranges for the things that this + instruction defines. */ + live_range_rec **defs; + + /* A null-terminated list of live ranges for the things that this + instruction uses. */ + live_range_rec **uses; + + /* The point at which the instruction appears. */ + unsigned int point; + + /* The cost of the instruction, or UNKNOWN_COST if we haven't + measured it yet. */ + unsigned int cost; + }; + + /* Describes one attempt to combine instructions. */ + struct combination_attempt_rec + { + /* The instruction that we're currently trying to optimize. + If the combination succeeds, we'll use this insn_info_rec + to describe the new instruction. */ + insn_info_rec *new_home; + + /* The instructions we're combining, in program order. */ + insn_info_rec *sequence[MAX_COMBINE_INSNS]; + + /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the + live range that describes the substituted register. */ + live_range_rec *def_use_range; + + /* The earliest and latest points at which we could insert the + combined instruction. */ + unsigned int earliest_point; + unsigned int latest_point; + + /* The cost of the new instruction, once we have a successful match. */ + unsigned int new_cost; + }; + + /* Pass-specific information about a register. */ + struct reg_info_rec + { + /* The live range associated with the last reference to the register. */ + live_range_rec *range; + + /* The point at which the last reference occurred. */ + unsigned int next_ref; + + /* True if the register is currently live. We record this here rather + than in a separate bitmap because (a) there's a natural hole for + it on LP64 hosts and (b) we only refer to it when updating the + other fields, and so recording it here should give better locality. */ + unsigned int live_p : 1; + }; + + live_range_rec *new_live_range (unsigned int, live_range_rec *); + live_range_rec *reg_live_range (unsigned int); + live_range_rec *mem_live_range (); + bool add_range_use (live_range_rec *, insn_info_rec *); + void remove_range_use (live_range_rec *, insn_info_rec *); + bool has_single_use_p (live_range_rec *); + bool known_last_use_p (live_range_rec *, insn_info_rec *); + unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *); + unsigned int find_latest_point (insn_info_rec *, insn_info_rec *); + bool start_combination (combination_attempt_rec &, insn_info_rec *, + insn_info_rec *, live_range_rec * = NULL); + bool verify_combination (combination_attempt_rec &); + int estimate_reg_pressure_delta (insn_info_rec *); + void commit_combination (combination_attempt_rec &, bool); + bool try_parallel_sets (combination_attempt_rec &, rtx, rtx); + bool try_parallelize_insns (combination_attempt_rec &); + bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool); + bool try_combine_def_use (combination_attempt_rec &, rtx, rtx); + bool try_combine_two_uses (combination_attempt_rec &); + bool try_combine (insn_info_rec *, rtx, unsigned int); + bool optimize_insn (insn_info_rec *); + void record_defs (insn_info_rec *); + void record_reg_use (insn_info_rec *, df_ref); + void record_uses (insn_info_rec *); + void process_insn (insn_info_rec *); + void start_sequence (); + + /* The function we're optimizing. */ + function *m_fn; + + /* The highest pseudo register number plus one. */ + unsigned int m_num_regs; + + /* The current basic block. */ + basic_block m_bb; + + /* True if we should optimize the current basic block for speed. */ + bool m_optimize_for_speed_p; + + /* The point number to allocate to the next instruction we visit + in the backward traversal. */ + unsigned int m_point; + + /* The point number corresponding to the end of the current + instruction sequence, i.e. the lowest point number about which + we still have valid information. */ + unsigned int m_end_of_sequence; + + /* The point number corresponding to the end of the current basic block. + This is the same as M_END_OF_SEQUENCE when processing the last + instruction sequence in a basic block. */ + unsigned int m_end_of_bb; + + /* The memory live range, or null if we haven't yet found a memory + reference in the current instruction sequence. */ + live_range_rec *m_mem_range; + + /* Gives information about each register. We track both hard and + pseudo registers. */ + auto_vec<reg_info_rec> m_reg_info; + + /* A bitmap of registers whose entry in m_reg_info is valid. */ + auto_sbitmap m_valid_regs; + + /* If nonnuull, an unused 2-element PARALLEL that we can use to test + instruction combinations. */ + rtx m_spare_parallel; + + /* A bitmap of instructions that we've already tried to combine with. */ + auto_bitmap m_tried_insns; + + /* A temporary bitmap used to hold register numbers. */ + auto_bitmap m_true_deps; + + /* An obstack used for allocating insn_info_recs and for building + up their lists of definitions and uses. */ + obstack m_insn_obstack; + + /* An obstack used for allocating live_range_recs. */ + obstack m_range_obstack; + + /* Start-of-object pointers for the two obstacks. */ + char *m_insn_obstack_start; + char *m_range_obstack_start; + + /* A list of instructions that we've optimized and whose new forms + change the cfg. */ + auto_vec<rtx_insn *> m_cfg_altering_insns; + + /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS. */ + auto_bitmap m_cfg_altering_insn_ids; + + /* We can insert new instructions at point P * 2 by inserting them + after M_POINTS[P - M_END_OF_SEQUENCE / 2]. We can insert new + instructions at point P * 2 + 1 by inserting them before + M_POINTS[P - M_END_OF_SEQUENCE / 2]. */ + auto_vec<rtx_insn *, 256> m_points; +}; + +combine2::combine2 (function *fn) + : m_fn (fn), + m_num_regs (max_reg_num ()), + m_bb (NULL), + m_optimize_for_speed_p (false), + m_point (2), + m_end_of_sequence (m_point), + m_end_of_bb (m_point), + m_mem_range (NULL), + m_reg_info (m_num_regs), + m_valid_regs (m_num_regs), + m_spare_parallel (NULL_RTX) +{ + gcc_obstack_init (&m_insn_obstack); + gcc_obstack_init (&m_range_obstack); + m_reg_info.quick_grow (m_num_regs); + bitmap_clear (m_valid_regs); + m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0); + m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0); +} + +combine2::~combine2 () +{ + obstack_free (&m_insn_obstack, NULL); + obstack_free (&m_range_obstack, NULL); +} + +/* Return true if extending the live range of REGNO might introduce a + spill failure during register allocation. We deliberately don't check + targetm.class_likely_spilled_p since: + + (a) in the right circumstances, any allocatable hard register could + trigger a spill failure; + + (b) using REGNO_REG_CLASS to get the class would on many targets lead + to an artificial distinction between general registers that happen + to be in a small class for a rarely-used constraint and those + whose class is GENERAL_REGS itself. + + (c) there should be few cases in which moving references to allocatable + hard registers is important before RA. */ + +static bool +move_could_cause_spill_failure_p (unsigned int regno) +{ + return (regno != INVALID_REGNUM + && HARD_REGISTER_NUM_P (regno) + && !fixed_regs[regno]); +} + +/* Return true if it's possible in principle to combine INSN with + other instructions. ALLOW_ASMS_P is true if the caller can cope + with asm statements. */ + +static bool +combinable_insn_p (rtx_insn *insn, bool allow_asms_p) +{ + rtx pattern = PATTERN (insn); + + if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER) + return false; + + if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX)) + return false; + + if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0) + return false; + + return true; +} + +/* Return true if it's possible in principle to move INSN somewhere else, + as long as all dependencies are satisfied. */ + +static bool +movable_insn_p (rtx_insn *insn) +{ + if (JUMP_P (insn)) + return false; + + if (volatile_refs_p (PATTERN (insn))) + return false; + + return true; +} + +/* A note_stores callback. Set the bool at *DATA to true if DEST is in + memory. */ + +static void +find_mem_def (rtx dest, const_rtx, void *data) +{ + /* note_stores has stripped things like subregs and zero_extracts, + so we don't need to worry about them here. */ + if (MEM_P (dest)) + *(bool *) data = true; +} + +/* Return true if instruction INSN writes to memory. */ + +static bool +insn_writes_mem_p (rtx_insn *insn) +{ + bool saw_mem_p = false; + note_stores (insn, find_mem_def, &saw_mem_p); + return saw_mem_p; +} + +/* A note_uses callback. Set the bool at DATA to true if *LOC reads + from variable memory. */ + +static void +find_mem_use (rtx *loc, void *data) +{ + subrtx_iterator::array_type array; + FOR_EACH_SUBRTX (iter, array, *loc, NONCONST) + if (MEM_P (*iter) && !MEM_READONLY_P (*iter)) + { + *(bool *) data = true; + break; + } +} + +/* Return true if instruction INSN reads memory, including via notes. */ + +static bool +insn_reads_mem_p (rtx_insn *insn) +{ + bool saw_mem_p = false; + note_uses (&PATTERN (insn), find_mem_use, &saw_mem_p); + for (rtx note = REG_NOTES (insn); !saw_mem_p && note; note = XEXP (note, 1)) + if (REG_NOTE_KIND (note) == REG_EQUAL + || REG_NOTE_KIND (note) == REG_EQUIV) + note_uses (&XEXP (note, 0), find_mem_use, &saw_mem_p); + return saw_mem_p; +} + +/* Create and return a new live range for REGNO. NEXT is the next range + in program order, or null if this is the first live range in the + sequence. */ + +combine2::live_range_rec * +combine2::new_live_range (unsigned int regno, live_range_rec *next) +{ + live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec); + memset (range, 0, sizeof (*range)); + + range->regno = regno; + range->next_range = next; + if (next) + next->prev_range = range; + return range; +} + +/* Return the current live range for register REGNO, creating a new + one if necessary. */ + +combine2::live_range_rec * +combine2::reg_live_range (unsigned int regno) +{ + /* Initialize the liveness flag, if it isn't already valid for this BB. */ + bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno); + if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb) + m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno); + + /* See if we already have a live range associated with the current + instruction sequence. */ + live_range_rec *range = NULL; + if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence) + range = m_reg_info[regno].range; + + /* Create a new range if this is the first reference to REGNO in the + current instruction sequence or if the current range has been closed + off by a definition. */ + if (!range || range->producer) + { + range = new_live_range (regno, range); + + /* If the register is live after the current sequence, treat that + as a fake use at the end of the sequence. */ + if (!range->next_range && m_reg_info[regno].live_p) + range->first_extra_use = range->last_extra_use = m_end_of_sequence; + + /* Record that this is now the current range for REGNO. */ + if (first_ref_p) + bitmap_set_bit (m_valid_regs, regno); + m_reg_info[regno].range = range; + m_reg_info[regno].next_ref = m_point; + } + return range; +} + +/* Return the current live range for memory, treating memory as a single + entity. Create a new live range if necessary. */ + +combine2::live_range_rec * +combine2::mem_live_range () +{ + if (!m_mem_range || m_mem_range->producer) + m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range); + return m_mem_range; +} + +/* Record that instruction USER uses the resource described by RANGE. + Return true if this is new information. */ + +bool +combine2::add_range_use (live_range_rec *range, insn_info_rec *user) +{ + /* See if we've already recorded the instruction, or if there's a + spare use slot we can use. */ + unsigned int i = 0; + for (; i < NUM_RANGE_USERS && range->users[i]; ++i) + if (range->users[i] == user) + return false; + + if (i == NUM_RANGE_USERS) + { + /* Since we've processed USER recently, assume that it's more + interesting to record explicitly than the last user in the + current list. Evict that last user and describe it in the + overflow "extra use" range instead. */ + insn_info_rec *ousted_user = range->users[--i]; + if (range->first_extra_use < ousted_user->point) + range->first_extra_use = ousted_user->point; + if (range->last_extra_use > ousted_user->point) + range->last_extra_use = ousted_user->point; + } + + /* Insert USER while keeping the list sorted. */ + for (; i > 0 && range->users[i - 1]->point < user->point; --i) + range->users[i] = range->users[i - 1]; + range->users[i] = user; + return true; +} + +/* Remove USER from the uses recorded for RANGE, if we can. + There's nothing we can do if USER was described in the + overflow "extra use" range. */ + +void +combine2::remove_range_use (live_range_rec *range, insn_info_rec *user) +{ + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) + if (range->users[i] == user) + { + for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j) + range->users[j] = range->users[j + 1]; + range->users[NUM_RANGE_USERS - 1] = NULL; + break; + } +} + +/* Return true if RANGE has a single known user. */ + +bool +combine2::has_single_use_p (live_range_rec *range) +{ + return range->users[0] && !range->users[1] && !range->first_extra_use; +} + +/* Return true if we know that USER is the last user of RANGE. */ + +bool +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user) +{ + if (range->last_extra_use <= user->point) + return false; + + for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i) + if (range->users[i] == user) + return i == NUM_RANGE_USERS - 1 || !range->users[i + 1]; + else if (range->users[i]->point == user->point) + return false; + + gcc_unreachable (); +} + +/* Find the earliest point that we could move I2 up in order to combine + it with I1. Ignore any dependencies between I1 and I2; leave the + caller to deal with those instead. */ + +unsigned int +combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1) +{ + if (!movable_insn_p (i2->insn)) + return i2->point; + + /* Don't allow sets to be moved earlier if doing so could introduce + a spill failure. */ + if (prev_real_insn (i2->insn) != i1->insn) + for (live_range_rec **def = i2->defs; *def; ++def) + if (move_could_cause_spill_failure_p ((*def)->regno)) + return i2->point; + + /* Start by optimistically assuming that we can move the instruction + all the way up to I1. */ + unsigned int point = i1->point; + + /* Make sure that the new position preserves all necessary true dependencies + on earlier instructions. */ + for (live_range_rec **use = i2->uses; *use; ++use) + { + live_range_rec *range = *use; + if (range->producer + && range->producer != i1 + && point >= range->producer->point) + point = range->producer->point - 1; + } + + /* Make sure that the new position preserves all necessary output and + anti dependencies on earlier instructions. */ + for (live_range_rec **def = i2->defs; *def; ++def) + if (live_range_rec *range = (*def)->prev_range) + { + if (range->producer + && range->producer != i1 + && point >= range->producer->point) + point = range->producer->point - 1; + + for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;) + if (range->users[i] && range->users[i] != i1) + { + if (point >= range->users[i]->point) + point = range->users[i]->point - 1; + break; + } + + if (range->last_extra_use && point >= range->last_extra_use) + point = range->last_extra_use - 1; + } + + return point; +} + +/* Find the latest point that we could move I1 down in order to combine + it with I2. Ignore any dependencies between I1 and I2; leave the + caller to deal with those instead. */ + +unsigned int +combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2) +{ + if (!movable_insn_p (i1->insn)) + return i1->point; + + /* Start by optimistically assuming that we can move the instruction + all the way down to I2. */ + unsigned int point = i2->point; + + /* Make sure that the new position preserves all necessary anti dependencies + on later instructions. */ + for (live_range_rec **use = i1->uses; *use; ++use) + if (live_range_rec *range = (*use)->next_range) + if (range->producer != i2 && point <= range->producer->point) + point = range->producer->point + 1; + + /* Make sure that the new position preserves all necessary output and + true dependencies on later instructions. */ + for (live_range_rec **def = i1->defs; *def; ++def) + { + live_range_rec *range = *def; + + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) + if (range->users[i] != i2) + { + if (range->users[i] && point <= range->users[i]->point) + point = range->users[i]->point + 1; + break; + } + + if (range->first_extra_use && point <= range->first_extra_use) + point = range->first_extra_use + 1; + + live_range_rec *next_range = range->next_range; + if (next_range + && next_range->producer != i2 + && point <= next_range->producer->point) + point = next_range->producer->point + 1; + } + + /* Don't allow the live range of a register to be extended if doing + so could introduce a spill failure. */ + if (prev_real_insn (i2->insn) != i1->insn) + for (live_range_rec **use = i1->uses; *use; ++use) + { + live_range_rec *range = *use; + if (move_could_cause_spill_failure_p (range->regno)) + { + for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;) + if (range->users[i]) + { + if (point < range->users[i]->point) + point = range->users[i]->point; + break; + } + + if (range->last_extra_use && point < range->last_extra_use) + point = range->last_extra_use; + } + } + + return point; +} + +/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2, + where I1 is the instruction that we're currently trying to optimize. + If DEF_USE_RANGE is nonnull, I1 defines the value described by + DEF_USE_RANGE and I2 uses it. */ + +bool +combine2::start_combination (combination_attempt_rec &attempt, + insn_info_rec *i1, insn_info_rec *i2, + live_range_rec *def_use_range) +{ + attempt.new_home = i1; + attempt.sequence[0] = i1; + attempt.sequence[1] = i2; + if (attempt.sequence[0]->point < attempt.sequence[1]->point) + std::swap (attempt.sequence[0], attempt.sequence[1]); + attempt.def_use_range = def_use_range; + + /* Check that the instructions have no true dependencies other than + DEF_USE_RANGE. */ + bitmap_clear (m_true_deps); + for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def) + if (*def != def_use_range) + bitmap_set_bit (m_true_deps, (*def)->regno); + for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use) + if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno)) + return false; + + /* Calculate the range of points at which the combined instruction + could live. */ + attempt.earliest_point = find_earliest_point (attempt.sequence[1], + attempt.sequence[0]); + attempt.latest_point = find_latest_point (attempt.sequence[0], + attempt.sequence[1]); + if (attempt.earliest_point < attempt.latest_point) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "cannot combine %d and %d: no suitable" + " location for combined insn\n", + INSN_UID (attempt.sequence[0]->insn), + INSN_UID (attempt.sequence[1]->insn)); + return false; + } + + /* Make sure we have valid costs for the original instructions before + we start changing their patterns. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + if (attempt.sequence[i]->cost == UNKNOWN_COST) + attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn, + m_optimize_for_speed_p); + return true; +} + +/* Check whether the combination attempt described by ATTEMPT matches + an .md instruction (or matches its constraints, in the case of an + asm statement). If so, calculate the cost of the new instruction + and check whether it's cheap enough. */ + +bool +combine2::verify_combination (combination_attempt_rec &attempt) +{ + rtx_insn *insn = attempt.sequence[1]->insn; + + bool ok_p = verify_changes (0); + if (dump_file && (dump_flags & TDF_DETAILS)) + { + if (!ok_p) + fprintf (dump_file, "failed to match this instruction:\n"); + else if (const char *name = get_insn_name (INSN_CODE (insn))) + fprintf (dump_file, "successfully matched this instruction to %s:\n", + name); + else + fprintf (dump_file, "successfully matched this instruction:\n"); + print_rtl_single (dump_file, PATTERN (insn)); + } + if (!ok_p) + return false; + + if (INSN_CODE (insn) >= 0 && !targetm.legitimate_combined_insn (insn)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "instruction rejected by target\n"); + return false; + } + + unsigned int cost1 = attempt.sequence[0]->cost; + unsigned int cost2 = attempt.sequence[1]->cost; + attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p); + ok_p = (attempt.new_cost <= cost1 + cost2); + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n", + cost1, cost2, attempt.new_cost, + ok_p ? "keeping replacement" : "rejecting replacement"); + if (!ok_p) + return false; + + confirm_change_group (); + return true; +} + +/* Return true if we should consider register REGNO when calculating + register pressure estimates. */ + +static bool +count_reg_pressure_p (unsigned int regno) +{ + if (regno == INVALID_REGNUM) + return false; + + /* Unallocatable registers aren't interesting. */ + if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno]) + return false; + + return true; +} + +/* Try to estimate the effect that the original form of INSN_INFO + had on register pressure, in the form "born - dying". */ + +int +combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info) +{ + int delta = 0; + + for (live_range_rec **def = insn_info->defs; *def; ++def) + if (count_reg_pressure_p ((*def)->regno)) + delta += 1; + + for (live_range_rec **use = insn_info->uses; *use; ++use) + if (count_reg_pressure_p ((*use)->regno) + && known_last_use_p (*use, insn_info)) + delta -= 1; + + return delta; +} + +/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete + FROM_INSN. Copy any useful information to TO_INSN before doing that. */ + +static void +transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn) +{ + INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn); + INSN_CODE (to_insn) = INSN_CODE (from_insn); + REG_NOTES (to_insn) = REG_NOTES (from_insn); +} + +/* The combination attempt in ATTEMPT has succeeded and is currently + part of an open validate_change group. Commit to making the change + and decide where the new instruction should go. + + KEPT_DEF_P is true if the new instruction continues to perform + the definition described by ATTEMPT.def_use_range. */ + +void +combine2::commit_combination (combination_attempt_rec &attempt, + bool kept_def_p) +{ + insn_info_rec *new_home = attempt.new_home; + rtx_insn *old_insn = attempt.sequence[0]->insn; + rtx_insn *new_insn = attempt.sequence[1]->insn; + + /* Remove any notes that are no longer relevant. */ + bool single_set_p = single_set (new_insn); + for (rtx *note_ptr = ®_NOTES (new_insn); *note_ptr; ) + { + rtx note = *note_ptr; + bool keep_p = true; + switch (REG_NOTE_KIND (note)) + { + case REG_EQUAL: + case REG_EQUIV: + case REG_NOALIAS: + keep_p = single_set_p; + break; + + case REG_UNUSED: + keep_p = false; + break; + + default: + break; + } + if (keep_p) + note_ptr = &XEXP (*note_ptr, 1); + else + { + *note_ptr = XEXP (*note_ptr, 1); + free_EXPR_LIST_node (note); + } + } + + /* Complete the open validate_change group. */ + confirm_change_group (); + + /* Decide where the new instruction should go. */ + unsigned int new_point = attempt.latest_point; + if (new_point != attempt.earliest_point + && prev_real_insn (new_insn) != old_insn) + { + /* Prefer the earlier point if the combined instruction reduces + register pressure and the latest point if it increases register + pressure. + + The choice isn't obvious in the event of a tie, but picking + the earliest point should reduce the number of times that + we need to invalidate debug insns. */ + int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]); + int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]); + bool move_up_p = (delta1 + delta2 <= 0); + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + "register pressure delta = %d + %d; using %s position\n", + delta1, delta2, move_up_p ? "earliest" : "latest"); + if (move_up_p) + new_point = attempt.earliest_point; + } + + /* Translate inserting at NEW_POINT into inserting before or after + a particular insn. */ + rtx_insn *anchor = NULL; + bool before_p = (new_point & 1); + if (new_point != attempt.sequence[1]->point + && new_point != attempt.sequence[0]->point) + { + anchor = m_points[(new_point - m_end_of_sequence) / 2]; + rtx_insn *other_side = (before_p + ? prev_real_insn (anchor) + : next_real_insn (anchor)); + /* Inserting next to an insn X and then deleting X is just a + roundabout way of using X as the insertion point. */ + if (anchor == new_insn || other_side == new_insn) + new_point = attempt.sequence[1]->point; + else if (anchor == old_insn || other_side == old_insn) + new_point = attempt.sequence[0]->point; + } + + /* Actually perform the move. */ + if (new_point == attempt.sequence[1]->point) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "using insn %d to hold the combined pattern\n", + INSN_UID (new_insn)); + set_insn_deleted (old_insn); + } + else if (new_point == attempt.sequence[0]->point) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "using insn %d to hold the combined pattern\n", + INSN_UID (old_insn)); + PATTERN (old_insn) = PATTERN (new_insn); + transfer_insn (old_insn, new_insn); + std::swap (old_insn, new_insn); + set_insn_deleted (old_insn); + } + else + { + /* We need to insert a new instruction. We can't simply move + NEW_INSN because it acts as an insertion anchor in m_points. */ + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "inserting combined insn %s insn %d\n", + before_p ? "before" : "after", INSN_UID (anchor)); + + rtx_insn *added_insn = (before_p + ? emit_insn_before (PATTERN (new_insn), anchor) + : emit_insn_after (PATTERN (new_insn), anchor)); + transfer_insn (added_insn, new_insn); + set_insn_deleted (old_insn); + set_insn_deleted (new_insn); + new_insn = added_insn; + } + df_insn_rescan (new_insn); + + /* Unlink the old uses. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) + remove_range_use (*use, attempt.sequence[i]); + + /* Work out which registers the new pattern uses. */ + bitmap_clear (m_true_deps); + df_ref use; + FOR_EACH_INSN_USE (use, new_insn) + { + rtx reg = DF_REF_REAL_REG (use); + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); + } + FOR_EACH_INSN_EQ_USE (use, new_insn) + { + rtx reg = DF_REF_REAL_REG (use); + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); + } + + /* Describe the combined instruction in NEW_HOME. */ + new_home->insn = new_insn; + new_home->point = new_point; + new_home->cost = attempt.new_cost; + + /* Build up a list of definitions for the combined instructions + and update all the ranges accordingly. It shouldn't matter + which order we do this in. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def) + if (kept_def_p || *def != attempt.def_use_range) + { + obstack_ptr_grow (&m_insn_obstack, *def); + (*def)->producer = new_home; + } + obstack_ptr_grow (&m_insn_obstack, NULL); + new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack); + + /* Build up a list of uses for the combined instructions and update + all the ranges accordingly. Again, it shouldn't matter which + order we do this in. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) + { + live_range_rec *range = *use; + if (range != attempt.def_use_range + && (range->regno == INVALID_REGNUM + ? insn_reads_mem_p (new_insn) + : bitmap_bit_p (m_true_deps, range->regno)) + && add_range_use (range, new_home)) + obstack_ptr_grow (&m_insn_obstack, range); + } + obstack_ptr_grow (&m_insn_obstack, NULL); + new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack); + + /* There shouldn't be any remaining references to other instructions + in the combination. Invalidate their contents to make lingering + references a noisy failure. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + if (attempt.sequence[i] != new_home) + { + attempt.sequence[i]->insn = NULL; + attempt.sequence[i]->point = ~0U; + } + + /* Unlink the def-use range. */ + if (!kept_def_p && attempt.def_use_range) + { + live_range_rec *range = attempt.def_use_range; + if (range->prev_range) + range->prev_range->next_range = range->next_range; + else + m_reg_info[range->regno].range = range->next_range; + if (range->next_range) + range->next_range->prev_range = range->prev_range; + } + + /* Record instructions whose new form alters the cfg. */ + rtx pattern = PATTERN (new_insn); + if ((returnjump_p (new_insn) + || any_uncondjump_p (new_insn) + || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx)) + && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn))) + m_cfg_altering_insns.safe_push (new_insn); +} + +/* Return true if X1 and X2 are memories and if X1 does not have + a higher alignment than X2. */ + +static bool +dubious_mem_pair_p (rtx x1, rtx x2) +{ + return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2); +} + +/* Try implement ATTEMPT using (parallel [SET1 SET2]). */ + +bool +combine2::try_parallel_sets (combination_attempt_rec &attempt, + rtx set1, rtx set2) +{ + rtx_insn *insn = attempt.sequence[1]->insn; + + /* Combining two loads or two stores can be useful on targets that + allow them to be treated as a single access. However, we use a + very peephole approach to picking the pairs, so we need to be + relatively confident that we're making a good choice. + + For now just aim for cases in which the memory references are + consecutive and the first reference has a higher alignment. + We can leave the target to test the consecutive part; whatever test + we added here might be different from the target's, and in any case + it's fine if the target accepts other well-aligned cases too. */ + if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2)) + || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2))) + return false; + + /* Cache the PARALLEL rtx between attempts so that we don't generate + too much garbage rtl. */ + if (!m_spare_parallel) + { + rtvec vec = gen_rtvec (2, set1, set2); + m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec); + } + else + { + XVECEXP (m_spare_parallel, 0, 0) = set1; + XVECEXP (m_spare_parallel, 0, 1) = set2; + } + + unsigned int num_changes = num_validated_changes (); + validate_change (insn, &PATTERN (insn), m_spare_parallel, true); + if (verify_combination (attempt)) + { + m_spare_parallel = NULL_RTX; + return true; + } + cancel_changes (num_changes); + return false; +} + +/* Try to parallelize the two instructions in ATTEMPT. */ + +bool +combine2::try_parallelize_insns (combination_attempt_rec &attempt) +{ + rtx_insn *i1_insn = attempt.sequence[0]->insn; + rtx_insn *i2_insn = attempt.sequence[1]->insn; + + /* Can't parallelize asm statements. */ + if (asm_noperands (PATTERN (i1_insn)) >= 0 + || asm_noperands (PATTERN (i2_insn)) >= 0) + return false; + + /* For now, just handle the case in which both instructions are + single sets. We could handle more than 2 sets as well, but few + targets support that anyway. */ + rtx set1 = single_set (i1_insn); + if (!set1) + return false; + rtx set2 = single_set (i2_insn); + if (!set2) + return false; + + /* Make sure that we have structural proof that the destinations + are independent. Things like alias analysis rely on semantic + information and assume no undefined behavior, which is rarely a + good enough guarantee to allow a useful instruction combination. */ + rtx dest1 = SET_DEST (set1); + rtx dest2 = SET_DEST (set2); + if (MEM_P (dest1) + ? MEM_P (dest2) && !nonoverlapping_memrefs_p (dest1, dest2, false) + : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2)) + return false; + + /* Try the sets in both orders. */ + if (try_parallel_sets (attempt, set1, set2) + || try_parallel_sets (attempt, set2, set1)) + { + commit_combination (attempt, true); + if (MAY_HAVE_DEBUG_BIND_INSNS + && attempt.new_home->insn != i1_insn) + propagate_for_debug (i1_insn, attempt.new_home->insn, + SET_DEST (set1), SET_SRC (set1), m_bb); + return true; + } + return false; +} + +/* Replace DEST with SRC in the register notes for INSN. */ + +static void +substitute_into_note (rtx_insn *insn, rtx dest, rtx src) +{ + for (rtx *note_ptr = ®_NOTES (insn); *note_ptr; ) + { + rtx note = *note_ptr; + bool keep_p = true; + switch (REG_NOTE_KIND (note)) + { + case REG_EQUAL: + case REG_EQUIV: + keep_p = validate_simplify_replace_rtx (insn, &XEXP (note, 0), + dest, src); + break; + + default: + break; + } + if (keep_p) + note_ptr = &XEXP (*note_ptr, 1); + else + { + *note_ptr = XEXP (*note_ptr, 1); + free_EXPR_LIST_node (note); + } + } +} + +/* A subroutine of try_combine_def_use. Try replacing DEST with SRC + in ATTEMPT. SRC might be either the original SET_SRC passed to the + parent routine or a value pulled from a note; SRC_IS_NOTE_P is true + in the latter case. */ + +bool +combine2::try_combine_def_use_1 (combination_attempt_rec &attempt, + rtx dest, rtx src, bool src_is_note_p) +{ + rtx_insn *def_insn = attempt.sequence[0]->insn; + rtx_insn *use_insn = attempt.sequence[1]->insn; + + /* Mimic combine's behavior by not combining moves from allocatable hard + registers (e.g. when copying parameters or function return values). */ + if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)]) + return false; + + /* Don't mess with volatile references. For one thing, we don't yet + know how many copies of SRC we'll need. */ + if (volatile_refs_p (src)) + return false; + + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "trying to combine %d and %d%s:\n", + INSN_UID (def_insn), INSN_UID (use_insn), + src_is_note_p ? " using equal/equiv note" : ""); + dump_insn_slim (dump_file, def_insn); + dump_insn_slim (dump_file, use_insn); + } + + unsigned int num_changes = num_validated_changes (); + if (!validate_simplify_replace_rtx (use_insn, &PATTERN (use_insn), + dest, src)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "combination failed -- unable to substitute" + " all uses\n"); + return false; + } + + /* Try matching the instruction on its own if DEST isn't used elsewhere. */ + if (has_single_use_p (attempt.def_use_range) + && verify_combination (attempt)) + { + live_range_rec *next_range = attempt.def_use_range->next_range; + substitute_into_note (use_insn, dest, src); + commit_combination (attempt, false); + if (MAY_HAVE_DEBUG_BIND_INSNS) + { + rtx_insn *end_of_range = (next_range + ? next_range->producer->insn + : BB_END (m_bb)); + propagate_for_debug (def_insn, end_of_range, dest, src, m_bb); + } + return true; + } + + /* Try doing the new USE_INSN pattern in parallel with the DEF_INSN + pattern. */ + if ((!targetm.cannot_copy_insn_p || !targetm.cannot_copy_insn_p (def_insn)) + && try_parallelize_insns (attempt)) + return true; + + cancel_changes (num_changes); + return false; +} + +/* ATTEMPT describes an attempt to substitute the result of the first + instruction into the second instruction. Try to implement it, + given that the first instruction sets DEST to SRC. */ + +bool +combine2::try_combine_def_use (combination_attempt_rec &attempt, + rtx dest, rtx src) +{ + rtx_insn *def_insn = attempt.sequence[0]->insn; + rtx_insn *use_insn = attempt.sequence[1]->insn; + rtx def_note = find_reg_equal_equiv_note (def_insn); + + /* First try combining the instructions in their original form. */ + if (try_combine_def_use_1 (attempt, dest, src, false)) + return true; + + /* Try to replace DEST with a REG_EQUAL/EQUIV value instead. */ + if (def_note + && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true)) + return true; + + /* If USE_INSN has a REG_EQUAL/EQUIV note that refers to DEST, try + using that instead of the main pattern. */ + for (rtx *link_ptr = ®_NOTES (use_insn); *link_ptr; + link_ptr = &XEXP (*link_ptr, 1)) + { + rtx use_note = *link_ptr; + if (REG_NOTE_KIND (use_note) != REG_EQUAL + && REG_NOTE_KIND (use_note) != REG_EQUIV) + continue; + + rtx use_set = single_set (use_insn); + if (!use_set) + break; + + if (!reg_overlap_mentioned_p (dest, XEXP (use_note, 0))) + continue; + + /* Try snipping out the note and putting it in the SET instead. */ + validate_change (use_insn, link_ptr, XEXP (use_note, 1), 1); + validate_change (use_insn, &SET_SRC (use_set), XEXP (use_note, 0), 1); + + if (try_combine_def_use_1 (attempt, dest, src, false)) + return true; + + if (def_note + && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true)) + return true; + + cancel_changes (0); + } + + return false; +} + +/* ATTEMPT describes an attempt to combine two instructions that use + the same resource. Try to implement it, returning true on success. */ + +bool +combine2::try_combine_two_uses (combination_attempt_rec &attempt) +{ + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "trying to parallelize %d and %d:\n", + INSN_UID (attempt.sequence[0]->insn), + INSN_UID (attempt.sequence[1]->insn)); + dump_insn_slim (dump_file, attempt.sequence[0]->insn); + dump_insn_slim (dump_file, attempt.sequence[1]->insn); + } + + return try_parallelize_insns (attempt); +} + +/* Try to optimize instruction INSN_INFO. Return true on success. */ + +bool +combine2::optimize_insn (insn_info_rec *i1) +{ + combination_attempt_rec attempt; + + if (!combinable_insn_p (i1->insn, false)) + return false; + + rtx set = single_set (i1->insn); + if (!set) + return false; + + /* First try combining INSN with a user of its result. */ + rtx dest = SET_DEST (set); + rtx src = SET_SRC (set); + if (REG_P (dest) && REG_NREGS (dest) == 1) + for (live_range_rec **def = i1->defs; *def; ++def) + if ((*def)->regno == REGNO (dest)) + { + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) + { + insn_info_rec *use = (*def)->users[i]; + if (use + && combinable_insn_p (use->insn, has_single_use_p (*def)) + && start_combination (attempt, i1, use, *def) + && try_combine_def_use (attempt, dest, src)) + return true; + } + break; + } + + /* Try parallelizing INSN and another instruction that uses the same + resource. */ + bitmap_clear (m_tried_insns); + for (live_range_rec **use = i1->uses; *use; ++use) + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) + { + insn_info_rec *i2 = (*use)->users[i]; + if (i2 + && i2 != i1 + && combinable_insn_p (i2->insn, false) + && bitmap_set_bit (m_tried_insns, INSN_UID (i2->insn)) + && start_combination (attempt, i1, i2) + && try_combine_two_uses (attempt)) + return true; + } + + return false; +} + +/* Record all register and memory definitions in INSN_INFO and fill in its + "defs" list. */ + +void +combine2::record_defs (insn_info_rec *insn_info) +{ + rtx_insn *insn = insn_info->insn; + + /* Record register definitions. */ + df_ref def; + FOR_EACH_INSN_DEF (def, insn) + { + rtx reg = DF_REF_REAL_REG (def); + unsigned int end_regno = END_REGNO (reg); + for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno) + { + live_range_rec *range = reg_live_range (regno); + range->producer = insn_info; + m_reg_info[regno].live_p = false; + obstack_ptr_grow (&m_insn_obstack, range); + } + } + + /* If the instruction writes to memory, record that too. */ + if (insn_writes_mem_p (insn)) + { + live_range_rec *range = mem_live_range (); + range->producer = insn_info; + obstack_ptr_grow (&m_insn_obstack, range); + } + + /* Complete the list of definitions. */ + obstack_ptr_grow (&m_insn_obstack, NULL); + insn_info->defs = (live_range_rec **) obstack_finish (&m_insn_obstack); +} + +/* Record that INSN_INFO contains register use USE. If this requires + new entries to be added to INSN_INFO->uses, add those entries to the + list we're building in m_insn_obstack. */ + +void +combine2::record_reg_use (insn_info_rec *insn_info, df_ref use) +{ + rtx reg = DF_REF_REAL_REG (use); + unsigned int end_regno = END_REGNO (reg); + for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno) + { + live_range_rec *range = reg_live_range (regno); + if (add_range_use (range, insn_info)) + obstack_ptr_grow (&m_insn_obstack, range); + m_reg_info[regno].live_p = true; + } +} + +/* Record all register and memory uses in INSN_INFO and fill in its + "uses" list. */ + +void +combine2::record_uses (insn_info_rec *insn_info) +{ + rtx_insn *insn = insn_info->insn; + + /* Record register uses in the main pattern. */ + df_ref use; + FOR_EACH_INSN_USE (use, insn) + record_reg_use (insn_info, use); + + /* Treat REG_EQUAL uses as first-class uses. We don't lose much + by doing that, since it's rare for a REG_EQUAL note to mention + registers that the main pattern doesn't. It also gives us the + maximum freedom to use REG_EQUAL notes in place of the main pattern. */ + FOR_EACH_INSN_EQ_USE (use, insn) + record_reg_use (insn_info, use); + + /* Record a memory use if either the pattern or the notes read from + memory. */ + if (insn_reads_mem_p (insn)) + { + live_range_rec *range = mem_live_range (); + if (add_range_use (range, insn_info)) + obstack_ptr_grow (&m_insn_obstack, range); + } + + /* Complete the list of uses. */ + obstack_ptr_grow (&m_insn_obstack, NULL); + insn_info->uses = (live_range_rec **) obstack_finish (&m_insn_obstack); +} + +/* Start a new instruction sequence, discarding all information about + the previous one. */ + +void +combine2::start_sequence (void) +{ + m_end_of_sequence = m_point; + m_mem_range = NULL; + m_points.truncate (0); + obstack_free (&m_insn_obstack, m_insn_obstack_start); + obstack_free (&m_range_obstack, m_range_obstack_start); +} + +/* Run the pass on the current function. */ + +void +combine2::execute (void) +{ + df_analyze (); + FOR_EACH_BB_FN (m_bb, cfun) + { + m_optimize_for_speed_p = optimize_bb_for_speed_p (m_bb); + m_end_of_bb = m_point; + start_sequence (); + + rtx_insn *insn, *prev; + FOR_BB_INSNS_REVERSE_SAFE (m_bb, insn, prev) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + + /* The current m_point represents the end of the sequence if + INSN is the last instruction in the sequence, otherwise it + represents the gap between INSN and the next instruction. + m_point + 1 represents INSN itself. + + Instructions can be added to m_point by inserting them + after INSN. They can be added to m_point + 1 by inserting + them before INSN. */ + m_points.safe_push (insn); + m_point += 1; + + insn_info_rec *insn_info = XOBNEW (&m_insn_obstack, insn_info_rec); + insn_info->insn = insn; + insn_info->point = m_point; + insn_info->cost = UNKNOWN_COST; + + record_defs (insn_info); + record_uses (insn_info); + + /* Set up m_point for the next instruction. */ + m_point += 1; + + if (CALL_P (insn)) + start_sequence (); + else + while (optimize_insn (insn_info)) + gcc_assert (insn_info->insn); + } + } + + /* If an instruction changes the cfg, update the containing block + accordingly. */ + rtx_insn *insn; + unsigned int i; + FOR_EACH_VEC_ELT (m_cfg_altering_insns, i, insn) + if (JUMP_P (insn)) + { + mark_jump_label (PATTERN (insn), insn, 0); + update_cfg_for_uncondjump (insn); + } + else + { + remove_edge (split_block (BLOCK_FOR_INSN (insn), insn)); + emit_barrier_after_bb (BLOCK_FOR_INSN (insn)); + } + + /* Propagate the above block-local cfg changes to the rest of the cfg. */ + if (!m_cfg_altering_insns.is_empty ()) + { + if (dom_info_available_p (CDI_DOMINATORS)) + free_dominance_info (CDI_DOMINATORS); + timevar_push (TV_JUMP); + rebuild_jump_labels (get_insns ()); + cleanup_cfg (0); + timevar_pop (TV_JUMP); + } +} + +const pass_data pass_data_combine2 = +{ + RTL_PASS, /* type */ + "combine2", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_COMBINE2, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + TODO_df_finish, /* todo_flags_finish */ +}; + +class pass_combine2 : public rtl_opt_pass +{ +public: + pass_combine2 (gcc::context *ctxt, int flag) + : rtl_opt_pass (pass_data_combine2, ctxt), m_flag (flag) + {} + + bool + gate (function *) OVERRIDE + { + return optimize && (param_run_combine & m_flag) != 0 && !HAVE_cc0; + } + + unsigned int + execute (function *f) OVERRIDE + { + combine2 (f).execute (); + return 0; + } + +private: + unsigned int m_flag; +}; // class pass_combine2 + +} // anon namespace + +rtl_opt_pass * +make_pass_combine2_before (gcc::context *ctxt) +{ + return new pass_combine2 (ctxt, 1); +} + +rtl_opt_pass * +make_pass_combine2_after (gcc::context *ctxt) +{ + return new pass_combine2 (ctxt, 4); +}
On Wed, Dec 04, 2019 at 07:43:30PM +0900, Oleg Endo wrote: > On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote: > > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled > > > explicitly, but it's got a R0_REGS class with only one said reg in it. > > > So the default impl of class_likely_spilled should do its thing. > > > > Yes, good point. So what happened here? > > "Something, somewhere, went terribly wrong"... > > insn 18 wants to do > > mov.l @(r4,r6),r0 > > But it can't because the reg+reg address mode has a R0 constraint > itself. So it needs to be changed to > > mov r4,r0 > mov.l @(r0,r6),r0 > > And it can't handle that. Or only sometimes? Don't remember. > > > Is it just RA messing things > > up, unrelated to the new pass? > > Yep, I think so. The additional pass seems to create "tougher" code so > reload passes out earlier than usual. We've had the same issue when > trying address mode selection optimization. In fact that was one huge > showstopper. So maybe you should have a define_insn_and_split that allows any two regs and replaces one by r0 if neither is (and a move to r0 before the load)? Split after reload of course. It may be admitting defeat, but it may even result in better code as well ;-) Segher
On Fri, 2019-12-06 at 16:51 -0600, Segher Boessenkool wrote: > On Wed, Dec 04, 2019 at 07:43:30PM +0900, Oleg Endo wrote: > > On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote: > > > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled > > > > explicitly, but it's got a R0_REGS class with only one said reg in it. > > > > So the default impl of class_likely_spilled should do its thing. > > > > > > Yes, good point. So what happened here? > > > > "Something, somewhere, went terribly wrong"... > > > > insn 18 wants to do > > > > mov.l @(r4,r6),r0 > > > > But it can't because the reg+reg address mode has a R0 constraint > > itself. So it needs to be changed to > > > > mov r4,r0 > > mov.l @(r0,r6),r0 > > > > And it can't handle that. Or only sometimes? Don't remember. > > > > > Is it just RA messing things > > > up, unrelated to the new pass? > > > > Yep, I think so. The additional pass seems to create "tougher" code so > > reload passes out earlier than usual. We've had the same issue when > > trying address mode selection optimization. In fact that was one huge > > showstopper. > > So maybe you should have a define_insn_and_split that allows any two > regs and replaces one by r0 if neither is (and a move to r0 before the > load)? Split after reload of course. > > It may be admitting defeat, but it may even result in better code as > well ;-) > AFAIR I've tried that already and it was just like running in circles. Means it didn't help. Perhaps if R0_REGS was hidden from RA altogether it might work. But that sounds like opening a whole other can of worms. Another idea I was entertaining was to do a custom RTL pass to pre-allocate all R0 constraints before the real full RA. But then the whole reload stuff would still have the same issue as above. So all the wallpapering is just moot. Proper fix of the actual problem would be more appropriate. Cheers, Oleg
Index: gcc/Makefile.in =================================================================== --- gcc/Makefile.in 2019-11-14 14:34:27.599783740 +0000 +++ gcc/Makefile.in 2019-11-17 23:15:31.188500613 +0000 @@ -1261,6 +1261,7 @@ OBJS = \ cgraphunit.o \ cgraphclones.o \ combine.o \ + combine2.o \ combine-stack-adj.o \ compare-elim.o \ context.o \ Index: gcc/params.opt =================================================================== --- gcc/params.opt 2019-11-14 14:34:26.339792215 +0000 +++ gcc/params.opt 2019-11-17 23:15:31.200500531 +0000 @@ -768,6 +768,10 @@ Use internal function id in profile look Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param Maximum depth of a loop nest to fully value-number optimistically. +-param=run-combine= +Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param +Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass. + -param=sccvn-max-alias-queries-per-access= Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param Maximum number of disambiguations to perform per memory access. Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi 2019-11-16 10:43:45.597105823 +0000 +++ gcc/doc/invoke.texi 2019-11-17 23:15:31.200500531 +0000 @@ -11807,6 +11807,11 @@ in combiner for a pseudo register as las @item max-combine-insns The maximum number of instructions the RTL combiner tries to combine. +@item run-combine +Choose which of the 3 available combine passes to run: bit 1 for the main +combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 +for a later variant of the combine pass. + @item integer-share-limit Small integer constants can use a shared data structure, reducing the compiler's memory usage and increasing its speed. This sets the maximum Index: gcc/tree-pass.h =================================================================== --- gcc/tree-pass.h 2019-10-29 08:29:03.096444049 +0000 +++ gcc/tree-pass.h 2019-11-17 23:15:31.204500501 +0000 @@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt); extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt); extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt); extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt); extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt); extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt); extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt); Index: gcc/passes.def =================================================================== --- gcc/passes.def 2019-10-29 08:29:03.224443133 +0000 +++ gcc/passes.def 2019-11-17 23:15:31.200500531 +0000 @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_inc_dec); NEXT_PASS (pass_initialize_regs); NEXT_PASS (pass_ud_rtl_dce); + NEXT_PASS (pass_combine2_before); NEXT_PASS (pass_combine); + NEXT_PASS (pass_combine2_after); NEXT_PASS (pass_if_after_combine); NEXT_PASS (pass_jump_after_combine); NEXT_PASS (pass_partition_blocks); Index: gcc/timevar.def =================================================================== --- gcc/timevar.def 2019-10-11 15:43:53.403498517 +0100 +++ gcc/timevar.def 2019-11-17 23:15:31.204500501 +0000 @@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC , " DEFTIMEVAR (TV_CSE2 , "CSE 2") DEFTIMEVAR (TV_BRANCH_PROB , "branch prediction") DEFTIMEVAR (TV_COMBINE , "combiner") +DEFTIMEVAR (TV_COMBINE2 , "second combiner") DEFTIMEVAR (TV_IFCVT , "if-conversion") DEFTIMEVAR (TV_MODE_SWITCH , "mode switching") DEFTIMEVAR (TV_SMS , "sms modulo scheduling") Index: gcc/cfgrtl.h =================================================================== --- gcc/cfgrtl.h 2019-03-08 18:15:39.320730391 +0000 +++ gcc/cfgrtl.h 2019-11-17 23:15:31.192500584 +0000 @@ -47,6 +47,7 @@ extern void fixup_partitions (void); extern bool purge_dead_edges (basic_block); extern bool purge_all_dead_edges (void); extern bool fixup_abnormal_edges (void); +extern void update_cfg_for_uncondjump (rtx_insn *); extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *); extern void relink_block_chain (bool); extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *); Index: gcc/combine.c =================================================================== --- gcc/combine.c 2019-11-13 08:42:45.537368745 +0000 +++ gcc/combine.c 2019-11-17 23:15:31.192500584 +0000 @@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg) && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT; } -/* Delete the unconditional jump INSN and adjust the CFG correspondingly. - Note that the INSN should be deleted *after* removing dead edges, so - that the kept edge is the fallthrough edge for a (set (pc) (pc)) - but not for a (set (pc) (label_ref FOO)). */ - -static void -update_cfg_for_uncondjump (rtx_insn *insn) -{ - basic_block bb = BLOCK_FOR_INSN (insn); - gcc_assert (BB_END (bb) == insn); - - purge_dead_edges (bb); - - delete_insn (insn); - if (EDGE_COUNT (bb->succs) == 1) - { - rtx_insn *insn; - - single_succ_edge (bb)->flags |= EDGE_FALLTHRU; - - /* Remove barriers from the footer if there are any. */ - for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) - if (BARRIER_P (insn)) - { - if (PREV_INSN (insn)) - SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); - else - BB_FOOTER (bb) = NEXT_INSN (insn); - if (NEXT_INSN (insn)) - SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); - } - else if (LABEL_P (insn)) - break; - } -} - /* Return whether PAT is a PARALLEL of exactly N register SETs followed by an arbitrary number of CLOBBERs. */ static bool @@ -15096,7 +15060,10 @@ const pass_data pass_data_combine = {} /* opt_pass methods: */ - virtual bool gate (function *) { return (optimize > 0); } + virtual bool gate (function *) + { + return optimize > 0 && (param_run_combine & 2) != 0; + } virtual unsigned int execute (function *) { return rest_of_handle_combine (); Index: gcc/cfgrtl.c =================================================================== --- gcc/cfgrtl.c 2019-10-17 14:22:55.523309009 +0100 +++ gcc/cfgrtl.c 2019-11-17 23:15:31.188500613 +0000 @@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void) return inserted; } +/* Delete the unconditional jump INSN and adjust the CFG correspondingly. + Note that the INSN should be deleted *after* removing dead edges, so + that the kept edge is the fallthrough edge for a (set (pc) (pc)) + but not for a (set (pc) (label_ref FOO)). */ + +void +update_cfg_for_uncondjump (rtx_insn *insn) +{ + basic_block bb = BLOCK_FOR_INSN (insn); + gcc_assert (BB_END (bb) == insn); + + purge_dead_edges (bb); + + delete_insn (insn); + if (EDGE_COUNT (bb->succs) == 1) + { + rtx_insn *insn; + + single_succ_edge (bb)->flags |= EDGE_FALLTHRU; + + /* Remove barriers from the footer if there are any. */ + for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) + if (BARRIER_P (insn)) + { + if (PREV_INSN (insn)) + SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); + else + BB_FOOTER (bb) = NEXT_INSN (insn); + if (NEXT_INSN (insn)) + SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); + } + else if (LABEL_P (insn)) + break; + } +} + /* Cut the insns from FIRST to LAST out of the insns stream. */ rtx_insn * Index: gcc/simplify-rtx.c =================================================================== --- gcc/simplify-rtx.c 2019-11-16 15:33:36.642840131 +0000 +++ gcc/simplify-rtx.c 2019-11-17 23:15:31.204500501 +0000 @@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode, && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1) return constm1_rtx; + /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result + in a narrower mode if useful. */ + if (COMPARISON_P (op)) + return simplify_gen_relational (GET_CODE (op), mode, VOIDmode, + XEXP (op, 0), XEXP (op, 1)); + return NULL_RTX; } Index: gcc/recog.h =================================================================== --- gcc/recog.h 2019-09-09 18:58:28.860430363 +0100 +++ gcc/recog.h 2019-11-17 23:15:31.204500501 +0000 @@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *); extern void validate_replace_src_group (rtx, rtx, rtx_insn *); extern bool validate_simplify_insn (rtx_insn *insn); +extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx); extern int num_changes_pending (void); extern int next_insn_tests_no_inequality (rtx_insn *); extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode); Index: gcc/recog.c =================================================================== --- gcc/recog.c 2019-10-01 09:55:35.150088599 +0100 +++ gcc/recog.c 2019-11-17 23:15:31.204500501 +0000 @@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn) } return ((num_changes_pending () > 0) && (apply_change_group () > 0)); } + +/* A subroutine of validate_simplify_replace_rtx. Apply the replacement + described by R to LOC. Return true on success; leave the caller + to clean up on failure. */ + +static bool +validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc) +{ + rtx x = *loc; + enum rtx_code code = GET_CODE (x); + machine_mode mode = GET_MODE (x); + + if (rtx_equal_p (x, r.from)) + { + validate_unshare_change (r.insn, loc, r.to, 1); + return true; + } + + /* Recursively apply the substitution and see if we can simplify + the result. This specifically shouldn't use simplify_gen_*, + since we want to avoid generating new expressions where possible. */ + int old_num_changes = num_validated_changes (); + rtx newx = NULL_RTX; + bool recurse_p = false; + switch (GET_RTX_CLASS (code)) + { + case RTX_UNARY: + { + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))) + return false; + + newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode); + break; + } + + case RTX_BIN_ARITH: + case RTX_COMM_ARITH: + { + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) + return false; + + newx = simplify_binary_operation (code, mode, + XEXP (x, 0), XEXP (x, 1)); + break; + } + + case RTX_COMPARE: + case RTX_COMM_COMPARE: + { + machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode + ? GET_MODE (XEXP (x, 0)) + : GET_MODE (XEXP (x, 1))); + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) + return false; + + newx = simplify_relational_operation (code, mode, op_mode, + XEXP (x, 0), XEXP (x, 1)); + break; + } + + case RTX_TERNARY: + case RTX_BITFIELD_OPS: + { + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2))) + return false; + + newx = simplify_ternary_operation (code, mode, op0_mode, + XEXP (x, 0), XEXP (x, 1), + XEXP (x, 2)); + break; + } + + case RTX_EXTRA: + if (code == SUBREG) + { + machine_mode inner_mode = GET_MODE (SUBREG_REG (x)); + if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x))) + return false; + + rtx inner = SUBREG_REG (x); + newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x)); + /* Reject the same cases that simplify_gen_subreg would. */ + if (!newx + && (GET_CODE (inner) == SUBREG + || GET_CODE (inner) == CONCAT + || GET_MODE (inner) == VOIDmode + || !validate_subreg (mode, inner_mode, + inner, SUBREG_BYTE (x)))) + return false; + break; + } + else + recurse_p = true; + break; + + case RTX_OBJ: + if (code == LO_SUM) + { + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) + return false; + + /* (lo_sum (high x) y) -> y where x and y have the same base. */ + rtx op0 = XEXP (x, 0); + rtx op1 = XEXP (x, 1); + if (GET_CODE (op0) == HIGH) + { + rtx base0, base1, offset0, offset1; + split_const (XEXP (op0, 0), &base0, &offset0); + split_const (op1, &base1, &offset1); + if (rtx_equal_p (base0, base1)) + newx = op1; + } + } + else if (code == REG) + { + if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from)) + return false; + } + else + recurse_p = true; + break; + + case RTX_CONST_OBJ: + break; + + case RTX_AUTOINC: + if (reg_overlap_mentioned_p (XEXP (x, 0), r.from)) + return false; + recurse_p = true; + break; + + case RTX_MATCH: + case RTX_INSN: + gcc_unreachable (); + } + + if (recurse_p) + { + const char *fmt = GET_RTX_FORMAT (code); + for (int i = 0; fmt[i]; i++) + switch (fmt[i]) + { + case 'E': + for (int j = 0; j < XVECLEN (x, i); j++) + if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j))) + return false; + break; + + case 'e': + if (XEXP (x, i) + && !validate_simplify_replace_rtx_1 (r, &XEXP (x, i))) + return false; + break; + } + } + + if (newx && !rtx_equal_p (x, newx)) + { + /* There's no longer any point unsharing the substitutions made + for subexpressions, since we'll just copy this one instead. */ + for (int i = old_num_changes; i < num_changes; ++i) + changes[i].unshare = false; + validate_unshare_change (r.insn, loc, newx, 1); + } + + return true; +} + +/* A note_uses callback for validate_simplify_replace_rtx. + DATA points to a validate_replace_src_data object. */ + +static void +validate_simplify_replace_rtx_uses (rtx *loc, void *data) +{ + validate_replace_src_data &r = *(validate_replace_src_data *) data; + if (r.insn && !validate_simplify_replace_rtx_1 (r, loc)) + r.insn = NULL; +} + +/* Try to perform the equivalent of: + + newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX); + validate_change (INSN, LOC, newx, 1); + + but without generating as much garbage rtl when the resulting + pattern doesn't match. + + Return true if we were able to replace all uses of OLD_RTX in *LOC + and if the result conforms to general rtx rules (e.g. for whether + subregs are meaningful). + + When returning true, add all replacements to the current validation group, + leaving the caller to test it in the normal way. Leave both *LOC and the + validation group unchanged on failure. */ + +bool +validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc, + rtx old_rtx, rtx new_rtx) +{ + validate_replace_src_data r; + r.from = old_rtx; + r.to = new_rtx; + r.insn = insn; + + unsigned int num_changes = num_validated_changes (); + note_uses (loc, validate_simplify_replace_rtx_uses, &r); + if (!r.insn) + { + cancel_changes (num_changes); + return false; + } + return true; +} /* Return 1 if the insn using CC0 set by INSN does not contain any ordered tests applied to the condition codes. Index: gcc/combine2.c =================================================================== --- /dev/null 2019-09-17 11:41:18.176664108 +0100 +++ gcc/combine2.c 2019-11-17 23:15:31.196500559 +0000 @@ -0,0 +1,1576 @@ +/* Combine instructions + Copyright (C) 2019 Free Software Foundation, Inc. + +This file is part of GCC. + +GCC is free software; you can redistribute it and/or modify it under +the terms of the GNU General Public License as published by the Free +Software Foundation; either version 3, or (at your option) any later +version. + +GCC is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +for more details. + +You should have received a copy of the GNU General Public License +along with GCC; see the file COPYING3. If not see +<http://www.gnu.org/licenses/>. */ + +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "backend.h" +#include "rtl.h" +#include "df.h" +#include "tree-pass.h" +#include "memmodel.h" +#include "emit-rtl.h" +#include "insn-config.h" +#include "recog.h" +#include "print-rtl.h" +#include "rtl-iter.h" +#include "predict.h" +#include "cfgcleanup.h" +#include "cfghooks.h" +#include "cfgrtl.h" +#include "alias.h" +#include "valtrack.h" + +/* This pass tries to combine instructions in the following ways: + + (1) If we have two dependent instructions: + + I1: (set DEST1 SRC1) + I2: (...DEST1...) + + and I2 is the only user of DEST1, the pass tries to combine them into: + + I2: (...SRC1...) + + (2) If we have two dependent instructions: + + I1: (set DEST1 SRC1) + I2: (...DEST1...) + + the pass tries to combine them into: + + I2: (parallel [(set DEST1 SRC1) (...SRC1...)]) + + or: + + I2: (parallel [(...SRC1...) (set DEST1 SRC1)]) + + (3) If we have two independent instructions: + + I1: (set DEST1 SRC1) + I2: (set DEST2 SRC2) + + that read from memory or from the same register, the pass tries to + combine them into: + + I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)]) + + or: + + I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)]) + + If the combined form is a valid instruction, the pass tries to find a + place between I1 and I2 inclusive for the new instruction. If there + are multiple valid locations, it tries to pick the best one by taking + the effect on register pressure into account. + + If a combination succeeds and produces a single set, the pass tries to + combine the new form with earlier or later instructions. + + The pass currently optimizes each basic block separately. It walks + the instructions in reverse order, building up live ranges for registers + and memory. It then uses these live ranges to look for possible + combination opportunities and to decide where the combined instructions + could be placed. + + The pass represents positions in the block using point numbers, + with higher numbers indicating earlier instructions. The numbering + scheme is that: + + - the end of the current instruction sequence has an even base point B. + + - instructions initially have odd-numbered points B + 1, B + 3, etc. + with B + 1 being the final instruction in the sequence. + + - even points after B represent gaps between instructions where combined + instructions could be placed. + + Thus even points initially represent no instructions and odd points + initially represent single instructions. However, when picking a + place for a combined instruction, the pass may choose somewhere + inbetween the original two instructions, so that over time a point + may come to represent several instructions. When this happens, + the pass maintains the invariant that all instructions with the same + point number are independent of each other and thus can be treated as + acting in parallel (or as acting in any arbitrary sequence). + + TODOs: + + - Handle 3-instruction combinations, and possibly more. + + - Handle existing clobbers more efficiently. At the moment we can't + move an instruction that clobbers R across another instruction that + clobbers R. + + - Allow hard register clobbers to be added, like combine does. + + - Perhaps work on EBBs, or SESE regions. */ + +namespace { + +/* The number of explicit uses to record in a live range. */ +const unsigned int NUM_RANGE_USERS = 4; + +/* The maximum number of instructions that we can combine at once. */ +const unsigned int MAX_COMBINE_INSNS = 2; + +/* A fake cost for instructions that we haven't costed yet. */ +const unsigned int UNKNOWN_COST = ~0U; + +class combine2 +{ +public: + combine2 (function *); + ~combine2 (); + + void execute (); + +private: + struct insn_info_rec; + + /* Describes the live range of a register or of memory. For simplicity, + we treat memory as a single entity. + + If we had a fully-accurate live range, updating it to account for a + moved instruction would be a linear-time operation. Doing this for + each combination would then make the pass quadratic. We therefore + just maintain a list of NUM_RANGE_USERS use insns and use simple, + conservatively-correct behavior for the rest. */ + struct live_range_rec + { + /* Which instruction provides the dominating definition, or null if + we don't know yet. */ + insn_info_rec *producer; + + /* A selection of instructions that use the resource, in program order. */ + insn_info_rec *users[NUM_RANGE_USERS]; + + /* An inclusive range of points that covers instructions not mentioned + in USERS. Both values are zero if there are no such instructions. + + Once we've included a use U at point P in this range, we continue + to assume that some kind of use exists at P whatever happens to U + afterwards. */ + unsigned int first_extra_use; + unsigned int last_extra_use; + + /* The register number this range describes, or INVALID_REGNUM + for memory. */ + unsigned int regno; + + /* Forms a linked list of ranges for the same resource, in program + order. */ + live_range_rec *prev_range; + live_range_rec *next_range; + }; + + /* Pass-specific information about an instruction. */ + struct insn_info_rec + { + /* The instruction itself. */ + rtx_insn *insn; + + /* A null-terminated list of live ranges for the things that this + instruction defines. */ + live_range_rec **defs; + + /* A null-terminated list of live ranges for the things that this + instruction uses. */ + live_range_rec **uses; + + /* The point at which the instruction appears. */ + unsigned int point; + + /* The cost of the instruction, or UNKNOWN_COST if we haven't + measured it yet. */ + unsigned int cost; + }; + + /* Describes one attempt to combine instructions. */ + struct combination_attempt_rec + { + /* The instruction that we're currently trying to optimize. + If the combination succeeds, we'll use this insn_info_rec + to describe the new instruction. */ + insn_info_rec *new_home; + + /* The instructions we're combining, in program order. */ + insn_info_rec *sequence[MAX_COMBINE_INSNS]; + + /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the + live range that describes the substituted register. */ + live_range_rec *def_use_range; + + /* The earliest and latest points at which we could insert the + combined instruction. */ + unsigned int earliest_point; + unsigned int latest_point; + + /* The cost of the new instruction, once we have a successful match. */ + unsigned int new_cost; + }; + + /* Pass-specific information about a register. */ + struct reg_info_rec + { + /* The live range associated with the last reference to the register. */ + live_range_rec *range; + + /* The point at which the last reference occurred. */ + unsigned int next_ref; + + /* True if the register is currently live. We record this here rather + than in a separate bitmap because (a) there's a natural hole for + it on LP64 hosts and (b) we only refer to it when updating the + other fields, and so recording it here should give better locality. */ + unsigned int live_p : 1; + }; + + live_range_rec *new_live_range (unsigned int, live_range_rec *); + live_range_rec *reg_live_range (unsigned int); + live_range_rec *mem_live_range (); + bool add_range_use (live_range_rec *, insn_info_rec *); + void remove_range_use (live_range_rec *, insn_info_rec *); + bool has_single_use_p (live_range_rec *); + bool known_last_use_p (live_range_rec *, insn_info_rec *); + unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *); + unsigned int find_latest_point (insn_info_rec *, insn_info_rec *); + bool start_combination (combination_attempt_rec &, insn_info_rec *, + insn_info_rec *, live_range_rec * = NULL); + bool verify_combination (combination_attempt_rec &); + int estimate_reg_pressure_delta (insn_info_rec *); + void commit_combination (combination_attempt_rec &, bool); + bool try_parallel_sets (combination_attempt_rec &, rtx, rtx); + bool try_parallelize_insns (combination_attempt_rec &); + bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool); + bool try_combine_def_use (combination_attempt_rec &, rtx, rtx); + bool try_combine_two_uses (combination_attempt_rec &); + bool try_combine (insn_info_rec *, rtx, unsigned int); + bool optimize_insn (insn_info_rec *); + void record_defs (insn_info_rec *); + void record_reg_use (insn_info_rec *, df_ref); + void record_uses (insn_info_rec *); + void process_insn (insn_info_rec *); + void start_sequence (); + + /* The function we're optimizing. */ + function *m_fn; + + /* The highest pseudo register number plus one. */ + unsigned int m_num_regs; + + /* The current basic block. */ + basic_block m_bb; + + /* True if we should optimize the current basic block for speed. */ + bool m_optimize_for_speed_p; + + /* The point number to allocate to the next instruction we visit + in the backward traversal. */ + unsigned int m_point; + + /* The point number corresponding to the end of the current + instruction sequence, i.e. the lowest point number about which + we still have valid information. */ + unsigned int m_end_of_sequence; + + /* The point number corresponding to the end of the current basic block. + This is the same as M_END_OF_SEQUENCE when processing the last + instruction sequence in a basic block. */ + unsigned int m_end_of_bb; + + /* The memory live range, or null if we haven't yet found a memory + reference in the current instruction sequence. */ + live_range_rec *m_mem_range; + + /* Gives information about each register. We track both hard and + pseudo registers. */ + auto_vec<reg_info_rec> m_reg_info; + + /* A bitmap of registers whose entry in m_reg_info is valid. */ + auto_sbitmap m_valid_regs; + + /* If nonnuull, an unused 2-element PARALLEL that we can use to test + instruction combinations. */ + rtx m_spare_parallel; + + /* A bitmap of instructions that we've already tried to combine with. */ + auto_bitmap m_tried_insns; + + /* A temporary bitmap used to hold register numbers. */ + auto_bitmap m_true_deps; + + /* An obstack used for allocating insn_info_recs and for building + up their lists of definitions and uses. */ + obstack m_insn_obstack; + + /* An obstack used for allocating live_range_recs. */ + obstack m_range_obstack; + + /* Start-of-object pointers for the two obstacks. */ + char *m_insn_obstack_start; + char *m_range_obstack_start; + + /* A list of instructions that we've optimized and whose new forms + change the cfg. */ + auto_vec<rtx_insn *> m_cfg_altering_insns; + + /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS. */ + auto_bitmap m_cfg_altering_insn_ids; + + /* We can insert new instructions at point P * 2 by inserting them + after M_POINTS[P - M_END_OF_SEQUENCE / 2]. We can insert new + instructions at point P * 2 + 1 by inserting them before + M_POINTS[P - M_END_OF_SEQUENCE / 2]. */ + auto_vec<rtx_insn *, 256> m_points; +}; + +combine2::combine2 (function *fn) + : m_fn (fn), + m_num_regs (max_reg_num ()), + m_bb (NULL), + m_optimize_for_speed_p (false), + m_point (2), + m_end_of_sequence (m_point), + m_end_of_bb (m_point), + m_mem_range (NULL), + m_reg_info (m_num_regs), + m_valid_regs (m_num_regs), + m_spare_parallel (NULL_RTX) +{ + gcc_obstack_init (&m_insn_obstack); + gcc_obstack_init (&m_range_obstack); + m_reg_info.quick_grow (m_num_regs); + bitmap_clear (m_valid_regs); + m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0); + m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0); +} + +combine2::~combine2 () +{ + obstack_free (&m_insn_obstack, NULL); + obstack_free (&m_range_obstack, NULL); +} + +/* Return true if it's possible in principle to combine INSN with + other instructions. ALLOW_ASMS_P is true if the caller can cope + with asm statements. */ + +static bool +combinable_insn_p (rtx_insn *insn, bool allow_asms_p) +{ + rtx pattern = PATTERN (insn); + + if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER) + return false; + + if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX)) + return false; + + if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0) + return false; + + return true; +} + +/* Return true if it's possible in principle to move INSN somewhere else, + as long as all dependencies are satisfied. */ + +static bool +movable_insn_p (rtx_insn *insn) +{ + if (JUMP_P (insn)) + return false; + + if (volatile_refs_p (PATTERN (insn))) + return false; + + return true; +} + +/* Create and return a new live range for REGNO. NEXT is the next range + in program order, or null if this is the first live range in the + sequence. */ + +combine2::live_range_rec * +combine2::new_live_range (unsigned int regno, live_range_rec *next) +{ + live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec); + memset (range, 0, sizeof (*range)); + + range->regno = regno; + range->next_range = next; + if (next) + next->prev_range = range; + return range; +} + +/* Return the current live range for register REGNO, creating a new + one if necessary. */ + +combine2::live_range_rec * +combine2::reg_live_range (unsigned int regno) +{ + /* Initialize the liveness flag, if it isn't already valid for this BB. */ + bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno); + if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb) + m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno); + + /* See if we already have a live range associated with the current + instruction sequence. */ + live_range_rec *range = NULL; + if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence) + range = m_reg_info[regno].range; + + /* Create a new range if this is the first reference to REGNO in the + current instruction sequence or if the current range has been closed + off by a definition. */ + if (!range || range->producer) + { + range = new_live_range (regno, range); + + /* If the register is live after the current sequence, treat that + as a fake use at the end of the sequence. */ + if (!range->next_range && m_reg_info[regno].live_p) + range->first_extra_use = range->last_extra_use = m_end_of_sequence; + + /* Record that this is now the current range for REGNO. */ + if (first_ref_p) + bitmap_set_bit (m_valid_regs, regno); + m_reg_info[regno].range = range; + m_reg_info[regno].next_ref = m_point; + } + return range; +} + +/* Return the current live range for memory, treating memory as a single + entity. Create a new live range if necessary. */ + +combine2::live_range_rec * +combine2::mem_live_range () +{ + if (!m_mem_range || m_mem_range->producer) + m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range); + return m_mem_range; +} + +/* Record that instruction USER uses the resource described by RANGE. + Return true if this is new information. */ + +bool +combine2::add_range_use (live_range_rec *range, insn_info_rec *user) +{ + /* See if we've already recorded the instruction, or if there's a + spare use slot we can use. */ + unsigned int i = 0; + for (; i < NUM_RANGE_USERS && range->users[i]; ++i) + if (range->users[i] == user) + return false; + + if (i == NUM_RANGE_USERS) + { + /* Since we've processed USER recently, assume that it's more + interesting to record explicitly than the last user in the + current list. Evict that last user and describe it in the + overflow "extra use" range instead. */ + insn_info_rec *ousted_user = range->users[--i]; + if (range->first_extra_use < ousted_user->point) + range->first_extra_use = ousted_user->point; + if (range->last_extra_use > ousted_user->point) + range->last_extra_use = ousted_user->point; + } + + /* Insert USER while keeping the list sorted. */ + for (; i > 0 && range->users[i - 1]->point < user->point; --i) + range->users[i] = range->users[i - 1]; + range->users[i] = user; + return true; +} + +/* Remove USER from the uses recorded for RANGE, if we can. + There's nothing we can do if USER was described in the + overflow "extra use" range. */ + +void +combine2::remove_range_use (live_range_rec *range, insn_info_rec *user) +{ + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) + if (range->users[i] == user) + { + for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j) + range->users[j] = range->users[j + 1]; + range->users[NUM_RANGE_USERS - 1] = NULL; + break; + } +} + +/* Return true if RANGE has a single known user. */ + +bool +combine2::has_single_use_p (live_range_rec *range) +{ + return range->users[0] && !range->users[1] && !range->first_extra_use; +} + +/* Return true if we know that USER is the last user of RANGE. */ + +bool +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user) +{ + if (range->last_extra_use <= user->point) + return false; + + for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i) + if (range->users[i] == user) + return i == NUM_RANGE_USERS - 1 || !range->users[i + 1]; + else if (range->users[i]->point == user->point) + return false; + + gcc_unreachable (); +} + +/* Find the earliest point that we could move I2 up in order to combine + it with I1. Ignore any dependencies between I1 and I2; leave the + caller to deal with those instead. */ + +unsigned int +combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1) +{ + if (!movable_insn_p (i2->insn)) + return i2->point; + + /* Start by optimistically assuming that we can move the instruction + all the way up to I1. */ + unsigned int point = i1->point; + + /* Make sure that the new position preserves all necessary true dependencies + on earlier instructions. */ + for (live_range_rec **use = i2->uses; *use; ++use) + { + live_range_rec *range = *use; + if (range->producer + && range->producer != i1 + && point >= range->producer->point) + point = range->producer->point - 1; + } + + /* Make sure that the new position preserves all necessary output and + anti dependencies on earlier instructions. */ + for (live_range_rec **def = i2->defs; *def; ++def) + if (live_range_rec *range = (*def)->prev_range) + { + if (range->producer + && range->producer != i1 + && point >= range->producer->point) + point = range->producer->point - 1; + + for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;) + if (range->users[i] && range->users[i] != i1) + { + if (point >= range->users[i]->point) + point = range->users[i]->point - 1; + break; + } + + if (range->last_extra_use && point >= range->last_extra_use) + point = range->last_extra_use - 1; + } + + return point; +} + +/* Find the latest point that we could move I1 down in order to combine + it with I2. Ignore any dependencies between I1 and I2; leave the + caller to deal with those instead. */ + +unsigned int +combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2) +{ + if (!movable_insn_p (i1->insn)) + return i1->point; + + /* Start by optimistically assuming that we can move the instruction + all the way down to I2. */ + unsigned int point = i2->point; + + /* Make sure that the new position preserves all necessary anti dependencies + on later instructions. */ + for (live_range_rec **use = i1->uses; *use; ++use) + if (live_range_rec *range = (*use)->next_range) + if (range->producer != i2 && point <= range->producer->point) + point = range->producer->point + 1; + + /* Make sure that the new position preserves all necessary output and + true dependencies on later instructions. */ + for (live_range_rec **def = i1->defs; *def; ++def) + { + live_range_rec *range = *def; + + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) + if (range->users[i] != i2) + { + if (range->users[i] && point <= range->users[i]->point) + point = range->users[i]->point + 1; + break; + } + + if (range->first_extra_use && point <= range->first_extra_use) + point = range->first_extra_use + 1; + + live_range_rec *next_range = range->next_range; + if (next_range + && next_range->producer != i2 + && point <= next_range->producer->point) + point = next_range->producer->point + 1; + } + + return point; +} + +/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2, + where I1 is the instruction that we're currently trying to optimize. + If DEF_USE_RANGE is nonnull, I1 defines the value described by + DEF_USE_RANGE and I2 uses it. */ + +bool +combine2::start_combination (combination_attempt_rec &attempt, + insn_info_rec *i1, insn_info_rec *i2, + live_range_rec *def_use_range) +{ + attempt.new_home = i1; + attempt.sequence[0] = i1; + attempt.sequence[1] = i2; + if (attempt.sequence[0]->point < attempt.sequence[1]->point) + std::swap (attempt.sequence[0], attempt.sequence[1]); + attempt.def_use_range = def_use_range; + + /* Check that the instructions have no true dependencies other than + DEF_USE_RANGE. */ + bitmap_clear (m_true_deps); + for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def) + if (*def != def_use_range) + bitmap_set_bit (m_true_deps, (*def)->regno); + for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use) + if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno)) + return false; + + /* Calculate the range of points at which the combined instruction + could live. */ + attempt.earliest_point = find_earliest_point (attempt.sequence[1], + attempt.sequence[0]); + attempt.latest_point = find_latest_point (attempt.sequence[0], + attempt.sequence[1]); + if (attempt.earliest_point < attempt.latest_point) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "cannot combine %d and %d: no suitable" + " location for combined insn\n", + INSN_UID (attempt.sequence[0]->insn), + INSN_UID (attempt.sequence[1]->insn)); + return false; + } + + /* Make sure we have valid costs for the original instructions before + we start changing their patterns. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + if (attempt.sequence[i]->cost == UNKNOWN_COST) + attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn, + m_optimize_for_speed_p); + return true; +} + +/* Check whether the combination attempt described by ATTEMPT matches + an .md instruction (or matches its constraints, in the case of an + asm statement). If so, calculate the cost of the new instruction + and check whether it's cheap enough. */ + +bool +combine2::verify_combination (combination_attempt_rec &attempt) +{ + rtx_insn *insn = attempt.sequence[1]->insn; + + bool ok_p = verify_changes (0); + if (dump_file && (dump_flags & TDF_DETAILS)) + { + if (!ok_p) + fprintf (dump_file, "failed to match this instruction:\n"); + else if (const char *name = get_insn_name (INSN_CODE (insn))) + fprintf (dump_file, "successfully matched this instruction to %s:\n", + name); + else + fprintf (dump_file, "successfully matched this instruction:\n"); + print_rtl_single (dump_file, PATTERN (insn)); + } + if (!ok_p) + return false; + + unsigned int cost1 = attempt.sequence[0]->cost; + unsigned int cost2 = attempt.sequence[1]->cost; + attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p); + ok_p = (attempt.new_cost <= cost1 + cost2); + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; %s\n", + cost1, cost2, attempt.new_cost, + ok_p ? "keeping replacement" : "rejecting replacement"); + if (!ok_p) + return false; + + confirm_change_group (); + return true; +} + +/* Return true if we should consider register REGNO when calculating + register pressure estimates. */ + +static bool +count_reg_pressure_p (unsigned int regno) +{ + if (regno == INVALID_REGNUM) + return false; + + /* Unallocatable registers aren't interesting. */ + if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno]) + return false; + + return true; +} + +/* Try to estimate the effect that the original form of INSN_INFO + had on register pressure, in the form "born - dying". */ + +int +combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info) +{ + int delta = 0; + + for (live_range_rec **def = insn_info->defs; *def; ++def) + if (count_reg_pressure_p ((*def)->regno)) + delta += 1; + + for (live_range_rec **use = insn_info->uses; *use; ++use) + if (count_reg_pressure_p ((*use)->regno) + && known_last_use_p (*use, insn_info)) + delta -= 1; + + return delta; +} + +/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete + FROM_INSN. Copy any useful information to TO_INSN before doing that. */ + +static void +transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn) +{ + INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn); + INSN_CODE (to_insn) = INSN_CODE (from_insn); + REG_NOTES (to_insn) = REG_NOTES (from_insn); +} + +/* The combination attempt in ATTEMPT has succeeded and is currently + part of an open validate_change group. Commit to making the change + and decide where the new instruction should go. + + KEPT_DEF_P is true if the new instruction continues to perform + the definition described by ATTEMPT.def_use_range. */ + +void +combine2::commit_combination (combination_attempt_rec &attempt, + bool kept_def_p) +{ + insn_info_rec *new_home = attempt.new_home; + rtx_insn *old_insn = attempt.sequence[0]->insn; + rtx_insn *new_insn = attempt.sequence[1]->insn; + + /* Remove any notes that are no longer relevant. */ + bool single_set_p = single_set (new_insn); + for (rtx *note_ptr = ®_NOTES (new_insn); *note_ptr; ) + { + rtx note = *note_ptr; + bool keep_p = true; + switch (REG_NOTE_KIND (note)) + { + case REG_EQUAL: + case REG_EQUIV: + case REG_NOALIAS: + keep_p = single_set_p; + break; + + case REG_UNUSED: + keep_p = false; + break; + + default: + break; + } + if (keep_p) + note_ptr = &XEXP (*note_ptr, 1); + else + { + *note_ptr = XEXP (*note_ptr, 1); + free_EXPR_LIST_node (note); + } + } + + /* Complete the open validate_change group. */ + confirm_change_group (); + + /* Decide where the new instruction should go. */ + unsigned int new_point = attempt.latest_point; + if (new_point != attempt.earliest_point + && prev_real_insn (new_insn) != old_insn) + { + /* Prefer the earlier point if the combined instruction reduces + register pressure and the latest point if it increases register + pressure. + + The choice isn't obvious in the event of a tie, but picking + the earliest point should reduce the number of times that + we need to invalidate debug insns. */ + int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]); + int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]); + bool move_up_p = (delta1 + delta2 <= 0); + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + "register pressure delta = %d + %d; using %s position\n", + delta1, delta2, move_up_p ? "earliest" : "latest"); + if (move_up_p) + new_point = attempt.earliest_point; + } + + /* Translate inserting at NEW_POINT into inserting before or after + a particular insn. */ + rtx_insn *anchor = NULL; + bool before_p = (new_point & 1); + if (new_point != attempt.sequence[1]->point + && new_point != attempt.sequence[0]->point) + { + anchor = m_points[(new_point - m_end_of_sequence) / 2]; + rtx_insn *other_side = (before_p + ? prev_real_insn (anchor) + : next_real_insn (anchor)); + /* Inserting next to an insn X and then deleting X is just a + roundabout way of using X as the insertion point. */ + if (anchor == new_insn || other_side == new_insn) + new_point = attempt.sequence[1]->point; + else if (anchor == old_insn || other_side == old_insn) + new_point = attempt.sequence[0]->point; + } + + /* Actually perform the move. */ + if (new_point == attempt.sequence[1]->point) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "using insn %d to hold the combined pattern\n", + INSN_UID (new_insn)); + set_insn_deleted (old_insn); + } + else if (new_point == attempt.sequence[0]->point) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "using insn %d to hold the combined pattern\n", + INSN_UID (old_insn)); + PATTERN (old_insn) = PATTERN (new_insn); + transfer_insn (old_insn, new_insn); + std::swap (old_insn, new_insn); + set_insn_deleted (old_insn); + } + else + { + /* We need to insert a new instruction. We can't simply move + NEW_INSN because it acts as an insertion anchor in m_points. */ + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "inserting combined insn %s insn %d\n", + before_p ? "before" : "after", INSN_UID (anchor)); + + rtx_insn *added_insn = (before_p + ? emit_insn_before (PATTERN (new_insn), anchor) + : emit_insn_after (PATTERN (new_insn), anchor)); + transfer_insn (added_insn, new_insn); + set_insn_deleted (old_insn); + set_insn_deleted (new_insn); + new_insn = added_insn; + } + df_insn_rescan (new_insn); + + /* Unlink the old uses. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) + remove_range_use (*use, attempt.sequence[i]); + + /* Work out which registers the new pattern uses. */ + bitmap_clear (m_true_deps); + df_ref use; + FOR_EACH_INSN_USE (use, new_insn) + { + rtx reg = DF_REF_REAL_REG (use); + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); + } + FOR_EACH_INSN_EQ_USE (use, new_insn) + { + rtx reg = DF_REF_REAL_REG (use); + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); + } + + /* Describe the combined instruction in NEW_HOME. */ + new_home->insn = new_insn; + new_home->point = new_point; + new_home->cost = attempt.new_cost; + + /* Build up a list of definitions for the combined instructions + and update all the ranges accordingly. It shouldn't matter + which order we do this in. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def) + if (kept_def_p || *def != attempt.def_use_range) + { + obstack_ptr_grow (&m_insn_obstack, *def); + (*def)->producer = new_home; + } + obstack_ptr_grow (&m_insn_obstack, NULL); + new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack); + + /* Build up a list of uses for the combined instructions and update + all the ranges accordingly. Again, it shouldn't matter which + order we do this in. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) + if (*use != attempt.def_use_range + && add_range_use (*use, new_home)) + obstack_ptr_grow (&m_insn_obstack, *use); + obstack_ptr_grow (&m_insn_obstack, NULL); + new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack); + + /* There shouldn't be any remaining references to other instructions + in the combination. Invalidate their contents to make lingering + references a noisy failure. */ + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) + if (attempt.sequence[i] != new_home) + { + attempt.sequence[i]->insn = NULL; + attempt.sequence[i]->point = ~0U; + } + + /* Unlink the def-use range. */ + if (!kept_def_p && attempt.def_use_range) + { + live_range_rec *range = attempt.def_use_range; + if (range->prev_range) + range->prev_range->next_range = range->next_range; + else + m_reg_info[range->regno].range = range->next_range; + if (range->next_range) + range->next_range->prev_range = range->prev_range; + } + + /* Record instructions whose new form alters the cfg. */ + rtx pattern = PATTERN (new_insn); + if ((returnjump_p (new_insn) + || any_uncondjump_p (new_insn) + || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx)) + && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn))) + m_cfg_altering_insns.safe_push (new_insn); +} + +/* Return true if X1 and X2 are memories and if X1 does not have + a higher alignment than X2. */ + +static bool +dubious_mem_pair_p (rtx x1, rtx x2) +{ + return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2); +} + +/* Try implement ATTEMPT using (parallel [SET1 SET2]). */ + +bool +combine2::try_parallel_sets (combination_attempt_rec &attempt, + rtx set1, rtx set2) +{ + rtx_insn *insn = attempt.sequence[1]->insn; + + /* Combining two loads or two stores can be useful on targets that + allow them to be treated as a single access. However, we use a + very peephole approach to picking the pairs, so we need to be + relatively confident that we're making a good choice. + + For now just aim for cases in which the memory references are + consecutive and the first reference has a higher alignment. + We can leave the target to test the consecutive part; whatever test + we added here might be different from the target's, and in any case + it's fine if the target accepts other well-aligned cases too. */ + if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2)) + || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2))) + return false; + + /* Cache the PARALLEL rtx between attempts so that we don't generate + too much garbage rtl. */ + if (!m_spare_parallel) + { + rtvec vec = gen_rtvec (2, set1, set2); + m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec); + } + else + { + XVECEXP (m_spare_parallel, 0, 0) = set1; + XVECEXP (m_spare_parallel, 0, 1) = set2; + } + + unsigned int num_changes = num_validated_changes (); + validate_change (insn, &PATTERN (insn), m_spare_parallel, true); + if (verify_combination (attempt)) + { + m_spare_parallel = NULL_RTX; + return true; + } + cancel_changes (num_changes); + return false; +} + +/* Try to parallelize the two instructions in ATTEMPT. */ + +bool +combine2::try_parallelize_insns (combination_attempt_rec &attempt) +{ + rtx_insn *i1_insn = attempt.sequence[0]->insn; + rtx_insn *i2_insn = attempt.sequence[1]->insn; + + /* Can't parallelize asm statements. */ + if (asm_noperands (PATTERN (i1_insn)) >= 0 + || asm_noperands (PATTERN (i2_insn)) >= 0) + return false; + + /* For now, just handle the case in which both instructions are + single sets. We could handle more than 2 sets as well, but few + targets support that anyway. */ + rtx set1 = single_set (i1_insn); + if (!set1) + return false; + rtx set2 = single_set (i2_insn); + if (!set2) + return false; + + /* Make sure that we have structural proof that the destinations + are independent. Things like alias analysis rely on semantic + information and assume no undefined behavior, which is rarely a + good enough guarantee to allow a useful instruction combination. */ + rtx dest1 = SET_DEST (set1); + rtx dest2 = SET_DEST (set2); + if (MEM_P (dest1) + ? MEM_P (dest2) && nonoverlapping_memrefs_p (dest1, dest2, false) + : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2)) + return false; + + /* Try the sets in both orders. */ + if (try_parallel_sets (attempt, set1, set2) + || try_parallel_sets (attempt, set2, set1)) + { + commit_combination (attempt, true); + if (MAY_HAVE_DEBUG_BIND_INSNS + && attempt.new_home->insn != i1_insn) + propagate_for_debug (i1_insn, attempt.new_home->insn, + SET_DEST (set1), SET_SRC (set1), m_bb); + return true; + } + return false; +} + +/* Replace DEST with SRC in the register notes for INSN. */ + +static void +substitute_into_note (rtx_insn *insn, rtx dest, rtx src) +{ + for (rtx *note_ptr = ®_NOTES (insn); *note_ptr; ) + { + rtx note = *note_ptr; + bool keep_p = true; + switch (REG_NOTE_KIND (note)) + { + case REG_EQUAL: + case REG_EQUIV: + keep_p = validate_simplify_replace_rtx (insn, &XEXP (note, 0), + dest, src); + break; + + default: + break; + } + if (keep_p) + note_ptr = &XEXP (*note_ptr, 1); + else + { + *note_ptr = XEXP (*note_ptr, 1); + free_EXPR_LIST_node (note); + } + } +} + +/* A subroutine of try_combine_def_use. Try replacing DEST with SRC + in ATTEMPT. SRC might be either the original SET_SRC passed to the + parent routine or a value pulled from a note; SRC_IS_NOTE_P is true + in the latter case. */ + +bool +combine2::try_combine_def_use_1 (combination_attempt_rec &attempt, + rtx dest, rtx src, bool src_is_note_p) +{ + rtx_insn *def_insn = attempt.sequence[0]->insn; + rtx_insn *use_insn = attempt.sequence[1]->insn; + + /* Mimic combine's behavior by not combining moves from allocatable hard + registers (e.g. when copying parameters or function return values). */ + if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)]) + return false; + + /* Don't mess with volatile references. For one thing, we don't yet + know how many copies of SRC we'll need. */ + if (volatile_refs_p (src)) + return false; + + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "trying to combine %d and %d%s:\n", + INSN_UID (def_insn), INSN_UID (use_insn), + src_is_note_p ? " using equal/equiv note" : ""); + dump_insn_slim (dump_file, def_insn); + dump_insn_slim (dump_file, use_insn); + } + + unsigned int num_changes = num_validated_changes (); + if (!validate_simplify_replace_rtx (use_insn, &PATTERN (use_insn), + dest, src)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "combination failed -- unable to substitute" + " all uses\n"); + return false; + } + + /* Try matching the instruction on its own if DEST isn't used elsewhere. */ + if (has_single_use_p (attempt.def_use_range) + && verify_combination (attempt)) + { + live_range_rec *next_range = attempt.def_use_range->next_range; + substitute_into_note (use_insn, dest, src); + commit_combination (attempt, false); + if (MAY_HAVE_DEBUG_BIND_INSNS) + { + rtx_insn *end_of_range = (next_range + ? next_range->producer->insn + : BB_END (m_bb)); + propagate_for_debug (def_insn, end_of_range, dest, src, m_bb); + } + return true; + } + + /* Try doing the new USE_INSN pattern in parallel with the DEF_INSN + pattern. */ + if (try_parallelize_insns (attempt)) + return true; + + cancel_changes (num_changes); + return false; +} + +/* ATTEMPT describes an attempt to substitute the result of the first + instruction into the second instruction. Try to implement it, + given that the first instruction sets DEST to SRC. */ + +bool +combine2::try_combine_def_use (combination_attempt_rec &attempt, + rtx dest, rtx src) +{ + rtx_insn *def_insn = attempt.sequence[0]->insn; + rtx_insn *use_insn = attempt.sequence[1]->insn; + rtx def_note = find_reg_equal_equiv_note (def_insn); + + /* First try combining the instructions in their original form. */ + if (try_combine_def_use_1 (attempt, dest, src, false)) + return true; + + /* Try to replace DEST with a REG_EQUAL/EQUIV value instead. */ + if (def_note + && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true)) + return true; + + /* If USE_INSN has a REG_EQUAL/EQUIV note that refers to DEST, try + using that instead of the main pattern. */ + for (rtx *link_ptr = ®_NOTES (use_insn); *link_ptr; + link_ptr = &XEXP (*link_ptr, 1)) + { + rtx use_note = *link_ptr; + if (REG_NOTE_KIND (use_note) != REG_EQUAL + && REG_NOTE_KIND (use_note) != REG_EQUIV) + continue; + + rtx use_set = single_set (use_insn); + if (!use_set) + break; + + if (!reg_overlap_mentioned_p (dest, XEXP (use_note, 0))) + continue; + + /* Try snipping out the note and putting it in the SET instead. */ + validate_change (use_insn, link_ptr, XEXP (use_note, 1), 1); + validate_change (use_insn, &SET_SRC (use_set), XEXP (use_note, 0), 1); + + if (try_combine_def_use_1 (attempt, dest, src, false)) + return true; + + if (def_note + && try_combine_def_use_1 (attempt, dest, XEXP (def_note, 0), true)) + return true; + + cancel_changes (0); + } + + return false; +} + +/* ATTEMPT describes an attempt to combine two instructions that use + the same resource. Try to implement it, returning true on success. */ + +bool +combine2::try_combine_two_uses (combination_attempt_rec &attempt) +{ + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "trying to parallelize %d and %d:\n", + INSN_UID (attempt.sequence[0]->insn), + INSN_UID (attempt.sequence[1]->insn)); + dump_insn_slim (dump_file, attempt.sequence[0]->insn); + dump_insn_slim (dump_file, attempt.sequence[1]->insn); + } + + return try_parallelize_insns (attempt); +} + +/* Try to optimize instruction INSN_INFO. Return true on success. */ + +bool +combine2::optimize_insn (insn_info_rec *i1) +{ + combination_attempt_rec attempt; + + if (!combinable_insn_p (i1->insn, false)) + return false; + + rtx set = single_set (i1->insn); + if (!set) + return false; + + /* First try combining INSN with a user of its result. */ + rtx dest = SET_DEST (set); + rtx src = SET_SRC (set); + if (REG_P (dest) && REG_NREGS (dest) == 1) + for (live_range_rec **def = i1->defs; *def; ++def) + if ((*def)->regno == REGNO (dest)) + { + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) + { + insn_info_rec *use = (*def)->users[i]; + if (use + && combinable_insn_p (use->insn, has_single_use_p (*def)) + && start_combination (attempt, i1, use, *def) + && try_combine_def_use (attempt, dest, src)) + return true; + } + break; + } + + /* Try parallelizing INSN and another instruction that uses the same + resource. */ + bitmap_clear (m_tried_insns); + for (live_range_rec **use = i1->uses; *use; ++use) + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) + { + insn_info_rec *i2 = (*use)->users[i]; + if (i2 + && i2 != i1 + && combinable_insn_p (i2->insn, false) + && bitmap_set_bit (m_tried_insns, INSN_UID (i2->insn)) + && start_combination (attempt, i1, i2) + && try_combine_two_uses (attempt)) + return true; + } + + return false; +} + +/* A note_stores callback. Set the bool at *DATA to true if DEST is in + memory. */ + +static void +find_mem_def (rtx dest, const_rtx, void *data) +{ + /* note_stores has stripped things like subregs and zero_extracts, + so we don't need to worry about them here. */ + if (MEM_P (dest)) + *(bool *) data = true; +} + +/* Record all register and memory definitions in INSN_INFO and fill in its + "defs" list. */ + +void +combine2::record_defs (insn_info_rec *insn_info) +{ + rtx_insn *insn = insn_info->insn; + + /* Record register definitions. */ + df_ref def; + FOR_EACH_INSN_DEF (def, insn) + { + rtx reg = DF_REF_REAL_REG (def); + unsigned int end_regno = END_REGNO (reg); + for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno) + { + live_range_rec *range = reg_live_range (regno); + range->producer = insn_info; + m_reg_info[regno].live_p = false; + obstack_ptr_grow (&m_insn_obstack, range); + } + } + + /* If the instruction writes to memory, record that too. */ + bool saw_mem_p = false; + note_stores (insn, find_mem_def, &saw_mem_p); + if (saw_mem_p) + { + live_range_rec *range = mem_live_range (); + range->producer = insn_info; + obstack_ptr_grow (&m_insn_obstack, range); + } + + /* Complete the list of definitions. */ + obstack_ptr_grow (&m_insn_obstack, NULL); + insn_info->defs = (live_range_rec **) obstack_finish (&m_insn_obstack); +} + +/* Record that INSN_INFO contains register use USE. If this requires + new entries to be added to INSN_INFO->uses, add those entries to the + list we're building in m_insn_obstack. */ + +void +combine2::record_reg_use (insn_info_rec *insn_info, df_ref use) +{ + rtx reg = DF_REF_REAL_REG (use); + unsigned int end_regno = END_REGNO (reg); + for (unsigned int regno = REGNO (reg); regno < end_regno; ++regno) + { + live_range_rec *range = reg_live_range (regno); + if (add_range_use (range, insn_info)) + obstack_ptr_grow (&m_insn_obstack, range); + m_reg_info[regno].live_p = true; + } +} + +/* A note_uses callback. Set the bool at DATA to true if *LOC reads + from variable memory. */ + +static void +find_mem_use (rtx *loc, void *data) +{ + subrtx_iterator::array_type array; + FOR_EACH_SUBRTX (iter, array, *loc, NONCONST) + if (MEM_P (*iter) && !MEM_READONLY_P (*iter)) + { + *(bool *) data = true; + break; + } +} + +/* Record all register and memory uses in INSN_INFO and fill in its + "uses" list. */ + +void +combine2::record_uses (insn_info_rec *insn_info) +{ + rtx_insn *insn = insn_info->insn; + + /* Record register uses in the main pattern. */ + df_ref use; + FOR_EACH_INSN_USE (use, insn) + record_reg_use (insn_info, use); + + /* Treat REG_EQUAL uses as first-class uses. We don't lose much + by doing that, since it's rare for a REG_EQUAL note to mention + registers that the main pattern doesn't. It also gives us the + maximum freedom to use REG_EQUAL notes in place of the main pattern. */ + FOR_EACH_INSN_EQ_USE (use, insn) + record_reg_use (insn_info, use); + + /* Record a memory use if either the pattern or the notes read from + memory. */ + bool saw_mem_p = false; + note_uses (&PATTERN (insn), find_mem_use, &saw_mem_p); + for (rtx note = REG_NOTES (insn); !saw_mem_p && note; note = XEXP (note, 1)) + if (REG_NOTE_KIND (note) == REG_EQUAL + || REG_NOTE_KIND (note) == REG_EQUIV) + note_uses (&XEXP (note, 0), find_mem_use, &saw_mem_p); + if (saw_mem_p) + { + live_range_rec *range = mem_live_range (); + if (add_range_use (range, insn_info)) + obstack_ptr_grow (&m_insn_obstack, range); + } + + /* Complete the list of uses. */ + obstack_ptr_grow (&m_insn_obstack, NULL); + insn_info->uses = (live_range_rec **) obstack_finish (&m_insn_obstack); +} + +/* Start a new instruction sequence, discarding all information about + the previous one. */ + +void +combine2::start_sequence (void) +{ + m_end_of_sequence = m_point; + m_mem_range = NULL; + m_points.truncate (0); + obstack_free (&m_insn_obstack, m_insn_obstack_start); + obstack_free (&m_range_obstack, m_range_obstack_start); +} + +/* Run the pass on the current function. */ + +void +combine2::execute (void) +{ + df_analyze (); + FOR_EACH_BB_FN (m_bb, cfun) + { + m_optimize_for_speed_p = optimize_bb_for_speed_p (m_bb); + m_end_of_bb = m_point; + start_sequence (); + + rtx_insn *insn, *prev; + FOR_BB_INSNS_REVERSE_SAFE (m_bb, insn, prev) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + + /* The current m_point represents the end of the sequence if + INSN is the last instruction in the sequence, otherwise it + represents the gap between INSN and the next instruction. + m_point + 1 represents INSN itself. + + Instructions can be added to m_point by inserting them + after INSN. They can be added to m_point + 1 by inserting + them before INSN. */ + m_points.safe_push (insn); + m_point += 1; + + insn_info_rec *insn_info = XOBNEW (&m_insn_obstack, insn_info_rec); + insn_info->insn = insn; + insn_info->point = m_point; + insn_info->cost = UNKNOWN_COST; + + record_defs (insn_info); + record_uses (insn_info); + + /* Set up m_point for the next instruction. */ + m_point += 1; + + if (CALL_P (insn)) + start_sequence (); + else + while (optimize_insn (insn_info)) + gcc_assert (insn_info->insn); + } + } + + /* If an instruction changes the cfg, update the containing block + accordingly. */ + rtx_insn *insn; + unsigned int i; + FOR_EACH_VEC_ELT (m_cfg_altering_insns, i, insn) + if (JUMP_P (insn)) + { + mark_jump_label (PATTERN (insn), insn, 0); + update_cfg_for_uncondjump (insn); + } + else + { + remove_edge (split_block (BLOCK_FOR_INSN (insn), insn)); + emit_barrier_after_bb (BLOCK_FOR_INSN (insn)); + } + + /* Propagate the above block-local cfg changes to the rest of the cfg. */ + if (!m_cfg_altering_insns.is_empty ()) + { + if (dom_info_available_p (CDI_DOMINATORS)) + free_dominance_info (CDI_DOMINATORS); + timevar_push (TV_JUMP); + rebuild_jump_labels (get_insns ()); + cleanup_cfg (0); + timevar_pop (TV_JUMP); + } +} + +const pass_data pass_data_combine2 = +{ + RTL_PASS, /* type */ + "combine2", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_COMBINE2, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + TODO_df_finish, /* todo_flags_finish */ +}; + +class pass_combine2 : public rtl_opt_pass +{ +public: + pass_combine2 (gcc::context *ctxt, int flag) + : rtl_opt_pass (pass_data_combine2, ctxt), m_flag (flag) + {} + + bool + gate (function *) OVERRIDE + { + return optimize && (param_run_combine & m_flag) != 0; + } + + unsigned int + execute (function *f) OVERRIDE + { + combine2 (f).execute (); + return 0; + } + +private: + unsigned int m_flag; +}; // class pass_combine2 + +} // anon namespace + +rtl_opt_pass * +make_pass_combine2_before (gcc::context *ctxt) +{ + return new pass_combine2 (ctxt, 1); +} + +rtl_opt_pass * +make_pass_combine2_after (gcc::context *ctxt) +{ + return new pass_combine2 (ctxt, 4); +}