diff mbox series

[committed,PR,rtl-optimization/87761] Improve MIPS splitters to sometimes avoid unnecessary cross-unit register copies

Message ID b2864cfb-1b41-59e3-59cf-e5baf13ad5ad@redhat.com
State New
Headers show
Series [committed,PR,rtl-optimization/87761] Improve MIPS splitters to sometimes avoid unnecessary cross-unit register copies | expand

Commit Message

Jeff Law March 22, 2019, 6:16 p.m. UTC
I cry uncle!

So I finally started looking at the fpr-moves regression in this BZ.  No
surprise this is further fallout from the combiner changes.

Going into register allocation we have something like this:

(insn 13 7 14 2 (set (reg:TF 196)
        (reg:TF 44 $f12 [ d ])) "j.c":7:1 376 {*movtf}
     (expr_list:REG_DEAD (reg:TF 44 $f12 [ d ])
        (nil)))
(insn 14 13 10 2 (set (reg:DI 197)
        (reg:DI 6 $6 [ x ])) "j.c":7:1 313 {*movdi_64bit}
     (expr_list:REG_DEAD (reg:DI 6 $6 [ x ])
        (nil)))
(insn 10 14 15 2 (set (mem:TF (reg:DI 197) [1 *x_2(D)+0 S16 A128])
        (reg:TF 196)) "j.c":8:6 376 {*movtf}
     (expr_list:REG_DEAD (reg:DI 197)
        (expr_list:REG_DEAD (reg:TF 196)
            (nil))))


Prior to the combine changes, all of that would have been squashed into
a single nice insn.  One might reasonably ask if combine's avoidance of
hard regs should be loosened for consecutive insns -- combining
consecutive insns isn't going to increase the live range of these hard
regs.  I wandered around combine a bit and decided it wasn't really
feasible to restrict things that way.

So as a starting point, assume the RTL above is what we're going to have
to deal with.

We end up allocating (reg 196) into a GPR.  It's "cheaper".  So
post-reload it looks like:

> (insn 13 7 10 2 (set (reg:TF 2 $2 [196])
>         (reg:TF 44 $f12 [ d ])) "j.c":7:1 376 {*movtf}
>      (nil))
> (insn 10 13 15 2 (set (mem:TF (reg:DI 6 $6 [197]) [1 *x_2(D)+0 S16 A128])
>         (reg:TF 2 $2 [196])) "j.c":8:6 376 {*movtf}
>      (nil))

The MIPS splitter turns that into this mess:

> (insn 17 7 18 2 (set (reg:DI 2 $2 [196])
>         (unspec:DI [
>                 (reg:TF 44 $f12 [ d ])
>                 (const_int 0 [0])
>             ] UNSPEC_STORE_WORD)) "j.c":7:1 407 {store_wordtf}
>      (nil))
> (insn 18 17 19 2 (set (reg:DI 3 $3 [+8 ])
>         (unspec:DI [
>                 (reg:TF 44 $f12 [ d ])
>                 (const_int 1 [0x1])
>             ] UNSPEC_STORE_WORD)) "j.c":7:1 407 {store_wordtf}
>      (nil))
> (insn 19 18 20 2 (set (mem:DI (reg:DI 6 $6 [197]) [1 *x_2(D)+0 S8 A128])
>         (reg:DI 2 $2 [196])) "j.c":8:6 313 {*movdi_64bit}
>      (nil))
> (insn 20 19 15 2 (set (mem:DI (plus:DI (reg:DI 6 $6 [197])
>                 (const_int 8 [0x8])) [1 *x_2(D)+8 S8 A64])
>         (reg:DI 3 $3 [+8 ])) "j.c":8:6 313 {*movdi_64bit}
>      (nil))


Ugh.  Note the unspecs in the first two insns.  That's going to prevent
regcprop from doing its job with this mess.  I have no idea why we don't
represent this properly in RTL and use UNSPECs instead, but I'm willing
to assume it's for a good reason.

So one thought was to have a pass of regcprop earlier.  That fixes this
MIPS issue nicely *and* simplifies the solution to the fix-r4000
regression!  But regresses x86 in a couple ways.  The x86 regressions
can then be fixed by moving the REE pass to just before the early
regcprop pass.  That's real promising, so I throw it into the tester and
look for fallout.

Example fallout on visium where we have this after reload:

> (insn 2 5 11 2 (set (reg/v:QI 10 r10 [orig:66 a ] [66])
>         (reg:QI 1 r1 [74])) "k.c":7:1 1 {*movqi_insn}
>      (nil))
> (insn 11 2 10 2 (set (reg:QI 9 r9 [71])
>         (plus:QI (reg/v:QI 10 r10 [orig:66 a ] [66])
>             (reg/v:QI 2 r2 [orig:67 b ] [67]))) "k.c":8:10 28 {*addqi3_insn}
>      (nil))
> (insn 10 11 12 2 (set (reg:QI 1 r1 [orig:64 _7+1 ] [64])
>         (const_int 0 [0])) "k.c":8:10 1 {*movqi_insn}
>      (expr_list:REG_EQUAL (const_int 0 [0])
>         (nil)))
> (jump_insn 12 10 30 2 (set (pc)
>         (if_then_else (eq (reg:QI 9 r9 [71])
>                 (unspec:QI [
>                         (reg/v:QI 10 r10 [orig:66 a ] [66])
>                         (reg/v:QI 2 r2 [orig:67 b ] [67])
>                     ] UNSPEC_ADDV))
>             (label_ref:SI 17)
>             (pc))) "k.c":8:10 234 {*cbranchqi4_addv_insn}
>      (int_list:REG_BR_PROB 1073204964 (nil))

The early regcprop pass will propagate r1 for r10 in insn 11, but it
can't propagate into the use in insn 12 (because of the set of r1 in
insn 10).  So we regress visium in unpleasant ways (the operands of the
plus apparently need to match the jump at insn 12 to get the code we
want).  We can't move the cmpelim pass earlier because it depends on
splitting and the point of this approach is to run hard cprop before
splitting.  Note this was just one regression, rl78 regressed and likely
others would have as well if they had stronger target specific testsuites.

Anyway, we could look at doing more pass juggling, but I suspect we'd
just be permuting the problems.  We could perhaps do some pass
duplication, but that seems so wasteful of compile-time given the cases
we're looking at.

So I'm just punting.  In the MIPS splitter, we can peek ahead one real
insn and try to forward propagate the source operand at split time.  We
still generate the split insns as well.  So after post-reload splitting
we have something like this:

> (insn 17 7 18 2 (set (reg:DI 2 $2 [196])
>         (unspec:DI [
>                 (reg:TF 44 $f12 [ d ])
>                 (const_int 0 [0])
>             ] UNSPEC_STORE_WORD)) "j.c":7:1 407 {store_wordtf}
>      (nil))
> (insn 18 17 19 2 (set (reg:DI 3 $3 [+8 ])
>         (unspec:DI [
>                 (reg:TF 44 $f12 [ d ])
>                 (const_int 1 [0x1])
>             ] UNSPEC_STORE_WORD)) "j.c":7:1 407 {store_wordtf}
>      (nil))
> (insn 19 18 20 2 (set (mem:DI (reg:DI 6 $6 [197]) [1 *x_2(D)+0 S8 A128])
>         (unspec:DI [
>                 (reg:TF 44 $f12 [ d ])
>                 (const_int 0 [0])
>             ] UNSPEC_STORE_WORD)) "j.c":8:6 -1
>      (nil))
> (insn 20 19 15 2 (set (mem:DI (plus:DI (reg:DI 6 $6 [197])
>                 (const_int 8 [0x8])) [1 *x_2(D)+8 S8 A64])
>         (unspec:DI [
>                 (reg:TF 44 $f12 [ d ])
>                 (const_int 1 [0x1])
>             ] UNSPEC_STORE_WORD)) "j.c":8:6 -1
>      (nil))

Note how we split the second move too and the source operand is now
$f12.    insns 17 and 18 will get removed as dead by DCE and we
ultimately get the desired code.

But just to be clear, this is a gross hack to avoid playing wack-a-mole
with pass ordering and duplication.  If it triggers in real world code,
it should ultimately improve performance as it can make the cross unit
copies dead and gives the scheduler additional freedoms. But I don't
really expect it to be triggering much in real world codes.

It fixes fpr-moves-5 (noted as a regression in the BZ) and fpr-moves-6
(not noted as a regression in the BZ) on mips64-linux-gnu and
mips64el-linux-gnu.  Bootstrapped on mipsisa32r2-linux-gnu as well.


Installing on the trunk.

Jeff

ps.  Now back to the fix-r4000 regression :-)
PR rtl-optimization/87761
	* config/mips/mips-protos.h (mips_split_move): Add new argument.
	(mips_emit_move_or_split): Pass NULL for INSN into mips_split_move.
	(mips_split_move): Accept new INSN argument.  Try to forward SRC
	into the next instruction.
	(mips_split_move_insn): Pass INSN through to mips_split_move.

Comments

Segher Boessenkool March 24, 2019, 5:42 p.m. UTC | #1
Hi!

On Fri, Mar 22, 2019 at 12:16:30PM -0600, Jeff Law wrote:
> So I finally started looking at the fpr-moves regression in this BZ.  No
> surprise this is further fallout from the combiner changes.

It *exposed* the problem, yes :-)

> One might reasonably ask if combine's avoidance of
> hard regs should be loosened for consecutive insns -- combining
> consecutive insns isn't going to increase the live range of these hard
> regs.

But that is not the only reason we don't want to forward hard registers.
One of the important reasons is correctness: we could create situations
that reload/LRA cannot handle (cannot *possibly* handle).  Another reason
is that combine should not try to do register allocation's job: we on
average get better code after the combine hard reg change.

(And what are consecutive insns, anyway?  This is before scheduling.)

> We end up allocating (reg 196) into a GPR.  It's "cheaper".

Is this a problem in itself already?  (I don't know MIPS terribly well).

> So one thought was to have a pass of regcprop earlier.  That fixes this
> MIPS issue nicely *and* simplifies the solution to the fix-r4000
> regression!  But regresses x86 in a couple ways.  The x86 regressions
> can then be fixed by moving the REE pass to just before the early
> regcprop pass.  That's real promising, so I throw it into the tester and
> look for fallout.
> 
> Example fallout on visium where we have this after reload:

Yeah, reordering or duplicating late passes is something for stage 1, and
even then it might just not work out.

> So I'm just punting.  In the MIPS splitter, we can peek ahead one real
> insn and try to forward propagate the source operand at split time.  We
> still generate the split insns as well.  So after post-reload splitting
> we have something like this:

> Note how we split the second move too and the source operand is now
> $f12.    insns 17 and 18 will get removed as dead by DCE and we
> ultimately get the desired code.

For splitters after reload you have to do a lot of work manually, to get
reasonable code.  This is a problem everywhere :-(

An obvious fix is to split *before* reload, too, but that doesn't work for
those few splitters that depend on RA results.


Segher
Jeff Law March 25, 2019, 5:39 p.m. UTC | #2
On 3/24/19 11:42 AM, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Mar 22, 2019 at 12:16:30PM -0600, Jeff Law wrote:
>> So I finally started looking at the fpr-moves regression in this BZ.  No
>> surprise this is further fallout from the combiner changes.
> 
> It *exposed* the problem, yes :-)
Well, it was one of regressions that were a result of that patch.  I'm a
bit disappointed in how many were just punted rather than really
analyzed and either mitigated or explicitly moved out to gcc-10 because
mitigation was too painful with marginal benefits to do at this stage.


> 
>> One might reasonably ask if combine's avoidance of
>> hard regs should be loosened for consecutive insns -- combining
>> consecutive insns isn't going to increase the live range of these hard
>> regs.
> 
> But that is not the only reason we don't want to forward hard registers.
> One of the important reasons is correctness: we could create situations
> that reload/LRA cannot handle (cannot *possibly* handle).  Another reason
> is that combine should not try to do register allocation's job: we on
> average get better code after the combine hard reg change.
No doubt.  My point was that we can and should have been looking for
ways to mitigate the fallout.  Which in my mind includes further
investigation of whether or not we've hit the right balance for what
should and should not be combined.

> 
> (And what are consecutive insns, anyway?  This is before scheduling.)
The insns at combine time looked like:

> (insn 13 8 5 2 (set (reg:TF 196)
>         (reg:TF 44 $f12 [ d ])) "j.c":7:1 -1
>      (expr_list:REG_DEAD (reg:TF 44 $f12 [ d ])
>         (nil)))
> (note 5 13 14 2 NOTE_INSN_DELETED)
> (insn 14 5 6 2 (set (reg:DI 197)
>         (reg:DI 6 $6 [ x ])) "j.c":7:1 -1
>      (expr_list:REG_DEAD (reg:DI 6 $6 [ x ])
>         (nil)))
> (note 6 14 7 2 NOTE_INSN_DELETED)
> (note 7 6 10 2 NOTE_INSN_FUNCTION_BEG)
> (insn 10 7 0 2 (set (mem:TF (reg:DI 197) [1 *x_2(D)+0 S16 A128])
>         (reg:TF 196)) "j.c":8:6 376 {*movtf}
>      (expr_list:REG_DEAD (reg:DI 197)
>         (expr_list:REG_DEAD (reg:TF 196)
>             (nil))))


Without reviewing the thread which lead to the combine.c change I can't
say offhand if there's something we can/should relax in combine to
handle this better.  My point was that it should have been more
thoroughly investigated and mitigations attempted once the regressions
were noted.


> 
>> We end up allocating (reg 196) into a GPR.  It's "cheaper".
> 
> Is this a problem in itself already?  (I don't know MIPS terribly well).
It may be.  While I've done extensive MIPS work in the past, it predates
the TF/TI stuff in there so I don't have a sense of relative costs for
the TF/TI stuff.  It certainly looked fishy when I first looked at the
costs.  But I wasn't up for tackling that right now either :-)  It's an
area Vlad continues to investigate (hard reg costing is a known weak
spot for IRA/LRA).


>> So I'm just punting.  In the MIPS splitter, we can peek ahead one real
>> insn and try to forward propagate the source operand at split time.  We
>> still generate the split insns as well.  So after post-reload splitting
>> we have something like this:
> 
>> Note how we split the second move too and the source operand is now
>> $f12.    insns 17 and 18 will get removed as dead by DCE and we
>> ultimately get the desired code.
> 
> For splitters after reload you have to do a lot of work manually, to get
> reasonable code.  This is a problem everywhere :-(
Yup.  In fact, we've seen multiple BZs this cycle where a DCE pass right
before/after splitting would help.  In both cases we ended up hacking up
the splitters a bit.  It's one of the items I think we'll want to
revisit during gcc-10 stage1 development.

> 
> An obvious fix is to split *before* reload, too, but that doesn't work for
> those few splitters that depend on RA results.
Right.

jeff
Segher Boessenkool March 26, 2019, 1:32 a.m. UTC | #3
On Mon, Mar 25, 2019 at 11:39:27AM -0600, Jeff Law wrote:
> On 3/24/19 11:42 AM, Segher Boessenkool wrote:
> > On Fri, Mar 22, 2019 at 12:16:30PM -0600, Jeff Law wrote:
> >> So I finally started looking at the fpr-moves regression in this BZ.  No
> >> surprise this is further fallout from the combiner changes.
> > 
> > It *exposed* the problem, yes :-)
> Well, it was one of regressions that were a result of that patch.  I'm a
> bit disappointed in how many were just punted rather than really
> analyzed and either mitigated or explicitly moved out to gcc-10 because
> mitigation was too painful with marginal benefits to do at this stage.
> 
> >> One might reasonably ask if combine's avoidance of
> >> hard regs should be loosened for consecutive insns -- combining
> >> consecutive insns isn't going to increase the live range of these hard
> >> regs.
> > 
> > But that is not the only reason we don't want to forward hard registers.
> > One of the important reasons is correctness: we could create situations
> > that reload/LRA cannot handle (cannot *possibly* handle).  Another reason
> > is that combine should not try to do register allocation's job: we on
> > average get better code after the combine hard reg change.
> No doubt.  My point was that we can and should have been looking for
> ways to mitigate the fallout.  Which in my mind includes further
> investigation of whether or not we've hit the right balance for what
> should and should not be combined.

Do you have some example of something that happened after the change, that
did not happen before the change (when using a var instead of a function
argument, say)?

> > For splitters after reload you have to do a lot of work manually, to get
> > reasonable code.  This is a problem everywhere :-(
> Yup.  In fact, we've seen multiple BZs this cycle where a DCE pass right
> before/after splitting would help.  In both cases we ended up hacking up
> the splitters a bit.  It's one of the items I think we'll want to
> revisit during gcc-10 stage1 development.

I would like to even see something CSE-like, cprop-like, and combine-like.

But most of all, people should stop doing splitters after reload if there
is any way to avoid it!


Segher
diff mbox series

Patch

diff --git a/gcc/config/mips/mips-protos.h b/gcc/config/mips/mips-protos.h
index 64afb350fb7..32070fdb8c9 100644
--- a/gcc/config/mips/mips-protos.h
+++ b/gcc/config/mips/mips-protos.h
@@ -214,7 +214,7 @@  extern bool mips_legitimize_move (machine_mode, rtx, rtx);
 
 extern rtx mips_subword (rtx, bool);
 extern bool mips_split_move_p (rtx, rtx, enum mips_split_type);
-extern void mips_split_move (rtx, rtx, enum mips_split_type);
+extern void mips_split_move (rtx, rtx, enum mips_split_type, rtx);
 extern bool mips_split_move_insn_p (rtx, rtx, rtx);
 extern void mips_split_move_insn (rtx, rtx, rtx);
 extern void mips_split_128bit_move (rtx, rtx);
diff --git a/gcc/config/mips/mips.c b/gcc/config/mips/mips.c
index 48f324410b9..1de33b28c38 100644
--- a/gcc/config/mips/mips.c
+++ b/gcc/config/mips/mips.c
@@ -3031,7 +3031,7 @@  static void
 mips_emit_move_or_split (rtx dest, rtx src, enum mips_split_type split_type)
 {
   if (mips_split_move_p (dest, src, split_type))
-    mips_split_move (dest, src, split_type);
+    mips_split_move (dest, src, split_type, NULL);
   else
     mips_emit_move (dest, src);
 }
@@ -4780,10 +4780,11 @@  mips_split_move_p (rtx dest, rtx src, enum mips_split_type split_type)
 }
 
 /* Split a move from SRC to DEST, given that mips_split_move_p holds.
-   SPLIT_TYPE describes the split condition.  */
+   SPLIT_TYPE describes the split condition.  INSN is the insn being
+   split, if we know it, NULL otherwise.  */
 
 void
-mips_split_move (rtx dest, rtx src, enum mips_split_type split_type)
+mips_split_move (rtx dest, rtx src, enum mips_split_type split_type, rtx insn_)
 {
   rtx low_dest;
 
@@ -4843,6 +4844,21 @@  mips_split_move (rtx dest, rtx src, enum mips_split_type split_type)
 	  mips_emit_move (mips_subword (dest, true), mips_subword (src, true));
 	}
     }
+
+  /* This is a hack.  See if the next insn uses DEST and if so, see if we
+     can forward SRC for DEST.  This is most useful if the next insn is a
+     simple store.   */
+  rtx_insn *insn = (rtx_insn *)insn_;
+  if (insn)
+    {
+      rtx_insn *next = next_nonnote_nondebug_insn_bb (insn);
+      if (next)
+	{
+	  rtx set = single_set (next);
+	  if (set && SET_SRC (set) == dest)
+	    validate_change (next, &SET_SRC (set), src, false);
+	}
+    }
 }
 
 /* Return the split type for instruction INSN.  */
@@ -5070,7 +5086,7 @@  mips_split_move_insn_p (rtx dest, rtx src, rtx insn)
 void
 mips_split_move_insn (rtx dest, rtx src, rtx insn)
 {
-  mips_split_move (dest, src, mips_insn_split_type (insn));
+  mips_split_move (dest, src, mips_insn_split_type (insn), insn);
 }
 
 /* Return the appropriate instructions to move SRC into DEST.  Assume