diff mbox

combine/dce patch for PR36003, PR42575

Message ID 201006272305.05010.ebotcazou@adacore.com
State New
Headers show

Commit Message

Eric Botcazou June 27, 2010, 9:05 p.m. UTC
> Surely that's for RA/reload/sched which are naturally expensive?  Things
> like lower-subreg, ifcvt, dce tend to show up with 0% when I look at the
> time report if they show up with any time at all.

See Steven's recent figures: while lower-subreg is indeed cheap, ifcvt1 + 
ifcvt2 account for 0.91% and dce for 0.45% of the compilation time, without 
the DF overhead.  The combined DF overhead is 11%.

> Anyhow.  Here's a very simple implementation, grafted onto lower-subreg,
> which gets rid of the unnecessary insn in PR42575.  It also gets rid of
> all the byte_lr code :)  I have another version where it's done as an
> extra pass inside DCE.

lower-subreg should already be able to handle this but it cannot because 
expand_doubleword_mult generates unnecessarily tangled RTL.  The first 
attached patch is sufficient to untangle it, but this has an unexpected side 
effect on x86 (turning an LEA into ADD + MOVE).

Another approach is to teach lower-subreg to handle MULT on the RHS specially, 
as it already does for ASM_OPERANDS.  This is the second attached patch.  It 
has no effect on x86 (and SPARC and others) because of the ??? comment.

Both very lightly tested.


The part that gets rid of byte DF is OK if it is approved by a DF maintainer.


	* optabs.c (expand_doubleword_mult): If no target is specified, copy
	the result of the widening multiplication into a temporary before
	adjusting it.
	* stmt.c (expand_return): Do not assign the return value a constant
	temporary if it is scalar.


	* lower-subreg.c (special_source_operand): New static function.
	(simple_move): Use it to recognize special source operands.
	(resolve_simple_move): Handle special source operands specially.
	(decompose_multiword_subreg): Do not treat ASM_OPERANDS specially.

Comments

Paolo Bonzini June 28, 2010, 9:47 a.m. UTC | #1
On 06/27/2010 11:05 PM, Eric Botcazou wrote:
> +      /* ??? This probably disables the MULT case for several platforms.  */
> +      if (GET_CODE (PATTERN (insn)) == PARALLEL)
> +	return NULL_RTX;

single_set would be a better match, no?

Paolo
Eric Botcazou June 28, 2010, 9:47 a.m. UTC | #2
> single_set would be a better match, no?

What do you mean exactly?  That single_set returns NULL_RTX for PARALLEL?
Paolo Bonzini June 28, 2010, 9:54 a.m. UTC | #3
On 06/28/2010 11:47 AM, Eric Botcazou wrote:
>> single_set would be a better match, no?
>
> What do you mean exactly?  That single_set returns NULL_RTX for PARALLEL?

That it doesn't return NULL_RTX for CLOBBERs, USEs, or SETs whose output 
will not be used (like, most of the time, the CC set).

Paolo
Eric Botcazou June 28, 2010, 10:12 a.m. UTC | #4
> That it doesn't return NULL_RTX for CLOBBERs, USEs, or SETs whose output
> will not be used (like, most of the time, the CC set).

Yes, but that's irrelevant here, single_set is already called a few lines 
above.  The problem is that resolve_simple_move doesn't know (yet) how to 
reinstantiate the other members of the PARALLEL when it is copying the SRC.
Bernd Schmidt June 28, 2010, 10:38 a.m. UTC | #5
On 06/27/2010 11:05 PM, Eric Botcazou wrote:

> lower-subreg should already be able to handle this but it cannot because 
> expand_doubleword_mult generates unnecessarily tangled RTL.  The first 
> attached patch is sufficient to untangle it, but this has an unexpected side 
> effect on x86 (turning an LEA into ADD + MOVE).

It also seems to generally increase stack frames on ARM on a number of
testcases I looked at and often pessimizes multiplication code:

        ldr     ip, [r4, #40]        |   ldr     r1, [r4, #40]
        ldr     lr, .L351+232        |   ldr     ip, .L351+232
        umull   r0, r1, ip, lr       |   umull   r8, r9, r1, ip
        ldr     ip, [r4, #44]        |   ldr     r1, [r4, #44]
        mla     r1, lr, ip, r1       |   mov     r0, r8
                                     >   mla     r1, ip, r1, r9

I haven't looked at it further.

> Another approach is to teach lower-subreg to handle MULT on the RHS specially, 
> as it already does for ASM_OPERANDS.  This is the second attached patch.  It 
> has no effect on x86 (and SPARC and others) because of the ??? comment.

This produces better results in a number of cases, but it's not a clear
win.  In smallish testcases it often produces extra instructions due to
the general problem IRA currently has with tracking lifetimes of DImode
subwords.  If that gets fixed we might consider the patch, but right now
I don't think we can.

        smull   r4, r5, r2, ip
+       mov     ip, r4
        ldr     r6, [r0, r3]
-       mov     ip, r4, lsr #31
+       mov     r4, r5
+       mov     ip, ip, lsr #31

It also seems somewhat hackish to me to treat some operations specially
for no really good reason.  Why do we care about the type of the source
operand?  Can't we just decompose anything that occurs in a SET_DEST?
The fact that it does nothing on some targets is also a minus IMO.


Bernd
Eric Botcazou June 29, 2010, 5:38 p.m. UTC | #6
> This produces better results in a number of cases, but it's not a clear
> win.  In smallish testcases it often produces extra instructions due to
> the general problem IRA currently has with tracking lifetimes of DImode
> subwords.  If that gets fixed we might consider the patch, but right now
> I don't think we can.
>
>         smull   r4, r5, r2, ip
> +       mov     ip, r4
>         ldr     r6, [r0, r3]
> -       mov     ip, r4, lsr #31
> +       mov     r4, r5
> +       mov     ip, ip, lsr #31

The original testcase is smallish too so it would be interesting to understand 
why it helps for certain smallish testcases and pessimizes for others.

How does your can_decompose_p patch come into play here?

> It also seems somewhat hackish to me to treat some operations specially
> for no really good reason.  Why do we care about the type of the source
> operand?  Can't we just decompose anything that occurs in a SET_DEST?

We aren't trying to enhance lower-subreg, only to solve a problem related to
double-word multiplication.  The pass has a special provision for ASM_OPERANDS 
on the RHS because of a certain x86 pattern, it isn't unreasonable to extend 
it to another common pattern if this proves beneficial in most cases.

> The fact that it does nothing on some targets is also a minus IMO.

This can presumably be easily fixed.
Bernd Schmidt July 2, 2010, 2:22 p.m. UTC | #7
On 06/29/2010 07:38 PM, Eric Botcazou wrote:
>> This produces better results in a number of cases, but it's not a clear
>> win.  In smallish testcases it often produces extra instructions due to
>> the general problem IRA currently has with tracking lifetimes of DImode
>> subwords.  If that gets fixed we might consider the patch, but right now
>> I don't think we can.
>>
>>         smull   r4, r5, r2, ip
>> +       mov     ip, r4
>>         ldr     r6, [r0, r3]
>> -       mov     ip, r4, lsr #31
>> +       mov     r4, r5
>> +       mov     ip, ip, lsr #31
> 
> The original testcase is smallish too so it would be interesting to understand 
> why it helps for certain smallish testcases and pessimizes for others.

As I said, it's the known IRA deficiency.  Your patch creates more
instances of

 (set (reg:DI x) (something))
 (set (reg:SI a) (subreg x 0))
 (set (reg:SI b) (subreg x 1))

where IRA treats x as conflicting with a.  If we can fix this in IRA, we
should retest this patch (or something similar).

> How does your can_decompose_p patch come into play here?

Not at all, I think.  It's a different problem, we're not creating
superfluous DImode pseudos here.

>> It also seems somewhat hackish to me to treat some operations specially
>> for no really good reason.  Why do we care about the type of the source
>> operand?  Can't we just decompose anything that occurs in a SET_DEST?
> 
> We aren't trying to enhance lower-subreg,

I think we are, if that's the best solution for the problem.

IMO my mini-DCE is unambiguously beneficial as it only deletes
instructions, it should be free in the normal case since it's guarded by
the "First see if there are any multi-word pseudo-registers" test, and
cheap even when there are multi-word pseudos.  It's not explicitly tied
to multiplications.  On the other hand it has very little effect in
general beyond fixing the testcase.  Since we may be able to do better,
I'm withdrawing the patch for now, until we decide whether IRA will be
enhanced to cope better with DImode lifetimes.

Once (if) IRA is fixed, a plausible approach for lower-subreg would be
to decompose DImode regs that are set once in a non-decomposable
context.  After the initializing insn, we can move its subregs to new
SImode pseudos, just as we're doing for ASM_OPERANDS now.


Bernd
diff mbox

Patch

Index: lower-subreg.c
===================================================================
--- lower-subreg.c	(revision 161426)
+++ lower-subreg.c	(working copy)
@@ -67,6 +67,24 @@  static bitmap non_decomposable_context;
    copy from reg M to reg N.  */
 static VEC(bitmap,heap) *reg_copy_graph;
 
+/* Return nonzero if X is a special source operand for which we may want to
+   decompose the destination.  The value is the number of sub-operands.  */
+
+static int
+special_source_operand (rtx x)
+{
+  /* We handle ASM_OPERANDS as it is beneficial for things like x86 rdtsc
+     which returns a DImode value.  */
+  if (GET_CODE (x) == ASM_OPERANDS)
+    return 1;
+
+  /* We handle MULT as it is beneficial for double-word multiplication.  */
+  if (GET_CODE (x) == MULT)
+    return 2;
+
+  return 0;
+}
+
 /* Return whether X is a simple object which we can take a word_mode
    subreg of.  */
 
@@ -100,11 +118,12 @@  simple_move_operand (rtx x)
 static rtx
 simple_move (rtx insn)
 {
-  rtx x;
-  rtx set;
+  const int n = recog_data.n_operands;
   enum machine_mode mode;
+  rtx set, x;
+  int sso;
 
-  if (recog_data.n_operands != 2)
+  if (n < 2)
     return NULL_RTX;
 
   set = single_set (insn);
@@ -118,13 +137,22 @@  simple_move (rtx insn)
     return NULL_RTX;
 
   x = SET_SRC (set);
-  if (x != recog_data.operand[0] && x != recog_data.operand[1])
-    return NULL_RTX;
-  /* For the src we can handle ASM_OPERANDS, and it is beneficial for
-     things like x86 rdtsc which returns a DImode value.  */
-  if (GET_CODE (x) != ASM_OPERANDS
-      && !simple_move_operand (x))
-    return NULL_RTX;
+  sso = special_source_operand (x);
+  if (sso)
+    {
+      if (n != 1 + sso)
+	return NULL_RTX;
+      /* ??? This probably disables the MULT case for several platforms.  */
+      if (GET_CODE (PATTERN (insn)) == PARALLEL)
+	return NULL_RTX;
+    }
+  else
+    {
+      if (x != recog_data.operand[0] && x != recog_data.operand[1])
+	return NULL_RTX;
+      if (!simple_move_operand (x))
+	return NULL_RTX;
+    }
 
   /* We try to decompose in integer modes, to avoid generating
      inefficient code copying between integer and floating point
@@ -714,16 +742,23 @@  resolve_simple_move (rtx set, rtx insn)
       gcc_assert (acg);
     }
 
+  /* If SRC is a special operand, we need to move via a temporary register.  */
+  if (special_source_operand (src))
+    {
+      int acg;
+      rtx reg = gen_reg_rtx (orig_mode);
+      for_each_rtx (&src, resolve_subreg_use, NULL_RTX);
+      acg = apply_change_group ();
+      gcc_assert (acg);
+      emit_move_insn (reg, src);
+      src = reg;
+    }
+
   /* If SRC is a register which we can't decompose, or has side
      effects, we need to move via a temporary register.  */
-
-  if (!can_decompose_p (src)
-      || side_effects_p (src)
-      || GET_CODE (src) == ASM_OPERANDS)
+  else if (!can_decompose_p (src) || side_effects_p (src))
     {
-      rtx reg;
-
-      reg = gen_reg_rtx (orig_mode);
+      rtx reg = gen_reg_rtx (orig_mode);
       emit_move_insn (reg, src);
       src = reg;
     }
@@ -1104,7 +1139,7 @@  decompose_multiword_subregs (void)
 	{
 	  rtx set;
 	  enum classify_move_insn cmi;
-	  int i, n;
+	  int i;
 
 	  if (!INSN_P (insn)
 	      || GET_CODE (PATTERN (insn)) == CLOBBER
@@ -1129,25 +1164,11 @@  decompose_multiword_subregs (void)
 		cmi = SIMPLE_MOVE;
 	    }
 
-	  n = recog_data.n_operands;
-	  for (i = 0; i < n; ++i)
-	    {
-	      for_each_rtx (&recog_data.operand[i],
-			    find_decomposable_subregs,
-			    &cmi);
-
-	      /* We handle ASM_OPERANDS as a special case to support
-		 things like x86 rdtsc which returns a DImode value.
-		 We can decompose the output, which will certainly be
-		 operand 0, but not the inputs.  */
-
-	      if (cmi == SIMPLE_MOVE
-		  && GET_CODE (SET_SRC (set)) == ASM_OPERANDS)
-		{
-		  gcc_assert (i == 0);
-		  cmi = NOT_SIMPLE_MOVE;
-		}
-	    }
+	  for (i = 0; i < recog_data.n_operands; ++i)
+	    for_each_rtx (&recog_data.operand[i],
+			  find_decomposable_subregs,
+			  &cmi);
+
 	}
     }