diff mbox

[combine,RFC] Don't transform sign and zero extends inside mults

Message ID 56376FFF.3070008@arm.com
State New
Headers show

Commit Message

Kyrylo Tkachov Nov. 2, 2015, 2:15 p.m. UTC
Hi all,

This patch attempts to restrict combine from transforming ZERO_EXTEND and SIGN_EXTEND operations into and-bitmask
and weird SUBREG expressions when they appear inside MULT expressions. This is because a MULT rtx containing these
extend operations is usually a well understood widening multiply operation.
However, if the costs for simple zero or sign-extend moves happen to line up in a particular way,
expand_compound_operation will end up mangling a perfectly innocent extend+mult+add rtx like:
  (set (reg/f:DI 393)
     (plus:DI (mult:DI (sign_extend:DI (reg/v:SI 425 [ selected ]))
             (sign_extend:DI (reg:SI 606)))
          (reg:DI 600)))

into:
  (set (reg/f:DI 393)
     (plus:DI (mult:DI (and:DI (subreg:DI (reg:SI 606) 0)
                 (const_int 202 [0xca]))
             (sign_extend:DI (reg/v:SI 425 [ selected ])))
          (reg:DI 600)))

because it decides that the and-subreg form is cheaper than a sign-extend, even though
the resultant expression can't hope to match a widening multiply-add pattern anymore.
The and-subreg form may well be cheaper as the SET_SRC of a simple set, but is unlikely
to be meaningful when inside a MULT expression.
AFAIK the accepted way of representing widening multiplies is with MULT expressions containing
zero/sign-extends, so rather than adding duplicate patterns in the backend to represent the different
ways an extend operation can be expressed by combine, I think we should stop combine from mangling
the representation where possible.

This patch fixes that by stopping the transformation on the extend operands of a MULT if the target
has a mode-appropriate widening multiply optab in the two places where I saw this happen.

With this patch on aarch64 I saw increased generation of smaddl and umaddl instructions where
before the combination of smull and add instructions were rejected because the extend operands
were being transformed into these weird equivalent expressions.

Overall, I see 24% more [su]maddl instructions being generated for SPEC2006 and with no performance regressions.

The testcase in the patch is the most minimal one I could get that demonstrates the issue I'm trying to solve.

Does this approach look ok?

Thanks,
Kyrill

2015-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * combine.c (subst): Restrict transformation when handling extend
     ops inside a MULT.
     (force_to_mode, binop case): Likewise.

2015-11-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/umaddl_combine_1.c: New test.

Comments

Jeff Law Nov. 2, 2015, 10:31 p.m. UTC | #1
On 11/02/2015 07:15 AM, Kyrill Tkachov wrote:
> Hi all,
>
> This patch attempts to restrict combine from transforming ZERO_EXTEND
> and SIGN_EXTEND operations into and-bitmask
> and weird SUBREG expressions when they appear inside MULT expressions.
> This is because a MULT rtx containing these
> extend operations is usually a well understood widening multiply operation.
> However, if the costs for simple zero or sign-extend moves happen to
> line up in a particular way,
> expand_compound_operation will end up mangling a perfectly innocent
> extend+mult+add rtx like:
>   (set (reg/f:DI 393)
>      (plus:DI (mult:DI (sign_extend:DI (reg/v:SI 425 [ selected ]))
>              (sign_extend:DI (reg:SI 606)))
>           (reg:DI 600)))
>
> into:
>   (set (reg/f:DI 393)
>      (plus:DI (mult:DI (and:DI (subreg:DI (reg:SI 606) 0)
>                  (const_int 202 [0xca]))
>              (sign_extend:DI (reg/v:SI 425 [ selected ])))
>           (reg:DI 600)))
Going to leave the review side of this for Segher.

If you decide to go forward, there's a section in md.texi WRT 
canonicalization of these RTL codes that probably would need updating. 
Just search for "canonicalization" section and read down a ways.


Jeff
Kyrylo Tkachov Nov. 4, 2015, 11:37 a.m. UTC | #2
Ji Jrgg,

On 02/11/15 22:31, Jeff Law wrote:
> On 11/02/2015 07:15 AM, Kyrill Tkachov wrote:
>> Hi all,
>>
>> This patch attempts to restrict combine from transforming ZERO_EXTEND
>> and SIGN_EXTEND operations into and-bitmask
>> and weird SUBREG expressions when they appear inside MULT expressions.
>> This is because a MULT rtx containing these
>> extend operations is usually a well understood widening multiply operation.
>> However, if the costs for simple zero or sign-extend moves happen to
>> line up in a particular way,
>> expand_compound_operation will end up mangling a perfectly innocent
>> extend+mult+add rtx like:
>>   (set (reg/f:DI 393)
>>      (plus:DI (mult:DI (sign_extend:DI (reg/v:SI 425 [ selected ]))
>>              (sign_extend:DI (reg:SI 606)))
>>           (reg:DI 600)))
>>
>> into:
>>   (set (reg/f:DI 393)
>>      (plus:DI (mult:DI (and:DI (subreg:DI (reg:SI 606) 0)
>>                  (const_int 202 [0xca]))
>>              (sign_extend:DI (reg/v:SI 425 [ selected ])))
>>           (reg:DI 600)))
> Going to leave the review side of this for Segher.
>
> If you decide to go forward, there's a section in md.texi WRT canonicalization of these RTL codes that probably would need updating. Just search for "canonicalization" section and read down a ways.
>

You mean document a canonical form for these operations as (mult (extend op1) (extend op2)) ?

>
> Jeff
>
>
Kyrylo Tkachov Nov. 4, 2015, 1:32 p.m. UTC | #3
On 04/11/15 11:37, Kyrill Tkachov wrote:
>
> On 02/11/15 22:31, Jeff Law wrote:
>> On 11/02/2015 07:15 AM, Kyrill Tkachov wrote:
>>> Hi all,
>>>
>>> This patch attempts to restrict combine from transforming ZERO_EXTEND
>>> and SIGN_EXTEND operations into and-bitmask
>>> and weird SUBREG expressions when they appear inside MULT expressions.
>>> This is because a MULT rtx containing these
>>> extend operations is usually a well understood widening multiply operation.
>>> However, if the costs for simple zero or sign-extend moves happen to
>>> line up in a particular way,
>>> expand_compound_operation will end up mangling a perfectly innocent
>>> extend+mult+add rtx like:
>>>   (set (reg/f:DI 393)
>>>      (plus:DI (mult:DI (sign_extend:DI (reg/v:SI 425 [ selected ]))
>>>              (sign_extend:DI (reg:SI 606)))
>>>           (reg:DI 600)))
>>>
>>> into:
>>>   (set (reg/f:DI 393)
>>>      (plus:DI (mult:DI (and:DI (subreg:DI (reg:SI 606) 0)
>>>                  (const_int 202 [0xca]))
>>>              (sign_extend:DI (reg/v:SI 425 [ selected ])))
>>>           (reg:DI 600)))
>> Going to leave the review side of this for Segher.
>>
>> If you decide to go forward, there's a section in md.texi WRT canonicalization of these RTL codes that probably would need updating. Just search for "canonicalization" section and read down a ways.
>>
>
> You mean document a canonical form for these operations as (mult (extend op1) (extend op2)) ?
>

Looking at the issue more deeply I think the concrete problem is the code in expand_compound_operation:
<code>
   /* If we reach here, we want to return a pair of shifts.  The inner
      shift is a left shift of BITSIZE - POS - LEN bits.  The outer
      shift is a right shift of BITSIZE - LEN bits.  It is arithmetic or
      logical depending on the value of UNSIGNEDP.

      If this was a ZERO_EXTEND or ZERO_EXTRACT, this pair of shifts will be
      converted into an AND of a shift.

      We must check for the case where the left shift would have a negative
      count.  This can happen in a case like (x >> 31) & 255 on machines
      that can't shift by a constant.  On those machines, we would first
      combine the shift with the AND to produce a variable-position
      extraction.  Then the constant of 31 would be substituted in
      to produce such a position.  */

   modewidth = GET_MODE_PRECISION (GET_MODE (x));
   if (modewidth >= pos + len)
     {
       machine_mode mode = GET_MODE (x);
       tem = gen_lowpart (mode, XEXP (x, 0));
       if (!tem || GET_CODE (tem) == CLOBBER)
     return x;
       tem = simplify_shift_const (NULL_RTX, ASHIFT, mode,
                   tem, modewidth - pos - len);
       tem = simplify_shift_const (NULL_RTX, unsignedp ? LSHIFTRT : ASHIFTRT,
                   mode, tem, modewidth - len);
     }
   else if (unsignedp && len < HOST_BITS_PER_WIDE_INT)
     tem = simplify_and_const_int (NULL_RTX, GET_MODE (x),
                   simplify_shift_const (NULL_RTX, LSHIFTRT,
                             GET_MODE (x),
                             XEXP (x, 0), pos),
                   ((unsigned HOST_WIDE_INT) 1 << len) - 1);
   else
     /* Any other cases we can't handle.  */
     return x;
</code>

this code ends up unconditionally transforming (zero_extend:DI (reg:SI 125)) into:
(and:DI (subreg:DI (reg:SI 125) 0)
     (const_int 1 [0x1]))

which then gets substituted as an operand of the mult and nothing matches after that.
archaeology shows this code has been there since at least 1992. I guess my question is
why do we do this unconditionally? Should we be checking whether the zero_extend form
is cheaper than whatever simplify_shift_const returns?
I don't think what expand_compound_operatio is trying to do here is canonicalisation...

Thanks,
Kyrill


>>
>> Jeff
>>
>>
>
diff mbox

Patch

commit 2005243d896cbeb3d22421a63f080699ab357610
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Fri Oct 30 13:56:10 2015 +0000

    [combine] Don't transform sign and zero extends inside mults

diff --git a/gcc/combine.c b/gcc/combine.c
index fa1a93f..be0f5ae 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -5369,6 +5369,7 @@  subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy)
 		  n_occurrences++;
 		}
 	      else
+		{
 		/* If we are in a SET_DEST, suppress most cases unless we
 		   have gone inside a MEM, in which case we want to
 		   simplify the address.  We assume here that things that
@@ -5376,15 +5377,33 @@  subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy)
 		   parts in the first expression.  This is true for SUBREG,
 		   STRICT_LOW_PART, and ZERO_EXTRACT, which are the only
 		   things aside from REG and MEM that should appear in a
-		   SET_DEST.  */
-		new_rtx = subst (XEXP (x, i), from, to,
+		   SET_DEST.  Also, restrict transformations on SIGN_EXTENDs
+		   and ZERO_EXTENDs if they appear inside multiplications if
+		   the target supports widening multiplies.  This is to avoid
+		   mangling such expressions beyond recognition.  */
+		  bool restrict_extend_p = false;
+		  rtx_code inner_code = GET_CODE (XEXP (x, i));
+
+		  if (code == MULT
+		      && (inner_code == SIGN_EXTEND
+			  || inner_code == ZERO_EXTEND)
+		      && widening_optab_handler (inner_code == ZERO_EXTEND
+						    ? umul_widen_optab
+						    : smul_widen_optab,
+						  GET_MODE (x),
+						  GET_MODE (XEXP (XEXP (x, i), 0)))
+			 != CODE_FOR_nothing)
+		    restrict_extend_p = true;
+
+		  new_rtx = subst (XEXP (x, i), from, to,
 			     (((in_dest
 				&& (code == SUBREG || code == STRICT_LOW_PART
 				    || code == ZERO_EXTRACT))
 			       || code == SET)
-			      && i == 0),
+			      && i == 0) || restrict_extend_p,
 				 code == IF_THEN_ELSE && i == 0,
 				 unique_copy);
+		}
 
 	      /* If we found that we will have to reject this combination,
 		 indicate that by returning the CLOBBER ourselves, rather than
@@ -8509,8 +8528,30 @@  force_to_mode (rtx x, machine_mode mode, unsigned HOST_WIDE_INT mask,
       /* For most binary operations, just propagate into the operation and
 	 change the mode if we have an operation of that mode.  */
 
-      op0 = force_to_mode (XEXP (x, 0), mode, mask, next_select);
-      op1 = force_to_mode (XEXP (x, 1), mode, mask, next_select);
+      op0 = XEXP (x, 0);
+      op1 = XEXP (x, 1);
+
+      /* If we have a widening multiply avoid messing with the
+	 ZERO/SIGN_EXTEND operands so that we can still match the
+	 appropriate smul/umul patterns.  */
+      if (GET_CODE (x) == MULT
+	  && GET_CODE (op0) == GET_CODE (op1)
+	  && (GET_CODE (op0) == SIGN_EXTEND
+	      || GET_CODE (op0) == ZERO_EXTEND)
+	  && GET_MODE (op0) == GET_MODE (op1)
+	  && widening_optab_handler (GET_CODE (op0) == ZERO_EXTEND
+				      ? umul_widen_optab
+				      : smul_widen_optab,
+				     GET_MODE (x),
+				     GET_MODE (XEXP (op0, 0)))
+	       != CODE_FOR_nothing)
+	;
+      else
+	{
+	  op0 = force_to_mode (op0, mode, mask, next_select);
+	  op1 = force_to_mode (op1, mode, mask, next_select);
+	}
+
 
       /* If we ended up truncating both operands, truncate the result of the
 	 operation instead.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/umaddl_combine_1.c b/gcc/testsuite/gcc.target/aarch64/umaddl_combine_1.c
new file mode 100644
index 0000000..703c94d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/umaddl_combine_1.c
@@ -0,0 +1,29 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a53" } */
+
+enum reg_class
+{
+  NO_REGS,
+  AD_REGS,
+  ALL_REGS, LIM_REG_CLASSES
+};
+
+extern enum reg_class
+  reg_class_subclasses[((int) LIM_REG_CLASSES)][((int) LIM_REG_CLASSES)];
+
+void
+init_reg_sets_1 ()
+{
+  unsigned int i, j;
+  {
+    for (j = i + 1; j < ((int) LIM_REG_CLASSES); j++)
+      {
+	enum reg_class *p;
+	p = &reg_class_subclasses[j][0];
+	while (*p != LIM_REG_CLASSES)
+	  p++;
+      }
+  }
+}
+
+/* { dg-final { scan-assembler-not "umull\tx\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */