diff mbox

[combine,RFC] Don't transform sign and zero extends inside mults

Message ID 20151104235015.GA13203@gate.crashing.org
State New
Headers show

Commit Message

Segher Boessenkool Nov. 4, 2015, 11:50 p.m. UTC
Hi Kyrill,

On Mon, Nov 02, 2015 at 02:15:27PM +0000, Kyrill Tkachov wrote:
> This patch attempts to restrict combine from transforming ZERO_EXTEND and 
> SIGN_EXTEND operations into and-bitmask
> and weird SUBREG expressions when they appear inside MULT expressions. This 
> is because a MULT rtx containing these
> extend operations is usually a well understood widening multiply operation.

Right.  I have often wanted for combine to try plain substitution as well
as substitution and simplification: simplification (which uses nonzero_bits
etc.) often makes a mess of even moderately complex instructions.

But since we do not have that yet, and your 24% number is nicely impressive,
let's try to do no simplification for widening mults, as in your patch.

> With this patch on aarch64 I saw increased generation of smaddl and umaddl 
> instructions where
> before the combination of smull and add instructions were rejected because 
> the extend operands
> were being transformed into these weird equivalent expressions.
> 
> Overall, I see 24% more [su]maddl instructions being generated for SPEC2006 
> and with no performance regressions.
> 
> The testcase in the patch is the most minimal one I could get that 
> demonstrates the issue I'm trying to solve.
> 
> Does this approach look ok?

In broad lines it does.  Could you try this patch instead?  Not tested
etc. (other than building an aarch64 cross and your test case); I'll do
that tomorrow if you like the result.


Segher

Comments

Kyrylo Tkachov Nov. 5, 2015, 12:01 p.m. UTC | #1
Hi Segher,

On 04/11/15 23:50, Segher Boessenkool wrote:
> Hi Kyrill,
>
> On Mon, Nov 02, 2015 at 02:15:27PM +0000, Kyrill Tkachov wrote:
>> This patch attempts to restrict combine from transforming ZERO_EXTEND and
>> SIGN_EXTEND operations into and-bitmask
>> and weird SUBREG expressions when they appear inside MULT expressions. This
>> is because a MULT rtx containing these
>> extend operations is usually a well understood widening multiply operation.
> Right.  I have often wanted for combine to try plain substitution as well
> as substitution and simplification: simplification (which uses nonzero_bits
> etc.) often makes a mess of even moderately complex instructions.
>
> But since we do not have that yet, and your 24% number is nicely impressive,
> let's try to do no simplification for widening mults, as in your patch.
>
>> With this patch on aarch64 I saw increased generation of smaddl and umaddl
>> instructions where
>> before the combination of smull and add instructions were rejected because
>> the extend operands
>> were being transformed into these weird equivalent expressions.
>>
>> Overall, I see 24% more [su]maddl instructions being generated for SPEC2006
>> and with no performance regressions.
>>
>> The testcase in the patch is the most minimal one I could get that
>> demonstrates the issue I'm trying to solve.
>>
>> Does this approach look ok?
> In broad lines it does.  Could you try this patch instead?  Not tested
> etc. (other than building an aarch64 cross and your test case); I'll do
> that tomorrow if you like the result.

Thanks, that looks less intrusive. I did try it out on arm and aarch64.
It does work on the aarch64 testcase. However, there's also a correctness
regression, I'll try to explain inline....

>
>
> Segher
>
>
> diff --git a/gcc/combine.c b/gcc/combine.c
> index c3db2e0..3bf7ffb 100644
> --- a/gcc/combine.c
> +++ b/gcc/combine.c
> @@ -5284,6 +5284,15 @@ subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy)
>   	      || GET_CODE (SET_DEST (x)) == PC))
>   	fmt = "ie";
>   
> +      /* Substituting into the operands of a widening MULT is not likely
> +	 to create RTL matching a machine insn.  */
> +      if (code == MULT
> +	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
> +	      || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND)
> +	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
> +	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND))
> +	return x;

I think we should also add:
       && REG_P (XEXP (XEXP (x, 0), 0))
       && REG_P (XEXP (XEXP (x, 1), 0))

to the condition. Otherwise I've seen regressions in the arm testsuite, the
gcc.target/arm/smlatb-1.s test in particular that tries to match the pattern
(define_insn "*maddhisi4tb"
   [(set (match_operand:SI 0 "s_register_operand" "=r")
     (plus:SI (mult:SI (ashiftrt:SI
                (match_operand:SI 1 "s_register_operand" "r")
                (const_int 16))
               (sign_extend:SI
                (match_operand:HI 2 "s_register_operand" "r")))
          (match_operand:SI 3 "s_register_operand" "r")))]


There we have a sign_extend of a shift that we want to convert to the form
that the pattern expects. So adding the checks for REG_P fixes that for me.

For the correctness issue I saw on aarch64 the shortest case I could reduce is:
short int a[16], b[16];
void
f5 (void)
{
   a[0] = b[0] / 14;
}

We synthesise the division and lose one of the intermediate immediates.
The good codegen is:
f5:
     adrp    x1, b
     mov    w0, 9363            // <--- This gets lost!
     movk    w0, 0x9249, lsl 16 // <--- This gets lost!
     adrp    x2, a
     ldrsh    w1, [x1, #:lo12:b]
     smull    x0, w1, w0
     lsr    x0, x0, 32
     add    w0, w1, w0
     asr    w0, w0, 3
     sub    w0, w0, w1, asr 31
     strh    w0, [x2, #:lo12:a]
     ret

The bad one with this patch is:
     adrp    x1, b
     adrp    x2, a
     ldrsh    w1, [x1, #:lo12:b]
     smull    x0, w1, w0
     lsr    x0, x0, 32
     add    w0, w1, w0
     asr    w0, w0, 3
     sub    w0, w0, w1, asr 31
     strh    w0, [x2, #:lo12:a]
     ret


The problematic hunk there is the subst hunk.
In the expression 'x':
(mult:DI (sign_extend:DI (reg:SI 80 [ b ]))
     (sign_extend:DI (reg:SI 82)))

it tries to substitute 'from': (reg:SI 82)
with 'to': (const_int -1840700269 [0xffffffff92492493])

since we now return just 'x' combine thinks that the substitution succeeded
and eliminates the immediate move.
Is there a way that subst can signal some kind of "failed to substitute" result?
If not, I got it to work by using that check to set the in_dest variable to the
subsequent recursive call to subst, in a similar way to my original patch, but I was
hoping to avoid overloading the meaning of in_dest.

Thanks,
Kyrill

> +
>         /* Get the mode of operand 0 in case X is now a SIGN_EXTEND of a
>   	 constant.  */
>         if (fmt[0] == 'e')
> @@ -8455,6 +8464,15 @@ force_to_mode (rtx x, machine_mode mode, unsigned HOST_WIDE_INT mask,
>         /* ... fall through ...  */
>   
>       case MULT:
> +      /* Substituting into the operands of a widening MULT is not likely to
> +	 create RTL matching a machine insn.  */
> +      if (code == MULT
> +	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
> +	      || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND)
> +	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
> +	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND))
> +	return gen_lowpart_or_truncate (mode, x);
> +
>         /* For PLUS, MINUS and MULT, we need any bits less significant than the
>   	 most significant bit in MASK since carries from those bits will
>   	 affect the bits we are interested in.  */
diff mbox

Patch

diff --git a/gcc/combine.c b/gcc/combine.c
index c3db2e0..3bf7ffb 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -5284,6 +5284,15 @@  subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy)
 	      || GET_CODE (SET_DEST (x)) == PC))
 	fmt = "ie";
 
+      /* Substituting into the operands of a widening MULT is not likely
+	 to create RTL matching a machine insn.  */
+      if (code == MULT
+	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
+	      || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND)
+	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
+	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND))
+	return x;
+
       /* Get the mode of operand 0 in case X is now a SIGN_EXTEND of a
 	 constant.  */
       if (fmt[0] == 'e')
@@ -8455,6 +8464,15 @@  force_to_mode (rtx x, machine_mode mode, unsigned HOST_WIDE_INT mask,
       /* ... fall through ...  */
 
     case MULT:
+      /* Substituting into the operands of a widening MULT is not likely to
+	 create RTL matching a machine insn.  */
+      if (code == MULT
+	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
+	      || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND)
+	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
+	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND))
+	return gen_lowpart_or_truncate (mode, x);
+
       /* For PLUS, MINUS and MULT, we need any bits less significant than the
 	 most significant bit in MASK since carries from those bits will
 	 affect the bits we are interested in.  */