diff mbox

[combine,RFC] Don't transform sign and zero extends inside mults

Message ID 563CB6DE.7070106@arm.com
State New
Headers show

Commit Message

Kyrylo Tkachov Nov. 6, 2015, 2:19 p.m. UTC
On 06/11/15 00:56, Segher Boessenkool wrote:
> On Thu, Nov 05, 2015 at 12:01:26PM +0000, Kyrill Tkachov wrote:
>> Thanks, that looks less intrusive. I did try it out on arm and aarch64.
>> It does work on the aarch64 testcase. However, there's also a correctness
>> regression, I'll try to explain inline....
>>> diff --git a/gcc/combine.c b/gcc/combine.c
>>> index c3db2e0..3bf7ffb 100644
>>> --- a/gcc/combine.c
>>> +++ b/gcc/combine.c
>>> @@ -5284,6 +5284,15 @@ subst (rtx x, rtx from, rtx to, int in_dest, int
>>> in_cond, int unique_copy)
>>>   	      || GET_CODE (SET_DEST (x)) == PC))
>>>   	fmt = "ie";
>>>   
>>> +      /* Substituting into the operands of a widening MULT is not likely
>>> +	 to create RTL matching a machine insn.  */
>>> +      if (code == MULT
>>> +	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
>>> +	      || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND)
>>> +	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
>>> +	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND))
>>> +	return x;
>> I think we should also add:
>>        && REG_P (XEXP (XEXP (x, 0), 0))
>>        && REG_P (XEXP (XEXP (x, 1), 0))
>>
>> to the condition. Otherwise I've seen regressions in the arm testsuite, the
>> gcc.target/arm/smlatb-1.s test in particular that tries to match the pattern
>> (define_insn "*maddhisi4tb"
>>    [(set (match_operand:SI 0 "s_register_operand" "=r")
>>      (plus:SI (mult:SI (ashiftrt:SI
>>                 (match_operand:SI 1 "s_register_operand" "r")
>>                 (const_int 16))
>>                (sign_extend:SI
>>                 (match_operand:HI 2 "s_register_operand" "r")))
>>           (match_operand:SI 3 "s_register_operand" "r")))]
>>
>>
>> There we have a sign_extend of a shift that we want to convert to the form
>> that the pattern expects. So adding the checks for REG_P fixes that for me.
> I'll have a look at this; I thought it should be handled with the new patch
> (attached), but maybe not.
>
>> For the correctness issue I saw on aarch64 the shortest case I could reduce
>> is:
>> short int a[16], b[16];
>> void
>> f5 (void)
>> {
>>    a[0] = b[0] / 14;
>> }
> (Without -mcpu=cortex-a53, or you just get a divide insn).
>
>> Is there a way that subst can signal some kind of "failed to substitute"
>> result?
> Yep, see new patch.  The "from == to" condition is for when subst is called
> just to simplify some code (normally with pc_rtx, pc_rtx).

Indeed, this looks better but it still needs the REG_P checks for the inner
operands of the extends to not screw up the arm case.
Here's my proposed extension. I've also modified the testcase slightly to not use
an unitialised variable. It still demonstrates the issue we're trying to solve.

Bootstrapped and tested on arm and aarch64.
I'll let you put it through it's paces on your setup :)

P.S. Do we want to restrict this to targets that have a widening mul optab like I did
in the original patch?

Thanks,
Kyrill

2015-11-06  Segher Boessenkool  <segher@kernel.crashing.org>
             Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * combine.c (subst): Don't substitute or simplify when
     handling register-wise widening multiply.
     (force_to_mode): Likewise.

2015-11-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/umaddl_combine_1.c: New test.

>> If not, I got it to work by using that check to set the in_dest variable to
>> the
>> subsequent recursive call to subst, in a similar way to my original patch,
>> but I was
>> hoping to avoid overloading the meaning of in_dest.
> Yes me too :-)
>
>
> Segher
>
>
> diff --git a/gcc/combine.c b/gcc/combine.c
> index c3db2e0..c806db9 100644
> --- a/gcc/combine.c
> +++ b/gcc/combine.c
> @@ -5284,6 +5284,20 @@ subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy)
>   	      || GET_CODE (SET_DEST (x)) == PC))
>   	fmt = "ie";
>   
> +      /* Substituting into the operands of a widening MULT is not likely
> +	 to create RTL matching a machine insn.  */
> +      if (code == MULT
> +	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
> +	      || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND)
> +	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
> +	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND))
> +	{
> +	  if (from == to)
> +	    return x;
> +	  else
> +	    return gen_rtx_CLOBBER (GET_MODE (x), const0_rtx);
> +	}
> +
>         /* Get the mode of operand 0 in case X is now a SIGN_EXTEND of a
>   	 constant.  */
>         if (fmt[0] == 'e')
> @@ -8455,6 +8469,15 @@ force_to_mode (rtx x, machine_mode mode, unsigned HOST_WIDE_INT mask,
>         /* ... fall through ...  */
>   
>       case MULT:
> +      /* Substituting into the operands of a widening MULT is not likely to
> +	 create RTL matching a machine insn.  */
> +      if (code == MULT
> +	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
> +	      || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND)
> +	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
> +	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND))
> +	return gen_lowpart_or_truncate (mode, x);
> +
>         /* For PLUS, MINUS and MULT, we need any bits less significant than the
>   	 most significant bit in MASK since carries from those bits will
>   	 affect the bits we are interested in.  */

Comments

Jeff Law Nov. 6, 2015, 9:14 p.m. UTC | #1
On 11/06/2015 07:19 AM, Kyrill Tkachov wrote:
>>> I think we should also add:
>>>        && REG_P (XEXP (XEXP (x, 0), 0))
>>>        && REG_P (XEXP (XEXP (x, 1), 0))
I tend to agree.

>> Yep, see new patch.  The "from == to" condition is for when subst is
>> called
>> just to simplify some code (normally with pc_rtx, pc_rtx).
>
> Indeed, this looks better but it still needs the REG_P checks for the inner
> operands of the extends to not screw up the arm case.
> Here's my proposed extension. I've also modified the testcase slightly
> to not use
> an unitialised variable. It still demonstrates the issue we're trying to
> solve.
>
> Bootstrapped and tested on arm and aarch64.
> I'll let you put it through it's paces on your setup :)
>
> P.S. Do we want to restrict this to targets that have a widening mul
> optab like I did in the original patch?
I don't think it's necessary or desirable.  Consider the case where the 
target has a limited widening multiply insn -- we may not want to expose 
it to the RTL expanders due to some limitation, but we still want the 
combiner to be able to create it under the right circumstances.

>
> Thanks,
> Kyrill
>
> 2015-11-06  Segher Boessenkool  <segher@kernel.crashing.org>
>              Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>      * combine.c (subst): Don't substitute or simplify when
>      handling register-wise widening multiply.
>      (force_to_mode): Likewise.
>
> 2015-11-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>      * gcc.target/aarch64/umaddl_combine_1.c: New test.
I'll let Segher give the final yes/no on this, but it generally looks 
good to me.

jeff
Segher Boessenkool Nov. 6, 2015, 10 p.m. UTC | #2
[ reordered a bit ]

On Fri, Nov 06, 2015 at 02:14:12PM -0700, Jeff Law wrote:
> On 11/06/2015 07:19 AM, Kyrill Tkachov wrote:
> >>>I think we should also add:
> >>>       && REG_P (XEXP (XEXP (x, 0), 0))
> >>>       && REG_P (XEXP (XEXP (x, 1), 0))
> I tend to agree.

> >Indeed, this looks better but it still needs the REG_P checks for the inner
> >operands of the extends to not screw up the arm case.

> >P.S. Do we want to restrict this to targets that have a widening mul
> >optab like I did in the original patch?
> I don't think it's necessary or desirable.

With the REG_P checks added, now simplification is only stopped for
widening muls of registers, so all is fine -- any such construct _is_
a widening multiplication!

This patch stops combine from generating widening muls of anything else
but registers (immediates, memory, ...).  This probably is a reasonable
tradeoff for all targets, even those (if any) that have such insns.

> >I'll let you put it through it's paces on your setup :)

> I'll let Segher give the final yes/no on this, but it generally looks 
> good to me.

It looks okay to me too.  Testing now, combine patches have the tendency
to do unforeseen things on other targets ;-)


Segher
Segher Boessenkool Nov. 8, 2015, 8:58 p.m. UTC | #3
On Fri, Nov 06, 2015 at 04:00:08PM -0600, Segher Boessenkool wrote:
> This patch stops combine from generating widening muls of anything else
> but registers (immediates, memory, ...).  This probably is a reasonable
> tradeoff for all targets, even those (if any) that have such insns.
> 
> > >I'll let you put it through it's paces on your setup :)
> 
> > I'll let Segher give the final yes/no on this, but it generally looks 
> > good to me.
> 
> It looks okay to me too.  Testing now, combine patches have the tendency
> to do unforeseen things on other targets ;-)

Testing shows it makes a difference only very rarely.  For many targets
it makes no difference, for a few it is a small win.  For 32-bit x86 it
creates slightly bigger code.

I think it looks good, but let's wait to hear Uros' opinion.


Segher
Uros Bizjak Nov. 9, 2015, 7:52 a.m. UTC | #4
On Sun, Nov 8, 2015 at 9:58 PM, Segher Boessenkool
<segher@kernel.crashing.org> wrote:
> On Fri, Nov 06, 2015 at 04:00:08PM -0600, Segher Boessenkool wrote:
>> This patch stops combine from generating widening muls of anything else
>> but registers (immediates, memory, ...).  This probably is a reasonable
>> tradeoff for all targets, even those (if any) that have such insns.
>>
>> > >I'll let you put it through it's paces on your setup :)
>>
>> > I'll let Segher give the final yes/no on this, but it generally looks
>> > good to me.
>>
>> It looks okay to me too.  Testing now, combine patches have the tendency
>> to do unforeseen things on other targets ;-)
>
> Testing shows it makes a difference only very rarely.  For many targets
> it makes no difference, for a few it is a small win.  For 32-bit x86 it
> creates slightly bigger code.
>
> I think it looks good, but let's wait to hear Uros' opinion.

From the original patch submission, it looks that this patch would
also benefit x86_32.

Regarding the above code size increase -  do you perhaps have a
testcase, to see what causes the difference? It isn't necessary due to
the patch, but perhaps some loads are moved to the insn and aren't
CSE'd anymore.

Uros.
Segher Boessenkool Nov. 9, 2015, 9:51 a.m. UTC | #5
On Mon, Nov 09, 2015 at 08:52:13AM +0100, Uros Bizjak wrote:
> On Sun, Nov 8, 2015 at 9:58 PM, Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > On Fri, Nov 06, 2015 at 04:00:08PM -0600, Segher Boessenkool wrote:
> >> This patch stops combine from generating widening muls of anything else
> >> but registers (immediates, memory, ...).  This probably is a reasonable
> >> tradeoff for all targets, even those (if any) that have such insns.
> >>
> >> > >I'll let you put it through it's paces on your setup :)
> >>
> >> > I'll let Segher give the final yes/no on this, but it generally looks
> >> > good to me.
> >>
> >> It looks okay to me too.  Testing now, combine patches have the tendency
> >> to do unforeseen things on other targets ;-)
> >
> > Testing shows it makes a difference only very rarely.  For many targets
> > it makes no difference, for a few it is a small win.  For 32-bit x86 it
> > creates slightly bigger code.
> >
> > I think it looks good, but let's wait to hear Uros' opinion.
> 
> >From the original patch submission, it looks that this patch would
> also benefit x86_32.

Yes, that is what I thought too.

> Regarding the above code size increase -  do you perhaps have a
> testcase, to see what causes the difference?

I could extract some.  It happens quite rarely on usual code.

> It isn't necessary due to
> the patch, but perhaps some loads are moved to the insn and aren't
> CSE'd anymore.


Segher
Segher Boessenkool Nov. 10, 2015, 7:53 p.m. UTC | #6
On Mon, Nov 09, 2015 at 03:51:32AM -0600, Segher Boessenkool wrote:
> > >From the original patch submission, it looks that this patch would
> > also benefit x86_32.
> 
> Yes, that is what I thought too.
> 
> > Regarding the above code size increase -  do you perhaps have a
> > testcase, to see what causes the difference?
> 
> I could extract some.  It happens quite rarely on usual code.
> 
> > It isn't necessary due to
> > the patch, but perhaps some loads are moved to the insn and aren't
> > CSE'd anymore.

I don't have a small testcase yet.

What causes the degradation is that sometimes we end up with imul reg,reg
instead of imul mem,reg.  In the normal case we already have imul mem,reg
after expand, so the patch doesn't change anything in the normal case.
Even if expand didn't do it fwprop would I think.

It also isn't LRA that is doing it, the MEMs in case are not on stack.
Maybe as you say some CSE pass.

For x86_64, which has many more registers than i386, often a peephole
fires that turns a  mov reg,reg ; imul mem,reg  into an  mov mem,reg ;
imul reg,reg  which makes the generated machines code identical with
or without the patch (tested on a Linux build, 12MB text).

The i386 size regression is 0.01% btw (comparable to the gains for
other targets).


Segher
Kyrylo Tkachov Nov. 13, 2015, 10:10 a.m. UTC | #7
Hi all,

On 10/11/15 19:53, Segher Boessenkool wrote:
> On Mon, Nov 09, 2015 at 03:51:32AM -0600, Segher Boessenkool wrote:
>>> >From the original patch submission, it looks that this patch would
>>> also benefit x86_32.
>> Yes, that is what I thought too.
>>
>>> Regarding the above code size increase -  do you perhaps have a
>>> testcase, to see what causes the difference?
>> I could extract some.  It happens quite rarely on usual code.
>>
>>> It isn't necessary due to
>>> the patch, but perhaps some loads are moved to the insn and aren't
>>> CSE'd anymore.
> I don't have a small testcase yet.
>
> What causes the degradation is that sometimes we end up with imul reg,reg
> instead of imul mem,reg.  In the normal case we already have imul mem,reg
> after expand, so the patch doesn't change anything in the normal case.
> Even if expand didn't do it fwprop would I think.
>
> It also isn't LRA that is doing it, the MEMs in case are not on stack.
> Maybe as you say some CSE pass.
>
> For x86_64, which has many more registers than i386, often a peephole
> fires that turns a  mov reg,reg ; imul mem,reg  into an  mov mem,reg ;
> imul reg,reg  which makes the generated machines code identical with
> or without the patch (tested on a Linux build, 12MB text).
>
> The i386 size regression is 0.01% btw (comparable to the gains for
> other targets).

So how shall we proceed? Should we try to come up with
a testcase? I wonder what are the RTL patterns that combine
now tries on i386 that don't match...

Kyrill

>
>
> Segher
>
Uros Bizjak Nov. 13, 2015, 10:17 a.m. UTC | #8
On Fri, Nov 13, 2015 at 11:10 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
> Hi all,
>
> On 10/11/15 19:53, Segher Boessenkool wrote:
>>
>> On Mon, Nov 09, 2015 at 03:51:32AM -0600, Segher Boessenkool wrote:
>>>>
>>>> >From the original patch submission, it looks that this patch would
>>>> also benefit x86_32.
>>>
>>> Yes, that is what I thought too.
>>>
>>>> Regarding the above code size increase -  do you perhaps have a
>>>> testcase, to see what causes the difference?
>>>
>>> I could extract some.  It happens quite rarely on usual code.
>>>
>>>> It isn't necessary due to
>>>> the patch, but perhaps some loads are moved to the insn and aren't
>>>> CSE'd anymore.
>>
>> I don't have a small testcase yet.
>>
>> What causes the degradation is that sometimes we end up with imul reg,reg
>> instead of imul mem,reg.  In the normal case we already have imul mem,reg
>> after expand, so the patch doesn't change anything in the normal case.
>> Even if expand didn't do it fwprop would I think.
>>
>> It also isn't LRA that is doing it, the MEMs in case are not on stack.
>> Maybe as you say some CSE pass.
>>
>> For x86_64, which has many more registers than i386, often a peephole
>> fires that turns a  mov reg,reg ; imul mem,reg  into an  mov mem,reg ;
>> imul reg,reg  which makes the generated machines code identical with
>> or without the patch (tested on a Linux build, 12MB text).
>>
>> The i386 size regression is 0.01% btw (comparable to the gains for
>> other targets).
>
>
> So how shall we proceed? Should we try to come up with
> a testcase? I wonder what are the RTL patterns that combine
> now tries on i386 that don't match...

IMO, this is such a small code-size regression, that it should not
block the patch submission. It would be nice to know, what causes the
increase (in case, this is some systematic oversight), but we can live
with it.

Uros.
Segher Boessenkool Nov. 13, 2015, 3:01 p.m. UTC | #9
On Fri, Nov 13, 2015 at 11:17:38AM +0100, Uros Bizjak wrote:
> IMO, this is such a small code-size regression, that it should not
> block the patch submission.

In that case: Kyrill, the patch is okay for trunk.  Thanks!

> It would be nice to know, what causes the
> increase (in case, this is some systematic oversight), but we can live
> with it.

After the patch it will no longer combine an imul reg,reg (+ mov) into an
imul mem,reg.  _Most_ cases that end up as mem,reg are already expanded
as such, but not all.  It's hard to make a smallish testcase.


Segher
diff mbox

Patch

commit 24d4a7f9f32e9290e95f940c1ac57fc9be26687c
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Fri Oct 30 13:56:10 2015 +0000

    [combine] Don't transform sign and zero extends inside mults

diff --git a/gcc/combine.c b/gcc/combine.c
index fa1a93f..26b14bf 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -5286,6 +5286,22 @@  subst (rtx x, rtx from, rtx to, int in_dest, int in_cond, int unique_copy)
 	      || GET_CODE (SET_DEST (x)) == PC))
 	fmt = "ie";
 
+      /* Substituting into the operands of a widening MULT is not likely
+	 to create RTL matching a machine insn.  */
+      if (code == MULT
+	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
+	      || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND)
+	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
+	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND)
+	  && REG_P (XEXP (XEXP (x, 0), 0))
+	  && REG_P (XEXP (XEXP (x, 1), 0)))
+	{
+	  if (from == to)
+	    return x;
+	  else
+	    return gen_rtx_CLOBBER (GET_MODE (x), const0_rtx);
+	}
+
       /* Get the mode of operand 0 in case X is now a SIGN_EXTEND of a
 	 constant.  */
       if (fmt[0] == 'e')
@@ -8447,6 +8463,17 @@  force_to_mode (rtx x, machine_mode mode, unsigned HOST_WIDE_INT mask,
       /* ... fall through ...  */
 
     case MULT:
+      /* Substituting into the operands of a widening MULT is not likely to
+	 create RTL matching a machine insn.  */
+      if (code == MULT
+	  && (GET_CODE (XEXP (x, 0)) == ZERO_EXTEND
+	      || GET_CODE (XEXP (x, 0)) == SIGN_EXTEND)
+	  && (GET_CODE (XEXP (x, 1)) == ZERO_EXTEND
+	      || GET_CODE (XEXP (x, 1)) == SIGN_EXTEND)
+	  && REG_P (XEXP (XEXP (x, 0), 0))
+	  && REG_P (XEXP (XEXP (x, 1), 0)))
+	return gen_lowpart_or_truncate (mode, x);
+
       /* For PLUS, MINUS and MULT, we need any bits less significant than the
 	 most significant bit in MASK since carries from those bits will
 	 affect the bits we are interested in.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/umaddl_combine_1.c b/gcc/testsuite/gcc.target/aarch64/umaddl_combine_1.c
new file mode 100644
index 0000000..430277f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/umaddl_combine_1.c
@@ -0,0 +1,29 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a53" } */
+
+enum reg_class
+{
+  NO_REGS,
+  AD_REGS,
+  ALL_REGS, LIM_REG_CLASSES
+};
+
+extern enum reg_class
+  reg_class_subclasses[((int) LIM_REG_CLASSES)][((int) LIM_REG_CLASSES)];
+
+void
+init_reg_sets_1 (unsigned int i)
+{
+  unsigned int j;
+  {
+    for (j = i + 1; j < ((int) LIM_REG_CLASSES); j++)
+      {
+	enum reg_class *p;
+	p = &reg_class_subclasses[j][0];
+	while (*p != LIM_REG_CLASSES)
+	  p++;
+      }
+  }
+}
+
+/* { dg-final { scan-assembler-not "umull\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */