diff mbox series

middle-end: Improve RTL expansion in expand_mul_overflow,

Message ID 014301d653dd$10c28670$32479350$@nextmovesoftware.com
State New
Headers show
Series middle-end: Improve RTL expansion in expand_mul_overflow, | expand

Commit Message

Roger Sayle July 6, 2020, 9:33 p.m. UTC
This patch improves the RTL that the middle-end generates for testing
signed overflow following a widening multiplication.  During this
expansion the middle-end generates a truncation which can get used
multiple times.  Placing this intermediate value in a pseudo register
reduces the amount of code generated on platforms where this truncation
requires an explicit instruction.

This simple call to force_reg eliminates 368 lines of the -S output
from testsuite/c-c++-common/torture/builtin-arith-overflow-1.c on
nvptx-none.  An example difference is in t120_1smul where the following
7 instruction sequence in which the 1st and 6th instructions perform
the same truncation:

< cvt.u32.u64     %r31, %r28;           <- truncate %r28
< shr.s32 %r30, %r31, 31;
< cvt.u32.u64     %r32, %r29;
< setp.eq.u32     %r33, %r30, %r32;
< selp.u32        %r24, 0, 1, %r33;
< cvt.u32.u64     %r25, %r28;           <- truncate %r28
< setp.eq.u32     %r34, %r24, 0;

is now generated as a 4 instruction sequence without duplication:

> cvt.u32.u64     %r30, %r28;
> shr.s32 %r31, %r30, 31;
> cvt.u32.u64     %r32, %r29;
> setp.eq.u32     %r33, %r31, %r32;

On x86_64-pc-linux-gnu, where SUBREGs are free, this patch generates
exactly the same builtin-arith-overflow-1.s as before.

This patch has been tested on both x86_64-pc-linux-gnu with
"make bootstrap" and nvptx-none with "make", with no new
testsuite regressions on either platform.
Ok for mainline?


2020-07-06  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog:
	* internal-fn.c (expand_mul_overflow): When checking for signed
	overflow from a widening multiplication, we access the truncated
	lowpart RES twice, so keep this value in a pseudo register.


Thanks in advance,
Roger
--
Roger Sayle
NextMove Software
Cambridge, UK

Comments

Richard Sandiford July 9, 2020, 8:17 a.m. UTC | #1
"Roger Sayle" <roger@nextmovesoftware.com> writes:
> This patch improves the RTL that the middle-end generates for testing
> signed overflow following a widening multiplication.  During this
> expansion the middle-end generates a truncation which can get used
> multiple times.  Placing this intermediate value in a pseudo register
> reduces the amount of code generated on platforms where this truncation
> requires an explicit instruction.
>
> This simple call to force_reg eliminates 368 lines of the -S output
> from testsuite/c-c++-common/torture/builtin-arith-overflow-1.c on
> nvptx-none.  An example difference is in t120_1smul where the following
> 7 instruction sequence in which the 1st and 6th instructions perform
> the same truncation:
>
> < cvt.u32.u64     %r31, %r28;           <- truncate %r28
> < shr.s32 %r30, %r31, 31;
> < cvt.u32.u64     %r32, %r29;
> < setp.eq.u32     %r33, %r30, %r32;
> < selp.u32        %r24, 0, 1, %r33;
> < cvt.u32.u64     %r25, %r28;           <- truncate %r28
> < setp.eq.u32     %r34, %r24, 0;
>
> is now generated as a 4 instruction sequence without duplication:
>
>> cvt.u32.u64     %r30, %r28;
>> shr.s32 %r31, %r30, 31;
>> cvt.u32.u64     %r32, %r29;
>> setp.eq.u32     %r33, %r31, %r32;
>
> On x86_64-pc-linux-gnu, where SUBREGs are free, this patch generates
> exactly the same builtin-arith-overflow-1.s as before.
>
> This patch has been tested on both x86_64-pc-linux-gnu with
> "make bootstrap" and nvptx-none with "make", with no new
> testsuite regressions on either platform.
> Ok for mainline?
>
>
> 2020-07-06  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog:
> 	* internal-fn.c (expand_mul_overflow): When checking for signed
> 	overflow from a widening multiplication, we access the truncated
> 	lowpart RES twice, so keep this value in a pseudo register.
>
>
> Thanks in advance,
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index 0be2eb4..d1bd6cc 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -1627,6 +1627,9 @@ expand_mul_overflow (location_t loc, tree lhs, tree arg0, tree arg1,
>  				     profile_probability::very_likely ());
>  	  else
>  	    {
> +	      /* RES is used more than once, place it in a pseudo.  */
> +	      res = force_reg (mode, res);
> +
>  	      rtx signbit = expand_shift (RSHIFT_EXPR, mode, res, prec - 1,
>  					  NULL_RTX, 0);
>  	      /* RES is low half of the double width result, HIPART

In general, this can be dangerous performance-wise on targets where
subregs are free.  If the move survives to the register allocators,
it increases the risk that the move will become a machine insn.
(The RA will prefer to tie the registers, but that isn't guaranteed.)

But more fundamentally, this can hurt if the SUBREG_REG is live at
the same time as the new pseudo, since the RA then has to treat them
as separate quantities.  From your results, that obviously doesn't
occur in the test case, but I'm not 100% confident that it won't
occur elsewhere.

If target-independent code is going to optimise for “no subreg operand”
targets like nvptx, I think it needs to know that the target wants that.

Thanks,
Richard
Jakub Jelinek July 9, 2020, 8:29 a.m. UTC | #2
On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote:
> > --- a/gcc/internal-fn.c
> > +++ b/gcc/internal-fn.c
> > @@ -1627,6 +1627,9 @@ expand_mul_overflow (location_t loc, tree lhs, tree arg0, tree arg1,
> >  				     profile_probability::very_likely ());
> >  	  else
> >  	    {
> > +	      /* RES is used more than once, place it in a pseudo.  */
> > +	      res = force_reg (mode, res);
> > +
> >  	      rtx signbit = expand_shift (RSHIFT_EXPR, mode, res, prec - 1,
> >  					  NULL_RTX, 0);
> >  	      /* RES is low half of the double width result, HIPART
> 
> In general, this can be dangerous performance-wise on targets where
> subregs are free.  If the move survives to the register allocators,
> it increases the risk that the move will become a machine insn.
> (The RA will prefer to tie the registers, but that isn't guaranteed.)
> 
> But more fundamentally, this can hurt if the SUBREG_REG is live at
> the same time as the new pseudo, since the RA then has to treat them
> as separate quantities.  From your results, that obviously doesn't
> occur in the test case, but I'm not 100% confident that it won't
> occur elsewhere.
> 
> If target-independent code is going to optimise for “no subreg operand”
> targets like nvptx, I think it needs to know that the target wants that.

Isn't that though what the expander is doing in lots of places?
Force operands into pseudos especially if optimize to hope for better CSE
etc., and hope combine does its job to undo it when it is better to be
propagated?
It is true that if res is used several times, then combine will not
propagate it due to multiple uses, so the question I guess is why as Roger
says we get the same code before/after (which pass undoes that; RA?).

	Jakub
Richard Sandiford July 9, 2020, 8:46 a.m. UTC | #3
Jakub Jelinek <jakub@redhat.com> writes:
> On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote:
>> > --- a/gcc/internal-fn.c
>> > +++ b/gcc/internal-fn.c
>> > @@ -1627,6 +1627,9 @@ expand_mul_overflow (location_t loc, tree lhs, tree arg0, tree arg1,
>> >  				     profile_probability::very_likely ());
>> >  	  else
>> >  	    {
>> > +	      /* RES is used more than once, place it in a pseudo.  */
>> > +	      res = force_reg (mode, res);
>> > +
>> >  	      rtx signbit = expand_shift (RSHIFT_EXPR, mode, res, prec - 1,
>> >  					  NULL_RTX, 0);
>> >  	      /* RES is low half of the double width result, HIPART
>> 
>> In general, this can be dangerous performance-wise on targets where
>> subregs are free.  If the move survives to the register allocators,
>> it increases the risk that the move will become a machine insn.
>> (The RA will prefer to tie the registers, but that isn't guaranteed.)
>> 
>> But more fundamentally, this can hurt if the SUBREG_REG is live at
>> the same time as the new pseudo, since the RA then has to treat them
>> as separate quantities.  From your results, that obviously doesn't
>> occur in the test case, but I'm not 100% confident that it won't
>> occur elsewhere.
>> 
>> If target-independent code is going to optimise for “no subreg operand”
>> targets like nvptx, I think it needs to know that the target wants that.
>
> Isn't that though what the expander is doing in lots of places?
> Force operands into pseudos especially if optimize to hope for better CSE
> etc., and hope combine does its job to undo it when it is better to be
> propagated?
> It is true that if res is used several times, then combine will not
> propagate it due to multiple uses, so the question I guess is why as Roger
> says we get the same code before/after (which pass undoes that; RA?).

I'd imagine fwprop.  That's just a guess though.

But I don't see what this force_reg achieves on “free subreg” targets,
even for CSE.  The SUBREG_REG is still set as a full REG value and
can be CSEd in the normal way.  And because the subreg itself is free,
trying to CSE the subregs is likely to be actively harmful for the
reason above: we can then have the SUBREG_REG and the CSEd subreg
live at the same time.

I.e. it's not usually worth extending the lifetime of a previous
subreg result when a new subreg on the same value would have no cost.

Maybe in the old days it made more sense, because we relied on RTL
optimisers to do constant propagation and folding, and so a subreg
move might later become a constant move, with the constant then being
CSEable.  Is that what you mean?  But that should be much less necessary
now.  There shouldn't be many cases in which only the RTL optimisers can
prove that an internal function argument is constant.

Thanks,
Richard
Roger Sayle July 10, 2020, 10:12 a.m. UTC | #4
Hi Richard,

> On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote:
>> > +	      res = force_reg (mode, res);
>> 
>> In general, this can be dangerous performance-wise on targets where 
>> subregs are free.  If the move survives to the register allocators, 
>> it increases the risk that the move will become a machine insn.
>> (The RA will prefer to tie the registers, but that isn't guaranteed.)

I believe I can use exactly the same test case, but compiled on ARM, to
demonstrate that placing reused intermediate values in pseudo registers
instead of SUBREGs is potentially preferable and at least no worse
(certainly not "dangerous").

The minimal test case for t120_smul is:

void bar(void);
int t120_1smul (int x, int y)
{
  int r;
  if (__builtin_smul_overflow (x, y, &r))
    bar ();
  return r;
}

which when compiled on arm-linux-gnueabihf generates, both with and
without my patch, the very impressive, near optimal sequence:

t120_1smul:
        mov     r3, r0
        smull   r0, r2, r3, r1
        cmp     r2, r0, asr #31
        bxeq    lr
        str     lr, [sp, #-4]!
        sub     sp, sp, #12
        str     r0, [sp, #4]
        bl      bar
        ldr     r0, [sp, #4]
        add     sp, sp, #12
        ldr     pc, [sp], #4

Notice that the first argument (output) of smul, r0, is used multiple
times, but thanks to fwprop both the SUBREG and the pseudo forms
are equivalent to the result register, r0, which reload respects.  That
GCC puts the lopart of the widening multiplication directly in the
result register is  awesome.  The use of force_reg in my patch clearly
doesn't hurt.

In fact, the use of a pseudo helps insulate some dubious code
expanded by the ARM backend, the previous assignment to the register
that is forced into a pseudo register actually looks like:

(insn 10 9 11 2 (set (subreg:SI (reg:DI 120) 0)
        (ashiftrt:SI (subreg:SI (reg:DI 119) 4)
            (const_int 0 [0]))) "t120_1smul.c":5:7 -1
     (nil))

Fortunately this cryptic move instruction eventually gets cleaned up
by the middle-end.

[Minor aside this shift by zero can be avoided by following patch]
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index e15d286..5225df6 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -31324,7 +31324,10 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code, rtx out, rtx in,

          if (REG_P (out) && REG_P (in) && REGNO (out) != REGNO (in))
            emit_insn (SET (out, const0_rtx));
-         emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
+         if (adj_amount != const0_rtx)
+           emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
+         else
+           emit_insn (SET (out_down, in_up));
          if (code == ASHIFTRT)
            emit_insn (gen_ashrsi3 (out_up, in_up,
                                    GEN_INT (31)));

But to (perhaps) convince you (folks) that pseudo registers are preferable to SUBREGs, we need
to take a look at the first two instructions generated in the above test case.

Instead of the current:
        mov     r3, r0
        smull   r0, r2, r3, r1

The optimal sequence on ARM (if my understanding of the ARM documentation on register
constraints is correct) would be the single instruction:

        smull   r0, r2, r1, r0

As the first three argument registers must be different, but the fourth and final register doesn't
conflict with the first three.  We know that we require the first output to be r0, but we can use
the commutativity of multiplication to swap r0 and r1 in the third and fourth positions, saving
both an instruction and a register!

Alas my attempts to convince GCC to generate this sequence have been unsuccessful.
The closest I can get is by using something like:

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index a6a31f8..ece9175 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -2409,11 +2409,11 @@
 )

 (define_insn "<US>mull"
-  [(set (match_operand:SI 0 "s_register_operand" "=r,&r")
+  [(set (match_operand:SI 0 "s_register_operand" "=r,&r,&r,&r")
        (mult:SI
-        (match_operand:SI 2 "s_register_operand" "%r,r")
-        (match_operand:SI 3 "s_register_operand" "r,r")))
-   (set (match_operand:SI 1 "s_register_operand" "=r,&r")
+        (match_operand:SI 2 "s_register_operand" "%r,r,r,r")
+        (match_operand:SI 3 "s_register_operand" "r,r,0,1")))
+   (set (match_operand:SI 1 "s_register_operand" "=r,&r,&r,&r")
        (truncate:SI
         (lshiftrt:DI
          (mult:DI (SE:DI (match_dup 2)) (SE:DI (match_dup 3)))
@@ -2422,7 +2422,7 @@
   "<US>mull%?\\t%0, %1, %2, %3"
   [(set_attr "type" "umull")
    (set_attr "predicable" "yes")
-   (set_attr "arch" "v6,nov6")]
+   (set_attr "arch" "v6,nov6,nov6,nov6")]
 )

Which is one step closer, with reload/GCC now generating:

        mov     r3, r1
        smull   r0, r1, r3, r0

This has the correct form, with r0 as the first and fourth arguments, and requires
one less register but notice mysteriously we still have the (unnecessary) mov instruction!

I suspect that this may be a consequence of using SUBREGs instead of pseudo registers.
Perhaps, by insisting that the DI mode result of this instruction are constrained to exist
in consecutive registers, the ARM backend produces worse code than it needs to?

If anyone can convince reload to generate a single "smull r0, r2, r1, r0" for this
testcase on ARM, then I'll admit that they are a better compiler wrangler than me.


I hope this was convincing.  The take home message is that the single line force_reg
change really is safe.  I'm happy for follow up with design philosophy, and RTL sharing,
and ARM-relevant supporting arguments on why pseudo register usage is good.


Best regards,
Roger
--
Richard Sandiford July 10, 2020, 10:37 a.m. UTC | #5
"Roger Sayle" <roger@nextmovesoftware.com> writes:
> Hi Richard,
>
>> On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote:
>>> > +	      res = force_reg (mode, res);
>>> 
>>> In general, this can be dangerous performance-wise on targets where 
>>> subregs are free.  If the move survives to the register allocators, 
>>> it increases the risk that the move will become a machine insn.
>>> (The RA will prefer to tie the registers, but that isn't guaranteed.)
>
> I believe I can use exactly the same test case, but compiled on ARM, to
> demonstrate that placing reused intermediate values in pseudo registers
> instead of SUBREGs is potentially preferable and at least no worse
> (certainly not "dangerous").

OK.  Sorry for the noise then, please go ahead with the patch.

Thanks,
Richard
Segher Boessenkool July 11, 2020, 7:47 a.m. UTC | #6
Hi all,

On Thu, Jul 09, 2020 at 10:29:44AM +0200, Jakub Jelinek via Gcc-patches wrote:
> On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote:
> > If target-independent code is going to optimise for “no subreg operand”
> > targets like nvptx, I think it needs to know that the target wants that.
> 
> Isn't that though what the expander is doing in lots of places?

Not nearly enough, in fact.

The expander should output very simple code, and let the RTL optimisers
figure out what would make good machine code.  Doing "optimisations"
during expand just leads to worse code (and makes the compiler itself
harder to maintain).  Expand should translate Gimple to simple RTL, and
that is it.

A lot of what expand does was useful when people used -O0 and -O1 in
production, but is harmful in the modern world.

> Force operands into pseudos especially if optimize to hope for better CSE

Not just CSE, but pretty much *all* RTL optimisations can do a better
job if they start with simple non-obfuscated code.

> etc., and hope combine does its job to undo it when it is better to be
> propagated?

It does, usually.  Not if a move is the only thing in a block, but it is
RA's job to decide in that case, anyway.

> It is true that if res is used several times, then combine will not
> propagate it due to multiple uses,

Nope, combine can handle that just fine :-)  It won't do this if the
resulting code is more expensive than what it started with, of course,
but combine can handle multiple uses just fine, in general.  What it
cannot handle well is multiple sets of the same pseudo.  You shouldn't
do that anyway, combine is not the only pass that will not handle this.

> so the question I guess is why as Roger
> says we get the same code before/after (which pass undoes that; RA?).

combine can do it just fine, in the simpler cases.  RA can do more cases.


Segher
diff mbox series

Patch

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 0be2eb4..d1bd6cc 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -1627,6 +1627,9 @@  expand_mul_overflow (location_t loc, tree lhs, tree arg0, tree arg1,
 				     profile_probability::very_likely ());
 	  else
 	    {
+	      /* RES is used more than once, place it in a pseudo.  */
+	      res = force_reg (mode, res);
+
 	      rtx signbit = expand_shift (RSHIFT_EXPR, mode, res, prec - 1,
 					  NULL_RTX, 0);
 	      /* RES is low half of the double width result, HIPART