diff mbox

RFC: 2->2 combine patch (was: Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961))

Message ID 20160617000706.GB25170@gate.crashing.org
State New
Headers show

Commit Message

Segher Boessenkool June 17, 2016, 12:07 a.m. UTC
On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote:
> With the proposed cost change for vector construction we will end up
> vectorizing the testcase in PR68961 again (on x86_64 and likely
> on ppc64le as well after that target gets adjustments).  Currently
> we can't optimize that away again noticing the direct overlap of
> argument and return registers.  The obstackle is
> 
> (insn 7 4 8 2 (set (reg:V2DF 93)
>         (vec_concat:V2DF (reg/v:DF 91 [ a ])
>             (reg/v:DF 92 [ aa ]))) 
> ...
> (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
>         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
>         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> 
> which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> 
> First of all simplify_subreg doesn't handle the subregs of a vec_concat
> (easy fix below).
> 
> Then combine doesn't like to simplify the multi-use (it tries some
> parallel it seems).

Combine will not do a 2->2 combination currently.  Say it is combining
A with a later B into C, and the result of A is used again later, then
it tries a parallel of A with C.  That usually does not match an insn for
the target.

If this were a 3->2 (or 4->2) combination, or A or C are no-op moves
(so that they will disappear later in combines), combine will break the
parallel into two and see if that matches.  We can in fact do that for
2->2 combinations as well: this removes a log_link (from A to B), so
combine cannot get into an infinite loop, even though it does not make
the number of RTL insns lower.

So I tried out the patch below.  It decreases code size on most targets
(mostly fixed length insn targets), and increases it a small bit on some
variable length insn targets (doing an op twice, instead of doing it once
and doing a move).  It looks to be all good there too, but there are so
many changes that it is almost impossible to really check.

So: can people try this out with their favourite benchmarks, please?


Segher

Comments

Kyrill Tkachov June 20, 2016, 1:39 p.m. UTC | #1
Hi Segher,

On 17/06/16 01:07, Segher Boessenkool wrote:
> On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote:
>> With the proposed cost change for vector construction we will end up
>> vectorizing the testcase in PR68961 again (on x86_64 and likely
>> on ppc64le as well after that target gets adjustments).  Currently
>> we can't optimize that away again noticing the direct overlap of
>> argument and return registers.  The obstackle is
>>
>> (insn 7 4 8 2 (set (reg:V2DF 93)
>>          (vec_concat:V2DF (reg/v:DF 91 [ a ])
>>              (reg/v:DF 92 [ aa ])))
>> ...
>> (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
>>          (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
>> (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
>>          (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
>>
>> which we eventually optimize to DFmode subregs of (reg:V2DF 93).
>>
>> First of all simplify_subreg doesn't handle the subregs of a vec_concat
>> (easy fix below).
>>
>> Then combine doesn't like to simplify the multi-use (it tries some
>> parallel it seems).
> Combine will not do a 2->2 combination currently.  Say it is combining
> A with a later B into C, and the result of A is used again later, then
> it tries a parallel of A with C.  That usually does not match an insn for
> the target.
>
> If this were a 3->2 (or 4->2) combination, or A or C are no-op moves
> (so that they will disappear later in combines), combine will break the
> parallel into two and see if that matches.  We can in fact do that for
> 2->2 combinations as well: this removes a log_link (from A to B), so
> combine cannot get into an infinite loop, even though it does not make
> the number of RTL insns lower.
>
> So I tried out the patch below.  It decreases code size on most targets
> (mostly fixed length insn targets), and increases it a small bit on some
> variable length insn targets (doing an op twice, instead of doing it once
> and doing a move).  It looks to be all good there too, but there are so
> many changes that it is almost impossible to really check.
>
> So: can people try this out with their favourite benchmarks, please?

I hope to give this a run on AArch64 but I'll probably manage to get to it only next week.
In the meantime I've had a quick look at some SPEC2006 codegen on aarch64.
Some benchmarks decrease in size, others increase. One recurring theme I spotted is
shifts being repeatedly combined with arithmetic operations rather than being computed
once and reusing the result. For example:
     lsl    x30, x15, 3
     add    x4, x5, x30
     add    x9, x7, x30
     add    x24, x8, x30
     add    x10, x0, x30
     add    x2, x22, x30

becoming (modulo regalloc fluctuations):
     add    x14, x2, x15, lsl 3
     add    x13, x22, x15, lsl 3
     add    x21, x4, x15, lsl 3
     add    x6, x0, x15, lsl 3
     add    x3, x30, x15, lsl 3

which, while saving one instruction can be harmful overall because the extra shift operation
in the arithmetic instructions can increase the latency of the instruction. I believe the aarch64
rtx costs should convey this information. Do you expect RTX costs to gate this behaviour?

Some occurrences that hurt code size look like:
     cmp    x8, x11, asr 5

being transformed into:
     asr    x12, x11, 5
     cmp    x12, x8, uxtw //zero-extend x8
with the user of the condition code inverted to match the change in order of operands
to the comparisons.
I haven't looked at the RTL dumps yet to figure out why this is happening, it could be a backend
RTL representation issue.

Kyrill


>
> Segher
>
>
> diff --git a/gcc/combine.c b/gcc/combine.c
> index 6b5d000..2c99b4e 100644
> --- a/gcc/combine.c
> +++ b/gcc/combine.c
> @@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
>   	   && XVECLEN (newpat, 0) == 2
>   	   && GET_CODE (XVECEXP (newpat, 0, 0)) == SET
>   	   && GET_CODE (XVECEXP (newpat, 0, 1)) == SET
> -	   && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
> -		  || set_noop_p (XVECEXP (newpat, 0, 1)))
>   	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
>   	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
>   	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT
>
Segher Boessenkool June 20, 2016, 2:37 p.m. UTC | #2
On Mon, Jun 20, 2016 at 02:39:06PM +0100, Kyrill Tkachov wrote:
> >So I tried out the patch below.  It decreases code size on most targets
> >(mostly fixed length insn targets), and increases it a small bit on some
> >variable length insn targets (doing an op twice, instead of doing it once
> >and doing a move).  It looks to be all good there too, but there are so
> >many changes that it is almost impossible to really check.
> >
> >So: can people try this out with their favourite benchmarks, please?
> 
> I hope to give this a run on AArch64 but I'll probably manage to get to it 
> only next week.
> In the meantime I've had a quick look at some SPEC2006 codegen on aarch64.
> Some benchmarks decrease in size, others increase. One recurring theme I 
> spotted is
> shifts being repeatedly combined with arithmetic operations rather than 
> being computed
> once and reusing the result. For example:
>     lsl    x30, x15, 3
>     add    x4, x5, x30
>     add    x9, x7, x30
>     add    x24, x8, x30
>     add    x10, x0, x30
>     add    x2, x22, x30
> 
> becoming (modulo regalloc fluctuations):
>     add    x14, x2, x15, lsl 3
>     add    x13, x22, x15, lsl 3
>     add    x21, x4, x15, lsl 3
>     add    x6, x0, x15, lsl 3
>     add    x3, x30, x15, lsl 3
> 
> which, while saving one instruction can be harmful overall because the 
> extra shift operation
> in the arithmetic instructions can increase the latency of the instruction. 
> I believe the aarch64
> rtx costs should convey this information. Do you expect RTX costs to gate 
> this behaviour?

Yes, RTX costs are used for *all* of combine's combinations.  So it seems
your add,lsl patterns are the same cost as plain add?  If it were more
expensive, combine would reject this combination.

> Some occurrences that hurt code size look like:
>     cmp    x8, x11, asr 5
> 
> being transformed into:
>     asr    x12, x11, 5
>     cmp    x12, x8, uxtw //zero-extend x8
> with the user of the condition code inverted to match the change in order 
> of operands
> to the comparisons.
> I haven't looked at the RTL dumps yet to figure out why this is happening, 
> it could be a backend
> RTL representation issue.

That could be a target thing yes, hard to tell; it's not clear to me
what combination is made here (if any).


Segher
diff mbox

Patch

diff --git a/gcc/combine.c b/gcc/combine.c
index 6b5d000..2c99b4e 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -3933,8 +3933,6 @@  try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
 	   && XVECLEN (newpat, 0) == 2
 	   && GET_CODE (XVECEXP (newpat, 0, 0)) == SET
 	   && GET_CODE (XVECEXP (newpat, 0, 1)) == SET
-	   && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
-		  || set_noop_p (XVECEXP (newpat, 0, 1)))
 	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
 	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
 	   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT