Message ID | 20160617000706.GB25170@gate.crashing.org |
---|---|
State | New |
Headers | show |
Hi Segher, On 17/06/16 01:07, Segher Boessenkool wrote: > On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote: >> With the proposed cost change for vector construction we will end up >> vectorizing the testcase in PR68961 again (on x86_64 and likely >> on ppc64le as well after that target gets adjustments). Currently >> we can't optimize that away again noticing the direct overlap of >> argument and return registers. The obstackle is >> >> (insn 7 4 8 2 (set (reg:V2DF 93) >> (vec_concat:V2DF (reg/v:DF 91 [ a ]) >> (reg/v:DF 92 [ aa ]))) >> ... >> (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ]) >> (subreg:DI (reg:TI 88 [ D.1756 ]) 0)) >> (insn 24 21 11 2 (set (reg:DI 100 [+8 ]) >> (subreg:DI (reg:TI 88 [ D.1756 ]) 8)) >> >> which we eventually optimize to DFmode subregs of (reg:V2DF 93). >> >> First of all simplify_subreg doesn't handle the subregs of a vec_concat >> (easy fix below). >> >> Then combine doesn't like to simplify the multi-use (it tries some >> parallel it seems). > Combine will not do a 2->2 combination currently. Say it is combining > A with a later B into C, and the result of A is used again later, then > it tries a parallel of A with C. That usually does not match an insn for > the target. > > If this were a 3->2 (or 4->2) combination, or A or C are no-op moves > (so that they will disappear later in combines), combine will break the > parallel into two and see if that matches. We can in fact do that for > 2->2 combinations as well: this removes a log_link (from A to B), so > combine cannot get into an infinite loop, even though it does not make > the number of RTL insns lower. > > So I tried out the patch below. It decreases code size on most targets > (mostly fixed length insn targets), and increases it a small bit on some > variable length insn targets (doing an op twice, instead of doing it once > and doing a move). It looks to be all good there too, but there are so > many changes that it is almost impossible to really check. > > So: can people try this out with their favourite benchmarks, please? I hope to give this a run on AArch64 but I'll probably manage to get to it only next week. In the meantime I've had a quick look at some SPEC2006 codegen on aarch64. Some benchmarks decrease in size, others increase. One recurring theme I spotted is shifts being repeatedly combined with arithmetic operations rather than being computed once and reusing the result. For example: lsl x30, x15, 3 add x4, x5, x30 add x9, x7, x30 add x24, x8, x30 add x10, x0, x30 add x2, x22, x30 becoming (modulo regalloc fluctuations): add x14, x2, x15, lsl 3 add x13, x22, x15, lsl 3 add x21, x4, x15, lsl 3 add x6, x0, x15, lsl 3 add x3, x30, x15, lsl 3 which, while saving one instruction can be harmful overall because the extra shift operation in the arithmetic instructions can increase the latency of the instruction. I believe the aarch64 rtx costs should convey this information. Do you expect RTX costs to gate this behaviour? Some occurrences that hurt code size look like: cmp x8, x11, asr 5 being transformed into: asr x12, x11, 5 cmp x12, x8, uxtw //zero-extend x8 with the user of the condition code inverted to match the change in order of operands to the comparisons. I haven't looked at the RTL dumps yet to figure out why this is happening, it could be a backend RTL representation issue. Kyrill > > Segher > > > diff --git a/gcc/combine.c b/gcc/combine.c > index 6b5d000..2c99b4e 100644 > --- a/gcc/combine.c > +++ b/gcc/combine.c > @@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, > && XVECLEN (newpat, 0) == 2 > && GET_CODE (XVECEXP (newpat, 0, 0)) == SET > && GET_CODE (XVECEXP (newpat, 0, 1)) == SET > - && (i1 || set_noop_p (XVECEXP (newpat, 0, 0)) > - || set_noop_p (XVECEXP (newpat, 0, 1))) > && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT > && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART > && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT >
On Mon, Jun 20, 2016 at 02:39:06PM +0100, Kyrill Tkachov wrote: > >So I tried out the patch below. It decreases code size on most targets > >(mostly fixed length insn targets), and increases it a small bit on some > >variable length insn targets (doing an op twice, instead of doing it once > >and doing a move). It looks to be all good there too, but there are so > >many changes that it is almost impossible to really check. > > > >So: can people try this out with their favourite benchmarks, please? > > I hope to give this a run on AArch64 but I'll probably manage to get to it > only next week. > In the meantime I've had a quick look at some SPEC2006 codegen on aarch64. > Some benchmarks decrease in size, others increase. One recurring theme I > spotted is > shifts being repeatedly combined with arithmetic operations rather than > being computed > once and reusing the result. For example: > lsl x30, x15, 3 > add x4, x5, x30 > add x9, x7, x30 > add x24, x8, x30 > add x10, x0, x30 > add x2, x22, x30 > > becoming (modulo regalloc fluctuations): > add x14, x2, x15, lsl 3 > add x13, x22, x15, lsl 3 > add x21, x4, x15, lsl 3 > add x6, x0, x15, lsl 3 > add x3, x30, x15, lsl 3 > > which, while saving one instruction can be harmful overall because the > extra shift operation > in the arithmetic instructions can increase the latency of the instruction. > I believe the aarch64 > rtx costs should convey this information. Do you expect RTX costs to gate > this behaviour? Yes, RTX costs are used for *all* of combine's combinations. So it seems your add,lsl patterns are the same cost as plain add? If it were more expensive, combine would reject this combination. > Some occurrences that hurt code size look like: > cmp x8, x11, asr 5 > > being transformed into: > asr x12, x11, 5 > cmp x12, x8, uxtw //zero-extend x8 > with the user of the condition code inverted to match the change in order > of operands > to the comparisons. > I haven't looked at the RTL dumps yet to figure out why this is happening, > it could be a backend > RTL representation issue. That could be a target thing yes, hard to tell; it's not clear to me what combination is made here (if any). Segher
diff --git a/gcc/combine.c b/gcc/combine.c index 6b5d000..2c99b4e 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, && XVECLEN (newpat, 0) == 2 && GET_CODE (XVECEXP (newpat, 0, 0)) == SET && GET_CODE (XVECEXP (newpat, 0, 1)) == SET - && (i1 || set_noop_p (XVECEXP (newpat, 0, 0)) - || set_noop_p (XVECEXP (newpat, 0, 1))) && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT