Message ID | 014301d653dd$10c28670$32479350$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | middle-end: Improve RTL expansion in expand_mul_overflow, | expand |
"Roger Sayle" <roger@nextmovesoftware.com> writes: > This patch improves the RTL that the middle-end generates for testing > signed overflow following a widening multiplication. During this > expansion the middle-end generates a truncation which can get used > multiple times. Placing this intermediate value in a pseudo register > reduces the amount of code generated on platforms where this truncation > requires an explicit instruction. > > This simple call to force_reg eliminates 368 lines of the -S output > from testsuite/c-c++-common/torture/builtin-arith-overflow-1.c on > nvptx-none. An example difference is in t120_1smul where the following > 7 instruction sequence in which the 1st and 6th instructions perform > the same truncation: > > < cvt.u32.u64 %r31, %r28; <- truncate %r28 > < shr.s32 %r30, %r31, 31; > < cvt.u32.u64 %r32, %r29; > < setp.eq.u32 %r33, %r30, %r32; > < selp.u32 %r24, 0, 1, %r33; > < cvt.u32.u64 %r25, %r28; <- truncate %r28 > < setp.eq.u32 %r34, %r24, 0; > > is now generated as a 4 instruction sequence without duplication: > >> cvt.u32.u64 %r30, %r28; >> shr.s32 %r31, %r30, 31; >> cvt.u32.u64 %r32, %r29; >> setp.eq.u32 %r33, %r31, %r32; > > On x86_64-pc-linux-gnu, where SUBREGs are free, this patch generates > exactly the same builtin-arith-overflow-1.s as before. > > This patch has been tested on both x86_64-pc-linux-gnu with > "make bootstrap" and nvptx-none with "make", with no new > testsuite regressions on either platform. > Ok for mainline? > > > 2020-07-06 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog: > * internal-fn.c (expand_mul_overflow): When checking for signed > overflow from a widening multiplication, we access the truncated > lowpart RES twice, so keep this value in a pseudo register. > > > Thanks in advance, > Roger > -- > Roger Sayle > NextMove Software > Cambridge, UK > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c > index 0be2eb4..d1bd6cc 100644 > --- a/gcc/internal-fn.c > +++ b/gcc/internal-fn.c > @@ -1627,6 +1627,9 @@ expand_mul_overflow (location_t loc, tree lhs, tree arg0, tree arg1, > profile_probability::very_likely ()); > else > { > + /* RES is used more than once, place it in a pseudo. */ > + res = force_reg (mode, res); > + > rtx signbit = expand_shift (RSHIFT_EXPR, mode, res, prec - 1, > NULL_RTX, 0); > /* RES is low half of the double width result, HIPART In general, this can be dangerous performance-wise on targets where subregs are free. If the move survives to the register allocators, it increases the risk that the move will become a machine insn. (The RA will prefer to tie the registers, but that isn't guaranteed.) But more fundamentally, this can hurt if the SUBREG_REG is live at the same time as the new pseudo, since the RA then has to treat them as separate quantities. From your results, that obviously doesn't occur in the test case, but I'm not 100% confident that it won't occur elsewhere. If target-independent code is going to optimise for “no subreg operand” targets like nvptx, I think it needs to know that the target wants that. Thanks, Richard
On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote: > > --- a/gcc/internal-fn.c > > +++ b/gcc/internal-fn.c > > @@ -1627,6 +1627,9 @@ expand_mul_overflow (location_t loc, tree lhs, tree arg0, tree arg1, > > profile_probability::very_likely ()); > > else > > { > > + /* RES is used more than once, place it in a pseudo. */ > > + res = force_reg (mode, res); > > + > > rtx signbit = expand_shift (RSHIFT_EXPR, mode, res, prec - 1, > > NULL_RTX, 0); > > /* RES is low half of the double width result, HIPART > > In general, this can be dangerous performance-wise on targets where > subregs are free. If the move survives to the register allocators, > it increases the risk that the move will become a machine insn. > (The RA will prefer to tie the registers, but that isn't guaranteed.) > > But more fundamentally, this can hurt if the SUBREG_REG is live at > the same time as the new pseudo, since the RA then has to treat them > as separate quantities. From your results, that obviously doesn't > occur in the test case, but I'm not 100% confident that it won't > occur elsewhere. > > If target-independent code is going to optimise for “no subreg operand” > targets like nvptx, I think it needs to know that the target wants that. Isn't that though what the expander is doing in lots of places? Force operands into pseudos especially if optimize to hope for better CSE etc., and hope combine does its job to undo it when it is better to be propagated? It is true that if res is used several times, then combine will not propagate it due to multiple uses, so the question I guess is why as Roger says we get the same code before/after (which pass undoes that; RA?). Jakub
Jakub Jelinek <jakub@redhat.com> writes: > On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote: >> > --- a/gcc/internal-fn.c >> > +++ b/gcc/internal-fn.c >> > @@ -1627,6 +1627,9 @@ expand_mul_overflow (location_t loc, tree lhs, tree arg0, tree arg1, >> > profile_probability::very_likely ()); >> > else >> > { >> > + /* RES is used more than once, place it in a pseudo. */ >> > + res = force_reg (mode, res); >> > + >> > rtx signbit = expand_shift (RSHIFT_EXPR, mode, res, prec - 1, >> > NULL_RTX, 0); >> > /* RES is low half of the double width result, HIPART >> >> In general, this can be dangerous performance-wise on targets where >> subregs are free. If the move survives to the register allocators, >> it increases the risk that the move will become a machine insn. >> (The RA will prefer to tie the registers, but that isn't guaranteed.) >> >> But more fundamentally, this can hurt if the SUBREG_REG is live at >> the same time as the new pseudo, since the RA then has to treat them >> as separate quantities. From your results, that obviously doesn't >> occur in the test case, but I'm not 100% confident that it won't >> occur elsewhere. >> >> If target-independent code is going to optimise for “no subreg operand” >> targets like nvptx, I think it needs to know that the target wants that. > > Isn't that though what the expander is doing in lots of places? > Force operands into pseudos especially if optimize to hope for better CSE > etc., and hope combine does its job to undo it when it is better to be > propagated? > It is true that if res is used several times, then combine will not > propagate it due to multiple uses, so the question I guess is why as Roger > says we get the same code before/after (which pass undoes that; RA?). I'd imagine fwprop. That's just a guess though. But I don't see what this force_reg achieves on “free subreg” targets, even for CSE. The SUBREG_REG is still set as a full REG value and can be CSEd in the normal way. And because the subreg itself is free, trying to CSE the subregs is likely to be actively harmful for the reason above: we can then have the SUBREG_REG and the CSEd subreg live at the same time. I.e. it's not usually worth extending the lifetime of a previous subreg result when a new subreg on the same value would have no cost. Maybe in the old days it made more sense, because we relied on RTL optimisers to do constant propagation and folding, and so a subreg move might later become a constant move, with the constant then being CSEable. Is that what you mean? But that should be much less necessary now. There shouldn't be many cases in which only the RTL optimisers can prove that an internal function argument is constant. Thanks, Richard
Hi Richard, > On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote: >> > + res = force_reg (mode, res); >> >> In general, this can be dangerous performance-wise on targets where >> subregs are free. If the move survives to the register allocators, >> it increases the risk that the move will become a machine insn. >> (The RA will prefer to tie the registers, but that isn't guaranteed.) I believe I can use exactly the same test case, but compiled on ARM, to demonstrate that placing reused intermediate values in pseudo registers instead of SUBREGs is potentially preferable and at least no worse (certainly not "dangerous"). The minimal test case for t120_smul is: void bar(void); int t120_1smul (int x, int y) { int r; if (__builtin_smul_overflow (x, y, &r)) bar (); return r; } which when compiled on arm-linux-gnueabihf generates, both with and without my patch, the very impressive, near optimal sequence: t120_1smul: mov r3, r0 smull r0, r2, r3, r1 cmp r2, r0, asr #31 bxeq lr str lr, [sp, #-4]! sub sp, sp, #12 str r0, [sp, #4] bl bar ldr r0, [sp, #4] add sp, sp, #12 ldr pc, [sp], #4 Notice that the first argument (output) of smul, r0, is used multiple times, but thanks to fwprop both the SUBREG and the pseudo forms are equivalent to the result register, r0, which reload respects. That GCC puts the lopart of the widening multiplication directly in the result register is awesome. The use of force_reg in my patch clearly doesn't hurt. In fact, the use of a pseudo helps insulate some dubious code expanded by the ARM backend, the previous assignment to the register that is forced into a pseudo register actually looks like: (insn 10 9 11 2 (set (subreg:SI (reg:DI 120) 0) (ashiftrt:SI (subreg:SI (reg:DI 119) 4) (const_int 0 [0]))) "t120_1smul.c":5:7 -1 (nil)) Fortunately this cryptic move instruction eventually gets cleaned up by the middle-end. [Minor aside this shift by zero can be avoided by following patch] diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index e15d286..5225df6 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -31324,7 +31324,10 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code, rtx out, rtx in, if (REG_P (out) && REG_P (in) && REGNO (out) != REGNO (in)) emit_insn (SET (out, const0_rtx)); - emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount))); + if (adj_amount != const0_rtx) + emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount))); + else + emit_insn (SET (out_down, in_up)); if (code == ASHIFTRT) emit_insn (gen_ashrsi3 (out_up, in_up, GEN_INT (31))); But to (perhaps) convince you (folks) that pseudo registers are preferable to SUBREGs, we need to take a look at the first two instructions generated in the above test case. Instead of the current: mov r3, r0 smull r0, r2, r3, r1 The optimal sequence on ARM (if my understanding of the ARM documentation on register constraints is correct) would be the single instruction: smull r0, r2, r1, r0 As the first three argument registers must be different, but the fourth and final register doesn't conflict with the first three. We know that we require the first output to be r0, but we can use the commutativity of multiplication to swap r0 and r1 in the third and fourth positions, saving both an instruction and a register! Alas my attempts to convince GCC to generate this sequence have been unsuccessful. The closest I can get is by using something like: diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md index a6a31f8..ece9175 100644 --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -2409,11 +2409,11 @@ ) (define_insn "<US>mull" - [(set (match_operand:SI 0 "s_register_operand" "=r,&r") + [(set (match_operand:SI 0 "s_register_operand" "=r,&r,&r,&r") (mult:SI - (match_operand:SI 2 "s_register_operand" "%r,r") - (match_operand:SI 3 "s_register_operand" "r,r"))) - (set (match_operand:SI 1 "s_register_operand" "=r,&r") + (match_operand:SI 2 "s_register_operand" "%r,r,r,r") + (match_operand:SI 3 "s_register_operand" "r,r,0,1"))) + (set (match_operand:SI 1 "s_register_operand" "=r,&r,&r,&r") (truncate:SI (lshiftrt:DI (mult:DI (SE:DI (match_dup 2)) (SE:DI (match_dup 3))) @@ -2422,7 +2422,7 @@ "<US>mull%?\\t%0, %1, %2, %3" [(set_attr "type" "umull") (set_attr "predicable" "yes") - (set_attr "arch" "v6,nov6")] + (set_attr "arch" "v6,nov6,nov6,nov6")] ) Which is one step closer, with reload/GCC now generating: mov r3, r1 smull r0, r1, r3, r0 This has the correct form, with r0 as the first and fourth arguments, and requires one less register but notice mysteriously we still have the (unnecessary) mov instruction! I suspect that this may be a consequence of using SUBREGs instead of pseudo registers. Perhaps, by insisting that the DI mode result of this instruction are constrained to exist in consecutive registers, the ARM backend produces worse code than it needs to? If anyone can convince reload to generate a single "smull r0, r2, r1, r0" for this testcase on ARM, then I'll admit that they are a better compiler wrangler than me. I hope this was convincing. The take home message is that the single line force_reg change really is safe. I'm happy for follow up with design philosophy, and RTL sharing, and ARM-relevant supporting arguments on why pseudo register usage is good. Best regards, Roger --
"Roger Sayle" <roger@nextmovesoftware.com> writes: > Hi Richard, > >> On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote: >>> > + res = force_reg (mode, res); >>> >>> In general, this can be dangerous performance-wise on targets where >>> subregs are free. If the move survives to the register allocators, >>> it increases the risk that the move will become a machine insn. >>> (The RA will prefer to tie the registers, but that isn't guaranteed.) > > I believe I can use exactly the same test case, but compiled on ARM, to > demonstrate that placing reused intermediate values in pseudo registers > instead of SUBREGs is potentially preferable and at least no worse > (certainly not "dangerous"). OK. Sorry for the noise then, please go ahead with the patch. Thanks, Richard
Hi all, On Thu, Jul 09, 2020 at 10:29:44AM +0200, Jakub Jelinek via Gcc-patches wrote: > On Thu, Jul 09, 2020 at 09:17:46AM +0100, Richard Sandiford wrote: > > If target-independent code is going to optimise for “no subreg operand” > > targets like nvptx, I think it needs to know that the target wants that. > > Isn't that though what the expander is doing in lots of places? Not nearly enough, in fact. The expander should output very simple code, and let the RTL optimisers figure out what would make good machine code. Doing "optimisations" during expand just leads to worse code (and makes the compiler itself harder to maintain). Expand should translate Gimple to simple RTL, and that is it. A lot of what expand does was useful when people used -O0 and -O1 in production, but is harmful in the modern world. > Force operands into pseudos especially if optimize to hope for better CSE Not just CSE, but pretty much *all* RTL optimisations can do a better job if they start with simple non-obfuscated code. > etc., and hope combine does its job to undo it when it is better to be > propagated? It does, usually. Not if a move is the only thing in a block, but it is RA's job to decide in that case, anyway. > It is true that if res is used several times, then combine will not > propagate it due to multiple uses, Nope, combine can handle that just fine :-) It won't do this if the resulting code is more expensive than what it started with, of course, but combine can handle multiple uses just fine, in general. What it cannot handle well is multiple sets of the same pseudo. You shouldn't do that anyway, combine is not the only pass that will not handle this. > so the question I guess is why as Roger > says we get the same code before/after (which pass undoes that; RA?). combine can do it just fine, in the simpler cases. RA can do more cases. Segher
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c index 0be2eb4..d1bd6cc 100644 --- a/gcc/internal-fn.c +++ b/gcc/internal-fn.c @@ -1627,6 +1627,9 @@ expand_mul_overflow (location_t loc, tree lhs, tree arg0, tree arg1, profile_probability::very_likely ()); else { + /* RES is used more than once, place it in a pseudo. */ + res = force_reg (mode, res); + rtx signbit = expand_shift (RSHIFT_EXPR, mode, res, prec - 1, NULL_RTX, 0); /* RES is low half of the double width result, HIPART