Message ID | 013401d8aa7f$bd006d80$37014880$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | [x86,take,#2] Add peephole2 to reduce double word register shuffling | expand |
On Sun, Aug 7, 2022 at 7:04 PM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > This is a resubmission of my patch from June to fix some forms of > inefficient > register allocation using an additional peephole2 in i386.md. > https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596064.html > > Since the original, a number of supporting patches/improvements have > been reviewed and approved, making this peephole even more effective. > Hence for the simple function: > __int128 foo(__int128 x, __int128 y) { return x+y; } > > mainline GCC on x86_64 with -O2 currently generates: > movq %rsi, %rax > movq %rdi, %r8 > movq %rax, %rdi > movq %rdx, %rax > movq %rcx, %rdx > addq %r8, %rax > adcq %rdi, %rdx > ret > > with this patch we now generate (a three insn improvement): > movq %rdx, %rax > movq %rcx, %rdx > addq %rdi, %rax > adcq %rsi, %rdx > ret > > Back in June the review of the original patch stalled, as peephole2 > isn't the ideal place to fix this (with which I fully agree), and this patch > is really just a workaround for a deeper deficiency in reload/lra. > To address this I've now filed a new enhancement PR in Bugzilla, > PR rtl-optimization/106518, that describes that underlying issue, > which might make an interesting (GSoC) project for anyone brave > (fool hardy) enough to tweak GCC's register allocation. > > By comparison, this single peephole can't adversely affect other targets, > and should the happy day come that it's no longer required, at worst > would just become a harmless legacy transform that no longer triggers. > > I'm also investigating Uros' suggestion that it may be possible for RTL > expansion to do a better job expanding the function prologue, but > ultimately the hard register placement constraints are fixed by the > target ABI, and poor allocation/assignment of hard registers is the > responsibility/fault of the register allocation passes. > But it may still be possible to reduce register pressure, but avoiding the > use of SUBREGs (which keep the source and destination double words > live during shuffling) along the lines of Richard's CONCAT suggestion. > > This patch has been retested again mainline using make bootstrap and > make -k check, both with and without --target_board=unix{-m32}, > with no new failures. Ok mainline? > > > 2022-08-07 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > PR target/43644 > PR rtl-optimization/97756 > PR rtl-optimization/98438 > * config/i386/i386.md (define_peephole2): Recognize double word > swap sequences, and replace them with more efficient idioms, > including using xchg when optimizing for size. > > gcc/testsuite/ChangeLog > PR target/43644 > * gcc.target/i386/pr43644.c: New test case. +;; Replace a double word swap that requires 4 mov insns with a +;; 3 mov insn implementation (or an xchg when optimizing for size). +(define_peephole2 + [(set (match_operand:DWIH 0 "general_reg_operand") + (match_operand:DWIH 1 "general_reg_operand")) + (set (match_operand:DWIH 2 "general_reg_operand") + (match_operand:DWIH 3 "general_reg_operand")) + (clobber (match_operand:<DWI> 4 "general_reg_operand")) + (set (match_dup 3) (match_dup 0)) + (set (match_dup 1) (match_dup 2))] + "REGNO (operands[0]) != REGNO (operands[3]) + && REGNO (operands[1]) != REGNO (operands[2]) + && REGNO (operands[1]) != REGNO (operands[3]) + && REGNO (operands[3]) == REGNO (operands[4]) + && peep2_reg_dead_p (4, operands[0]) + && peep2_reg_dead_p (5, operands[2])" + [(parallel [(set (match_dup 1) (match_dup 3)) + (set (match_dup 3) (match_dup 1))])] I'm not sure it is correct to remove the clobber here. Some RTL expert should comment on this change. + if (!optimize_insn_for_size_p ()) + { + rtx tmp = REGNO (operands[0]) > REGNO (operands[2]) ? operands[0] + : operands[2]; Hm, this is a strange relation, and it is not obvious why it is done in that way. Usually, REGNO (op1) != REGNO (op2) does the trick. At least a comment should be added here. Uros. + emit_move_insn (tmp, operands[1]); + emit_move_insn (operands[1], operands[3]); + emit_move_insn (operands[3], tmp); + DONE; + } > > Thanks in advance, > Roger > -- >
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 298e4b3..a11fd5b 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3039,6 +3039,36 @@ [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 1) (match_dup 0))])]) +;; Replace a double word swap that requires 4 mov insns with a +;; 3 mov insn implementation (or an xchg when optimizing for size). +(define_peephole2 + [(set (match_operand:DWIH 0 "general_reg_operand") + (match_operand:DWIH 1 "general_reg_operand")) + (set (match_operand:DWIH 2 "general_reg_operand") + (match_operand:DWIH 3 "general_reg_operand")) + (clobber (match_operand:<DWI> 4 "general_reg_operand")) + (set (match_dup 3) (match_dup 0)) + (set (match_dup 1) (match_dup 2))] + "REGNO (operands[0]) != REGNO (operands[3]) + && REGNO (operands[1]) != REGNO (operands[2]) + && REGNO (operands[1]) != REGNO (operands[3]) + && REGNO (operands[3]) == REGNO (operands[4]) + && peep2_reg_dead_p (4, operands[0]) + && peep2_reg_dead_p (5, operands[2])" + [(parallel [(set (match_dup 1) (match_dup 3)) + (set (match_dup 3) (match_dup 1))])] +{ + if (!optimize_insn_for_size_p ()) + { + rtx tmp = REGNO (operands[0]) > REGNO (operands[2]) ? operands[0] + : operands[2]; + emit_move_insn (tmp, operands[1]); + emit_move_insn (operands[1], operands[3]); + emit_move_insn (operands[3], tmp); + DONE; + } +}) + (define_expand "movstrict<mode>" [(set (strict_low_part (match_operand:SWI12 0 "register_operand")) (match_operand:SWI12 1 "general_operand"))] diff --git a/gcc/testsuite/gcc.target/i386/pr43644.c b/gcc/testsuite/gcc.target/i386/pr43644.c new file mode 100644 index 0000000..ffdf31c --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr43644.c @@ -0,0 +1,11 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ + +__int128 foo(__int128 x, __int128 y) +{ + return x+y; +} + +/* { dg-final { scan-assembler-times "movq" 2 } } */ +/* { dg-final { scan-assembler-not "push" } } */ +/* { dg-final { scan-assembler-not "pop" } } */