Message ID | 20201021082529.1508350-1-luoxhu@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | rs6000: Don't split constant operator add before reload, move to temp register for future optimization | expand |
On Wed, Oct 21, 2020 at 03:25:29AM -0500, Xionghu Luo wrote: > Don't split code from add<mode>3 for SDI to allow a later pass to split. This is very problematic. > This allows later logic to hoist out constant load in add instructions. Later logic should be able to do that any way (I do not say that works perfectly, mind; it no doubt could be improved). > In loop, lis+ori could be hoisted out to improve performance compared with > previous addis+addi (About 15% on typical case), weak point is > one more register is used and one more instruction is generated. i.e.: Yes, better performance on one testcase, and worse code always :-( > addis 3,3,0x6765 > addi 3,3,0x4321 > > => > > lis 9,0x6765 > ori 9,9,0x4321 > add 3,3,9 This is the typical kind of clumsy code you get if you generate RTL that matches actual machine instructions too late ("split too late"). So, please make it possible to hoist 2-insn-immediate sequences out of loops, *without* changing them to fake 1-insn things. Some comments on the patch: > + rs6000_emit_move (tmp, operands[2], <MODE>mode); > + emit_insn (gen_add<mode>3 (operands[0], operands[1], tmp)); > DONE; > } > + else > + { You don't need an else here: everything in the "if" has done a DONE or FAIL. You can just keep the existing code as-is, there is no need to obfuscate the code :-) > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/add-const.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile { target { lp64 } } } */ { target lp64 } (no extra braces needed) > +/* Ensure the lis,ori are generated, which indicates they have > + been hoisted outside of the loop. */ That is a very fragile test. > --- a/gcc/testsuite/gcc.target/powerpc/prefix-add.c > +++ b/gcc/testsuite/gcc.target/powerpc/prefix-add.c > @@ -2,13 +2,13 @@ > /* { dg-require-effective-target powerpc_prefixed_addr } */ > /* { dg-options "-O2 -mdejagnu-cpu=power10" } */ > > -/* Test that PADDI is generated to add a large constant. */ > +/* Test that PLI is generated to add a large constant. */ Nope, that is a bad idea. The test tested that we generate good code, we should keep it that way. Segher
Hi, On 2020/10/27 05:10, Segher Boessenkool wrote: > On Wed, Oct 21, 2020 at 03:25:29AM -0500, Xionghu Luo wrote: >> Don't split code from add<mode>3 for SDI to allow a later pass to split. > > This is very problematic. > >> This allows later logic to hoist out constant load in add instructions. > > Later logic should be able to do that any way (I do not say that works > perfectly, mind; it no doubt could be improved). > >> In loop, lis+ori could be hoisted out to improve performance compared with >> previous addis+addi (About 15% on typical case), weak point is >> one more register is used and one more instruction is generated. i.e.: > > Yes, better performance on one testcase, and worse code always :-( > >> addis 3,3,0x6765 >> addi 3,3,0x4321 >> >> => >> >> lis 9,0x6765 >> ori 9,9,0x4321 >> add 3,3,9 > > This is the typical kind of clumsy code you get if you generate RTL that > matches actual machine instructions too late ("split too late"). > > So, please make it possible to hoist 2-insn-immediate sequences out of > loops, *without* changing them to fake 1-insn things. > As we discussed offline, addis+addi is not quite possible to be hoisted out of loops as not invariant, update the patch as below, thanks: [PATCH v2] rs6000: Split constant operator add in split1 instead of expander Currently, ADD with positive 32bit constant is split to addis+addi in expander, which seems too early to optimize the constant load out of loop compared with other targets. This patch use a temp register to load the constant and do two register addition in expander same as negative 32bit constant add. This allows loop invariant pass to hoist out constant load before add instructions, then split1 pass will split the load to lis+ori after combine. Performance could be improved by 15% on typical case compared with previous addis+addi in loop. (1) 0x67654321 addis 3,3,0x6765 addi 3,3,0x4321 => lis 9,0x6765 ori 9,9,0x4321 add 3,3,9 (2) 0x00008fff addis 9,9,0x1 addi 3,9,-28673 => li 10,0 ori 10,10,0x8fff add 3,3,10 Regression and bootstrap tested pass on P8LE. gcc/ChangeLog: 2020-10-21 Xiong Hu Luo <luoxhu@linux.ibm.com> * config/rs6000/rs6000.md (add<mode>3 for SDI): Don't split before reload, move constant to temp register for add. (define_split): Split const from split1. gcc/testsuite/ChangeLog: 2020-10-21 Xiong Hu Luo <luoxhu@linux.ibm.com> * gcc.target/powerpc/add-const.c: New test. --- gcc/config/rs6000/rs6000.md | 38 ++++++++++++-------- gcc/testsuite/gcc.target/powerpc/add-const.c | 18 ++++++++++ 2 files changed, 41 insertions(+), 15 deletions(-) create mode 100644 gcc/testsuite/gcc.target/powerpc/add-const.c diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md index 5e5ad9f7c3d..b52e9555962 100644 --- a/gcc/config/rs6000/rs6000.md +++ b/gcc/config/rs6000/rs6000.md @@ -1750,18 +1750,26 @@ (define_expand "add<mode>3" if (CONST_INT_P (operands[2]) && !add_operand (operands[2], <MODE>mode)) { - rtx tmp = ((!can_create_pseudo_p () - || rtx_equal_p (operands[0], operands[1])) - ? operands[0] : gen_reg_rtx (<MODE>mode)); - - /* Adding a constant to r0 is not a valid insn, so use a different - strategy in that case. */ - if (reg_or_subregno (operands[1]) == 0 || reg_or_subregno (tmp) == 0) + bool reg0 = reg_or_subregno (operands[0]) == 0; + if (can_create_pseudo_p () || reg0) { - if (operands[0] == operands[1]) - FAIL; - rs6000_emit_move (operands[0], operands[2], <MODE>mode); - emit_insn (gen_add<mode>3 (operands[0], operands[1], operands[0])); + + rtx tmp = (!can_create_pseudo_p () + || rtx_equal_p (operands[0], operands[1])) + ? operands[0] : gen_reg_rtx (<MODE>mode); + + /* Adding a constant to r0 is not a valid insn, so use a different + strategy in that case. See stack-limit.c, need generate + "24: %0:DI=0x20fa0; 25: %0:DI=%14:DI+%0:DI" in pro_and_epilogue + when can_create_pseudo_p is false. */ + if (reg0 == 0 || reg_or_subregno (tmp) == 0) + { + if (operands[0] == operands[1]) + FAIL; + } + + rs6000_emit_move (tmp, operands[2], <MODE>mode); + emit_insn (gen_add<mode>3 (operands[0], operands[1], tmp)); DONE; } @@ -1775,8 +1783,8 @@ (define_expand "add<mode>3" /* The ordering here is important for the prolog expander. When space is allocated from the stack, adding 'low' first may produce a temporary deallocation (which would be bad). */ - emit_insn (gen_add<mode>3 (tmp, operands[1], GEN_INT (rest))); - emit_insn (gen_add<mode>3 (operands[0], tmp, GEN_INT (low))); + emit_insn (gen_add<mode>3 (operands[0], operands[1], GEN_INT (rest))); + emit_insn (gen_add<mode>3 (operands[0], operands[0], GEN_INT (low))); DONE; } }) @@ -9118,7 +9126,7 @@ (define_split ;; When non-easy constants can go in the TOC, this should use ;; easy_fp_constant predicate. (define_split - [(set (match_operand:DI 0 "int_reg_operand_not_pseudo") + [(set (match_operand:DI 0 "int_reg_operand") (match_operand:DI 1 "const_int_operand"))] "TARGET_POWERPC64 && num_insns_constant (operands[1], DImode) > 1" [(set (match_dup 0) (match_dup 2)) @@ -9131,7 +9139,7 @@ (define_split }) (define_split - [(set (match_operand:DI 0 "int_reg_operand_not_pseudo") + [(set (match_operand:DI 0 "int_reg_operand") (match_operand:DI 1 "const_scalar_int_operand"))] "TARGET_POWERPC64 && num_insns_constant (operands[1], DImode) > 1" [(set (match_dup 0) (match_dup 2)) diff --git a/gcc/testsuite/gcc.target/powerpc/add-const.c b/gcc/testsuite/gcc.target/powerpc/add-const.c new file mode 100644 index 00000000000..3ffdbcfab89 --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/add-const.c @@ -0,0 +1,18 @@ +/* { dg-do compile { target lp64 } } */ +/* { dg-options "-O3 -fno-unroll-loops" } */ + +/* Ensure the lis,ori are generated, which indicates they have + been hoisted outside of the loop. */ + +typedef unsigned long ulong; +ulong +foo (ulong n, ulong h) +{ + int i; + for (i = 0; i < n; i++) + h = ((h + 8) | h) + 0x67654321; + return h; +} + +/* { dg-final { scan-assembler-times {\mlis\M} 1 } } */ +/* { dg-final { scan-assembler-times {\mori\M} 1 } } */
diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md index 4c2fe7fa312..af577da669e 100644 --- a/gcc/config/rs6000/predicates.md +++ b/gcc/config/rs6000/predicates.md @@ -859,8 +859,7 @@ (define_special_predicate "indexed_address_mem" (define_predicate "add_operand" (if_then_else (match_code "const_int") (match_test "satisfies_constraint_I (op) - || satisfies_constraint_L (op) - || satisfies_constraint_eI (op)") + || satisfies_constraint_L (op)") (match_operand 0 "gpc_reg_operand"))) ;; Return 1 if the operand is either a non-special register, or 0, or -1. diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md index 779bfd11237..facf6e12114 100644 --- a/gcc/config/rs6000/rs6000.md +++ b/gcc/config/rs6000/rs6000.md @@ -1750,34 +1750,44 @@ (define_expand "add<mode>3" if (CONST_INT_P (operands[2]) && !add_operand (operands[2], <MODE>mode)) { - rtx tmp = ((!can_create_pseudo_p () - || rtx_equal_p (operands[0], operands[1])) - ? operands[0] : gen_reg_rtx (<MODE>mode)); + bool reg0 = (reg_or_subregno (operands[0]) == 0); + if (can_create_pseudo_p () || reg0) + { + + rtx tmp = (!can_create_pseudo_p () + || rtx_equal_p (operands[0], operands[1])) + ? operands[0] : gen_reg_rtx (<MODE>mode); /* Adding a constant to r0 is not a valid insn, so use a different - strategy in that case. */ - if (reg_or_subregno (operands[1]) == 0 || reg_or_subregno (tmp) == 0) - { - if (operands[0] == operands[1]) - FAIL; - rs6000_emit_move (operands[0], operands[2], <MODE>mode); - emit_insn (gen_add<mode>3 (operands[0], operands[1], operands[0])); + strategy in that case. See stack-limit.c, need generate + "24: %0:DI=0x20fa0; 25: %0:DI=%14:DI+%0:DI" in pro_and_epilogue + when can_create_pseudo_p is false. */ + if (reg0 == 0 || reg_or_subregno (tmp) == 0) + { + if (operands[0] == operands[1]) + FAIL; + } + + rs6000_emit_move (tmp, operands[2], <MODE>mode); + emit_insn (gen_add<mode>3 (operands[0], operands[1], tmp)); DONE; } + else + { + HOST_WIDE_INT val = INTVAL (operands[2]); + HOST_WIDE_INT low = ((val & 0xffff) ^ 0x8000) - 0x8000; + HOST_WIDE_INT rest = trunc_int_for_mode (val - low, <MODE>mode); - HOST_WIDE_INT val = INTVAL (operands[2]); - HOST_WIDE_INT low = ((val & 0xffff) ^ 0x8000) - 0x8000; - HOST_WIDE_INT rest = trunc_int_for_mode (val - low, <MODE>mode); - - if (<MODE>mode == DImode && !satisfies_constraint_L (GEN_INT (rest))) - FAIL; + if (<MODE>mode == DImode && !satisfies_constraint_L (GEN_INT (rest))) + FAIL; - /* The ordering here is important for the prolog expander. - When space is allocated from the stack, adding 'low' first may - produce a temporary deallocation (which would be bad). */ - emit_insn (gen_add<mode>3 (tmp, operands[1], GEN_INT (rest))); - emit_insn (gen_add<mode>3 (operands[0], tmp, GEN_INT (low))); - DONE; + /* The ordering here is important for the prolog expander. + When space is allocated from the stack, adding 'low' first may + produce a temporary deallocation (which would be bad). */ + emit_insn (gen_add<mode>3 (operands[0], operands[1], GEN_INT (rest))); + emit_insn (gen_add<mode>3 (operands[0], operands[0], GEN_INT (low))); + DONE; + } } }) diff --git a/gcc/testsuite/gcc.target/powerpc/add-const.c b/gcc/testsuite/gcc.target/powerpc/add-const.c new file mode 100644 index 00000000000..eeeed091032 --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/add-const.c @@ -0,0 +1,18 @@ +/* { dg-do compile { target { lp64 } } } */ +/* { dg-options "-O3 -fno-unroll-loops" } */ + +/* Ensure the lis,ori are generated, which indicates they have + been hoisted outside of the loop. */ + +typedef unsigned long ulong; +ulong +foo (ulong n, ulong h) +{ + int i; + for (i = 0; i < n; i++) + h = ((h + 8) | h) + 0x67654321; + return h; +} + +/* { dg-final { scan-assembler-times {\mlis\M} 1 } } */ +/* { dg-final { scan-assembler-times {\mori\M} 1 } } */ diff --git a/gcc/testsuite/gcc.target/powerpc/prefix-add.c b/gcc/testsuite/gcc.target/powerpc/prefix-add.c index 0027406e457..13af39e51ec 100644 --- a/gcc/testsuite/gcc.target/powerpc/prefix-add.c +++ b/gcc/testsuite/gcc.target/powerpc/prefix-add.c @@ -2,13 +2,13 @@ /* { dg-require-effective-target powerpc_prefixed_addr } */ /* { dg-options "-O2 -mdejagnu-cpu=power10" } */ -/* Test that PADDI is generated to add a large constant. */ +/* Test that PLI is generated to add a large constant. */ unsigned long add (unsigned long a) { return a + 0x12345U; } -/* { dg-final { scan-assembler {\mpaddi\M} } } */ +/* { dg-final { scan-assembler {\mpli\M} } } */ /* { dg-final { scan-assembler-not {\maddi\M} } } */ /* { dg-final { scan-assembler-not {\maddis\M} } } */ diff --git a/gcc/testsuite/gcc.target/powerpc/prefix-no-update.c b/gcc/testsuite/gcc.target/powerpc/prefix-no-update.c index 837fcd77c0b..26221d3eec5 100644 --- a/gcc/testsuite/gcc.target/powerpc/prefix-no-update.c +++ b/gcc/testsuite/gcc.target/powerpc/prefix-no-update.c @@ -42,7 +42,7 @@ struct foo *dec_store (struct foo *p, unsigned int *q) /* { dg-final { scan-assembler-times {\mlwz\M} 2 } } */ /* { dg-final { scan-assembler-times {\mstw\M} 2 } } */ -/* { dg-final { scan-assembler-times {\mpaddi\M} 4 } } */ +/* { dg-final { scan-assembler-times {\mpli\M} 4 } } */ /* { dg-final { scan-assembler-times {\mplwz\M} 2 } } */ /* { dg-final { scan-assembler-times {\mpstw\M} 2 } } */ /* { dg-final { scan-assembler-not {\mplwzu\M} } } */