Message ID | mptimpq3m08.fsf@arm.com |
---|---|
State | New |
Headers | show |
Series | Avoid adding impossible copies in ira-conflicts.c:process_reg_shuffles | expand |
On 9/17/19 6:50 PM, Richard Sandiford wrote: [ ... ] > This patch tries to avoid the problem by not adding register shuffle > copies if there appears to be no chance that the two operands could be > allocated to the same register. > The table below summarises > the tests that had more or fewer assembly lines after the patch > (always a bad metric, but it's just to get a flavour): > > Target Tests Delta Best Worst Median > ====== ===== ===== ==== ===== ====== > x86_64-linux-gnu 39 -577 -164 23 -1 Hmmm, this sounds certainly interesting enough to try on its own merits, even if it's not committed by tomorrow morning ... Fascinating analysis - thanks !
On 09/17/2019 12:50 PM, Richard Sandiford wrote: > If an insn requires two operands to be tied, and the input operand dies > in the insn, IRA acts as though there were a copy from the input to the > output with the same execution frequency as the insn. Allocating the > same register to the input and the output then saves the cost of a move. > > If there is no such tie, but an input operand nevertheless dies > in the insn, IRA creates a similar move, but with an eighth of the > frequency. This helps to ensure that chains of instructions reuse > registers in a natural way, rather than using arbitrarily different > registers for no reason. > > This heuristic seems to work well in the vast majority of cases. > However, for SVE, the handling of untied operands ends up creating > copies between dying predicate registers and vector outputs, even though > vector and predicate registers are distinct classes and can never be > tied. This is a particular problem because the dying predicate tends > to be the loop control predicate, which is used by most instructions > in a vector loop and so (rightly) has a very high allocation priority. > Any copies involving the loop predicate therefore tend to get processed > before copies involving only vector registers. The end result is that > we tend to allocate the output of the last vector instruction in a loop > ahead of its natural place in the allocation order and don't benefit > from chains created between vector registers. > > This patch tries to avoid the problem by not adding register shuffle > copies if there appears to be no chance that the two operands could be > allocated to the same register. > > Tested on aarch64-linux-gnu and x86_64-linux-gnu (which needs the > pr82361-*.c tweak I just posted). Also tested by comparing the > gcc.c-torture, gcc.dg and g++.dg output for one target per CPU > directory at -O2 -ftree-vectorize. The table below summarises > the tests that had more or fewer assembly lines after the patch > (always a bad metric, but it's just to get a flavour): > > Target Tests Delta Best Worst Median > ====== ===== ===== ==== ===== ====== > aarch64-linux-gnu 11 -16 -15 9 -1 > aarch64_be-linux-gnu 10 -16 -15 9 -1 > alpha-linux-gnu 10 12 -5 10 2 > amdgcn-amdhsa 51 48 -138 174 -1 > arc-elf 4 -3 -2 2 -2 > bfin-elf 53 19 -13 5 1 > cris-elf 1 -1 -1 -1 -1 > frv-linux-gnu 139 22 -16 12 -1 > h8300-elf 2 2 1 1 1 > hppa64-hp-hpux11.23 20 5 -3 3 1 > i686-apple-darwin 42 -305 -87 11 -1 > i686-pc-linux-gnu 32 -3205 -402 15 -15 > ia64-linux-gnu 774 -636 -274 261 -1 > m32r-elf 179 -1386 -161 31 -2 > m68k-linux-gnu 21 -64 -18 4 -2 > mipsel-linux-gnu 61 8 -25 34 1 > mipsisa64-linux-gnu 40 -11 -10 13 -1 > mmix 4 1 -2 6 -2 > mn10300-elf 9 0 -6 5 1 > pdp11 13 -6 -5 2 1 > powerpc-ibm-aix7.0 152 1736 -142 481 -1 > powerpc64-linux-gnu 111 -1366 -370 39 -1 > powerpc64le-linux-gnu 136 -1183 -409 241 -1 > pru-elf 45 24 -6 17 1 > riscv32-elf 22 -83 -62 10 -2 > riscv64-elf 26 -78 -19 4 -1 > s390-linux-gnu 43 -145 -32 16 -1 > s390x-linux-gnu 40 -285 -96 16 -1 > visium-elf 125 -317 -50 10 -1 > x86_64-darwin 40 -83 -42 13 -1 > x86_64-linux-gnu 39 -577 -164 23 -1 > xstormy16-elf 26 -12 -4 4 -1 As usually a great report and patch explanation, Richard. Thank you for finding this problem and resolving it. The patch is ok for the trunk. > > 2019-09-17 Richard Sandiford <richard.sandiford@arm.com> > > gcc/ > * ira-conflicts.c (can_use_same_reg_p): New function. > (process_reg_shuffles): Take an insn parameter. Ignore cases > in which input operand op_num could seemingly never be allocated > to the same register as the destination. > (add_insn_allocno_copies): Update call to process_reg_shuffles. > > gcc/testsuite/ > * gcc.target/aarch64/sve/cond_convert_1.c: Remove XFAILs. > * gcc.target/aarch64/sve/cond_convert_4.c: Likewise. > * gcc.target/aarch64/sve/cond_unary_2.c: Likewise. > > Index: gcc/ira-conflicts.c > =================================================================== > --- gcc/ira-conflicts.c 2019-09-12 10:52:55.000000000 +0100 > +++ gcc/ira-conflicts.c 2019-09-17 17:43:06.426031052 +0100 > @@ -325,12 +325,37 @@ process_regs_for_copy (rtx reg1, rtx reg > return true; > } > > -/* Process all of the output registers of the current insn which are > - not bound (BOUND_P) and the input register REG (its operand number > +/* Return true if output operand OUTPUT and input operand INPUT of > + INSN can use the same register class for at least one alternative. > + INSN is already described in recog_data and recog_op_alt. */ > +static bool > +can_use_same_reg_p (rtx_insn *insn, int output, int input) > +{ > + alternative_mask preferred = get_preferred_alternatives (insn); > + for (int nalt = 0; nalt < recog_data.n_alternatives; nalt++) > + { > + if (!TEST_BIT (preferred, nalt)) > + continue; > + > + const operand_alternative *op_alt > + = &recog_op_alt[nalt * recog_data.n_operands]; > + if (op_alt[input].matches == output) > + return true; > + > + if (ira_reg_class_intersect[op_alt[input].cl][op_alt[output].cl] > + != NO_REGS) > + return true; > + } > + return false; > +} > + > +/* Process all of the output registers of the current insn (INSN) which > + are not bound (BOUND_P) and the input register REG (its operand number > OP_NUM) which dies in the insn as if there were a move insn between > them with frequency FREQ. */ > static void > -process_reg_shuffles (rtx reg, int op_num, int freq, bool *bound_p) > +process_reg_shuffles (rtx_insn *insn, rtx reg, int op_num, int freq, > + bool *bound_p) > { > int i; > rtx another_reg; > @@ -342,7 +367,13 @@ process_reg_shuffles (rtx reg, int op_nu > > if (!REG_SUBREG_P (another_reg) || op_num == i > || recog_data.operand_type[i] != OP_OUT > - || bound_p[i]) > + || bound_p[i] > + || (!can_use_same_reg_p (insn, i, op_num) > + && (recog_data.constraints[op_num][0] != '%' > + || !can_use_same_reg_p (insn, i, op_num + 1)) > + && (op_num == 0 > + || recog_data.constraints[op_num - 1][0] != '%' > + || !can_use_same_reg_p (insn, i, op_num - 1)))) > continue; > > process_regs_for_copy (reg, another_reg, false, NULL, freq); > @@ -412,7 +443,8 @@ add_insn_allocno_copies (rtx_insn *insn) > the corresponding allocno copies. The cost will not > correspond to a real move insn cost, so make the frequency > smaller. */ > - process_reg_shuffles (operand, i, freq < 8 ? 1 : freq / 8, bound_p); > + process_reg_shuffles (insn, operand, i, freq < 8 ? 1 : freq / 8, > + bound_p); > } > } > > Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c 2019-08-14 11:56:54.223221654 +0100 > +++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c 2019-09-17 17:43:06.426031052 +0100 > @@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz} } } */ > -/* At the moment we don't manage to avoid using MOVPRFX. */ > -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */ > /* { dg-final { scan-assembler-not {\tsel\t} } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c 2019-08-14 11:56:54.223221654 +0100 > +++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c 2019-09-17 17:43:06.430031021 +0100 > @@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz} } } */ > -/* At the moment we don't manage to avoid using MOVPRFX. */ > -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */ > /* { dg-final { scan-assembler-not {\tsel\t} } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c 2019-08-14 11:53:04.636898923 +0100 > +++ gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c 2019-09-17 17:43:06.430031021 +0100 > @@ -54,8 +54,5 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz} } } */ > -/* At the moment we don't manage to avoid using MOVPRFX for the > - floating-point functions. */ > -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\t} 6 } } */ > +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */ > /* { dg-final { scan-assembler-not {\tsel\t} } } */
Index: gcc/ira-conflicts.c =================================================================== --- gcc/ira-conflicts.c 2019-09-12 10:52:55.000000000 +0100 +++ gcc/ira-conflicts.c 2019-09-17 17:43:06.426031052 +0100 @@ -325,12 +325,37 @@ process_regs_for_copy (rtx reg1, rtx reg return true; } -/* Process all of the output registers of the current insn which are - not bound (BOUND_P) and the input register REG (its operand number +/* Return true if output operand OUTPUT and input operand INPUT of + INSN can use the same register class for at least one alternative. + INSN is already described in recog_data and recog_op_alt. */ +static bool +can_use_same_reg_p (rtx_insn *insn, int output, int input) +{ + alternative_mask preferred = get_preferred_alternatives (insn); + for (int nalt = 0; nalt < recog_data.n_alternatives; nalt++) + { + if (!TEST_BIT (preferred, nalt)) + continue; + + const operand_alternative *op_alt + = &recog_op_alt[nalt * recog_data.n_operands]; + if (op_alt[input].matches == output) + return true; + + if (ira_reg_class_intersect[op_alt[input].cl][op_alt[output].cl] + != NO_REGS) + return true; + } + return false; +} + +/* Process all of the output registers of the current insn (INSN) which + are not bound (BOUND_P) and the input register REG (its operand number OP_NUM) which dies in the insn as if there were a move insn between them with frequency FREQ. */ static void -process_reg_shuffles (rtx reg, int op_num, int freq, bool *bound_p) +process_reg_shuffles (rtx_insn *insn, rtx reg, int op_num, int freq, + bool *bound_p) { int i; rtx another_reg; @@ -342,7 +367,13 @@ process_reg_shuffles (rtx reg, int op_nu if (!REG_SUBREG_P (another_reg) || op_num == i || recog_data.operand_type[i] != OP_OUT - || bound_p[i]) + || bound_p[i] + || (!can_use_same_reg_p (insn, i, op_num) + && (recog_data.constraints[op_num][0] != '%' + || !can_use_same_reg_p (insn, i, op_num + 1)) + && (op_num == 0 + || recog_data.constraints[op_num - 1][0] != '%' + || !can_use_same_reg_p (insn, i, op_num - 1)))) continue; process_regs_for_copy (reg, another_reg, false, NULL, freq); @@ -412,7 +443,8 @@ add_insn_allocno_copies (rtx_insn *insn) the corresponding allocno copies. The cost will not correspond to a real move insn cost, so make the frequency smaller. */ - process_reg_shuffles (operand, i, freq < 8 ? 1 : freq / 8, bound_p); + process_reg_shuffles (insn, operand, i, freq < 8 ? 1 : freq / 8, + bound_p); } } Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c 2019-08-14 11:56:54.223221654 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c 2019-09-17 17:43:06.426031052 +0100 @@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-not {\tmov\tz} } } */ -/* At the moment we don't manage to avoid using MOVPRFX. */ -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c 2019-08-14 11:56:54.223221654 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c 2019-09-17 17:43:06.430031021 +0100 @@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-not {\tmov\tz} } } */ -/* At the moment we don't manage to avoid using MOVPRFX. */ -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c 2019-08-14 11:53:04.636898923 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c 2019-09-17 17:43:06.430031021 +0100 @@ -54,8 +54,5 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-not {\tmov\tz} } } */ -/* At the moment we don't manage to avoid using MOVPRFX for the - floating-point functions. */ -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\t} 6 } } */ +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */