Message ID | HE1PR0801MB204484FA92B21DFD4AB7FF4F83030@HE1PR0801MB2044.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [AArch64] Improve clz patterns | expand |
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > Although GCC should understand the limited range of clz/ctz/cls results, > Combine sometimes behaves oddly and duplicates ctz to remove a > sign extension. Avoid this by adding an explicit AND with 127 in the > patterns. Deepsjeng performance improves by ~0.6%. Could you go into more detail about what the before and after code looks like, and what combine is doing? Like you say, this sounds like a target-independent thing on face value. Either way, something like this needs a testcase. Thanks, Richard > > Bootstrap OK. > > ChangeLog: > 2020-02-03 Wilco Dijkstra <wdijkstr@arm.com> > > * config/aarch64/aarch64.md (clz<mode>2): Mask the clz result. > (clrsb<mode>2): Likewise. > (ctz<mode>2): Likewise. > -- > > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index 5edc76ee14b55b2b4323530e10bd22b3ffca483e..7ff0536aac42957dbb7a15be766d35cc6725ac40 100644 > --- a/gcc/config/aarch64/aarch64.md > +++ b/gcc/config/aarch64/aarch64.md > @@ -4794,7 +4794,8 @@ (define_insn "*and_one_cmpl_<SHIFT:optab><mode>3_compare0_no_reuse" > > (define_insn "clz<mode>2" > [(set (match_operand:GPI 0 "register_operand" "=r") > -(clz:GPI (match_operand:GPI 1 "register_operand" "r")))] > +(and:GPI (clz:GPI (match_operand:GPI 1 "register_operand" "r")) > + (const_int 127)))] > "" > "clz\\t%<w>0, %<w>1" > [(set_attr "type" "clz")] > @@ -4848,7 +4849,8 @@ (define_expand "popcount<mode>2" > > (define_insn "clrsb<mode>2" > [(set (match_operand:GPI 0 "register_operand" "=r") > - (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))] > +(and:GPI (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")) > + (const_int 127)))] > "" > "cls\\t%<w>0, %<w>1" > [(set_attr "type" "clz")] > @@ -4869,7 +4871,8 @@ (define_insn "rbit<mode>2" > > (define_insn_and_split "ctz<mode>2" > [(set (match_operand:GPI 0 "register_operand" "=r") > - (ctz:GPI (match_operand:GPI 1 "register_operand" "r")))] > + (and:GPI (ctz:GPI (match_operand:GPI 1 "register_operand" "r")) > +(const_int 127)))] > "" > "#" > "reload_completed"
Hi Richard, > Could you go into more detail about what the before and after code > looks like, and what combine is doing? Like you say, this sounds > like a target-independent thing on face value. It is indeed, but it seems specific to instructions where we have range information which allows it to remove a redundant sign-extend. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 for the full details. > Either way, something like this needs a testcase. Sure I've added the testcase from pr93565, see below. Cheers, Wilco Although GCC should understand the limited range of clz/ctz/cls results, Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary sign extension. Avoid this by adding an explicit AND with 127 in the patterns. Deepsjeng performance improves by ~0.6%. Bootstrap OK. ChangeLog: 2020-02-04 Wilco Dijkstra <wdijkstr@arm.com> PR rtl-optimization/93565 * config/aarch64/aarch64.md (clz<mode>2): Mask the clz result. (clrsb<mode>2): Likewise. (ctz<mode>2): Likewise. * gcc.target/aarch64/pr93565.c: New test. -- diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 5edc76ee14b55b2b4323530e10bd22b3ffca483e..7ff0536aac42957dbb7a15be766d35cc6725ac40 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -4794,7 +4794,8 @@ (define_insn "*and_one_cmpl_<SHIFT:optab><mode>3_compare0_no_reuse" (define_insn "clz<mode>2" [(set (match_operand:GPI 0 "register_operand" "=r") - (clz:GPI (match_operand:GPI 1 "register_operand" "r")))] + (and:GPI (clz:GPI (match_operand:GPI 1 "register_operand" "r")) + (const_int 127)))] "" "clz\\t%<w>0, %<w>1" [(set_attr "type" "clz")] @@ -4848,7 +4849,8 @@ (define_expand "popcount<mode>2" (define_insn "clrsb<mode>2" [(set (match_operand:GPI 0 "register_operand" "=r") - (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))] + (and:GPI (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")) + (const_int 127)))] "" "cls\\t%<w>0, %<w>1" [(set_attr "type" "clz")] @@ -4869,7 +4871,8 @@ (define_insn "rbit<mode>2" (define_insn_and_split "ctz<mode>2" [(set (match_operand:GPI 0 "register_operand" "=r") - (ctz:GPI (match_operand:GPI 1 "register_operand" "r")))] + (and:GPI (ctz:GPI (match_operand:GPI 1 "register_operand" "r")) + (const_int 127)))] "" "#" "reload_completed" diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c b/gcc/testsuite/gcc.target/aarch64/pr93565.c new file mode 100644 index 0000000000000000000000000000000000000000..7200f80d1bb161f6a058cc6591f61b6b75cf1749 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c @@ -0,0 +1,34 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +static const unsigned long long magic = 0x03f08c5392f756cdULL; + +static const char table[64] = { + 0, 1, 12, 2, 13, 22, 17, 3, + 14, 33, 23, 36, 18, 58, 28, 4, + 62, 15, 34, 26, 24, 48, 50, 37, + 19, 55, 59, 52, 29, 44, 39, 5, + 63, 11, 21, 16, 32, 35, 57, 27, + 61, 25, 47, 49, 54, 51, 43, 38, + 10, 20, 31, 56, 60, 46, 53, 42, + 9, 30, 45, 41, 8, 40, 7, 6, +}; + +static inline int ctz1 (unsigned long b) +{ + unsigned long lsb = b & -b; + return table[(lsb * magic) >> 58]; +} + +void f (unsigned long x, int *p) +{ + if (x != 0) + { + int a = ctz1 (x); + *p = a | p[a]; + } +} + +/* { dg-final { scan-assembler-times "rbit\t" 1 } } */ +/* { dg-final { scan-assembler-times "clz\t" 1 } } */ +
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > Hi Richard, > >> Could you go into more detail about what the before and after code >> looks like, and what combine is doing? Like you say, this sounds >> like a target-independent thing on face value. > > It is indeed, but it seems specific to instructions where we have range > information which allows it to remove a redundant sign-extend. > > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 for the full details. > >> Either way, something like this needs a testcase. > > Sure I've added the testcase from pr93565, see below. > > Cheers, > Wilco > > > Although GCC should understand the limited range of clz/ctz/cls results, > Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary > sign extension. Avoid this by adding an explicit AND with 127 in the > patterns. Deepsjeng performance improves by ~0.6%. > > Bootstrap OK. > > ChangeLog: > 2020-02-04 Wilco Dijkstra <wdijkstr@arm.com> > > PR rtl-optimization/93565 > * config/aarch64/aarch64.md (clz<mode>2): Mask the clz result. > (clrsb<mode>2): Likewise. > (ctz<mode>2): Likewise. > > * gcc.target/aarch64/pr93565.c: New test. I think this only works because the (and ...) wrapper stops combine from matching a plain ctz: Failed to match this instruction: (set (reg:DI 100 [ _9+-4 ]) (ctz:DI (reg/v:DI 97 [ x ]))) I.e. the patch doesn't work by making the range explicit. It just works by wrapping the pattern in something that combine will never generate naturally. Something like (ashift:DI ... (const_int 0)) would give the same result. If I add a second pattern that is never generated directly but can be matched by combine: (define_insn_and_split "*ctz<mode>2" [(set (match_operand:GPI 0 "register_operand" "=r") (ctz:GPI (match_operand:GPI 1 "register_operand" "r")))] "" "#" "reload_completed" [(const_int 0)] " emit_insn (gen_rbit<mode>2 (operands[0], operands[1])); emit_insn (gen_clz<mode>2 (operands[0], operands[0])); DONE; ") then we get the double ctz again. So I don't think this is OK even as a workaround. We should fix it in target-independent code instead. Thanks, Richard
On Fri, Feb 07, 2020 at 06:01:44PM +0000, Richard Sandiford wrote: > Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > > Although GCC should understand the limited range of clz/ctz/cls results, > > Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary > > sign extension. Avoid this by adding an explicit AND with 127 in the > > patterns. Deepsjeng performance improves by ~0.6%. > > PR rtl-optimization/93565 > So I don't think this is OK even as a workaround. We should fix it in > target-independent code instead. Yes, but combine shouldn't do CSE; combine does the *opposite* of CSE, in some regards. You shouldn't do CSE without a cost model for it, in any case. Segher
Hi Richard, Right, so this is an alternative approach using costs - Combine won't try to duplicate instructions if it increases costs, so increasing the ctz cost to 2 instructions (which is the correct cost for ctz anyway) ensures we still get efficient code for this example: [AArch64] Set ctz rtx_cost (PR93565) Although GCC should understand the limited range of clz/ctz/cls results, Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary sign extension. Avoid this by setting the cost for ctz to be higher than that of a simple ALU instruction. Deepsjeng performance improves by ~0.6%. Bootstrap OK. ChangeLog: 2020-02-12 Wilco Dijkstra <wdijkstr@arm.com> PR rtl-optimization/93565 * config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs. * gcc.target/aarch64/pr93565.c: New test. -- diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index e40750380cce202473da3cf572ebdbc28a4ecc06..7426629d6c973c06640f75d3de53a2815ff40f1b 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -11459,6 +11459,13 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int outer ATTRIBUTE_UNUSED, return false; + case CTZ: + *cost = COSTS_N_INSNS (2); + + if (speed) + *cost += extra_cost->alu.clz + extra_cost->alu.rev; + return false; + case COMPARE: op0 = XEXP (x, 0); op1 = XEXP (x, 1); diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c b/gcc/testsuite/gcc.target/aarch64/pr93565.c new file mode 100644 index 0000000000000000000000000000000000000000..7200f80d1bb161f6a058cc6591f61b6b75cf1749 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c @@ -0,0 +1,34 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +static const unsigned long long magic = 0x03f08c5392f756cdULL; + +static const char table[64] = { + 0, 1, 12, 2, 13, 22, 17, 3, + 14, 33, 23, 36, 18, 58, 28, 4, + 62, 15, 34, 26, 24, 48, 50, 37, + 19, 55, 59, 52, 29, 44, 39, 5, + 63, 11, 21, 16, 32, 35, 57, 27, + 61, 25, 47, 49, 54, 51, 43, 38, + 10, 20, 31, 56, 60, 46, 53, 42, + 9, 30, 45, 41, 8, 40, 7, 6, +}; + +static inline int ctz1 (unsigned long b) +{ + unsigned long lsb = b & -b; + return table[(lsb * magic) >> 58]; +} + +void f (unsigned long x, int *p) +{ + if (x != 0) + { + int a = ctz1 (x); + *p = a | p[a]; + } +} + +/* { dg-final { scan-assembler-times "rbit\t" 1 } } */ +/* { dg-final { scan-assembler-times "clz\t" 1 } } */ +
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > Hi Richard, > > Right, so this is an alternative approach using costs - Combine won't try to > duplicate instructions if it increases costs, so increasing the ctz cost to 2 > instructions (which is the correct cost for ctz anyway) ...agreed... > ensures we still get efficient code for this example: > > [AArch64] Set ctz rtx_cost (PR93565) > > Although GCC should understand the limited range of clz/ctz/cls results, > Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary > sign extension. Avoid this by setting the cost for ctz to be higher than > that of a simple ALU instruction. Deepsjeng performance improves by ~0.6%. > > Bootstrap OK. > > ChangeLog: > 2020-02-12 Wilco Dijkstra <wdijkstr@arm.com> > > PR rtl-optimization/93565 > * config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs. > > * gcc.target/aarch64/pr93565.c: New test. OK, thanks. Could you remove the bit about combine behaving oddly when you commit though? I think this was simply a target bug and combine was being given duff information. Richard > > -- > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index e40750380cce202473da3cf572ebdbc28a4ecc06..7426629d6c973c06640f75d3de53a2815ff40f1b 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -11459,6 +11459,13 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int outer ATTRIBUTE_UNUSED, > > return false; > > + case CTZ: > + *cost = COSTS_N_INSNS (2); > + > + if (speed) > +*cost += extra_cost->alu.clz + extra_cost->alu.rev; > + return false; > + > case COMPARE: > op0 = XEXP (x, 0); > op1 = XEXP (x, 1); > diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c b/gcc/testsuite/gcc.target/aarch64/pr93565.c > new file mode 100644 > index 0000000000000000000000000000000000000000..7200f80d1bb161f6a058cc6591f61b6b75cf1749 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c > @@ -0,0 +1,34 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2" } */ > + > +static const unsigned long long magic = 0x03f08c5392f756cdULL; > + > +static const char table[64] = { > + 0, 1, 12, 2, 13, 22, 17, 3, > + 14, 33, 23, 36, 18, 58, 28, 4, > + 62, 15, 34, 26, 24, 48, 50, 37, > + 19, 55, 59, 52, 29, 44, 39, 5, > + 63, 11, 21, 16, 32, 35, 57, 27, > + 61, 25, 47, 49, 54, 51, 43, 38, > + 10, 20, 31, 56, 60, 46, 53, 42, > + 9, 30, 45, 41, 8, 40, 7, 6, > +}; > + > +static inline int ctz1 (unsigned long b) > +{ > + unsigned long lsb = b & -b; > + return table[(lsb * magic) >> 58]; > +} > + > +void f (unsigned long x, int *p) > +{ > + if (x != 0) > + { > + int a = ctz1 (x); > + *p = a | p[a]; > + } > +} > + > +/* { dg-final { scan-assembler-times "rbit\t" 1 } } */ > +/* { dg-final { scan-assembler-times "clz\t" 1 } } */ > +
On Wed, Feb 12, 2020 at 9:56 AM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > > Hi Richard, > > > > Right, so this is an alternative approach using costs - Combine won't try to > > duplicate instructions if it increases costs, so increasing the ctz cost to 2 > > instructions (which is the correct cost for ctz anyway) > > ...agreed... Yes I agree a better cost model for CTZ/CLZ is the right solution but I disagree with 2 ALU instruction as the cost. It should either be the same cost as a multiply or have its own cost entry. For an example on OcteonTX (and ThunderX1), the cost of CLS/CLZ is 4 cycles, the same as the cost as a multiple; on OcteonTX2 it is 5 cycles (again the same cost as a multiple). Thanks, Andrew Pinski > > > ensures we still get efficient code for this example: > > > > [AArch64] Set ctz rtx_cost (PR93565) > > > > Although GCC should understand the limited range of clz/ctz/cls results, > > Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary > > sign extension. Avoid this by setting the cost for ctz to be higher than > > that of a simple ALU instruction. Deepsjeng performance improves by ~0.6%. > > > > Bootstrap OK. > > > > ChangeLog: > > 2020-02-12 Wilco Dijkstra <wdijkstr@arm.com> > > > > PR rtl-optimization/93565 > > * config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs. > > > > * gcc.target/aarch64/pr93565.c: New test. > > OK, thanks. Could you remove the bit about combine behaving oddly when > you commit though? I think this was simply a target bug and combine > was being given duff information. > > Richard > > > > > -- > > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > index e40750380cce202473da3cf572ebdbc28a4ecc06..7426629d6c973c06640f75d3de53a2815ff40f1b 100644 > > --- a/gcc/config/aarch64/aarch64.c > > +++ b/gcc/config/aarch64/aarch64.c > > @@ -11459,6 +11459,13 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int outer ATTRIBUTE_UNUSED, > > > > return false; > > > > + case CTZ: > > + *cost = COSTS_N_INSNS (2); > > + > > + if (speed) > > +*cost += extra_cost->alu.clz + extra_cost->alu.rev; > > + return false; > > + > > case COMPARE: > > op0 = XEXP (x, 0); > > op1 = XEXP (x, 1); > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c b/gcc/testsuite/gcc.target/aarch64/pr93565.c > > new file mode 100644 > > index 0000000000000000000000000000000000000000..7200f80d1bb161f6a058cc6591f61b6b75cf1749 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c > > @@ -0,0 +1,34 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2" } */ > > + > > +static const unsigned long long magic = 0x03f08c5392f756cdULL; > > + > > +static const char table[64] = { > > + 0, 1, 12, 2, 13, 22, 17, 3, > > + 14, 33, 23, 36, 18, 58, 28, 4, > > + 62, 15, 34, 26, 24, 48, 50, 37, > > + 19, 55, 59, 52, 29, 44, 39, 5, > > + 63, 11, 21, 16, 32, 35, 57, 27, > > + 61, 25, 47, 49, 54, 51, 43, 38, > > + 10, 20, 31, 56, 60, 46, 53, 42, > > + 9, 30, 45, 41, 8, 40, 7, 6, > > +}; > > + > > +static inline int ctz1 (unsigned long b) > > +{ > > + unsigned long lsb = b & -b; > > + return table[(lsb * magic) >> 58]; > > +} > > + > > +void f (unsigned long x, int *p) > > +{ > > + if (x != 0) > > + { > > + int a = ctz1 (x); > > + *p = a | p[a]; > > + } > > +} > > + > > +/* { dg-final { scan-assembler-times "rbit\t" 1 } } */ > > +/* { dg-final { scan-assembler-times "clz\t" 1 } } */ > > +
Hi Richard, See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565#c8 - the problem is more generic like I suspected and it's easy to create similar examples. So while this turned out to be an easy worksaround for ctz, there general case is harder to avoid since you still want to allow beneficial multi-use cases (such as merging a shift into 2 ALU instructions). Cheers, Wilco
Hi Andrew, > Yes I agree a better cost model for CTZ/CLZ is the right solution but > I disagree with 2 ALU instruction as the cost. It should either be > the same cost as a multiply or have its own cost entry. > For an example on OcteonTX (and ThunderX1), the cost of CLS/CLZ is 4 > cycles, the same as the cost as a multiple; on OcteonTX2 it is 5 > cycles (again the same cost as a multiple). + if (speed) + *cost += extra_cost->alu.clz + extra_cost->alu.rev; + return false; So if the cost of clz and ctz are similar enough, this will use the defined per-cpu costs. Cheers, Wilco
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 5edc76ee14b55b2b4323530e10bd22b3ffca483e..7ff0536aac42957dbb7a15be766d35cc6725ac40 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -4794,7 +4794,8 @@ (define_insn "*and_one_cmpl_<SHIFT:optab><mode>3_compare0_no_reuse" (define_insn "clz<mode>2" [(set (match_operand:GPI 0 "register_operand" "=r") - (clz:GPI (match_operand:GPI 1 "register_operand" "r")))] + (and:GPI (clz:GPI (match_operand:GPI 1 "register_operand" "r")) + (const_int 127)))] "" "clz\\t%<w>0, %<w>1" [(set_attr "type" "clz")] @@ -4848,7 +4849,8 @@ (define_expand "popcount<mode>2" (define_insn "clrsb<mode>2" [(set (match_operand:GPI 0 "register_operand" "=r") - (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))] + (and:GPI (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")) + (const_int 127)))] "" "cls\\t%<w>0, %<w>1" [(set_attr "type" "clz")] @@ -4869,7 +4871,8 @@ (define_insn "rbit<mode>2" (define_insn_and_split "ctz<mode>2" [(set (match_operand:GPI 0 "register_operand" "=r") - (ctz:GPI (match_operand:GPI 1 "register_operand" "r")))] + (and:GPI (ctz:GPI (match_operand:GPI 1 "register_operand" "r")) + (const_int 127)))] "" "#" "reload_completed"