Message ID | ce5777e2-694a-fdf5-5aa8-20e1a91e5e66@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [v2] Fold (add -1; zero_ext; add +1) operations to zero_ext when not overflow (PR37451, part of PR61837) | expand |
Hi! On Tue, May 12, 2020 at 02:48:40PM +0800, luoxhu wrote: > diff --git a/gcc/testsuite/gcc.target/powerpc/doloop-2.c b/gcc/testsuite/gcc.target/powerpc/doloop-2.c > new file mode 100644 > index 00000000000..dc8516bb0ab > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/doloop-2.c > @@ -0,0 +1,14 @@ > +/* { dg-do compile { target powerpc*-*-* } } */ Just { dg-do compiler } please, *everything* that runs this testsuite is powerpc*-*-*; but compile is the default as well, so you can leave that line completely out as well, if you want. > +/* { dg-final { scan-assembler-not "-1" } } */ This will fail the test for the string "-1" anywhere in the file. Like, if it was called "doloop-1.c" it would fail, or "doloop-12345.c". \m and \M can help for that last case, but you probably want to make the regex a bit more selective ;-) (And, document what it doesn't want to see, if it isn't really obvious?) Segher
luoxhu <luoxhu@linux.ibm.com> writes: > + /* Fold (add -1; zero_ext; add +1) operations to zero_ext. i.e: > + > + 73: r145:SI=r123:DI#0-0x1 > + 74: r144:DI=zero_extend (r145:SI) > + 75: r143:DI=r144:DI+0x1 > + ... > + 31: r135:CC=cmp (r123:DI,0) > + 72: {pc={(r143:DI!=0x1)?L70:pc};r143:DI=r143:DI-0x1;clobber > + scratch;clobber scratch;} Minor, but it might be worth stubbing out the clobbers, since they're not really necessary to understand the comment: 72: {pc={(r143:DI!=0x1)?L70:pc};r143:DI=r143:DI-0x1;...} > + > + r123:DI#0-0x1 is param count derived from loop->niter_expr equal to the > + loop iterations, if loop iterations expression doesn't overflow, then > + (zero_extend (r123:DI#0-1))+1 could be simplified to zero_extend only. > + */ > + bool simplify_zext = false; I think it'd be easier to follow if this was split out into a subroutine, rather than having the simplify_zext variable. > + rtx extop0 = XEXP (count, 0); > + if (GET_CODE (count) == ZERO_EXTEND && GET_CODE (extop0) == PLUS) This isn't valid: we can only do XEXP (count, 0) *after* checking for a ZERO_EXTEND. (It'd be good to test the patch with --enable-checking=yes,extra,rtl , which hopefully would have caught this.) > + { > + rtx addop0 = XEXP (extop0, 0); > + rtx addop1 = XEXP (extop0, 1); > + > + int nonoverflow = 0; > + unsigned int_mode > + = GET_MODE_PRECISION (as_a<scalar_int_mode> GET_MODE (addop0)); Heh. I wondered at first how on earth this compiled. It looked like there was a missing "(...)" around the GET_MODE. But of course, GET_MODE adds its own parentheses, so it all works out. :-) Please add the "(...)" anyway though. We shouldn't rely on that. "int_mode" seems a bit of a confusing name, since it's actually a precision in bits rather than a mode. > + unsigned HOST_WIDE_INT int_mode_max > + = (HOST_WIDE_INT_1U << (int_mode - 1) << 1) - 1; > + if (get_max_loop_iterations (loop, &iterations) > + && wi::ltu_p (iterations, int_mode_max)) You could use GET_MODE_MASK instead of int_mode_max here. For extra safety, it would be good to add a HWI_COMPUTABLE_P test, to make sure that using HWIs is valid. > + nonoverflow = 1; > + > + if (nonoverflow Having the nonoverflow variable doesn't seem necessary. We could just fuse the two "if" conditions together. > + && CONST_SCALAR_INT_P (addop1) > + && GET_MODE_PRECISION (mode) == int_mode * 2 This GET_MODE_PRECISION condition also shouldn't be necessary. If we can prove that the subtraction doesn't wrap, we can extend to any wider mode, not just to double the width. > + && addop1 == GEN_INT (-1)) This can just be: addop1 == constm1_rtx There's then no need for the CONST_SCALAR_INT_P check. Thanks, Richard
On 2020/5/13 02:24, Richard Sandiford wrote: > luoxhu <luoxhu@linux.ibm.com> writes: >> + /* Fold (add -1; zero_ext; add +1) operations to zero_ext. i.e: >> + >> + 73: r145:SI=r123:DI#0-0x1 >> + 74: r144:DI=zero_extend (r145:SI) >> + 75: r143:DI=r144:DI+0x1 >> + ... >> + 31: r135:CC=cmp (r123:DI,0) >> + 72: {pc={(r143:DI!=0x1)?L70:pc};r143:DI=r143:DI-0x1;clobber >> + scratch;clobber scratch;} > > Minor, but it might be worth stubbing out the clobbers, since they're > not really necessary to understand the comment: > > 72: {pc={(r143:DI!=0x1)?L70:pc};r143:DI=r143:DI-0x1;...} > >> + >> + r123:DI#0-0x1 is param count derived from loop->niter_expr equal to the >> + loop iterations, if loop iterations expression doesn't overflow, then >> + (zero_extend (r123:DI#0-1))+1 could be simplified to zero_extend only. >> + */ >> + bool simplify_zext = false; > > I think it'd be easier to follow if this was split out into > a subroutine, rather than having the simplify_zext variable. > >> + rtx extop0 = XEXP (count, 0); >> + if (GET_CODE (count) == ZERO_EXTEND && GET_CODE (extop0) == PLUS) > > This isn't valid: we can only do XEXP (count, 0) *after* checking > for a ZERO_EXTEND. (It'd be good to test the patch with > --enable-checking=yes,extra,rtl , which hopefully would have > caught this.) > >> + { >> + rtx addop0 = XEXP (extop0, 0); >> + rtx addop1 = XEXP (extop0, 1); >> + >> + int nonoverflow = 0; >> + unsigned int_mode >> + = GET_MODE_PRECISION (as_a<scalar_int_mode> GET_MODE (addop0)); > > Heh. I wondered at first how on earth this compiled. It looked like > there was a missing "(...)" around the GET_MODE. But of course, > GET_MODE adds its own parentheses, so it all works out. :-) > > Please add the "(...)" anyway though. We shouldn't rely on that. > > "int_mode" seems a bit of a confusing name, since it's actually a precision > in bits rather than a mode. > >> + unsigned HOST_WIDE_INT int_mode_max >> + = (HOST_WIDE_INT_1U << (int_mode - 1) << 1) - 1; >> + if (get_max_loop_iterations (loop, &iterations) >> + && wi::ltu_p (iterations, int_mode_max)) > > You could use GET_MODE_MASK instead of int_mode_max here. > > For extra safety, it would be good to add a HWI_COMPUTABLE_P test, > to make sure that using HWIs is valid. > >> + nonoverflow = 1; >> + >> + if (nonoverflow > > Having the nonoverflow variable doesn't seem necessary. We could > just fuse the two "if" conditions together. > >> + && CONST_SCALAR_INT_P (addop1) >> + && GET_MODE_PRECISION (mode) == int_mode * 2 > > This GET_MODE_PRECISION condition also shouldn't be necessary. > If we can prove that the subtraction doesn't wrap, we can extend > to any wider mode, not just to double the width. > >> + && addop1 == GEN_INT (-1)) > > This can just be: > > addop1 == constm1_rtx > > There's then no need for the CONST_SCALAR_INT_P check. > > Thanks, > Richard > Thanks for all your great comments, addressed them all with below update, "--enable-checking=yes,extra,rtl" did catch the ICE with performance penalty. This "subtract/extend/add" existed for a long time and still annoying us (PR37451, part of PR61837) when converting from 32bits to 64bits, as the ctr register is used as 64bits on powerpc64, Andraw Pinski had a patch but caused some issue and reverted by Joseph S. Myers(PR37451, PR37782). Andraw: http://gcc.gnu.org/ml/gcc-patches/2008-09/msg01070.html http://gcc.gnu.org/ml/gcc-patches/2008-10/msg01321.html Joseph: https://gcc.gnu.org/legacy-ml/gcc-patches/2011-11/msg02405.html We still can do the simplification from "subtract/zero_ext/add" to "zero_ext" when loop iterations is known to be LT than MODE_MAX (only do simplify when counter+0x1 NOT overflow). Bootstrap and regression tested pass on Power8-LE. gcc/ChangeLog 2020-05-14 Xiong Hu Luo <luoxhu@linux.ibm.com> PR rtl-optimization/37451, part of PR target/61837 * loop-doloop.c (doloop_simplify_count): New function. Simplify (add -1; zero_ext; add +1) to zero_ext when not wrapping. (doloop_modify): Call doloop_simplify_count. gcc/testsuite/ChangeLog 2020-05-14 Xiong Hu Luo <luoxhu@linux.ibm.com> PR rtl-optimization/37451, part of PR target/61837 * gcc.target/powerpc/doloop-2.c: New test. --- gcc/loop-doloop.c | 38 ++++++++++++++++++++- gcc/testsuite/gcc.target/powerpc/doloop-2.c | 29 ++++++++++++++++ 2 files changed, 66 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/powerpc/doloop-2.c diff --git a/gcc/loop-doloop.c b/gcc/loop-doloop.c index db6a014e43d..02282d45bd5 100644 --- a/gcc/loop-doloop.c +++ b/gcc/loop-doloop.c @@ -397,6 +397,42 @@ add_test (rtx cond, edge *e, basic_block dest) return true; } +/* Fold (add -1; zero_ext; add +1) operations to zero_ext if not wrapping. i.e: + + 73: r145:SI=r123:DI#0-0x1 + 74: r144:DI=zero_extend (r145:SI) + 75: r143:DI=r144:DI+0x1 + ... + 31: r135:CC=cmp (r123:DI,0) + 72: {pc={(r143:DI!=0x1)?L70:pc};r143:DI=r143:DI-0x1;...} + + r123:DI#0-0x1 is param count derived from loop->niter_expr equal to number of + loop iterations, if loop iterations expression doesn't overflow, then + (zero_extend (r123:DI#0-1))+1 can be simplified to zero_extend. */ + +static rtx +doloop_simplify_count (class loop *loop, scalar_int_mode mode, rtx count) +{ + widest_int iterations; + if (GET_CODE (count) == ZERO_EXTEND) + { + rtx extop0 = XEXP (count, 0); + if (GET_CODE (extop0) == PLUS) + { + rtx addop0 = XEXP (extop0, 0); + rtx addop1 = XEXP (extop0, 1); + + if (get_max_loop_iterations (loop, &iterations) + && wi::ltu_p (iterations, GET_MODE_MASK (GET_MODE (addop0))) + && addop1 == constm1_rtx) + return simplify_gen_unary (ZERO_EXTEND, mode, addop0, + GET_MODE (addop0)); + } + } + + return simplify_gen_binary (PLUS, mode, count, const1_rtx); +} + /* Modify the loop to use the low-overhead looping insn where LOOP describes the loop, DESC describes the number of iterations of the loop, and DOLOOP_INSN is the low-overhead looping insn to emit at the @@ -477,7 +513,7 @@ doloop_modify (class loop *loop, class niter_desc *desc, } if (increment_count) - count = simplify_gen_binary (PLUS, mode, count, const1_rtx); + count = doloop_simplify_count (loop, mode, count); /* Insert initialization of the count register into the loop header. */ start_sequence (); diff --git a/gcc/testsuite/gcc.target/powerpc/doloop-2.c b/gcc/testsuite/gcc.target/powerpc/doloop-2.c new file mode 100644 index 00000000000..3199fe56d35 --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/doloop-2.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fno-unroll-loops" } */ + +unsigned int +foo1 (unsigned int l, int *a) +{ + unsigned int i; + for(i = 0;i < l; i++) + a[i] = i; + return l; +} + +int +foo2 (int l, int *a) +{ + int i; + for(i = 0;i < l; i++) + a[i] = i; + return l; +} + +/* The place where we were getting an extra -1 is when converting from 32bits + to 64bits as the ctr register is used as 64bits on powerpc64. We should be + able to do this loop without "add -1/zero_ext/add 1" to the l to get the + number of iterations of this loop still doing a do-loop. */ + +/* { dg-final { scan-assembler-not {(?n)\maddi .*,.*,-1$} } } */ +/* { dg-final { scan-assembler-times "bdnz" 2 } } */ +/* { dg-final { scan-assembler-times "mtctr" 2 } } */
luoxhu <luoxhu@linux.ibm.com> writes: > This "subtract/extend/add" existed for a long time and still annoying us > (PR37451, part of PR61837) when converting from 32bits to 64bits, as the ctr > register is used as 64bits on powerpc64, Andraw Pinski had a patch but > caused some issue and reverted by Joseph S. Myers(PR37451, PR37782). > > Andraw: > http://gcc.gnu.org/ml/gcc-patches/2008-09/msg01070.html > http://gcc.gnu.org/ml/gcc-patches/2008-10/msg01321.html > Joseph: > https://gcc.gnu.org/legacy-ml/gcc-patches/2011-11/msg02405.html > > We still can do the simplification from "subtract/zero_ext/add" to "zero_ext" > when loop iterations is known to be LT than MODE_MAX (only do simplify > when counter+0x1 NOT overflow). > > Bootstrap and regression tested pass on Power8-LE. > > gcc/ChangeLog > > 2020-05-14 Xiong Hu Luo <luoxhu@linux.ibm.com> > > PR rtl-optimization/37451, part of PR target/61837 > * loop-doloop.c (doloop_simplify_count): New function. Simplify > (add -1; zero_ext; add +1) to zero_ext when not wrapping. > (doloop_modify): Call doloop_simplify_count. > > gcc/testsuite/ChangeLog > > 2020-05-14 Xiong Hu Luo <luoxhu@linux.ibm.com> > > PR rtl-optimization/37451, part of PR target/61837 > * gcc.target/powerpc/doloop-2.c: New test. OK, thanks. Richard > --- > gcc/loop-doloop.c | 38 ++++++++++++++++++++- > gcc/testsuite/gcc.target/powerpc/doloop-2.c | 29 ++++++++++++++++ > 2 files changed, 66 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.target/powerpc/doloop-2.c > > diff --git a/gcc/loop-doloop.c b/gcc/loop-doloop.c > index db6a014e43d..02282d45bd5 100644 > --- a/gcc/loop-doloop.c > +++ b/gcc/loop-doloop.c > @@ -397,6 +397,42 @@ add_test (rtx cond, edge *e, basic_block dest) > return true; > } > > +/* Fold (add -1; zero_ext; add +1) operations to zero_ext if not wrapping. i.e: > + > + 73: r145:SI=r123:DI#0-0x1 > + 74: r144:DI=zero_extend (r145:SI) > + 75: r143:DI=r144:DI+0x1 > + ... > + 31: r135:CC=cmp (r123:DI,0) > + 72: {pc={(r143:DI!=0x1)?L70:pc};r143:DI=r143:DI-0x1;...} > + > + r123:DI#0-0x1 is param count derived from loop->niter_expr equal to number of > + loop iterations, if loop iterations expression doesn't overflow, then > + (zero_extend (r123:DI#0-1))+1 can be simplified to zero_extend. */ > + > +static rtx > +doloop_simplify_count (class loop *loop, scalar_int_mode mode, rtx count) > +{ > + widest_int iterations; > + if (GET_CODE (count) == ZERO_EXTEND) > + { > + rtx extop0 = XEXP (count, 0); > + if (GET_CODE (extop0) == PLUS) > + { > + rtx addop0 = XEXP (extop0, 0); > + rtx addop1 = XEXP (extop0, 1); > + > + if (get_max_loop_iterations (loop, &iterations) > + && wi::ltu_p (iterations, GET_MODE_MASK (GET_MODE (addop0))) > + && addop1 == constm1_rtx) > + return simplify_gen_unary (ZERO_EXTEND, mode, addop0, > + GET_MODE (addop0)); > + } > + } > + > + return simplify_gen_binary (PLUS, mode, count, const1_rtx); > +} > + > /* Modify the loop to use the low-overhead looping insn where LOOP > describes the loop, DESC describes the number of iterations of the > loop, and DOLOOP_INSN is the low-overhead looping insn to emit at the > @@ -477,7 +513,7 @@ doloop_modify (class loop *loop, class niter_desc *desc, > } > > if (increment_count) > - count = simplify_gen_binary (PLUS, mode, count, const1_rtx); > + count = doloop_simplify_count (loop, mode, count); > > /* Insert initialization of the count register into the loop header. */ > start_sequence (); > diff --git a/gcc/testsuite/gcc.target/powerpc/doloop-2.c b/gcc/testsuite/gcc.target/powerpc/doloop-2.c > new file mode 100644 > index 00000000000..3199fe56d35 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/doloop-2.c > @@ -0,0 +1,29 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fno-unroll-loops" } */ > + > +unsigned int > +foo1 (unsigned int l, int *a) > +{ > + unsigned int i; > + for(i = 0;i < l; i++) > + a[i] = i; > + return l; > +} > + > +int > +foo2 (int l, int *a) > +{ > + int i; > + for(i = 0;i < l; i++) > + a[i] = i; > + return l; > +} > + > +/* The place where we were getting an extra -1 is when converting from 32bits > + to 64bits as the ctr register is used as 64bits on powerpc64. We should be > + able to do this loop without "add -1/zero_ext/add 1" to the l to get the > + number of iterations of this loop still doing a do-loop. */ > + > +/* { dg-final { scan-assembler-not {(?n)\maddi .*,.*,-1$} } } */ > +/* { dg-final { scan-assembler-times "bdnz" 2 } } */ > +/* { dg-final { scan-assembler-times "mtctr" 2 } } */
diff --git a/gcc/loop-doloop.c b/gcc/loop-doloop.c index db6a014e43d..16372382a22 100644 --- a/gcc/loop-doloop.c +++ b/gcc/loop-doloop.c @@ -477,7 +477,51 @@ doloop_modify (class loop *loop, class niter_desc *desc, } if (increment_count) - count = simplify_gen_binary (PLUS, mode, count, const1_rtx); + { + /* Fold (add -1; zero_ext; add +1) operations to zero_ext. i.e: + + 73: r145:SI=r123:DI#0-0x1 + 74: r144:DI=zero_extend (r145:SI) + 75: r143:DI=r144:DI+0x1 + ... + 31: r135:CC=cmp (r123:DI,0) + 72: {pc={(r143:DI!=0x1)?L70:pc};r143:DI=r143:DI-0x1;clobber + scratch;clobber scratch;} + + r123:DI#0-0x1 is param count derived from loop->niter_expr equal to the + loop iterations, if loop iterations expression doesn't overflow, then + (zero_extend (r123:DI#0-1))+1 could be simplified to zero_extend only. + */ + bool simplify_zext = false; + rtx extop0 = XEXP (count, 0); + if (GET_CODE (count) == ZERO_EXTEND && GET_CODE (extop0) == PLUS) + { + rtx addop0 = XEXP (extop0, 0); + rtx addop1 = XEXP (extop0, 1); + + int nonoverflow = 0; + unsigned int_mode + = GET_MODE_PRECISION (as_a<scalar_int_mode> GET_MODE (addop0)); + unsigned HOST_WIDE_INT int_mode_max + = (HOST_WIDE_INT_1U << (int_mode - 1) << 1) - 1; + if (get_max_loop_iterations (loop, &iterations) + && wi::ltu_p (iterations, int_mode_max)) + nonoverflow = 1; + + if (nonoverflow + && CONST_SCALAR_INT_P (addop1) + && GET_MODE_PRECISION (mode) == int_mode * 2 + && addop1 == GEN_INT (-1)) + { + count = simplify_gen_unary (ZERO_EXTEND, mode, addop0, + GET_MODE (addop0)); + simplify_zext = true; + } + } + + if (!simplify_zext) + count = simplify_gen_binary (PLUS, mode, count, const1_rtx); + } /* Insert initialization of the count register into the loop header. */ start_sequence (); diff --git a/gcc/testsuite/gcc.target/powerpc/doloop-2.c b/gcc/testsuite/gcc.target/powerpc/doloop-2.c new file mode 100644 index 00000000000..dc8516bb0ab --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/doloop-2.c @@ -0,0 +1,14 @@ +/* { dg-do compile { target powerpc*-*-* } } */ +/* { dg-options "-O2 -fno-unroll-loops" } */ + +int f(int l, int *a) +{ + int i; + for(i = 0;i < l; i++) + a[i] = i; + return l; +} + +/* { dg-final { scan-assembler-not "-1" } } */ +/* { dg-final { scan-assembler "bdnz" } } */ +/* { dg-final { scan-assembler-times "mtctr" 1 } } */