Message ID | 007e01d7dc8f$7b87f260$7297d720$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | [take,2] ivopts: Improve code generated for very simple loops. | expand |
On Thu, Nov 18, 2021 at 4:18 PM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > Hi Richard, > Many thanks for the patch review. > > On Tue, Nov 16, 2021 at 12:38 Richard Biener wrote: > > On Mon, Nov 15, 2021 at 2:04 PM Roger Sayle > > <roger@nextmovesoftware.com> wrote: > > > > > > This patch tidies up the code that GCC generates for simple loops, by > > > selecting/generating a simpler loop bound expression in ivopts. > > > Previously: > > > > > > void v1 (unsigned long *in, unsigned long *out, unsigned int n) { > > > int i; > > > for (i = 0; i < n; i++) { > > > out[i] = in[i]; > > > } > > > } > > > > > > on x86_64 generated: > > > v1: testl %edx, %edx > > > je .L1 > > > movl %edx, %edx > > > xorl %eax, %eax > > > .L3: movq (%rdi,%rax,8), %rcx > > > movq %rcx, (%rsi,%rax,8) > > > addq $1, %rax > > > cmpq %rax, %rdx > > > jne .L3 > > > .L1: ret > > > > > > and now instead generates: > > > v1: testl %edx, %edx > > > je .L1 > > > movl %edx, %edx > > > xorl %eax, %eax > > > leaq 0(,%rdx,8), %rcx > > > .L3: movq (%rdi,%rax), %rdx > > > movq %rdx, (%rsi,%rax) > > > addq $8, %rax > > > cmpq %rax, %rcx > > > jne .L3 > > > .L1: ret > > > > Is that actually better? IIRC the addressing modes are both complex and we > > now have an extra lea? > > > Technically the induction variable elimination is removing two multiplies by 8 > (or left shifts) from the body of the loop and replacing them with a single > multiply by 8 prior to the loop, which is exactly the induction variable > optimization that ivopts is designed to do. It's true that with x86's complex > addressing modes these "multiplications" are free, but ivopts is run on all > targets including those without indexed addressing, and even on x86 there's > a benefit from shorter instruction encodings. > > > For this case I see we generate > > _15 = n_10(D) + 4294967295; > > _8 = (unsigned long) _15; > > _7 = _8 + 1; > > > > where n is unsigned int so if we know that n is not zero we can simplify the > > addition and conveniently the loop header test provides this guarantee. > > IIRC there were some attempts to enhance match.pd for some cases of such > > expressions. > > Exactly. The loop optimizers are generating the expression (x-1)+1 and > assuming that the middle-end will simplify this to x. Unfortunately, the > change of modes (designed to prevent overflow) actually makes this > impossible to simplify, due to concerns about overflow. > > > + /* If AFTER_ADJUST is required, the code below generates the equivalent > > + * of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression > > + * BASE + (NITER + 1) * STEP, especially when NITER is often of the form > > + * SSA_NAME - 1. Unfortunately, guaranteeing that adding 1 to NITER > > + * doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC > > + * class for common idioms that we know are safe. */ > > > > No '* ' each line. > Doh! Thanks. Sometimes I hate vi. > > > I wonder if the non-overflowing can be captured by > > integer_onep (iv->step) > > && max_stmt_executions (loop, &max) > > Unfortunately, max_stmt_executions is intended to return the wide_int count > of iterations, either the known constant value or a profile-based estimate, while > the optimizations I'm proposing work with symbolic values from variable iteration > counts. When the iteration count is a constant, fold-const is already able to > simplify (x-1)+1 to an integer constant even with type conversions. > > > if we then do (niter + 1) * step instead of niter*step + step would that do the > > same? > > Yes. I've extended the scope of the patch to now also handle loops of the > form (for i=beg; i<end; i++), where the original niter is (end-beg)-1, and we > now use (end-beg) as the incremented niter, so the invariant expression now > becomes (end-beg)*4 instead of the currently generated: ((end-beg)-1)*4 + 4. > > I'm assuming that the niter*step + step is by design (for the general case), > so I'm only tweaking the (common) corner cases, where it's easy to see that > it's safe to substitute a simpler expression. For more general affine recurrences > in complex loop nests, niter*step + step may be preferred. > > > That said - what the change does is actually ensure that we CSE niter + 1 > > with the bound of the simplified exit test? > > Not quite, this simply provides a simplified expression for "niter + 1" that > takes advantage of the implicit range information we have. You're right > in theory that ranger/vrp may be able to help simplify (long)((int)x-1)+1L > more generally. In this case, I suspect it's just a historical artifact that > much of the literature on (iv) loop optimizations dates back to Fortran > and Algol where arrays were 1-based rather than 0-based, so decades > later gcc 11 on x86_64 sometimes requires an extra inc/dec instruction > for purely historical reasons. > > > The patch doesn't add any testcase. > > The three new attached tests check that the critical invariants have a > simpler form, and hopefully shouldn't be affected by whether the optimizer > and/or backend costs actually decide to perform this iv substitution or not. The testcases might depend on lp64 though, did you test them with -m32? IMHO it's fine to require lp64 here. I'm a bit unsure about adding this special-casing in cand_value_at in general - it does seem that we're doing sth wrong elsewhere - either by not simplifying even though enough knowledge is there or by throwing away knowledge earlier (during niter analysis?). Anyway, the patch does look quite safe - can you show some statistics in how many time there's extra simplification this way during say bootstrap? Thanks, Richard. > > Thanks, > > Richard. > > > This revised patch has been tested on x86_64-pc-linux-gnu with a > make bootstrap and make -k check with no new failures. > > 2021-11-18 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > * tree-ssa-loop-ivopts.c (cand_value_at): Take a class > tree_niter_desc* argument instead of just a tree for NITER. > If we require the iv candidate value at the end of the final > loop iteration, try using the original loop bound as the > NITER for sufficiently simple loops. > (may_eliminate_iv): Update (only) call to cand_value_at. > > gcc/testsuite > * gcc.dg/tree-ssa/ivopts-5.c: New test case. > * gcc.dg/tree-ssa/ivopts-6.c: New test case. > * gcc.dg/tree-ssa/ivopts-7.c: New test case. > * gcc.dg/wrapped-binop-simplify.c: Update expected test result. > > > Many thanks in advance (fingers-crossed this is still suitable/grandfathered-in > at the start of Stage 3). > > Roger > -- >
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c index 4a498ab..8f98595 100644 --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -5034,28 +5034,57 @@ determine_group_iv_cost_address (struct ivopts_data *data, return !sum_cost.infinite_cost_p (); } -/* Computes value of candidate CAND at position AT in iteration NITER, and - stores it to VAL. */ +/* Computes value of candidate CAND at position AT in iteration DESC->NITER, + and stores it to VAL. */ static void -cand_value_at (class loop *loop, struct iv_cand *cand, gimple *at, tree niter, - aff_tree *val) +cand_value_at (class loop *loop, struct iv_cand *cand, gimple *at, + class tree_niter_desc *desc, aff_tree *val) { aff_tree step, delta, nit; struct iv *iv = cand->iv; tree type = TREE_TYPE (iv->base); + tree niter = desc->niter; + bool after_adjust = stmt_after_increment (loop, cand, at); tree steptype; + if (POINTER_TYPE_P (type)) steptype = sizetype; else steptype = unsigned_type_for (type); + /* If AFTER_ADJUST is required, the code below generates the equivalent + of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression + BASE + (NITER + 1) * STEP, especially when NITER is often of the form + SSA_NAME - 1. Unfortunately, guaranteeing that adding 1 to NITER + doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC + class for common idioms that we know are safe. */ + if (after_adjust + && desc->control.no_overflow + && integer_onep (desc->control.step) + && (desc->cmp == LT_EXPR + || desc->cmp == NE_EXPR) + && TREE_CODE (desc->bound) == SSA_NAME) + { + if (integer_onep (desc->control.base)) + { + niter = desc->bound; + after_adjust = false; + } + else if (TREE_CODE (niter) == MINUS_EXPR + && integer_onep (TREE_OPERAND (niter, 1))) + { + niter = TREE_OPERAND (niter, 0); + after_adjust = false; + } + } + tree_to_aff_combination (iv->step, TREE_TYPE (iv->step), &step); aff_combination_convert (&step, steptype); tree_to_aff_combination (niter, TREE_TYPE (niter), &nit); aff_combination_convert (&nit, steptype); aff_combination_mult (&nit, &step, &delta); - if (stmt_after_increment (loop, cand, at)) + if (after_adjust) aff_combination_add (&delta, &step); tree_to_aff_combination (iv->base, type, val); @@ -5406,7 +5435,7 @@ may_eliminate_iv (struct ivopts_data *data, return true; } - cand_value_at (loop, cand, use->stmt, desc->niter, &bnd); + cand_value_at (loop, cand, use->stmt, desc, &bnd); *bound = fold_convert (TREE_TYPE (cand->iv->base), aff_combination_to_tree (&bnd)); diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c new file mode 100644 index 0000000..7a7661c --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-ivopts-details" } */ + +int* +foo (int* mem, int sz, int val) +{ + int i; + for (i = 0; i < sz; i++) + if (mem[i] == val) + return &mem[i]; + return 0; +} + +/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]\\(D\\)" "ivopts" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c new file mode 100644 index 0000000..773bfe4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-ivopts-details" } */ + +int* +foo (int* mem, int sz, int val) +{ + int i; + for (i = 0; i != sz; i++) + if (mem[i] == val) + return &mem[i]; + return 0; +} + +/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]\\(D\\)" "ivopts" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c new file mode 100644 index 0000000..3d2de48 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-ivopts-details" } */ + +int* +foo (int* mem, int beg, int end, int val) +{ + int i; + for (i = beg; i < end; i++) + if (mem[i] == val) + return &mem[i]; + return 0; +} +/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) \\(\\(unsigned int\\) end_\[0-9\]\\(D\\) - \\(unsigned int\\) beg_\[0-9\]\\(D\\)\\)" "ivopts" } } */ + diff --git a/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c b/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c index a5d953b..706eed8 100644 --- a/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c +++ b/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c @@ -1,6 +1,6 @@ /* { dg-do compile { target { { i?86-*-* x86_64-*-* s390*-*-* } && lp64 } } } */ /* { dg-options "-O2 -fdump-tree-vrp2-details" } */ -/* { dg-final { scan-tree-dump-times "gimple_simplified to" 4 "vrp2" } } */ +/* { dg-final { scan-tree-dump-times "gimple_simplified to" 1 "vrp2" } } */ void v1 (unsigned long *in, unsigned long *out, unsigned int n) {