diff mbox series

[take,2] ivopts: Improve code generated for very simple loops.

Message ID 007e01d7dc8f$7b87f260$7297d720$@nextmovesoftware.com
State New
Headers show
Series [take,2] ivopts: Improve code generated for very simple loops. | expand

Commit Message

Roger Sayle Nov. 18, 2021, 3:18 p.m. UTC
Hi Richard,
Many thanks for the patch review.

On Tue, Nov 16, 2021 at 12:38 Richard Biener wrote:
> On Mon, Nov 15, 2021 at 2:04 PM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> >
> > This patch tidies up the code that GCC generates for simple loops, by
> > selecting/generating a simpler loop bound expression in ivopts.
> > Previously:
> >
> > void v1 (unsigned long *in, unsigned long *out, unsigned int n) {
> >   int i;
> >   for (i = 0; i < n; i++) {
> >     out[i] = in[i];
> >   }
> > }
> >
> > on x86_64 generated:
> > v1:     testl   %edx, %edx
> >         je      .L1
> >         movl    %edx, %edx
> >         xorl    %eax, %eax
> > .L3:    movq    (%rdi,%rax,8), %rcx
> >         movq    %rcx, (%rsi,%rax,8)
> >         addq    $1, %rax
> >         cmpq    %rax, %rdx
> >         jne     .L3
> > .L1:    ret
> >
> > and now instead generates:
> > v1:     testl   %edx, %edx
> >         je      .L1
> >         movl    %edx, %edx
> >         xorl    %eax, %eax
> >         leaq    0(,%rdx,8), %rcx
> > .L3:    movq    (%rdi,%rax), %rdx
> >         movq    %rdx, (%rsi,%rax)
> >         addq    $8, %rax
> >         cmpq    %rax, %rcx
> >         jne     .L3
> > .L1:    ret
> 
> Is that actually better?  IIRC the addressing modes are both complex and we
> now have an extra lea?


Technically the induction variable elimination is removing two multiplies by 8
(or left shifts) from the body of the loop and replacing them with a single
multiply by 8 prior to the loop, which is exactly the induction variable 
optimization that ivopts is designed to do.  It's true that with x86's complex
addressing modes these "multiplications" are free, but ivopts is run on all
targets including those without indexed addressing, and even on x86 there's
a benefit from shorter instruction encodings.

> For this case I see we generate
>   _15 = n_10(D) + 4294967295;
>   _8 = (unsigned long) _15;
>   _7 = _8 + 1;
> 
> where n is unsigned int so if we know that n is not zero we can simplify the
> addition and conveniently the loop header test provides this guarantee.
> IIRC there were some attempts to enhance match.pd for some cases of such
> expressions.

Exactly.  The loop optimizers are generating the expression (x-1)+1 and
assuming that the middle-end will simplify this to x.  Unfortunately, the
change of modes (designed to prevent overflow) actually makes this
impossible to simplify, due to concerns about overflow.

> +  /* If AFTER_ADJUST is required, the code below generates the equivalent
> +   * of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression
> +   * BASE + (NITER + 1) * STEP, especially when NITER is often of the form
> +   * SSA_NAME - 1.  Unfortunately, guaranteeing that adding 1 to NITER
> +   * doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC
> +   * class for common idioms that we know are safe.  */
> 
> No '* ' each line.
Doh!  Thanks.  Sometimes I hate vi.

> I wonder if the non-overflowing can be captured by
>     integer_onep (iv->step)
>     && max_stmt_executions (loop, &max)

Unfortunately, max_stmt_executions is intended to return the wide_int count
of iterations, either the known constant value or a profile-based estimate, while
the optimizations I'm proposing work with symbolic values from variable iteration
counts.  When the iteration count is a constant, fold-const is already able to
simplify (x-1)+1 to an integer constant even with type conversions.

> if we then do (niter + 1) * step instead of niter*step + step would that do the
> same?

Yes.  I've extended the scope of the patch to now also handle loops of the
form (for i=beg; i<end; i++), where the original niter is (end-beg)-1, and we
now use (end-beg) as the incremented niter, so the invariant expression now
becomes (end-beg)*4 instead of the currently generated:  ((end-beg)-1)*4 + 4.

I'm assuming that the niter*step + step is by design (for the general case),
so I'm only tweaking the (common) corner cases, where it's easy to see that
it's safe to substitute a simpler expression.  For more general affine recurrences
in complex loop nests, niter*step + step may be preferred.

> That said - what the change does is actually ensure that we CSE niter + 1
> with the bound of the simplified exit test?

Not quite, this simply provides a simplified expression for "niter + 1" that
takes advantage of the implicit range information we have.  You're right
in theory that ranger/vrp may be able to help simplify (long)((int)x-1)+1L
more generally.   In this case, I suspect it's just a historical artifact that
much of the literature on (iv) loop optimizations dates back to Fortran
and Algol where arrays were 1-based rather than 0-based, so decades
later gcc 11 on x86_64 sometimes requires an extra inc/dec instruction
for purely historical reasons.

> The patch doesn't add any testcase.

The three new attached tests check that the critical invariants have a
simpler form, and hopefully shouldn't be affected by whether the optimizer
and/or backend costs actually decide to perform this iv substitution or not.

> Thanks,
> Richard.


This revised patch has been tested on x86_64-pc-linux-gnu with a
make bootstrap and make -k check with no new failures.

2021-11-18  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* tree-ssa-loop-ivopts.c (cand_value_at): Take a class
	tree_niter_desc* argument instead of just a tree for NITER.
	If we require the iv candidate value at the end of the final
	loop iteration, try using the original loop bound as the
	NITER for sufficiently simple loops.
	(may_eliminate_iv): Update (only) call to cand_value_at.

gcc/testsuite
	* gcc.dg/tree-ssa/ivopts-5.c: New test case.
	* gcc.dg/tree-ssa/ivopts-6.c: New test case.
	* gcc.dg/tree-ssa/ivopts-7.c: New test case.
	* gcc.dg/wrapped-binop-simplify.c: Update expected test result.


Many thanks in advance (fingers-crossed this is still suitable/grandfathered-in
at the start of Stage 3).

Roger
--

Comments

Richard Biener Nov. 23, 2021, 12:46 p.m. UTC | #1
On Thu, Nov 18, 2021 at 4:18 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Richard,
> Many thanks for the patch review.
>
> On Tue, Nov 16, 2021 at 12:38 Richard Biener wrote:
> > On Mon, Nov 15, 2021 at 2:04 PM Roger Sayle
> > <roger@nextmovesoftware.com> wrote:
> > >
> > > This patch tidies up the code that GCC generates for simple loops, by
> > > selecting/generating a simpler loop bound expression in ivopts.
> > > Previously:
> > >
> > > void v1 (unsigned long *in, unsigned long *out, unsigned int n) {
> > >   int i;
> > >   for (i = 0; i < n; i++) {
> > >     out[i] = in[i];
> > >   }
> > > }
> > >
> > > on x86_64 generated:
> > > v1:     testl   %edx, %edx
> > >         je      .L1
> > >         movl    %edx, %edx
> > >         xorl    %eax, %eax
> > > .L3:    movq    (%rdi,%rax,8), %rcx
> > >         movq    %rcx, (%rsi,%rax,8)
> > >         addq    $1, %rax
> > >         cmpq    %rax, %rdx
> > >         jne     .L3
> > > .L1:    ret
> > >
> > > and now instead generates:
> > > v1:     testl   %edx, %edx
> > >         je      .L1
> > >         movl    %edx, %edx
> > >         xorl    %eax, %eax
> > >         leaq    0(,%rdx,8), %rcx
> > > .L3:    movq    (%rdi,%rax), %rdx
> > >         movq    %rdx, (%rsi,%rax)
> > >         addq    $8, %rax
> > >         cmpq    %rax, %rcx
> > >         jne     .L3
> > > .L1:    ret
> >
> > Is that actually better?  IIRC the addressing modes are both complex and we
> > now have an extra lea?
>
>
> Technically the induction variable elimination is removing two multiplies by 8
> (or left shifts) from the body of the loop and replacing them with a single
> multiply by 8 prior to the loop, which is exactly the induction variable
> optimization that ivopts is designed to do.  It's true that with x86's complex
> addressing modes these "multiplications" are free, but ivopts is run on all
> targets including those without indexed addressing, and even on x86 there's
> a benefit from shorter instruction encodings.
>
> > For this case I see we generate
> >   _15 = n_10(D) + 4294967295;
> >   _8 = (unsigned long) _15;
> >   _7 = _8 + 1;
> >
> > where n is unsigned int so if we know that n is not zero we can simplify the
> > addition and conveniently the loop header test provides this guarantee.
> > IIRC there were some attempts to enhance match.pd for some cases of such
> > expressions.
>
> Exactly.  The loop optimizers are generating the expression (x-1)+1 and
> assuming that the middle-end will simplify this to x.  Unfortunately, the
> change of modes (designed to prevent overflow) actually makes this
> impossible to simplify, due to concerns about overflow.
>
> > +  /* If AFTER_ADJUST is required, the code below generates the equivalent
> > +   * of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression
> > +   * BASE + (NITER + 1) * STEP, especially when NITER is often of the form
> > +   * SSA_NAME - 1.  Unfortunately, guaranteeing that adding 1 to NITER
> > +   * doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC
> > +   * class for common idioms that we know are safe.  */
> >
> > No '* ' each line.
> Doh!  Thanks.  Sometimes I hate vi.
>
> > I wonder if the non-overflowing can be captured by
> >     integer_onep (iv->step)
> >     && max_stmt_executions (loop, &max)
>
> Unfortunately, max_stmt_executions is intended to return the wide_int count
> of iterations, either the known constant value or a profile-based estimate, while
> the optimizations I'm proposing work with symbolic values from variable iteration
> counts.  When the iteration count is a constant, fold-const is already able to
> simplify (x-1)+1 to an integer constant even with type conversions.
>
> > if we then do (niter + 1) * step instead of niter*step + step would that do the
> > same?
>
> Yes.  I've extended the scope of the patch to now also handle loops of the
> form (for i=beg; i<end; i++), where the original niter is (end-beg)-1, and we
> now use (end-beg) as the incremented niter, so the invariant expression now
> becomes (end-beg)*4 instead of the currently generated:  ((end-beg)-1)*4 + 4.
>
> I'm assuming that the niter*step + step is by design (for the general case),
> so I'm only tweaking the (common) corner cases, where it's easy to see that
> it's safe to substitute a simpler expression.  For more general affine recurrences
> in complex loop nests, niter*step + step may be preferred.
>
> > That said - what the change does is actually ensure that we CSE niter + 1
> > with the bound of the simplified exit test?
>
> Not quite, this simply provides a simplified expression for "niter + 1" that
> takes advantage of the implicit range information we have.  You're right
> in theory that ranger/vrp may be able to help simplify (long)((int)x-1)+1L
> more generally.   In this case, I suspect it's just a historical artifact that
> much of the literature on (iv) loop optimizations dates back to Fortran
> and Algol where arrays were 1-based rather than 0-based, so decades
> later gcc 11 on x86_64 sometimes requires an extra inc/dec instruction
> for purely historical reasons.
>
> > The patch doesn't add any testcase.
>
> The three new attached tests check that the critical invariants have a
> simpler form, and hopefully shouldn't be affected by whether the optimizer
> and/or backend costs actually decide to perform this iv substitution or not.

The testcases might depend on lp64 though, did you test them with -m32?
IMHO it's fine to require lp64 here.

I'm a bit unsure about adding this special-casing in cand_value_at in
general - it does seem that we're doing sth wrong elsewhere - either
by not simplifying even though enough knowledge is there or by
throwing away knowledge earlier (during niter analysis?).

Anyway, the patch does look quite safe - can you show some
statistics in how many time there's extra simplification this way
during say bootstrap?

Thanks,
Richard.

> > Thanks,
> > Richard.
>
>
> This revised patch has been tested on x86_64-pc-linux-gnu with a
> make bootstrap and make -k check with no new failures.
>
> 2021-11-18  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * tree-ssa-loop-ivopts.c (cand_value_at): Take a class
>         tree_niter_desc* argument instead of just a tree for NITER.
>         If we require the iv candidate value at the end of the final
>         loop iteration, try using the original loop bound as the
>         NITER for sufficiently simple loops.
>         (may_eliminate_iv): Update (only) call to cand_value_at.
>
> gcc/testsuite
>         * gcc.dg/tree-ssa/ivopts-5.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-6.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-7.c: New test case.
>         * gcc.dg/wrapped-binop-simplify.c: Update expected test result.
>
>
> Many thanks in advance (fingers-crossed this is still suitable/grandfathered-in
> at the start of Stage 3).
>
> Roger
> --
>
diff mbox series

Patch

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 4a498ab..8f98595 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -5034,28 +5034,57 @@  determine_group_iv_cost_address (struct ivopts_data *data,
   return !sum_cost.infinite_cost_p ();
 }
 
-/* Computes value of candidate CAND at position AT in iteration NITER, and
-   stores it to VAL.  */
+/* Computes value of candidate CAND at position AT in iteration DESC->NITER,
+   and stores it to VAL.  */
 
 static void
-cand_value_at (class loop *loop, struct iv_cand *cand, gimple *at, tree niter,
-	       aff_tree *val)
+cand_value_at (class loop *loop, struct iv_cand *cand, gimple *at,
+	       class tree_niter_desc *desc, aff_tree *val)
 {
   aff_tree step, delta, nit;
   struct iv *iv = cand->iv;
   tree type = TREE_TYPE (iv->base);
+  tree niter = desc->niter;
+  bool after_adjust = stmt_after_increment (loop, cand, at);
   tree steptype;
+
   if (POINTER_TYPE_P (type))
     steptype = sizetype;
   else
     steptype = unsigned_type_for (type);
 
+  /* If AFTER_ADJUST is required, the code below generates the equivalent
+     of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression
+     BASE + (NITER + 1) * STEP, especially when NITER is often of the form
+     SSA_NAME - 1.  Unfortunately, guaranteeing that adding 1 to NITER
+     doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC
+     class for common idioms that we know are safe.  */
+  if (after_adjust
+      && desc->control.no_overflow
+      && integer_onep (desc->control.step)
+      && (desc->cmp == LT_EXPR
+	  || desc->cmp == NE_EXPR)
+      && TREE_CODE (desc->bound) == SSA_NAME)
+    {
+      if (integer_onep (desc->control.base))
+	{
+	  niter = desc->bound;
+	  after_adjust = false;
+	}
+      else if (TREE_CODE (niter) == MINUS_EXPR
+	       && integer_onep (TREE_OPERAND (niter, 1)))
+	{
+	  niter = TREE_OPERAND (niter, 0);
+	  after_adjust = false;
+	}
+    }
+
   tree_to_aff_combination (iv->step, TREE_TYPE (iv->step), &step);
   aff_combination_convert (&step, steptype);
   tree_to_aff_combination (niter, TREE_TYPE (niter), &nit);
   aff_combination_convert (&nit, steptype);
   aff_combination_mult (&nit, &step, &delta);
-  if (stmt_after_increment (loop, cand, at))
+  if (after_adjust)
     aff_combination_add (&delta, &step);
 
   tree_to_aff_combination (iv->base, type, val);
@@ -5406,7 +5435,7 @@  may_eliminate_iv (struct ivopts_data *data,
       return true;
     }
 
-  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
+  cand_value_at (loop, cand, use->stmt, desc, &bnd);
 
   *bound = fold_convert (TREE_TYPE (cand->iv->base),
 			 aff_combination_to_tree (&bnd));
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c
new file mode 100644
index 0000000..7a7661c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, int sz, int val)
+{
+  int i;
+  for (i = 0; i < sz; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]\\(D\\)" "ivopts" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c
new file mode 100644
index 0000000..773bfe4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, int sz, int val)
+{
+  int i;
+  for (i = 0; i != sz; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]\\(D\\)" "ivopts" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c
new file mode 100644
index 0000000..3d2de48
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, int beg, int end, int val)
+{
+  int i;
+  for (i = beg; i < end; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) \\(\\(unsigned int\\) end_\[0-9\]\\(D\\) - \\(unsigned int\\) beg_\[0-9\]\\(D\\)\\)" "ivopts" } } */
+
diff --git a/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c b/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c
index a5d953b..706eed8 100644
--- a/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c
+++ b/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile { target { { i?86-*-* x86_64-*-* s390*-*-* } && lp64 } } } */
 /* { dg-options "-O2 -fdump-tree-vrp2-details" } */
-/* { dg-final { scan-tree-dump-times "gimple_simplified to" 4 "vrp2" } } */
+/* { dg-final { scan-tree-dump-times "gimple_simplified to" 1 "vrp2" } } */
 
 void v1 (unsigned long *in, unsigned long *out, unsigned int n)
 {