diff mbox series

[take,3] ivopts: Improve code generated for very simple loops.

Message ID 052601d7e228$af71a9b0$0e54fd10$@nextmovesoftware.com
State New
Headers show
Series [take,3] ivopts: Improve code generated for very simple loops. | expand

Commit Message

Roger Sayle Nov. 25, 2021, 6:17 p.m. UTC
On Tue, Nov 23, 2021 at 12:46PM Richard Biener < richard.guenther@gmail.com> wrote:
> On Thu, Nov 18, 2021 at 4:18 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> > > The patch doesn't add any testcase.
> >
> > The three new attached tests check that the critical invariants have a
> > simpler form, and hopefully shouldn't be affected by whether the
> > optimizer and/or backend costs actually decide to perform this iv substitution
> or not.
> 
> The testcases might depend on lp64 though, did you test them with -m32?
> IMHO it's fine to require lp64 here.

Great catch.  You're right that when the loop index has the same precision as the
target's pointer, that fold is (already) able to simplify the ((EXPR)-1)+1, so that with
-m32 my new tests ivopts-[567].c fail.  I've added "require lp64" to those tests, but
I've also added two more tests, using char and unsigned char for the loop expression,
which are optimized on both ilp32 and lp64.

For example, with -O2 -m32, we see the following improvements in ivopts-8.c:
diff ivopts-8.old.s ivopts-8.new.s
14,16c14,15
<       subl    $1, %ecx
<       movzbl  %cl, %ecx
<       leal    4(%eax,%ecx,4), %ecx
---
>       movsbl  %cl, %ecx
>       leal    (%eax,%ecx,4), %ecx

This might also explain why GCC currently generates sub-optimal code.  Back when
ivopts was written, most folks were on i686, so the generated code was optimal.
But with the transition to x86_64, the code is correct, just slightly less efficient.

> I'm a bit unsure about adding this special-casing in cand_value_at in general - it
> does seem that we're doing sth wrong elsewhere - either by not simplifying even
> though enough knowledge is there or by throwing away knowledge earlier
> (during niter analysis?).

I agree this approach is a bit ugly.  Conceptually, an alternative might be to avoid
throwing away knowledge earlier, during niter analysis, by adding an extra tree field
to the tree_niter_desc structure, so that it returns both niter0 (the iteration count
at the top of the loop) and niter1 (the iteration count at the bottom of the loop),
so that later passes (cand_value_at) can use the tree that's relevant.  Alas, this too is
ugly, and inefficient as we're creating/folding trees that may never be used/useful.
A compromise might be to add an enum field describing how the niter was
calculated to tree_niter_desc, and this can be inspected/used by cand_value_at.
The current patch figures this out by examining the other fields already in
tree_niter_desc.


> Anyway, the patch does look quite safe - can you show some statistics in how
> many time there's extra simplification this way during say bootstrap?

Certainly.  During stage2 and stage3 of a bootstrap on x86_64-pc-linux-gnu,
cand_value_at is called 500657 times.  The majority of calls,
447607 (89.4%), request the value at the end of the loop (after_adjust),
while 53050 (10.6%) request the value at the start of the loop.
102437 calls (20.5%) are optimized by clause 1 [0..N loops]
27939 calls (5.6%) are optimized by clause 2 [beg..end loops]

Looking for opportunities to improve things further, I see that
319608 calls (63.8%) have a LT_EXPR exit test.
160965 calls (32.2%) have a NE_EXPR exit test.
20084 calls (4.0%) have a GT_EXPR exit test.
so handling descending loops wouldn’t be a big win.
I'll investigate whether (constant) step sizes other than 1 are
(i) sufficiently common and (ii) benefit from improved folding.


This revised patch has been test on x86_64-pc-linux-gnu with a
make bootstrap and make -k check, both with and without
--target-board=unix{-m32}, with no new failures.
Ok for mainline?

2021-11-25  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* tree-ssa-loop-ivopts.c (cand_value_at): Take a class
	tree_niter_desc* argument instead of just a tree for NITER.
	If we require the iv candidate value at the end of the final
	loop iteration, try using the original loop bound as the
	NITER for sufficiently simple loops.
	(may_eliminate_iv): Update (only) call to cand_value_at.

gcc/testsuite
	* gcc.dg/wrapped-binop-simplify.c: Update expected test result.
	* gcc.dg/tree-ssa/ivopts-5.c: New test case.
	* gcc.dg/tree-ssa/ivopts-6.c: New test case.
	* gcc.dg/tree-ssa/ivopts-7.c: New test case.
	* gcc.dg/tree-ssa/ivopts-8.c: New test case.
	* gcc.dg/tree-ssa/ivopts-9.c: New test case.


Roger
--

Comments

Richard Biener Nov. 26, 2021, 7:36 a.m. UTC | #1
On Thu, Nov 25, 2021 at 7:17 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> On Tue, Nov 23, 2021 at 12:46PM Richard Biener < richard.guenther@gmail.com> wrote:
> > On Thu, Nov 18, 2021 at 4:18 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > > > The patch doesn't add any testcase.
> > >
> > > The three new attached tests check that the critical invariants have a
> > > simpler form, and hopefully shouldn't be affected by whether the
> > > optimizer and/or backend costs actually decide to perform this iv substitution
> > or not.
> >
> > The testcases might depend on lp64 though, did you test them with -m32?
> > IMHO it's fine to require lp64 here.
>
> Great catch.  You're right that when the loop index has the same precision as the
> target's pointer, that fold is (already) able to simplify the ((EXPR)-1)+1, so that with
> -m32 my new tests ivopts-[567].c fail.  I've added "require lp64" to those tests, but
> I've also added two more tests, using char and unsigned char for the loop expression,
> which are optimized on both ilp32 and lp64.
>
> For example, with -O2 -m32, we see the following improvements in ivopts-8.c:
> diff ivopts-8.old.s ivopts-8.new.s
> 14,16c14,15
> <       subl    $1, %ecx
> <       movzbl  %cl, %ecx
> <       leal    4(%eax,%ecx,4), %ecx
> ---
> >       movsbl  %cl, %ecx
> >       leal    (%eax,%ecx,4), %ecx
>
> This might also explain why GCC currently generates sub-optimal code.  Back when
> ivopts was written, most folks were on i686, so the generated code was optimal.
> But with the transition to x86_64, the code is correct, just slightly less efficient.
>
> > I'm a bit unsure about adding this special-casing in cand_value_at in general - it
> > does seem that we're doing sth wrong elsewhere - either by not simplifying even
> > though enough knowledge is there or by throwing away knowledge earlier
> > (during niter analysis?).
>
> I agree this approach is a bit ugly.  Conceptually, an alternative might be to avoid
> throwing away knowledge earlier, during niter analysis, by adding an extra tree field
> to the tree_niter_desc structure, so that it returns both niter0 (the iteration count
> at the top of the loop) and niter1 (the iteration count at the bottom of the loop),
> so that later passes (cand_value_at) can use the tree that's relevant.

Yes, I also thought of this but I wasn't sure we always have that.  I also
wouldn't think of it as too ugly, but well ... it would definitely be useful
elsewhere.  Btw, it's generally the number of executions of the latch vs.
the number of executions of the header - currently what niter analysis
computes is the number of executions of the latch.  There are loops where
the number of iterations of the header is not representable in the IV.

>  Alas, this too is
> ugly, and inefficient as we're creating/folding trees that may never be used/useful.
> A compromise might be to add an enum field describing how the niter was
> calculated to tree_niter_desc, and this can be inspected/used by cand_value_at.
> The current patch figures this out by examining the other fields already in
> tree_niter_desc.
>
>
> > Anyway, the patch does look quite safe - can you show some statistics in how
> > many time there's extra simplification this way during say bootstrap?
>
> Certainly.  During stage2 and stage3 of a bootstrap on x86_64-pc-linux-gnu,
> cand_value_at is called 500657 times.  The majority of calls,
> 447607 (89.4%), request the value at the end of the loop (after_adjust),
> while 53050 (10.6%) request the value at the start of the loop.
> 102437 calls (20.5%) are optimized by clause 1 [0..N loops]
> 27939 calls (5.6%) are optimized by clause 2 [beg..end loops]

Thanks for the detailed data.

> Looking for opportunities to improve things further, I see that
> 319608 calls (63.8%) have a LT_EXPR exit test.
> 160965 calls (32.2%) have a NE_EXPR exit test.
> 20084 calls (4.0%) have a GT_EXPR exit test.
> so handling descending loops wouldn’t be a big win.
> I'll investigate whether (constant) step sizes other than 1 are
> (i) sufficiently common and (ii) benefit from improved folding.
>
>
> This revised patch has been test on x86_64-pc-linux-gnu with a
> make bootstrap and make -k check, both with and without
> --target-board=unix{-m32}, with no new failures.
> Ok for mainline?

OK.

Thanks,
Richard.

> 2021-11-25  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * tree-ssa-loop-ivopts.c (cand_value_at): Take a class
>         tree_niter_desc* argument instead of just a tree for NITER.
>         If we require the iv candidate value at the end of the final
>         loop iteration, try using the original loop bound as the
>         NITER for sufficiently simple loops.
>         (may_eliminate_iv): Update (only) call to cand_value_at.
>
> gcc/testsuite
>         * gcc.dg/wrapped-binop-simplify.c: Update expected test result.
>         * gcc.dg/tree-ssa/ivopts-5.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-6.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-7.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-8.c: New test case.
>         * gcc.dg/tree-ssa/ivopts-9.c: New test case.
>
>
> Roger
> --
>
diff mbox series

Patch

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 4769b65..067f823 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -5030,28 +5030,57 @@  determine_group_iv_cost_address (struct ivopts_data *data,
   return !sum_cost.infinite_cost_p ();
 }
 
-/* Computes value of candidate CAND at position AT in iteration NITER, and
-   stores it to VAL.  */
+/* Computes value of candidate CAND at position AT in iteration DESC->NITER,
+   and stores it to VAL.  */
 
 static void
-cand_value_at (class loop *loop, struct iv_cand *cand, gimple *at, tree niter,
-	       aff_tree *val)
+cand_value_at (class loop *loop, struct iv_cand *cand, gimple *at,
+	       class tree_niter_desc *desc, aff_tree *val)
 {
   aff_tree step, delta, nit;
   struct iv *iv = cand->iv;
   tree type = TREE_TYPE (iv->base);
+  tree niter = desc->niter;
+  bool after_adjust = stmt_after_increment (loop, cand, at);
   tree steptype;
+
   if (POINTER_TYPE_P (type))
     steptype = sizetype;
   else
     steptype = unsigned_type_for (type);
 
+  /* If AFTER_ADJUST is required, the code below generates the equivalent
+     of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression
+     BASE + (NITER + 1) * STEP, especially when NITER is often of the form
+     SSA_NAME - 1.  Unfortunately, guaranteeing that adding 1 to NITER
+     doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC
+     class for common idioms that we know are safe.  */
+  if (after_adjust
+      && desc->control.no_overflow
+      && integer_onep (desc->control.step)
+      && (desc->cmp == LT_EXPR
+	  || desc->cmp == NE_EXPR)
+      && TREE_CODE (desc->bound) == SSA_NAME)
+    {
+      if (integer_onep (desc->control.base))
+	{
+	  niter = desc->bound;
+	  after_adjust = false;
+	}
+      else if (TREE_CODE (niter) == MINUS_EXPR
+	       && integer_onep (TREE_OPERAND (niter, 1)))
+	{
+	  niter = TREE_OPERAND (niter, 0);
+	  after_adjust = false;
+	}
+    }
+
   tree_to_aff_combination (iv->step, TREE_TYPE (iv->step), &step);
   aff_combination_convert (&step, steptype);
   tree_to_aff_combination (niter, TREE_TYPE (niter), &nit);
   aff_combination_convert (&nit, steptype);
   aff_combination_mult (&nit, &step, &delta);
-  if (stmt_after_increment (loop, cand, at))
+  if (after_adjust)
     aff_combination_add (&delta, &step);
 
   tree_to_aff_combination (iv->base, type, val);
@@ -5402,7 +5431,7 @@  may_eliminate_iv (struct ivopts_data *data,
       return true;
     }
 
-  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
+  cand_value_at (loop, cand, use->stmt, desc, &bnd);
 
   *bound = fold_convert (TREE_TYPE (cand->iv->base),
 			 aff_combination_to_tree (&bnd));
diff --git a/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c b/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c
index a5d953b..706eed8 100644
--- a/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c
+++ b/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile { target { { i?86-*-* x86_64-*-* s390*-*-* } && lp64 } } } */
 /* { dg-options "-O2 -fdump-tree-vrp2-details" } */
-/* { dg-final { scan-tree-dump-times "gimple_simplified to" 4 "vrp2" } } */
+/* { dg-final { scan-tree-dump-times "gimple_simplified to" 1 "vrp2" } } */
 
 void v1 (unsigned long *in, unsigned long *out, unsigned int n)
 {
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c
new file mode 100644
index 0000000..a6af497
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile { target lp64 } } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, int sz, int val)
+{
+  int i;
+  for (i = 0; i < sz; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]\\(D\\)" "ivopts" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c
new file mode 100644
index 0000000..8383154
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile { target lp64 } } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, int sz, int val)
+{
+  int i;
+  for (i = 0; i != sz; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]\\(D\\)" "ivopts" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c
new file mode 100644
index 0000000..44f5603
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile { target lp64 } } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, int beg, int end, int val)
+{
+  int i;
+  for (i = beg; i < end; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) \\(\\(unsigned int\\) end_\[0-9\]\\(D\\) - \\(unsigned int\\) beg_\[0-9\]\\(D\\)\\)" "ivopts" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-8.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-8.c
new file mode 100644
index 0000000..9e89a03
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-8.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, char sz, int val)
+{
+  char i;
+  for (i = 0; i < sz; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]*\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]*\\(D\\)" "ivopts" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-9.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-9.c
new file mode 100644
index 0000000..22b4a12
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-9.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, unsigned char sz, int val)
+{
+  unsigned char i;
+  for (i = 0; i < sz; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]\\(D\\)" "ivopts" } } */