diff mbox

[P2,PR,tree-optimization/68398] Refine when we allow the FSM threader to create irreducible inner loops

Message ID 56A9188F.1080803@redhat.com
State New
Headers show

Commit Message

Jeff Law Jan. 27, 2016, 7:20 p.m. UTC
Right now we only allow the FSM threader to create an irreducible inner 
loop when it's able to eliminate a multiway branch.  The theory being 
that the multiway branch is very expensive and potentially outweighs the 
value of loop optimizations.

That was a fine start, but as seen in 68398, further refinement is 
desirable.

Essentially in the 68398 case, we want to thread through the latch to a 
point in the loop that does not dominate the latch -- thus creating an 
irreducible inner loop.  But in the 68398 case, we aren't eliminating a 
multiway branch, just a simple conditional branch.

I stared at the jump threading paths, cfg and overall code for a long 
time over the last few days pondering what characteristics are important 
here.  In the end it seems so simple.

If the path has a high number of blocks relative to the number of 
statements in the path, then the path (and subsequent irreducible loop) 
really isn't likely to be helped by loop optimizations to begin with -- 
essentially the jump thread path is just a series of conditional 
branches where we know the result of the last one -- and there's very 
little computation going on in the path.

So that's what I'm keying on -- the ratio of blocks to statements in the 
path.   The higher that ratio, the more inclined we are to allow 
creation of the irreducible inner loop.

FWIW, I did look at simply allowing irreducible loops after the loop 
optimizers were complete -- that doesn't resolve the issues here. We 
need it to happen early so that the rest of the optimization pipeline 
(including dom, vrp2, dom2) see the jump-threaded code and refine it 
further.

We may still want to make that kind of change, but I'd like to see code 
that benefits from such a change rather than blindly going forward with it.

Going back to that ratio of blocks to statements.  We were counting PHIs 
in the thread path against the statement count.  Except for the final 
block, those PHIs are going to be const/copy propagated away.  So the 
code to count the PHIs was moved out of the loop and just looks at PHIs 
in the final block in the path.

Second we want to be able to compare the number of statements and the 
number of blocks in a jump threading path.  In particular we'd like to 
look at ratios between the two.  So there's two controlling PARAMS to 
scale the raw numbers, which then makes it easier to build ratios.  So 
we can say that once the number of blocks exceeds the number of 
statements by 1.5X, then allow creation of the irreducible loop (that's 
done by scaling the blocks by 3 and statements by 2.

Testing.  For the testcase in 68398, this improves things by another 
percent or two in terms of raw instruction counts.  I get something like 
206k-207k instruction references.

For pr66752-3.c the heuristic fires and we handle all 3 jump threads 
during VRP1 and obviously we continue to collapse away all 3 tests of FLAG.

For vrp46.c, ssa-dom-thread-2{c,d}.c I added statements on the jump 
threading path so that the heuristic wouldn't fire to preserve the 
spirit of those tests.

ssa-dom-thread-2{h,g}.c new tests, copied from the original 
ssa-dom-thread-2{c,d} which show the heuristic firing.

ssa-dom-thread-7.c does more threading early as the heuristic fires 
there.  So its expected output was adjusted.


Bootstrapped and regression tested on x86_64 linux.  Installing on the 
trunk.

This also fixes 3 minor issues Bernd spotted in the last round of changes.

Jeff
commit a5b3ed5d29ceb18b9e257894661f328462f581c8
Author: law <law@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Wed Jan 27 19:19:47 2016 +0000

    	PR tree-optimization/68398
    	* params.def (PARAM_FSM_SCALE_PATH_STMTS): New parameter.
    	(PARAM_FSM_SCALE_PATH_BLOCKS): Likewise.
    	* tree-ssa-threadbackward.c (fsm_find_control_statement_thread_paths):
    	Only count PHIs in the last block in the path.  The others will
    	const/copy propagate away.  Add heuristic to allow more irreducible
    	subloops to be created when it is likely profitable to do so.
    
    	* tree-ssa-threadbackward.c (fsm_find_control_statement_thread_paths):
    	Fix typo in comment.  Use gsi_after_labels and remove the GIMPLE_LABEL
    	check from within the loop.  Use gsi_next_nondebug rather than gsi_next.
    
    	PR tree-optimization/68398
    	* gcc.dg/tree-ssa/pr66752-3.c: Update expected output.
    	* gcc.dg/tree-ssa/ssa-dom-thread-2c.c: Add extra statements on thread
    	path to avoid new heuristic allowing more irreducible regions
    	* gcc.dg/tree-ssa/ssa-dom-thread-2d.c: Likewise.
    	* gcc.dg/tree-ssa/vrp46.c: Likewise.
    	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Update expected output.
    	* gcc.dg/tree-ssa/ssa-dom-thread-2g.c: New test.
    	* gcc.dg/tree-ssa/ssa-dom-thread-2h.c: Likewise.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@232897 138bc75d-0d04-0410-961f-82ee72b054a4

Comments

Bernhard Reutner-Fischer Jan. 27, 2016, 7:39 p.m. UTC | #1
On January 27, 2016 8:20:47 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:

>
>This also fixes 3 minor issues Bernd spotted in the last round of
>changes.

Bernhard, not Bernd i suppose. Note that the !is_gimple_debug (stmt) should now be redundant in the hunk below.
Thanks,

@@ -280,33 +280,19 @@ fsm_find_control_statement_thread_paths (tree name,
 		  break;
 		}
 
-	      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	      for (gsi = gsi_after_labels (bb);
+		   !gsi_end_p (gsi);
+		   gsi_next_nondebug (&gsi))
 		{
 		  gimple *stmt = gsi_stmt (gsi);
 		  /* Do not count empty statements and labels.  */
 		  if (gimple_code (stmt) != GIMPLE_NOP
-		      && gimple_code (stmt) != GIMPLE_LABEL
 		      && !(gimple_code (stmt) == GIMPLE_ASSIGN
 			   && gimple_assign_rhs_code (stmt) == ASSERT_EXPR)
 		      && !is_gimple_debug (stmt))
 		    ++n_insns;
 		}

>
>Jeff
Jeff Law Jan. 27, 2016, 7:48 p.m. UTC | #2
On 01/27/2016 12:39 PM, Bernhard Reutner-Fischer wrote:
> On January 27, 2016 8:20:47 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>
>>
>> This also fixes 3 minor issues Bernd spotted in the last round of
>> changes.
>
> Bernhard, not Bernd i suppose. Note that the !is_gimple_debug (stmt) should now be redundant in the hunk below.
> Thanks,
Terribly sorry.  Yes, it should have been Bernhard.

The is_gimple_debug is not redundant -- the first statement in the block 
could be a debug statement.  If we had a gsi-first-non-debug-non-label 
function, we could use that and then the is_gimple_debug would be redundant

jeff
Jakub Jelinek Jan. 27, 2016, 7:52 p.m. UTC | #3
On Wed, Jan 27, 2016 at 12:48:31PM -0700, Jeff Law wrote:
> On 01/27/2016 12:39 PM, Bernhard Reutner-Fischer wrote:
> >On January 27, 2016 8:20:47 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
> >
> >>
> >>This also fixes 3 minor issues Bernd spotted in the last round of
> >>changes.
> >
> >Bernhard, not Bernd i suppose. Note that the !is_gimple_debug (stmt) should now be redundant in the hunk below.
> >Thanks,
> Terribly sorry.  Yes, it should have been Bernhard.
> 
> The is_gimple_debug is not redundant -- the first statement in the block
> could be a debug statement.  If we had a gsi-first-non-debug-non-label
> function, we could use that and then the is_gimple_debug would be redundant

But then using gsi_next_nondebug doesn't make much sense.

	Jakub
diff mbox

Patch

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 1973060..76c7af2 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,18 @@ 
+2016-01-27  Jeff Law  <law@redhat.com>
+
+	PR tree-optimization/68398
+	PR tree-optimization/69196
+	* params.def (PARAM_FSM_SCALE_PATH_STMTS): New parameter.
+	(PARAM_FSM_SCALE_PATH_BLOCKS): Likewise.
+	* tree-ssa-threadbackward.c (fsm_find_control_statement_thread_paths):
+	Only count PHIs in the last block in the path.  The others will
+	const/copy propagate away.  Add heuristic to allow more irreducible
+	subloops to be created when it is likely profitable to do so.
+
+	* tree-ssa-threadbackward.c (fsm_find_control_statement_thread_paths):
+	Fix typo in comment.  Use gsi_after_labels and remove the GIMPLE_LABEL
+	check from within the loop.  Use gsi_next_nondebug rather than gsi_next.
+
 2016-01-27  Jakub Jelinek  <jakub@redhat.com>
 
 	PR lto/69254
diff --git a/gcc/params.def b/gcc/params.def
index 88971c7..0722ad7 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1145,6 +1145,16 @@  DEFPARAM (PARAM_CHKP_MAX_CTOR_SIZE,
 	  "constructor generated by Pointer Bounds Checker.",
 	  5000, 100, 0)
 
+DEFPARAM (PARAM_FSM_SCALE_PATH_STMTS,
+	  "fsm-scale-path-stmts",
+	  "Scale factor to apply to the number of statements in a threading path when comparing to the number of (scaled) blocks.",
+	  2, 1, 10)
+
+DEFPARAM (PARAM_FSM_SCALE_PATH_BLOCKS,
+	  "fsm-scale-path-blocks",
+	  "Scale factor to apply to the number of blocks in a threading path when comparing to the number of (scaled) statements.",
+	  3, 1, 10)
+
 DEFPARAM (PARAM_MAX_FSM_THREAD_PATH_INSNS,
 	  "max-fsm-thread-path-insns",
 	  "Maximum number of instructions to copy when duplicating blocks on a finite state automaton jump thread path.",
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 2cad0c2..22a124a 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,16 @@ 
+2016-01-25  Jeff Law  <law@redhat.com>
+
+	PR tree-optimization/68398
+        PR tree-optimization/69196
+	* gcc.dg/tree-ssa/pr66752-3.c: Update expected output.
+	* gcc.dg/tree-ssa/ssa-dom-thread-2c.c: Add extra statements on thread
+	path to avoid new heuristic allowing more irreducible regions
+	* gcc.dg/tree-ssa/ssa-dom-thread-2d.c: Likewise.
+	* gcc.dg/tree-ssa/vrp46.c: Likewise.
+	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Update expected output.
+	* gcc.dg/tree-ssa/ssa-dom-thread-2g.c: New test.
+	* gcc.dg/tree-ssa/ssa-dom-thread-2h.c: Likewise.
+
 2016-01-27  Marek Polacek  <polacek@redhat.com>
 
 	PR c/68062
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
index 6eeaca5..2949cbb 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
@@ -32,10 +32,9 @@  foo (int N, int c, int b, int *a)
    pt--;
 }
 
-/* There are 3 FSM jump threading opportunities, one of which will
-   get filtered.  */
-/* { dg-final { scan-tree-dump-times "Registering FSM" 2 "vrp1"} } */
-/* { dg-final { scan-tree-dump-times "FSM would create irreducible loop" 1 "vrp1"} } */
+/* There are 3 FSM jump threading opportunities, all of which will be
+   realized, which will eliminate testing of FLAG, completely.  */
+/* { dg-final { scan-tree-dump-times "Registering FSM" 3 "vrp1"} } */
 
 /* There should be no assignments or references to FLAG.  */
 /* { dg-final { scan-tree-dump-not "flag" "optimized"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2c.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2c.c
index f3e37bd..f914676 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2c.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2c.c
@@ -1,4 +1,4 @@ 
-/* { dg-do compile } */ 
+/* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-vrp1-stats -fdump-tree-dom2-stats" } */
 
 void foo();
@@ -15,6 +15,9 @@  void dont_thread_1 (void)
 
   do
     {
+      bla ();
+      bla ();
+      bla ();
       if (first)
 	foo ();
       else
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2d.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2d.c
index 9b4637a..a04aabf 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2d.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2d.c
@@ -1,4 +1,4 @@ 
-/* { dg-do compile } */ 
+/* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-vrp1-stats -fdump-tree-dom2-stats" } */
 
 void foo();
@@ -13,6 +13,9 @@  void dont_thread_2 (int first)
 
   do
     {
+      bla ();
+      bla ();
+      bla ();
       if (first)
 	foo ();
       else
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2g.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2g.c
new file mode 100644
index 0000000..6d1ff5d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2g.c
@@ -0,0 +1,26 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-vrp1-stats -fdump-tree-dom2-stats" } */
+
+void foo();
+void bla();
+void bar();
+
+void dont_thread_1 (void)
+{
+  int i = 0;
+  int first = 1;
+
+  do
+    {
+      if (first)
+	foo ();
+      else
+	bar ();
+
+      first = 0;
+      bla ();
+    } while (i++ < 100);
+}
+
+/* { dg-final { scan-tree-dump "Jumps threaded: 2" "vrp1"} } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 1" "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2h.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2h.c
new file mode 100644
index 0000000..61705e1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2h.c
@@ -0,0 +1,29 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-vrp1-stats -fdump-tree-dom2-stats" } */
+
+void foo();
+void bla();
+void bar();
+
+/* Avoid threading in the following case, to prevent creating subloops.  */
+
+void dont_thread_2 (int first)
+{
+  int i = 0;
+
+  do
+    {
+      if (first)
+	foo ();
+      else
+	bar ();
+
+      first = 0;
+      bla ();
+    } while (i++ < 100);
+}
+
+/* Peeling off the first iteration would make threading through
+   the loop latch safe, but we don't do that currently.  */
+/* { dg-final { scan-tree-dump "Jumps threaded: 1" "vrp1"} } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 1" "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index 93c767c..a7a737b 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -1,8 +1,9 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp1-stats -fdump-tree-dom2-stats -fdump-tree-dom3-stats" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 7"  "vrp1" } } */
+/* { dg-options "-O2 -fdump-tree-vrp1-stats -fdump-tree-dom2-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats" } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 19"  "vrp1" } } */
 /* { dg-final { scan-tree-dump "Jumps threaded: 12" "dom2" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 3"  "dom3" } } */
+/* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom3" } } */
+/* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp2" } } */
 
 enum STATE {
   S0=0,
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vrp46.c b/gcc/testsuite/gcc.dg/tree-ssa/vrp46.c
index 64c65d0..8923eb4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/vrp46.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vrp46.c
@@ -12,6 +12,8 @@  func_18 ( int t )
   for (0; 1; ++l_889)
     {
       int t1 = 0;
+      func_98 (0);
+      func_98 (0);
       if (func_81 (1))
 	{
 	  int rhs = l_895;
diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 131630e..735009c 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -266,7 +266,7 @@  fsm_find_control_statement_thread_paths (tree name,
 	  basic_block bb = (*path)[j];
 
 	  /* Remember, blocks in the path are stored in opposite order
-	     in the PATH array.  The last entry in the array reprensents
+	     in the PATH array.  The last entry in the array represents
 	     the block with an outgoing edge that we will redirect to the
 	     jump threading path.  Thus we don't care about that block's
 	     loop father, nor how many statements are in that block because
@@ -280,33 +280,19 @@  fsm_find_control_statement_thread_paths (tree name,
 		  break;
 		}
 
-	      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	      for (gsi = gsi_after_labels (bb);
+		   !gsi_end_p (gsi);
+		   gsi_next_nondebug (&gsi))
 		{
 		  gimple *stmt = gsi_stmt (gsi);
 		  /* Do not count empty statements and labels.  */
 		  if (gimple_code (stmt) != GIMPLE_NOP
-		      && gimple_code (stmt) != GIMPLE_LABEL
 		      && !(gimple_code (stmt) == GIMPLE_ASSIGN
 			   && gimple_assign_rhs_code (stmt) == ASSERT_EXPR)
 		      && !is_gimple_debug (stmt))
 		    ++n_insns;
 		}
 
-	      gphi_iterator gsip;
-	      for (gsip = gsi_start_phis (bb);
-		   !gsi_end_p (gsip);
-		   gsi_next (&gsip))
-		{
-		  gphi *phi = gsip.phi ();
-		  tree dst = gimple_phi_result (phi);
-
-		  /* We consider any non-virtual PHI as a statement since it
-		     count result in a constant assignment or copy
-		     operation.  */
-		  if (!virtual_operand_p (dst))
-		    ++n_insns;
-		}
-
 	      /* We do not look at the block with the threaded branch
 		 in this loop.  So if any block with a last statement that
 		 is a GIMPLE_SWITCH or GIMPLE_GOTO is seen, then we have a
@@ -360,6 +346,24 @@  fsm_find_control_statement_thread_paths (tree name,
 	      == DOMST_NONDOMINATING))
 	creates_irreducible_loop = true;
 
+      /* PHIs in the final target and only the final target will need
+	 to be duplicated.  So only count those against the number
+	 of statements.  */
+      gphi_iterator gsip;
+      for (gsip = gsi_start_phis (taken_edge->dest);
+	   !gsi_end_p (gsip);
+	   gsi_next (&gsip))
+	{
+	  gphi *phi = gsip.phi ();
+	  tree dst = gimple_phi_result (phi);
+
+	  /* We consider any non-virtual PHI as a statement since it
+	     count result in a constant assignment or copy
+	     operation.  */
+	  if (!virtual_operand_p (dst))
+	    ++n_insns;
+	}
+
       if (path_crosses_loops)
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
@@ -379,10 +383,18 @@  fsm_find_control_statement_thread_paths (tree name,
 	  continue;
 	}
 
-      /* We avoid creating irreducible loops unless we thread through
+      /* We avoid creating irreducible inner loops unless we thread through
 	 a multiway branch, in which case we have deemed it worth losing other
-	 loop optimizations later.  */
-      if (!threaded_multiway_branch && creates_irreducible_loop)
+	 loop optimizations later.
+
+	 We also consider it worth creating an irreducible inner loop if
+	 the number of copied statement is low relative to the length of
+	 the path -- in that case there's little the traditional loop optimizer
+	 would have done anyway, so an irreducible loop is not so bad.  */
+      if (!threaded_multiway_branch && creates_irreducible_loop
+	  && (n_insns * PARAM_VALUE (PARAM_FSM_SCALE_PATH_STMTS)
+	      > path_length * PARAM_VALUE (PARAM_FSM_SCALE_PATH_BLOCKS)))
+
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file,