diff mbox

[tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).

Message ID 1371647944-9788-1-git-send-email-james.greenhalgh@arm.com
State New
Headers show

Commit Message

James Greenhalgh June 19, 2013, 1:19 p.m. UTC
> -----Original Message-----
> From: Steve Ellcey [mailto:sellcey@mips.com]
> Sent: 14 June 2013 19:07
>
> With my version the compiler calls gimple_duplicate_sese_region from
> duplicate_blocks 60 times.  With your patch it calls
> gimple_duplicate_sese_region from duplicate_thread_path 16 times.
>

Hi Steve,

You are quite right. With -finline-limit=1000 I see
a big difference in performance. The cause of this is
the code added to
tree-ssa-threadedge.c (simplify_control_stmt_condition).

If we have failed to simplify to a gimple_min_invariant,
we want to look for thread paths to the value given by
gimple_goto_dest, rather than the SSA_NAME_VALUE of that
value.

This improves the performance on my x86_64 toolchain to the
same level as your patch. I see "Registered 20 jump paths"
printed 3 times in dom1, for a total of 60 thread paths.

I've also fixed another couple of bugs I spotted, improved
logging of results and added the parameters that were in
your patch.

I did investigate changing the search strategy back to yours,
but I saw no impact on the thread paths found.

Please let me know if this fixes the performance issues you
were seeing and if you have any other comments.

FWIW I've bootstrapped and regression tested this version of
the patch on x86_64 and ARM with no regressions.

Thanks,
James Greenhalgh

---

Changes from v1:

---
gcc/

2013-06-19  James Greenhalgh  <james.greenhalgh@arm.com>

	* params.def (PARAM_MAX_THREAD_PATH_INSNS): New.
	(PARAM_THREAD_PATHS): Likewise.
	* tree-ssa-threadedge.c
	(simplify_control_stmt_condition): If we can't simplify
	cond, return it unmodified.
	(max_insn_count): Do not initialize.
	(max_path_count): Likewise.
	(find_control_statement_thread_paths): Catch case where path
	has already been computed (thus no further path exists),
	add sanity checking.
	(thread_across_edge): Initialize max_{insn, path}_count;
	* tree-ssa-threadupdate.c
	(duplicate_thread_path): Add sanity check, logging.
	(thread_through_all_blocks): Thread paths, even if no
	threaded_edges were found.
	(register_thread_paths): Improve logging.

---

Changelog:

gcc/

2013-06-19  James Greenhalgh  <james.greenhalgh@arm.com>

	PR tree-optimization/54742
	* params.def (PARAM_MAX_THREAD_PATH_INSNS): New.
	(PARAM_THREAD_PATHS): Likewise.
	* tree.cfg (gimple_duplicate_sese_region): Memoize loop latch
	and loop header blocks if copying across a latch/header.
	* tree-flow.h (thread_paths): New struct.
	(register_thread_paths): Declare.
	* tree-ssa-threadedge.c
	(simplify_control_stmt_condition): Permit returning something
	not in gimple_min_invariant form.
	(max_insn_count): Declare.
	(max_path_count): Likewise.
	(find_thread_path_1): New function.
	(find_thread_path): Likewise.
	(save_new_thread_path): Likewise.
	(find_control_statement_thread_paths): Likewise.
	(thread_across_edge): Handle non gimple_min_invariant cases.
	* tree-ssa-threadupdate.c (thread_paths_vec): New.
	(remove_edge_from_thread_candidates): New function.
	(duplicate_thread_path): Likewise.
	(copy_control_statement_thread_paths): Likewise.
	(thread_through_all_blocks): Handle thread_paths.
	(register_thread_paths): New function.

Comments

Steve Ellcey June 19, 2013, 6:33 p.m. UTC | #1
On Wed, 2013-06-19 at 14:19 +0100, James Greenhalgh wrote:

> Please let me know if this fixes the performance issues you
> were seeing and if you have any other comments.
> 
> FWIW I've bootstrapped and regression tested this version of
> the patch on x86_64 and ARM with no regressions.
> 
> Thanks,
> James Greenhalgh


James,

This patch does give me the same performance as my original patch, so
that is excellent.  While testing it I noticed that the final executable
is larger with your patch then with mine.  Here are the sizes of the
bare-metal executables I created using the same flags I sent you
earlier, the first has no switch optimization, the second one uses my
plugin optimization, and the third uses your latest patch.  I haven't
looked into why the size difference for your patch and mine exists, do
you see a size difference on your platforms?  I am not sure if path
threading in general is turned off for -Os but it probably should be.

% ll -art coremark.fsf*elf
-rwxr-xr-x 1 sellcey src 413812 Jun 19 11:11 coremark.fsf.1.elf
-rwxr-xr-x 1 sellcey src 414676 Jun 19 11:11 coremark.fsf.2.elf
-rwxr-xr-x 1 sellcey src 416402 Jun 19 11:11 coremark.fsf.3.elf


Steve Ellcey
sellcey@mips.com
James Greenhalgh June 21, 2013, 4:43 p.m. UTC | #2
>  While testing it I noticed that the final executable
> is larger with your patch then with mine.  Here are the sizes of the
> bare-metal executables I created using the same flags I sent you
> earlier, the first has no switch optimization, the second one uses my
> plugin optimization, and the third uses your latest patch.  I haven't
> looked into why the size difference for your patch and mine exists, do
> you see a size difference on your platforms? 

Yes I do, but after playing around with it, this seems very dependant
on pass ordering.

I've built various arm-none-eabi compilers to test with:

  clean: is a compiler without path threading.
  steve.pass: is your original pass patch.
  james: is my patch (which will be called within vrp and dom passes)

  steve.after-vrp1: moves your pass to immediately after the first call
    to vrp

  steve.before, steve.after, steve.after-vrp-before-dom,
  steve.before-vrp-after-dom: run your pass immediately before or after
    both vrp and both dom passes.

  james.ch is my patch, rerunning pass_ch after dom1.

Then, building with flags:

  -finline-limit=1000 -funroll-all-loops
  -finline-functions [[-ftree-switch-shortcut]] -O3 -mthumb

And passing the resulting binary through:

$ arm-none-eabi-strip blob.*

I see:

$ size blob.arm.* | sort -n

   text	   data	    bss	    dec	    hex	filename
  53984	   2548	    296	  56828	   ddfc	../blobs/blob.arm.clean
  54464	   2548	    296	  57308	   dfdc	../blobs/blob.arm.steve.pass
  54496	   2548	    296	  57340	   dffc	../blobs/blob.arm.steve.after
  54496	   2548	    296	  57340	   dffc	../blobs/blob.arm.steve.after-vrp-before-dom
  54504	   2548	    296	  57348	   e004	../blobs/blob.arm.james.ch
  54504	   2548	    296	  57348	   e004	../blobs/blob.arm.steve.only-after-vrp1
  54656	   2548	    296	  57500	   e09c	../blobs/blob.arm.james
  54704	   2548	    296	  57548	   e0cc	../blobs/blob.arm.steve.before-vrp-after-dom
  54736	   2548	    296	  57580	   e0ec	../blobs/blob.arm.steve.before

So to my mind, this is all far too tied up in pass ordering details to
resolve. Given that all the threading opportunities for my patch are found
in dom1 and how fragile the positioning of dom1 is, there is not a great
deal I can do to modify the ordering.

The biggest improvement I could find comes from rerunning pass_ch
immediately after dom1, though I'm not sure what the cost of that
would be.

I wonder if you or others have any thoughts on what the right thing to
do would be?

> I am not sure if path threading in general is turned off for -Os but it
> probably should be.

I agree, jump threading is on at -Os, path threading should not be.

Thanks,
James
Steve Ellcey June 21, 2013, 9:21 p.m. UTC | #3
On Fri, 2013-06-21 at 17:43 +0100, James Greenhalgh wrote:

> So to my mind, this is all far too tied up in pass ordering details to
> resolve. Given that all the threading opportunities for my patch are found
> in dom1 and how fragile the positioning of dom1 is, there is not a great
> deal I can do to modify the ordering.
> 
> The biggest improvement I could find comes from rerunning pass_ch
> immediately after dom1, though I'm not sure what the cost of that
> would be.
> 
> I wonder if you or others have any thoughts on what the right thing to
> do would be?
> 
> > I am not sure if path threading in general is turned off for -Os but it
> > probably should be.
> 
> I agree, jump threading is on at -Os, path threading should not be.
> 
> Thanks,
> James

I think that since it seems to be more a space issue then a time issue,
the way you currently have it is reasonable.  As long as the
optimization does not happen at -Os then I think we can probably live
with the space increase.

Steve Ellcey
sellcey@mips.com
Richard Biener Oct. 18, 2013, 10:55 a.m. UTC | #4
On Fri, Jun 21, 2013 at 11:21 PM, Steve Ellcey <sellcey@mips.com> wrote:
> On Fri, 2013-06-21 at 17:43 +0100, James Greenhalgh wrote:
>
>> So to my mind, this is all far too tied up in pass ordering details to
>> resolve. Given that all the threading opportunities for my patch are found
>> in dom1 and how fragile the positioning of dom1 is, there is not a great
>> deal I can do to modify the ordering.
>>
>> The biggest improvement I could find comes from rerunning pass_ch
>> immediately after dom1, though I'm not sure what the cost of that
>> would be.
>>
>> I wonder if you or others have any thoughts on what the right thing to
>> do would be?
>>
>> > I am not sure if path threading in general is turned off for -Os but it
>> > probably should be.
>>
>> I agree, jump threading is on at -Os, path threading should not be.
>>
>> Thanks,
>> James
>
> I think that since it seems to be more a space issue then a time issue,
> the way you currently have it is reasonable.  As long as the
> optimization does not happen at -Os then I think we can probably live
> with the space increase.

I suppose with Jeffs recent work on jump-threading through paths
this case in handled and the patch in this thread is obsolete or can
be reworked?

Thanks,
Richard.

> Steve Ellcey
> sellcey@mips.com
>
>
James Greenhalgh Oct. 18, 2013, 11:05 a.m. UTC | #5
On Fri, Oct 18, 2013 at 11:55:08AM +0100, Richard Biener wrote:
> I suppose with Jeffs recent work on jump-threading through paths
> this case in handled and the patch in this thread is obsolete or can
> be reworked?

Yes, this patch is now obsolete, Jeff's solution is much more
elegant :-)

Thanks,
James
Jeff Law Oct. 18, 2013, 3:17 p.m. UTC | #6
On 10/18/13 04:55, Richard Biener wrote:
> On Fri, Jun 21, 2013 at 11:21 PM, Steve Ellcey <sellcey@mips.com> wrote:
>> On Fri, 2013-06-21 at 17:43 +0100, James Greenhalgh wrote:
>>
>>> So to my mind, this is all far too tied up in pass ordering details to
>>> resolve. Given that all the threading opportunities for my patch are found
>>> in dom1 and how fragile the positioning of dom1 is, there is not a great
>>> deal I can do to modify the ordering.
>>>
>>> The biggest improvement I could find comes from rerunning pass_ch
>>> immediately after dom1, though I'm not sure what the cost of that
>>> would be.
>>>
>>> I wonder if you or others have any thoughts on what the right thing to
>>> do would be?
>>>
>>>> I am not sure if path threading in general is turned off for -Os but it
>>>> probably should be.
>>>
>>> I agree, jump threading is on at -Os, path threading should not be.
>>>
>>> Thanks,
>>> James
>>
>> I think that since it seems to be more a space issue then a time issue,
>> the way you currently have it is reasonable.  As long as the
>> optimization does not happen at -Os then I think we can probably live
>> with the space increase.
>
> I suppose with Jeffs recent work on jump-threading through paths
> this case in handled and the patch in this thread is obsolete or can
> be reworked?
It trivially falls out of the work I'm doing.  The biggest question will 
be what to do about the loop structures.  For the FSA case from coremark 
the loop structures get mucked up way beyond repair.

My thoughts were to generally continue to not allow major damage to the 
loop structures when threading jumps --  with the exception being if we 
can eliminate an indirect or tablejump at the top of a loop.  Possibly 
further restricting it to the jump threading done after our loop 
optimizers have completed.

I haven't looked at that heuristic yet as I've primarily focused on 
having the right infrastructure to correctly update the SSA and CFG in 
the general case.



jeff
Jeff Law Oct. 18, 2013, 3:27 p.m. UTC | #7
On 10/18/13 05:05, James Greenhalgh wrote:
> On Fri, Oct 18, 2013 at 11:55:08AM +0100, Richard Biener wrote:
>> I suppose with Jeffs recent work on jump-threading through paths
>> this case in handled and the patch in this thread is obsolete or can
>> be reworked?
>
> Yes, this patch is now obsolete, Jeff's solution is much more
> elegant :-)
Not sure if I'd use the term elegant :-)  My solution is certainly more 
general though.  There's still a couple rounds of staging bits as I 
clean up the rough edges.

The need to store a full thread path to get SSA graph update correct in 
cases where we have a joiner block, followed by multiple threadable 
blocks without side effects, followed by a threadable block with side 
effects was unexpected.

Jeff
diff mbox

Patch

diff --git a/gcc/params.def b/gcc/params.def
index 3c52651..25d36a6 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -123,6 +123,19 @@  DEFPARAM (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY,
 	  "Maximum probability of the entry BB of split region (in percent relative to entry BB of the function) to make partial inlining happen",
 	  70, 0, 0)
 
+/* Maximum number of instructions to copy when duplicating blocks
+   on a jump thread path.  */
+DEFPARAM (PARAM_MAX_THREAD_PATH_INSNS,
+		"max-thread-path-insns",
+		"Maximum number of instructions to copy when duplicating blocks on a jump thread path",
+		100, 1, 999999)
+
+/* Maximum number of jump thread paths to duplicate.  */
+DEFPARAM (PARAM_MAX_THREAD_PATHS,
+		"max-thread-paths",
+		"Maximum number of new jump thread paths to create",
+		50, 1, 999999)
+
 /* Limit the number of expansions created by the variable expansion
    optimization to avoid register pressure.  */
 DEFPARAM (PARAM_MAX_VARIABLE_EXPANSIONS,
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 4b91a35..6dcd2e4 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -5717,10 +5717,12 @@  gimple_duplicate_sese_region (edge entry, edge exit,
 {
   unsigned i;
   bool free_region_copy = false, copying_header = false;
+  bool save_loop_details = false;
   struct loop *loop = entry->dest->loop_father;
   edge exit_copy;
   vec<basic_block> doms;
   edge redirected;
+  int memo_loop_header_no = 0, memo_loop_latch_no = 0;
   int total_freq = 0, entry_freq = 0;
   gcov_type total_count = 0, entry_count = 0;
 
@@ -5738,9 +5740,15 @@  gimple_duplicate_sese_region (edge entry, edge exit,
       if (region[i]->loop_father != loop)
 	return false;
 
-      if (region[i] != entry->dest
-	  && region[i] == loop->header)
-	return false;
+      /* If we are copying an abnormally shaped region, keep track of
+	 which block will become our loop header.  */
+      if ((region[i] != entry->dest && region[i] == loop->header)
+	  ||(region[i] != entry->src && region[i] == loop->latch))
+	{
+	  save_loop_details = true;
+	  memo_loop_latch_no = i;
+	  memo_loop_header_no = i;
+	}
     }
 
   set_loop_copy (loop, loop);
@@ -5821,6 +5829,13 @@  gimple_duplicate_sese_region (edge entry, edge exit,
       loop->latch = exit->src;
     }
 
+  /* Restore loop details if we were asked to save them.  */
+  if (save_loop_details)
+    {
+      loop->header = region_copy[memo_loop_header_no];
+      loop->latch = region_copy[memo_loop_latch_no];
+    }
+
   /* Redirect the entry and add the phi node arguments.  */
   redirected = redirect_edge_and_branch (entry, get_bb_copy (entry->dest));
   gcc_assert (redirected != NULL);
diff --git a/gcc/tree-flow.h b/gcc/tree-flow.h
index 24fcfbf..c16a1ba 100644
--- a/gcc/tree-flow.h
+++ b/gcc/tree-flow.h
@@ -604,6 +604,13 @@  struct tree_niter_desc
   enum tree_code cmp;
 };
 
+typedef struct thread_paths
+{
+  basic_block **paths;
+  int *path_sizes;
+  int path_count;
+} thread_paths;
+
 /* In tree-ssa-phiopt.c */
 bool empty_block_p (basic_block);
 basic_block *blocks_in_phiopt_order (void);
@@ -751,6 +758,7 @@  bool may_be_nonaddressable_p (tree expr);
 /* In tree-ssa-threadupdate.c.  */
 extern bool thread_through_all_blocks (bool);
 extern void register_jump_thread (edge, edge, edge);
+extern void register_thread_paths (thread_paths *);
 
 /* In gimplify.c  */
 tree force_gimple_operand_1 (tree, gimple_seq *, gimple_predicate, tree);
diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index b31e961..7e6152e 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -542,6 +542,7 @@  simplify_control_stmt_condition (edge e,
      rather than use a relational operator.  These are simpler to handle.  */
   if (TREE_CODE (cond) == SSA_NAME)
     {
+      tree original_lhs = cond;
       cached_lhs = cond;
 
       /* Get the variable's current value from the equivalence chains.
@@ -563,6 +564,12 @@  simplify_control_stmt_condition (edge e,
 	 pass specific callback to try and simplify it further.  */
       if (cached_lhs && ! is_gimple_min_invariant (cached_lhs))
         cached_lhs = (*simplify) (stmt, stmt);
+
+      /* We couldn't find an invariant.  But, callers of this
+	 function may be able to do something useful with the
+	 unmodified destination.  */
+      if (!cached_lhs)
+	cached_lhs = original_lhs;
     }
   else
     cached_lhs = NULL;
@@ -829,6 +836,218 @@  phi_args_equal_on_edges (edge e1, edge e2)
   return true;
 }
 
+static int max_insn_count;
+static int max_path_count;
+
+/* Helper function for find_path, VISITED_BBS is used to make sure we don't
+   fall into an infinite loop.  */
+
+static bool
+find_thread_path_1 (basic_block start_bb, basic_block end_bb,
+		    struct pointer_set_t *visited_bbs,
+		    vec<basic_block, va_gc> *&path)
+{
+  edge_iterator ei;
+  edge e;
+
+  if (start_bb == end_bb)
+    {
+      vec_safe_push (path, start_bb);
+      return true;
+    }
+
+  if (!pointer_set_insert (visited_bbs, start_bb))
+    {
+      FOR_EACH_EDGE (e, ei, start_bb->succs)
+	if (find_thread_path_1 (e->dest, end_bb, visited_bbs, path))
+	  {
+	    vec_safe_push (path, start_bb);
+	    return true;
+	  }
+    }
+  return false;
+}
+
+/* Return true if there is at least one path from START_BB to END_BB.  */
+
+static bool
+find_thread_path (basic_block start_bb, basic_block end_bb,
+		  vec<basic_block, va_gc> *&path)
+{
+  edge_iterator ei;
+  edge e;
+  struct pointer_set_t *visited_bbs;
+  bool p = false;
+
+  if (start_bb == end_bb)
+    {
+      vec_safe_push (path, start_bb);
+      return true;
+    }
+
+  visited_bbs = pointer_set_create ();
+  if (!pointer_set_insert (visited_bbs, start_bb))
+    {
+      FOR_EACH_EDGE (e, ei, start_bb->succs)
+	if (find_thread_path_1 (e->dest, end_bb, visited_bbs, path))
+	  {
+	    vec_safe_push (path, start_bb);
+	    p = true;
+	    break;
+	  }
+    }
+  pointer_set_destroy (visited_bbs);
+  return p;
+}
+
+/* We save the paths we want to copy in PATHS.  PATHS->path_count is the
+   number of paths saved, PATHS->paths[i] is the list of basic blocks in
+   one path.  Each path starts with the block where a variable is assigned
+   a constant value (PATHS->paths[i][0]) and ends with the control statement
+   block (PATHS->paths[i][PATH->path_sizes[i]-2]) and then the block
+   that the control statement is going to go to given the constant value
+   of the variable (PATH->paths[i][PATH->path_sizes[i]-1]).
+
+   BBS_LIST[0] is the block with the control statement,
+   BBS_LIST[n - 1] is the block where the control statement variable
+     is assigned a constant value,
+   The entries in between make a (reverse) path between the two.
+
+   We don't want to change BBS_LIST, we want to leave that alone and
+   and copy the path to PATHS->paths so that we wind up with an array
+   of paths that we want to update.  We also want to add the block that the
+   control statement is going to go to on to the list so that we know
+   which exit from the control statement is important.  */
+
+static void
+save_new_thread_path (vec<basic_block, va_gc> *&path,
+		      tree val, thread_paths *paths)
+{
+  int i;
+  int insn_count;
+  basic_block bb;
+  edge taken_edge;
+  gimple_stmt_iterator gsi;
+  int path_count = paths->path_count;
+  int n = path->length ();
+
+  if (n <= 1)
+    return;
+
+  if (path_count >= max_path_count)
+    return;
+
+  /* Put the blocks in 'correct' order and add in where we want to go after
+     the control statement, We want to leave path untouched for future
+     calls.  */
+
+  paths->paths[path_count] = XNEWVEC (basic_block, n + 1);
+  for (i = 0; i < n; i++)
+    paths->paths[path_count][i] = (*path)[n - i - 1];
+
+  taken_edge = find_taken_edge ((*path)[0], val);
+  paths->paths[path_count][n] = taken_edge->dest;
+
+  paths->path_sizes[path_count] = n + 1;
+
+  /* Count how many instructions are in the blocks we are going to
+     duplicate and if there are too many do not save this path
+     (return without incrementing PATHS->path_count).  */
+
+  insn_count = 0;
+  for (i = 1; i < n; i++)
+    {
+      bb = paths->paths[path_count][i];
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	insn_count += 1;
+    }
+
+  if (insn_count > max_insn_count)
+    {
+      /* Otherwise, we will lose this next call.  */
+      XDELETEVEC (paths->paths[path_count]);
+      return;
+    }
+
+  paths->path_count++;
+}
+
+/* CTRL_STMT is a COND_EXPR or SWITCH_EXPR statement whose controlling
+   expression is the variable EXPR.  We trace the value of the variable back
+   through any phi nodes looking for places where it gets a constant
+   value and save the path in BBS_LIST.  Then we call save_new_thread_path
+   to create a list of such paths.  */
+
+static void
+find_control_statement_thread_paths (tree expr, gimple ctrl_stmt,
+				     struct pointer_set_t *visited_phis,
+				     thread_paths *paths,
+				     vec<basic_block, va_gc> *&path)
+{
+  gimple def_stmt;
+  tree var;
+  unsigned int i;
+  edge e;
+  edge_iterator ei;
+  basic_block var_bb;
+  int e_count;
+  vec<basic_block, va_gc> *next_path;
+  basic_block bbx = path->last ();
+
+  var = SSA_NAME_VAR (expr);
+  def_stmt = SSA_NAME_DEF_STMT (expr);
+  var_bb = gimple_bb (def_stmt);
+
+  if (var == NULL || var_bb == NULL) return;
+
+  /* We have a variable definition (var) that is defined in var_bb,
+     We want to put the path from var_bb to the current bb into
+     next_path.  If there is more then one path, skip this and don't
+     try to do the optimization.  */
+
+  e_count = 0;
+  vec_alloc (next_path, n_basic_blocks);
+  if (var_bb != bbx)
+    {
+      FOR_EACH_EDGE (e, ei, bbx->preds)
+	{
+	  if (find_thread_path (var_bb, e->src, next_path))
+	    e_count = e_count + 1;
+	  if (e_count > 1)
+	    break;
+	}
+      if (e_count != 1)
+	{
+	  vec_free (next_path);
+	  return;
+	}
+    }
+
+
+  vec_safe_splice (path, next_path);
+  gcc_assert (path->last () == var_bb);
+
+  if ((gimple_code (def_stmt) == GIMPLE_PHI)
+      && !pointer_set_insert (visited_phis, def_stmt))
+    {
+      for (i = 0; i < gimple_phi_num_args (def_stmt); i++)
+	{
+	  tree arg = gimple_phi_arg_def (def_stmt, i);
+	  vec_safe_push (path, gimple_phi_arg_edge (def_stmt, i)->src);
+	  if (arg && (TREE_CODE (arg) == INTEGER_CST))
+	    save_new_thread_path (path, arg, paths);
+	  else if (arg && (TREE_CODE (arg) == SSA_NAME))
+	    find_control_statement_thread_paths (arg, ctrl_stmt,
+						 visited_phis,
+						 paths, path);
+	  path->pop ();
+	}
+    }
+
+  vec_safe_truncate (path, (path->length () - next_path->length ()));
+  vec_free (next_path);
+}
+
 /* We are exiting E->src, see if E->dest ends with a conditional
    jump which has a known value when reached via E.
 
@@ -943,12 +1162,50 @@  thread_across_edge (gimple dummy_cond,
 	  register_jump_thread (e, taken_edge, NULL);
 	  return;
 	}
+      else if (cond && !is_gimple_min_invariant (cond))
+	{
+	  /* Try to find paths from a control statement back through
+	     the PHI nodes which would affect that control statement.
+	     Record these opportunities for later.  */
+	  thread_paths *paths = XNEW (thread_paths);
+	  struct pointer_set_t *visited_phis;
+	  max_insn_count = PARAM_VALUE (PARAM_MAX_THREAD_PATH_INSNS);
+	  max_path_count = PARAM_VALUE (PARAM_MAX_THREAD_PATHS);
+	  vec<basic_block, va_gc> *path;
+	  vec_alloc (path, n_basic_blocks);
+
+	  paths->path_count = 0;
+	  paths->path_sizes = XNEWVEC (int, max_path_count);
+	  paths->paths = XNEWVEC (basic_block *, max_path_count);
+
+	  visited_phis = pointer_set_create ();
+
+	  vec_safe_push (path, e->dest);
+	  find_control_statement_thread_paths (cond, stmt, visited_phis,
+					       paths, path);
+
+	  pointer_set_destroy (visited_phis);
+	  vec_free (path);
+
+	  if (paths->path_count > 0)
+	    {
+	      register_thread_paths (paths);
+	      return;
+	    }
+
+	  /* This cleanup is the responsibility of tree-ssa-threadupdate.c
+	     if we found any paths.  */
+	  XDELETEVEC (paths->paths);
+	  XDELETEVEC (paths->path_sizes);
+	  XDELETE (paths);
+	}
     }
 
- /* We were unable to determine what out edge from E->dest is taken.  However,
-    we might still be able to thread through successors of E->dest.  This
-    often occurs when E->dest is a joiner block which then fans back out
-    based on redundant tests.
+ /* We were unable to determine what out edge from E->dest is taken, and we
+    could not find any threadable paths.  However, we might still be able
+    to thread through successors of E->dest.  This often occurs when
+    E->dest is a joiner block which then fans back out based on redundant
+    tests.
 
     If so, we'll copy E->dest and redirect the appropriate predecessor to
     the copy.  Within the copy of E->dest, we'll thread one or more edges
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 0e4cbc9..5d7fecc 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -169,6 +169,10 @@  struct ssa_local_info_t
    (original_edge, target_edge).  */
 static vec<edge> threaded_edges;
 
+/* For more advanced jump threading, make a note of the whole
+   interesting path.  */
+static vec<thread_paths*> thread_paths_vec;
+
 /* When we start updating the CFG for threading, data necessary for jump
    threading is attached to the AUX field for the incoming edge.  Use these
    macros to access the underlying structure attached to the AUX field.  */
@@ -1183,6 +1187,88 @@  mark_threaded_blocks (bitmap threaded_blocks)
   BITMAP_FREE(tmp);
 }
 
+/* Remove any edge triplet from the candidate jump-threads in threaded_edges
+   for which E is a member.  */
+
+static void
+remove_edge_from_thread_candidates (edge e)
+{
+  unsigned int i;
+
+  for (i = 0; i < threaded_edges.length (); i += 3)
+    {
+      if (threaded_edges[i] == e
+	  || threaded_edges[i + 1] == e
+	  || threaded_edges[i + 2] == e)
+	{
+	  /* By doing an unordered remove in reverse order, we
+	     maintain the intended (triplet of edges) element
+	     ordering.  */
+	  threaded_edges.unordered_remove (i + 2);
+	  threaded_edges.unordered_remove (i + 1);
+	  threaded_edges.unordered_remove (i);
+	  /* Rescan the previous three elements.  */
+	  i -= 3;
+	}
+    }
+}
+
+/* Call gimple_duplicate_sese_region to duplicate the blocks in bb_list.
+   We free and recalculate all ssa, loop, cfg and dominance information
+   afterwards because the region being copied is not really
+   SESE and so we cannot trust gimple_duplicate_sese_region to correctly
+   update the dataflow information.  */
+
+static void
+duplicate_thread_path (basic_block *bb_list, int bb_count)
+{
+  edge orig_edge, exit_edge;
+  bool success;
+  orig_edge = find_edge (bb_list[0], bb_list[1]);
+  exit_edge = find_edge (bb_list[bb_count - 2], bb_list[bb_count - 1]);
+
+  /* This can happen if we have removed edges from this basic
+     block through other calls to duplicate thread path.
+     Give up early.  */
+  if (!orig_edge || !exit_edge)
+    return;
+
+  success = gimple_duplicate_sese_region (orig_edge, exit_edge,
+					  &bb_list[1], bb_count - 2,
+					  NULL, 0);
+
+  if (success)
+    {
+      /* Some edges will no longer be safe to thread accross.  In the
+	 very worst case, we will miss an optimisation opportunity
+	 by doing this.  */
+      remove_edge_from_thread_candidates (orig_edge);
+      remove_edge_from_thread_candidates (exit_edge);
+      /* Clean up our mess.  */
+      free_dominance_info (CDI_DOMINATORS);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "  Threaded jump path from %d to %d\n",
+		 orig_edge->src->index, exit_edge->dest->index);
+
+      thread_stats.num_threaded_edges += bb_count;
+    }
+}
+
+/* Go through the paths saved in PATHS->paths and make copies of them.  */
+
+static void
+copy_control_statement_thread_paths (thread_paths *paths)
+{
+  int i;
+  for (i = 0; i < paths->path_count; i++)
+    {
+      /* For each path in bbs_list_size loop through and copy each block in
+	 the path (except the first on where the constant is assigned and
+	 the final one where the switch statement goes to.  */
+      if (!single_pred_p (paths->paths[i][1]))
+	duplicate_thread_path (paths->paths[i], paths->path_sizes[i]);
+    }
+}
 
 /* Walk through all blocks and thread incoming edges to the appropriate
    outgoing edge for each edge pair recorded in THREADED_EDGES.
@@ -1204,10 +1290,26 @@  thread_through_all_blocks (bool may_peel_loop_headers)
   bitmap threaded_blocks;
   struct loop *loop;
   loop_iterator li;
+  int j;
+  thread_paths *paths;
 
   /* We must know about loops in order to preserve them.  */
   gcc_assert (current_loops != NULL);
 
+  /* Threading the paths in thread_paths can cause major damage to
+     the dominance information, ssa names and loop forms.  Do these
+     paths first.  */
+  while (!thread_paths_vec.is_empty ())
+  {
+    paths = thread_paths_vec.pop ();
+    copy_control_statement_thread_paths (paths);
+    XDELETEVEC (paths->path_sizes);
+    for (j = 0; j < paths->path_count; j++)
+      XDELETEVEC (paths->paths[j]);
+    XDELETEVEC (paths->paths);
+    XDELETE (paths);
+  }
+
   if (!threaded_edges.exists ())
     return false;
 
@@ -1283,3 +1385,20 @@  register_jump_thread (edge e, edge e2, edge e3)
   threaded_edges.safe_push (e2);
   threaded_edges.safe_push (e3);
 }
+
+/* Register thread_paths found through analysis of expressions
+   used by control statements.  */
+
+void
+register_thread_paths (thread_paths *tps)
+{
+  if (!thread_paths_vec.exists ())
+    thread_paths_vec.create (15);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file,
+	     "  Registered %d jump paths\n", tps->path_count);
+
+  thread_paths_vec.safe_push (tps);
+}
+