Patchwork 0005-Search-all-dominated-blocks-for-expressions-to-hoist.patch

login
register
mail settings
Submitter Maxim Kuvyrkov
Date June 24, 2010, 3:53 p.m.
Message ID <4C237F81.9090807@codesourcery.com>
Download mbox | patch
Permalink /patch/56806/
State New
Headers show

Comments

Maxim Kuvyrkov - June 24, 2010, 3:53 p.m.
On 6/23/10 11:08 PM, Maxim Kuvyrkov wrote:
...
>> This can be addressed with a walk over the dominator tree after we
>> compute VBEout. Start with the root and descend in the tree keeping a
>> bitset of expressions that should be alive up the tree. If current node
>>
>> 1. has a single successor,
>> 2. has i'th expression set in VBEout,
>> 3. the successor has i'th expression set in VBEout,
>> 4. current node doesn't generate i'th expression,
>> 5. i'th expression is not marked in the bitset as required up the tree,
>>
>> than we can hoist i'th expression in the successor with the same result
>> as in the current node and not unnecessarily extend live ranges. There
>> maybe a couple more details to the above, but the problem should be
>> easily fixable.
>
> This is implemented as cleanup_code_hoist_vbeout() function. The
> solution it produces is OK from correctness point of view (it removes
> bits from VBEout), but, please, *check my reasoning* to make sure it
> doesn't remove from VBEout expressions it shouldn't.

There is a flaw in the implementation I posted yesterday.  VBEout sets 
have to be cleaned up considering data both downward and upward the 
dominator tree; see new example and comments in compute_code_hoist_vbeinout.

This updated patch corrects the cleaning routine and adds several 
comments to annotate its actions.

Does this look OK?
Jeff Law - June 30, 2010, 6:12 p.m.
On 06/24/10 09:53, Maxim Kuvyrkov wrote:
> On 6/23/10 11:08 PM, Maxim Kuvyrkov wrote:
> ...
>>> This can be addressed with a walk over the dominator tree after we
>>> compute VBEout. Start with the root and descend in the tree keeping a
>>> bitset of expressions that should be alive up the tree. If current node
>>>
>>> 1. has a single successor,
>>> 2. has i'th expression set in VBEout,
>>> 3. the successor has i'th expression set in VBEout,
>>> 4. current node doesn't generate i'th expression,
>>> 5. i'th expression is not marked in the bitset as required up the tree,
>>>
>>> than we can hoist i'th expression in the successor with the same result
>>> as in the current node and not unnecessarily extend live ranges. There
>>> maybe a couple more details to the above, but the problem should be
>>> easily fixable.
>>
>> This is implemented as cleanup_code_hoist_vbeout() function. The
>> solution it produces is OK from correctness point of view (it removes
>> bits from VBEout), but, please, *check my reasoning* to make sure it
>> doesn't remove from VBEout expressions it shouldn't.
>
> There is a flaw in the implementation I posted yesterday.  VBEout sets 
> have to be cleaned up considering data both downward and upward the 
> dominator tree; see new example and comments in 
> compute_code_hoist_vbeinout.
>
> This updated patch corrects the cleaning routine and adds several 
> comments to annotate its actions.
>
> Does this look OK?


It looks like you've got a bi-directional dataflow problem to clean up 
VBEout, in general we really want to avoid bi-directional problems.   Is 
there some reason you're not using a lowest common ancestor algorithm here?

In your comments you have the following CFG:



+  /* We cannot cleanup VBEout in a single traversal.  There has to be both
+     upward and downward links when computing VBEout of current block to
+     avoid removing bits that shouldn't be removed.  E.g., consider
+     the following dominator tree; '*' marks blocks which compute same
+     expression, the expression can be freely moved; the expected result
+     is that we move computations of '*' from (3) and (6) to (2).
+
+       2
+      / \
+     3*  4
+        / \
+       5   6*
+
+     Doing a depth-first search over this tree without and upward link
+     will remove the expression from VBEout[4] (there's no point of 
hoisting
+     the expression to (4) if it's not computed in both (5) and (6).
+     When cleaning up VBEout[2] we won't see the expression as needed 
in (4),
+     so we will remove it from VBEout[2] leaving it to (3) to calculate
+     it's own copy of '*'.

The first paragraph of your comment implies that we'd want to hoist the 
expression from 3 & 6 into 2.  However, that's not a valid 
transformation as the path 2, 4, 5 does not evaluate the expression in 
the original CFG.  VBEout for block 4 should be false since the 
expression is not evaluated in block #5.

Jeff
Maxim Kuvyrkov - June 30, 2010, 7:30 p.m.
On 6/30/10 10:12 PM, Jeff Law wrote:
> On 06/24/10 09:53, Maxim Kuvyrkov wrote:
>> On 6/23/10 11:08 PM, Maxim Kuvyrkov wrote:
>> ...
>>>> This can be addressed with a walk over the dominator tree after we
>>>> compute VBEout. Start with the root and descend in the tree keeping a
>>>> bitset of expressions that should be alive up the tree. If current node
>>>>
>>>> 1. has a single successor,
>>>> 2. has i'th expression set in VBEout,
>>>> 3. the successor has i'th expression set in VBEout,
>>>> 4. current node doesn't generate i'th expression,
>>>> 5. i'th expression is not marked in the bitset as required up the tree,
>>>>
>>>> than we can hoist i'th expression in the successor with the same result
>>>> as in the current node and not unnecessarily extend live ranges. There
>>>> maybe a couple more details to the above, but the problem should be
>>>> easily fixable.
>>>
>>> This is implemented as cleanup_code_hoist_vbeout() function. The
>>> solution it produces is OK from correctness point of view (it removes
>>> bits from VBEout), but, please, *check my reasoning* to make sure it
>>> doesn't remove from VBEout expressions it shouldn't.
>>
>> There is a flaw in the implementation I posted yesterday. VBEout sets
>> have to be cleaned up considering data both downward and upward the
>> dominator tree; see new example and comments in
>> compute_code_hoist_vbeinout.
>>
>> This updated patch corrects the cleaning routine and adds several
>> comments to annotate its actions.
>>
>> Does this look OK?
>
>
> It looks like you've got a bi-directional dataflow problem to clean up
> VBEout, in general we really want to avoid bi-directional problems. Is
> there some reason you're not using a lowest common ancestor algorithm here?

A dataflow problem seems simpler for this case.  The problem uses 
bi-directional links to compute a set of expressions that will be 
subtracted from VBEout on each iteration, it never adds new expressions 
to destination sets.  I, therefore, claim that the fact that this 
particular problem uses bi-directional links does not really have any 
significant negative impact.

Is there any reason bi-directional dataflow problems should be avoided 
at all cost?

>
> In your comments you have the following CFG:
>
>
>
> + /* We cannot cleanup VBEout in a single traversal. There has to be both
> + upward and downward links when computing VBEout of current block to
> + avoid removing bits that shouldn't be removed. E.g., consider
> + the following dominator tree; '*' marks blocks which compute same
> + expression, the expression can be freely moved; the expected result
> + is that we move computations of '*' from (3) and (6) to (2).
> +
> + 2
> + / \
> + 3* 4
> + / \
> + 5 6*
> +
> + Doing a depth-first search over this tree without and upward link
> + will remove the expression from VBEout[4] (there's no point of hoisting
> + the expression to (4) if it's not computed in both (5) and (6).
> + When cleaning up VBEout[2] we won't see the expression as needed in (4),
> + so we will remove it from VBEout[2] leaving it to (3) to calculate
> + it's own copy of '*'.
>
> The first paragraph of your comment implies that we'd want to hoist the
> expression from 3 & 6 into 2. However, that's not a valid transformation
> as the path 2, 4, 5 does not evaluate the expression in the original
> CFG. VBEout for block 4 should be false since the expression is not
> evaluated in block #5.

Interesting.  While working on implementing cleanup of VBEout sets I 
oscillated several times between bottom-up dominator tree walk and 
iterative walk.  At some point I got myself convinced that dominator 
walk would do just as fine the job as iterative walk.  Then I come up 
with the above case which needs more than a single walk to get to the 
right solution.  Now you pointed out that the above case is invalid, 
which makes me think that a dominator walk would suffice after all.

Still, the iterative solution looks better to me as it makes it crystal 
clear that only expressions that definitely won't be hoisted to BB will 
be removed from BB's VBEout.

Does this make sense?

Thanks,
Jeff Law - July 1, 2010, 4:36 p.m.
On 06/30/10 13:30, Maxim Kuvyrkov wrote:
>>
>> It looks like you've got a bi-directional dataflow problem to clean up
>> VBEout, in general we really want to avoid bi-directional problems. Is
>> there some reason you're not using a lowest common ancestor algorithm 
>> here?
>
>
> A dataflow problem seems simpler for this case.  The problem uses 
> bi-directional links to compute a set of expressions that will be 
> subtracted from VBEout on each iteration, it never adds new 
> expressions to destination sets.  I, therefore, claim that the fact 
> that this particular problem uses bi-directional links does not really 
> have any significant negative impact.
While this specific case may be reasonably safe, in general we really 
want to avoid bi-directional dataflow solvers as it's often hard to 
prove termination, it's hard to evaluate their compile-time performance, 
they're typically more difficult to understand for anyone reading the 
code, and (personal opinion here) often they're a symptom of not solving 
the right problem.

> Interesting.  While working on implementing cleanup of VBEout sets I 
> oscillated several times between bottom-up dominator tree walk and 
> iterative walk.  At some point I got myself convinced that dominator 
> walk would do just as fine the job as iterative walk.  Then I come up 
> with the above case which needs more than a single walk to get to the 
> right solution.  Now you pointed out that the above case is invalid, 
> which makes me think that a dominator walk would suffice after all.
>
> Still, the iterative solution looks better to me as it makes it 
> crystal clear that only expressions that definitely won't be hoisted 
> to BB will be removed from BB's VBEout.
>
> Does this make sense?
A little bit.  But it's still unclear why we're not using lowest common 
ancestor here.  It seems that tells us precisely where we want to hoist 
to avoid unnecessary movements.  Or is this to prevent hoisting into an 
LCA, then hoisting the expression again on a later pass further up the 
dominator tree?

The other issues that I think are still unanswered:

   1. Why precisely to do we need transpout for hoisting.  We should 
only be hoisting a very busy expression into a block which dominates the 
original evaluations.  We don't insert on edges, so we don't have to 
worry about splitting abnormals.  The only thing I can think of is 
perhaps the hoisting might require new edges in the destination block in 
the expression potentially traps?!?

   2. Assuming there's a good reason for #1, for correctness we should 
drop transpout and instead use the same method as PRE.

Jeff
>
> Thanks,
>
Maxim Kuvyrkov - July 2, 2010, 4:08 p.m.
On 7/1/10 8:36 PM, Jeff Law wrote:
> On 06/30/10 13:30, Maxim Kuvyrkov wrote:
>>>
>>> It looks like you've got a bi-directional dataflow problem to clean up
>>> VBEout, in general we really want to avoid bi-directional problems. Is
>>> there some reason you're not using a lowest common ancestor algorithm
>>> here?
>>
>>
>> A dataflow problem seems simpler for this case. The problem uses
>> bi-directional links to compute a set of expressions that will be
>> subtracted from VBEout on each iteration, it never adds new
>> expressions to destination sets. I, therefore, claim that the fact
>> that this particular problem uses bi-directional links does not really
>> have any significant negative impact.
> While this specific case may be reasonably safe, in general we really
> want to avoid bi-directional dataflow solvers as it's often hard to
> prove termination, it's hard to evaluate their compile-time performance,
> they're typically more difficult to understand for anyone reading the
> code, and (personal opinion here) often they're a symptom of not solving
> the right problem.
>
>> Interesting. While working on implementing cleanup of VBEout sets I
>> oscillated several times between bottom-up dominator tree walk and
>> iterative walk. At some point I got myself convinced that dominator
>> walk would do just as fine the job as iterative walk. Then I come up
>> with the above case which needs more than a single walk to get to the
>> right solution. Now you pointed out that the above case is invalid,
>> which makes me think that a dominator walk would suffice after all.
>>
>> Still, the iterative solution looks better to me as it makes it
>> crystal clear that only expressions that definitely won't be hoisted
>> to BB will be removed from BB's VBEout.
>>
>> Does this make sense?
> A little bit. But it's still unclear why we're not using lowest common
> ancestor here.

It appears we were thinking about different approaches to using LCA to 
solve this problem, the algorithm I thought you were suggesting would've 
been bulky and slow.

Now I see that the problem can be solved reasonably fast with LCA too. 
I don't yet have all the details figured out, so if you have a clear 
picture of the algorithm in mind, please don't hold it to yourself ;)

> 1. Why precisely to do we need transpout for hoisting. We should only be
> hoisting a very busy expression into a block which dominates the
> original evaluations. We don't insert on edges, so we don't have to
> worry about splitting abnormals. The only thing I can think of is
> perhaps the hoisting might require new edges in the destination block in
> the expression potentially traps?!?

I can't give a definitive answer to if and why hoisting needs transpout. 
  It seems to me transpout can be safely removed if we just avoid 
propagating data across complex edges in compute_vbeinout and 
hoist_expr_reaches_here_p.

That said, I would not check in such a clean up in the same patch as 
improving code hoisting to look into non-immediately-dominated blocks. 
Let's keep bugs these two changes can introduce separate.

>
> 2. Assuming there's a good reason for #1, for correctness we should drop
> transpout and instead use the same method as PRE.

I don't think there is.  Anyway, we will find out once I or someone else 
implement removal of transpout.

Jeff, do you remember why transpout sets were introduced in the first place?

Thanks,
Jeff Law - July 7, 2010, 4:40 p.m.
On 07/02/10 10:08, Maxim Kuvyrkov wrote:
>
> It appears we were thinking about different approaches to using LCA to 
> solve this problem, the algorithm I thought you were suggesting 
> would've been bulky and slow.
>
> Now I see that the problem can be solved reasonably fast with LCA too. 
> I don't yet have all the details figured out, so if you have a clear 
> picture of the algorithm in mind, please don't hold it to yourself ;)
Isn't it fairly simple?  If we have an expression we want to hoist from 
some set of blocks; isn't the destination the LCA within the dominator 
tree of those blocks?

I realize that LCA is usually formulated as the LCA between two blocks, 
but isn't it relatively easy to compute the LCA for pairs and 
recurse/iterate?

Or is this problem some kind of implementation detail that I've missed?

>
>> 1. Why precisely to do we need transpout for hoisting. We should only be
>> hoisting a very busy expression into a block which dominates the
>> original evaluations. We don't insert on edges, so we don't have to
>> worry about splitting abnormals. The only thing I can think of is
>> perhaps the hoisting might require new edges in the destination block in
>> the expression potentially traps?!?
>
> I can't give a definitive answer to if and why hoisting needs 
> transpout.  It seems to me transpout can be safely removed if we just 
> avoid propagating data across complex edges in compute_vbeinout and 
> hoist_expr_reaches_here_p.
>
> That said, I would not check in such a clean up in the same patch as 
> improving code hoisting to look into non-immediately-dominated blocks. 
> Let's keep bugs these two changes can introduce separate.
If you want to keep the changes separate (and there's certainly value in 
doing that), the way to go is fix the correctness issue first, then the 
improvement of the optimization.  Particularly when we're still 
iterating on the implementation of the optimization.


> I don't think there is.  Anyway, we will find out once I or someone 
> else implement removal of transpout.
>
> Jeff, do you remember why transpout sets were introduced in the first 
> place?
I don't.  It was in the first external version of hoisting I posted and 
the first development version I checked into Cygnus's internal tree.

Since there's no edge insertions with hoisting, the only potential 
problem I can see is when the hoisted expression itself can trigger 
traversal of an abnormal edge and the block we want to put the 
expression does not have an edge to the handler.  In that case we'd need 
to add edges in the cfg and I don't see any compensation code in gcse.c 
to deal with that case.

Assuming that's the situation we need to avoid then we really need to 
switch the pre-like code since it detects expressions which can cause 
traversal of the abnormal edge much better.   It's a fairly simple patch 
since it just prunes some expressions from the local tables before 
running the dataflow solver.

Jeff

Patch

diff --git a/gcc/basic-block.h b/gcc/basic-block.h
index 135c0c2..1bf192d 100644
--- a/gcc/basic-block.h
+++ b/gcc/basic-block.h
@@ -854,6 +854,8 @@  extern VEC (basic_block, heap) *get_dominated_by (enum cdi_direction, basic_bloc
 extern VEC (basic_block, heap) *get_dominated_by_region (enum cdi_direction,
 							 basic_block *,
 							 unsigned);
+extern VEC (basic_block, heap) *get_dominated_to_depth (enum cdi_direction,
+							basic_block, int);
 extern VEC (basic_block, heap) *get_all_dominated_blocks (enum cdi_direction,
 							  basic_block);
 extern void add_to_dominance_info (enum cdi_direction, basic_block);
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 9e517e9..05ebcf0 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -8195,6 +8195,12 @@  when @option{-ftree-vectorize} is used.  The number of iterations after
 vectorization needs to be greater than the value specified by this option
 to allow vectorization.  The default value is 0.
 
+@item max-hoist-depth
+The depth of search in the dominator tree for expressions to hoist.
+This is used to avoid quadratic behavior in hoisting algorithm.
+The value of 0 will avoid limiting the search, but may slow down compilation
+of huge functions.  The default value is 30.
+
 @item max-unrolled-insns
 The maximum number of instructions that a loop should have if that loop
 is unrolled, and if the loop is unrolled, it determines how many times
diff --git a/gcc/dominance.c b/gcc/dominance.c
index 9c2dcf0..7861439 100644
--- a/gcc/dominance.c
+++ b/gcc/dominance.c
@@ -783,16 +783,20 @@  get_dominated_by_region (enum cdi_direction dir, basic_block *region,
 }
 
 /* Returns the list of basic blocks including BB dominated by BB, in the
-   direction DIR.  The vector will be sorted in preorder.  */
+   direction DIR up to DEPTH in the dominator tree.  The DEPTH of zero will
+   produce a vector containing all dominated blocks.  The vector will be sorted
+   in preorder.  */
 
 VEC (basic_block, heap) *
-get_all_dominated_blocks (enum cdi_direction dir, basic_block bb)
+get_dominated_to_depth (enum cdi_direction dir, basic_block bb, int depth)
 {
   VEC(basic_block, heap) *bbs = NULL;
   unsigned i;
+  unsigned next_level_start;
 
   i = 0;
   VEC_safe_push (basic_block, heap, bbs, bb);
+  next_level_start = 1; /* = VEC_length (basic_block, bbs); */
 
   do
     {
@@ -803,12 +807,24 @@  get_all_dominated_blocks (enum cdi_direction dir, basic_block bb)
 	   son;
 	   son = next_dom_son (dir, son))
 	VEC_safe_push (basic_block, heap, bbs, son);
+
+      if (i == next_level_start && --depth)
+	next_level_start = VEC_length (basic_block, bbs);
     }
-  while (i < VEC_length (basic_block, bbs));
+  while (i < next_level_start);
 
   return bbs;
 }
 
+/* Returns the list of basic blocks including BB dominated by BB, in the
+   direction DIR.  The vector will be sorted in preorder.  */
+
+VEC (basic_block, heap) *
+get_all_dominated_blocks (enum cdi_direction dir, basic_block bb)
+{
+  return get_dominated_to_depth (dir, bb, 0);
+}
+
 /* Redirect all edges pointing to BB to TO.  */
 void
 redirect_immediate_dominators (enum cdi_direction dir, basic_block bb,
diff --git a/gcc/gcse.c b/gcc/gcse.c
index 39660d5..572cfdb 100644
--- a/gcc/gcse.c
+++ b/gcc/gcse.c
@@ -4155,6 +4155,7 @@  compute_code_hoist_vbeinout (void)
 {
   int changed, passes;
   basic_block bb;
+  sbitmap tmp1, tmp2;
 
   sbitmap_vector_zero (hoist_vbeout, last_basic_block);
   sbitmap_vector_zero (hoist_vbein, last_basic_block);
@@ -4203,6 +4204,130 @@  compute_code_hoist_vbeinout (void)
 	  dump_sbitmap_file (dump_file, hoist_vbeout[bb->index]);
 	}
     }
+
+  /* Now cleanup VBEout to avoid moving expressions too far up.
+
+     We follow two rules to clean up VBEout[BB]:
+
+     1. If BB does not have any dominated blocks, nothing will ever be hoisted
+     to BB, so we can just wipe its VBEout clean.
+
+     2. If an expression can be hoisted both to BB and to a *single* successor
+     of BB in the dominator tree, then there is no point of hoisting
+     the expression to BB over BB's successor.  Doing otherwise would
+     unnecessarily extend live ranges.  */
+
+  /* Wipe VBEout of leaf blocks in the dominator tree.  */
+  FOR_EACH_BB (bb)
+    if (first_dom_son (CDI_DOMINATORS, bb) == NULL)
+      sbitmap_zero (hoist_vbeout[bb->index]);
+
+  tmp1 = sbitmap_alloc (expr_hash_table.n_elems);
+  tmp2 = sbitmap_alloc (expr_hash_table.n_elems);
+
+  /* We cannot cleanup VBEout in a single traversal.  There has to be both
+     upward and downward links when computing VBEout of current block to
+     avoid removing bits that shouldn't be removed.  E.g., consider
+     the following dominator tree; '*' marks blocks which compute same
+     expression, the expression can be freely moved; the expected result
+     is that we move computations of '*' from (3) and (6) to (2).
+
+       2
+      / \
+     3*  4
+        / \
+       5   6*
+
+     Doing a depth-first search over this tree without and upward link
+     will remove the expression from VBEout[4] (there's no point of hoisting
+     the expression to (4) if it's not computed in both (5) and (6).
+     When cleaning up VBEout[2] we won't see the expression as needed in (4),
+     so we will remove it from VBEout[2] leaving it to (3) to calculate
+     it's own copy of '*'.
+
+     Therefore, we use iterative algorithm to solve this problem with both
+     upward and downward links.  The algorithm obviously converges as at
+     each iteration we make VBEout sets only smaller.  */
+
+  passes = 0;
+  changed = 1;
+
+  while (changed)
+    {
+      changed = 0;
+
+      FOR_EACH_BB (bb)
+        {
+	  basic_block son;
+	  bool first_p = true;
+
+	  /* Walk through dominated blocks and calculate the set of expressions
+	     that are needed in any one, and only one, of the blocks.
+	     TMP1 is the basis of what we want to remove from VBEout[BB].  */
+	  for (son = first_dom_son (CDI_DOMINATORS, bb);
+	       son != NULL;
+	       son = next_dom_son (CDI_DOMINATORS, son))
+	    {
+	      if (first_p)
+		{
+		  sbitmap_copy (tmp1, hoist_vbeout[son->index]);
+		  first_p = false;
+		}
+	      else
+		sbitmap_difference (tmp1, tmp1, hoist_vbeout[son->index]);
+	    }
+
+	  if (!first_p)
+	    {
+	      /* Now trim TMP1 to avoid removing too much.  */
+
+	      if (bb->prev_bb != ENTRY_BLOCK_PTR)
+		/* Remove epxressions from TMP1 that are needed upwards.
+		   These are VBEout[parent] minus expressions that are
+		   killed in BB (and, hence, don't get to VBEout[parent] from
+		   BB).  */
+		{
+		  basic_block parent;
+
+		  parent = get_immediate_dominator (CDI_DOMINATORS, bb);
+
+		  sbitmap_difference (tmp2, hoist_vbeout[parent->index],
+				      transp[bb->index]);
+
+		  sbitmap_difference (tmp1, tmp1, tmp2);
+		}
+
+	      /* Never remove any of expressions computed in BB from
+		 VBEout[BB].  */
+	      sbitmap_difference (tmp1, tmp1, comp[bb->index]);
+
+	      if (sbitmap_any_common_bits (hoist_vbeout[bb->index], tmp1))
+		/* There is at least one bit that can be removed from
+		   VBEout[BB].  */
+		{
+		  sbitmap_difference (hoist_vbeout[bb->index],
+				      hoist_vbeout[bb->index], tmp1);
+		  changed = 1;
+		}
+	    }
+	}
+
+      passes++;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "hoisting vbeout cleanup: %d passes\n", passes);
+
+      FOR_EACH_BB (bb)
+        {
+	  fprintf (dump_file, "vbeout(%d): ", bb->index);
+	  dump_sbitmap_file (dump_file, hoist_vbeout[bb->index]);
+	}
+    }
+
+  sbitmap_free (tmp1);
+  sbitmap_free (tmp2);
 }
 
 /* Top level routine to do the dataflow analysis needed by code hoisting.  */
@@ -4212,8 +4337,8 @@  compute_code_hoist_data (void)
 {
   compute_local_properties (transp, comp, antloc, &expr_hash_table);
   compute_transpout ();
-  compute_code_hoist_vbeinout ();
   calculate_dominance_info (CDI_DOMINATORS);
+  compute_code_hoist_vbeinout ();
   if (dump_file)
     fprintf (dump_file, "\n");
 }
@@ -4306,7 +4431,8 @@  hoist_code (void)
       int found = 0;
       int insn_inserted_p;
 
-      domby = get_dominated_by (CDI_DOMINATORS, bb);
+      domby = get_dominated_to_depth (CDI_DOMINATORS, bb, MAX_HOIST_DEPTH);
+
       /* Examine each expression that is very busy at the exit of this
 	 block.  These are the potentially hoistable expressions.  */
       for (i = 0; i < hoist_vbeout[bb->index]->n_bits; i++)
@@ -4397,7 +4523,11 @@  hoist_code (void)
 		     it would be safe to compute it at the start of the
 		     dominated block.  Now we have to determine if the
 		     expression would reach the dominated block if it was
-		     placed at the end of BB.  */
+		     placed at the end of BB.
+		     Note: the fact that hoist_exprs has i-th bit set means
+		     that /some/, not necesserilly all, occurences from
+		     the dominated blocks can be hoisted to BB.  Here we check
+		     if a specific occurence can be hoisted to BB.  */
 		  if (hoist_expr_reaches_here_p (bb, i, dominated, NULL))
 		    {
 		      struct expr *expr = index_map[i];
@@ -4410,6 +4540,12 @@  hoist_code (void)
 			occr = occr->next;
 
 		      gcc_assert (occr);
+
+		      /* An occurence might've been already deleted
+			 while processing a dominator of BB.  */
+		      if (occr->deleted_p)
+			continue;
+
 		      insn = occr->insn;
 		      set = single_set (insn);
 		      gcc_assert (set);
diff --git a/gcc/params.def b/gcc/params.def
index 35650ff..f08d482 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -219,6 +219,14 @@  DEFPARAM(PARAM_GCSE_AFTER_RELOAD_CRITICAL_FRACTION,
 	"gcse-after-reload-critical-fraction",
 	"The threshold ratio of critical edges execution count that permit performing redundancy elimination after reload",
         10, 0, 0)
+/* How deep from a given basic block the dominator tree should be searched
+   for expressions to hoist to the block.  The value of 0 will avoid limiting
+   the search.  */
+DEFPARAM(PARAM_MAX_HOIST_DEPTH,
+	 "max-hoist-depth",
+	 "Maximum depth of search in the dominator tree for expressions to hoist",
+	 30, 0, 0)
+
 /* This parameter limits the number of insns in a loop that will be unrolled,
    and by how much the loop is unrolled.
 
diff --git a/gcc/params.h b/gcc/params.h
index 833fc3b..c0404ca 100644
--- a/gcc/params.h
+++ b/gcc/params.h
@@ -125,6 +125,8 @@  typedef enum compiler_param
   PARAM_VALUE (PARAM_GCSE_AFTER_RELOAD_PARTIAL_FRACTION)
 #define GCSE_AFTER_RELOAD_CRITICAL_FRACTION \
   PARAM_VALUE (PARAM_GCSE_AFTER_RELOAD_CRITICAL_FRACTION)
+#define MAX_HOIST_DEPTH \
+  PARAM_VALUE (PARAM_MAX_HOIST_DEPTH)
 #define MAX_UNROLLED_INSNS \
   PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS)
 #define MAX_SMS_LOOP_NUMBER \