Patchwork 0003-Improve-VBEout-computation.patch

login
register
mail settings
Submitter Maxim Kuvyrkov
Date June 16, 2010, 3:56 p.m.
Message ID <4C18F446.20508@codesourcery.com>
Download mbox | patch
Permalink /patch/55903/
State New
Headers show

Comments

Maxim Kuvyrkov - June 16, 2010, 3:56 p.m.
Code hoisting algorithm from Muchnick, which is the one implemented in 
GCC, has a quirk in that it does include expressions calculated in basic 
block in its VBEout set.  This seems odd to me; if an expression is 
calculated in BB and is available at BB's end, then we do want to hoist 
expressions from BB's successors to the end of BB.  This patch 
implements this.

The patch also adds an open-ended comment which describes another 
possible improvement to the algorithm.

OK to apply?

Thank you,
Paolo Bonzini - June 16, 2010, 4:57 p.m.
+	 that are calculated along every path from BB.
+	 E.g., it will not try to optimize the following case:
+
+		  2
+                  | \
+		  3* |
+	          | /
+		  4
+                 / \
+                5*  6
+
+	 ;; "*" marks basic blocks that calculate same expression
+	 ;; Ideally, all calculation would be moved to block 2.

No, this pessimizes the path 2->4->6.  It may cause a massive number of 
expressions to be hoisted above switch statements, see PR24123 for a 
similar failure.

Paolo
Maxim Kuvyrkov - June 16, 2010, 5:03 p.m.
On 6/16/10 8:57 PM, Paolo Bonzini wrote:
> +  that are calculated along every path from BB.
> + E.g., it will not try to optimize the following case:
> +
> + 2
> + | \
> + 3* |
> + | /
> + 4
> + / \
> + 5* 6
> +
> + ;; "*" marks basic blocks that calculate same expression
> + ;; Ideally, all calculation would be moved to block 2.
>
> No, this pessimizes the path 2->4->6.

Code hoisting is used only when optimizing for size, otherwise PRE is 
used.  Maybe I'm missing something, but /speed/ regression of the path 
2->4->6 is acceptable as long as overall code size goes down.

Thank you,
Paolo Bonzini - June 16, 2010, 5:10 p.m.
On 06/16/2010 07:03 PM, Maxim Kuvyrkov wrote:
> On 6/16/10 8:57 PM, Paolo Bonzini wrote:
>> + that are calculated along every path from BB.
>> + E.g., it will not try to optimize the following case:
>> +
>> + 2
>> + | \
>> + 3* |
>> + | /
>> + 4
>> + / \
>> + 5* 6
>> +
>> + ;; "*" marks basic blocks that calculate same expression
>> + ;; Ideally, all calculation would be moved to block 2.
>>
>> No, this pessimizes the path 2->4->6.
>
> Code hoisting is used only when optimizing for size, otherwise PRE is
> used. Maybe I'm missing something, but /speed/ regression of the path
> 2->4->6 is acceptable as long as overall code size goes down.

The traditional "code hoisting" optimization doesn't penalize speed 
(except possibly for increased register pressure and possibly extra 
spilling), so you're not doing the same optimization after your patch 
anymore.

And I'm talking about _massive_ speed regressions when this behavior 
happens.  Try timing

    double i, k, l;
    k = l = 0;
    for (i=1;i<1000000000;i++)
      {
        double j;
        if (i < 10)
          k += 1/i;
        if (i > 999999990)
          l += 1/i;
      }

with and without your patch.  If I understood correctly, you'll compute 
999999999 divisions instead of 18.

Paolo
Maxim Kuvyrkov - June 16, 2010, 5:19 p.m.
On 6/16/10 9:10 PM, Paolo Bonzini wrote:
> On 06/16/2010 07:03 PM, Maxim Kuvyrkov wrote:
>> On 6/16/10 8:57 PM, Paolo Bonzini wrote:
>>> + that are calculated along every path from BB.
>>> + E.g., it will not try to optimize the following case:
>>> +
>>> + 2
>>> + | \
>>> + 3* |
>>> + | /
>>> + 4
>>> + / \
>>> + 5* 6
>>> +
>>> + ;; "*" marks basic blocks that calculate same expression
>>> + ;; Ideally, all calculation would be moved to block 2.
>>>
>>> No, this pessimizes the path 2->4->6.
...
> with and without your patch.

To make sure we're on the same page: the patch consists of two parts:

1. the big open-ended comment that describes an unimplemented possible 
improvement; I assume this what you're referring to; and

2. a smaller comment and 2 lines of code that implement a different 
improvement to VBEout computation that is in line with traditional 
algorithm and doesn't penalize speed.

I do not make strong claims if the change described in (1) is an actual 
improvement.
Paolo Bonzini - June 16, 2010, 5:27 p.m.
On 06/16/2010 07:19 PM, Maxim Kuvyrkov wrote:
> the patch consists of two parts:
>
> 1. the big open-ended comment that describes an unimplemented possible
> improvement; I assume this what you're referring to; and

Yes.

> 2. a smaller comment and 2 lines of code that implement a different
> improvement to VBEout computation that is in line with traditional
> algorithm and doesn't penalize speed.

Got it now. :)

I was confused by the

+	      sbitmap_intersection_of_succs (hoist_vbeout[bb->index],
+					     hoist_vbein, bb->index);

after the comment, which however is just the preexisting code moved into 
braces.

Regarding (2), I think it's fine.  Still wondering about one thing 
though: if an expression is available at the end of BB and computed in 
BB, it is fully redundant and it should be PRE's task to remove it, 
right?  Maybe you're hitting the problem that our RTL PRE is not cascading?

Paolo
Maxim Kuvyrkov - June 16, 2010, 5:32 p.m.
On 6/16/10 9:27 PM, Paolo Bonzini wrote:
...
> Regarding (2), I think it's fine. Still wondering about one thing
> though: if an expression is available at the end of BB and computed in
> BB, it is fully redundant and it should be PRE's task to remove it,
> right? Maybe you're hitting the problem that our RTL PRE is not cascading?

PRE often increases code size, so we run it when optimizing for speed. 
When optimizing for size hoist is run in place of PRE.  Does this answer 
your question?

Regards,
Paolo Bonzini - June 16, 2010, 5:50 p.m.
On 06/16/2010 07:32 PM, Maxim Kuvyrkov wrote:
> On 6/16/10 9:27 PM, Paolo Bonzini wrote:
> ...
>> Regarding (2), I think it's fine. Still wondering about one thing
>> though: if an expression is available at the end of BB and computed in
>> BB, it is fully redundant and it should be PRE's task to remove it,
>> right? Maybe you're hitting the problem that our RTL PRE is not
>> cascading?
>
> PRE often increases code size, so we run it when optimizing for speed.
> When optimizing for size hoist is run in place of PRE. Does this answer
> your question?

Yes.  It looks like a valuable addition indeed.

Can you please add a comment saying "this allows the hoisting pass to 
also perform elimination of fully redundant expressions"?

Thanks!

Paolo
Jeff Law - June 21, 2010, 6:02 p.m.
On 06/16/10 11:50, Paolo Bonzini wrote:
> On 06/16/2010 07:32 PM, Maxim Kuvyrkov wrote:
>> On 6/16/10 9:27 PM, Paolo Bonzini wrote:
>> ...
>>> Regarding (2), I think it's fine. Still wondering about one thing
>>> though: if an expression is available at the end of BB and computed in
>>> BB, it is fully redundant and it should be PRE's task to remove it,
>>> right? Maybe you're hitting the problem that our RTL PRE is not
>>> cascading?
>>
>> PRE often increases code size, so we run it when optimizing for speed.
>> When optimizing for size hoist is run in place of PRE. Does this answer
>> your question?
>
> Yes.  It looks like a valuable addition indeed.
>
> Can you please add a comment saying "this allows the hoisting pass to 
> also perform elimination of fully redundant expressions"?
What's "funny" is gcse.c used to have a classic fully-redundant 
expression elimination pass which was used when optimizing for size 
(instead of a PRE based algorithm).  It had bitrotted over time and it 
was removed.    If we can make hoisting do the job (or a sizeable 
portion of the job) without having to compute any additional dataflow, 
that's good, very good.

jeff
Jeff Law - June 21, 2010, 6:08 p.m.
On 06/16/10 11:03, Maxim Kuvyrkov wrote:
> On 6/16/10 8:57 PM, Paolo Bonzini wrote:
>> +  that are calculated along every path from BB.
>> + E.g., it will not try to optimize the following case:
>> +
>> + 2
>> + | \
>> + 3* |
>> + | /
>> + 4
>> + / \
>> + 5* 6
>> +
>> + ;; "*" marks basic blocks that calculate same expression
>> + ;; Ideally, all calculation would be moved to block 2.
>>
>> No, this pessimizes the path 2->4->6.
>
> Code hoisting is used only when optimizing for size, otherwise PRE is 
> used.  Maybe I'm missing something, but /speed/ regression of the path 
> 2->4->6 is acceptable as long as overall code size goes down.
Well, within certain limits, a speed regression would be acceptable.  
But what is more important is the correctness issue.

Hoisting the expression into block #2 would introduce an evaluation of 
the expression on a path which did not have an evaluation in the 
original code (2->4->6) -- which could potentially cause a conforming 
program to begin to fail.  This can occur for memory references or other 
instructions that might potentially trap/fault on invalid input.

Jeff
Jeff Law - June 21, 2010, 6:19 p.m.
On 06/16/10 09:56, Maxim Kuvyrkov wrote:
> Code hoisting algorithm from Muchnick, which is the one implemented in 
> GCC, has a quirk in that it does include expressions calculated in 
> basic block in its VBEout set.  This seems odd to me; if an expression 
> is calculated in BB and is available at BB's end, then we do want to 
> hoist expressions from BB's successors to the end of BB.  This patch 
> implements this.
>
> The patch also adds an open-ended comment which describes another 
> possible improvement to the algorithm.
I think the comment that has sparked the confusion needs to be removed 
as the proposed optimization is incorrect, at least in cases where the 
expression may trap.


+      /* Enable if debugging VBE data flow problem.  */
+      if (dump_file && 0)
+        {
+          fprintf (dump_file, "vbein (%d): ", bb->index);
+          dump_sbitmap_file (dump_file, hoist_vbein[bb->index]);
+          fprintf (dump_file, "vbeout(%d): ", bb->index);
+          dump_sbitmap_file (dump_file, hoist_vbeout[bb->index]);
+        }

ISTM you should move this code to the end of compute_code_hoist_vbeinout 
and have it iterate over each block -- you don't want to dump the VBE 
state at each transition change, ISTM you want to dump the vbein/vbeout 
for each block when the solution is complete.

Jeff
Maxim Kuvyrkov - June 22, 2010, 12:02 p.m.
On 6/21/10 10:19 PM, Jeff Law wrote:
...
> + /* Enable if debugging VBE data flow problem. */
> + if (dump_file && 0)
> + {
> + fprintf (dump_file, "vbein (%d): ", bb->index);
> + dump_sbitmap_file (dump_file, hoist_vbein[bb->index]);
> + fprintf (dump_file, "vbeout(%d): ", bb->index);
> + dump_sbitmap_file (dump_file, hoist_vbeout[bb->index]);
> + }
>
> ISTM you should move this code to the end of compute_code_hoist_vbeinout
> and have it iterate over each block -- you don't want to dump the VBE
> state at each transition change, ISTM you want to dump the vbein/vbeout
> for each block when the solution is complete.

The intent of this code is to see intermediate states of the data flow 
problem, but, anyway, I'll move it out of the inner loop to dump only 
the final result.

I'll post another version of the patch in a couple of days when I finish 
reworking other pieces of improvements to hoisting.

Thanks!

Patch

diff --git a/gcc/gcse.c b/gcc/gcse.c
index 00f3841..74986a4 100644
--- a/gcc/gcse.c
+++ b/gcc/gcse.c
@@ -4171,13 +4171,52 @@  compute_code_hoist_vbeinout (void)
       FOR_EACH_BB_REVERSE (bb)
 	{
 	  if (bb->next_bb != EXIT_BLOCK_PTR)
-	    sbitmap_intersection_of_succs (hoist_vbeout[bb->index],
-					   hoist_vbein, bb->index);
+	    {
+	      /* ??? Code hoisting algorithm from Muchnick seems to be
+		 overly conservative in considering only those expressions
+		 that are calculated along every path from BB.
+		 E.g., it will not try to optimize the following case:
+
+		  2
+                  | \
+		  3* |
+	          | /
+		  4
+                 / \
+                5*  6
+
+		 ;; "*" marks basic blocks that calculate same expression
+		 ;; Ideally, all calculation would be moved to block 2.
+
+		 One way to improve the algorith can be to exclude
+		 certain edges from intersection of successors' hoist_vbein's.
+
+		 E.g., if a basic block has a fallthru edge that the control
+		 takes with >80% probability, then just copy hoist_vbein
+		 of the destination block to hoist_vbeout of BB.  */
+	      sbitmap_intersection_of_succs (hoist_vbeout[bb->index],
+					     hoist_vbein, bb->index);
+
+	      /* ??? Another quirk of Muchnick is that vbeout[BB] does not
+		 include expressions calculated in BB itself and available
+		 at its end.  Fix this.  */
+	      sbitmap_a_or_b (hoist_vbeout[bb->index],
+			      hoist_vbeout[bb->index], comp[bb->index]);
+	    }
 
 	  changed |= sbitmap_a_or_b_and_c_cg (hoist_vbein[bb->index],
 					      antloc[bb->index],
 					      hoist_vbeout[bb->index],
 					      transp[bb->index]);
+
+	  /* Enable if debugging VBE data flow problem.  */
+	  if (dump_file && 0)
+	    {
+	      fprintf (dump_file, "vbein (%d): ", bb->index);
+	      dump_sbitmap_file (dump_file, hoist_vbein[bb->index]);
+	      fprintf (dump_file, "vbeout(%d): ", bb->index);
+	      dump_sbitmap_file (dump_file, hoist_vbeout[bb->index]);
+	    }
 	}
 
       passes++;
@@ -4298,6 +4337,11 @@  hoist_code (void)
 	  if (TEST_BIT (hoist_vbeout[bb->index], i)
 	      && TEST_BIT (transpout[bb->index], i))
 	    {
+	      /* If an expression is computed in BB and is available at end of
+		 BB, hoist all occurences dominated by BB to BB.  */
+	      if (TEST_BIT (comp[bb->index], i))
+		hoistable++;
+
 	      /* We've found a potentially hoistable expression, now
 		 we look at every block BB dominates to see if it
 		 computes the expression.  */