diff mbox

ifcvt/crossjump patch: Fix PR 42496, 21803

Message ID 4C48B32C.30007@codesourcery.com
State New
Headers show

Commit Message

Bernd Schmidt July 22, 2010, 9:07 p.m. UTC
On 07/22/2010 09:47 PM, Eric Botcazou wrote:
>> Here's a new patch.  A testcase is included; as I mentioned before this
>> triggers quite frequently.  This is PR44374.
>>
>> I've moved and reused code from dead_or_predicable for a new function
>> can_move_insns_across.  The tests in dead_or_predicable were still
>> somewhat ad-hoc, after the patch I believe it's using the exact
>> necessary and sufficient conditions for moving code.
> 
> I'll look into it tomorrow.

Before you do that, here's a new version.  This corrects a few errors in
the register lifetime handling, and adds support for moving across two
basic blocks, which is very useful for switch statements but happens in
other cases as well.  A new testcase demonstrates the usefulness; I'm
tempted to add it to gcc.dg but I wouldn't like to try to verify that it
passes on all ports...


Bernd
PR rtl-optimization/44374
	* ifcvt.c (find_memory): Remove function.
	(dead_or_predicable): Use can_move_insns_across.
	* df.h (can_move_insns_across): Declare function.
	(simulate_backwards_to_point): Declare function.
	* cfgcleanup.c (block_was_dirty): New static variable.
	(try_head_merge_bb): New static function.
	(try_optimize_cfg): Call it.  Call df_analyze if block_was_dirty
	is set.
	* df-problems.c: Include "target.h"
	(df_simulate_find_uses): New static function.
	(MEMREF_NORMAL, MEMREF_VOLATILE): New macros.
	(find_memory, find_memory_store): New static functions.
	(simulate_backwards_to_point): New function.
	(can_move_insns_across): New function.
	* Makefile.in (df-problems.o): Update dependencies.

testsuite/
	PR rtl-optimization/44374
	* gcc.target/arm/headmerge-1.c: New test.
	* gcc.target/arm/headmerge-2.c: New test.

Comments

Eric Botcazou July 23, 2010, 10:05 p.m. UTC | #1
> Before you do that, here's a new version.  This corrects a few errors in
> the register lifetime handling, and adds support for moving across two
> basic blocks, which is very useful for switch statements but happens in
> other cases as well.  

This implementation really moves insns whereas cross-jumping, the reversed  
transformation, is implemented by means of operations on the CFG.  Although 
this is probably not as straightforward in this direction, did you consider 
the CFG approach instead?  Wouldn't it simplify a little the integration in 
the cfgcleanup.c framework?
Bernd Schmidt July 23, 2010, 10:12 p.m. UTC | #2
On 07/24/2010 12:05 AM, Eric Botcazou wrote:
>> Before you do that, here's a new version.  This corrects a few errors in
>> the register lifetime handling, and adds support for moving across two
>> basic blocks, which is very useful for switch statements but happens in
>> other cases as well.  
> 
> This implementation really moves insns whereas cross-jumping, the reversed  
> transformation, is implemented by means of operations on the CFG.  Although 
> this is probably not as straightforward in this direction, did you consider 
> the CFG approach instead?  Wouldn't it simplify a little the integration in 
> the cfgcleanup.c framework?

Please be more specific about what you envision.


Bernd
Eric Botcazou July 24, 2010, 1:07 p.m. UTC | #3
> Please be more specific about what you envision.

See try_crossjump_to_edge and try_crossjump_bb: no code is actually moved, 
blocks are split and edges redirected instead.
Bernd Schmidt July 26, 2010, 9:42 a.m. UTC | #4
On 07/24/2010 03:07 PM, Eric Botcazou wrote:
>> Please be more specific about what you envision.
> 
> See try_crossjump_to_edge and try_crossjump_bb: no code is actually moved, 
> blocks are split and edges redirected instead.

Yeah, but that can't work for this optimization. I don't see how you can
do it without moving insns across jumps.  Please give an example of how
you would transform code.


Bernd
Paolo Bonzini July 26, 2010, 1:40 p.m. UTC | #5
On 07/26/2010 11:42 AM, Bernd Schmidt wrote:
> On 07/24/2010 03:07 PM, Eric Botcazou wrote:
>>> Please be more specific about what you envision.
>>
>> See try_crossjump_to_edge and try_crossjump_bb: no code is actually moved,
>> blocks are split and edges redirected instead.
>
> Yeah, but that can't work for this optimization. I don't see how you can
> do it without moving insns across jumps.  Please give an example of how
> you would transform code.

You could:
- split the destination BB before the jump (into BB11 and BB12)
- split the source BBs after the last moved instruction (into BB21 and 
BB22, BB31 and BB32, etc.)
- redirect the jumps to BBn1 (n>=2) to go to BBn2.
- graft BB21 between BB11 and BB12, remove all BBn1 for n>2

I don't know if this is worth though, it can always be done later.

Paolo
Bernd Schmidt July 26, 2010, 1:49 p.m. UTC | #6
On 07/26/2010 03:40 PM, Paolo Bonzini wrote:
> You could:
> - split the destination BB before the jump (into BB11 and BB12)
> - split the source BBs after the last moved instruction (into BB21 and
> BB22, BB31 and BB32, etc.)
> - redirect the jumps to BBn1 (n>=2) to go to BBn2.
> - graft BB21 between BB11 and BB12, remove all BBn1 for n>2

How is this simpler and better than just having a single line calling
reorder_insns?  It seems pointless given that it produces the same
result, with a lot more effort.


Bernd
Paolo Bonzini July 26, 2010, 1:56 p.m. UTC | #7
On 07/26/2010 03:49 PM, Bernd Schmidt wrote:
> On 07/26/2010 03:40 PM, Paolo Bonzini wrote:
>> You could:
>> - split the destination BB before the jump (into BB11 and BB12)
>> - split the source BBs after the last moved instruction (into BB21 and
>> BB22, BB31 and BB32, etc.)
>> - redirect the jumps to BBn1 (n>=2) to go to BBn2.
>> - graft BB21 between BB11 and BB12, remove all BBn1 for n>2
>
> How is this simpler and better than just having a single line calling
> reorder_insns?  It seems pointless given that it produces the same
> result, with a lot more effort.

I can't say I disagree (even if you include in the picture deleting the 
duplicated insns in other basic blocks).

Paolo
Eric Botcazou July 27, 2010, 7:44 a.m. UTC | #8
> How is this simpler and better than just having a single line calling
> reorder_insns?  It seems pointless given that it produces the same
> result, with a lot more effort.

It's the canonical way of doing this kind of transformations these days.  The 
underlying machinery is supposed to do all the heavy lifting, you just have 
to drive it.

In particular, I want to avoid kludges like:

+/* Set to true if we couldn't run an optimization due to stale liveness
+   information; we should run df_analyze to enable more opportunities.  */
+static bool block_was_dirty;

@@ -2182,6 +2449,9 @@ try_optimize_cfg (int mode)
 	      && try_crossjump_bb (mode, EXIT_BLOCK_PTR))
 	    changed = true;
 
+	  if (block_was_dirty)
+	    df_analyze ();
+
 #ifdef ENABLE_CHECKING
 	  if (changed)
 	    verify_flow_info ();

that shouldn't be necessary.
Bernd Schmidt July 27, 2010, 9:21 a.m. UTC | #9
On 07/27/2010 09:44 AM, Eric Botcazou wrote:
>> How is this simpler and better than just having a single line calling
>> reorder_insns?  It seems pointless given that it produces the same
>> result, with a lot more effort.
> 
> It's the canonical way of doing this kind of transformations these days.  The 
> underlying machinery is supposed to do all the heavy lifting, you just have 
> to drive it.

That's a non-argument, and false IMO.  Not every optimization can be
represented cleanly as a set of CFG manipulations, and this one can't.
Are you saying we should restrict ourselves to not doing it?

> In particular, I want to avoid kludges like:
> 
> +/* Set to true if we couldn't run an optimization due to stale liveness
> +   information; we should run df_analyze to enable more opportunities.  */
> +static bool block_was_dirty;
> 
> @@ -2182,6 +2449,9 @@ try_optimize_cfg (int mode)
>  	      && try_crossjump_bb (mode, EXIT_BLOCK_PTR))
>  	    changed = true;
>  
> +	  if (block_was_dirty)
> +	    df_analyze ();
> +
>  #ifdef ENABLE_CHECKING
>  	  if (changed)
>  	    verify_flow_info ();
> 
> that shouldn't be necessary.

Still not an argument.  Why shouldn't it be necessary?  It is logical
that by moving code, we change the liveness of registers.  We have to
verify the liveness of registers before moving code, hence, to iterate,
we have to recompute it.


Bernd
Bernd Schmidt July 27, 2010, 1:27 p.m. UTC | #10
On 07/27/2010 11:21 AM, Bernd Schmidt wrote:
> On 07/27/2010 09:44 AM, Eric Botcazou wrote:
>> In particular, I want to avoid kludges like:
>>
>> +/* Set to true if we couldn't run an optimization due to stale liveness
>> +   information; we should run df_analyze to enable more opportunities.  */
>> +static bool block_was_dirty;
>>
>> @@ -2182,6 +2449,9 @@ try_optimize_cfg (int mode)
>>  	      && try_crossjump_bb (mode, EXIT_BLOCK_PTR))
>>  	    changed = true;
>>  
>> +	  if (block_was_dirty)
>> +	    df_analyze ();
>> +
>>  #ifdef ENABLE_CHECKING
>>  	  if (changed)
>>  	    verify_flow_info ();
>>
>> that shouldn't be necessary.
> 
> Still not an argument.  Why shouldn't it be necessary?  It is logical
> that by moving code, we change the liveness of registers.  We have to
> verify the liveness of registers before moving code, hence, to iterate,
> we have to recompute it.

BTW, this is essentially the same thing that's done in the main loop in
ifcvt.c (do you also see it as a kludge there?), where I originally
implemented this optimization, and you were the one who suggested I move
it to cfgcleanup.c.  Maybe you misunderstood the optimization back then
and thought it was just CFG manipulation?  That's simply not the case;
the analogy with crossjumping doesn't entirely hold.

Please explain what you are thinking.  If you have a clever way to do
it, show it.  If you have just not thought it through sufficiently,
please do not continue to hold up a useful improvement.


Bernd
Jeff Law July 27, 2010, 3:27 p.m. UTC | #11
On 07/23/10 16:05, Eric Botcazou wrote:
>> Before you do that, here's a new version.  This corrects a few errors in
>> the register lifetime handling, and adds support for moving across two
>> basic blocks, which is very useful for switch statements but happens in
>> other cases as well.
> This implementation really moves insns whereas cross-jumping, the reversed
> transformation, is implemented by means of operations on the CFG.  Although
> this is probably not as straightforward in this direction, did you consider
> the CFG approach instead?  Wouldn't it simplify a little the integration in
> the cfgcleanup.c framework?
It's probably worth noting that these optimizations are more effective 
when they're allowed to move insns.   So while limiting to CFG 
approaches may simplify things, it also leads to fewer opportunities to 
commonize code.

jeff
Jeff Law July 27, 2010, 5:09 p.m. UTC | #12
On 07/27/10 03:21, Bernd Schmidt wrote:
> In particular, I want to avoid kludges like:
>> +/* Set to true if we couldn't run an optimization due to stale liveness
>> +   information; we should run df_analyze to enable more opportunities.  */
>> +static bool block_was_dirty;
>>
>> @@ -2182,6 +2449,9 @@ try_optimize_cfg (int mode)
>>   	&&  try_crossjump_bb (mode, EXIT_BLOCK_PTR))
>>   	    changed = true;
>>
>> +	  if (block_was_dirty)
>> +	    df_analyze ();
>> +
>>   #ifdef ENABLE_CHECKING
>>   	  if (changed)
>>   	    verify_flow_info ();
>>
>> that shouldn't be necessary.
> Still not an argument.  Why shouldn't it be necessary?  It is logical
> that by moving code, we change the liveness of registers.  We have to
> verify the liveness of registers before moving code, hence, to iterate,
> we have to recompute it.

It seems to me implementing this optimization well requires insn 
movement which is going to affect register lifetimes.   Furthermore, 
this optimization is sitting inside a while (changed) style loop.  At 
the least we need to mark blocks where the life data has become 
inaccurate so that we don't mis-optimize based on inaccurate life data.  
I haven't thought deeply about the problem, but it may well be the case 
that as the cfgcleanup loop iterates new opportunities may be exposed 
and thus it'd be useful to go ahead and update the life information.

What I'm more concerned about is placement of this optimization in 
cfgcleanup -- one could argue this optimization isn't strictly a cfg 
cleanup given the need to move insns from one block to another (contrast 
to our cross jumping implementation which just scrambles the cfg).

One could formulate a head merging algorithm which worked solely on the 
CFG, but I doubt it's going to be very effective.

jeff
Bernd Schmidt July 27, 2010, 9:35 p.m. UTC | #13
Thanks for looking at this, Jeff.

On 07/27/2010 07:09 PM, Jeff Law wrote:
> It seems to me implementing this optimization well requires insn
> movement which is going to affect register lifetimes.   Furthermore,
> this optimization is sitting inside a while (changed) style loop.  At
> the least we need to mark blocks where the life data has become
> inaccurate so that we don't mis-optimize based on inaccurate life data. 
> I haven't thought deeply about the problem, but it may well be the case
> that as the cfgcleanup loop iterates new opportunities may be exposed
> and thus it'd be useful to go ahead and update the life information.

I'm fairly certain I've observed this to happen.  Note that other
transformations done in cfgcleanup can make life information inaccurate
before we even try to run head merging, making it impossible to do the
analysis.

> What I'm more concerned about is placement of this optimization in
> cfgcleanup -- one could argue this optimization isn't strictly a cfg
> cleanup given the need to move insns from one block to another (contrast
> to our cross jumping implementation which just scrambles the cfg).

Originally I'd placed this in ifcvt.c, by analogy with find_if_case_1
and find_if_case_2, which do some very similar transformations.  Eric
requested I move it to cfgcleanup.c and now seems unhappy about the
consequences.  I can see two possible reasons for the request: handling
switch statements as well as ifs (also possible in ifcvt by placing it
right at the top of find_if_header or doing it before calling
find_if_header), and the possibility that other cfg cleanups expose more
opportunities.

I think in theory, it is probably more powerful to do it in cfg cleanup,
which is why I did not object to the request to do it there.  However, I
did not see anything wrong in principle with the original patch that did
it in ifcvt (and I think it could have been applied at the time, and
would have improved gcc).

I wouldn't necessarily mind putting it back into ifcvt, but we might
need to insert calls to cleanup_cfg to get the full benefit (possibly
restricted to situations where we made a block empty except for a jump).

> One could formulate a head merging algorithm which worked solely on the
> CFG, but I doubt it's going to be very effective.

Well, I don't know what if anything Eric has in mind, but assuming we have

BB1
lots of stuff
if (x) goto A;
BB2
y = 1;
goto C;
BB3
A: y = 1;
goto D;

how can we possibly avoid code movement?  The whole purpose is that we
want to have only one copy of the assignment, and that only works if
it's before the jump.  Never mind that we couldn't merge it into the end
of BB1 since the jump can't be in the middle of a basic block.  So it
seems fairly obvious to me that any kind of simple CFG manipulation
fails entirely to achieve the right result.

Even if it can be thought of as "reverse" cross jumping, in terms of
implementation the transformations in ifcvt are a somewhat better match
than the crossjump code.  For one thing, we need almost exactly the same
code to check liveness information.

I think we can also discard the suggestion of simulating the effect of a
single reorder_insns call with a series of complex CFG transformations,
as that seems entirely pointless.


Bernd
Eric Botcazou July 27, 2010, 10:16 p.m. UTC | #14
> It's probably worth noting that these optimizations are more effective
> when they're allowed to move insns.   So while limiting to CFG
> approaches may simplify things, it also leads to fewer opportunities to
> commonize code.

Do you have a concrete example of such an optimization that would be doable 
with code movements but not with CFG manipulations?
Eric Botcazou July 27, 2010, 10:18 p.m. UTC | #15
> That's a non-argument, and false IMO.  Not every optimization can be
> represented cleanly as a set of CFG manipulations, and this one can't.

I wrote "this kind of transformations", not "all transformations".

What about the algorithm sketched by Paolo?

> Are you saying we should restrict ourselves to not doing it?

I'm saying that optimizations run in cfgcleanup.c must play by the rules.

> Still not an argument.  Why shouldn't it be necessary?  It is logical
> that by moving code, we change the liveness of registers.  We have to
> verify the liveness of registers before moving code, hence, to iterate,
> we have to recompute it.

Because this will be done automatically if you use the appropriate API.
Eric Botcazou July 27, 2010, 10:28 p.m. UTC | #16
> BTW, this is essentially the same thing that's done in the main loop in
> ifcvt.c (do you also see it as a kludge there?), where I originally
> implemented this optimization, and you were the one who suggested I move
> it to cfgcleanup.c.  Maybe you misunderstood the optimization back then
> and thought it was just CFG manipulation?  That's simply not the case;
> the analogy with crossjumping doesn't entirely hold.

Yes, I still think that it will be more useful in cfgcleanup.c.  And it's 
another form of code commonization so I still think that implementing it using 
CFG manipulations is the best approach.

> Please explain what you are thinking.  If you have a clever way to do
> it, show it.  If you have just not thought it through sufficiently,
> please do not continue to hold up a useful improvement.

Fair enough.  Since I don't have enough time at the moment to experiment 
myself, I guess I have to withdraw my objections.

Please run this by another maintainer though.
Eric Botcazou July 27, 2010, 10:38 p.m. UTC | #17
> Well, I don't know what if anything Eric has in mind, but assuming we have
>
> BB1
> lots of stuff
> if (x) goto A;
> BB2
> y = 1;
> goto C;
> BB3
> A: y = 1;
> goto D;
>
> how can we possibly avoid code movement?

Split BB2 and BB3 after "y = 1;" and redirect the edges from BB1.  Then split 
BB1 before the test and insert one instance of the common heads.
Bernd Schmidt July 27, 2010, 10:45 p.m. UTC | #18
On 07/28/2010 12:18 AM, Eric Botcazou wrote:
>> That's a non-argument, and false IMO.  Not every optimization can be
>> represented cleanly as a set of CFG manipulations, and this one can't.
> 
> I wrote "this kind of transformations", not "all transformations".
> 
> What about the algorithm sketched by Paolo?

Paolo's "algorithm" was
- split the destination BB before the jump (into BB11 and BB12)
- split the source BBs after the last moved instruction (into BB21 and
BB22, BB31 and BB32, etc.)
- redirect the jumps to BBn1 (n>=2) to go to BBn2.
- graft BB21 between BB11 and BB12, remove all BBn1 for n>2

which has exactly the effect of one call to reorder_insns and a few more
to delete_insn, except it creates lots of garbage BBs only to delete
them immediately again.

What exactly is gained by that?  Certainly not readability.  You're
moving insns, so reorder_insns is the correct API.  The suggestion is
obviously absurd in my eyes.

>> Are you saying we should restrict ourselves to not doing it?
> 
> I'm saying that optimizations run in cfgcleanup.c must play by the rules.

If your "rules" lead to an absurd result, the rules are bogus.  Who
decided those "rules" anyway?

>> Still not an argument.  Why shouldn't it be necessary?  It is logical
>> that by moving code, we change the liveness of registers.  We have to
>> verify the liveness of registers before moving code, hence, to iterate,
>> we have to recompute it.
> 
> Because this will be done automatically if you use the appropriate API.

Then I don't think we have the "appropriate API".  The CFG contortions
mentioned above certainly do nothing to solve this problem.  I still
think you're simply missing something here.

> Fair enough.  Since I don't have enough time at the moment to experiment 
> myself, I guess I have to withdraw my objections.
> 
> Please run this by another maintainer though.

Thanks.


Bernd
Paolo Bonzini July 27, 2010, 11:04 p.m. UTC | #19
On Wed, Jul 28, 2010 at 00:18, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> That's a non-argument, and false IMO.  Not every optimization can be
>> represented cleanly as a set of CFG manipulations, and this one can't.
>
> I wrote "this kind of transformations", not "all transformations".
>
> What about the algorithm sketched by Paolo?

That's what Bernd referred to when he said, "I think we can also
discard the suggestion of simulating the effect of a single
reorder_insns call with a series of complex CFG transformations, as
that seems entirely pointless."

I actually agree with him.  I don't think it is _that_ complex
(particularly because my sketch did more than a single reorder_insns),
but I agree it is pointless.  It is faking that head merging is a pure
CFG transformation when in fact it isn't.

Paolo
Eric Botcazou July 28, 2010, 8:35 a.m. UTC | #20
> What exactly is gained by that?  Certainly not readability.

Avoiding kludges like the one I already mentioned.

> You're moving insns, so reorder_insns is the correct API.  The suggestion is
> obviously absurd in my eyes.

The first sentence is equally absurd these days because of CFG layout mode.
Bernd Schmidt July 28, 2010, 10:04 a.m. UTC | #21
On 07/28/2010 10:35 AM, Eric Botcazou wrote:
>> What exactly is gained by that?  Certainly not readability.
> 
> Avoiding kludges like the one I already mentioned.

As I already pointed out, it's
a) not a kludge, but quite necessary and done like that elsewhere
b) not avoided by pretending a reorder_insns operation is a CFG operation.

Hence, your comment makes no sense.  You've never replied with anything
of substance to either point a) or point b).

>> You're moving insns, so reorder_insns is the correct API.  The suggestion is
>> obviously absurd in my eyes.
> 
> The first sentence is equally absurd these days because of CFG layout mode.

Explain how that is relevant to the current discussion.  I believe it's
completely beside the point, as that's only concerned with the layout of
basic blocks, not with the placement of insns within them.  Moving insns
from one basic block to another (while explicitly avoiding touching
things like final jump insns) doesn't affect that, so IMO you're still
not making any sense.

Maybe that's the thing you're missing?  No control flow insns are ever
touched or moved at all by the patch.  The CFG is in every case the same
afterwards as it was before (although it may be cleaned up, but that's a
different job done already by the other code in cfglcleanup).  That's a
pretty strong hint that we're not dealing with a CFG operation.

Your attempts at patch review for this issue all have been drive-by
one-liners like this, which were at best insufficiently explained, and
at worst completely nonsensical.  It has been extremely frustrating.


Bernd
Jeff Law July 28, 2010, 4:52 p.m. UTC | #22
On 07/27/10 16:28, Eric Botcazou wrote:
>> BTW, this is essentially the same thing that's done in the main loop in
>> ifcvt.c (do you also see it as a kludge there?), where I originally
>> implemented this optimization, and you were the one who suggested I move
>> it to cfgcleanup.c.  Maybe you misunderstood the optimization back then
>> and thought it was just CFG manipulation?  That's simply not the case;
>> the analogy with crossjumping doesn't entirely hold.
> Yes, I still think that it will be more useful in cfgcleanup.c.
While I don't think head merging is a perfect fit for cfgcleanup.c, I 
don't think it's worth a huge argument.  I'll go along with the final 
version (whatever it looks like) in cfgcleanup.c

>   And it's
> another form of code commonization so I still think that implementing it using
> CFG manipulations is the best approach.
I think this is the root of the disagreement.  I'll keep looking at it.


> Please run this by another maintainer though.
I'll own reviewing.

Jeff
Jeff Law July 28, 2010, 5:03 p.m. UTC | #23
On 07/27/10 16:38, Eric Botcazou wrote:
>> Well, I don't know what if anything Eric has in mind, but assuming we have
>>
>> BB1
>> lots of stuff
>> if (x) goto A;
>> BB2
>> y = 1;
>> goto C;
>> BB3
>> A: y = 1;
>> goto D;
>>
>> how can we possibly avoid code movement?
> Split BB2 and BB3 after "y = 1;" and redirect the edges from BB1.  Then split
> BB1 before the test and insert one instance of the common heads.
Which to me seems more convoluted than just moving insns implementing 
the common code.  And I think that's the whole point behind the 
disagreement.  While we *can* formulate this as a series of CFG 
manipulations I think it actually makes the resulting transformation 
more difficult to understand.

Also note that trying to pluck common insns that aren't at the head of 
the blocks is more difficult to do as a pure CFG transformation (it's 
doable, just ugly) while it ought to be trivial to just move them around.


Jeff
Jeff Law July 28, 2010, 5:05 p.m. UTC | #24
On 07/27/10 16:16, Eric Botcazou wrote:
>> It's probably worth noting that these optimizations are more effective
>> when they're allowed to move insns.   So while limiting to CFG
>> approaches may simplify things, it also leads to fewer opportunities to
>> commonize code.
> Do you have a concrete example of such an optimization that would be doable
> with code movements but not with CFG manipulations?
Think about plucking a common insn from the middle of a block rather 
than strictly at the head or tail.

Jeff
Bernd Schmidt July 28, 2010, 5:27 p.m. UTC | #25
On 07/28/2010 07:05 PM, Jeff Law wrote:
> On 07/27/10 16:16, Eric Botcazou wrote:
>>> It's probably worth noting that these optimizations are more
>>> effective when they're allowed to move insns.   So while limiting
>>> to CFG approaches may simplify things, it also leads to fewer
>>> opportunities to commonize code.
>> Do you have a concrete example of such an optimization that would
>> be doable with code movements but not with CFG manipulations?
> Think about plucking a common insn from the middle of a block rather 
> than strictly at the head or tail.

In a sense that's what happens here if you consider the middle as
anything between the code_label or note_insn_basic_block and the final jump.

Of course, moving out of the middle can also be disguised with a CFG
scheme like the one Paolo described (you just need to split a block
twice), but the real question is, why would anyone want to do that?
Just to follow arbitrary, misunderstood rules?

> Which to me seems more convoluted than just moving insns implementing
> the common code.  And I think that's the whole point behind the
> disagreement.

There must be something else, Eric seems to think using CFG manipulation
would somehow eliminate the need to verify register lifetimes.  If you
agree with me that this is simply impossible, I think we can bury that
part of the issue.


Bernd
Jeff Law July 28, 2010, 6:25 p.m. UTC | #26
On 07/27/10 16:45, Bernd Schmidt wrote:
> On 07/28/2010 12:18 AM, Eric Botcazou wrote:
>>> That's a non-argument, and false IMO.  Not every optimization can be
>>> represented cleanly as a set of CFG manipulations, and this one can't.
>> I wrote "this kind of transformations", not "all transformations".
>>
>> What about the algorithm sketched by Paolo?
> Paolo's "algorithm" was
> - split the destination BB before the jump (into BB11 and BB12)
> - split the source BBs after the last moved instruction (into BB21 and
> BB22, BB31 and BB32, etc.)
> - redirect the jumps to BBn1 (n>=2) to go to BBn2.
> - graft BB21 between BB11 and BB12, remove all BBn1 for n>2
>
> which has exactly the effect of one call to reorder_insns and a few more
> to delete_insn, except it creates lots of garbage BBs only to delete
> them immediately again.
>
> What exactly is gained by that?  Certainly not readability.  You're
> moving insns, so reorder_insns is the correct API.  The suggestion is
> obviously absurd in my eyes.

Can we all agree that the problem can be viewed as either cfg 
manipulations or insn movement and that what we're really arguing about 
is which is the most appropriate way to view the problem?

If so, then we really need to determine which implementation is the 
easiest to understand, implement & maintain.


>> I'm saying that optimizations run in cfgcleanup.c must play by the rules.
> If your "rules" lead to an absurd result, the rules are bogus.  Who
> decided those "rules" anyway?
I'm not aware of an such rule.   I can see the value in placing such 
rules on cfgcleanup.c's worker bees which is part of the reason why I 
originally suggested this optimization (if implemented as insn movement) 
be placed somewhere other than cfgcleanup.


Jeff
Paolo Bonzini July 28, 2010, 6:30 p.m. UTC | #27
On 07/28/2010 08:25 PM, Jeff Law wrote:
>>> I'm saying that optimizations run in cfgcleanup.c must play by the
>>> rules.
>> If your "rules" lead to an absurd result, the rules are bogus. Who
>> decided those "rules" anyway?
>
> I'm not aware of an such rule. I can see the value in placing such rules
> on cfgcleanup.c's worker bees which is part of the reason why I
> originally suggested this optimization (if implemented as insn movement)
> be placed somewhere other than cfgcleanup.

Personally I don't care about how the pass is implemented, I think it 
fits more in cfgcleanup.c anyway than in if-conversion (because it 
doesn't remove the conditional execution).

There may be another advantage in putting it in cfgcleanup; using flags 
to control head-merging may be more suitable than the relatively rigid 
pass manager.

Paolo
Jeff Law July 28, 2010, 7:39 p.m. UTC | #28
On 07/28/10 04:04, Bernd Schmidt wrote:
>
> As I already pointed out, it's
> a) not a kludge, but quite necessary and done like that elsewhere
> b) not avoided by pretending a reorder_insns operation is a CFG operation.
>
> Hence, your comment makes no sense.  You've never replied with anything
> of substance to either point a) or point b).
Presumably when we split the blocks we trigger an DF update, either 
immediate or deferred.  At least that's what makes sense to me (since in 
the CFG formulation there isn't any real insn movement, just block 
splitting and edge redirection).  While I see value in having the DF 
update happen automagically as a result of our CFG manipulations, I'm 
still not seeing how CFG manipulations are a cleaner way to express this 
optimization.


>
> Maybe that's the thing you're missing?  No control flow insns are ever
> touched or moved at all by the patch.  The CFG is in every case the same
> afterwards as it was before (although it may be cleaned up, but that's a
> different job done already by the other code in cfglcleanup).  That's a
> pretty strong hint that we're not dealing with a CFG operation.
I doubt that's what Eric is missing -- there's really two ways to 
formulate this optimization, moving insns without manipulating the CFG 
and strictly with CFG manipulations.  It appears to me y'all differ on 
which of the implementations is preferred.

> Your attempts at patch review for this issue all have been drive-by
> one-liners like this, which were at best insufficiently explained, and
> at worst completely nonsensical.  It has been extremely frustrating.
I can see it's frustrating for both of you -- I'd ask that everyone 
remember that you both are advocating a solution you believe in.  It's 
not personal, but just a matter of different technical opinions.

jeff
Bernd Schmidt July 28, 2010, 7:54 p.m. UTC | #29
On 07/28/2010 09:39 PM, Jeff Law wrote:
>  On 07/28/10 04:04, Bernd Schmidt wrote:
>>
>> As I already pointed out, it's
>> a) not a kludge, but quite necessary and done like that elsewhere
>> b) not avoided by pretending a reorder_insns operation is a CFG
>> operation.
>>
>> Hence, your comment makes no sense.  You've never replied with anything
>> of substance to either point a) or point b).
> Presumably when we split the blocks we trigger an DF update, either
> immediate or deferred.

All I can see in cfgcleanup or cfgrtl are manual calls to
df_set_bb_dirty to show we don't know anything about register lives in
modified blocks.  This is also done by my new pass if it modifies stuff.
 If we want to get accurate lifetime data (and we need it to verify we
can move insns), we'll then have to call df_analyze at some point as far
as I know.  We certainly don't want to do that for every change we make,
so it has to happen at the top level when every other possibility has
been exhausted.  So I don't see how we'd avoid the "kludge".  Again, see
ifcvt.c, it's precisely the same structure.

The df_analyze call is new simply because no code in cfgcleanup.c
currently needs to look at register lifetimes.


Bernd
Bernd Schmidt July 28, 2010, 9:30 p.m. UTC | #30
On 07/26/2010 03:40 PM, Paolo Bonzini wrote:
> - split the destination BB before the jump (into BB11 and BB12)
> - split the source BBs after the last moved instruction (into BB21 and
> BB22, BB31 and BB32, etc.)
> - redirect the jumps to BBn1 (n>=2) to go to BBn2.
> - graft BB21 between BB11 and BB12, remove all BBn1 for n>2

The funny thing is that when you look at merge_blocks_move and its
subroutines, which presumably would be used in the last step, you'll
find it's implemented using... reorder_insns!.

So, the above creates several useless BB structures, moves them about,
deletes them again, performs the roughly same number of reorder_insns
and delete_insn calls as the code it would be replacing, only to end up
with the same CFG as when it started.  All in order to pretend we're
doing a CFG operation.


Bernd
Eric Botcazou July 29, 2010, 8:17 a.m. UTC | #31
> I think this is the root of the disagreement.  I'll keep looking at it.
>
> > Please run this by another maintainer though.
>
> I'll own reviewing.

Thanks.
Eric Botcazou July 29, 2010, 8:28 a.m. UTC | #32
> Can we all agree that the problem can be viewed as either cfg
> manipulations or insn movement and that what we're really arguing about
> is which is the most appropriate way to view the problem?

Yes.

> If so, then we really need to determine which implementation is the
> easiest to understand, implement & maintain.

Yes, and I think only experiments can give a definitive answer.  That's all 
what I requested from Bernd.
Jeff Law July 29, 2010, 2:25 p.m. UTC | #33
On 07/28/10 15:30, Bernd Schmidt wrote:
> On 07/26/2010 03:40 PM, Paolo Bonzini wrote:
>> - split the destination BB before the jump (into BB11 and BB12)
>> - split the source BBs after the last moved instruction (into BB21 and
>> BB22, BB31 and BB32, etc.)
>> - redirect the jumps to BBn1 (n>=2) to go to BBn2.
>> - graft BB21 between BB11 and BB12, remove all BBn1 for n>2
> The funny thing is that when you look at merge_blocks_move and its
> subroutines, which presumably would be used in the last step, you'll
> find it's implemented using... reorder_insns!.
Quite amusing.  I guess that eliminates the need to ponder an 
implementation of reorder_insns as a CFG manipulation and how that would 
affect readability of the resulting code -- I wasn't going to suggest we 
actually make that change only ponder the impacts as a proxy for the 
main issue we're stuck on.

Jeff
Jeff Law July 29, 2010, 3:26 p.m. UTC | #34
On 07/28/10 13:54, Bernd Schmidt wrote:
> On 07/28/2010 09:39 PM, Jeff Law wrote:
>>   On 07/28/10 04:04, Bernd Schmidt wrote:
>>> As I already pointed out, it's
>>> a) not a kludge, but quite necessary and done like that elsewhere
>>> b) not avoided by pretending a reorder_insns operation is a CFG
>>> operation.
>>>
>>> Hence, your comment makes no sense.  You've never replied with anything
>>> of substance to either point a) or point b).
>> Presumably when we split the blocks we trigger an DF update, either
>> immediate or deferred.
> All I can see in cfgcleanup or cfgrtl are manual calls to
> df_set_bb_dirty to show we don't know anything about register lives in
> modified blocks.  This is also done by my new pass if it modifies stuff.
So we've got the markers to allow us to do a deferred update, but with 
nothing needing the accurate DF info in the cfgcleanup loop, we just 
ignore the markers (within the context of that loop).    Which certainly 
makes sense if nothing in cfgcleanup has needed accurate register 
lifetime until now.

I'm having an awful hard time seeing what modeling this optimization as 
a series of CFG manipulations is going to win us right now.   It's more 
complex than using reorder_insns, it doesn't keep the DF information 
up-to-date, and generates unnecessary churn in the CFG and other data 
structures.

Jeff
Paolo Bonzini July 29, 2010, 4:19 p.m. UTC | #35
On 07/29/2010 05:26 PM, Jeff Law wrote:
>>>
>> All I can see in cfgcleanup or cfgrtl are manual calls to
>> df_set_bb_dirty to show we don't know anything about register lives in
>> modified blocks.  This is also done by my new pass if it modifies stuff.
>
> So we've got the markers to allow us to do a deferred update, but with
> nothing needing the accurate DF info in the cfgcleanup loop, we just
> ignore the markers (within the context of that loop).    Which certainly
> makes sense if nothing in cfgcleanup has needed accurate register
> lifetime until now.

Correct.  _Operand scan_ can be made automatic, but not dataflow analysis.

> I'm having an awful hard time seeing what modeling this optimization as
> a series of CFG manipulations is going to win us right now.   It's more
> complex than using reorder_insns, it doesn't keep the DF information
> up-to-date, and generates unnecessary churn in the CFG and other data
> structures.

Just to state it once more, I agree.

I have two remaining doubts, which were shadowed by the discussion so 
far.  Don't worry, it's small stuff. :)

I'd like to have a note to the reader that df_analyze is only invoked 
when you do crossjumping.  Please add an assert like

   if (block_was_dirty)
     {
       gcc_assert (mode & CLEANUP_CROSSJUMP);
       df_analyze ();
     }

We do not use dataflow otherwise, and it is not necessary to call it 
gratuitously.  Passes know that CFG cleanup destroys dataflow and call 
it themselves if necessary.

Second, crossjumping is now more expensive.  Does it buy much really to 
iterate it?  Something like

   mode &= ~CLEANUP_CROSSJUMP;

just before iterating may still leave it "good enough".  Steven, do you 
remember anything?  This anyway can be done separately after the patch 
goes in.

Paolo
Bernd Schmidt July 29, 2010, 5 p.m. UTC | #36
On 07/29/2010 06:19 PM, Paolo Bonzini wrote:
> I'd like to have a note to the reader that df_analyze is only invoked
> when you do crossjumping.  Please add an assert like
> 
>   if (block_was_dirty)
>     {
>       gcc_assert (mode & CLEANUP_CROSSJUMP);
>       df_analyze ();
>     }

Can do.

> We do not use dataflow otherwise, and it is not necessary to call it
> gratuitously.  Passes know that CFG cleanup destroys dataflow and call
> it themselves if necessary.

Then again, we probably won't lose much by calling df_analyze during
cfgcleanup if the following pass needs it anyway - right?

> Second, crossjumping is now more expensive.  Does it buy much really to
> iterate it?  Something like
> 
>   mode &= ~CLEANUP_CROSSJUMP;
> 
> just before iterating may still leave it "good enough".

A quick experiment shows that this causes many missed opportunities.
(Placed it after the run_fast_dce call).

Another issue that I stumbled across is that cfgcleanup uses the
df_bb_dirty flag for other reasons: it uses it to test whether a block
has changed previously, and retries its optimizations only if that is
the case.  We probably need a different flag to indicate "block changed
during cfgcleanup" and set it from df_bb_dirty just before calling
df_analyze.


Bernd
Paolo Bonzini July 29, 2010, 5:10 p.m. UTC | #37
On 07/29/2010 07:00 PM, Bernd Schmidt wrote:
> On 07/29/2010 06:19 PM, Paolo Bonzini wrote:
>> I'd like to have a note to the reader that df_analyze is only invoked
>> when you do crossjumping.  Please add an assert like
>>
>>    if (block_was_dirty)
>>      {
>>        gcc_assert (mode & CLEANUP_CROSSJUMP);
>>        df_analyze ();
>>      }
>
> Can do.
>
>> We do not use dataflow otherwise, and it is not necessary to call it
>> gratuitously.  Passes know that CFG cleanup destroys dataflow and call
>> it themselves if necessary.
>
> Then again, we probably won't lose much by calling df_analyze during
> cfgcleanup if the following pass needs it anyway - right?

What I meant is I want to document that it's for a special case.  I 
wouldn't like someone to randomly remove the if just because it happens 
to fix his bug.  Certainly I didn't want to imply any further change. :-)

>> Second, crossjumping is now more expensive.  Does it buy much really to
>> iterate it?  Something like
>>
>>    mode &= ~CLEANUP_CROSSJUMP;
>>
>> just before iterating may still leave it "good enough".
>
> A quick experiment shows that this causes many missed opportunities.
> (Placed it after the run_fast_dce call).

Thanks.  Wishful thinking.

Paolo
Jeff Law July 29, 2010, 5:23 p.m. UTC | #38
On 07/27/10 15:35, Bernd Schmidt wrote:
> Thanks for looking at this, Jeff.
>
> On 07/27/2010 07:09 PM, Jeff Law wrote:
>> It seems to me implementing this optimization well requires insn
>> movement which is going to affect register lifetimes.   Furthermore,
>> this optimization is sitting inside a while (changed) style loop.  At
>> the least we need to mark blocks where the life data has become
>> inaccurate so that we don't mis-optimize based on inaccurate life data.
>> I haven't thought deeply about the problem, but it may well be the case
>> that as the cfgcleanup loop iterates new opportunities may be exposed
>> and thus it'd be useful to go ahead and update the life information.
> I'm fairly certain I've observed this to happen.  Note that other
> transformations done in cfgcleanup can make life information inaccurate
> before we even try to run head merging, making it impossible to do the
> analysis.
If that occurs, we will have set block_was_dirty and changed which tells 
head merging to do nothing.  At the bottom of the loop we call 
df_analyze to get things up-to-date, then start the next iteration which 
then allows head merging its chance at any blocks which were dirty on 
the previous iteration, right?

Presumably there's no blocks marked as dirty when cfgcleanup starts :-)

Jeff
Bernd Schmidt July 29, 2010, 5:28 p.m. UTC | #39
On 07/29/2010 07:23 PM, Jeff Law wrote:
>  On 07/27/10 15:35, Bernd Schmidt wrote:
>> Thanks for looking at this, Jeff.
>>
>> On 07/27/2010 07:09 PM, Jeff Law wrote:
>>> It seems to me implementing this optimization well requires insn
>>> movement which is going to affect register lifetimes.   Furthermore,
>>> this optimization is sitting inside a while (changed) style loop.  At
>>> the least we need to mark blocks where the life data has become
>>> inaccurate so that we don't mis-optimize based on inaccurate life data.
>>> I haven't thought deeply about the problem, but it may well be the case
>>> that as the cfgcleanup loop iterates new opportunities may be exposed
>>> and thus it'd be useful to go ahead and update the life information.
>> I'm fairly certain I've observed this to happen.  Note that other
>> transformations done in cfgcleanup can make life information inaccurate
>> before we even try to run head merging, making it impossible to do the
>> analysis.
> If that occurs, we will have set block_was_dirty and changed which tells
> head merging to do nothing.  At the bottom of the loop we call
> df_analyze to get things up-to-date, then start the next iteration which
> then allows head merging its chance at any blocks which were dirty on
> the previous iteration, right?

That's the plan.

> Presumably there's no blocks marked as dirty when cfgcleanup starts :-)

I guess there might be if the previous pass modifies things that affect
liveness.  The effect would be to postpone head-merging until the second
pass of try_cleanup_cfg.  Also, as I said other cfgcleanup
transformations can mark blocks dirty.  If there's a lot going on we'll
just have to iterate until nothing can be improved anymore.


Bernd
Steven Bosscher July 29, 2010, 10:39 p.m. UTC | #40
On Thu, Jul 29, 2010 at 6:19 PM, Paolo Bonzini <bonzini@gnu.org> wrote:

> Second, crossjumping is now more expensive.  Does it buy much really to
> iterate it?  Something like
>
>  mode &= ~CLEANUP_CROSSJUMP;
>
> just before iterating may still leave it "good enough".  Steven, do you
> remember anything?  This anyway can be done separately after the patch goes
> in.

Iterating is often helpful. Crossjumping only merges single pairs of
basic blocks per iteration, but never across a control flow statement.
If you iterate, you usually find that the previous iteration exposed
further opportunities. And crossjumping is not very expensive anyway.

<plug>
I just hopes someone picks up the patches of PR20070 for pre-reload
crossjumping, that's even more helpful than iterating.
</plug>

Ciao!
Steven
diff mbox

Patch

Index: ifcvt.c
===================================================================
--- ifcvt.c	(revision 162372)
+++ ifcvt.c	(working copy)
@@ -101,7 +101,6 @@  static int noce_find_if_block (basic_blo
 static int cond_exec_find_if_block (ce_if_block_t *);
 static int find_if_case_1 (basic_block, edge, edge);
 static int find_if_case_2 (basic_block, edge, edge);
-static int find_memory (rtx *, void *);
 static int dead_or_predicable (basic_block, basic_block, basic_block,
 			       basic_block, int);
 static void noce_emit_move_insn (rtx, rtx);
@@ -3877,15 +3876,6 @@  find_if_case_2 (basic_block test_bb, edg
   return TRUE;
 }
 
-/* A subroutine of dead_or_predicable called through for_each_rtx.
-   Return 1 if a memory is found.  */
-
-static int
-find_memory (rtx *px, void *data ATTRIBUTE_UNUSED)
-{
-  return MEM_P (*px);
-}
-
 /* Used by the code above to perform the actual rtl transformations.
    Return TRUE if successful.
 
@@ -3987,131 +3977,38 @@  dead_or_predicable (basic_block test_bb,
       earliest = jump;
     }
 #endif
+  /* If we allocated new pseudos (e.g. in the conditional move
+     expander called from noce_emit_cmove), we must resize the
+     array first.  */
+  if (max_regno < max_reg_num ())
+    max_regno = max_reg_num ();
+
   /* Try the NCE path if the CE path did not result in any changes.  */
   if (n_validated_changes == 0)
     {
+      rtx cond;
+      regset live;
+      bool success;
+
       /* In the non-conditional execution case, we have to verify that there
 	 are no trapping operations, no calls, no references to memory, and
 	 that any registers modified are dead at the branch site.  */
 
-      rtx insn, cond, prev;
-      bitmap merge_set, merge_set_noclobber, test_live, test_set;
-      unsigned i, fail = 0;
-      bitmap_iterator bi;
-
-      /* Check for no calls or trapping operations.  */
-      for (insn = head; ; insn = NEXT_INSN (insn))
-	{
-	  if (CALL_P (insn))
-	    return FALSE;
-	  if (NONDEBUG_INSN_P (insn))
-	    {
-	      if (may_trap_p (PATTERN (insn)))
-		return FALSE;
-
-	      /* ??? Even non-trapping memories such as stack frame
-		 references must be avoided.  For stores, we collect
-		 no lifetime info; for reads, we'd have to assert
-		 true_dependence false against every store in the
-		 TEST range.  */
-	      if (for_each_rtx (&PATTERN (insn), find_memory, NULL))
-		return FALSE;
-	    }
-	  if (insn == end)
-	    break;
-	}
-
-      if (! any_condjump_p (jump))
+      if (!any_condjump_p (jump))
 	return FALSE;
 
       /* Find the extent of the conditional.  */
       cond = noce_get_condition (jump, &earliest, false);
-      if (! cond)
+      if (!cond)
 	return FALSE;
 
-      /* Collect:
-	   MERGE_SET = set of registers set in MERGE_BB
-	   MERGE_SET_NOCLOBBER = like MERGE_SET, but only includes registers
-	     that are really set, not just clobbered.
-	   TEST_LIVE = set of registers live at EARLIEST
-	   TEST_SET = set of registers set between EARLIEST and the
-	     end of the block.  */
-
-      merge_set = BITMAP_ALLOC (&reg_obstack);
-      merge_set_noclobber = BITMAP_ALLOC (&reg_obstack);
-      test_live = BITMAP_ALLOC (&reg_obstack);
-      test_set = BITMAP_ALLOC (&reg_obstack);
-
-      /* ??? bb->local_set is only valid during calculate_global_regs_live,
-	 so we must recompute usage for MERGE_BB.  Not so bad, I suppose,
-         since we've already asserted that MERGE_BB is small.  */
-      /* If we allocated new pseudos (e.g. in the conditional move
-	 expander called from noce_emit_cmove), we must resize the
-	 array first.  */
-      if (max_regno < max_reg_num ())
-	max_regno = max_reg_num ();
-
-      FOR_BB_INSNS (merge_bb, insn)
-	{
-	  if (NONDEBUG_INSN_P (insn))
-	    {
-	      df_simulate_find_defs (insn, merge_set);
-	      df_simulate_find_noclobber_defs (insn, merge_set_noclobber);
-	    }
-	}
-
-      /* For small register class machines, don't lengthen lifetimes of
-	 hard registers before reload.  */
-      if (! reload_completed
-	  && targetm.small_register_classes_for_mode_p (VOIDmode))
-	{
-          EXECUTE_IF_SET_IN_BITMAP (merge_set_noclobber, 0, i, bi)
-	    {
-	      if (i < FIRST_PSEUDO_REGISTER
-		  && ! fixed_regs[i]
-		  && ! global_regs[i])
-		fail = 1;
-	    }
-	}
-
-      /* For TEST, we're interested in a range of insns, not a whole block.
-	 Moreover, we're interested in the insns live from OTHER_BB.  */
-
-      /* The loop below takes the set of live registers
-         after JUMP, and calculates the live set before EARLIEST. */
-      bitmap_copy (test_live, df_get_live_in (other_bb));
-      df_simulate_initialize_backwards (test_bb, test_live);
-      for (insn = jump; ; insn = prev)
-	{
-	  if (INSN_P (insn))
-	    {
-	      df_simulate_find_defs (insn, test_set);
-	      df_simulate_one_insn_backwards (test_bb, insn, test_live);
-	    }
-	  prev = PREV_INSN (insn);
-	  if (insn == earliest)
-	    break;
-	}
-
-      /* We can perform the transformation if
-	   MERGE_SET_NOCLOBBER & TEST_SET
-	 and
-	   MERGE_SET & TEST_LIVE)
-	 and
-	   TEST_SET & DF_LIVE_IN (merge_bb)
-	 are empty.  */
-
-      if (bitmap_intersect_p (test_set, merge_set_noclobber)
-	  || bitmap_intersect_p (test_live, merge_set)
-	  || bitmap_intersect_p (test_set, df_get_live_in (merge_bb)))
-	fail = 1;
-
-      BITMAP_FREE (merge_set_noclobber);
-      BITMAP_FREE (merge_set);
-      BITMAP_FREE (test_live);
-      BITMAP_FREE (test_set);
-
-      if (fail)
+      live = BITMAP_ALLOC (&reg_obstack);
+      simulate_backwards_to_point (merge_bb, live, end);
+      success = can_move_insns_across (head, end, earliest, jump,
+				       merge_bb, live,
+				       df_get_live_in (other_bb), NULL);
+      BITMAP_FREE (live);
+      if (!success)
 	return FALSE;
     }
 
Index: df.h
===================================================================
--- df.h	(revision 162372)
+++ df.h	(working copy)
@@ -992,7 +992,9 @@  extern void df_simulate_one_insn_backwar
 extern void df_simulate_finalize_backwards (basic_block, bitmap);
 extern void df_simulate_initialize_forwards (basic_block, bitmap);
 extern void df_simulate_one_insn_forwards (basic_block, rtx, bitmap);
-
+extern void simulate_backwards_to_point (basic_block, regset, rtx);
+extern bool can_move_insns_across (rtx, rtx, rtx, rtx, basic_block, regset,
+				   regset, rtx *);
 /* Functions defined in df-scan.c.  */
 
 extern void df_scan_alloc (bitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c	(revision 162372)
+++ cfgcleanup.c	(working copy)
@@ -66,6 +66,10 @@  static bool first_pass;
 /* Set to true if crossjumps occured in the latest run of try_optimize_cfg.  */
 static bool crossjumps_occured;
 
+/* Set to true if we couldn't run an optimization due to stale liveness
+   information; we should run df_analyze to enable more opportunities.  */
+static bool block_was_dirty;
+
 static bool try_crossjump_to_edge (int, edge, edge);
 static bool try_crossjump_bb (int, basic_block);
 static bool outgoing_edges_match (int, basic_block, basic_block);
@@ -1927,6 +1931,261 @@  try_crossjump_bb (int mode, basic_block 
   return changed;
 }
 
+/* Search the successors of BB for common insn sequences.  When found,
+   share code between them by moving it across the basic block
+   boundary.  Return true if any changes made.  */
+
+static bool
+try_head_merge_bb (basic_block bb)
+{
+  basic_block final_dest_bb = NULL;
+  int max_match = INT_MAX;
+  edge e0;
+  rtx *headptr, *currptr;
+  bool changed, moveall;
+  unsigned ix;
+  rtx e0_last_head, cond, move_before;
+  unsigned nedges = EDGE_COUNT (bb->succs);
+  rtx jump = BB_END (bb);
+  regset live, live_union;
+
+  /* Nothing to do if there is not at least two outgoing edges.  */
+  if (nedges < 2)
+    return false;
+
+  /* Don't crossjump if this block ends in a computed jump,
+     unless we are optimizing for size.  */
+  if (optimize_bb_for_size_p (bb)
+      && bb != EXIT_BLOCK_PTR
+      && computed_jump_p (BB_END (bb)))
+    return false;
+
+  cond = get_condition (jump, &move_before, true, false);
+  if (cond == NULL_RTX)
+    move_before = jump;
+
+  for (ix = 0; ix < nedges; ix++)
+    {
+      edge e = EDGE_SUCC (bb, ix);
+      basic_block other_bb = e->dest;
+
+      if (df_get_bb_dirty (other_bb))
+	{
+	  block_was_dirty = true;
+	  return false;
+	}
+
+      if (e->flags & EDGE_ABNORMAL)
+	return false;
+
+      /* Normally, all destination blocks must only be reachable from this
+	 block, i.e. they must have one incoming edge.
+
+	 There is one special case we can handle, that of multiple consecutive
+	 jumps where the first jumps to one of the targets of the second jump.
+	 This happens frequently in switch statements for default labels.
+	 The structure is as follows:
+	 FINAL_DEST_BB
+	 ....
+	 if (cond) jump A;
+	 fall through
+	 BB
+	 jump with targets A, B, C, D...
+	 A
+	 has two incoming edges, from FINAL_DEST_BB and BB
+
+	 In this case, we can try to move the insns through BB and into
+	 FINAL_DEST_BB.  */
+      if (EDGE_COUNT (other_bb->preds) != 1)
+	{
+	  edge incoming_edge, incoming_bb_other_edge;
+	  edge_iterator ei;
+
+	  if (final_dest_bb != NULL
+	      || EDGE_COUNT (other_bb->preds) != 2)
+	    return false;
+
+	  /* We must be able to move the insns across the whole block.  */
+	  move_before = BB_HEAD (bb);
+	  while (!NONDEBUG_INSN_P (move_before))
+	    move_before = NEXT_INSN (move_before);
+
+	  FOR_EACH_EDGE (incoming_edge, ei, bb->preds)
+	    if (incoming_edge->dest == bb)
+	      break;
+	  final_dest_bb = incoming_edge->src;
+	  if (EDGE_COUNT (final_dest_bb->succs) != 2)
+	    return false;
+	  FOR_EACH_EDGE (incoming_bb_other_edge, ei, final_dest_bb->succs)
+	    if (incoming_bb_other_edge != incoming_edge)
+	      break;
+	  if (incoming_bb_other_edge->dest != other_bb)
+	    return false;
+	}
+    }
+
+  e0 = EDGE_SUCC (bb, 0);
+  e0_last_head = NULL_RTX;
+  changed = false;
+
+  for (ix = 1; ix < nedges; ix++)
+    {
+      edge e = EDGE_SUCC (bb, ix);
+      rtx e0_last, e_last;
+      int nmatch;
+
+      nmatch = flow_find_head_matching_sequence (e0->dest, e->dest,
+						 &e0_last, &e_last, 0);
+      if (nmatch == 0)
+	return false;
+
+      if (nmatch < max_match)
+	{
+	  max_match = nmatch;
+	  e0_last_head = e0_last;
+	}
+    }
+
+  /* If we matched an entire block, we probably have to avoid moving the
+     last insn.  */
+  if (max_match > 0
+      && e0_last_head == BB_END (e0->dest)
+      && (find_reg_note (e0_last_head, REG_EH_REGION, 0)
+	  || control_flow_insn_p (e0_last_head)))
+    {
+      max_match--;
+      if (max_match == 0)
+	return false;
+      do
+	e0_last_head = prev_real_insn (e0_last_head);
+      while (DEBUG_INSN_P (e0_last_head));
+    }
+
+  if (max_match == 0)
+    return false;
+
+  /* We must find a union of the live registers at each of the end points.  */
+  live = BITMAP_ALLOC (NULL);
+  live_union = BITMAP_ALLOC (NULL);
+
+  currptr = XNEWVEC (rtx, nedges);
+  headptr = XNEWVEC (rtx, nedges);
+
+  for (ix = 0; ix < nedges; ix++)
+    {
+      int j;
+      basic_block merge_bb = EDGE_SUCC (bb, ix)->dest;
+      rtx head = BB_HEAD (merge_bb);
+
+      while (!NONDEBUG_INSN_P (head))
+	head = NEXT_INSN (head);
+      headptr[ix] = head;
+      currptr[ix] = head;
+
+      /* Compute the end point and live information  */
+      for (j = 1; j < max_match; j++)
+	do
+	  head = NEXT_INSN (head);
+	while (!NONDEBUG_INSN_P (head));
+      simulate_backwards_to_point (merge_bb, live, head);
+      IOR_REG_SET (live_union, live);
+    }
+
+  /* If we're moving across two blocks, verify the validity of the
+     first move, then adjust the target and let the loop below deal
+     with the final move.  */
+  if (final_dest_bb != NULL)
+    {
+      rtx move_upto;
+
+      moveall = can_move_insns_across (currptr[0], e0_last_head, move_before,
+				       jump, e0->dest, live_union,
+				       NULL, &move_upto);
+      if (!moveall)
+	e0_last_head = move_upto;
+      if (e0_last_head == NULL_RTX)
+	goto out;
+
+      jump = BB_END (final_dest_bb);
+      cond = get_condition (jump, &move_before, true, false);
+      if (cond == NULL_RTX)
+	move_before = jump;
+    }
+
+  do
+    {
+      rtx move_upto;
+      moveall = can_move_insns_across (currptr[0], e0_last_head,
+				       move_before, jump, e0->dest, live_union,
+				       NULL, &move_upto);
+      if (!moveall && move_upto == NULL_RTX)
+	{
+	  if (jump == move_before)
+	    break;
+
+	  /* Try again, using a different insertion point.  */
+	  move_before = jump;
+	  continue;
+	}
+
+      if (final_dest_bb && !moveall)
+	/* We haven't checked whether a partial move would be OK for the first
+	   move, so we have to fail this case.  */
+	break;
+
+      changed = true;
+      for (;;)
+	{
+	  if (currptr[0] == move_upto)
+	    break;
+	  for (ix = 0; ix < nedges; ix++)
+	    {
+	      rtx curr = currptr[ix];
+	      do
+		curr = NEXT_INSN (curr);
+	      while (!NONDEBUG_INSN_P (curr));
+	      currptr[ix] = curr;
+	    }
+	}
+
+      reorder_insns (headptr[0], currptr[0], PREV_INSN (move_before));
+      df_set_bb_dirty (EDGE_SUCC (bb, 0)->dest);
+      if (final_dest_bb != NULL)
+	df_set_bb_dirty (final_dest_bb);
+      df_set_bb_dirty (bb);
+      for (ix = 1; ix < nedges; ix++)
+	{
+	  df_set_bb_dirty (EDGE_SUCC (bb, ix)->dest);
+	  delete_insn_chain (headptr[ix], currptr[ix], false);
+	}
+      if (!moveall)
+	{
+	  if (jump == move_before)
+	    break;
+
+	  /* Try again, using a different insertion point.  */
+	  move_before = jump;
+	  for (ix = 0; ix < nedges; ix++)
+	    {
+	      rtx curr = currptr[ix];
+	      do
+		curr = NEXT_INSN (curr);
+	      while (!NONDEBUG_INSN_P (curr));
+	      currptr[ix] = headptr[ix] = curr;
+	    }
+	}
+    }
+  while (!moveall);
+
+ out:
+  free (currptr);
+  free (headptr);
+
+  crossjumps_occured |= changed;
+
+  return changed;
+}
+
 /* Return true if BB contains just bb note, or bb note followed
    by only DEBUG_INSNs.  */
 
@@ -1972,6 +2231,7 @@  try_optimize_cfg (int mode)
 	 one predecessor, they may be combined.  */
       do
 	{
+	  block_was_dirty = false;
 	  changed = false;
 	  iterations++;
 
@@ -2170,6 +2430,13 @@  try_optimize_cfg (int mode)
 		  && try_crossjump_bb (mode, b))
 		changed_here = true;
 
+	      if ((mode & CLEANUP_CROSSJUMP)
+		  /* This can lengthen register lifetimes.  Do it only after
+		     reload.  */
+		  && reload_completed
+		  && try_head_merge_bb (b))
+		changed_here = true;
+
 	      /* Don't get confused by the index shift caused by
 		 deleting blocks.  */
 	      if (!changed_here)
@@ -2182,6 +2449,9 @@  try_optimize_cfg (int mode)
 	      && try_crossjump_bb (mode, EXIT_BLOCK_PTR))
 	    changed = true;
 
+	  if (block_was_dirty)
+	    df_analyze ();
+
 #ifdef ENABLE_CHECKING
 	  if (changed)
 	    verify_flow_info ();
@@ -2366,8 +2636,7 @@  cleanup_cfg (int mode)
 	  if ((mode & CLEANUP_EXPENSIVE) && !reload_completed
 	      && !delete_trivially_dead_insns (get_insns (), max_reg_num ()))
 	    break;
-	  else if ((mode & CLEANUP_CROSSJUMP)
-		   && crossjumps_occured)
+	  if ((mode & CLEANUP_CROSSJUMP) && crossjumps_occured)
 	    run_fast_dce ();
 	}
       else
Index: df-problems.c
===================================================================
--- df-problems.c	(revision 162372)
+++ df-problems.c	(working copy)
@@ -39,6 +39,7 @@  along with GCC; see the file COPYING3.  
 #include "basic-block.h"
 #include "sbitmap.h"
 #include "bitmap.h"
+#include "target.h"
 #include "timevar.h"
 #include "df.h"
 #include "except.h"
@@ -3804,6 +3805,27 @@  df_simulate_find_defs (rtx insn, bitmap 
     }
 }
 
+/* Find the set of uses for INSN.  This includes partial defs.  */
+
+static void
+df_simulate_find_uses (rtx insn, bitmap uses)
+{
+  df_ref *rec;
+  unsigned int uid = INSN_UID (insn);
+
+  for (rec = DF_INSN_UID_DEFS (uid); *rec; rec++)
+    {
+      df_ref def = *rec;
+      if (DF_REF_FLAGS (def) & (DF_REF_PARTIAL | DF_REF_CONDITIONAL))
+	bitmap_set_bit (uses, DF_REF_REGNO (def));
+    }
+  for (rec = DF_INSN_UID_USES (uid); *rec; rec++)
+    {
+      df_ref use = *rec;
+      bitmap_set_bit (uses, DF_REF_REGNO (use));
+    }
+}
+
 /* Find the set of real DEFs, which are not clobbers, for INSN.  */
 
 void
@@ -4031,7 +4053,297 @@  df_simulate_one_insn_forwards (basic_blo
     }
   df_simulate_fixup_sets (bb, live);
 }
+
+/* Used by the next two functions to encode information about the
+   memory references we found.  */
+#define MEMREF_NORMAL 1
+#define MEMREF_VOLATILE 2
+
+/* A subroutine of can_move_insns_across_p called through for_each_rtx.
+   Return either MEMREF_NORMAL or MEMREF_VOLATILE if a memory is found.  */
+
+static int
+find_memory (rtx *px, void *data ATTRIBUTE_UNUSED)
+{
+  rtx x = *px;
+  if (!MEM_P (x))
+    return 0;
+  if (MEM_VOLATILE_P (x))
+    return MEMREF_VOLATILE;
+  if (MEM_READONLY_P (x))
+    return 0;
+
+  return MEMREF_NORMAL;
+}
+
+/* A subroutine of can_move_insns_across_p called through note_stores.
+   DATA points to an integer in which we set either the bit for
+   MEMREF_NORMAL or the bit for MEMREF_VOLATILE if we find a MEM
+   of either kind.  */
+
+static void
+find_memory_stores (rtx x, const_rtx pat ATTRIBUTE_UNUSED,
+		    void *data ATTRIBUTE_UNUSED)
+{
+  int *pflags = (int *)data;
+  if (GET_CODE (x) == SUBREG)
+    x = XEXP (x, 0);
+  /* Treat stores to SP as stores to memory, this will prevent problems
+     when there are references to the stack frame.  */
+  if (x == stack_pointer_rtx)
+    *pflags |= MEMREF_VOLATILE;
+  if (!MEM_P (x))
+    return;
+  *pflags |= MEM_VOLATILE_P (x) ? MEMREF_VOLATILE : MEMREF_NORMAL;
+}
+
+/* Scan BB backwards, using df_simulate functions to keep track of
+   lifetimes, up to insn POINT.  The result is stored in LIVE.  */
+
+void
+simulate_backwards_to_point (basic_block bb, regset live, rtx point)
+{
+  rtx insn;
+  bitmap_copy (live, df_get_live_out (bb));
+  df_simulate_initialize_backwards (bb, live);
+
+  /* Scan and update life information until we reach the point we're
+     interested in.  */
+  for (insn = BB_END (bb); insn != point; insn = PREV_INSN (insn))
+    df_simulate_one_insn_backwards (bb, insn, live);
+}
+
+/* Return true if it is safe to move a group of insns, described by
+   the range FROM to TO, backwards across another group of insns,
+   described by ACROSS_FROM to ACROSS_TO.  It is assumed that there
+   are no insns between ACROSS_TO and FROM, but they may be in
+   different basic blocks; MERGE_BB is the block from which the
+   insns will be moved.  The caller must pass in a regset MERGE_LIVE
+   which specifies the registers live after TO.
+
+   This function may be called in one of two cases: either we try to
+   move identical instructions from all successor blocks into their
+   predecessor, or we try to move from only one successor block.  If
+   OTHER_BRANCH_LIVE is nonnull, it indicates that we're dealing with
+   the second case.  It should contain a set of registers live at the
+   end of ACROSS_TO which must not be clobbered by moving the insns.
+   In that case, we're also more careful about moving memory references
+   and trapping insns.
+
+   We return false if it is not safe to move the entire group, but it
+   may still be possible to move a subgroup.  PMOVE_UPTO, if nonnull,
+   is set to point at the last moveable insn in such a case.  */
+
+bool
+can_move_insns_across (rtx from, rtx to, rtx across_from, rtx across_to,
+		       basic_block merge_bb, regset merge_live,
+		       regset other_branch_live, rtx *pmove_upto)
+{
+  rtx insn, next, max_to;
+  bitmap merge_set, merge_use, local_merge_live;
+  bitmap test_set, test_use;
+  unsigned i, fail = 0;
+  bitmap_iterator bi;
+  int memrefs_in_across = 0;
+  int mem_sets_in_across = 0;
+  bool trapping_insns_in_across = false;
+
+  if (pmove_upto != NULL)
+    *pmove_upto = NULL_RTX;
+
+  /* Find real bounds, ignoring debug insns.  */
+  while (!NONDEBUG_INSN_P (from) && from != to)
+    from = NEXT_INSN (from);
+  while (!NONDEBUG_INSN_P (to) && from != to)
+    to = PREV_INSN (to);
+
+  for (insn = across_to; ; insn = next)
+    {
+      if (NONDEBUG_INSN_P (insn))
+	{
+	  memrefs_in_across |= for_each_rtx (&PATTERN (insn), find_memory,
+					     NULL);
+	  note_stores (PATTERN (insn), find_memory_stores,
+		       &mem_sets_in_across);
+	  /* This is used just to find sets of the stack pointer.  */
+	  memrefs_in_across |= mem_sets_in_across;
+	  trapping_insns_in_across |= may_trap_p (PATTERN (insn));
+	}
+      next = PREV_INSN (insn);
+      if (insn == across_from)
+	break;
+    }
+
+  /* Collect:
+     MERGE_SET = set of registers set in MERGE_BB
+     MERGE_USE = set of registers used in MERGE_BB and live at its top
+     MERGE_LIVE = set of registers live at the point inside the MERGE
+     range that we've reached during scanning
+     TEST_SET = set of registers set between ACROSS_FROM and ACROSS_END.
+     TEST_USE = set of registers used between ACROSS_FROM and ACROSS_END,
+     and live before ACROSS_FROM.  */
+
+  merge_set = BITMAP_ALLOC (&reg_obstack);
+  merge_use = BITMAP_ALLOC (&reg_obstack);
+  local_merge_live = BITMAP_ALLOC (&reg_obstack);
+  test_set = BITMAP_ALLOC (&reg_obstack);
+  test_use = BITMAP_ALLOC (&reg_obstack);
+
+  /* Compute the set of registers set and used in the ACROSS range.  */
+  if (other_branch_live != NULL)
+    bitmap_copy (test_use, other_branch_live);
+  df_simulate_initialize_backwards (merge_bb, test_use);
+  for (insn = across_to; ; insn = next)
+    {
+      if (NONDEBUG_INSN_P (insn))
+	{
+	  df_simulate_find_defs (insn, test_set);
+	  df_simulate_defs (insn, test_use);
+	  df_simulate_uses (insn, test_use);
+	}
+      next = PREV_INSN (insn);
+      if (insn == across_from)
+	break;
+    }
+
+  /* Compute an upper bound for the amount of insns moved, by finding
+     the first insn in MERGE that sets a register in TEST_USE, or uses
+     a register in TEST_SET.  We also check for calls, trapping operations,
+     and memory references.  */
+  max_to = NULL_RTX;
+  for (insn = from; ; insn = next)
+    {
+      if (CALL_P (insn))
+	break;
+      if (NONDEBUG_INSN_P (insn))
+	{
+	  if (may_trap_p (PATTERN (insn))
+	      && (trapping_insns_in_across || other_branch_live != NULL))
+	    break;
+
+	  /* We cannot move memory stores past each other, or move memory
+	     reads past stores, at least not without tracking them and
+	     calling true_dependence on every pair.
+
+	     If there is no other branch and no memory references or
+	     sets in the ACROSS range, we can move memory references
+	     freely, even volatile ones.
+
+	     Otherwise, the rules are as follows: volatile memory
+	     references and stores can't be moved at all, and any type
+	     of memory reference can't be moved if there are volatile
+	     accesses or stores in the ACROSS range.  That leaves
+	     normal reads, which can be moved, as the trapping case is
+	     dealt with elsewhere.  */
+	  if (other_branch_live != NULL || memrefs_in_across != 0)
+	    {
+	      int mem_ref_flags = 0;
+	      int mem_set_flags = 0;
+	      note_stores (PATTERN (insn), find_memory_stores, &mem_set_flags);
+	      mem_ref_flags = for_each_rtx (&PATTERN (insn), find_memory,
+					    NULL);
+	      /* Catch sets of the stack pointer.  */
+	      mem_ref_flags |= mem_set_flags;
+
+	      if ((mem_ref_flags | mem_set_flags) & MEMREF_VOLATILE)
+		break;
+	      if ((memrefs_in_across & MEMREF_VOLATILE) && mem_ref_flags != 0)
+		break;
+	      if (mem_set_flags != 0
+		  || (mem_sets_in_across != 0 && mem_ref_flags != 0))
+		break;
+	    }
+	  df_simulate_find_uses (insn, merge_use);
+	  /* We're only interested in uses which use a value live at
+	     the top, not one previously set in this block.  */
+	  bitmap_and_compl_into (merge_use, merge_set);
+	  df_simulate_find_defs (insn, merge_set);
+	  if (bitmap_intersect_p (merge_set, test_use)
+	      || bitmap_intersect_p (merge_use, test_set))
+	    break;
+	  max_to = insn;
+	}
+      next = NEXT_INSN (insn);
+      if (insn == to)
+	break;
+    }
+  if (max_to != to)
+    fail = 1;
+
+  if (max_to == NULL_RTX || (fail && pmove_upto == NULL))
+    goto out;
+
+  /* Now, lower this upper bound by also taking into account that
+     a range of insns moved across ACROSS must not leave a register
+     live at the end that will be clobbered in ACROSS.  We need to
+     find a point where TEST_SET & LIVE == 0.
+
+     Insns in the MERGE range that set registers which are also set
+     in the ACROSS range may still be moved as long as we also move
+     later insns which use the results of the set, and make the
+     register dead again.  This is verified by the condition stated
+     above.  We only need to test it for registers that are set in
+     the moved region.
+
+     MERGE_LIVE is provided by the caller and holds live registers after
+     TO.  */
+  bitmap_copy (local_merge_live, merge_live);
+  for (insn = to; insn != max_to; insn = PREV_INSN (insn))
+    df_simulate_one_insn_backwards (merge_bb, insn, local_merge_live);
+
+  /* We're not interested in registers that aren't set in the moved
+     region at all.  */
+  bitmap_and_into (local_merge_live, merge_set);
+  for (;;)
+    {
+      if (NONDEBUG_INSN_P (insn))
+	{
+	  if (!bitmap_intersect_p (test_set, local_merge_live))
+	    {
+	      max_to = insn;
+	      break;
+	    }
+
+	  df_simulate_one_insn_backwards (merge_bb, insn,
+					  local_merge_live);
+	}
+      if (insn == from)
+	{
+	  fail = 1;
+	  goto out;
+	}
+      insn = PREV_INSN (insn);
+    }
 
+  if (max_to != to)
+    fail = 1;
+
+  if (pmove_upto)
+    *pmove_upto = max_to;
+
+  /* For small register class machines, don't lengthen lifetimes of
+     hard registers before reload.  */
+  if (! reload_completed
+      && targetm.small_register_classes_for_mode_p (VOIDmode))
+    {
+      EXECUTE_IF_SET_IN_BITMAP (merge_set, 0, i, bi)
+	{
+	  if (i < FIRST_PSEUDO_REGISTER
+	      && ! fixed_regs[i]
+	      && ! global_regs[i])
+	    fail = 1;
+	}
+    }
+
+ out:
+  BITMAP_FREE (merge_set);
+  BITMAP_FREE (merge_use);
+  BITMAP_FREE (local_merge_live);
+  BITMAP_FREE (test_set);
+  BITMAP_FREE (test_use);
+
+  return !fail;
+}
 
 
 /*----------------------------------------------------------------------------
Index: Makefile.in
===================================================================
--- Makefile.in	(revision 162372)
+++ Makefile.in	(working copy)
@@ -3154,7 +3154,7 @@  df-core.o : df-core.c $(CONFIG_H) $(SYST
 df-problems.o : df-problems.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
    $(RTL_H) insn-config.h $(RECOG_H) $(FUNCTION_H) $(REGS_H) alloc-pool.h \
    hard-reg-set.h $(BASIC_BLOCK_H) $(DF_H) $(BITMAP_H) sbitmap.h $(TIMEVAR_H) \
-   $(TM_P_H) $(FLAGS_H) output.h $(EXCEPT_H) dce.h vecprim.h
+   $(TM_P_H) $(TARGET_H) $(FLAGS_H) output.h $(EXCEPT_H) dce.h vecprim.h
 df-scan.o : df-scan.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    insn-config.h $(RECOG_H) $(FUNCTION_H) $(REGS_H) alloc-pool.h \
    hard-reg-set.h $(BASIC_BLOCK_H) $(DF_H) $(BITMAP_H) sbitmap.h $(TIMEVAR_H) \
Index: testsuite/gcc.target/arm/headmerge-1.c
===================================================================
--- testsuite/gcc.target/arm/headmerge-1.c	(revision 0)
+++ testsuite/gcc.target/arm/headmerge-1.c	(revision 0)
@@ -0,0 +1,14 @@ 
+/* { dg-do compile }  */
+/* { dg-options "-O2" }  */
+/* { dg-final { scan-assembler-times "#120" 1 } } */
+
+extern void foo1 (int);
+extern void foo2 (int);
+
+void t (int x, int y)
+{
+  if (y < 5)
+    foo1 (120);
+  else
+    foo2 (120);
+}
Index: testsuite/gcc.target/arm/headmerge-2.c
===================================================================
--- testsuite/gcc.target/arm/headmerge-2.c	(revision 0)
+++ testsuite/gcc.target/arm/headmerge-2.c	(revision 0)
@@ -0,0 +1,35 @@ 
+/* { dg-do compile }  */
+/* { dg-options "-O2" }  */
+/* { dg-final { scan-assembler-times "120" 1 } } */
+
+extern void foo1 (int);
+extern void foo2 (int);
+extern void foo3 (int);
+extern void foo4 (int);
+extern void foo5 (int);
+extern void foo6 (int);
+
+void t (int x, int y)
+{
+  switch (y)
+    {
+    case 1:
+      foo1 (120);
+      break;
+    case 5:
+      foo2 (120);
+      break;
+    case 7:
+      foo3 (120);
+      break;
+    case 10:
+      foo4 (120);
+      break;
+    case 13:
+      foo5 (120);
+      break;
+    default:
+      foo6 (120);
+      break;
+    }
+}