From patchwork Sun Jul 17 18:33:01 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Tom de Vries <vries@codesourcery.com>
X-Patchwork-Id: 105080
Return-Path: 
 <gcc-patches-return-297095-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	by ozlabs.org (Postfix) with SMTP id 88965B6F83
	for <incoming@patchwork.ozlabs.org>;
	Mon, 18 Jul 2011 04:33:16 +1000 (EST)
Received: (qmail 19763 invoked by alias); 17 Jul 2011 18:33:14 -0000
Received: (qmail 19658 invoked by uid 22791); 17 Jul 2011 18:33:05 -0000
X-SWARE-Spam-Status: No, hits=-0.1 required=5.0	tests=AWL, BAYES_50,
	RCVD_IN_DNSWL_NONE, SPF_FAIL, TW_CB, TW_CF, TW_CV, TW_IV,
	TW_TM
X-Spam-Check-By: sourceware.org
Received: from smtp-vbr11.xs4all.nl (HELO smtp-vbr11.xs4all.nl)
	(194.109.24.31) by sourceware.org (qpsmtpd/0.43rc1) with
	ESMTP; Sun, 17 Jul 2011 18:32:41 +0000
Received: from [192.168.1.68] (teejay.xs4all.nl
	[213.84.119.160])	(authenticated bits=0)	by
	smtp-vbr11.xs4all.nl (8.13.8/8.13.8) with ESMTP id
	p6HIWZNS060626	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA
	bits=256 verify=NO);
	Sun, 17 Jul 2011 20:32:37 +0200 (CEST)	(envelope-from
	vries@codesourcery.com)
Message-ID: <4E232ADD.3020803@codesourcery.com>
Date: Sun, 17 Jul 2011 20:33:01 +0200
From: Tom de Vries <vries@codesourcery.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
	rv:1.9.2.17) Gecko/20110424 Lightning/1.0b2 Thunderbird/3.1.10
MIME-Version: 1.0
To: Richard Guenther <richard.guenther@gmail.com>
CC: Steven Bosscher <stevenb.gcc@gmail.com>, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
References: <4DEF4408.4040001@codesourcery.com>
	<BANLkTimAvGy1jRniWftbEvLnGwEN_0Ma8Q@mail.gmail.com>
	<4DF24C2C.7080808@codesourcery.com>
	<BANLkTi=rU-E3R8tSkN=iZWvZJ4DcTe-RCg@mail.gmail.com>
	<4E1C3A19.8060706@codesourcery.com>
	<CAFiYyc2VhgbAZ8U2iDMJqApzHR+GyCdOfskMEVvr5Mf2TwB8_A@mail.gmail.com>
In-Reply-To: 
 <CAFiYyc2VhgbAZ8U2iDMJqApzHR+GyCdOfskMEVvr5Mf2TwB8_A@mail.gmail.com>
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org

On 07/12/2011 04:07 PM, Richard Guenther wrote:
> On Tue, Jul 12, 2011 at 2:12 PM, Tom de Vries <vries@codesourcery.com> wrote:
>> Hi Richard,
>>
>> here's a new version of the pass. I attempted to address as much as possible
>> your comments. The pass was bootstrapped and reg-tested on x86_64.
>>
>> On 06/14/2011 05:01 PM, Richard Guenther wrote:
>>> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries <vries@codesourcery.com> wrote:
>>>> Hi Richard,
>>>>
>>>> thanks for the review.
>>>>
>>>> On 06/08/2011 11:55 AM, Richard Guenther wrote:
>>>>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
>>>>>> Hi Richard,
>>>>>>
>>>>>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>>>>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>>>>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
>>>>>> table (%, lower is better).
>>>>>>
>>>>>>                     none            pic
>>>>>>                thumb1  thumb2  thumb1 thumb2
>>>>>> spec2000         99.9    99.9    99.8   99.8
>>>>>>
>>>>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
>>>>>> optimizations proposed in PR20070 would fix this PR.
>>>>>>
>>>>>> The problem in this PR is that when compiling with -O2, the example below should
>>>>>> only have one call to free. The original problem is formulated in terms of -Os,
>>>>>> but currently we generate one call to free with -Os, although still not the
>>>>>> smallest code possible. I'll show here the -O2 case, since that's similar to the
>>>>>> original PR.
>>>>>>
>>>>
>>>> Example A. (naming it for reference below)
>>>>
>>>>>> #include <stdio.h>
>>>>>> void foo (char*, FILE*);
>>>>>> char* hprofStartupp(char *outputFileName, char *ctx)
>>>>>> {
>>>>>>    char fileName[1000];
>>>>>>    FILE *fp;
>>>>>>    sprintf(fileName, outputFileName);
>>>>>>    if (access(fileName, 1) == 0) {
>>>>>>        free(ctx);
>>>>>>        return 0;
>>>>>>    }
>>>>>>
>>>>>>    fp = fopen(fileName, 0);
>>>>>>    if (fp == 0) {
>>>>>>        free(ctx);
>>>>>>        return 0;
>>>>>>    }
>>>>>>
>>>>>>    foo(outputFileName, fp);
>>>>>>
>>>>>>    return ctx;
>>>>>> }
>>>>>>
>>>>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
>>>>>> - Merging 2 blocks which are identical expect for input registers, by using a
>>>>>>  conditional move to choose between the different input registers.
>>>>>> - Merging 2 blocks which have different local registers, by ignoring those
>>>>>>  differences
>>>>>>
>>>>>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>>>>>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>>>>>> conditional move would probably be benificial in terms of size, but it's not
>>>>>> clear what condition the conditional move should be using. Calculating such a
>>>>>> condition would add in size and increase the execution path.
>>>>>>
>>>>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>>>>>> ...
>>>>>>        push    {r4, r5, lr}
>>>>>>        mov     r4, r0
>>>>>>        sub     sp, sp, #1004
>>>>>>        mov     r5, r1
>>>>>>        mov     r0, sp
>>>>>>        mov     r1, r4
>>>>>>        bl      sprintf
>>>>>>        mov     r0, sp
>>>>>>        movs    r1, #1
>>>>>>        bl      access
>>>>>>        mov     r3, r0
>>>>>>        cbz     r0, .L6
>>>>>>        movs    r1, #0
>>>>>>        mov     r0, sp
>>>>>>        bl      fopen
>>>>>>        mov     r1, r0
>>>>>>        cbz     r0, .L7
>>>>>>        mov     r0, r4
>>>>>>        bl      foo
>>>>>> .L3:
>>>>>>        mov     r0, r5
>>>>>>        add     sp, sp, #1004
>>>>>>        pop     {r4, r5, pc}
>>>>>> .L6:
>>>>>>        mov     r0, r5
>>>>>>        mov     r5, r3
>>>>>>        bl      free
>>>>>>        b       .L3
>>>>>> .L7:
>>>>>>        mov     r0, r5
>>>>>>        mov     r5, r1
>>>>>>        bl      free
>>>>>>        b       .L3
>>>>>> ...
>>>>>>
>>>>>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>>>>>> when they are still identical: at gimple level. It detect that the 2 blocks are
>>>>>> identical, and removes one of them.
>>>>>>
>>>>>> The following table shows the impact of the patch on the example in terms of
>>>>>> size for -march=armv7-a:
>>>>>>
>>>>>>          without     with    delta
>>>>>> Os      :     108      104       -4
>>>>>> O2      :     120      104      -16
>>>>>> Os thumb:      68       64       -4
>>>>>> O2 thumb:      76       64      -12
>>>>>>
>>>>>> The gain in size for -O2 is that of removing the entire block, plus the
>>>>>> replacement of 2 moves by a constant set, which also decreases the execution
>>>>>> path. The patch ensures optimal code for both -O2 and -Os.
>>>>>>
>>>>>>
>>>>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
>>>>>> differences in comparison. Without this feature, we would only match blocks with
>>>>>> resultless operations, due the the ssa-nature of gimples.
>>>>>> For example, with this feature, we reduce the following function to its minimum
>>>>>> at gimple level, rather than at rtl level.
>>>>>>
>>>>
>>>> Example B. (naming it for reference below)
>>>>
>>>>>> int f(int c, int b, int d)
>>>>>> {
>>>>>>  int r, e;
>>>>>>
>>>>>>  if (c)
>>>>>>    r = b + d;
>>>>>>  else
>>>>>>    {
>>>>>>      e = b + d;
>>>>>>      r = e;
>>>>>>    }
>>>>>>
>>>>>>  return r;
>>>>>> }
>>>>>>
>>>>>> ;; Function f (f)
>>>>>>
>>>>>> f (int c, int b, int d)
>>>>>> {
>>>>>>  int e;
>>>>>>
>>>>>> <bb 2>:
>>>>>>  e_6 = b_3(D) + d_4(D);
>>>>>>  return e_6;
>>>>>>
>>>>>> }
>>>>>>
>>>>>> I'll send the patch with the testcases in a separate email.
>>>>>>
>>>>>> OK for trunk?
>>>>>
>>>>> I don't like that you hook this into cleanup_tree_cfg - that is called
>>>>> _way_ too often.
>>>>>
>>>>
>>>> Here is a reworked patch that addresses several concerns, particularly the
>>>> compile time overhead.
>>>>
>>>> Changes:
>>>> - The optimization is now in a separate file.
>>>> - The optimization is now a pass rather than a cleanup. That allowed me to
>>>>  remove the test for pass-local flags.
>>>>  New is the pass driver tail_merge_optimize, based on
>>>>  tree-cfgcleanup.c:cleanup_tree_cfg_1.
>>>> - The pass is run once, on SSA. Before, the patch would
>>>>  fix example A only before SSA and example B only on SSA.
>>>>  In order to fix example A on SSA, I added these changes:
>>>>  - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p)
>>>>  - insert vop phi in bb2, and use that one (update_vuses)
>>>>  - complete pt_solutions_equal_p.
>>>>
>>>> Passed x86_64 bootstrapping and regression testing, currently regtesting on ARM.
>>>>
>>>> I placed the pass at the earliest point where it fixes example B: After copy
>>>> propagation and dead code elimination, specifically, after the first invocation
>>>> of pass_cd_dce. Do you know (any other points) where the pass should be scheduled?
>>>
>>> It's probably reasonable to run it after IPA inlining has taken place which
>>> means insert it somewhen after the second pass_fre (I'd suggest after
>>> pass_merge_phi).
>>>
>>
>> I placed it there, but I ran into some interaction with
>> pass_late_warn_uninitialized.  Addition of the pass makes test
>> gcc.dg/uninit-pred-2_c.c fail.
>>
>> FAIL: gcc.dg/uninit-pred-2_c.c bogus uninitialized var warning
>>                               (test for bogus messages, line 43)
>> FAIL: gcc.dg/uninit-pred-2_c.c real uninitialized var warning
>>                               (test for warnings, line 45)
>>
>>   int foo_2 (int n, int m, int r)
>>   {
>>     int flag = 0;
>>     int v;
>>
>>     if (n)
>>       {
>>         v = r;
>>         flag = 1;
>>       }
>>
>>     if (m) g++;
>>     else bar ();
>>
>>     if (flag)
>>       blah (v); { dg-bogus "uninitialized" "bogus uninitialized var warning" }
>>     else
>>       blah (v); { dg-warning "uninitialized" "real uninitialized var warning" }
>>
>>     return 0;
>>   }
>>
>> The pass replaces the second call to blah with the first one, and eliminates
>> the if.  After that, the uninitialized warning is issued for the line number
>> of the first call to blah, while at source level the warning only makes sense
>> for the second call to blah.
>>
>> Shall I try putting the pass after pass_late_warn_uninitialized?
> 
> No, simply pass -fno-tree-tail-merge in the testcase.
> 
>>> But my general comment still applies - I don't like the structural
>>> comparison code at all and you should really use the value-numbering
>>> machineries we have
>>
>> I now use sccvn.
> 
> Good.
> 
>>> or even better, merge this pass with FRE itself
>>> (or DOM if that suits you more).  For FRE you'd want to hook into
>>> tree-ssa-pre.c:eliminate().
>>>
>>
>> If we need to do the transformation after pass_late_warn_uninitialized, it needs
>> to stay on its own, I suppose.
> 
> I suppose part of the high cost of the pass is running SCCVN, so it
> makes sense to share that with the existing FRE run.

Done.

> Any reason
> you use VN_NOWALK?
>

No, that was just a first-try value.

>>>>> This also duplicates the literal matching done on the RTL level - instead
>>>>> I think this optimization would be more related to value-numbering
>>>>> (either that of SCCVN/FRE/PRE or that of DOM which also does
>>>>> jump-threading).
>>>>
>>>> The pass currently does just duplicate block elimination, not cross-jumping.
>>>> If we would like to extend this to cross-jumping, I think we need to do the
>>>> reverse of value numbering: walk backwards over the bb, and keep track of the
>>>> way values are used rather than defined. This will allows us to make a cut
>>>> halfway a basic block.
>>>
>>> I don't understand - I propose to do literal matching but using value-numbering
>>> for tracking equivalences to avoid literal matching for stuff we know is
>>> equivalent.  In fact I think it will be mostly calls and stores where we
>>> need to do literal matching, but never intermediate computations on
>>> registers.
>>>
>>
>> I tried to implement that scheme now.
>>
>>> But maybe I miss something here.
>>>
>>>> In general, we cannot do cut halfway a basic block in the current implementation
>>>> (of value numbering and forward matching), since we assume equivalence of the
>>>> incoming vops at bb entry. This assumption is in general only valid if we indeed
>>>> replace the entire block by another entire block.
>>>
>>> Why are VOPs of concern at all?
>>>
>>
>> In the previous version, I inserted the phis for the vops manually.
>> In the current version of the pass, I let TODO_update_ssa_only_virtuals deal
>> with vops, so it's not relevant anymore.
>>
>>>> I imagine that a cross-jumping heuristic would be based on the length of the
>>>> match and the amount of non-vop phis it would introduce. Then value numbering
>>>> would be something orthogonal to this optimization, which would reduce amount of
>>>> phis needed for a cross-jump.
>>>> I think it would make sense to use SCCVN value numbering at the point that we
>>>> have this backward matching.
>>>>
>>>> I'm not sure whether it's a good idea to try to replace the current forward
>>>> local value numbering with SCCVN value numbering, since we currently declare
>>>> vops equal, which are, in the global sense, not equal. And once we go to
>>>> backward matching, we'll need something to keep track of the uses, and we can
>>>> reuse the current infrastructure for that, but not the SCCVN value numbering.
>>>>
>>>> Does that make any sense?
>>>
>>> Ok, let me think about this a bit.
>>>
>>
>> I tried to to be more clear on this in the header comment of the pass.
>>
>>> For now about the patch in general.  The functions need renaming to
>>> something more sensible now that this isn't cfg-cleanup anymore.
>>>
>>> I miss a general overview of the pass - it's hard to reverse engineer
>>> its working for me.
>>
>> I added a header comment.
>>
>>> Like (working backwards), you are detecting
>>> duplicate predecessors
>>> - that obviously doesn't work for duplicates
>>> without any successors, like those ending in noreturn calls.
>>>
>>
>> Merging of blocks without successors works now.
>>
>>> +  n = EDGE_COUNT (bb->preds);
>>> +
>>> +  for (i = 0; i < n; ++i)
>>> +    {
>>> +      e1 = EDGE_PRED (bb, i);
>>> +      if (e1->flags & EDGE_COMPLEX)
>>> +        continue;
>>> +      for (j = i + 1; j < n; ++j)
>>> +        {
>>>
>>> that's quadratic in the number of predecessors.
>>>
>>
>> The quadratic comparison is now limited by PARAM_TAIL_MERGE_MAX_COMPARISONS.
>> Each bb is compared to maximally PARAM_TAIL_MERGE_MAX_COMPARISONS similar bbs
>> per worklist iteration.
>>
>>> +          /* Block e1->src might be deleted.  If bb and e1->src are the same
>>> +             block, delete e2->src instead, by swapping e1 and e2.  */
>>> +          e1_swapped = (bb == e1->src) ? e2: e1;
>>> +          e2_swapped = (bb == e1->src) ? e1: e2;
>>>
>>> is that because you incrementally merge preds two at a time?  As you
>>> are deleting blocks don't you need to adjust the quadratic walking?
>>> Thus, with say four equivalent preds won't your code crash anyway?
>>>
>>
>> I think it was to make calculation of dominator info easier, but I use now
>> functions from dominance.c for that, so this piece of code is gone.
>>
>>> I think the code needs to delay the CFG manipulation to the end
>>> of this function.
>>>
>>
>> I now delay the cfg manipulation till after each analysis phase.
>>
>>> +/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
>>> +   E2 are either:
>>> +   - equal, or
>>> +   - defined locally in E1->src and E2->src.
>>> +   In the latter case, register the alternatives in *PHI_EQUIV.  */
>>> +
>>> +static bool
>>> +same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
>>> +{
>>> +  int n1 = e1->dest_idx;
>>> +  int n2 = e2->dest_idx;
>>> +  gimple_stmt_iterator gsi;
>>> +  basic_block dest = e1->dest;
>>> +  gcc_assert (dest == e2->dest);
>>>
>>> too many asserts in general - I'd say for this case pass in the destination
>>> block as argument.
>>>
>>> +      gcc_assert (val1 != NULL_TREE);
>>> +      gcc_assert (val2 != NULL_TREE);
>>>
>>> superfluous.
>>>
>>> +static bool
>>> +cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
>>> ...
>>> +  VEC (edge,heap) *redirected_edges;
>>> +  gcc_assert (bb == e2->dest);
>>>
>>> same.
>>>
>>> +  if (e1->flags != e2->flags)
>>> +    return false;
>>>
>>> that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches
>>> by swapping edges in the preds.
>>>
>>
>> That's handled now.
>>
>>> +  /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2
>>> +     have the same successors.  */
>>> +  if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
>>> +    return false;
>>>
>>> hm, ok - that would need fixing, too.  Same or mergeable successors
>>> of course, which makes me wonder if doing this whole transformation
>>> incrementally and locally is a good idea ;)   Also
>>>
>>
>> Also handled now.
>>
>>> +  /* Calculate the changes to be made to the dominator info.
>>> +     Calculate bb2_dom.  */
>>> ...
>>>
>>> wouldn't be necessary I suppose (just throw away dom info after the
>>> pass).
>>>
>>> That is, I'd globally record BB equivalences (thus, "value-number"
>>> BBs) and apply the CFG manipulations at a single point.
>>>
>>
>> I delay the cfg manipulation till after each analysis phase. Delaying the cfg
>> manipulation till the end of the pass instead might make the analysis code more
>> convoluted.
>>
>>> Btw, I miss where you insert PHI nodes for all uses that flow in
>>> from the preds preds - you do that for VOPs but not for real
>>> operands?
>>>
>>
>> Indeed, inserting phis for non-vops is a todo.
>>
>>> +  /* Replace uses of vuse2 with uses of the phi.  */
>>> +  for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi))
>>> +    {
>>>
>>> why not walk immediate uses of the old PHI and SET_USE to
>>> the new one instead (for those uses in the duplicate BB of course)?
>>>
>>
>> And I no longer insert VOP phis, but let a TODO handle that, so this code is gone.
> 
> Ok.  Somewhat costly in comparison though.
> 

I tried to add that back, guarded by update_vops.  Handled in update_vuses,
vop_phi, insn_vops, vop_at_entry, replace_block_by.

>>> +    case GSS_CALL:
>>> +      if (!pt_solution_equal_p (gimple_call_use_set (s1),
>>> +                                gimple_call_use_set (s2))
>>>
>>> I don't understand why you are concerned about equality of
>>> points-to information.  Why not simply ior it (pt_solution_ior_into - note
>>> they are shared so you need to unshare them first).
>>>
>>
>> I let a todo handle the alias info now.
> 
> Hmm, that's not going to work if it's needed for correctness.
> 

Should be handed my merge_calls now.

>>> +/* Return true if p1 and p2 can be considered equal.  */
>>> +
>>> +static bool
>>> +pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)
>>>
>>> would go into tree-ssa-structalias.c instead.
>>>
>>> +static bool
>>> +gimple_base_equal_p (gimple s1, gimple s2)
>>> +{
>>> ...
>>> +  if (gimple_modified_p (s1) || gimple_modified_p (s2))
>>> +    return false;
>>>
>>> that shouldn't be of concern.
>>>
>>> +  if (s1->gsbase.subcode != s2->gsbase.subcode)
>>> +    return false;
>>>
>>> for assigns that are of class GIMPLE_SINGLE_RHS we do not
>>> update subcode during transformations so it can differ for now
>>> equal statements.
>>>
>>
>> handled properly now.
>>
>>> I'm not sure if a splay tree for the SSA name version equivalency
>>> map is the best representation - I would have used a simple
>>> array of num_ssa_names size and assign value-numbers
>>> (the lesser version for example).
>>>
>>> Thus equiv_insert would do
>>>
>>>   value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2));
>>>   values[SSA_NAME_VERSION (val1)] = value;
>>>   values[SSA_NAME_VERSION (val2)] = value;
>>>
>>> if the names are not defined in bb1 resp. bb2 we would have to insert
>>> a PHI node in the merged block - that would be a cost thingy for
>>> doing this value-numbering in a more global way.
>>>
>>
>> local value numbering code has been removed.
>>
>>> You don't seem to be concerned about the SSA names points-to
>>> information, but it surely has the same issues as that of the calls
>>> (so either they should be equal or they should be conservatively
>>> merged).  But as-is you don't allow any values to flow into the
>>> merged blocks that are not equal for both edges, no?
>>>
>>
>> Correct, that's still a todo.
>>
>>> +  TV_TREE_CFG,                          /* tv_id */
>>>
>>> add a new timevar.  We wan to be able to turn the pass off,
>>> so also add a new option (I can see it might make debugging harder
>>> in some cases).
>>>
>>
>> I added -ftree-tail-merge and TV_TREE_TAIL_MERGE.
>>
>>> Can you assess the effect of the patch on GCC itself (for example
>>> when building cc1?)? What's the size benefit and the compile-time
>>> overhead?
>>>
>>
>> effect on building cc1:
>>
>>               real        user        sys
>> without: 19m50.158s  19m 2.090s  0m20.860s
>> with:    19m59.456s  19m17.170s  0m20.350s
>>                     ----------
>>                       +15.080s
>>                         +1.31%
> 
> That's quite a lot of time.
>

Measurement for this version:

               real        user        sys
without  19m59.995s  19m 9.970s  0m21.050s
with     19m56.160s  19m14.830s  0m21.530s
                     ----------
                         +4.86s
                         +0.42%

    text   data      bss       dec      hex     filename
17547657  41736  1364384  18953777  1213631  without/cc1
17211049  41736  1364384  18617169  11c1351     with/cc1
--------
 -336608
  -1.92%


>> $ size without/cc1 with/cc1
>>    text   data      bss       dec      hex     filename
>> 17515986  41320  1364352  18921658  120b8ba  without/cc1
>> 17399226  41320  1364352  18804898  11ef0a2     with/cc1
>> --------
>>  -116760
>>  -0.67%
>>
>> OK for trunk, provided build & reg-testing on ARM is ok?
> 
> I miss additions to the testsuite.
> 

I will send an updated patch on thread
http://gcc.gnu.org/ml/gcc-patches/2011-06/msg00625.html.

> 
> +static bool
> +bb_dominated_by_p (basic_block bb1, basic_block bb2)
> 
> please use
> 
> +  if (TREE_CODE (val1) == SSA_NAME)
> +    {
> +      if (!same_preds
>            && !SSA_NAME_IS_DEFAULT_DEF (val1)
>            && !dominated_by_p (bb2, gimple_bb (SSA_NAME_DEF_STMT (val1)))
>               return false;
> 
> instead.  All stmts should have a BB apart from def stmts of default defs
> (which are gimple_nops).
> 

Done.

> +/* Return the canonical scc_vn tree for X, if we can use scc_vn_info.
> +   Otherwise, return X.  */
> +
> +static tree
> +gvn_val (tree x)
> +{
> +  return ((scc_vn_ok && x != NULL && TREE_CODE (x) == SSA_NAME)
> +         ? VN_INFO ((x))->valnum : x);
> +}
> 
> I suppose we want to export vn_valueize from tree-ssa-sccvn.c instead
> which seems to perform the same.

Done.

> Do you ever call the above
> when scc_vn_ok is false or x is NULL?

Not in this version. Earlier, I also ran the pass if sccvn bailed out, but pre
and fre only run if sccvn succeeded.

> 
> +static bool
> +gvn_uses_equal (tree val1, tree val2, basic_block bb1,
> +               basic_block bb2, bool same_preds)
> +{
> +  gimple def1, def2;
> +  basic_block def1_bb, def2_bb;
> +
> +  if (val1 == NULL_TREE || val2 == NULL_TREE)
> +    return false;
> 
> does this ever happen?

Not in the current version.  Removed.

> 
> +  if (gvn_val (val1) != gvn_val (val2))
> +    return false;
> 
> I suppose a shortcut
> 
>    if (val1 == val2)
>      return true;
> 
> is possible?
> 

Indeed. Added.

> +static int *bb_size;
> +
> +/* Init bb_size administration.  */
> +
> +static void
> +init_bb_size (void)
> +{
> 
> if you need more per-BB info you can hook it into bb->aux.  What's the
> size used for (I guess I'll see below ...)?
> 
> +      for (gsi = gsi_start_nondebug_bb (bb);
> +          !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
> +       size++;
> 
> is pretty rough.  I guess for a quick false result for comparing BBs
> (which means you could initialize the info lazily?)

Done.

> 
> +struct same_succ
> +{
> +  /* The bbs that have the same successor bbs.  */
> +  bitmap bbs;
> +  /* The successor bbs.  */
> +  bitmap succs;
> +  /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for
> +     bb.  */
> +  bitmap inverse;
> +  /* The edge flags for each of the successor bbs.  */
> +  VEC (int, heap) *succ_flags;
> +  /* Indicates whether the struct is in the worklist.  */
> +  bool in_worklist;
> +};
> 
> looks somewhat odd at first sight - maybe a overall comment what this
> is used for is missing.  Well, let's see.
> 

Tried to add an overall comment.

> +static hashval_t
> +same_succ_hash (const void *ve)
> +{
> +  const_same_succ_t e = (const_same_succ_t)ve;
> +  hashval_t hashval = bitmap_hash (e->succs);
> +  int flags;
> +  unsigned int i;
> +  unsigned int first = bitmap_first_set_bit (e->bbs);
> +  int size = bb_size [first];
> +  gimple_stmt_iterator gsi;
> +  gimple stmt;
> +  basic_block bb = BASIC_BLOCK (first);
> +
> +  hashval = iterative_hash_hashval_t (size, hashval);
> +  for (gsi = gsi_start_nondebug_bb (bb);
> +          !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
> +    {
> +      stmt = gsi_stmt (gsi);
> +      hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval);
> +      if (!is_gimple_call (stmt))
> +       continue;
> +      if (gimple_call_internal_p (stmt))
> +       hashval = iterative_hash_hashval_t
> +         ((hashval_t) gimple_call_internal_fn (stmt), hashval);
> +      else
> +       hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval);
> 
> you could also keep a cache of the BB hash as you keep a cache
> of the size (if this function is called multiple times per BB).

Right, I forgot it's a closed hash table. Added the cache of the bb hash.

> The hash looks relatively weak - for all asignments it will hash
> in GIMPLE_ASSIGN only ... I'd at least hash in gimple_assign_rhs_code.

Done.

> The call handling OTOH looks overly complicated to me ;)
> 

That's an attempt to handle compiling insn-recog.c efficiently. All bbs without
successors are grouped together, and even after selecting on same function name,
there are still thousands of bbs in that group. I added now the args as well.

> The hash will be dependent on stmt ordering even if that doesn't matter,
> like
> 
>   i = i + 1;
>   j = j - 1;
> 
> vs. the swapped variant. 

Right, that's a todo, added that in the header comment.

> Similar the successor edges are not sorted,
> so true/false edges may be in different order.
> 

I keep a cache of the successor edge flags, in order of bbs.

> Not sure yet if your comparison function would make those BBs
> unequal anyway.
> 
> +static bool
> +inverse_flags (const_same_succ_t e1, const_same_succ_t e2)
> +{
> +  int f1a, f1b, f2a, f2b;
> +  int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
> +
> +  if (VEC_length (int, e1->succ_flags) != 2)
> +    return false;
> ...
> 
> I wonder why you keep a VEC of successor edges in same_succ_t
> instead of  using the embedded successor edge vector in the basic_block
> structure?
> 

To keep the information in order of bbs, for quick comparison.

> +      bb_to_same_succ[bb->index] = *slot;
> 
> looks like a candidate for per-BB info in bb->aux, too.
> 

Done.

> +static void
> +find_same_succ (void)
> +{
> +  int i;
> +  same_succ_t same = same_succ_alloc ();
> +
> +  for (i = 0; i < last_basic_block; ++i)
> +    {
> +      find_same_succ_bb (BASIC_BLOCK (i), &same);
> +      if (same == NULL)
> +       same = same_succ_alloc ();
> +    }
> 
> I suppose you want FOR_EACH_BB (excluding entry/exit block) or
> FOR_ALL_BB (including them).  The above also can
> have BASIC_BLOCK(i) == NULL.  Similar in other places.
> 

Done.

> +  for (i = 0; i < n1; ++i)
> +    {
> +      ei = EDGE_PRED (bb1, i);
> +      for (j = 0; j < n2; ++j)
> +       {
> +         ej = EDGE_PRED (bb2, j);
> +         if (ei->src != ej->src)
> +           continue;
> +         nr_matches++;
> +         break;
> +       }
> +    }
> 
>   FOR_EACH_EDGE (ei, iterator, bb1->preds)
>      if (!find_edge (ei->src, bb2))
>        return false;
> 
> is easier to parse.
> 

Done.

> +static bool
> +gimple_subcode_equal_p (gimple s1, gimple s2, bool inv_cond)
> +{
> +  tree var, var_type;
> +  bool honor_nans;
> +
> +  if (is_gimple_assign (s1)
> +      && gimple_assign_rhs_class (s1) == GIMPLE_SINGLE_RHS)
> +    return true;
> 
> the subcode for GIMPLE_SINGLE_RHS is gimple_assign_rhs_code
> (TREE_CODE of gimple_assign_rhs1 actually).
> 
> +static bool
> +gimple_base_equal_p (gimple s1, gimple s2, bool inv_cond)
> 
> I wonder if you still need this given ..
> 
> +static bool
> +gimple_equal_p (gimple s1, gimple s2, bool same_preds, bool inv_cond)
> +{
> +  unsigned int i;
> +  enum gimple_statement_structure_enum gss;
> +  tree lhs1, lhs2;
> +  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
> +
> +  /* Handle omp gimples conservatively.  */
> +  if (is_gimple_omp (s1) || is_gimple_omp (s2))
> +    return false;
> +
> +  /* Handle lhs.  */
> +  lhs1 = gimple_get_lhs (s1);
> +  lhs2 = gimple_get_lhs (s2);
> +  if (lhs1 != NULL_TREE && lhs2 != NULL_TREE)
> +    return (same_preds && TREE_CODE (lhs1) == SSA_NAME
> +           && TREE_CODE (lhs2) == SSA_NAME
> +           && gvn_val (lhs1) == gvn_val (lhs2));
> +  else if (!(lhs1 == NULL_TREE && lhs2 == NULL_TREE))
> +    return false;
> 
> all lhs equivalency is defered to GVN (which means all GIMPLE_ASSIGN
> and GIMPLE_CALL stmts with a lhs).
> 
> That leaves the case of calls without a lhs.  I'd rather structure this
> function like
> 
>   if (gimple_code (s1) != gimple_code (s2))
>     return false;
>   swithc (gimple_code (s1))
>     {
>     case GIMPLE_CALL:
>        ... compare arguments ...
>        if equal ok, if not and we have a lhs use GVN.
> 
>     case GIMPLE_ASSIGN:
>        ... compare GVN of the LHS ...
> 
>      case GIMPLE_COND:
>         ... compare operands ...
> 
>      default:
>         return false;
>     }
> 
> 

Done.

> +static bool
> +bb_gimple_equal_p (basic_block bb1, basic_block bb2, bool same_preds,
> +                  bool inv_cond)
> +
> +{
> 
> you don't do an early out by comparing the pre-computed sizes.

This function is only called for bb with the same size. The hash table equal
fuction does have the size comparison.

> Mind
> you can have hashtable collisions where they still differ (did you
> check hashtable stats on it?  how is the collision rate?)
> 

I managed to lower the collision rate by specifying n_basic_blocks as hash table
size.

While compiling insn-recog.c, the highest collision rate for a function is 2.69.

> +static bool
> +bb_has_non_vop_phi (basic_block bb)
> +{
> +  gimple_seq phis = phi_nodes (bb);
> +  gimple phi;
> +
> +  if (phis == NULL)
> +    return false;
> +
> +  if (!gimple_seq_singleton_p (phis))
> +    return true;
> +
> +  phi = gimple_seq_first_stmt (phis);
> +  return !VOID_TYPE_P (TREE_TYPE (gimple_phi_result (phi)));
> 
> return is_gimple_reg (gimple_phi_result (phi));
> 

Done.

> +static void
> +update_debug_stmts (void)
> +{
> +  int i;
> +  basic_block bb;
> +
> +  for (i = 0; i < last_basic_block; ++i)
> +    {
> +      gimple stmt;
> +      gimple_stmt_iterator gsi;
> +
> +      bb = BASIC_BLOCK (i);
> 
> FOR_EACH_BB
> 
> it must be possible to avoid scanning basic-blocks that are not affected
> by the transform, no?  In fact the only affected basic-blocks should be
> those that were merged with another block?

Done. I also check for MAY_HAVE_DEBUG_STMTS now.

> 
> +      /* Mark vops for updating.  Without this, TODO_update_ssa_only_virtuals
> +        won't do anything.  */
> +      mark_sym_for_renaming (gimple_vop (cfun));
> 
> it won't insert any PHIs, that's correct.  Still somewhat ugly, a manual
> update of PHI nodes should be possible.
> 

Added.  I'm trying to be lazy about it though:

+  bool update_vops = ((todo & TODO_update_ssa_only_virtuals) == 0
+		      || !symbol_marked_for_renaming (gimple_vop (cfun)));

If we're going to insert those phis anyway given the current todo, we don't bother.


> +      if (dump_file)
> +       {
> +         fprintf (dump_file, "Before TODOs.\n");
> 
> with TDF_DETAILS only please.
> 

Done.

> +      free_dominance_info (CDI_DOMINATORS);
> 
> if you keep dominance info up-to-date there is no need to free it.
> 

Indeed. And by not freeing it, the info was checked, and I hit validation errors
in that info, so the updating code had problems.  I reverted back to calculating
when needed and freeing when changed.

> +  TODO_verify_ssa | TODO_verify_stmts
> +  | TODO_verify_flow | TODO_update_ssa_only_virtuals
> +  | TODO_rebuild_alias
> 
> please no TODO_rebuild_alias, simply remove it - alias info in merged
> paths should be compatible enough if there is value-equivalence between
> SSA names.  At least you can't rely on TODO_rebuild_alias for
> correctness - it is skipped if IPA PTA was run for example.
> 

Done.

> +  | TODO_cleanup_cfg
> 
> is that needed?  If so return it from your execute function if you changed
> anything only.  But I doubt your transformation introduces cleanup
> opportunities?
> 

If all the predeccessor blocks of a block are merged, the block and it's
remaining predecessor block might be merged, so that is a cleanup opportunity.
Removed for the moment.

> New options and params need documentation in doc/invoke.texi.
> 

Added.

> Thanks,
> Richard.
> 

Bootstrapped and reg-tested on x86_64.  Ok for trunk (after ARM testing)?

Thanks,
- Tom

2011-07-17  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* tree-ssa-tail-merge.c: New file.
	(struct same_succ): Define.
	(same_succ_t, const_same_succ_t): New typedef.
	(struct bb_cluster): Define.
	(bb_cluster_t, const_bb_cluster_t): New typedef.
	(struct aux_bb_info): Define.
	(BB_SIZE, BB_SAME_SUCC, BB_CLUSTER, BB_VOP_AT_EXIT): Define.
	(gvn_uses_equal): New function.
	(same_succ_print, same_succ_print_traverse, same_succ_hash)
	(inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete)
	(same_succ_reset): New function.
	(same_succ_htab, same_succ_edge_flags)
	(deleted_bbs, deleted_bb_preds): New var.
	(debug_same_succ): New function.
	(worklist): New var.
	(print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ)
	(init_worklist, delete_worklist, delete_basic_block_same_succ)
	(same_succ_flush_bbs, update_worklist): New function.
	(print_cluster, debug_cluster, same_predecessors)
	(add_bb_to_cluster, new_cluster, delete_cluster): New function.
	(all_clusters): New var.
	(alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors)
	(merge_clusters, set_cluster): New function.
	(gimple_equal_p, find_duplicate, same_phi_alternatives_1)
	(same_phi_alternatives, bb_has_non_vop_phi, find_clusters_1)
	(find_clusters): New function.
	(merge_calls, update_vuses, vop_phi, insn_vops, vop_at_entry)
	(replace_block_by): New function.
	(update_bbs): New var.
	(apply_clusters): New function.
	(update_debug_stmt, update_debug_stmts): New function.
	(tail_merge_optimize): New function.
	tree-flow.h (tail_merge_optimize): Declare.
	* tree-ssa-pre.c (execute_pre): Use tail_merge_optimize.
	* Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
	(tree-ssa-tail-merge.o): New rule.
	* opts.c (default_options_table): Set OPT_ftree_tail_merge by default at
	OPT_LEVELS_2_PLUS.
	* tree-ssa-sccvn.c (vn_valueize): Move to ...
	* tree-ssa-sccvn.h (vn_valueize): Here.
	* tree-ssa-alias.h (pt_solution_ior_into_shared): Declare.
	* tree-ssa-structalias.c (find_what_var_points_to): Factor out and
	use ...
	(pt_solution_share): New function.
	(pt_solution_unshare, pt_solution_ior_into_shared): New function.
	(delete_points_to_sets): Nullify shared_bitmap_table after deletion.
	* timevar.def (TV_TREE_TAIL_MERGE): New timevar.
	* common.opt (ftree-tail-merge): New switch.
	* params.def (PARAM_MAX_TAIL_MERGE_COMPARISONS): New parameter.
	* doc/invoke.texi (Optimization Options, -O2): Add -ftree-tail-merge.
	(-ftree-tail-merge, max-tail-merge-comparisons): New item.

Index: gcc/tree-ssa-tail-merge.c
===================================================================
--- gcc/tree-ssa-tail-merge.c	(revision 0)
+++ gcc/tree-ssa-tail-merge.c	(revision 0)
@@ -0,0 +1,1676 @@
+/* Tail merging for gimple.
+   Copyright (C) 2011 Free Software Foundation, Inc.
+   Contributed by Tom de Vries (tom@codesourcery.com)
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* Pass overview.
+
+
+   MOTIVATIONAL EXAMPLE
+
+   gimple representation of gcc/testsuite/gcc.dg/pr43864.c at
+
+   hprofStartupp (charD.1 * outputFileNameD.2600, charD.1 * ctxD.2601)
+   {
+     struct FILED.1638 * fpD.2605;
+     charD.1 fileNameD.2604[1000];
+     intD.0 D.3915;
+     const charD.1 * restrict outputFileName.0D.3914;
+
+     # BLOCK 2 freq:10000
+     # PRED: ENTRY [100.0%]  (fallthru,exec)
+     # PT = nonlocal { D.3926 } (restr)
+     outputFileName.0D.3914_3
+       = (const charD.1 * restrict) outputFileNameD.2600_2(D);
+     # .MEMD.3923_13 = VDEF <.MEMD.3923_12(D)>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     sprintfD.759 (&fileNameD.2604, outputFileName.0D.3914_3);
+     # .MEMD.3923_14 = VDEF <.MEMD.3923_13>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     D.3915_4 = accessD.2606 (&fileNameD.2604, 1);
+     if (D.3915_4 == 0)
+       goto <bb 3>;
+     else
+       goto <bb 4>;
+     # SUCC: 3 [10.0%]  (true,exec) 4 [90.0%]  (false,exec)
+
+     # BLOCK 3 freq:1000
+     # PRED: 2 [10.0%]  (true,exec)
+     # .MEMD.3923_15 = VDEF <.MEMD.3923_14>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     freeD.898 (ctxD.2601_5(D));
+     goto <bb 7>;
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 4 freq:9000
+     # PRED: 2 [90.0%]  (false,exec)
+     # .MEMD.3923_16 = VDEF <.MEMD.3923_14>
+     # PT = nonlocal escaped
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     fpD.2605_8 = fopenD.1805 (&fileNameD.2604[0], 0B);
+     if (fpD.2605_8 == 0B)
+       goto <bb 5>;
+     else
+       goto <bb 6>;
+     # SUCC: 5 [1.9%]  (true,exec) 6 [98.1%]  (false,exec)
+
+     # BLOCK 5 freq:173
+     # PRED: 4 [1.9%]  (true,exec)
+     # .MEMD.3923_17 = VDEF <.MEMD.3923_16>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     freeD.898 (ctxD.2601_5(D));
+     goto <bb 7>;
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 6 freq:8827
+     # PRED: 4 [98.1%]  (false,exec)
+     # .MEMD.3923_18 = VDEF <.MEMD.3923_16>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     fooD.2599 (outputFileNameD.2600_2(D), fpD.2605_8);
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 7 freq:10000
+     # PRED: 3 [100.0%]  (fallthru,exec) 5 [100.0%]  (fallthru,exec)
+             6 [100.0%]  (fallthru,exec)
+     # PT = nonlocal null
+
+     # ctxD.2601_1 = PHI <0B(3), 0B(5), ctxD.2601_5(D)(6)>
+     # .MEMD.3923_11 = PHI <.MEMD.3923_15(3), .MEMD.3923_17(5),
+                            .MEMD.3923_18(6)>
+     # VUSE <.MEMD.3923_11>
+     return ctxD.2601_1;
+     # SUCC: EXIT [100.0%]
+   }
+
+   bb 3 and bb 5 can be merged.  The blocks have different predecessors, but the
+   same successors, and the same operations.
+
+
+   CONTEXT
+
+   A technique called tail merging (or cross jumping) can fix the example
+   above.  For a block, we look for common code at the end (the tail) of the
+   predecessor blocks, and insert jumps from one block to the other.
+   The example is a special case for tail merging, in that 2 whole blocks
+   can be merged, rather than just the end parts of it.
+   We currently only focus on whole block merging, so in that sense
+   calling this pass tail merge is a bit of a misnomer.
+
+   We distinguish 2 kinds of situations in which blocks can be merged:
+   - same operations, same predecessors.  The successor edges coming from one
+     block are redirected to come from the other block.
+   - same operations, same successors.  The predecessor edges entering one block
+     are redirected to enter the other block.  Note that this operation might
+     involve introducing phi operations.
+
+   For efficient implementation, we would like to value numbers the blocks, and
+   have a comparison operator that tells us whether the blocks are equal.
+   Besides being runtime efficient, block value numbering should also abstract
+   from irrelevant differences in order of operations, much like normal value
+   numbering abstracts from irrelevant order of operations.
+
+   For the first situation (same_operations, same predecessors), normal value
+   numbering fits well.  We can calculate a block value number based on the
+   value numbers of the defs and vdefs.
+
+   For the second situation (same operations, same successors), this approach
+   doesn't work so well.  We can illustrate this using the example.  The calls
+   to free use different vdefs: MEMD.3923_16 and MEMD.3923_14, and these will
+   remain different in value numbering, since they represent different memory
+   states.  So the resulting vdefs of the frees will be different in value
+   numbering, so the block value numbers will be different.
+
+   The reason why we call the blocks equal is not because they define the same
+   values, but because uses in the blocks use (possibly different) defs in the
+   same way.  To be able to detect this efficiently, we need to do some kind of
+   reverse value numbering, meaning number the uses rather than the defs, and
+   calculate a block value number based on the value number of the uses.
+   Ideally, a block comparison operator will also indicate which phis are needed
+   to merge the blocks.
+
+   For the moment, we don't do block value numbering, but we do insn-by-insn
+   matching, using scc value numbers to match operations with results, and
+   structural comparison otherwise, while ignoring vop mismatches.
+
+
+   IMPLEMENTATION
+
+   1. The pass first determines all groups of blocks with the same successor
+      blocks.
+   2. Within each group, it tries to determine clusters of equal basic blocks.
+   3. The clusters are applied.
+   4. The same successor groups are updated.
+   5. This process is repeated from 2 onwards, until no more changes.
+
+
+   LIMITATIONS/TODO
+
+   - block only
+   - handles only 'same operations, same successors'.
+     It handles same predecessors as a special subcase though.
+   - does not implement the reverse value numbering and block value numbering.
+   - does not abstract from statement order.  In order to do this, we need to
+     abstract from statement order in the hash function, and bb comparison
+     functions.
+   - improve memory allocation: use garbage collected memory, obstacks,
+     allocpools where appropriate.
+   - no insertion of gimple_reg phis,  We only introduce vop-phis.
+   - handle blocks with gimple_reg phi_nodes.
+
+
+   SWITCHES
+
+   - ftree-tail-merge.  On at -O2.  We may have to enable it only at -Os.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "tm_p.h"
+#include "basic-block.h"
+#include "output.h"
+#include "flags.h"
+#include "function.h"
+#include "tree-flow.h"
+#include "timevar.h"
+#include "bitmap.h"
+#include "tree-ssa-alias.h"
+#include "params.h"
+#include "tree-pretty-print.h"
+#include "hashtab.h"
+#include "gimple-pretty-print.h"
+#include "tree-ssa-sccvn.h"
+#include "tree-dump.h"
+
+/* Describes a group of bbs with the same successors.  The successor bbs are
+   cached in succs, and the successor edge flags are cached in succ_flags.
+   If a bb has the EDGE_TRUE/VALSE_VALUE flags swapped compared to succ_flags,
+   it's marked in inverse.
+   Additionally, the hash value for the struct is cached in hashval, and
+   in_worklist indicates whether it's currently part of worklist.  */
+
+struct same_succ
+{
+  /* The bbs that have the same successor bbs.  */
+  bitmap bbs;
+  /* The successor bbs.  */
+  bitmap succs;
+  /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for
+     bb.  */
+  bitmap inverse;
+  /* The edge flags for each of the successor bbs.  */
+  VEC (int, heap) *succ_flags;
+  /* Indicates whether the struct is currently in the worklist.  */
+  bool in_worklist;
+  /* The hash value of the struct.  */
+  hashval_t hashval;
+};
+typedef struct same_succ *same_succ_t;
+typedef const struct same_succ *const_same_succ_t;
+
+/* A group of bbs where 1 bb from bbs can replace the other bbs.  */
+
+struct bb_cluster
+{
+  /* The bbs in the cluster.  */
+  bitmap bbs;
+  /* The preds of the bbs in the cluster.  */
+  bitmap preds;
+  /* index in all_clusters vector.  */
+  int index;
+};
+typedef struct bb_cluster *bb_cluster_t;
+typedef const struct bb_cluster *const_bb_cluster_t;
+
+/* Per bb-info.  */
+
+struct aux_bb_info
+{
+  /* The number of non-debug statements in the bb.  */
+  int size;
+  /* The same_succ that this bb is a member of.  */
+  same_succ_t same_succ;
+  /* The cluster that this bb is a member of.  */
+  bb_cluster_t cluster;
+  /* The vop state at the exit of a bb.  This is shortlived data, used to
+     communicate data between update_block_by and update_vuses.  */
+  tree vop_at_exit;
+};
+
+/* Macros to access the fields of struct aux_bb_info.  */
+
+#define BB_SIZE(bb) (((struct aux_bb_info *)bb->aux)->size)
+#define BB_SAME_SUCC(bb) (((struct aux_bb_info *)bb->aux)->same_succ)
+#define BB_CLUSTER(bb) (((struct aux_bb_info *)bb->aux)->cluster)
+#define BB_VOP_AT_EXIT(bb) (((struct aux_bb_info *)bb->aux)->vop_at_exit)
+
+/* VAL1 and VAL2 are either:
+   - uses in BB1 and BB2, or
+   - phi alternatives for BB1 and BB2.
+   SAME_PREDS indicates whether BB1 and BB2 have the same predecessors.
+   Return true if the uses have the same gvn value, and if the corresponding
+   defs can be used in both BB1 and BB2.  */
+
+static bool
+gvn_uses_equal (tree val1, tree val2, basic_block bb1,
+		basic_block bb2, bool same_preds)
+{
+  gcc_checking_assert (val1 != NULL_TREE && val2 != NULL_TREE);
+
+  if (val1 == val2)
+    return true;
+
+  if (vn_valueize (val1) != vn_valueize (val2))
+    return false;
+
+  /* If BB1 and BB2 have the same predecessors, the same values are defined at
+     entry of BB1 and BB2.  Otherwise, we need to check.  */
+
+  if (TREE_CODE (val1) == SSA_NAME)
+    {
+      if (!same_preds
+	  && !SSA_NAME_IS_DEFAULT_DEF (val1)
+	  && !dominated_by_p (CDI_DOMINATORS, bb2,
+			      gimple_bb (SSA_NAME_DEF_STMT (val1))))
+	return false;
+    }
+  else if (!CONSTANT_CLASS_P (val1))
+    return false;
+
+  if (TREE_CODE (val2) == SSA_NAME)
+    {
+      if (!same_preds
+	  && !SSA_NAME_IS_DEFAULT_DEF (val2)
+	  && !dominated_by_p (CDI_DOMINATORS, bb1,
+			      gimple_bb (SSA_NAME_DEF_STMT (val2))))
+	return false;
+    }
+  else if (!CONSTANT_CLASS_P (val2))
+    return false;
+
+  return true;
+}
+
+/* Prints E to FILE.  */
+
+static void
+same_succ_print (FILE *file, const same_succ_t e)
+{
+  unsigned int i;
+  bitmap_print (file, e->bbs, "bbs:", "\n");
+  bitmap_print (file, e->succs, "succs:", "\n");
+  bitmap_print (file, e->inverse, "inverse:", "\n");
+  fprintf (file, "flags:");
+  for (i = 0; i < VEC_length (int, e->succ_flags); ++i)
+    fprintf (file, " %x", VEC_index (int, e->succ_flags, i));
+  fprintf (file, "\n");
+}
+
+/* Prints same_succ VE to VFILE.  */
+
+static int
+same_succ_print_traverse (void **ve, void *vfile)
+{
+  const same_succ_t e = *((const same_succ_t *)ve);
+  FILE *file = ((FILE*)vfile);
+  same_succ_print (file, e);
+  return 1;
+}
+
+/* Calculates hash value for same_succ VE.  */
+
+static hashval_t
+same_succ_hash (const void *ve)
+{
+  const_same_succ_t e = (const_same_succ_t)ve;
+  hashval_t hashval = bitmap_hash (e->succs);
+  int flags;
+  unsigned int i;
+  unsigned int first = bitmap_first_set_bit (e->bbs);
+  basic_block bb = BASIC_BLOCK (first);
+  int size = 0;
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+  tree arg;
+
+  for (gsi = gsi_start_nondebug_bb (bb);
+	   !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+    {
+      size++;
+      stmt = gsi_stmt (gsi);
+      hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval);
+      if (is_gimple_assign (stmt))
+	hashval = iterative_hash_hashval_t (gimple_assign_rhs_code (stmt),
+					    hashval);
+      if (!is_gimple_call (stmt))
+	continue;
+      if (gimple_call_internal_p (stmt))
+	hashval = iterative_hash_hashval_t
+	  ((hashval_t) gimple_call_internal_fn (stmt), hashval);
+      else
+	hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval);
+      for (i = 0; i < gimple_call_num_args (stmt); i++)
+	{
+	  arg = gimple_call_arg (stmt, i);
+	  arg = vn_valueize (arg);
+	  hashval = iterative_hash_expr (arg, hashval);
+	}
+    }
+  hashval = iterative_hash_hashval_t (size, hashval);
+  BB_SIZE (bb) = size;
+
+  for (i = 0; i < VEC_length (int, e->succ_flags); ++i)
+    {
+      flags = VEC_index (int, e->succ_flags, i);
+      flags = flags & ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+      hashval = iterative_hash_hashval_t (flags, hashval);
+    }
+  return hashval;
+}
+
+/* Returns true if E1 and E2 have 2 successors, and if the successor flags
+   are inverse for the EDGE_TRUE_VALUE and EDGE_FALSE_VALUE flags, and equal for
+   the other edge flags.  */
+
+static bool
+inverse_flags (const_same_succ_t e1, const_same_succ_t e2)
+{
+  int f1a, f1b, f2a, f2b;
+  int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+
+  if (VEC_length (int, e1->succ_flags) != 2)
+    return false;
+
+  f1a = VEC_index (int, e1->succ_flags, 0);
+  f1b = VEC_index (int, e1->succ_flags, 1);
+  f2a = VEC_index (int, e2->succ_flags, 0);
+  f2b = VEC_index (int, e2->succ_flags, 1);
+
+  if (f1a == f2a && f1b == f2b)
+    return false;
+
+  return (f1a & mask) == (f2a & mask) && (f1b & mask) == (f2b & mask);
+}
+
+/* Compares SAME_SUCCs VE1 and VE2.  */
+
+static int
+same_succ_equal (const void *ve1, const void *ve2)
+{
+  const_same_succ_t e1 = (const_same_succ_t)ve1;
+  const_same_succ_t e2 = (const_same_succ_t)ve2;
+  unsigned int i, first1, first2;
+  gimple_stmt_iterator gsi1, gsi2;
+  gimple s1, s2;
+  basic_block bb1, bb2;
+
+  if (e1->hashval != e2->hashval)
+    return 0;
+
+  if (bitmap_bit_p (e1->bbs, ENTRY_BLOCK)
+      || bitmap_bit_p (e1->bbs, EXIT_BLOCK)
+      || bitmap_bit_p (e2->bbs, ENTRY_BLOCK)
+      || bitmap_bit_p (e2->bbs, EXIT_BLOCK))
+    return 0;
+
+  if (VEC_length (int, e1->succ_flags) != VEC_length (int, e2->succ_flags))
+    return 0;
+
+  if (!bitmap_equal_p (e1->succs, e2->succs))
+    return 0;
+
+  if (!inverse_flags (e1, e2))
+    {
+      for (i = 0; i < VEC_length (int, e1->succ_flags); ++i)
+	if (VEC_index (int, e1->succ_flags, i)
+	    != VEC_index (int, e1->succ_flags, i))
+	  return 0;
+    }
+
+  first1 = bitmap_first_set_bit (e1->bbs);
+  first2 = bitmap_first_set_bit (e2->bbs);
+
+  bb1 = BASIC_BLOCK (first1);
+  bb2 = BASIC_BLOCK (first2);
+
+  if (BB_SIZE (bb1) != BB_SIZE (bb2))
+    return 0;
+
+  gsi1 = gsi_start_nondebug_bb (bb1);
+  gsi2 = gsi_start_nondebug_bb (bb2);
+  while (!(gsi_end_p (gsi1) || gsi_end_p (gsi2)))
+    {
+      s1 = gsi_stmt (gsi1);
+      s2 = gsi_stmt (gsi2);
+      if (gimple_code (s1) != gimple_code (s2))
+	return 0;
+      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	return 0;
+      gsi_next_nondebug (&gsi1);
+      gsi_next_nondebug (&gsi2);
+    }
+
+  return 1;
+}
+
+/* Alloc and init a new SAME_SUCC.  */
+
+static same_succ_t
+same_succ_alloc (void)
+{
+  same_succ_t same = XNEW (struct same_succ);
+
+  same->bbs = BITMAP_ALLOC (NULL);
+  same->succs = BITMAP_ALLOC (NULL);
+  same->inverse = BITMAP_ALLOC (NULL);
+  same->succ_flags = VEC_alloc (int, heap, 10);
+  same->in_worklist = false;
+
+  return same;
+}
+
+/* Delete same_succ VE.  */
+
+static void
+same_succ_delete (void *ve)
+{
+  same_succ_t e = (same_succ_t)ve;
+
+  BITMAP_FREE (e->bbs);
+  BITMAP_FREE (e->succs);
+  BITMAP_FREE (e->inverse);
+  VEC_free (int, heap, e->succ_flags);
+
+  XDELETE (ve);
+}
+
+/* Reset same_succ SAME.  */
+
+static void
+same_succ_reset (same_succ_t same)
+{
+  bitmap_clear (same->bbs);
+  bitmap_clear (same->succs);
+  bitmap_clear (same->inverse);
+  VEC_truncate (int, same->succ_flags, 0);
+}
+
+/* Hash table with all same_succ entries.  */
+
+static htab_t same_succ_htab;
+
+/* Array that is used to store the edge flags for a successor.  */
+
+static int *same_succ_edge_flags;
+
+/* Bitmap that is used to mark bbs that are recently deleted.  */
+
+static bitmap deleted_bbs;
+
+/* Bitmap that is used to mark predecessors of bbs that are
+   deleted.  */
+
+static bitmap deleted_bb_preds;
+
+/* Prints same_succ_htab to stderr.  */
+
+extern void debug_same_succ (void);
+DEBUG_FUNCTION void
+debug_same_succ ( void)
+{
+  htab_traverse (same_succ_htab, same_succ_print_traverse, stderr);
+}
+
+DEF_VEC_P (same_succ_t);
+DEF_VEC_ALLOC_P (same_succ_t, heap);
+
+/* Vector of bbs to process.  */
+
+static VEC (same_succ_t, heap) *worklist;
+
+/* Prints worklist to FILE.  */
+
+static void
+print_worklist (FILE *file)
+{
+  unsigned int i;
+  for (i = 0; i < VEC_length (same_succ_t, worklist); ++i)
+    same_succ_print (file, VEC_index (same_succ_t, worklist, i));
+}
+
+/* Adds SAME to worklist.  */
+
+static void
+add_to_worklist (same_succ_t same)
+{
+  if (same->in_worklist)
+    return;
+
+  if (bitmap_count_bits (same->bbs) < 2)
+    return;
+
+  same->in_worklist = true;
+  VEC_safe_push (same_succ_t, heap, worklist, same);
+}
+
+/* Add BB to same_succ_htab.  */
+
+static void
+find_same_succ_bb (basic_block bb, same_succ_t *same_p)
+{
+  unsigned int j;
+  bitmap_iterator bj;
+  same_succ_t same = *same_p;
+  same_succ_t *slot;
+  edge_iterator ei;
+  edge e;
+
+  if (bb == NULL)
+    return;
+  bitmap_set_bit (same->bbs, bb->index);
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      int index = e->dest->index;
+      bitmap_set_bit (same->succs, index);
+      same_succ_edge_flags[index] = e->flags;
+    }
+  EXECUTE_IF_SET_IN_BITMAP (same->succs, 0, j, bj)
+    VEC_safe_push (int, heap, same->succ_flags, same_succ_edge_flags[j]);
+
+  same->hashval = same_succ_hash (same);
+
+  slot = (same_succ_t *) htab_find_slot_with_hash (same_succ_htab, same,
+						   same->hashval, INSERT);
+  if (*slot == NULL)
+    {
+      *slot = same;
+      BB_SAME_SUCC (bb) = same;
+      add_to_worklist (same);
+      *same_p = NULL;
+    }
+  else
+    {
+      bitmap_set_bit ((*slot)->bbs, bb->index);
+      BB_SAME_SUCC (bb) = *slot;
+      add_to_worklist (*slot);
+      if (inverse_flags (same, *slot))
+	bitmap_set_bit ((*slot)->inverse, bb->index);
+      same_succ_reset (same);
+    }
+}
+
+/* Find bbs with same successors.  */
+
+static void
+find_same_succ (void)
+{
+  same_succ_t same = same_succ_alloc ();
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    {
+      find_same_succ_bb (bb, &same);
+      if (same == NULL)
+	same = same_succ_alloc ();
+    }
+
+  same_succ_delete (same);
+}
+
+/* Initializes worklist administration.  */
+
+static void
+init_worklist (void)
+{
+  alloc_aux_for_blocks (sizeof (struct aux_bb_info));
+  same_succ_htab
+    = htab_create (n_basic_blocks, same_succ_hash, same_succ_equal,
+		   same_succ_delete);
+  same_succ_edge_flags = XCNEWVEC (int, last_basic_block);
+  deleted_bbs = BITMAP_ALLOC (NULL);
+  deleted_bb_preds = BITMAP_ALLOC (NULL);
+  worklist = VEC_alloc (same_succ_t, heap, n_basic_blocks);
+  find_same_succ ();
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "initial worklist:\n");
+      print_worklist (dump_file);
+    }
+}
+
+/* Deletes worklist administration.  */
+
+static void
+delete_worklist (void)
+{
+  free_aux_for_blocks ();
+  htab_delete (same_succ_htab);
+  same_succ_htab = NULL;
+  XDELETEVEC (same_succ_edge_flags);
+  same_succ_edge_flags = NULL;
+  BITMAP_FREE (deleted_bbs);
+  BITMAP_FREE (deleted_bb_preds);
+  VEC_free (same_succ_t, heap, worklist);
+}
+
+/* Mark BB as deleted, and mark its predecessors.  */
+
+static void
+delete_basic_block_same_succ (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  bitmap_set_bit (deleted_bbs, bb->index);
+
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    bitmap_set_bit (deleted_bb_preds, e->src->index);
+}
+
+/* Removes all bbs in BBS from their corresponding same_succ.  */
+
+static void
+same_succ_flush_bbs (bitmap bbs)
+{
+  unsigned int i;
+  bitmap_iterator bi;
+
+  EXECUTE_IF_SET_IN_BITMAP (bbs, 0, i, bi)
+    {
+      basic_block bb = BASIC_BLOCK (i);
+      same_succ_t same = BB_SAME_SUCC (bb);
+      BB_SAME_SUCC (bb) = NULL;
+      if (bitmap_single_bit_set_p (same->bbs))
+	htab_remove_elt_with_hash (same_succ_htab, same, same->hashval);
+      else
+	bitmap_clear_bit (same->bbs, i);
+    }
+}
+
+/* For deleted_bb_preds, find bbs with same successors.  */
+
+static void
+update_worklist (void)
+{
+  unsigned int i;
+  bitmap_iterator bi;
+  basic_block bb;
+  same_succ_t same;
+
+  bitmap_and_compl_into (deleted_bb_preds, deleted_bbs);
+  bitmap_clear_bit (deleted_bb_preds, ENTRY_BLOCK);
+  same_succ_flush_bbs (deleted_bbs);
+  same_succ_flush_bbs (deleted_bb_preds);
+
+  EXECUTE_IF_SET_IN_BITMAP (deleted_bbs, 0, i, bi)
+    delete_basic_block (BASIC_BLOCK (i));
+
+  same = same_succ_alloc ();
+  EXECUTE_IF_SET_IN_BITMAP (deleted_bb_preds, 0, i, bi)
+    {
+      bb = BASIC_BLOCK (i);
+      gcc_assert (bb != NULL);
+      find_same_succ_bb (bb, &same);
+      if (same == NULL)
+	same = same_succ_alloc ();
+    }
+  same_succ_delete (same);
+
+  bitmap_clear (deleted_bbs);
+  bitmap_clear (deleted_bb_preds);
+}
+
+/* Prints cluster C to FILE.  */
+
+static void
+print_cluster (FILE *file, bb_cluster_t c)
+{
+  if (c == NULL)
+    return;
+  bitmap_print (file, c->bbs, "bbs:", "\n");
+  bitmap_print (file, c->preds, "preds:", "\n");
+}
+
+/* Prints cluster C to stderr.  */
+
+extern void debug_cluster (bb_cluster_t);
+DEBUG_FUNCTION void
+debug_cluster (bb_cluster_t c)
+{
+  print_cluster (stderr, c);
+}
+
+/* Returns true if bb1 and bb2 have the same predecessors.  */
+
+static bool
+same_predecessors (basic_block bb1, basic_block bb2)
+{
+  edge e;
+  edge_iterator ei;
+  unsigned int n1 = EDGE_COUNT (bb1->preds), n2 = EDGE_COUNT (bb2->preds);
+
+  if (n1 != n2)
+    return false;
+
+  FOR_EACH_EDGE (e, ei, bb1->preds)
+    if (!find_edge (e->src, bb2))
+      return false;
+
+  return true;
+}
+
+/* Add BB to cluster C.  Sets BB in C->bbs, and preds of BB in C->preds.  */
+
+static void
+add_bb_to_cluster (bb_cluster_t c, basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  bitmap_set_bit (c->bbs, bb->index);
+
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    bitmap_set_bit (c->preds, e->src->index);
+}
+
+/* Allocate and init new cluster.  */
+
+static bb_cluster_t
+new_cluster (void)
+{
+  bb_cluster_t c;
+  c = XCNEW (struct bb_cluster);
+  c->bbs = BITMAP_ALLOC (NULL);
+  c->preds = BITMAP_ALLOC (NULL);
+  return c;
+}
+
+/* Delete clusters.  */
+
+static void
+delete_cluster (bb_cluster_t c)
+{
+  if (c == NULL)
+    return;
+  BITMAP_FREE (c->bbs);
+  BITMAP_FREE (c->preds);
+  XDELETE (c);
+}
+
+DEF_VEC_P (bb_cluster_t);
+DEF_VEC_ALLOC_P (bb_cluster_t, heap);
+
+/* Array that contains all clusters.  */
+
+static VEC (bb_cluster_t, heap) *all_clusters;
+
+/* Allocate all cluster vectors.  */
+
+static void
+alloc_cluster_vectors (void)
+{
+  all_clusters = VEC_alloc (bb_cluster_t, heap, n_basic_blocks);
+}
+
+/* Reset all cluster vectors.  */
+
+static void
+reset_cluster_vectors (void)
+{
+  unsigned int i;
+  basic_block bb;
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    delete_cluster (VEC_index (bb_cluster_t, all_clusters, i));
+  VEC_truncate (bb_cluster_t, all_clusters, 0);
+  FOR_EACH_BB (bb)
+    BB_CLUSTER (bb) = NULL;
+}
+
+/* Delete all cluster vectors.  */
+
+static void
+delete_cluster_vectors (void)
+{
+  unsigned int i;
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    delete_cluster (VEC_index (bb_cluster_t, all_clusters, i));
+  VEC_free (bb_cluster_t, heap, all_clusters);
+}
+
+/* Merge cluster C2 into C1.  */
+
+static void
+merge_clusters (bb_cluster_t c1, bb_cluster_t c2)
+{
+  bitmap_ior_into (c1->bbs, c2->bbs);
+  bitmap_ior_into (c1->preds, c2->preds);
+}
+
+/* Register equivalence of BB1 and BB2 (members of cluster C).  Store c in
+   all_clusters, or merge c with existing cluster.  */
+
+static void
+set_cluster (basic_block bb1, basic_block bb2)
+{
+  basic_block merge_bb, other_bb;
+  bb_cluster_t merge, old, c;
+
+  if (BB_CLUSTER (bb1) == NULL && BB_CLUSTER (bb2) == NULL)
+    {
+      c = new_cluster ();
+      add_bb_to_cluster (c, bb1);
+      add_bb_to_cluster (c, bb2);
+      BB_CLUSTER (bb1) = c;
+      BB_CLUSTER (bb2) = c;
+      c->index = VEC_length (bb_cluster_t, all_clusters);
+      VEC_safe_push (bb_cluster_t, heap, all_clusters, c);
+    }
+  else if (BB_CLUSTER (bb1) == NULL || BB_CLUSTER (bb2) == NULL)
+    {
+      merge_bb = BB_CLUSTER (bb1) == NULL ? bb2 : bb1;
+      other_bb = BB_CLUSTER (bb1) == NULL ? bb1 : bb2;
+      merge = BB_CLUSTER (merge_bb);
+      add_bb_to_cluster (merge, other_bb);
+      BB_CLUSTER (other_bb) = merge;
+    }
+  else if (BB_CLUSTER (bb1) != BB_CLUSTER (bb2))
+    {
+      unsigned int i;
+      bitmap_iterator bi;
+
+      old = BB_CLUSTER (bb2);
+      merge = BB_CLUSTER (bb1);
+      merge_clusters (merge, old);
+      EXECUTE_IF_SET_IN_BITMAP (old->bbs, 0, i, bi)
+	BB_CLUSTER (BASIC_BLOCK (i)) = merge;
+      VEC_replace (bb_cluster_t, all_clusters, old->index, NULL);
+      delete_cluster (old);
+    }
+  else
+    gcc_unreachable ();
+}
+
+/* Return true if gimple statements S1 and S2 are equal.  SAME_PREDS indicates
+   whether gimple_bb (s1) and gimple_bb (s2) (members of SAME_SUCC) have the
+   same predecessors.  */
+
+static bool
+gimple_equal_p (same_succ_t same_succ, gimple s1, gimple s2, bool same_preds)
+{
+  unsigned int i;
+  tree lhs1, lhs2;
+  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
+  tree t1, t2;
+  bool equal, inv_cond;
+  enum tree_code code1, code2;
+
+  if (gimple_code (s1) != gimple_code (s2))
+    return false;
+
+  switch (gimple_code (s1))
+    {
+    case GIMPLE_CALL:
+      if (gimple_call_num_args (s1) != gimple_call_num_args (s2))
+	return false;
+      if (!gimple_call_same_target_p (s1, s2))
+        return false;
+
+      equal = true;
+      for (i = 0; i < gimple_call_num_args (s1); ++i)
+	{
+	  t1 = gimple_call_arg (s1, i);
+	  t2 = gimple_call_arg (s2, i);
+	  if (operand_equal_p (t1, t2, 0))
+	    continue;
+	  if (gvn_uses_equal (t1, t2, bb1, bb2, same_preds))
+	    continue;
+	  equal = false;
+	  break;
+	}
+      if (equal)
+	return true;
+
+      lhs1 = gimple_get_lhs (s1);
+      lhs2 = gimple_get_lhs (s2);
+      return (lhs1 != NULL_TREE && lhs2 != NULL_TREE && same_preds
+	      && TREE_CODE (lhs1) == SSA_NAME && TREE_CODE (lhs2) == SSA_NAME
+	      && vn_valueize (lhs1) == vn_valueize (lhs2));
+
+    case GIMPLE_ASSIGN:
+      lhs1 = gimple_get_lhs (s1);
+      lhs2 = gimple_get_lhs (s2);
+      return (same_preds && TREE_CODE (lhs1) == SSA_NAME
+	      && TREE_CODE (lhs2) == SSA_NAME
+	      && vn_valueize (lhs1) == vn_valueize (lhs2));
+
+    case GIMPLE_COND:
+      t1 = gimple_cond_lhs (s1);
+      t2 = gimple_cond_lhs (s2);
+      if (!operand_equal_p (t1, t2, 0)
+	  && !gvn_uses_equal (t1, t2, bb1, bb2, same_preds))
+	return false;
+
+      t1 = gimple_cond_rhs (s1);
+      t2 = gimple_cond_rhs (s2);
+      if (!operand_equal_p (t1, t2, 0)
+	  && !gvn_uses_equal (t1, t2, bb1, bb2, same_preds))
+	return false;
+
+      code1 = gimple_expr_code (s1);
+      code2 = gimple_expr_code (s2);
+      inv_cond = (bitmap_bit_p (same_succ->inverse, bb1->index)
+		  != bitmap_bit_p (same_succ->inverse, bb2->index));
+      if (inv_cond)
+	{
+	  bool honor_nans
+	    = HONOR_NANS (TYPE_MODE (TREE_TYPE (gimple_cond_lhs (s1))));
+	  code2 = invert_tree_comparison (code2, honor_nans);
+	}
+      return code1 == code2;
+
+    default:
+      return false;
+    }
+}
+
+/* Determines whether BB1 and BB2 (members of same_succ) are duplicates.  If so,
+   clusters them.  SAME_PREDS indicates whether BB1 and BB2 have the same
+   predecessors.  */
+
+static void
+find_duplicate (same_succ_t same_succ, basic_block bb1, basic_block bb2,
+		bool same_preds)
+{
+  gimple_stmt_iterator gsi1 = gsi_last_nondebug_bb (bb1);
+  gimple_stmt_iterator gsi2 = gsi_last_nondebug_bb (bb2);
+  bool end1 = gsi_end_p (gsi1);
+  bool end2 = gsi_end_p (gsi2);
+
+  while (!end1 && !end2)
+    {
+      if (!gimple_equal_p (same_succ, gsi_stmt (gsi1), gsi_stmt (gsi2),
+			   same_preds))
+	return;
+
+      gsi_prev_nondebug (&gsi1);
+      gsi_prev_nondebug (&gsi2);
+      end1 = gsi_end_p (gsi1);
+      end2 = gsi_end_p (gsi2);
+    }
+
+  if (!(end1 && end2))
+    return;
+
+  if (dump_file)
+    fprintf (dump_file, "find_duplicates: <bb %d> duplicate of <bb %d>\n",
+	     bb1->index, bb2->index);
+
+  set_cluster (bb1, bb2);
+}
+
+/* Returns whether for all phis in DEST the phi alternatives for E1 and
+   E2 are equal.  SAME_PREDS indicates whether BB1 and BB2 have the same
+   predecessors.  */
+
+static bool
+same_phi_alternatives_1 (basic_block dest, edge e1, edge e2, bool same_preds)
+{
+  int n1 = e1->dest_idx, n2 = e2->dest_idx;
+  basic_block bb1 = e1->src, bb2 = e2->src;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_phis (dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple phi = gsi_stmt (gsi);
+      tree lhs = gimple_phi_result (phi);
+      tree val1 = gimple_phi_arg_def (phi, n1);
+      tree val2 = gimple_phi_arg_def (phi, n2);
+
+      if (!is_gimple_reg (lhs))
+	continue;
+
+      if (operand_equal_for_phi_arg_p (val1, val2))
+        continue;
+      if (gvn_uses_equal (val1, val2, bb1, bb2, same_preds))
+	continue;
+
+      return false;
+    }
+
+  return true;
+}
+
+/* Returns whether for all successors of BB1 and BB2 (members of SAME_SUCC), the
+   phi alternatives for BB1 and BB2 are equal.  SAME_PREDS indicates whether BB1
+   and BB2 have the same predecessors.  */
+
+static bool
+same_phi_alternatives (same_succ_t same_succ, basic_block bb1, basic_block bb2,
+		       bool same_preds)
+{
+  unsigned int s;
+  bitmap_iterator bs;
+  edge e1, e2;
+  basic_block succ;
+
+  EXECUTE_IF_SET_IN_BITMAP (same_succ->succs, 0, s, bs)
+    {
+      succ = BASIC_BLOCK (s);
+      e1 = find_edge (bb1, succ);
+      e2 = find_edge (bb2, succ);
+      if (e1->flags & EDGE_COMPLEX
+	  || e2->flags & EDGE_COMPLEX)
+	return false;
+
+      /* For all phis in bb, the phi alternatives for e1 and e2 need to have
+	 the same value.  */
+      if (!same_phi_alternatives_1 (succ, e1, e2, same_preds))
+	return false;
+    }
+
+  return true;
+}
+
+/* Return true if BB has non-vop phis.  */
+
+static bool
+bb_has_non_vop_phi (basic_block bb)
+{
+  gimple_seq phis = phi_nodes (bb);
+  gimple phi;
+
+  if (phis == NULL)
+    return false;
+
+  if (!gimple_seq_singleton_p (phis))
+    return true;
+
+  phi = gimple_seq_first_stmt (phis);
+  return is_gimple_reg (gimple_phi_result (phi));
+}
+
+/* Within SAME_SUCC->bbs, find clusters of bbs which can be merged.  */
+
+static void
+find_clusters_1 (same_succ_t same_succ)
+{
+  basic_block bb1, bb2;
+  unsigned int i, j;
+  bitmap_iterator bi, bj;
+  bool same_preds;
+  int nr_comparisons;
+  int max_comparisons = PARAM_VALUE (PARAM_MAX_TAIL_MERGE_COMPARISONS);
+
+  if (same_succ == NULL)
+    return;
+
+  EXECUTE_IF_SET_IN_BITMAP (same_succ->bbs, 0, i, bi)
+    {
+      bb1 = BASIC_BLOCK (i);
+
+      /* TODO: handle blocks with phi-nodes.  We'll have find corresponding
+	 phi-nodes in bb1 and bb2, with the same alternatives for the same
+	 preds.  */
+      if (bb_has_non_vop_phi (bb1))
+	continue;
+
+      nr_comparisons = 0;
+      EXECUTE_IF_SET_IN_BITMAP (same_succ->bbs, i + 1, j, bj)
+	{
+	  bb2 = BASIC_BLOCK (j);
+
+	  if (bb_has_non_vop_phi (bb2))
+	    continue;
+
+	  if (BB_CLUSTER (bb1) != NULL
+	      && BB_CLUSTER (bb1) == BB_CLUSTER (bb2))
+	    continue;
+
+	  /* Limit quadratic behaviour.  */
+	  nr_comparisons++;
+	  if (nr_comparisons > max_comparisons)
+	    break;
+
+	  same_preds = same_predecessors (bb1, bb2);
+
+	  if (!(same_phi_alternatives (same_succ, bb1, bb2, same_preds)))
+	    continue;
+	  find_duplicate (same_succ, bb1, bb2, same_preds);
+        }
+    }
+}
+
+/* Find clusters of bbs which can be merged.  */
+
+static void
+find_clusters (void)
+{
+  same_succ_t same;
+
+  while (!VEC_empty (same_succ_t, worklist))
+    {
+      same = VEC_pop (same_succ_t, worklist);
+      same->in_worklist = false;
+      if (dump_file)
+	{
+	  fprintf (dump_file, "processing worklist entry\n");
+	  same_succ_print (dump_file, same);
+	}
+      find_clusters_1 (same);
+    }
+}
+
+/* Merge the alias info of the calls in BB1 into the calls in BB2.  */
+
+static void
+merge_calls (basic_block bb1, basic_block bb2)
+{
+  gimple_stmt_iterator gsi1 = gsi_start_nondebug_bb (bb1);
+  gimple_stmt_iterator gsi2 = gsi_start_nondebug_bb (bb2);
+  bool end1, end2;
+  gimple s1, s2;
+
+  end1 = gsi_end_p (gsi1);
+  end2 = gsi_end_p (gsi2);
+
+  while (true)
+    {
+      if (end1 && end2)
+	return;
+      gcc_assert (!end1 && !end2);
+      s1 = gsi_stmt (gsi1);
+      s2 = gsi_stmt (gsi2);
+
+      if (is_gimple_call (s1) && is_gimple_call (s2))
+	{
+	  pt_solution_ior_into_shared (gimple_call_use_set (s2),
+				       gimple_call_use_set (s1));
+	  pt_solution_ior_into_shared (gimple_call_clobber_set (s2),
+				       gimple_call_clobber_set (s1));
+	}
+      else
+	gcc_assert (!is_gimple_call (s1) && !is_gimple_call (s2));
+
+      gsi_next_nondebug (&gsi1);
+      gsi_next_nondebug (&gsi2);
+      end1 = gsi_end_p (gsi1);
+      end2 = gsi_end_p (gsi2);
+    }
+}
+
+/* Create or update a vop phi in BB2.  Use VUSE1 arguments for all the
+   REDIRECTED_EDGES, or if VUSE1 is NULL_TREE, use BB_VOP_AT_EXIT.  If a new
+   phis is created, use the phi instead of VUSE2 in BB2.  */
+
+static void
+update_vuses (tree vuse1, tree vuse2, basic_block bb2,
+              VEC (edge,heap) *redirected_edges)
+{
+  gimple stmt, phi = NULL;
+  tree lhs, arg, current_arg;
+  unsigned int i;
+  gimple def_stmt2;
+  source_location locus1, locus2;
+  imm_use_iterator iter;
+  use_operand_p use_p;
+  edge_iterator ei;
+  edge e;
+
+  if (vuse2 == NULL_TREE)
+    return;
+
+  def_stmt2 = SSA_NAME_DEF_STMT (vuse2);
+
+  /* Update existing phi.  */
+  if (gimple_bb (def_stmt2) == bb2)
+    {
+      phi = def_stmt2;
+
+      for (i = 0; i < EDGE_COUNT (redirected_edges); ++i)
+	{
+	  e = VEC_index (edge, redirected_edges, i);
+	  if (vuse1)
+	    arg = vuse1;
+	  else
+	    arg = BB_VOP_AT_EXIT (e->src);
+	  current_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	  if (current_arg == NULL)
+	    {
+	      locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));
+	      add_phi_arg (phi, arg, e, locus1);
+	    }
+	  else
+	    gcc_assert (arg == current_arg);
+	}
+      return;
+    }
+
+  /* No need to create a phi with 2 equal arguments.  */
+  if (vuse1 == vuse2)
+    return;
+
+  locus2 = gimple_location (def_stmt2);
+
+  /* Create a phi, first with default argument vuse2 for all preds.  */
+  lhs = make_ssa_name (SSA_NAME_VAR (vuse2), NULL);
+  VN_INFO_GET (lhs);
+  phi = create_phi_node (lhs, bb2);
+  SSA_NAME_DEF_STMT (lhs) = phi;
+  FOR_EACH_EDGE (e, ei, bb2->preds)
+    add_phi_arg (phi, vuse2, e, locus2);
+
+  /* Now overwrite the arguments associated with the redirected edges with
+     vuse1.  */
+  for (i = 0; i < EDGE_COUNT (redirected_edges); ++i)
+    {
+      e = VEC_index (edge, redirected_edges, i);
+      gcc_assert (PHI_ARG_DEF_FROM_EDGE (phi, e));
+      if (vuse1)
+	arg = vuse1;
+      else
+	arg = BB_VOP_AT_EXIT (e->src);
+      SET_PHI_ARG_DEF (phi, e->dest_idx, arg);
+      locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));
+      gimple_phi_arg_set_location (phi, e->dest_idx, locus1);
+    }
+
+  /* Replace uses of vuse2 in bb2 with phi.  */
+  FOR_EACH_IMM_USE_STMT (stmt, iter, vuse2)
+    {
+      if (gimple_code (stmt) == GIMPLE_PHI)
+	{
+	  edge e;
+	  if (stmt == phi)
+	    continue;
+	  e = find_edge (bb2, gimple_bb (stmt));
+	  if (e == NULL)
+	    continue;
+	  use_p = PHI_ARG_DEF_PTR_FROM_EDGE (stmt, e);
+	  SET_USE (use_p, lhs);
+	  update_stmt (stmt);
+	}
+      else if (gimple_bb (stmt) == bb2)
+	{
+	  FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	    SET_USE (use_p, lhs);
+	  update_stmt (stmt);
+	}
+    }
+}
+
+/* Returns the vop phi of BB, if any.  */
+
+static gimple
+vop_phi (basic_block bb)
+{
+  gimple stmt;
+  gimple_stmt_iterator gsi;
+  for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      if (is_gimple_reg (gimple_phi_result (stmt)))
+	continue;
+      return stmt;
+    }
+  return NULL;
+}
+
+/* Scans the vdefs and vuses of the insn of BB, and returns the vop at entry in
+   VOP_AT_ENTRY, and the vop at exit in VOP_AT_EXIT.  */
+
+static void
+insn_vops (basic_block bb, tree *vop_at_entry, tree *vop_at_exit)
+{
+  gimple stmt;
+  gimple_stmt_iterator gsi;
+  tree vuse, vdef;
+  tree last_vdef = NULL_TREE;
+
+  if (*vop_at_entry != NULL_TREE && *vop_at_exit != NULL_TREE)
+    return;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      vuse = gimple_vuse (stmt);
+      vdef = gimple_vdef (stmt);
+      if (vuse != NULL_TREE && *vop_at_entry == NULL_TREE)
+	{
+	  *vop_at_entry = vuse;
+	  if (*vop_at_exit != NULL_TREE)
+	    return;
+	}
+      if (vdef != NULL_TREE)
+	last_vdef = vdef;
+    }
+
+  *vop_at_exit = last_vdef != NULL_TREE ? last_vdef : *vop_at_entry;
+}
+
+/* Returns the vop at entry of BB1 in VOP_AT_ENTRY1, and the one of bb2 in
+   VOP_AT_ENTRY2, where BB1 and BB2 have the same successors.  */
+
+static void
+vop_at_entry (basic_block bb1, basic_block bb2, tree *vop_at_entry1,
+	      tree *vop_at_entry2)
+{
+  gimple succ_phi, bb1_phi, bb2_phi;
+  basic_block succ;
+  tree vop_at_exit1 = NULL_TREE, vop_at_exit2 = NULL_TREE;
+  bool same_at_exit;
+
+  bb1_phi = vop_phi (bb1);
+  bb2_phi = vop_phi (bb2);
+
+  *vop_at_entry1 = bb1_phi != NULL ? gimple_phi_result (bb1_phi) : NULL_TREE;
+  *vop_at_entry2 = bb2_phi != NULL ? gimple_phi_result (bb2_phi) : NULL_TREE;
+
+  if (*vop_at_entry1 != NULL_TREE && *vop_at_entry2 != NULL_TREE)
+    return;
+
+  if (EDGE_COUNT (bb1->succs) != 0)
+    {
+      succ = EDGE_SUCC (bb1, 0)->dest;
+      succ_phi = vop_phi (succ);
+      if (succ_phi != NULL)
+	{
+	  vop_at_exit1
+	    = PHI_ARG_DEF_FROM_EDGE (succ_phi, find_edge (bb1, succ));
+	  vop_at_exit2
+	    = PHI_ARG_DEF_FROM_EDGE (succ_phi, find_edge (bb2, succ));
+	}
+    }
+
+  same_at_exit = vop_at_exit1 == vop_at_exit2;
+
+  if (*vop_at_entry1 == NULL_TREE && vop_at_exit1 != NULL_TREE
+      && gimple_bb (SSA_NAME_DEF_STMT (vop_at_exit1)) != bb1)
+    *vop_at_entry1 = vop_at_exit1;
+
+  if (*vop_at_entry2 == NULL_TREE && vop_at_exit2 != NULL_TREE
+      && gimple_bb (SSA_NAME_DEF_STMT (vop_at_exit2)) != bb2)
+    *vop_at_entry2 = vop_at_exit2;
+
+  if (*vop_at_entry1 != NULL_TREE && *vop_at_entry2 != NULL_TREE)
+    return;
+
+  insn_vops (bb1, vop_at_entry1, &vop_at_exit1);
+  insn_vops (bb2, vop_at_entry2, &vop_at_exit2);
+
+  if (*vop_at_entry1 != NULL_TREE && *vop_at_entry2 != NULL_TREE)
+    return;
+
+  if (same_at_exit && vop_at_exit1 != NULL_TREE
+      && *vop_at_entry2 == NULL_TREE
+      && dominated_by_p (CDI_DOMINATORS, bb2, bb1))
+    *vop_at_entry2 = vop_at_exit1;
+
+  if (same_at_exit && vop_at_exit2 != NULL_TREE
+      && *vop_at_entry1 == NULL_TREE
+      && dominated_by_p (CDI_DOMINATORS, bb1, bb2))
+    *vop_at_entry1 = vop_at_exit2;
+
+  if (*vop_at_entry1 != NULL_TREE && *vop_at_entry2 != NULL_TREE)
+    return;
+
+  gcc_assert (*vop_at_entry1 == NULL_TREE && *vop_at_entry2 == NULL_TREE);
+}
+
+/* Redirect all edges from BB1 to BB2, marks BB1 for removal, and if
+   UPDATE_VOPS, inserts vop phis.  */
+
+static void
+replace_block_by (basic_block bb1, basic_block bb2, bool update_vops)
+{
+  edge pred_edge;
+  unsigned int i;
+  tree phi_vuse1, phi_vuse2, arg;
+  VEC (edge,heap) *redirected_edges = NULL;
+  edge e;
+  edge_iterator ei;
+
+  if (update_vops)
+    {
+      vop_at_entry (bb1, bb2, &phi_vuse1, &phi_vuse2);
+
+      if (phi_vuse1 != NULL_TREE
+	  && gimple_bb (SSA_NAME_DEF_STMT (phi_vuse1)) == bb1)
+	{
+	  FOR_EACH_EDGE (e, ei, bb1->preds)
+	    {
+	      arg = PHI_ARG_DEF_FROM_EDGE (SSA_NAME_DEF_STMT (phi_vuse1), e);
+	      BB_VOP_AT_EXIT (e->src) = arg;
+	    }
+	  phi_vuse1 = NULL;
+	}
+      redirected_edges = VEC_alloc (edge, heap, 10);
+    }
+
+  delete_basic_block_same_succ (bb1);
+
+  /* Redirect the incoming edges of bb1 to bb2.  */
+  for (i = EDGE_COUNT (bb1->preds); i > 0 ; --i)
+    {
+      pred_edge = EDGE_PRED (bb1, i - 1);
+      pred_edge = redirect_edge_and_branch (pred_edge, bb2);
+      gcc_assert (pred_edge != NULL);
+      if (update_vops)
+	VEC_safe_push (edge, heap, redirected_edges, pred_edge);
+    }
+
+  if (update_vops)
+    {
+      update_vuses (phi_vuse1, phi_vuse2, bb2, redirected_edges);
+      VEC_free (edge, heap, redirected_edges);
+    }
+
+  merge_calls (bb1, bb2);
+}
+
+/* Bbs for which update_debug_stmt need to be called.  */
+
+static bitmap update_bbs;
+
+/* For each cluster in all_clusters, merge all cluster->bbs.  Returns
+   number of bbs removed.  Insert vop phis if UPDATE_VOPS.  */
+
+static int
+apply_clusters (bool update_vops)
+{
+  basic_block bb1, bb2;
+  bb_cluster_t c;
+  unsigned int i, j;
+  bitmap_iterator bj;
+  int nr_bbs_removed = 0;
+
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    {
+      c = VEC_index (bb_cluster_t, all_clusters, i);
+      if (c == NULL)
+	continue;
+
+      bb2 = BASIC_BLOCK (bitmap_first_set_bit (c->bbs));
+      gcc_assert (bb2 != NULL);
+
+      bitmap_set_bit (update_bbs, bb2->index);
+      EXECUTE_IF_SET_IN_BITMAP (c->bbs, 0, j, bj)
+	{
+	  bb1 = BASIC_BLOCK (j);
+	  gcc_assert (bb1 != NULL);
+	  if (bb1 == bb2)
+	    continue;
+
+	  bitmap_clear_bit (update_bbs, bb1->index);
+	  replace_block_by (bb1, bb2, update_vops);
+	  nr_bbs_removed++;
+	}
+    }
+
+  return nr_bbs_removed;
+}
+
+/* Resets debug statement STMT if it has uses that are not dominated by their
+   defs.  */
+
+static void
+update_debug_stmt (gimple stmt)
+{
+  use_operand_p use_p;
+  ssa_op_iter oi;
+  basic_block bbdef, bbuse;
+  gimple def_stmt;
+  tree name;
+
+  if (!gimple_debug_bind_p (stmt))
+    return;
+
+  bbuse = gimple_bb (stmt);
+  FOR_EACH_PHI_OR_STMT_USE (use_p, stmt, oi, SSA_OP_USE)
+    {
+      name = USE_FROM_PTR (use_p);
+      gcc_assert (TREE_CODE (name) == SSA_NAME);
+
+      def_stmt = SSA_NAME_DEF_STMT (name);
+      gcc_assert (def_stmt != NULL);
+
+      bbdef = gimple_bb (def_stmt);
+      if (bbdef == NULL || bbuse == bbdef
+	  || dominated_by_p (CDI_DOMINATORS, bbuse, bbdef))
+	continue;
+
+      gimple_debug_bind_reset_value (stmt);
+      update_stmt (stmt);
+    }
+}
+
+/* Resets all debug statements that have uses that are not
+   dominated by their defs.  */
+
+static void
+update_debug_stmts (void)
+{
+  basic_block bb;
+  bitmap_iterator bi;
+  unsigned int i;
+
+  if (!MAY_HAVE_DEBUG_STMTS)
+    return;
+
+  EXECUTE_IF_SET_IN_BITMAP (update_bbs, 0, i, bi)
+    {
+      gimple stmt;
+      gimple_stmt_iterator gsi;
+
+      bb = BASIC_BLOCK (i);
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  stmt = gsi_stmt (gsi);
+	  if (!is_gimple_debug (stmt))
+	    continue;
+	  update_debug_stmt (stmt);
+	}
+    }
+}
+
+/* Runs tail merge optimization.  */
+
+unsigned int
+tail_merge_optimize (unsigned int todo)
+{
+  int nr_bbs_removed_total = 0;
+  int nr_bbs_removed;
+  bool loop_entered = false;
+  int iteration_nr = 0;
+  bool update_vops = ((todo & TODO_update_ssa_only_virtuals) == 0
+		      || !symbol_marked_for_renaming (gimple_vop (cfun)));
+
+  if (!flag_tree_tail_merge)
+    return 0;
+
+  timevar_push (TV_TREE_TAIL_MERGE);
+
+  init_worklist ();
+
+  while (!VEC_empty (same_succ_t, worklist))
+    {
+      if (!loop_entered)
+	{
+	  loop_entered = true;
+	  alloc_cluster_vectors ();
+	  update_bbs = BITMAP_ALLOC (NULL);
+	}
+      else
+	reset_cluster_vectors ();
+
+      iteration_nr++;
+      if (dump_file)
+	fprintf (dump_file, "worklist iteration #%d\n", iteration_nr);
+
+      calculate_dominance_info (CDI_DOMINATORS);
+      find_clusters ();
+      gcc_assert (VEC_empty (same_succ_t, worklist));
+      if (VEC_empty (bb_cluster_t, all_clusters))
+	break;
+
+      nr_bbs_removed = apply_clusters (update_vops);
+      nr_bbs_removed_total += nr_bbs_removed;
+      if (nr_bbs_removed == 0)
+	break;
+
+      free_dominance_info (CDI_DOMINATORS);
+      update_worklist ();
+    }
+
+  if (dump_file)
+    fprintf (dump_file, "htab collision / search: %f\n",
+	     htab_collisions (same_succ_htab));
+
+  if (nr_bbs_removed_total > 0)
+    {
+      calculate_dominance_info (CDI_DOMINATORS);
+      update_debug_stmts ();
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Before TODOs.\n");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+
+      todo |= (TODO_verify_ssa | TODO_verify_stmts | TODO_verify_flow
+	       | TODO_dump_func);
+    }
+
+  delete_worklist ();
+  if (loop_entered)
+    {
+      delete_cluster_vectors ();
+      BITMAP_FREE (update_bbs);
+    }
+
+  timevar_pop (TV_TREE_TAIL_MERGE);
+
+  return todo;
+}
Index: gcc/tree-ssa-sccvn.c
===================================================================
--- gcc/tree-ssa-sccvn.c	(revision 175801)
+++ gcc/tree-ssa-sccvn.c	(working copy)
@@ -2872,19 +2872,6 @@ simplify_unary_expression (gimple stmt)
   return NULL_TREE;
 }
 
-/* Valueize NAME if it is an SSA name, otherwise just return it.  */
-
-static inline tree
-vn_valueize (tree name)
-{
-  if (TREE_CODE (name) == SSA_NAME)
-    {
-      tree tem = SSA_VAL (name);
-      return tem == VN_TOP ? name : tem;
-    }
-  return name;
-}
-
 /* Try to simplify RHS using equivalences and constant folding.  */
 
 static tree
Index: gcc/tree-ssa-sccvn.h
===================================================================
--- gcc/tree-ssa-sccvn.h	(revision 175801)
+++ gcc/tree-ssa-sccvn.h	(working copy)
@@ -209,4 +209,18 @@ unsigned int get_constant_value_id (tree
 unsigned int get_or_alloc_constant_value_id (tree);
 bool value_id_constant_p (unsigned int);
 tree fully_constant_vn_reference_p (vn_reference_t);
+
+/* Valueize NAME if it is an SSA name, otherwise just return it.  */
+
+static inline tree
+vn_valueize (tree name)
+{
+  if (TREE_CODE (name) == SSA_NAME)
+    {
+      tree tem = VN_INFO (name)->valnum;
+      return tem == VN_TOP ? name : tem;
+    }
+  return name;
+}
+
 #endif /* TREE_SSA_SCCVN_H  */
Index: gcc/tree-ssa-alias.h
===================================================================
--- gcc/tree-ssa-alias.h	(revision 175801)
+++ gcc/tree-ssa-alias.h	(working copy)
@@ -134,6 +134,8 @@ extern bool pt_solutions_same_restrict_b
 extern void pt_solution_reset (struct pt_solution *);
 extern void pt_solution_set (struct pt_solution *, bitmap, bool, bool);
 extern void pt_solution_set_var (struct pt_solution *, tree);
+extern void pt_solution_ior_into_shared (struct pt_solution *,
+					 struct pt_solution *);
 
 extern void dump_pta_stats (FILE *);
 
Index: gcc/opts.c
===================================================================
--- gcc/opts.c	(revision 175801)
+++ gcc/opts.c	(working copy)
@@ -484,6 +484,7 @@ static const struct default_options defa
     { OPT_LEVELS_2_PLUS, OPT_falign_jumps, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_falign_labels, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
 
     /* -O3 optimizations.  */
     { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	(revision 175801)
+++ gcc/timevar.def	(working copy)
@@ -127,6 +127,7 @@ DEFTIMEVAR (TV_TREE_GIMPLIFY	     , "tre
 DEFTIMEVAR (TV_TREE_EH		     , "tree eh")
 DEFTIMEVAR (TV_TREE_CFG		     , "tree CFG construction")
 DEFTIMEVAR (TV_TREE_CLEANUP_CFG	     , "tree CFG cleanup")
+DEFTIMEVAR (TV_TREE_TAIL_MERGE       , "tree tail merge")
 DEFTIMEVAR (TV_TREE_VRP              , "tree VRP")
 DEFTIMEVAR (TV_TREE_COPY_PROP        , "tree copy propagation")
 DEFTIMEVAR (TV_FIND_REFERENCED_VARS  , "tree find ref. vars")
Index: gcc/tree-ssa-pre.c
===================================================================
--- gcc/tree-ssa-pre.c	(revision 175801)
+++ gcc/tree-ssa-pre.c	(working copy)
@@ -4935,7 +4935,6 @@ execute_pre (bool do_fre)
   statistics_counter_event (cfun, "Constified", pre_stats.constified);
 
   clear_expression_ids ();
-  free_scc_vn ();
   if (!do_fre)
     {
       remove_dead_inserted_code ();
@@ -4945,6 +4944,9 @@ execute_pre (bool do_fre)
   scev_finalize ();
   fini_pre (do_fre);
 
+  todo |= tail_merge_optimize (todo);
+  free_scc_vn ();
+
   return todo;
 }
 
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 175801)
+++ gcc/common.opt	(working copy)
@@ -1937,6 +1937,10 @@ ftree-dominator-opts
 Common Report Var(flag_tree_dom) Optimization
 Enable dominator optimizations
 
+ftree-tail-merge
+Common Report Var(flag_tree_tail_merge) Optimization
+Enable tail merging on trees
+
 ftree-dse
 Common Report Var(flag_tree_dse) Optimization
 Enable dead store elimination
Index: gcc/tree-flow.h
===================================================================
--- gcc/tree-flow.h	(revision 175801)
+++ gcc/tree-flow.h	(working copy)
@@ -806,6 +806,9 @@ bool multiplier_allowed_in_address_p (HO
 unsigned multiply_by_cost (HOST_WIDE_INT, enum machine_mode, bool);
 bool may_be_nonaddressable_p (tree expr);
 
+/* In tree-ssa-tail-merge.c.  */
+extern unsigned int tail_merge_optimize (unsigned int);
+
 /* In tree-ssa-threadupdate.c.  */
 extern bool thread_through_all_blocks (bool);
 extern void register_jump_thread (edge, edge, edge);
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	(revision 175801)
+++ gcc/Makefile.in	(working copy)
@@ -1466,6 +1466,7 @@ OBJS = \
 	tree-ssa-sccvn.o \
 	tree-ssa-sink.o \
 	tree-ssa-structalias.o \
+	tree-ssa-tail-merge.o \
 	tree-ssa-ter.o \
 	tree-ssa-threadedge.o \
 	tree-ssa-threadupdate.o \
@@ -2427,6 +2428,13 @@ stor-layout.o : stor-layout.c $(CONFIG_H
    $(TREE_H) $(PARAMS_H) $(FLAGS_H) $(FUNCTION_H) $(EXPR_H) output.h $(RTL_H) \
    $(GGC_H) $(TM_P_H) $(TARGET_H) langhooks.h $(REGS_H) gt-stor-layout.h \
    $(DIAGNOSTIC_CORE_H) $(CGRAPH_H) $(TREE_INLINE_H) $(TREE_DUMP_H) $(GIMPLE_H)
+tree-ssa-tail-merge.o: tree-ssa-tail-merge.c \
+   $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(BITMAP_H) \
+   $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
+   $(TREE_H) $(TREE_FLOW_H) $(TREE_INLINE_H) \
+   $(GIMPLE_H) $(FUNCTION_H) \
+   $(TIMEVAR_H) tree-ssa-sccvn.h \
+   $(CGRAPH_H) gimple-pretty-print.h tree-pretty-print.h $(PARAMS_H)
 tree-ssa-structalias.o: tree-ssa-structalias.c \
    $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(GGC_H) $(OBSTACK_H) $(BITMAP_H) \
    $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
Index: gcc/tree-ssa-structalias.c
===================================================================
--- gcc/tree-ssa-structalias.c	(revision 175801)
+++ gcc/tree-ssa-structalias.c	(working copy)
@@ -5688,6 +5688,48 @@ shared_bitmap_add (bitmap pt_vars)
   *slot = (void *) sbi;
 }
 
+/* Unshares the points-to bitmap of PT.  */
+
+static void
+pt_solution_unshare (struct pt_solution *pt)
+{
+  bitmap copy;
+
+  if (pt == NULL || pt->vars == NULL || shared_bitmap_table == NULL)
+    return;
+
+  copy = BITMAP_GGC_ALLOC ();
+  bitmap_copy (pt->vars, copy);
+  pt->vars = copy;
+}
+
+/* Shares the points-to bitmap of PT.  */
+
+static void
+pt_solution_share (struct pt_solution *pt)
+{
+  bitmap shared;
+
+  if (pt == NULL || pt->vars == NULL || shared_bitmap_table == NULL)
+    return;
+
+  shared = shared_bitmap_lookup (pt->vars);
+
+  if (!shared)
+    {
+      /* Share unshared bitmap.  */
+      shared_bitmap_add (pt->vars);
+      return;
+    }
+
+  /* Already using shared bitmap.  */
+  if (shared == pt->vars)
+    return;
+
+  /* Use shared bitmap.  */
+  bitmap_clear (pt->vars);
+  pt->vars = shared;
+}
 
 /* Set bits in INTO corresponding to the variable uids in solution set FROM.  */
 
@@ -5734,7 +5776,6 @@ find_what_var_points_to (varinfo_t orig_
   unsigned int i;
   bitmap_iterator bi;
   bitmap finished_solution;
-  bitmap result;
   varinfo_t vi;
 
   memset (pt, 0, sizeof (struct pt_solution));
@@ -5788,17 +5829,8 @@ find_what_var_points_to (varinfo_t orig_
   stats.points_to_sets_created++;
 
   set_uids_in_ptset (finished_solution, vi->solution, pt);
-  result = shared_bitmap_lookup (finished_solution);
-  if (!result)
-    {
-      shared_bitmap_add (finished_solution);
-      pt->vars = finished_solution;
-    }
-  else
-    {
-      pt->vars = result;
-      bitmap_clear (finished_solution);
-    }
+  pt->vars = finished_solution;
+  pt_solution_share (pt);
 }
 
 /* Given a pointer variable P, fill in its points-to set.  */
@@ -5921,6 +5953,25 @@ pt_solution_ior_into (struct pt_solution
   bitmap_ior_into (dest->vars, src->vars);
 }
 
+/* Like pt_solution_ior_into, but may be used if the points-to bitmap
+   of *DEST might be shared.  */
+
+void
+pt_solution_ior_into_shared (struct pt_solution *dest, struct pt_solution *src)
+{
+  if (!src->vars)
+    return;
+  if (!dest->vars)
+    {
+      dest->vars = src->vars;
+      return;
+    }
+
+  pt_solution_unshare (dest);
+  pt_solution_ior_into (dest, src);
+  pt_solution_share (dest);
+}
+
 /* Return true if the points-to solution *PT is empty.  */
 
 bool
@@ -6600,6 +6651,7 @@ delete_points_to_sets (void)
   unsigned int i;
 
   htab_delete (shared_bitmap_table);
+  shared_bitmap_table = NULL;
   if (dump_file && (dump_flags & TDF_STATS))
     fprintf (dump_file, "Points to sets created:%d\n",
 	     stats.points_to_sets_created);
Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 175801)
+++ gcc/params.def	(working copy)
@@ -892,6 +892,11 @@ DEFPARAM (PARAM_MAX_STORES_TO_SINK,
           "Maximum number of conditional store pairs that can be sunk",
           2, 0, 0)
 
+DEFPARAM (PARAM_MAX_TAIL_MERGE_COMPARISONS,
+          "max-tail-merge-comparisons",
+          "Maximum amount of similar bbs to compare a bb with",
+          10, 0, 0)
+
 
 /*
 Local variables:
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 175801)
+++ gcc/doc/invoke.texi	(working copy)
@@ -404,7 +404,7 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-phiprop -ftree-loop-distribution -ftree-loop-distribute-patterns @gol
 -ftree-loop-ivcanon -ftree-loop-linear -ftree-loop-optimize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-pta -ftree-reassoc @gol
--ftree-sink -ftree-sra -ftree-switch-conversion @gol
+-ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
@@ -6091,7 +6091,7 @@ also turns on the following optimization
 -fsched-interblock  -fsched-spec @gol
 -fschedule-insns  -fschedule-insns2 @gol
 -fstrict-aliasing -fstrict-overflow @gol
--ftree-switch-conversion @gol
+-ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-pre @gol
 -ftree-vrp}
 
@@ -6974,6 +6974,11 @@ Perform conversion of simple initializat
 initializations from a scalar array.  This flag is enabled by default
 at @option{-O2} and higher.
 
+@item -ftree-tail-merge
+Merges identical blocks with same successors.  This flag is enabled by default
+at @option{-O2} and higher.  The run time of this pass can be limited using
+@option{max-tail-merge-comparisons} parameter.
+
 @item -ftree-dce
 @opindex ftree-dce
 Perform dead code elimination (DCE) on trees.  This flag is enabled by
@@ -8541,6 +8546,10 @@ This is used to avoid quadratic behavior
 The value of 0 will avoid limiting the search, but may slow down compilation
 of huge functions.  The default value is 30.
 
+@item max-tail-merge-comparisons
+The maximum amount of similar bbs to compare a bb with.  This is used to
+avoid quadratic behaviour in tree tail merging.  The default value is 10.
+
 @item max-unrolled-insns
 The maximum number of instructions that a loop should have if that loop
 is unrolled, and if the loop is unrolled, it determines how many times