diff mbox

[tree-optimization] : Add new path Splitting pass on tree ssa representation

Message ID 5643A732.4040707@redhat.com
State New
Headers show

Commit Message

Jeff Law Nov. 11, 2015, 8:38 p.m. UTC
On 09/04/2015 11:36 AM, Ajit Kumar Agarwal wrote:

>> diff --git a/gcc/passes.def b/gcc/passes.def
>> index 6b66f8f..20ddf3d 100644
>> --- a/gcc/passes.def
>> +++ b/gcc/passes.def
>> @@ -82,6 +82,7 @@ along with GCC; see the file COPYING3.  If not see
>>    	  NEXT_PASS (pass_ccp);
>>    	  /* After CCP we rewrite no longer addressed locals into SSA
>>    	     form if possible.  */
>> +          NEXT_PASS (pass_path_split);
>>    	  NEXT_PASS (pass_forwprop);
>>    	  NEXT_PASS (pass_sra_early);
> I can't recall if we've discussed the location of the pass at all.  I'm
> not objecting to this location, but would like to hear why you chose
> this particular location in the optimization pipeline.
So returning to the question of where this would live in the 
optimization pipeline and how it interacts with if-conversion and 
vectorization.

The concern with moving it to late in the pipeline was that we'd miss 
VRP/DCE/CSE opportunities.  I'm not sure if you're aware, but we 
actually run those passes more than once.  So it would be possible to 
run path splitting after if-conversion & vectorization, but before the 
second passs of VRP & DOM.  But trying that seems to result in something 
scrambling the loop enough that the path splitting opportunity is 
missed.  That might be worth deeper investigation if we can't come up 
with some kind of heuristics to fire or suppress path splitting.

Other random notes as I look over the code:

Call the pass "path-split", not "path_split".  I don't think we have any 
passes with underscores in their names, dump files, etc.

You factored out the code for transform_duplicate.  When you create new 
functions, they should all have a block comment indicating what they do, 
return values, etc.

I asked you to trim down the #includes in tree-ssa-path-split.c  Most 
were ultimately unnecessary.  The trimmed list is just 11 headers.

Various functions in tree-ssa-path-split.c were missing their block 
comments.  There were several places in tree-ssa-path-split that I felt 
deserved a comment.  While you are familiar with the code, it's likely 
someone else will have to look at and modify this code at some point in 
the future.  The comments help make that easier.

In find_trace_loop_latch_same_as_join_blk, we find the immediate 
dominator of the latch and verify it ends in a conditional.  That's 
fine.  Then we look at the predecessors of the latch to see if one is 
succeeded only by the latch and falls through to the latch.  That is the 
block we'll end up redirecting to a copy of the latch.  Also fine.

Note how there is no testing for the relationship between the immediate 
dominator of the latch and the predecessors of the latch.  ISTM that we 
can have a fairly arbitrary region in the THEN/ELSE arms of the 
conditional.  Was this intentional?  Would it be advisable to verify 
that the THEN/ELSE arms are single blocks?  Do we want to verify that 
neither the THEN/ELSE arms transfer control other than to the latch?  Do 
we want to verify the predecessors of the latch are immediate successors 
of the latch's immediate dominator?

The is_feasible_trace routine was still checking if the block had a 
conversion and rejecting it.  I removed that check.  It does seem to me 
that we need an upper limit on the number of statements.  I wonder if we 
should factor out the maximum statements to copy code from jump 
threading and use it for both jump threading and path splitting.

Instead of creating loop with multiple latches, what ever happened to 
the idea of duplicating the latch block twice -- once into each path. 
Remove the control statement in each duplicate.  Then remove everything 
but the control statement in the original latch.


I added some direct dump support.  Essentially anytime we split the 
path, we output something like this:

Split path in loop: latch block 9, predecessor 7.

That allows tests in the testsuite to look for the "Split path in loop" 
string rather than inferring the information from the SSA graph update's 
replacement table.  It also allows us to do things like count how many 
paths get split if we have more complex tests.

On the topic of tests.  Is the one you provided something where path 
splitting results in a significant improvement?  From looking at the 
x86_64 output, I can see the path splitting transformation occur, but 
not any improvement in the final code.

While the existing test is useful, testing on code that actually 
improves as a result of path splitting is better.  Ideally we'd test 
both that path splitting occurred and that the secondary optimizations 
we wanted triggered.

The tests should go into gcc.dg/tree-ssa rather than just gcc.dg.

ANyway, here's my work-in-progress.  Your thoughts on the various 
questions, concerns, ideas noted above would be appreciated.  Obviously 
I'd like to wrap things up quickly and include this patch in gcc6.

Note, I haven't bootstrapped or regression tested this version.

Comments

Richard Biener Nov. 12, 2015, 10:58 a.m. UTC | #1
On Wed, Nov 11, 2015 at 9:38 PM, Jeff Law <law@redhat.com> wrote:
> On 09/04/2015 11:36 AM, Ajit Kumar Agarwal wrote:
>
>>> diff --git a/gcc/passes.def b/gcc/passes.def
>>> index 6b66f8f..20ddf3d 100644
>>> --- a/gcc/passes.def
>>> +++ b/gcc/passes.def
>>> @@ -82,6 +82,7 @@ along with GCC; see the file COPYING3.  If not see
>>>           NEXT_PASS (pass_ccp);
>>>           /* After CCP we rewrite no longer addressed locals into SSA
>>>              form if possible.  */
>>> +          NEXT_PASS (pass_path_split);
>>>           NEXT_PASS (pass_forwprop);
>>>           NEXT_PASS (pass_sra_early);
>>
>> I can't recall if we've discussed the location of the pass at all.  I'm
>> not objecting to this location, but would like to hear why you chose
>> this particular location in the optimization pipeline.
>
> So returning to the question of where this would live in the optimization
> pipeline and how it interacts with if-conversion and vectorization.

Note that adding passes to the early pipeline that do code duplication
is a no-no.
The early pipeline should be exclusively for things making functions
more suitable
for inlining.

> The concern with moving it to late in the pipeline was that we'd miss
> VRP/DCE/CSE opportunities.  I'm not sure if you're aware, but we actually
> run those passes more than once.  So it would be possible to run path
> splitting after if-conversion & vectorization, but before the second passs
> of VRP & DOM.  But trying that seems to result in something scrambling the
> loop enough that the path splitting opportunity is missed.  That might be
> worth deeper investigation if we can't come up with some kind of heuristics
> to fire or suppress path splitting.

As I still think it is a transform similar to tracer just put it next to that.

But IIRC you mentioned it should enable vectorization or so?  In this case
that's obviously too late.

Richard.

> Other random notes as I look over the code:
>
> Call the pass "path-split", not "path_split".  I don't think we have any
> passes with underscores in their names, dump files, etc.
>
> You factored out the code for transform_duplicate.  When you create new
> functions, they should all have a block comment indicating what they do,
> return values, etc.
>
> I asked you to trim down the #includes in tree-ssa-path-split.c  Most were
> ultimately unnecessary.  The trimmed list is just 11 headers.
>
> Various functions in tree-ssa-path-split.c were missing their block
> comments.  There were several places in tree-ssa-path-split that I felt
> deserved a comment.  While you are familiar with the code, it's likely
> someone else will have to look at and modify this code at some point in the
> future.  The comments help make that easier.
>
> In find_trace_loop_latch_same_as_join_blk, we find the immediate dominator
> of the latch and verify it ends in a conditional.  That's fine.  Then we
> look at the predecessors of the latch to see if one is succeeded only by the
> latch and falls through to the latch.  That is the block we'll end up
> redirecting to a copy of the latch.  Also fine.
>
> Note how there is no testing for the relationship between the immediate
> dominator of the latch and the predecessors of the latch.  ISTM that we can
> have a fairly arbitrary region in the THEN/ELSE arms of the conditional.
> Was this intentional?  Would it be advisable to verify that the THEN/ELSE
> arms are single blocks?  Do we want to verify that neither the THEN/ELSE
> arms transfer control other than to the latch?  Do we want to verify the
> predecessors of the latch are immediate successors of the latch's immediate
> dominator?
>
> The is_feasible_trace routine was still checking if the block had a
> conversion and rejecting it.  I removed that check.  It does seem to me that
> we need an upper limit on the number of statements.  I wonder if we should
> factor out the maximum statements to copy code from jump threading and use
> it for both jump threading and path splitting.
>
> Instead of creating loop with multiple latches, what ever happened to the
> idea of duplicating the latch block twice -- once into each path. Remove the
> control statement in each duplicate.  Then remove everything but the control
> statement in the original latch.
>
>
> I added some direct dump support.  Essentially anytime we split the path, we
> output something like this:
>
> Split path in loop: latch block 9, predecessor 7.
>
> That allows tests in the testsuite to look for the "Split path in loop"
> string rather than inferring the information from the SSA graph update's
> replacement table.  It also allows us to do things like count how many paths
> get split if we have more complex tests.
>
> On the topic of tests.  Is the one you provided something where path
> splitting results in a significant improvement?  From looking at the x86_64
> output, I can see the path splitting transformation occur, but not any
> improvement in the final code.
>
> While the existing test is useful, testing on code that actually improves as
> a result of path splitting is better.  Ideally we'd test both that path
> splitting occurred and that the secondary optimizations we wanted triggered.
>
> The tests should go into gcc.dg/tree-ssa rather than just gcc.dg.
>
> ANyway, here's my work-in-progress.  Your thoughts on the various questions,
> concerns, ideas noted above would be appreciated.  Obviously I'd like to
> wrap things up quickly and include this patch in gcc6.
>
> Note, I haven't bootstrapped or regression tested this version.
>
>
>
>
>
>
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 34d2356..6613e83 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1474,6 +1474,7 @@ OBJS = \
>         tree-ssa-loop.o \
>         tree-ssa-math-opts.o \
>         tree-ssa-operands.o \
> +       tree-ssa-path-split.o \
>         tree-ssa-phionlycprop.o \
>         tree-ssa-phiopt.o \
>         tree-ssa-phiprop.o \
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 757ce85..9cc94e2 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -2403,6 +2403,10 @@ ftree-vrp
>  Common Report Var(flag_tree_vrp) Init(0) Optimization
>  Perform Value Range Propagation on trees.
>
> +ftree-path-split
> +Common Report Var(flag_tree_path_split) Init(0) Optimization
> +Perform Path Splitting on trees for loop backedges
> +
>  funit-at-a-time
>  Common Report Var(flag_unit_at_a_time) Init(1)
>  Compile whole compilation unit at a time.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 213a9d0..b1e95da 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -354,6 +354,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fdump-tree-fre@r{[}-@var{n}@r{]} @gol
>  -fdump-tree-vtable-verify @gol
>  -fdump-tree-vrp@r{[}-@var{n}@r{]} @gol
> +-fdump-tree-path-split@r{[}-@var{n}@r{]} @gol
>  -fdump-tree-storeccp@r{[}-@var{n}@r{]} @gol
>  -fdump-final-insns=@var{file} @gol
>  -fcompare-debug@r{[}=@var{opts}@r{]}  -fcompare-debug-second @gol
> @@ -462,7 +463,7 @@ Objective-C and Objective-C++ Dialects}.
>  -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta
> @gol
>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>  -ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
> --ftree-vectorize -ftree-vrp @gol
> +-ftree-vectorize -ftree-vrp @gol -ftree-path-split @gol
>  -funit-at-a-time -funroll-all-loops -funroll-loops @gol
>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops
> @gol
>  -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
> @@ -7169,6 +7170,11 @@ output on to @file{stderr}. If two conflicting dump
> filenames are
>  given for the same pass, then the latter option overrides the earlier
>  one.
>
> +@item path-split
> +@opindex fdump-tree-path-split
> +Dump each function after path splitting.  The file name is made by
> +appending @file{.path-split} to the source file name
> +
>  @item all
>  Turn on all options, except @option{raw}, @option{slim}, @option{verbose}
>  and @option{lineno}.
> @@ -7811,6 +7817,7 @@ also turns on the following optimization flags:
>  -ftree-switch-conversion -ftree-tail-merge @gol
>  -ftree-pre @gol
>  -ftree-vrp @gol
> +-ftree-path-split @gol
>  -fipa-ra}
>
>  Please note the warning under @option{-fgcse} about
> @@ -8819,7 +8826,7 @@ currently enabled, but may be enabled by @option{-O2}
> in the future.
>
>  @item -ftree-sink
>  @opindex ftree-sink
> -Perform forward store motion  on trees.  This flag is
> +Perform forward store motion on trees.  This flag is
>  enabled by default at @option{-O} and higher.
>
>  @item -ftree-bit-ccp
> @@ -9125,6 +9132,13 @@ enabled by default at @option{-O2} and higher.  Null
> pointer check
>  elimination is only done if @option{-fdelete-null-pointer-checks} is
>  enabled.
>
> +@item -ftree-path-split
> +@opindex ftree-path-split
> +Perform Path Splitting on trees.  When the two execution paths of a
> +if-then-else merge at the loop latch node, try to duplicate the
> +merge node into two paths. This is enabled by default at @option{-O2}
> +and above.
> +
>  @item -fsplit-ivs-in-unroller
>  @opindex fsplit-ivs-in-unroller
>  Enables expression of values of induction variables in later iterations
> diff --git a/gcc/opts.c b/gcc/opts.c
> index 9a3fbb3..9a0b27c 100644
> --- a/gcc/opts.c
> +++ b/gcc/opts.c
> @@ -509,6 +509,7 @@ static const struct default_options
> default_options_table[] =
>      { OPT_LEVELS_2_PLUS, OPT_fisolate_erroneous_paths_dereference, NULL, 1
> },
>      { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
>      { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
> +    { OPT_LEVELS_2_PLUS, OPT_ftree_path_split, NULL, 1 },
>
>      /* -O3 optimizations.  */
>      { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
> diff --git a/gcc/passes.def b/gcc/passes.def
> index c0ab6b9..e0c1cd8 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -79,6 +79,7 @@ along with GCC; see the file COPYING3.  If not see
>           NEXT_PASS (pass_remove_cgraph_callee_edges);
>           NEXT_PASS (pass_object_sizes);
>           NEXT_PASS (pass_ccp);
> +         NEXT_PASS (pass_path_split);
>           /* After CCP we rewrite no longer addressed locals into SSA
>              form if possible.  */
>           NEXT_PASS (pass_forwprop);
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
> b/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
> new file mode 100644
> index 0000000..4b8637b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
> @@ -0,0 +1,67 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-path-split-details " } */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +
> +#define RGBMAX 255
> +
> +int
> +test()
> +{
> +  int i, Pels;
> +  unsigned char sum = 0;
> +  unsigned char xr, xg, xb;
> +  unsigned char xc, xm, xy, xk;
> +  unsigned char *ReadPtr, *EritePtr;
> +
> +  ReadPtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
> +  EritePtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
> +
> +  for (i = 0; i < 100;i++)
> +     {
> +       ReadPtr[i] = 100 - i;
> +     }
> +
> +  for (i = 0; i < 100; i++)
> +     {
> +       xr = *ReadPtr++;
> +       xg = *ReadPtr++;
> +       xb = *ReadPtr++;
> +
> +       xc = (unsigned char) (RGBMAX - xr);
> +       xm = (unsigned char) (RGBMAX - xg);
> +       xy = (unsigned char) (RGBMAX - xb);
> +
> +       if (xc < xm)
> +         {
> +           xk = (unsigned char) (xc < xy ? xc : xy);
> +         }
> +       else
> +        {
> +          xk = (unsigned char) (xm < xy ? xm : xy);
> +        }
> +
> +       xc = (unsigned char) (xc - xk);
> +       xm = (unsigned char) (xm - xk);
> +       xy = (unsigned char) (xy - xk);
> +
> +       *EritePtr++ = xc;
> +       *EritePtr++ = xm;
> +       *EritePtr++ = xy;
> +       *EritePtr++ = xk;
> +       sum += *EritePtr;
> +    }
> +  return sum;
> +}
> +
> +int
> +main()
> +{
> +  if (test() != 33)
> +    abort();
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "Split path in loop:" "path-split" } } */
> diff --git a/gcc/timevar.def b/gcc/timevar.def
> index b429faf..3dba68b 100644
> --- a/gcc/timevar.def
> +++ b/gcc/timevar.def
> @@ -300,3 +300,4 @@ DEFTIMEVAR (TV_LINK              , "link JIT code")
>  DEFTIMEVAR (TV_LOAD                 , "load JIT result")
>  DEFTIMEVAR (TV_JIT_ACQUIRING_MUTEX   , "acquiring JIT mutex")
>  DEFTIMEVAR (TV_JIT_CLIENT_CODE   , "JIT client code")
> +DEFTIMEVAR (TV_TREE_PATH_SPLIT  , "tree path split")
> diff --git a/gcc/tracer.c b/gcc/tracer.c
> index 941dc20..c7b5150 100644
> --- a/gcc/tracer.c
> +++ b/gcc/tracer.c
> @@ -51,9 +51,9 @@
>  #include "tree-inline.h"
>  #include "cfgloop.h"
>  #include "fibonacci_heap.h"
> +#include "tracer.h"
>
>  static int count_insns (basic_block);
> -static bool ignore_bb_p (const_basic_block);
>  static bool better_p (const_edge, const_edge);
>  static edge find_best_successor (basic_block);
>  static edge find_best_predecessor (basic_block);
> @@ -85,7 +85,7 @@ bb_seen_p (basic_block bb)
>  }
>
>  /* Return true if we should ignore the basic block for purposes of tracing.
> */
> -static bool
> +bool
>  ignore_bb_p (const_basic_block bb)
>  {
>    if (bb->index < NUM_FIXED_BLOCKS)
> @@ -226,6 +226,24 @@ find_trace (basic_block bb, basic_block *trace)
>    return i;
>  }
>
> +/* Duplicate block BB2, placing it after BB in the CFG.  Return the
> +   newly created block.  */
> +basic_block
> +transform_duplicate (basic_block bb, basic_block bb2)
> +{
> +  edge e;
> +  basic_block copy;
> +
> +  e = find_edge (bb, bb2);
> +
> +  copy = duplicate_block (bb2, e, bb);
> +  flush_pending_stmts (e);
> +
> +  add_phi_args_after_copy (&copy, 1, NULL);
> +
> +  return (copy);
> +}
> +
>  /* Look for basic blocks in frequency order, construct traces and tail
> duplicate
>     if profitable.  */
>
> @@ -321,17 +339,8 @@ tail_duplicate (void)
>                  entries or at least rotate the loop.  */
>               && bb2->loop_father->header != bb2)
>             {
> -             edge e;
> -             basic_block copy;
> -
> -             nduplicated += counts [bb2->index];
> -
> -             e = find_edge (bb, bb2);
> -
> -             copy = duplicate_block (bb2, e, bb);
> -             flush_pending_stmts (e);
> -
> -             add_phi_args_after_copy (&copy, 1, NULL);
> +              nduplicated += counts [bb2->index];
> +              basic_block copy = transform_duplicate (bb, bb2);
>
>               /* Reconsider the original copy of block we've duplicated.
>                  Removing the most common predecessor may make it to be
> diff --git a/gcc/tracer.h b/gcc/tracer.h
> new file mode 100644
> index 0000000..454d3b7
> --- /dev/null
> +++ b/gcc/tracer.h
> @@ -0,0 +1,26 @@
> +/* Header file for Tracer.
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> + for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#ifndef GCC_TRACER_H
> +#define GCC_TRACER_H
> +
> +extern basic_block transform_duplicate (basic_block bb, basic_block bb2);
> +extern bool ignore_bb_p (const_basic_block bb);
> +
> +#endif /* GCC_TRaCER_H */
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 49e22a9..6963acc 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -390,6 +390,7 @@ extern gimple_opt_pass *make_pass_tree_loop_done
> (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_path_split (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt);
> diff --git a/gcc/tree-ssa-path-split.c b/gcc/tree-ssa-path-split.c
> new file mode 100644
> index 0000000..33dbc4d
> --- /dev/null
> +++ b/gcc/tree-ssa-path-split.c
> @@ -0,0 +1,254 @@
> +/* Support routines for Path Splitting.
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   Contributed by Ajit Kumar Agarwal <ajitkum@xilinx.com>.
> +
> + This file is part of GCC.
> +
> + GCC is free software; you can redistribute it and/or modify
> + it under the terms of the GNU General Public License as published by
> + the Free Software Foundation; either version 3, or (at your option)
> + any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "tree.h"
> +#include "gimple.h"
> +#include "tree-pass.h"
> +#include "cfganal.h"
> +#include "cfgloop.h"
> +#include "gimple-iterator.h"
> +#include "tracer.h"
> +
> +/* Given LOOP, if its latch has an immediate dominator that ends
> +   in a simple conditional, then search for a block that is a
> +   predecessor of the latch that falls through to the latch and
> +   has no other successors.
> +
> +   When found, put that block into TRACE[0] and the latch block into
> +   TRACE[1].  Otherwise do nothing.  */
> +
> +static void
> +find_trace_loop_latch_same_as_join_blk (loop_p loop, basic_block *trace)
> +{
> +  basic_block latch = loop->latch;
> +
> +  /* We don't support path splitting if the latch has more than two
> +     predecessors.  */
> +  if (EDGE_COUNT (latch->preds) == 2)
> +    {
> +      basic_block bb = get_immediate_dominator (CDI_DOMINATORS, latch);
> +      gimple *last = gsi_stmt (gsi_last_bb (bb));
> +
> +      /* The immediate dominator of the latch must end in a conditional.
> */
> +      if (last && gimple_code (last) != GIMPLE_COND)
> +       return;
> +
> +      /* This looks for a predecessor of the latch which has a single
> +        fallthru successor (that is the latch).  Only one of the blocks
> +        can match that pattern.
> +
> +        Do we need to check that E->src is a successor of BB?  */
> +      edge_iterator ei;
> +      edge e;
> +      FOR_EACH_EDGE (e, ei, latch->preds)
> +       {
> +         if (!single_succ_p (e->src)
> +             || !(single_succ_edge (e->src)->flags & EDGE_FALLTHRU))
> +           break;
> +         else
> +           {
> +             trace[0] = e->src;
> +             trace[1] = latch;
> +             break;
> +           }
> +       }
> +    }
> +}
> +
> +/* Return TRUE if BB is a reasonable block to duplicate by examining
> +   its size, false otherwise.  BB will always be a loop latch block.
> +
> +   Should this use the same tests as we do for jump threading?  */
> +
> +static bool
> +is_feasible_trace (basic_block bb)
> +{
> +  int num_stmt = 0;
> +  gimple_stmt_iterator gsi;
> +
> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +    {
> +      gimple *stmt = gsi_stmt (gsi);
> +      if (!is_gimple_debug (stmt))
> +       num_stmt++;
> +    }
> +
> +  /* We may want to limit how many statements we copy.  */
> +  if (num_stmt > 1)
> +    return true;
> +
> +  return false;
> +}
> +
> +/* If the immediate dominator of the latch of the loop is
> +   block with conditional branch, then the loop latch  is
> +   duplicated to its predecessors path preserving the SSA
> +   semantics.
> +
> +   CFG before transformation.
> +
> +   <bb 6>:
> +      xk_35 = MIN_EXPR <xy_34, xc_32>;
> +      goto <bb 8>;
> +
> +   <bb 7>:
> +      xk_36 = MIN_EXPR <xy_34, xm_33>;
> +
> +   <bb 8>:
> +      # xk_4 = PHI <xk_35(6), xk_36(7)>
> +      xc_37 = xc_32 - xk_4;
> +      xm_38 = xm_33 - xk_4;
> +      xy_39 = xy_34 - xk_4;
> +
> +   CFG After Path Splitting transformation
> +   before cleanup phase.
> +
> +   <bb 7>:
> +     xk_35 = MIN_EXPR <xy_34, xc_32>;
> +
> +   <bb 8>:
> +     # xk_29 = PHI <xk_35(7)>
> +     xc_56 = xc_32 - xk_29;
> +     xm_57 = xm_33 - xk_29;
> +     xy_58 = xy_34 - xk_29;
> +     goto <bb 11>;
> +
> +   <bb 9>:
> +     xk_36 = MIN_EXPR <xy_34, xm_33>;
> +
> +   <bb 10>:
> +     # xk_4 = PHI <xk_36(9)>
> +     xc_37 = xc_32 - xk_4;
> +     xm_38 = xm_33 - xk_4;
> +     xy_39 = xy_34 - xk_4;
> +
> +  <bb 11>: .......  */
> +
> +static bool
> +perform_path_splitting ()
> +{
> +  bool changed = false;
> +  basic_block trace[2] = {NULL, NULL};
> +  loop_p loop;
> +
> +  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
> +  initialize_original_copy_tables ();
> +  calculate_dominance_info (CDI_DOMINATORS);
> +
> +  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
> +    {
> +      /* If we're optimizing for size, or should not duplicate
> +        the latch block for some other reason, then ignore this
> +        loop.  */
> +      if (ignore_bb_p (loop->latch))
> +       continue;
> +
> +      /* Get the latch node and its predecessor, storing them into
> +        trace[0] and trace[1] respectively.  */
> +      find_trace_loop_latch_same_as_join_blk (loop, trace);
> +
> +      if (trace[0] && trace[1] && is_feasible_trace (trace[1]))
> +       {
> +          if (dump_file && (dump_flags & TDF_DETAILS))
> +            fprintf (dump_file,
> +                    "Split path in loop: latch block %d, predecessor
> %d.\n",
> +                    trace[1]->index, trace[0]->index);
> +         transform_duplicate (trace[0], trace[1]);
> +         trace[0] = NULL;
> +         trace[1] = NULL;
> +         changed = true;
> +       }
> +    }
> +
> +  loop_optimizer_finalize ();
> +  free_original_copy_tables ();
> +  return changed;
> +}
> +
> +/* Main entry point for path splitting.  Returns TODO_cleanup_cfg if any
> +   paths where split, otherwise return zero.  */
> +
> +static unsigned int
> +execute_path_split (void)
> +{
> +  /* If we don't have at least 2 real blocks and backedges in the
> +     CFG, then there's no point in trying to perform path splitting.  */
> +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> +      || !mark_dfs_back_edges ())
> +    return 0;
> +
> +  bool changed = perform_path_splitting();
> +  if (changed)
> +    {
> +      free_dominance_info (CDI_DOMINATORS);
> +      /* If we changed the CFG schedule loops for fixup by cleanup_cfg.  */
> +      if (current_loops)
> +       loops_state_set (LOOPS_NEED_FIXUP);
> +    }
> +
> +  return changed ? TODO_cleanup_cfg : 0;
> +
> +}
> +
> +static bool
> +gate_path_split(void)
> +{
> +  return flag_tree_path_split != 0;
> +}
> +
> +namespace {
> +
> +const pass_data pass_data_path_split =
> +{
> +  GIMPLE_PASS, /* type */
> +  "path-split", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_TREE_PATH_SPLIT, /* tv_id */
> +  PROP_ssa, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_update_ssa, /* todo_flags_finish */
> +};
> +
> +class pass_path_split : public gimple_opt_pass
> +{
> +   public:
> +    pass_path_split (gcc::context *ctxt)
> +      : gimple_opt_pass (pass_data_path_split, ctxt)
> +    {}
> +   /* opt_pass methods: */
> +   opt_pass * clone () { return new pass_path_split (m_ctxt); }
> +   virtual bool gate (function *) { return gate_path_split (); }
> +   virtual unsigned int execute (function *) { return execute_path_split
> (); }
> +
> +}; // class pass_path_split
> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_path_split (gcc::context *ctxt)
> +{
> +  return new pass_path_split (ctxt);
> +}
>
Jeff Law Nov. 12, 2015, 5:05 p.m. UTC | #2
On 11/12/2015 03:58 AM, Richard Biener wrote:
> On Wed, Nov 11, 2015 at 9:38 PM, Jeff Law <law@redhat.com> wrote:
>> On 09/04/2015 11:36 AM, Ajit Kumar Agarwal wrote:
>>
>>>> diff --git a/gcc/passes.def b/gcc/passes.def
>>>> index 6b66f8f..20ddf3d 100644
>>>> --- a/gcc/passes.def
>>>> +++ b/gcc/passes.def
>>>> @@ -82,6 +82,7 @@ along with GCC; see the file COPYING3.  If not see
>>>>            NEXT_PASS (pass_ccp);
>>>>            /* After CCP we rewrite no longer addressed locals into SSA
>>>>               form if possible.  */
>>>> +          NEXT_PASS (pass_path_split);
>>>>            NEXT_PASS (pass_forwprop);
>>>>            NEXT_PASS (pass_sra_early);
>>>
>>> I can't recall if we've discussed the location of the pass at all.  I'm
>>> not objecting to this location, but would like to hear why you chose
>>> this particular location in the optimization pipeline.
>>
>> So returning to the question of where this would live in the optimization
>> pipeline and how it interacts with if-conversion and vectorization.
>
> Note that adding passes to the early pipeline that do code duplication
> is a no-no.
> The early pipeline should be exclusively for things making functions
> more suitable for inlining.
I'd been experimenting with moving it down in the pipeline.  It 
certainly doesn't seem to need to be in the early optimizations.   At 
some point we force latches to have single successors which spoils the 
simplistic region recognition of the path splitting pass.

>
>> The concern with moving it to late in the pipeline was that we'd miss
>> VRP/DCE/CSE opportunities.  I'm not sure if you're aware, but we actually
>> run those passes more than once.  So it would be possible to run path
>> splitting after if-conversion & vectorization, but before the second passs
>> of VRP & DOM.  But trying that seems to result in something scrambling the
>> loop enough that the path splitting opportunity is missed.  That might be
>> worth deeper investigation if we can't come up with some kind of heuristics
>> to fire or suppress path splitting.
>
> As I still think it is a transform similar to tracer just put it next to that.
The CFG has changed shape significantly by that point.  So some 
adjustments would be needed.  Essentially it's no longer the latch that 
needs to be duplicated into the THEN/ELSE clauses, but the join point 
that's the predecessor of the latch.

But that's probably a good change to make anyway because we end up doing 
less damage to the overall shape of the CFG.  Essentially path splitting 
would look like creating superblocks by target duplication, and that's 
kind of what I expected this to look like all-along.


>
> But IIRC you mentioned it should enable vectorization or so?  In this case
> that's obviously too late.
The opposite.  Path splitting interferes with if-conversion & 
vectorization.  Path splitting mucks up the CFG enough that 
if-conversion won't fire and as a result vectorization is inhibited.  It 
also creates multi-latch loops, which isn't a great situation either.

It *may* be the case that dropping it that far down in the pipeline and 
making the modifications necessary to handle simple latches may in turn 
make the path splitting code play better with if-conversion and 
vectorization and avoid creation of multi-latch loops.  At least that's 
how it looks on paper when I draw out the CFG manipulations.

I'll do some experiments.

Jeff
Jeff Law Nov. 12, 2015, 6:32 p.m. UTC | #3
On 11/12/2015 10:05 AM, Jeff Law wrote:
>> But IIRC you mentioned it should enable vectorization or so?  In this
>> case
>> that's obviously too late.
> The opposite.  Path splitting interferes with if-conversion &
> vectorization.  Path splitting mucks up the CFG enough that
> if-conversion won't fire and as a result vectorization is inhibited.  It
> also creates multi-latch loops, which isn't a great situation either.
>
> It *may* be the case that dropping it that far down in the pipeline and
> making the modifications necessary to handle simple latches may in turn
> make the path splitting code play better with if-conversion and
> vectorization and avoid creation of multi-latch loops.  At least that's
> how it looks on paper when I draw out the CFG manipulations.
>
> I'll do some experiments.
It doesn't look too terrible to ravamp the recognition code to work 
later in the pipeline with simple latches.  Sadly that doesn't seem to 
have fixed the bad interactions with if-conversion.

*But* that does open up the possibility of moving the path splitting 
pass even deeper in the pipeline -- in particular we can move it past 
the vectorizer.  Which is may be a win.

So the big question is whether or not we'll still see enough benefits 
from having it so late in the pipeline.  It's still early enough that we 
get DOM, VRP, reassoc, forwprop, phiopt, etc.

Ajit, I'll pass along an updated patch after doing some more testing.

Jeff
Richard Biener Nov. 12, 2015, 7:40 p.m. UTC | #4
On November 12, 2015 7:32:57 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>On 11/12/2015 10:05 AM, Jeff Law wrote:
>>> But IIRC you mentioned it should enable vectorization or so?  In
>this
>>> case
>>> that's obviously too late.
>> The opposite.  Path splitting interferes with if-conversion &
>> vectorization.  Path splitting mucks up the CFG enough that
>> if-conversion won't fire and as a result vectorization is inhibited. 
>It
>> also creates multi-latch loops, which isn't a great situation either.
>>
>> It *may* be the case that dropping it that far down in the pipeline
>and
>> making the modifications necessary to handle simple latches may in
>turn
>> make the path splitting code play better with if-conversion and
>> vectorization and avoid creation of multi-latch loops.  At least
>that's
>> how it looks on paper when I draw out the CFG manipulations.
>>
>> I'll do some experiments.
>It doesn't look too terrible to ravamp the recognition code to work 
>later in the pipeline with simple latches.  Sadly that doesn't seem to 
>have fixed the bad interactions with if-conversion.
>
>*But* that does open up the possibility of moving the path splitting 
>pass even deeper in the pipeline -- in particular we can move it past 
>the vectorizer.  Which is may be a win.
>
>So the big question is whether or not we'll still see enough benefits 
>from having it so late in the pipeline.  It's still early enough that
>we 
>get DOM, VRP, reassoc, forwprop, phiopt, etc.
>
>Ajit, I'll pass along an updated patch after doing some more testing.

BTW, if you not use loops_normal for loop init you don't get simple latches forced (and cfg-cleanup will remove them)

Richard.

>Jeff
Jeff Law Nov. 12, 2015, 7:52 p.m. UTC | #5
On 11/12/2015 12:40 PM, Richard Biener wrote:
> On November 12, 2015 7:32:57 PM GMT+01:00, Jeff Law <law@redhat.com>
> wrote:
>> On 11/12/2015 10:05 AM, Jeff Law wrote:
>>>> But IIRC you mentioned it should enable vectorization or so?
>>>> In
>> this
>>>> case that's obviously too late.
>>> The opposite.  Path splitting interferes with if-conversion &
>>> vectorization.  Path splitting mucks up the CFG enough that
>>> if-conversion won't fire and as a result vectorization is
>>> inhibited.
>> It
>>> also creates multi-latch loops, which isn't a great situation
>>> either.
>>>
>>> It *may* be the case that dropping it that far down in the
>>> pipeline
>> and
>>> making the modifications necessary to handle simple latches may
>>> in
>> turn
>>> make the path splitting code play better with if-conversion and
>>> vectorization and avoid creation of multi-latch loops.  At least
>> that's
>>> how it looks on paper when I draw out the CFG manipulations.
>>>
>>> I'll do some experiments.
>> It doesn't look too terrible to ravamp the recognition code to
>> work later in the pipeline with simple latches.  Sadly that doesn't
>> seem to have fixed the bad interactions with if-conversion.
>>
>> *But* that does open up the possibility of moving the path
>> splitting pass even deeper in the pipeline -- in particular we can
>> move it past the vectorizer.  Which is may be a win.
>>
>> So the big question is whether or not we'll still see enough
>> benefits from having it so late in the pipeline.  It's still early
>> enough that we get DOM, VRP, reassoc, forwprop, phiopt, etc.
>>
>> Ajit, I'll pass along an updated patch after doing some more
>> testing.
>
> BTW, if you not use loops_normal for loop init you don't get simple
> latches forced (and cfg-cleanup will remove them)
I think I'd prefer to have loops start in simple-latches form and 
preserve the simple-latches form.    Detection is slightly harder, but 
transformation without creating multiple latches is easier.

jeff
diff mbox

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 34d2356..6613e83 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1474,6 +1474,7 @@  OBJS = \
 	tree-ssa-loop.o \
 	tree-ssa-math-opts.o \
 	tree-ssa-operands.o \
+	tree-ssa-path-split.o \
 	tree-ssa-phionlycprop.o \
 	tree-ssa-phiopt.o \
 	tree-ssa-phiprop.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 757ce85..9cc94e2 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2403,6 +2403,10 @@  ftree-vrp
 Common Report Var(flag_tree_vrp) Init(0) Optimization
 Perform Value Range Propagation on trees.
 
+ftree-path-split
+Common Report Var(flag_tree_path_split) Init(0) Optimization
+Perform Path Splitting on trees for loop backedges
+
 funit-at-a-time
 Common Report Var(flag_unit_at_a_time) Init(1)
 Compile whole compilation unit at a time.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 213a9d0..b1e95da 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -354,6 +354,7 @@  Objective-C and Objective-C++ Dialects}.
 -fdump-tree-fre@r{[}-@var{n}@r{]} @gol
 -fdump-tree-vtable-verify @gol
 -fdump-tree-vrp@r{[}-@var{n}@r{]} @gol
+-fdump-tree-path-split@r{[}-@var{n}@r{]} @gol
 -fdump-tree-storeccp@r{[}-@var{n}@r{]} @gol
 -fdump-final-insns=@var{file} @gol
 -fcompare-debug@r{[}=@var{opts}@r{]}  -fcompare-debug-second @gol
@@ -462,7 +463,7 @@  Objective-C and Objective-C++ Dialects}.
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
 -ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
--ftree-vectorize -ftree-vrp @gol
+-ftree-vectorize -ftree-vrp @gol -ftree-path-split @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
@@ -7169,6 +7170,11 @@  output on to @file{stderr}. If two conflicting dump filenames are
 given for the same pass, then the latter option overrides the earlier
 one.
 
+@item path-split
+@opindex fdump-tree-path-split
+Dump each function after path splitting.  The file name is made by
+appending @file{.path-split} to the source file name
+
 @item all
 Turn on all options, except @option{raw}, @option{slim}, @option{verbose}
 and @option{lineno}.
@@ -7811,6 +7817,7 @@  also turns on the following optimization flags:
 -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-pre @gol
 -ftree-vrp @gol
+-ftree-path-split @gol
 -fipa-ra}
 
 Please note the warning under @option{-fgcse} about
@@ -8819,7 +8826,7 @@  currently enabled, but may be enabled by @option{-O2} in the future.
 
 @item -ftree-sink
 @opindex ftree-sink
-Perform forward store motion  on trees.  This flag is
+Perform forward store motion on trees.  This flag is
 enabled by default at @option{-O} and higher.
 
 @item -ftree-bit-ccp
@@ -9125,6 +9132,13 @@  enabled by default at @option{-O2} and higher.  Null pointer check
 elimination is only done if @option{-fdelete-null-pointer-checks} is
 enabled.
 
+@item -ftree-path-split
+@opindex ftree-path-split
+Perform Path Splitting on trees.  When the two execution paths of a
+if-then-else merge at the loop latch node, try to duplicate the
+merge node into two paths. This is enabled by default at @option{-O2}
+and above.
+
 @item -fsplit-ivs-in-unroller
 @opindex fsplit-ivs-in-unroller
 Enables expression of values of induction variables in later iterations
diff --git a/gcc/opts.c b/gcc/opts.c
index 9a3fbb3..9a0b27c 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -509,6 +509,7 @@  static const struct default_options default_options_table[] =
     { OPT_LEVELS_2_PLUS, OPT_fisolate_erroneous_paths_dereference, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_ftree_path_split, NULL, 1 },
 
     /* -O3 optimizations.  */
     { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
diff --git a/gcc/passes.def b/gcc/passes.def
index c0ab6b9..e0c1cd8 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -79,6 +79,7 @@  along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_remove_cgraph_callee_edges);
 	  NEXT_PASS (pass_object_sizes);
 	  NEXT_PASS (pass_ccp);
+	  NEXT_PASS (pass_path_split);
 	  /* After CCP we rewrite no longer addressed locals into SSA
 	     form if possible.  */
 	  NEXT_PASS (pass_forwprop);
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c b/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
new file mode 100644
index 0000000..4b8637b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
@@ -0,0 +1,67 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-path-split-details " } */
+
+#include <stdio.h>
+#include <stdlib.h>
+
+#define RGBMAX 255
+
+int
+test()
+{
+  int i, Pels;
+  unsigned char sum = 0;
+  unsigned char xr, xg, xb;
+  unsigned char xc, xm, xy, xk;
+  unsigned char *ReadPtr, *EritePtr;
+
+  ReadPtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
+  EritePtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
+
+  for (i = 0; i < 100;i++)
+     {
+       ReadPtr[i] = 100 - i;
+     }
+
+  for (i = 0; i < 100; i++)
+     {
+       xr = *ReadPtr++;
+       xg = *ReadPtr++;
+       xb = *ReadPtr++;
+
+       xc = (unsigned char) (RGBMAX - xr);
+       xm = (unsigned char) (RGBMAX - xg);
+       xy = (unsigned char) (RGBMAX - xb);
+
+       if (xc < xm)
+         {
+           xk = (unsigned char) (xc < xy ? xc : xy);
+         }
+       else
+        {
+          xk = (unsigned char) (xm < xy ? xm : xy);
+        }
+
+       xc = (unsigned char) (xc - xk);
+       xm = (unsigned char) (xm - xk);
+       xy = (unsigned char) (xy - xk);
+
+       *EritePtr++ = xc;
+       *EritePtr++ = xm;
+       *EritePtr++ = xy;
+       *EritePtr++ = xk;
+       sum += *EritePtr;
+    }
+  return sum;
+}
+
+int
+main()
+{
+  if (test() != 33)
+    abort();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "Split path in loop:" "path-split" } } */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index b429faf..3dba68b 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -300,3 +300,4 @@  DEFTIMEVAR (TV_LINK		     , "link JIT code")
 DEFTIMEVAR (TV_LOAD		     , "load JIT result")
 DEFTIMEVAR (TV_JIT_ACQUIRING_MUTEX   , "acquiring JIT mutex")
 DEFTIMEVAR (TV_JIT_CLIENT_CODE   , "JIT client code")
+DEFTIMEVAR (TV_TREE_PATH_SPLIT  , "tree path split")
diff --git a/gcc/tracer.c b/gcc/tracer.c
index 941dc20..c7b5150 100644
--- a/gcc/tracer.c
+++ b/gcc/tracer.c
@@ -51,9 +51,9 @@ 
 #include "tree-inline.h"
 #include "cfgloop.h"
 #include "fibonacci_heap.h"
+#include "tracer.h"
 
 static int count_insns (basic_block);
-static bool ignore_bb_p (const_basic_block);
 static bool better_p (const_edge, const_edge);
 static edge find_best_successor (basic_block);
 static edge find_best_predecessor (basic_block);
@@ -85,7 +85,7 @@  bb_seen_p (basic_block bb)
 }
 
 /* Return true if we should ignore the basic block for purposes of tracing.  */
-static bool
+bool
 ignore_bb_p (const_basic_block bb)
 {
   if (bb->index < NUM_FIXED_BLOCKS)
@@ -226,6 +226,24 @@  find_trace (basic_block bb, basic_block *trace)
   return i;
 }
 
+/* Duplicate block BB2, placing it after BB in the CFG.  Return the
+   newly created block.  */
+basic_block
+transform_duplicate (basic_block bb, basic_block bb2)
+{
+  edge e;
+  basic_block copy;
+
+  e = find_edge (bb, bb2);
+
+  copy = duplicate_block (bb2, e, bb);
+  flush_pending_stmts (e);
+
+  add_phi_args_after_copy (&copy, 1, NULL);
+
+  return (copy);
+}
+
 /* Look for basic blocks in frequency order, construct traces and tail duplicate
    if profitable.  */
 
@@ -321,17 +339,8 @@  tail_duplicate (void)
 		 entries or at least rotate the loop.  */
 	      && bb2->loop_father->header != bb2)
 	    {
-	      edge e;
-	      basic_block copy;
-
-	      nduplicated += counts [bb2->index];
-
-	      e = find_edge (bb, bb2);
-
-	      copy = duplicate_block (bb2, e, bb);
-	      flush_pending_stmts (e);
-
-	      add_phi_args_after_copy (&copy, 1, NULL);
+              nduplicated += counts [bb2->index];
+              basic_block copy = transform_duplicate (bb, bb2);
 
 	      /* Reconsider the original copy of block we've duplicated.
 	         Removing the most common predecessor may make it to be
diff --git a/gcc/tracer.h b/gcc/tracer.h
new file mode 100644
index 0000000..454d3b7
--- /dev/null
+++ b/gcc/tracer.h
@@ -0,0 +1,26 @@ 
+/* Header file for Tracer.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_TRACER_H
+#define GCC_TRACER_H
+
+extern basic_block transform_duplicate (basic_block bb, basic_block bb2);
+extern bool ignore_bb_p (const_basic_block bb);
+
+#endif /* GCC_TRaCER_H */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 49e22a9..6963acc 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -390,6 +390,7 @@  extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_path_split (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-path-split.c b/gcc/tree-ssa-path-split.c
new file mode 100644
index 0000000..33dbc4d
--- /dev/null
+++ b/gcc/tree-ssa-path-split.c
@@ -0,0 +1,254 @@ 
+/* Support routines for Path Splitting.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal <ajitkum@xilinx.com>.
+
+ This file is part of GCC.
+
+ GCC is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 3, or (at your option)
+ any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "cfganal.h"
+#include "cfgloop.h"
+#include "gimple-iterator.h"
+#include "tracer.h"
+
+/* Given LOOP, if its latch has an immediate dominator that ends
+   in a simple conditional, then search for a block that is a
+   predecessor of the latch that falls through to the latch and
+   has no other successors.
+
+   When found, put that block into TRACE[0] and the latch block into
+   TRACE[1].  Otherwise do nothing.  */
+
+static void
+find_trace_loop_latch_same_as_join_blk (loop_p loop, basic_block *trace)
+{
+  basic_block latch = loop->latch;
+
+  /* We don't support path splitting if the latch has more than two
+     predecessors.  */
+  if (EDGE_COUNT (latch->preds) == 2)
+    {
+      basic_block bb = get_immediate_dominator (CDI_DOMINATORS, latch);
+      gimple *last = gsi_stmt (gsi_last_bb (bb));
+
+      /* The immediate dominator of the latch must end in a conditional.  */
+      if (last && gimple_code (last) != GIMPLE_COND)
+	return;
+
+      /* This looks for a predecessor of the latch which has a single
+	 fallthru successor (that is the latch).  Only one of the blocks
+	 can match that pattern. 
+
+	 Do we need to check that E->src is a successor of BB?  */
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, latch->preds)
+	{
+	  if (!single_succ_p (e->src)
+	      || !(single_succ_edge (e->src)->flags & EDGE_FALLTHRU))
+	    break;
+	  else
+	    {
+	      trace[0] = e->src;
+	      trace[1] = latch;
+	      break;
+	    }
+	}
+    }
+}
+
+/* Return TRUE if BB is a reasonable block to duplicate by examining
+   its size, false otherwise.  BB will always be a loop latch block.
+
+   Should this use the same tests as we do for jump threading?  */
+
+static bool
+is_feasible_trace (basic_block bb)
+{
+  int num_stmt = 0;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      if (!is_gimple_debug (stmt))
+	num_stmt++;
+    }
+
+  /* We may want to limit how many statements we copy.  */
+  if (num_stmt > 1)
+    return true;
+
+  return false;
+}
+
+/* If the immediate dominator of the latch of the loop is
+   block with conditional branch, then the loop latch  is
+   duplicated to its predecessors path preserving the SSA
+   semantics.
+
+   CFG before transformation.
+
+   <bb 6>:
+      xk_35 = MIN_EXPR <xy_34, xc_32>;
+      goto <bb 8>;
+
+   <bb 7>:
+      xk_36 = MIN_EXPR <xy_34, xm_33>;
+
+   <bb 8>:
+      # xk_4 = PHI <xk_35(6), xk_36(7)>
+      xc_37 = xc_32 - xk_4;
+      xm_38 = xm_33 - xk_4;
+      xy_39 = xy_34 - xk_4;
+
+   CFG After Path Splitting transformation
+   before cleanup phase.
+
+   <bb 7>:
+     xk_35 = MIN_EXPR <xy_34, xc_32>;
+
+   <bb 8>:
+     # xk_29 = PHI <xk_35(7)>
+     xc_56 = xc_32 - xk_29;
+     xm_57 = xm_33 - xk_29;
+     xy_58 = xy_34 - xk_29;
+     goto <bb 11>;
+
+   <bb 9>:
+     xk_36 = MIN_EXPR <xy_34, xm_33>;
+
+   <bb 10>:
+     # xk_4 = PHI <xk_36(9)>
+     xc_37 = xc_32 - xk_4;
+     xm_38 = xm_33 - xk_4;
+     xy_39 = xy_34 - xk_4;
+
+  <bb 11>: .......  */
+
+static bool
+perform_path_splitting ()
+{
+  bool changed = false;
+  basic_block trace[2] = {NULL, NULL};
+  loop_p loop;
+
+  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+  initialize_original_copy_tables ();
+  calculate_dominance_info (CDI_DOMINATORS);
+
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+    {
+      /* If we're optimizing for size, or should not duplicate
+	 the latch block for some other reason, then ignore this
+	 loop.  */
+      if (ignore_bb_p (loop->latch))
+	continue;
+
+      /* Get the latch node and its predecessor, storing them into
+	 trace[0] and trace[1] respectively.  */
+      find_trace_loop_latch_same_as_join_blk (loop, trace);
+
+      if (trace[0] && trace[1] && is_feasible_trace (trace[1]))
+	{
+          if (dump_file && (dump_flags & TDF_DETAILS)) 
+            fprintf (dump_file,
+		     "Split path in loop: latch block %d, predecessor %d.\n",
+		     trace[1]->index, trace[0]->index);
+	  transform_duplicate (trace[0], trace[1]);
+	  trace[0] = NULL;
+	  trace[1] = NULL;
+	  changed = true;
+	}
+    }
+
+  loop_optimizer_finalize ();
+  free_original_copy_tables ();
+  return changed;
+}
+
+/* Main entry point for path splitting.  Returns TODO_cleanup_cfg if any
+   paths where split, otherwise return zero.  */
+
+static unsigned int
+execute_path_split (void)
+{
+  /* If we don't have at least 2 real blocks and backedges in the
+     CFG, then there's no point in trying to perform path splitting.  */
+  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
+      || !mark_dfs_back_edges ())
+    return 0;
+
+  bool changed = perform_path_splitting();
+  if (changed)
+    {
+      free_dominance_info (CDI_DOMINATORS);
+      /* If we changed the CFG schedule loops for fixup by cleanup_cfg.  */
+      if (current_loops)
+	loops_state_set (LOOPS_NEED_FIXUP);
+    }
+
+  return changed ? TODO_cleanup_cfg : 0;
+
+}
+
+static bool
+gate_path_split(void)
+{
+  return flag_tree_path_split != 0;
+}
+
+namespace {
+
+const pass_data pass_data_path_split =
+{
+  GIMPLE_PASS, /* type */
+  "path-split", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_TREE_PATH_SPLIT, /* tv_id */
+  PROP_ssa, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_path_split : public gimple_opt_pass
+{
+   public:
+    pass_path_split (gcc::context *ctxt)
+      : gimple_opt_pass (pass_data_path_split, ctxt)
+    {}
+   /* opt_pass methods: */
+   opt_pass * clone () { return new pass_path_split (m_ctxt); }
+   virtual bool gate (function *) { return gate_path_split (); }
+   virtual unsigned int execute (function *) { return execute_path_split (); }
+
+}; // class pass_path_split
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_path_split (gcc::context *ctxt)
+{
+  return new pass_path_split (ctxt);
+}