diff mbox

[tree-optimization] : Add new path Splitting pass on tree ssa representation

Message ID 56450B62.4090404@redhat.com
State New
Headers show

Commit Message

Jeff Law Nov. 12, 2015, 9:57 p.m. UTC
On 11/12/2015 11:32 AM, Jeff Law wrote:
> On 11/12/2015 10:05 AM, Jeff Law wrote:
>>> But IIRC you mentioned it should enable vectorization or so?  In this
>>> case
>>> that's obviously too late.
>> The opposite.  Path splitting interferes with if-conversion &
>> vectorization.  Path splitting mucks up the CFG enough that
>> if-conversion won't fire and as a result vectorization is inhibited.  It
>> also creates multi-latch loops, which isn't a great situation either.
>>
>> It *may* be the case that dropping it that far down in the pipeline and
>> making the modifications necessary to handle simple latches may in turn
>> make the path splitting code play better with if-conversion and
>> vectorization and avoid creation of multi-latch loops.  At least that's
>> how it looks on paper when I draw out the CFG manipulations.
>>
>> I'll do some experiments.
> It doesn't look too terrible to ravamp the recognition code to work
> later in the pipeline with simple latches.  Sadly that doesn't seem to
> have fixed the bad interactions with if-conversion.
>
> *But* that does open up the possibility of moving the path splitting
> pass even deeper in the pipeline -- in particular we can move it past
> the vectorizer.  Which is may be a win.
>
> So the big question is whether or not we'll still see enough benefits
> from having it so late in the pipeline.  It's still early enough that we
> get DOM, VRP, reassoc, forwprop, phiopt, etc.
>
> Ajit, I'll pass along an updated patch after doing some more testing.
So here's what I'm working with.  It runs after the vectorizer now.

Ajit, if you could benchmark this it would be greatly appreciated.  I 
know you saw significant improvements on one or more benchmarks in the 
past.  It'd be good to know that the updated placement of the pass 
doesn't invalidate the gains you saw.

With the updated pass placement, we don't have to worry about switching 
the pass on/off based on whether or not the vectorizer & if-conversion 
are enabled.  So that hackery is gone.

I think I've beefed up the test to identify the diamond patterns we want 
so that it's stricter in what we accept.  The call to ignore_bb_p is a 
part of that test so that we're actually looking at the right block in a 
world where we're doing this transformation with simple latches.

I've also put a graphical comment before perform_path_splitting which 
hopefully shows the CFG transformation we're making a bit clearer.

This bootstraps and regression tests cleanly on x86_64-linux-gnu.

Comments

Richard Biener Nov. 13, 2015, 10:13 a.m. UTC | #1
On Thu, Nov 12, 2015 at 10:57 PM, Jeff Law <law@redhat.com> wrote:
> On 11/12/2015 11:32 AM, Jeff Law wrote:
>>
>> On 11/12/2015 10:05 AM, Jeff Law wrote:
>>>>
>>>> But IIRC you mentioned it should enable vectorization or so?  In this
>>>> case
>>>> that's obviously too late.
>>>
>>> The opposite.  Path splitting interferes with if-conversion &
>>> vectorization.  Path splitting mucks up the CFG enough that
>>> if-conversion won't fire and as a result vectorization is inhibited.  It
>>> also creates multi-latch loops, which isn't a great situation either.
>>>
>>> It *may* be the case that dropping it that far down in the pipeline and
>>> making the modifications necessary to handle simple latches may in turn
>>> make the path splitting code play better with if-conversion and
>>> vectorization and avoid creation of multi-latch loops.  At least that's
>>> how it looks on paper when I draw out the CFG manipulations.
>>>
>>> I'll do some experiments.
>>
>> It doesn't look too terrible to ravamp the recognition code to work
>> later in the pipeline with simple latches.  Sadly that doesn't seem to
>> have fixed the bad interactions with if-conversion.
>>
>> *But* that does open up the possibility of moving the path splitting
>> pass even deeper in the pipeline -- in particular we can move it past
>> the vectorizer.  Which is may be a win.
>>
>> So the big question is whether or not we'll still see enough benefits
>> from having it so late in the pipeline.  It's still early enough that we
>> get DOM, VRP, reassoc, forwprop, phiopt, etc.
>>
>> Ajit, I'll pass along an updated patch after doing some more testing.
>
> So here's what I'm working with.  It runs after the vectorizer now.
>
> Ajit, if you could benchmark this it would be greatly appreciated.  I know
> you saw significant improvements on one or more benchmarks in the past.
> It'd be good to know that the updated placement of the pass doesn't
> invalidate the gains you saw.
>
> With the updated pass placement, we don't have to worry about switching the
> pass on/off based on whether or not the vectorizer & if-conversion are
> enabled.  So that hackery is gone.
>
> I think I've beefed up the test to identify the diamond patterns we want so
> that it's stricter in what we accept.  The call to ignore_bb_p is a part of
> that test so that we're actually looking at the right block in a world where
> we're doing this transformation with simple latches.
>
> I've also put a graphical comment before perform_path_splitting which
> hopefully shows the CFG transformation we're making a bit clearer.
>
> This bootstraps and regression tests cleanly on x86_64-linux-gnu.
>
>
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 34d2356..6613e83 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1474,6 +1474,7 @@ OBJS = \
>         tree-ssa-loop.o \
>         tree-ssa-math-opts.o \
>         tree-ssa-operands.o \
> +       tree-ssa-path-split.o \

gimple-ssa-path-split please.

>         tree-ssa-phionlycprop.o \
>         tree-ssa-phiopt.o \
>         tree-ssa-phiprop.o \
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 757ce85..3e946ca 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -2403,6 +2403,10 @@ ftree-vrp
>  Common Report Var(flag_tree_vrp) Init(0) Optimization
>  Perform Value Range Propagation on trees.
>
> +ftree-path-split

fsplit-paths

> +Common Report Var(flag_tree_path_split) Init(0) Optimization
> +Perform Path Splitting on trees for loop backedges.
> +
>  funit-at-a-time
>  Common Report Var(flag_unit_at_a_time) Init(1)
>  Compile whole compilation unit at a time.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 213a9d0..b1e95da 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -354,6 +354,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fdump-tree-fre@r{[}-@var{n}@r{]} @gol
>  -fdump-tree-vtable-verify @gol
>  -fdump-tree-vrp@r{[}-@var{n}@r{]} @gol
> +-fdump-tree-path-split@r{[}-@var{n}@r{]} @gol
>  -fdump-tree-storeccp@r{[}-@var{n}@r{]} @gol
>  -fdump-final-insns=@var{file} @gol
>  -fcompare-debug@r{[}=@var{opts}@r{]}  -fcompare-debug-second @gol
> @@ -462,7 +463,7 @@ Objective-C and Objective-C++ Dialects}.
>  -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta
> @gol
>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>  -ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
> --ftree-vectorize -ftree-vrp @gol
> +-ftree-vectorize -ftree-vrp @gol -ftree-path-split @gol
>  -funit-at-a-time -funroll-all-loops -funroll-loops @gol
>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops
> @gol
>  -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
> @@ -7169,6 +7170,11 @@ output on to @file{stderr}. If two conflicting dump
> filenames are
>  given for the same pass, then the latter option overrides the earlier
>  one.
>
> +@item path-split
> +@opindex fdump-tree-path-split
> +Dump each function after path splitting.  The file name is made by
> +appending @file{.path-split} to the source file name.
> +
>  @item all
>  Turn on all options, except @option{raw}, @option{slim}, @option{verbose}
>  and @option{lineno}.
> @@ -7811,6 +7817,7 @@ also turns on the following optimization flags:
>  -ftree-switch-conversion -ftree-tail-merge @gol
>  -ftree-pre @gol
>  -ftree-vrp @gol
> +-ftree-path-split @gol
>  -fipa-ra}
>
>  Please note the warning under @option{-fgcse} about
> @@ -8819,7 +8826,7 @@ currently enabled, but may be enabled by @option{-O2}
> in the future.
>
>  @item -ftree-sink
>  @opindex ftree-sink
> -Perform forward store motion  on trees.  This flag is
> +Perform forward store motion on trees.  This flag is
>  enabled by default at @option{-O} and higher.
>
>  @item -ftree-bit-ccp
> @@ -9125,6 +9132,13 @@ enabled by default at @option{-O2} and higher.  Null
> pointer check
>  elimination is only done if @option{-fdelete-null-pointer-checks} is
>  enabled.
>
> +@item -ftree-path-split
> +@opindex ftree-path-split
> +Perform Path Splitting on trees.  When the two execution paths of a
> +if-then-else merge at the loop latch node, try to duplicate the
> +merge node into two paths. This is enabled by default at @option{-O2}
> +and above.
> +

I think if we go into the detail of the transform we should mention the
effective result (creating a loop nest with disambiguation figuring out
which is the "better" inner loop).

>  @item -fsplit-ivs-in-unroller
>  @opindex fsplit-ivs-in-unroller
>  Enables expression of values of induction variables in later iterations
> diff --git a/gcc/opts.c b/gcc/opts.c
> index 9a3fbb3..9a0b27c 100644
> --- a/gcc/opts.c
> +++ b/gcc/opts.c
> @@ -509,6 +509,7 @@ static const struct default_options
> default_options_table[] =
>      { OPT_LEVELS_2_PLUS, OPT_fisolate_erroneous_paths_dereference, NULL, 1
> },
>      { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
>      { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
> +    { OPT_LEVELS_2_PLUS, OPT_ftree_path_split, NULL, 1 },

Is this transform a good idea for -Os?

>
>      /* -O3 optimizations.  */
>      { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
> diff --git a/gcc/passes.def b/gcc/passes.def
> index c0ab6b9..4c9ef5f 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -274,6 +274,7 @@ along with GCC; see the file COPYING3.  If not see
>        POP_INSERT_PASSES ()
>        NEXT_PASS (pass_simduid_cleanup);
>        NEXT_PASS (pass_lower_vector_ssa);
> +      NEXT_PASS (pass_path_split);
>        NEXT_PASS (pass_cse_reciprocals);
>        NEXT_PASS (pass_reassoc);
>        NEXT_PASS (pass_strength_reduction);
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
> b/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
> new file mode 100644
> index 0000000..c7e9515
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
> @@ -0,0 +1,67 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-path-split-details " } */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +
> +#define RGBMAX 255
> +
> +int
> +test()
> +{
> +  int i, Pels;
> +  unsigned char sum = 0;
> +  unsigned char xr, xg, xb;
> +  unsigned char xc, xm, xy, xk;
> +  unsigned char *ReadPtr, *EritePtr;
> +
> +  ReadPtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
> +  EritePtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
> +
> +  for (i = 0; i < 100;i++)
> +     {
> +       ReadPtr[i] = 100 - i;
> +     }
> +
> +  for (i = 0; i < 100; i++)
> +     {
> +       xr = *ReadPtr++;
> +       xg = *ReadPtr++;
> +       xb = *ReadPtr++;
> +
> +       xc = (unsigned char) (RGBMAX - xr);
> +       xm = (unsigned char) (RGBMAX - xg);
> +       xy = (unsigned char) (RGBMAX - xb);
> +
> +       if (xc < xm)
> +         {
> +           xk = (unsigned char) (xc < xy ? xc : xy);
> +         }
> +       else
> +        {
> +          xk = (unsigned char) (xm < xy ? xm : xy);
> +        }
> +
> +       xc = (unsigned char) (xc - xk);
> +       xm = (unsigned char) (xm - xk);
> +       xy = (unsigned char) (xy - xk);
> +
> +       *EritePtr++ = xc;
> +       *EritePtr++ = xm;
> +       *EritePtr++ = xy;
> +       *EritePtr++ = xk;
> +       sum += *EritePtr;
> +    }
> +  return sum;
> +}
> +
> +int
> +main()
> +{
> +  if (test() != 33)
> +    abort();
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "Duplicating join block" "path-split" } } */
> diff --git a/gcc/timevar.def b/gcc/timevar.def
> index b429faf..3dba68b 100644
> --- a/gcc/timevar.def
> +++ b/gcc/timevar.def
> @@ -300,3 +300,4 @@ DEFTIMEVAR (TV_LINK              , "link JIT code")
>  DEFTIMEVAR (TV_LOAD                 , "load JIT result")
>  DEFTIMEVAR (TV_JIT_ACQUIRING_MUTEX   , "acquiring JIT mutex")
>  DEFTIMEVAR (TV_JIT_CLIENT_CODE   , "JIT client code")
> +DEFTIMEVAR (TV_TREE_PATH_SPLIT  , "tree path split")
> diff --git a/gcc/tracer.c b/gcc/tracer.c
> index 941dc20..c7b5150 100644
> --- a/gcc/tracer.c
> +++ b/gcc/tracer.c
> @@ -51,9 +51,9 @@
>  #include "tree-inline.h"
>  #include "cfgloop.h"
>  #include "fibonacci_heap.h"
> +#include "tracer.h"
>
>  static int count_insns (basic_block);
> -static bool ignore_bb_p (const_basic_block);
>  static bool better_p (const_edge, const_edge);
>  static edge find_best_successor (basic_block);
>  static edge find_best_predecessor (basic_block);
> @@ -85,7 +85,7 @@ bb_seen_p (basic_block bb)
>  }
>
>  /* Return true if we should ignore the basic block for purposes of tracing.
> */
> -static bool
> +bool
>  ignore_bb_p (const_basic_block bb)
>  {
>    if (bb->index < NUM_FIXED_BLOCKS)
> @@ -226,6 +226,24 @@ find_trace (basic_block bb, basic_block *trace)
>    return i;
>  }
>
> +/* Duplicate block BB2, placing it after BB in the CFG.  Return the
> +   newly created block.  */
> +basic_block
> +transform_duplicate (basic_block bb, basic_block bb2)
> +{
> +  edge e;
> +  basic_block copy;
> +
> +  e = find_edge (bb, bb2);
> +
> +  copy = duplicate_block (bb2, e, bb);
> +  flush_pending_stmts (e);
> +
> +  add_phi_args_after_copy (&copy, 1, NULL);
> +
> +  return (copy);
> +}
> +
>  /* Look for basic blocks in frequency order, construct traces and tail
> duplicate
>     if profitable.  */
>
> @@ -321,17 +339,8 @@ tail_duplicate (void)
>                  entries or at least rotate the loop.  */
>               && bb2->loop_father->header != bb2)
>             {
> -             edge e;
> -             basic_block copy;
> -
> -             nduplicated += counts [bb2->index];
> -
> -             e = find_edge (bb, bb2);
> -
> -             copy = duplicate_block (bb2, e, bb);
> -             flush_pending_stmts (e);
> -
> -             add_phi_args_after_copy (&copy, 1, NULL);
> +              nduplicated += counts [bb2->index];
> +              basic_block copy = transform_duplicate (bb, bb2);
>
>               /* Reconsider the original copy of block we've duplicated.
>                  Removing the most common predecessor may make it to be
> diff --git a/gcc/tracer.h b/gcc/tracer.h
> new file mode 100644
> index 0000000..cd1792a
> --- /dev/null
> +++ b/gcc/tracer.h
> @@ -0,0 +1,26 @@
> +/* Header file for Tracer.
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> + for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#ifndef GCC_TRACER_H
> +#define GCC_TRACER_H
> +
> +extern basic_block transform_duplicate (basic_block bb, basic_block bb2);
> +extern bool ignore_bb_p (const_basic_block bb);
> +
> +#endif /* GCC_TRACER_H */
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 49e22a9..6963acc 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -390,6 +390,7 @@ extern gimple_opt_pass *make_pass_tree_loop_done
> (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_path_split (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt);
> diff --git a/gcc/tree-ssa-path-split.c b/gcc/tree-ssa-path-split.c
> new file mode 100644
> index 0000000..9f61bd4
> --- /dev/null
> +++ b/gcc/tree-ssa-path-split.c
> @@ -0,0 +1,275 @@
> +/* Support routines for Path Splitting.
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   Contributed by Ajit Kumar Agarwal <ajitkum@xilinx.com>.
> +
> + This file is part of GCC.
> +
> + GCC is free software; you can redistribute it and/or modify
> + it under the terms of the GNU General Public License as published by
> + the Free Software Foundation; either version 3, or (at your option)
> + any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "tree.h"
> +#include "gimple.h"
> +#include "tree-pass.h"
> +#include "cfganal.h"
> +#include "cfgloop.h"
> +#include "gimple-iterator.h"
> +#include "tracer.h"
> +
> +/* Given LATCH, the latch block in a loop, see if the shape of the
> +   path reaching LATCH is suitable for path splitting.  If so, return
> +   the block that will be duplicated into its predecessor paths.  Else
> +   return NULL.  */
> +
> +static basic_block
> +find_block_to_duplicate_for_path_splitting (basic_block latch)
> +{
> +  /* We should have simple latches at this point.  So the latch should
> +     have a single successor.  This implies the predecessor of the latch
> +     likely has the loop exit.  And it's that predecessor we're most
> +     interested in. To keep things simple, we're going to require that
> +     the latch have a single predecessor too.  */
> +  if (single_succ_p (latch) && single_pred_p (latch))
> +    {
> +      basic_block bb = get_immediate_dominator (CDI_DOMINATORS, latch);
> +      gcc_assert (single_pred_edge (latch)->src == bb);
> +
> +      /* If BB has been marked as not to be duplicated, then honor that
> +        request.  */
> +      if (ignore_bb_p (bb))
> +       return NULL;
> +
> +      gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb));
> +      /* The immediate dominator of the latch must end in a conditional.
> */
> +      if (!last || gimple_code (last) != GIMPLE_COND)
> +       return NULL;
> +
> +      /* We're hoping that BB is a join point for an IF-THEN-ELSE diamond
> +        region.  Verify that it is.
> +
> +        First, verify that BB has two predecessors (each arm of the
> +        IF-THEN-ELSE) and two successors (the latch and exit).  */
> +      if (EDGE_COUNT (bb->preds) == 2 && EDGE_COUNT (bb->succs) == 2)
> +       {
> +         /* Now verify that BB's immediate dominator ends in a
> +            conditional as well.  */
> +         basic_block bb_idom = get_immediate_dominator (CDI_DOMINATORS,
> bb);
> +         gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb_idom));
> +         if (!last || gimple_code (last) != GIMPLE_COND)
> +           return NULL;
> +
> +         /* And that BB's immediate dominator's successors are the
> +            the predecessors of BB.  */
> +         if (!find_edge (bb_idom, EDGE_PRED (bb, 0)->src)
> +             || !find_edge (bb_idom, EDGE_PRED (bb, 1)->src))
> +           return NULL;
> +
> +         /* So at this point we have a simple diamond for an IF-THEN-ELSE
> +            construct starting at BB_IDOM, with a join point at BB.  BB
> +            pass control outside the loop or to the loop latch.
> +
> +            We're going to want to create two duplicates of BB, one for
> +            each successor of BB_IDOM.  */
> +         return bb;
> +       }
> +    }
> +  return NULL;
> +}
> +
> +/* Return TRUE if BB is a reasonable block to duplicate by examining
> +   its size, false otherwise.  BB will always be a loop latch block.
> +
> +   Should this use the same tests as we do for jump threading?  */
> +
> +static bool
> +is_feasible_trace (basic_block bb)
> +{
> +  int num_stmt = 0;
> +  gimple_stmt_iterator gsi;
> +
> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +    {
> +      gimple *stmt = gsi_stmt (gsi);
> +      if (!is_gimple_debug (stmt))
> +       num_stmt++;
> +    }
> +
> +  /* We may want to limit how many statements we copy.  */
> +  if (num_stmt > 1)
> +    return true;
> +
> +  return false;
> +}
> +
> +/* If the immediate dominator of the latch of the loop is
> +   block with conditional branch, then the loop latch  is
> +   duplicated to its predecessors path preserving the SSA
> +   semantics.
> +
> +   CFG before transformation.
> +
> +              2
> +              |
> +              |
> +        +---->3
> +        |    / \
> +        |   /   \
> +        |  4     5
> +        |   \   /
> +        |    \ /
> +        |     6
> +        |    / \
> +        |   /   \
> +        |  8     7
> +        |  |     |
> +        ---+     E
> +
> +
> +
> +    Block 8 is the latch.  We're going to make copies of block 6 (9 & 10)
> +    and wire things up so they look like this:
> +
> +              2
> +              |
> +              |
> +        +---->3
> +        |    / \
> +        |   /   \
> +        |  4     5
> +        |  |     |
> +        |  |     |
> +        |  9    10
> +        |  |\   /|
> +        |  | \ / |
> +        |  |  7  |
> +        |  |  |  |
> +        |  |  E  |
> +        |  |     |
> +        |   \   /
> +        |    \ /
> +        +-----8
> +
> +
> +    Blocks 9 and 10 will get merged into blocks 4 & 5 respectively which
> +    enables CSE, DCE and other optimizations to occur on a larger block
> +    of code.   */
> +
> +static bool
> +perform_path_splitting ()
> +{
> +  bool changed = false;
> +  loop_p loop;
> +
> +  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
> +  initialize_original_copy_tables ();
> +  calculate_dominance_info (CDI_DOMINATORS);
> +
> +  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
> +    {
> +      /* See if there is a block that we can duplicate to split the
> +        path to the loop latch.  */
> +      basic_block bb = find_block_to_duplicate_for_path_splitting
> (loop->latch);
> +
> +      /* BB is the merge point for an IF-THEN-ELSE we want to transform.
> +
> +        Essentially we want to create two duplicates of BB and append
> +        a duplicate to the THEN and ELSE clauses.  This will split the
> +        path leading to the latch.  BB will be unreachable and removed.  */
> +      if (bb && is_feasible_trace (bb))
> +       {
> +          if (dump_file && (dump_flags & TDF_DETAILS))
> +            fprintf (dump_file,
> +                    "Duplicating join block %d into predecessor paths\n",
> +                    bb->index);
> +         basic_block pred0 = EDGE_PRED (bb, 0)->src;
> +         basic_block pred1 = EDGE_PRED (bb, 1)->src;
> +         transform_duplicate (pred0, bb);
> +         transform_duplicate (pred1, bb);
> +         changed = true;
> +       }
> +    }
> +
> +  loop_optimizer_finalize ();
> +  free_original_copy_tables ();
> +  return changed;
> +}
> +
> +/* Main entry point for path splitting.  Returns TODO_cleanup_cfg if any
> +   paths where split, otherwise return zero.  */
> +
> +static unsigned int
> +execute_path_split (void)
> +{
> +  /* If we don't have at least 2 real blocks and backedges in the
> +     CFG, then there's no point in trying to perform path splitting.  */
> +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> +      || !mark_dfs_back_edges ())
> +    return 0;
> +
> +  bool changed = perform_path_splitting();
> +  if (changed)
> +    {
> +      free_dominance_info (CDI_DOMINATORS);
> +      /* If we changed the CFG schedule loops for fixup by cleanup_cfg.  */
> +      if (current_loops)
> +       loops_state_set (LOOPS_NEED_FIXUP);
> +    }
> +
> +  return changed ? TODO_cleanup_cfg : 0;
> +}
> +
> +static bool
> +gate_path_split (void)
> +{
> +  return flag_tree_path_split;
> +}
> +
> +namespace {
> +
> +const pass_data pass_data_path_split =
> +{
> +  GIMPLE_PASS, /* type */
> +  "path-split", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_TREE_PATH_SPLIT, /* tv_id */
> +  PROP_ssa, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_update_ssa, /* todo_flags_finish */
> +};
> +
> +class pass_path_split : public gimple_opt_pass
> +{
> +   public:
> +    pass_path_split (gcc::context *ctxt)
> +      : gimple_opt_pass (pass_data_path_split, ctxt)
> +    {}
> +   /* opt_pass methods: */
> +   opt_pass * clone () { return new pass_path_split (m_ctxt); }
> +   virtual bool gate (function *) { return gate_path_split (); }
> +   virtual unsigned int execute (function *) { return execute_path_split
> (); }
> +
> +}; // class pass_path_split
> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_path_split (gcc::context *ctxt)
> +{
> +  return new pass_path_split (ctxt);
> +}
>
Ajit Kumar Agarwal Nov. 13, 2015, 1:18 p.m. UTC | #2
-----Original Message-----
From: Jeff Law [mailto:law@redhat.com] 

Sent: Friday, November 13, 2015 3:28 AM
To: Richard Biener
Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 11/12/2015 11:32 AM, Jeff Law wrote:
> On 11/12/2015 10:05 AM, Jeff Law wrote:

>>> But IIRC you mentioned it should enable vectorization or so?  In 

>>> this case that's obviously too late.

>> The opposite.  Path splitting interferes with if-conversion & 

>> vectorization.  Path splitting mucks up the CFG enough that 

>> if-conversion won't fire and as a result vectorization is inhibited.  

>> It also creates multi-latch loops, which isn't a great situation either.

>>

>> It *may* be the case that dropping it that far down in the pipeline 

>> and making the modifications necessary to handle simple latches may 

>> in turn make the path splitting code play better with if-conversion 

>> and vectorization and avoid creation of multi-latch loops.  At least 

>> that's how it looks on paper when I draw out the CFG manipulations.

>>

>> I'll do some experiments.

> It doesn't look too terrible to ravamp the recognition code to work 

> later in the pipeline with simple latches.  Sadly that doesn't seem to 

> have fixed the bad interactions with if-conversion.

>

> *But* that does open up the possibility of moving the path splitting 

> pass even deeper in the pipeline -- in particular we can move it past 

> the vectorizer.  Which is may be a win.

>

> So the big question is whether or not we'll still see enough benefits 

> from having it so late in the pipeline.  It's still early enough that 

> we get DOM, VRP, reassoc, forwprop, phiopt, etc.

>

> Ajit, I'll pass along an updated patch after doing some more testing.


Hello Jeff:
>>So here's what I'm working with.  It runs after the vectorizer now.


>>Ajit, if you could benchmark this it would be greatly appreciated.  I know you saw significant improvements on one or more benchmarks in the past.  It'd be good to know that the >>updated placement of the pass doesn't invalidate the gains you saw.


>>With the updated pass placement, we don't have to worry about switching the pass on/off based on whether or not the vectorizer & if-conversion are enabled.  So that hackery is gone.


>>I think I've beefed up the test to identify the diamond patterns we want so that it's stricter in what we accept.  The call to ignore_bb_p is a part of that test so that we're actually looking at >>the right block in a world where we're doing this transformation with simple latches.


>>I've also put a graphical comment before perform_path_splitting which hopefully shows the CFG transformation we're making a bit clearer.


>>This bootstraps and regression tests cleanly on x86_64-linux-gnu.


Thank you for the inputs. I will build the compiler and run SPEC CPU 2000 benchmarks for X86 target and respond back as soon as run is done.
I will also run  the EEMBC/Mibench benchmarks for Microblaze target.
 
Would let you know the results at the earliest.

Thanks & Regards
Ajit
Jeff Law Nov. 13, 2015, 4:26 p.m. UTC | #3
On 11/13/2015 03:13 AM, Richard Biener wrote:

>> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
>> index 34d2356..6613e83 100644
>> --- a/gcc/Makefile.in
>> +++ b/gcc/Makefile.in
>> @@ -1474,6 +1474,7 @@ OBJS = \
>>          tree-ssa-loop.o \
>>          tree-ssa-math-opts.o \
>>          tree-ssa-operands.o \
>> +       tree-ssa-path-split.o \
>
> gimple-ssa-path-split please.
Agreed.   I'll make that change for Ajit.


>
>>          tree-ssa-phionlycprop.o \
>>          tree-ssa-phiopt.o \
>>          tree-ssa-phiprop.o \
>> diff --git a/gcc/common.opt b/gcc/common.opt
>> index 757ce85..3e946ca 100644
>> --- a/gcc/common.opt
>> +++ b/gcc/common.opt
>> @@ -2403,6 +2403,10 @@ ftree-vrp
>>   Common Report Var(flag_tree_vrp) Init(0) Optimization
>>   Perform Value Range Propagation on trees.
>>
>> +ftree-path-split
>
> fsplit-paths
And this plus related variable name fixes and such.


>>
>> +@item -ftree-path-split
>> +@opindex ftree-path-split
>> +Perform Path Splitting on trees.  When the two execution paths of a
>> +if-then-else merge at the loop latch node, try to duplicate the
>> +merge node into two paths. This is enabled by default at @option{-O2}
>> +and above.
>> +
>
> I think if we go into the detail of the transform we should mention the
> effective result (creating a loop nest with disambiguation figuring out
> which is the "better" inner loop).
It no longer creates a loop nest.  The overall shape of the CFG is 
maintained.  ie, we still have a single simple latch for the loop.  The 
blocks we create are internal to the loop.

I always struggle with the right level at which to document these 
options.   I'll take a look at this for Ajit.

BTW Do we have an API for indicating that new blocks have been added to 
a loop?  If so, then we can likely drop the LOOPS_NEED_FIXUP.


>
>>   @item -fsplit-ivs-in-unroller
>>   @opindex fsplit-ivs-in-unroller
>>   Enables expression of values of induction variables in later iterations
>> diff --git a/gcc/opts.c b/gcc/opts.c
>> index 9a3fbb3..9a0b27c 100644
>> --- a/gcc/opts.c
>> +++ b/gcc/opts.c
>> @@ -509,6 +509,7 @@ static const struct default_options
>> default_options_table[] =
>>       { OPT_LEVELS_2_PLUS, OPT_fisolate_erroneous_paths_dereference, NULL, 1
>> },
>>       { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
>>       { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
>> +    { OPT_LEVELS_2_PLUS, OPT_ftree_path_split, NULL, 1 },
>
> Is this transform a good idea for -Os?
In general, no because of the block duplication.

jeff
Richard Biener Nov. 13, 2015, 6:09 p.m. UTC | #4
On November 13, 2015 5:26:01 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>On 11/13/2015 03:13 AM, Richard Biener wrote:
>
>>> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
>>> index 34d2356..6613e83 100644
>>> --- a/gcc/Makefile.in
>>> +++ b/gcc/Makefile.in
>>> @@ -1474,6 +1474,7 @@ OBJS = \
>>>          tree-ssa-loop.o \
>>>          tree-ssa-math-opts.o \
>>>          tree-ssa-operands.o \
>>> +       tree-ssa-path-split.o \
>>
>> gimple-ssa-path-split please.
>Agreed.   I'll make that change for Ajit.
>
>
>>
>>>          tree-ssa-phionlycprop.o \
>>>          tree-ssa-phiopt.o \
>>>          tree-ssa-phiprop.o \
>>> diff --git a/gcc/common.opt b/gcc/common.opt
>>> index 757ce85..3e946ca 100644
>>> --- a/gcc/common.opt
>>> +++ b/gcc/common.opt
>>> @@ -2403,6 +2403,10 @@ ftree-vrp
>>>   Common Report Var(flag_tree_vrp) Init(0) Optimization
>>>   Perform Value Range Propagation on trees.
>>>
>>> +ftree-path-split
>>
>> fsplit-paths
>And this plus related variable name fixes and such.
>
>
>>>
>>> +@item -ftree-path-split
>>> +@opindex ftree-path-split
>>> +Perform Path Splitting on trees.  When the two execution paths of a
>>> +if-then-else merge at the loop latch node, try to duplicate the
>>> +merge node into two paths. This is enabled by default at
>@option{-O2}
>>> +and above.
>>> +
>>
>> I think if we go into the detail of the transform we should mention
>the
>> effective result (creating a loop nest with disambiguation figuring
>out
>> which is the "better" inner loop).
>It no longer creates a loop nest.  The overall shape of the CFG is 
>maintained.  ie, we still have a single simple latch for the loop.  The
>
>blocks we create are internal to the loop.
>
>I always struggle with the right level at which to document these 
>options.   I'll take a look at this for Ajit.
>
>BTW Do we have an API for indicating that new blocks have been added to
>
>a loop?  If so, then we can likely drop the LOOPS_NEED_FIXUP.

Please. It's called add_to_loop or so.

Richard.

>
>>
>>>   @item -fsplit-ivs-in-unroller
>>>   @opindex fsplit-ivs-in-unroller
>>>   Enables expression of values of induction variables in later
>iterations
>>> diff --git a/gcc/opts.c b/gcc/opts.c
>>> index 9a3fbb3..9a0b27c 100644
>>> --- a/gcc/opts.c
>>> +++ b/gcc/opts.c
>>> @@ -509,6 +509,7 @@ static const struct default_options
>>> default_options_table[] =
>>>       { OPT_LEVELS_2_PLUS, OPT_fisolate_erroneous_paths_dereference,
>NULL, 1
>>> },
>>>       { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
>>>       { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
>>> +    { OPT_LEVELS_2_PLUS, OPT_ftree_path_split, NULL, 1 },
>>
>> Is this transform a good idea for -Os?
>In general, no because of the block duplication.
>
>jeff
Jeff Law Nov. 13, 2015, 8:23 p.m. UTC | #5
On 11/13/2015 11:09 AM, Richard Biener wrote:

>>
>> BTW Do we have an API for indicating that new blocks have been added to
>>
>> a loop?  If so, then we can likely drop the LOOPS_NEED_FIXUP.
>
> Please. It's called add_to_loop or so.
Haha, the block duplication code was handling this already.  So in 
theory I can just drop the LOOPS_NEED_FIXUP completely.  Testing now.

jeff
diff mbox

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 34d2356..6613e83 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1474,6 +1474,7 @@  OBJS = \
 	tree-ssa-loop.o \
 	tree-ssa-math-opts.o \
 	tree-ssa-operands.o \
+	tree-ssa-path-split.o \
 	tree-ssa-phionlycprop.o \
 	tree-ssa-phiopt.o \
 	tree-ssa-phiprop.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 757ce85..3e946ca 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2403,6 +2403,10 @@  ftree-vrp
 Common Report Var(flag_tree_vrp) Init(0) Optimization
 Perform Value Range Propagation on trees.
 
+ftree-path-split
+Common Report Var(flag_tree_path_split) Init(0) Optimization
+Perform Path Splitting on trees for loop backedges.
+
 funit-at-a-time
 Common Report Var(flag_unit_at_a_time) Init(1)
 Compile whole compilation unit at a time.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 213a9d0..b1e95da 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -354,6 +354,7 @@  Objective-C and Objective-C++ Dialects}.
 -fdump-tree-fre@r{[}-@var{n}@r{]} @gol
 -fdump-tree-vtable-verify @gol
 -fdump-tree-vrp@r{[}-@var{n}@r{]} @gol
+-fdump-tree-path-split@r{[}-@var{n}@r{]} @gol
 -fdump-tree-storeccp@r{[}-@var{n}@r{]} @gol
 -fdump-final-insns=@var{file} @gol
 -fcompare-debug@r{[}=@var{opts}@r{]}  -fcompare-debug-second @gol
@@ -462,7 +463,7 @@  Objective-C and Objective-C++ Dialects}.
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
 -ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
--ftree-vectorize -ftree-vrp @gol
+-ftree-vectorize -ftree-vrp @gol -ftree-path-split @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
@@ -7169,6 +7170,11 @@  output on to @file{stderr}. If two conflicting dump filenames are
 given for the same pass, then the latter option overrides the earlier
 one.
 
+@item path-split
+@opindex fdump-tree-path-split
+Dump each function after path splitting.  The file name is made by
+appending @file{.path-split} to the source file name.
+
 @item all
 Turn on all options, except @option{raw}, @option{slim}, @option{verbose}
 and @option{lineno}.
@@ -7811,6 +7817,7 @@  also turns on the following optimization flags:
 -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-pre @gol
 -ftree-vrp @gol
+-ftree-path-split @gol
 -fipa-ra}
 
 Please note the warning under @option{-fgcse} about
@@ -8819,7 +8826,7 @@  currently enabled, but may be enabled by @option{-O2} in the future.
 
 @item -ftree-sink
 @opindex ftree-sink
-Perform forward store motion  on trees.  This flag is
+Perform forward store motion on trees.  This flag is
 enabled by default at @option{-O} and higher.
 
 @item -ftree-bit-ccp
@@ -9125,6 +9132,13 @@  enabled by default at @option{-O2} and higher.  Null pointer check
 elimination is only done if @option{-fdelete-null-pointer-checks} is
 enabled.
 
+@item -ftree-path-split
+@opindex ftree-path-split
+Perform Path Splitting on trees.  When the two execution paths of a
+if-then-else merge at the loop latch node, try to duplicate the
+merge node into two paths. This is enabled by default at @option{-O2}
+and above.
+
 @item -fsplit-ivs-in-unroller
 @opindex fsplit-ivs-in-unroller
 Enables expression of values of induction variables in later iterations
diff --git a/gcc/opts.c b/gcc/opts.c
index 9a3fbb3..9a0b27c 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -509,6 +509,7 @@  static const struct default_options default_options_table[] =
     { OPT_LEVELS_2_PLUS, OPT_fisolate_erroneous_paths_dereference, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_ftree_path_split, NULL, 1 },
 
     /* -O3 optimizations.  */
     { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
diff --git a/gcc/passes.def b/gcc/passes.def
index c0ab6b9..4c9ef5f 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -274,6 +274,7 @@  along with GCC; see the file COPYING3.  If not see
       POP_INSERT_PASSES ()
       NEXT_PASS (pass_simduid_cleanup);
       NEXT_PASS (pass_lower_vector_ssa);
+      NEXT_PASS (pass_path_split);
       NEXT_PASS (pass_cse_reciprocals);
       NEXT_PASS (pass_reassoc);
       NEXT_PASS (pass_strength_reduction);
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c b/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
new file mode 100644
index 0000000..c7e9515
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/path-split-1.c
@@ -0,0 +1,67 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-path-split-details " } */
+
+#include <stdio.h>
+#include <stdlib.h>
+
+#define RGBMAX 255
+
+int
+test()
+{
+  int i, Pels;
+  unsigned char sum = 0;
+  unsigned char xr, xg, xb;
+  unsigned char xc, xm, xy, xk;
+  unsigned char *ReadPtr, *EritePtr;
+
+  ReadPtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
+  EritePtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
+
+  for (i = 0; i < 100;i++)
+     {
+       ReadPtr[i] = 100 - i;
+     }
+
+  for (i = 0; i < 100; i++)
+     {
+       xr = *ReadPtr++;
+       xg = *ReadPtr++;
+       xb = *ReadPtr++;
+
+       xc = (unsigned char) (RGBMAX - xr);
+       xm = (unsigned char) (RGBMAX - xg);
+       xy = (unsigned char) (RGBMAX - xb);
+
+       if (xc < xm)
+         {
+           xk = (unsigned char) (xc < xy ? xc : xy);
+         }
+       else
+        {
+          xk = (unsigned char) (xm < xy ? xm : xy);
+        }
+
+       xc = (unsigned char) (xc - xk);
+       xm = (unsigned char) (xm - xk);
+       xy = (unsigned char) (xy - xk);
+
+       *EritePtr++ = xc;
+       *EritePtr++ = xm;
+       *EritePtr++ = xy;
+       *EritePtr++ = xk;
+       sum += *EritePtr;
+    }
+  return sum;
+}
+
+int
+main()
+{
+  if (test() != 33)
+    abort();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "Duplicating join block" "path-split" } } */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index b429faf..3dba68b 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -300,3 +300,4 @@  DEFTIMEVAR (TV_LINK		     , "link JIT code")
 DEFTIMEVAR (TV_LOAD		     , "load JIT result")
 DEFTIMEVAR (TV_JIT_ACQUIRING_MUTEX   , "acquiring JIT mutex")
 DEFTIMEVAR (TV_JIT_CLIENT_CODE   , "JIT client code")
+DEFTIMEVAR (TV_TREE_PATH_SPLIT  , "tree path split")
diff --git a/gcc/tracer.c b/gcc/tracer.c
index 941dc20..c7b5150 100644
--- a/gcc/tracer.c
+++ b/gcc/tracer.c
@@ -51,9 +51,9 @@ 
 #include "tree-inline.h"
 #include "cfgloop.h"
 #include "fibonacci_heap.h"
+#include "tracer.h"
 
 static int count_insns (basic_block);
-static bool ignore_bb_p (const_basic_block);
 static bool better_p (const_edge, const_edge);
 static edge find_best_successor (basic_block);
 static edge find_best_predecessor (basic_block);
@@ -85,7 +85,7 @@  bb_seen_p (basic_block bb)
 }
 
 /* Return true if we should ignore the basic block for purposes of tracing.  */
-static bool
+bool
 ignore_bb_p (const_basic_block bb)
 {
   if (bb->index < NUM_FIXED_BLOCKS)
@@ -226,6 +226,24 @@  find_trace (basic_block bb, basic_block *trace)
   return i;
 }
 
+/* Duplicate block BB2, placing it after BB in the CFG.  Return the
+   newly created block.  */
+basic_block
+transform_duplicate (basic_block bb, basic_block bb2)
+{
+  edge e;
+  basic_block copy;
+
+  e = find_edge (bb, bb2);
+
+  copy = duplicate_block (bb2, e, bb);
+  flush_pending_stmts (e);
+
+  add_phi_args_after_copy (&copy, 1, NULL);
+
+  return (copy);
+}
+
 /* Look for basic blocks in frequency order, construct traces and tail duplicate
    if profitable.  */
 
@@ -321,17 +339,8 @@  tail_duplicate (void)
 		 entries or at least rotate the loop.  */
 	      && bb2->loop_father->header != bb2)
 	    {
-	      edge e;
-	      basic_block copy;
-
-	      nduplicated += counts [bb2->index];
-
-	      e = find_edge (bb, bb2);
-
-	      copy = duplicate_block (bb2, e, bb);
-	      flush_pending_stmts (e);
-
-	      add_phi_args_after_copy (&copy, 1, NULL);
+              nduplicated += counts [bb2->index];
+              basic_block copy = transform_duplicate (bb, bb2);
 
 	      /* Reconsider the original copy of block we've duplicated.
 	         Removing the most common predecessor may make it to be
diff --git a/gcc/tracer.h b/gcc/tracer.h
new file mode 100644
index 0000000..cd1792a
--- /dev/null
+++ b/gcc/tracer.h
@@ -0,0 +1,26 @@ 
+/* Header file for Tracer.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_TRACER_H
+#define GCC_TRACER_H
+
+extern basic_block transform_duplicate (basic_block bb, basic_block bb2);
+extern bool ignore_bb_p (const_basic_block bb);
+
+#endif /* GCC_TRACER_H */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 49e22a9..6963acc 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -390,6 +390,7 @@  extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_path_split (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-path-split.c b/gcc/tree-ssa-path-split.c
new file mode 100644
index 0000000..9f61bd4
--- /dev/null
+++ b/gcc/tree-ssa-path-split.c
@@ -0,0 +1,275 @@ 
+/* Support routines for Path Splitting.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal <ajitkum@xilinx.com>.
+
+ This file is part of GCC.
+
+ GCC is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 3, or (at your option)
+ any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "cfganal.h"
+#include "cfgloop.h"
+#include "gimple-iterator.h"
+#include "tracer.h"
+
+/* Given LATCH, the latch block in a loop, see if the shape of the
+   path reaching LATCH is suitable for path splitting.  If so, return
+   the block that will be duplicated into its predecessor paths.  Else
+   return NULL.  */
+
+static basic_block
+find_block_to_duplicate_for_path_splitting (basic_block latch)
+{
+  /* We should have simple latches at this point.  So the latch should
+     have a single successor.  This implies the predecessor of the latch
+     likely has the loop exit.  And it's that predecessor we're most
+     interested in. To keep things simple, we're going to require that
+     the latch have a single predecessor too.  */
+  if (single_succ_p (latch) && single_pred_p (latch))
+    {
+      basic_block bb = get_immediate_dominator (CDI_DOMINATORS, latch);
+      gcc_assert (single_pred_edge (latch)->src == bb);
+
+      /* If BB has been marked as not to be duplicated, then honor that
+	 request.  */
+      if (ignore_bb_p (bb))
+	return NULL;
+
+      gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb));
+      /* The immediate dominator of the latch must end in a conditional.  */
+      if (!last || gimple_code (last) != GIMPLE_COND)
+	return NULL;
+
+      /* We're hoping that BB is a join point for an IF-THEN-ELSE diamond
+	 region.  Verify that it is.
+
+	 First, verify that BB has two predecessors (each arm of the
+	 IF-THEN-ELSE) and two successors (the latch and exit).  */
+      if (EDGE_COUNT (bb->preds) == 2 && EDGE_COUNT (bb->succs) == 2)
+	{
+	  /* Now verify that BB's immediate dominator ends in a
+	     conditional as well.  */
+	  basic_block bb_idom = get_immediate_dominator (CDI_DOMINATORS, bb);
+	  gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb_idom));
+	  if (!last || gimple_code (last) != GIMPLE_COND)
+	    return NULL;
+
+	  /* And that BB's immediate dominator's successors are the
+	     the predecessors of BB.  */
+	  if (!find_edge (bb_idom, EDGE_PRED (bb, 0)->src)
+	      || !find_edge (bb_idom, EDGE_PRED (bb, 1)->src))
+	    return NULL;
+
+	  /* So at this point we have a simple diamond for an IF-THEN-ELSE
+	     construct starting at BB_IDOM, with a join point at BB.  BB
+	     pass control outside the loop or to the loop latch.
+
+	     We're going to want to create two duplicates of BB, one for
+	     each successor of BB_IDOM.  */
+	  return bb;
+	}
+    }
+  return NULL;
+}
+
+/* Return TRUE if BB is a reasonable block to duplicate by examining
+   its size, false otherwise.  BB will always be a loop latch block.
+
+   Should this use the same tests as we do for jump threading?  */
+
+static bool
+is_feasible_trace (basic_block bb)
+{
+  int num_stmt = 0;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      if (!is_gimple_debug (stmt))
+	num_stmt++;
+    }
+
+  /* We may want to limit how many statements we copy.  */
+  if (num_stmt > 1)
+    return true;
+
+  return false;
+}
+
+/* If the immediate dominator of the latch of the loop is
+   block with conditional branch, then the loop latch  is
+   duplicated to its predecessors path preserving the SSA
+   semantics.
+
+   CFG before transformation.
+
+              2
+              |
+              |
+        +---->3
+        |    / \
+        |   /   \
+        |  4     5
+        |   \   /
+        |    \ /
+        |     6
+        |    / \
+        |   /   \
+        |  8     7
+        |  |     |
+        ---+     E
+    
+
+
+    Block 8 is the latch.  We're going to make copies of block 6 (9 & 10)
+    and wire things up so they look like this:
+
+              2
+              |
+              |
+        +---->3
+        |    / \
+        |   /   \
+        |  4     5
+        |  |     |
+        |  |     |
+        |  9    10
+        |  |\   /|
+        |  | \ / |
+        |  |  7  |
+        |  |  |  |
+        |  |  E  |
+        |  |     |
+        |   \   /
+        |    \ /
+        +-----8 
+
+
+    Blocks 9 and 10 will get merged into blocks 4 & 5 respectively which
+    enables CSE, DCE and other optimizations to occur on a larger block
+    of code.   */
+
+static bool
+perform_path_splitting ()
+{
+  bool changed = false;
+  loop_p loop;
+
+  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+  initialize_original_copy_tables ();
+  calculate_dominance_info (CDI_DOMINATORS);
+
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+    {
+      /* See if there is a block that we can duplicate to split the
+	 path to the loop latch.  */
+      basic_block bb = find_block_to_duplicate_for_path_splitting (loop->latch);
+
+      /* BB is the merge point for an IF-THEN-ELSE we want to transform.
+
+	 Essentially we want to create two duplicates of BB and append
+	 a duplicate to the THEN and ELSE clauses.  This will split the
+	 path leading to the latch.  BB will be unreachable and removed.  */
+      if (bb && is_feasible_trace (bb))
+	{
+          if (dump_file && (dump_flags & TDF_DETAILS)) 
+            fprintf (dump_file,
+		     "Duplicating join block %d into predecessor paths\n",
+		     bb->index);
+	  basic_block pred0 = EDGE_PRED (bb, 0)->src;
+	  basic_block pred1 = EDGE_PRED (bb, 1)->src;
+	  transform_duplicate (pred0, bb);
+	  transform_duplicate (pred1, bb);
+	  changed = true;
+	}
+    }
+
+  loop_optimizer_finalize ();
+  free_original_copy_tables ();
+  return changed;
+}
+
+/* Main entry point for path splitting.  Returns TODO_cleanup_cfg if any
+   paths where split, otherwise return zero.  */
+
+static unsigned int
+execute_path_split (void)
+{
+  /* If we don't have at least 2 real blocks and backedges in the
+     CFG, then there's no point in trying to perform path splitting.  */
+  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
+      || !mark_dfs_back_edges ())
+    return 0;
+
+  bool changed = perform_path_splitting();
+  if (changed)
+    {
+      free_dominance_info (CDI_DOMINATORS);
+      /* If we changed the CFG schedule loops for fixup by cleanup_cfg.  */
+      if (current_loops)
+	loops_state_set (LOOPS_NEED_FIXUP);
+    }
+
+  return changed ? TODO_cleanup_cfg : 0;
+}
+
+static bool
+gate_path_split (void)
+{
+  return flag_tree_path_split;
+}
+
+namespace {
+
+const pass_data pass_data_path_split =
+{
+  GIMPLE_PASS, /* type */
+  "path-split", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_TREE_PATH_SPLIT, /* tv_id */
+  PROP_ssa, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_path_split : public gimple_opt_pass
+{
+   public:
+    pass_path_split (gcc::context *ctxt)
+      : gimple_opt_pass (pass_data_path_split, ctxt)
+    {}
+   /* opt_pass methods: */
+   opt_pass * clone () { return new pass_path_split (m_ctxt); }
+   virtual bool gate (function *) { return gate_path_split (); }
+   virtual unsigned int execute (function *) { return execute_path_split (); }
+
+}; // class pass_path_split
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_path_split (gcc::context *ctxt)
+{
+  return new pass_path_split (ctxt);
+}