diff mbox

[tree-optimization] : Add new path Splitting pass on tree ssa representation

Message ID 564673DA.3020403@redhat.com
State New
Headers show

Commit Message

Jeff Law Nov. 13, 2015, 11:35 p.m. UTC
On 11/13/2015 01:23 PM, Jeff Law wrote:
> On 11/13/2015 11:09 AM, Richard Biener wrote:
>
>>>
>>> BTW Do we have an API for indicating that new blocks have been added to
>>>
>>> a loop?  If so, then we can likely drop the LOOPS_NEED_FIXUP.
>>
>> Please. It's called add_to_loop or so.
> Haha, the block duplication code was handling this already.  So in
> theory I can just drop the LOOPS_NEED_FIXUP completely.  Testing now.
>
> jeff
>
Attached is the committed patch for path splitting.  As noted above, we 
didn't need the LOOPS_NEED_FIXUP in the final version, so that wart is 
gone :-)

I do find myself wondering if this can/should be generalized beyond just 
paths heading to loop backedges.  However to do so I think we'd need to 
be able to undo this transformation reliably and we'd need some 
heuristics when to duplicate to expose the redundancy vs rely on PRE 
techniques and jump threading.  I vaguely remember a paper which touched 
on these topics, but I can't seem to find it.

Anyway, bootstrapped and regression tested on x86_64-linux-gnu. 
Installed on the trunk.
commit c1891376e5dcc99ad8be2d22f9551c03f9bb2729
Author: Jeff Law <law@redhat.com>
Date:   Fri Nov 13 16:29:34 2015 -0700

    [Patch,tree-optimization]: Add new path Splitting pass on tree ssa
    representation
    
    	* Makefile.in (OBJS): Add gimple-ssa-split-paths.o
    	* common.opt (-fsplit-paths): New flag controlling path splitting.
    	* doc/invoke.texi (fsplit-paths): Document.
    	* opts.c (default_options_table): Add -fsplit-paths to -O2.
    	* passes.def: Add split_paths pass.
    	* timevar.def (TV_SPLIT_PATHS): New timevar.
    	* tracer.c: Include "tracer.h"
    	(ignore_bb_p): No longer static.
    	(transform_duplicate): New function, broken out of tail_duplicate.
    	(tail_duplicate): Use transform_duplicate.
    	* tracer.h (ignore_bb_p): Declare
    	(transform_duplicate): Likewise.
    	* tree-pass.h (make_pass_split_paths): Declare.
    	* gimple-ssa-split-paths.c: New file.
    
    	* gcc.dg/tree-ssa/split-path-1.c: New test.

Comments

Tom de Vries Nov. 18, 2015, 7:44 a.m. UTC | #1
On 14/11/15 00:35, Jeff Law wrote:
> Anyway, bootstrapped and regression tested on x86_64-linux-gnu.
> Installed on the trunk.

>      [Patch,tree-optimization]: Add new path Splitting pass on tree ssa
>      representation
>
>      	* Makefile.in (OBJS): Add gimple-ssa-split-paths.o
>      	* common.opt (-fsplit-paths): New flag controlling path splitting.
>      	* doc/invoke.texi (fsplit-paths): Document.
>      	* opts.c (default_options_table): Add -fsplit-paths to -O2.
>      	* passes.def: Add split_paths pass.
>      	* timevar.def (TV_SPLIT_PATHS): New timevar.
>      	* tracer.c: Include "tracer.h"
>      	(ignore_bb_p): No longer static.
>      	(transform_duplicate): New function, broken out of tail_duplicate.
>      	(tail_duplicate): Use transform_duplicate.
>      	* tracer.h (ignore_bb_p): Declare
>      	(transform_duplicate): Likewise.
>      	* tree-pass.h (make_pass_split_paths): Declare.
>      	* gimple-ssa-split-paths.c: New file.
>
>      	* gcc.dg/tree-ssa/split-path-1.c: New test.

I've filed PR68402 - FAIL: gcc.dg/tree-ssa/split-path-1.c execution test 
with -m32.

Thanks,
- Tom
Ajit Kumar Agarwal Nov. 18, 2015, 2:24 p.m. UTC | #2
-----Original Message-----
From: Tom de Vries [mailto:Tom_deVries@mentor.com] 

Sent: Wednesday, November 18, 2015 1:14 PM
To: Jeff Law; Richard Biener
Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 14/11/15 00:35, Jeff Law wrote:
> Anyway, bootstrapped and regression tested on x86_64-linux-gnu.

> Installed on the trunk.


>      [Patch,tree-optimization]: Add new path Splitting pass on tree ssa

>      representation

>

>      	* Makefile.in (OBJS): Add gimple-ssa-split-paths.o

>      	* common.opt (-fsplit-paths): New flag controlling path splitting.

>      	* doc/invoke.texi (fsplit-paths): Document.

>      	* opts.c (default_options_table): Add -fsplit-paths to -O2.

>      	* passes.def: Add split_paths pass.

>      	* timevar.def (TV_SPLIT_PATHS): New timevar.

>      	* tracer.c: Include "tracer.h"

>      	(ignore_bb_p): No longer static.

>      	(transform_duplicate): New function, broken out of tail_duplicate.

>      	(tail_duplicate): Use transform_duplicate.

>      	* tracer.h (ignore_bb_p): Declare

>      	(transform_duplicate): Likewise.

>      	* tree-pass.h (make_pass_split_paths): Declare.

>      	* gimple-ssa-split-paths.c: New file.

>

>      	* gcc.dg/tree-ssa/split-path-1.c: New test.


>>I've filed PR68402 - FAIL: gcc.dg/tree-ssa/split-path-1.c execution test with -m32.


I have fixed the above PR and the patch is submitted.

https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02217.html

Thanks & Regards
Ajit

Thanks,
- Tom
Richard Biener Dec. 3, 2015, 2:38 p.m. UTC | #3
On Sat, Nov 14, 2015 at 12:35 AM, Jeff Law <law@redhat.com> wrote:
> On 11/13/2015 01:23 PM, Jeff Law wrote:
>>
>> On 11/13/2015 11:09 AM, Richard Biener wrote:
>>
>>>>
>>>> BTW Do we have an API for indicating that new blocks have been added to
>>>>
>>>> a loop?  If so, then we can likely drop the LOOPS_NEED_FIXUP.
>>>
>>>
>>> Please. It's called add_to_loop or so.
>>
>> Haha, the block duplication code was handling this already.  So in
>> theory I can just drop the LOOPS_NEED_FIXUP completely.  Testing now.
>>
>> jeff
>>
> Attached is the committed patch for path splitting.  As noted above, we
> didn't need the LOOPS_NEED_FIXUP in the final version, so that wart is gone
> :-)
>
> I do find myself wondering if this can/should be generalized beyond just
> paths heading to loop backedges.  However to do so I think we'd need to be
> able to undo this transformation reliably and we'd need some heuristics when
> to duplicate to expose the redundancy vs rely on PRE techniques and jump
> threading.  I vaguely remember a paper which touched on these topics, but I
> can't seem to find it.
>
> Anyway, bootstrapped and regression tested on x86_64-linux-gnu. Installed on
> the trunk.

This pass is now enabled by default with -Os but has no limits on the amount of
stmts it copies.  It also will make all loops with this shape have at least two
exits (if the resulting loop will be disambiguated the inner loop will
have two exits).
Having more than one exit will disable almost all loop optimizations after it.

The pass itself documents the transform it does but does zero to motivate it.

What's the benefit of this pass (apart from disrupting further optimizations)?

I can see a _single_ case where duplicating the latch will allow threading
one of the paths through the loop header to eliminate the original exit.  Then
disambiguation may create a nice nested loop out of this.  Of course that
is only profitable again if you know the remaining single exit of the inner
loop (exiting to the outer one) is executed infrequently (thus the inner loop
actually loops).

But no checks other than on the CFG shape exist (oh, it checks it will
at _least_ copy two stmts!).

Given the profitability constraints above (well, correct me if I am
wrong on these)
it looks like the whole transform should be done within the FSM threading
code which might be able to compute whether there will be an inner loop
with a single exit only.

I'm inclined to request the pass to be removed again or at least disabled by
default.

What closed source benchmark was this transform invented for?

Richard.

>
>
>
> commit c1891376e5dcc99ad8be2d22f9551c03f9bb2729
> Author: Jeff Law <law@redhat.com>
> Date:   Fri Nov 13 16:29:34 2015 -0700
>
>     [Patch,tree-optimization]: Add new path Splitting pass on tree ssa
>     representation
>
>         * Makefile.in (OBJS): Add gimple-ssa-split-paths.o
>         * common.opt (-fsplit-paths): New flag controlling path splitting.
>         * doc/invoke.texi (fsplit-paths): Document.
>         * opts.c (default_options_table): Add -fsplit-paths to -O2.
>         * passes.def: Add split_paths pass.
>         * timevar.def (TV_SPLIT_PATHS): New timevar.
>         * tracer.c: Include "tracer.h"
>         (ignore_bb_p): No longer static.
>         (transform_duplicate): New function, broken out of tail_duplicate.
>         (tail_duplicate): Use transform_duplicate.
>         * tracer.h (ignore_bb_p): Declare
>         (transform_duplicate): Likewise.
>         * tree-pass.h (make_pass_split_paths): Declare.
>         * gimple-ssa-split-paths.c: New file.
>
>         * gcc.dg/tree-ssa/split-path-1.c: New test.
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index dde2695..a7abe37 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,21 @@
> +2015-11-13  Ajit Agarwal  <ajitkum@xilinx.com>
> +           Jeff Law  <law@redhat.com>
> +
> +       * Makefile.in (OBJS): Add gimple-ssa-split-paths.o
> +       * common.opt (-fsplit-paths): New flag controlling path splitting.
> +       * doc/invoke.texi (fsplit-paths): Document.
> +       * opts.c (default_options_table): Add -fsplit-paths to -O2.
> +       * passes.def: Add split_paths pass.
> +       * timevar.def (TV_SPLIT_PATHS): New timevar.
> +       * tracer.c: Include "tracer.h"
> +       (ignore_bb_p): No longer static.
> +       (transform_duplicate): New function, broken out of tail_duplicate.
> +       (tail_duplicate): Use transform_duplicate.
> +       * tracer.h (ignore_bb_p): Declare
> +       (transform_duplicate): Likewise.
> +       * tree-pass.h (make_pass_split_paths): Declare.
> +       * gimple-ssa-split-paths.c: New file.
> +
>  2015-11-13  Kai Tietz  <ktietz70@googlemail.com>
>             Marek Polacek  <polacek@redhat.com>
>             Jason Merrill  <jason@redhat.com>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index d3fd5e9..5c294df 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1277,6 +1277,7 @@ OBJS = \
>         gimple-pretty-print.o \
>         gimple-ssa-backprop.o \
>         gimple-ssa-isolate-paths.o \
> +       gimple-ssa-split-paths.o \
>         gimple-ssa-strength-reduction.o \
>         gimple-streamer-in.o \
>         gimple-streamer-out.o \
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 757ce85..3eb520e 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -2403,6 +2403,10 @@ ftree-vrp
>  Common Report Var(flag_tree_vrp) Init(0) Optimization
>  Perform Value Range Propagation on trees.
>
> +fsplit-paths
> +Common Report Var(flag_split_paths) Init(0) Optimization
> +Split paths leading to loop backedges.
> +
>  funit-at-a-time
>  Common Report Var(flag_unit_at_a_time) Init(1)
>  Compile whole compilation unit at a time.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index c18df98..eeb79e6 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -354,6 +354,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fdump-tree-fre@r{[}-@var{n}@r{]} @gol
>  -fdump-tree-vtable-verify @gol
>  -fdump-tree-vrp@r{[}-@var{n}@r{]} @gol
> +-fdump-tree-split-paths@r{[}-@var{n}@r{]} @gol
>  -fdump-tree-storeccp@r{[}-@var{n}@r{]} @gol
>  -fdump-final-insns=@var{file} @gol
>  -fcompare-debug@r{[}=@var{opts}@r{]}  -fcompare-debug-second @gol
> @@ -448,6 +449,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
>  -fsemantic-interposition -fshrink-wrap -fsignaling-nans @gol
>  -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
> +-fsplit-paths @gol
>  -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol
>  -fstack-protector -fstack-protector-all -fstack-protector-strong @gol
>  -fstack-protector-explicit -fstdarg-opt -fstrict-aliasing @gol
> @@ -7171,6 +7173,11 @@ output on to @file{stderr}. If two conflicting dump
> filenames are
>  given for the same pass, then the latter option overrides the earlier
>  one.
>
> +@item split-paths
> +@opindex fdump-tree-split-paths
> +Dump each function after splitting paths to loop backedges.  The file
> +name is made by appending @file{.split-paths} to the source file name.
> +
>  @item all
>  Turn on all options, except @option{raw}, @option{slim}, @option{verbose}
>  and @option{lineno}.
> @@ -7808,6 +7815,7 @@ also turns on the following optimization flags:
>  -frerun-cse-after-loop  @gol
>  -fsched-interblock  -fsched-spec @gol
>  -fschedule-insns  -fschedule-insns2 @gol
> +-fsplit-paths @gol
>  -fstrict-aliasing -fstrict-overflow @gol
>  -ftree-builtin-call-dce @gol
>  -ftree-switch-conversion -ftree-tail-merge @gol
> @@ -8821,7 +8829,7 @@ currently enabled, but may be enabled by @option{-O2}
> in the future.
>
>  @item -ftree-sink
>  @opindex ftree-sink
> -Perform forward store motion  on trees.  This flag is
> +Perform forward store motion on trees.  This flag is
>  enabled by default at @option{-O} and higher.
>
>  @item -ftree-bit-ccp
> @@ -9127,6 +9135,12 @@ enabled by default at @option{-O2} and higher.  Null
> pointer check
>  elimination is only done if @option{-fdelete-null-pointer-checks} is
>  enabled.
>
> +@item -fsplit-paths
> +@opindex fsplit-paths
> +Split paths leading to loop backedges.  This can improve dead code
> +elimination and common subexpression elimination.  This is enabled by
> +default at @option{-O2} and above.
> +
>  @item -fsplit-ivs-in-unroller
>  @opindex fsplit-ivs-in-unroller
>  Enables expression of values of induction variables in later iterations
> diff --git a/gcc/gimple-ssa-split-paths.c b/gcc/gimple-ssa-split-paths.c
> new file mode 100644
> index 0000000..602e916
> --- /dev/null
> +++ b/gcc/gimple-ssa-split-paths.c
> @@ -0,0 +1,270 @@
> +/* Support routines for Splitting Paths to loop backedges
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   Contributed by Ajit Kumar Agarwal <ajitkum@xilinx.com>.
> +
> + This file is part of GCC.
> +
> + GCC is free software; you can redistribute it and/or modify
> + it under the terms of the GNU General Public License as published by
> + the Free Software Foundation; either version 3, or (at your option)
> + any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "tree.h"
> +#include "gimple.h"
> +#include "tree-pass.h"
> +#include "cfganal.h"
> +#include "cfgloop.h"
> +#include "gimple-iterator.h"
> +#include "tracer.h"
> +
> +/* Given LATCH, the latch block in a loop, see if the shape of the
> +   path reaching LATCH is suitable for being split by duplication.
> +   If so, return the block that will be duplicated into its predecessor
> +   paths.  Else return NULL.  */
> +
> +static basic_block
> +find_block_to_duplicate_for_splitting_paths (basic_block latch)
> +{
> +  /* We should have simple latches at this point.  So the latch should
> +     have a single successor.  This implies the predecessor of the latch
> +     likely has the loop exit.  And it's that predecessor we're most
> +     interested in. To keep things simple, we're going to require that
> +     the latch have a single predecessor too.  */
> +  if (single_succ_p (latch) && single_pred_p (latch))
> +    {
> +      basic_block bb = get_immediate_dominator (CDI_DOMINATORS, latch);
> +      gcc_assert (single_pred_edge (latch)->src == bb);
> +
> +      /* If BB has been marked as not to be duplicated, then honor that
> +        request.  */
> +      if (ignore_bb_p (bb))
> +       return NULL;
> +
> +      gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb));
> +      /* The immediate dominator of the latch must end in a conditional.
> */
> +      if (!last || gimple_code (last) != GIMPLE_COND)
> +       return NULL;
> +
> +      /* We're hoping that BB is a join point for an IF-THEN-ELSE diamond
> +        region.  Verify that it is.
> +
> +        First, verify that BB has two predecessors (each arm of the
> +        IF-THEN-ELSE) and two successors (the latch and exit).  */
> +      if (EDGE_COUNT (bb->preds) == 2 && EDGE_COUNT (bb->succs) == 2)
> +       {
> +         /* Now verify that BB's immediate dominator ends in a
> +            conditional as well.  */
> +         basic_block bb_idom = get_immediate_dominator (CDI_DOMINATORS,
> bb);
> +         gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb_idom));
> +         if (!last || gimple_code (last) != GIMPLE_COND)
> +           return NULL;
> +
> +         /* And that BB's immediate dominator's successors are the
> +            the predecessors of BB.  */
> +         if (!find_edge (bb_idom, EDGE_PRED (bb, 0)->src)
> +             || !find_edge (bb_idom, EDGE_PRED (bb, 1)->src))
> +           return NULL;
> +
> +         /* So at this point we have a simple diamond for an IF-THEN-ELSE
> +            construct starting at BB_IDOM, with a join point at BB.  BB
> +            pass control outside the loop or to the loop latch.
> +
> +            We're going to want to create two duplicates of BB, one for
> +            each successor of BB_IDOM.  */
> +         return bb;
> +       }
> +    }
> +  return NULL;
> +}
> +
> +/* Return TRUE if BB is a reasonable block to duplicate by examining
> +   its size, false otherwise.  BB will always be a loop latch block.
> +
> +   Should this use the same tests as we do for jump threading?  */
> +
> +static bool
> +is_feasible_trace (basic_block bb)
> +{
> +  int num_stmt = 0;
> +  gimple_stmt_iterator gsi;
> +
> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +    {
> +      gimple *stmt = gsi_stmt (gsi);
> +      if (!is_gimple_debug (stmt))
> +       num_stmt++;
> +    }
> +
> +  /* We may want to limit how many statements we copy.  */
> +  if (num_stmt > 1)
> +    return true;
> +
> +  return false;
> +}
> +
> +/* If the immediate dominator of the latch of the loop is
> +   block with conditional branch, then the loop latch  is
> +   duplicated to its predecessors path preserving the SSA
> +   semantics.
> +
> +   CFG before transformation.
> +
> +              2
> +              |
> +              |
> +        +---->3
> +        |    / \
> +        |   /   \
> +        |  4     5
> +        |   \   /
> +        |    \ /
> +        |     6
> +        |    / \
> +        |   /   \
> +        |  8     7
> +        |  |     |
> +        ---+     E
> +
> +
> +
> +    Block 8 is the latch.  We're going to make copies of block 6 (9 & 10)
> +    and wire things up so they look like this:
> +
> +              2
> +              |
> +              |
> +        +---->3
> +        |    / \
> +        |   /   \
> +        |  4     5
> +        |  |     |
> +        |  |     |
> +        |  9    10
> +        |  |\   /|
> +        |  | \ / |
> +        |  |  7  |
> +        |  |  |  |
> +        |  |  E  |
> +        |  |     |
> +        |   \   /
> +        |    \ /
> +        +-----8
> +
> +
> +    Blocks 9 and 10 will get merged into blocks 4 & 5 respectively which
> +    enables CSE, DCE and other optimizations to occur on a larger block
> +    of code.   */
> +
> +static bool
> +split_paths ()
> +{
> +  bool changed = false;
> +  loop_p loop;
> +
> +  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
> +  initialize_original_copy_tables ();
> +  calculate_dominance_info (CDI_DOMINATORS);
> +
> +  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
> +    {
> +      /* See if there is a block that we can duplicate to split the
> +        path to the loop latch.  */
> +      basic_block bb = find_block_to_duplicate_for_splitting_paths
> (loop->latch);
> +
> +      /* BB is the merge point for an IF-THEN-ELSE we want to transform.
> +
> +        Essentially we want to create two duplicates of BB and append
> +        a duplicate to the THEN and ELSE clauses.  This will split the
> +        path leading to the latch.  BB will be unreachable and removed.  */
> +      if (bb && is_feasible_trace (bb))
> +       {
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           fprintf (dump_file,
> +                    "Duplicating join block %d into predecessor paths\n",
> +                    bb->index);
> +         basic_block pred0 = EDGE_PRED (bb, 0)->src;
> +         basic_block pred1 = EDGE_PRED (bb, 1)->src;
> +         transform_duplicate (pred0, bb);
> +         transform_duplicate (pred1, bb);
> +         changed = true;
> +       }
> +    }
> +
> +  loop_optimizer_finalize ();
> +  free_original_copy_tables ();
> +  return changed;
> +}
> +
> +/* Main entry point for splitting paths.  Returns TODO_cleanup_cfg if any
> +   paths where split, otherwise return zero.  */
> +
> +static unsigned int
> +execute_split_paths ()
> +{
> +  /* If we don't have at least 2 real blocks and backedges in the
> +     CFG, then there's no point in trying to perform path splitting.  */
> +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> +      || !mark_dfs_back_edges ())
> +    return 0;
> +
> +  bool changed = split_paths();
> +  if (changed)
> +    free_dominance_info (CDI_DOMINATORS);
> +
> +  return changed ? TODO_cleanup_cfg : 0;
> +}
> +
> +static bool
> +gate_split_paths ()
> +{
> +  return flag_split_paths;
> +}
> +
> +namespace {
> +
> +const pass_data pass_data_split_paths =
> +{
> +  GIMPLE_PASS, /* type */
> +  "split-paths", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_SPLIT_PATHS, /* tv_id */
> +  PROP_ssa, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_update_ssa, /* todo_flags_finish */
> +};
> +
> +class pass_split_paths : public gimple_opt_pass
> +{
> +   public:
> +    pass_split_paths (gcc::context *ctxt)
> +      : gimple_opt_pass (pass_data_split_paths, ctxt)
> +    {}
> +   /* opt_pass methods: */
> +   opt_pass * clone () { return new pass_split_paths (m_ctxt); }
> +   virtual bool gate (function *) { return gate_split_paths (); }
> +   virtual unsigned int execute (function *) { return execute_split_paths
> (); }
> +
> +}; // class pass_split_paths
> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_split_paths (gcc::context *ctxt)
> +{
> +  return new pass_split_paths (ctxt);
> +}
> diff --git a/gcc/opts.c b/gcc/opts.c
> index 930ae43..be04cf5 100644
> --- a/gcc/opts.c
> +++ b/gcc/opts.c
> @@ -523,6 +523,7 @@ static const struct default_options
> default_options_table[] =
>      { OPT_LEVELS_2_PLUS, OPT_fisolate_erroneous_paths_dereference, NULL, 1
> },
>      { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
>      { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
> +    { OPT_LEVELS_2_PLUS, OPT_fsplit_paths, NULL, 1 },
>
>      /* -O3 optimizations.  */
>      { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
> diff --git a/gcc/passes.def b/gcc/passes.def
> index c0ab6b9..db822d3 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -274,6 +274,7 @@ along with GCC; see the file COPYING3.  If not see
>        POP_INSERT_PASSES ()
>        NEXT_PASS (pass_simduid_cleanup);
>        NEXT_PASS (pass_lower_vector_ssa);
> +      NEXT_PASS (pass_split_paths);
>        NEXT_PASS (pass_cse_reciprocals);
>        NEXT_PASS (pass_reassoc);
>        NEXT_PASS (pass_strength_reduction);
> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
> index 3301130..ee92aaf 100644
> --- a/gcc/testsuite/ChangeLog
> +++ b/gcc/testsuite/ChangeLog
> @@ -1,3 +1,8 @@
> +2015-11-13  Ajit Agarwal  <ajitkum@xilinx.com>
> +            Jeff Law  <law@redhat.com>
> +
> +       * gcc.dg/tree-ssa/split-path-1.c: New test.
> +
>  2015-11-13  Nathan Sidwell  <nathan@codesourcery.com>
>
>         * c-c++-common/goacc/loop-auto-1.c: New.
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/split-path-1.c
> b/gcc/testsuite/gcc.dg/tree-ssa/split-path-1.c
> new file mode 100644
> index 0000000..1239892
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/split-path-1.c
> @@ -0,0 +1,67 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-split-paths-details " } */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +
> +#define RGBMAX 255
> +
> +int
> +test()
> +{
> +  int i, Pels;
> +  unsigned char sum = 0;
> +  unsigned char xr, xg, xb;
> +  unsigned char xc, xm, xy, xk;
> +  unsigned char *ReadPtr, *EritePtr;
> +
> +  ReadPtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
> +  EritePtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
> +
> +  for (i = 0; i < 100;i++)
> +     {
> +       ReadPtr[i] = 100 - i;
> +     }
> +
> +  for (i = 0; i < 100; i++)
> +     {
> +       xr = *ReadPtr++;
> +       xg = *ReadPtr++;
> +       xb = *ReadPtr++;
> +
> +       xc = (unsigned char) (RGBMAX - xr);
> +       xm = (unsigned char) (RGBMAX - xg);
> +       xy = (unsigned char) (RGBMAX - xb);
> +
> +       if (xc < xm)
> +         {
> +           xk = (unsigned char) (xc < xy ? xc : xy);
> +         }
> +       else
> +        {
> +          xk = (unsigned char) (xm < xy ? xm : xy);
> +        }
> +
> +       xc = (unsigned char) (xc - xk);
> +       xm = (unsigned char) (xm - xk);
> +       xy = (unsigned char) (xy - xk);
> +
> +       *EritePtr++ = xc;
> +       *EritePtr++ = xm;
> +       *EritePtr++ = xy;
> +       *EritePtr++ = xk;
> +       sum += *EritePtr;
> +    }
> +  return sum;
> +}
> +
> +int
> +main()
> +{
> +  if (test() != 33)
> +    abort();
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "Duplicating join block" "split-paths" } }
> */
> diff --git a/gcc/timevar.def b/gcc/timevar.def
> index b429faf..45e3b70 100644
> --- a/gcc/timevar.def
> +++ b/gcc/timevar.def
> @@ -252,6 +252,7 @@ DEFTIMEVAR (TV_GCSE_AFTER_RELOAD     , "load CSE after
> reload")
>  DEFTIMEVAR (TV_REE                  , "ree")
>  DEFTIMEVAR (TV_THREAD_PROLOGUE_AND_EPILOGUE, "thread pro- & epilogue")
>  DEFTIMEVAR (TV_IFCVT2               , "if-conversion 2")
> +DEFTIMEVAR (TV_SPLIT_PATHS          , "split paths")
>  DEFTIMEVAR (TV_COMBINE_STACK_ADJUST  , "combine stack adjustments")
>  DEFTIMEVAR (TV_PEEPHOLE2             , "peephole 2")
>  DEFTIMEVAR (TV_RENAME_REGISTERS      , "rename registers")
> diff --git a/gcc/tracer.c b/gcc/tracer.c
> index 941dc20..c2dba4c 100644
> --- a/gcc/tracer.c
> +++ b/gcc/tracer.c
> @@ -51,9 +51,9 @@
>  #include "tree-inline.h"
>  #include "cfgloop.h"
>  #include "fibonacci_heap.h"
> +#include "tracer.h"
>
>  static int count_insns (basic_block);
> -static bool ignore_bb_p (const_basic_block);
>  static bool better_p (const_edge, const_edge);
>  static edge find_best_successor (basic_block);
>  static edge find_best_predecessor (basic_block);
> @@ -85,7 +85,7 @@ bb_seen_p (basic_block bb)
>  }
>
>  /* Return true if we should ignore the basic block for purposes of tracing.
> */
> -static bool
> +bool
>  ignore_bb_p (const_basic_block bb)
>  {
>    if (bb->index < NUM_FIXED_BLOCKS)
> @@ -226,6 +226,24 @@ find_trace (basic_block bb, basic_block *trace)
>    return i;
>  }
>
> +/* Duplicate block BB2, placing it after BB in the CFG.  Return the
> +   newly created block.  */
> +basic_block
> +transform_duplicate (basic_block bb, basic_block bb2)
> +{
> +  edge e;
> +  basic_block copy;
> +
> +  e = find_edge (bb, bb2);
> +
> +  copy = duplicate_block (bb2, e, bb);
> +  flush_pending_stmts (e);
> +
> +  add_phi_args_after_copy (&copy, 1, NULL);
> +
> +  return (copy);
> +}
> +
>  /* Look for basic blocks in frequency order, construct traces and tail
> duplicate
>     if profitable.  */
>
> @@ -321,17 +339,8 @@ tail_duplicate (void)
>                  entries or at least rotate the loop.  */
>               && bb2->loop_father->header != bb2)
>             {
> -             edge e;
> -             basic_block copy;
> -
>               nduplicated += counts [bb2->index];
> -
> -             e = find_edge (bb, bb2);
> -
> -             copy = duplicate_block (bb2, e, bb);
> -             flush_pending_stmts (e);
> -
> -             add_phi_args_after_copy (&copy, 1, NULL);
> +             basic_block copy = transform_duplicate (bb, bb2);
>
>               /* Reconsider the original copy of block we've duplicated.
>                  Removing the most common predecessor may make it to be
> diff --git a/gcc/tracer.h b/gcc/tracer.h
> new file mode 100644
> index 0000000..cd1792a
> --- /dev/null
> +++ b/gcc/tracer.h
> @@ -0,0 +1,26 @@
> +/* Header file for Tracer.
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> + for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#ifndef GCC_TRACER_H
> +#define GCC_TRACER_H
> +
> +extern basic_block transform_duplicate (basic_block bb, basic_block bb2);
> +extern bool ignore_bb_p (const_basic_block bb);
> +
> +#endif /* GCC_TRACER_H */
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 49e22a9..da67761 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -390,6 +390,7 @@ extern gimple_opt_pass *make_pass_tree_loop_done
> (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_split_paths (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt);
>
Richard Biener Dec. 3, 2015, 2:45 p.m. UTC | #4
On Thu, Dec 3, 2015 at 3:38 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Sat, Nov 14, 2015 at 12:35 AM, Jeff Law <law@redhat.com> wrote:
>> On 11/13/2015 01:23 PM, Jeff Law wrote:
>>>
>>> On 11/13/2015 11:09 AM, Richard Biener wrote:
>>>
>>>>>
>>>>> BTW Do we have an API for indicating that new blocks have been added to
>>>>>
>>>>> a loop?  If so, then we can likely drop the LOOPS_NEED_FIXUP.
>>>>
>>>>
>>>> Please. It's called add_to_loop or so.
>>>
>>> Haha, the block duplication code was handling this already.  So in
>>> theory I can just drop the LOOPS_NEED_FIXUP completely.  Testing now.
>>>
>>> jeff
>>>
>> Attached is the committed patch for path splitting.  As noted above, we
>> didn't need the LOOPS_NEED_FIXUP in the final version, so that wart is gone
>> :-)
>>
>> I do find myself wondering if this can/should be generalized beyond just
>> paths heading to loop backedges.  However to do so I think we'd need to be
>> able to undo this transformation reliably and we'd need some heuristics when
>> to duplicate to expose the redundancy vs rely on PRE techniques and jump
>> threading.  I vaguely remember a paper which touched on these topics, but I
>> can't seem to find it.
>>
>> Anyway, bootstrapped and regression tested on x86_64-linux-gnu. Installed on
>> the trunk.
>
> This pass is now enabled by default with -Os but has no limits on the amount of
> stmts it copies.  It also will make all loops with this shape have at least two
> exits (if the resulting loop will be disambiguated the inner loop will
> have two exits).
> Having more than one exit will disable almost all loop optimizations after it.
>
> The pass itself documents the transform it does but does zero to motivate it.
>
> What's the benefit of this pass (apart from disrupting further optimizations)?
>
> I can see a _single_ case where duplicating the latch will allow threading
> one of the paths through the loop header to eliminate the original exit.  Then
> disambiguation may create a nice nested loop out of this.  Of course that
> is only profitable again if you know the remaining single exit of the inner
> loop (exiting to the outer one) is executed infrequently (thus the inner loop
> actually loops).
>
> But no checks other than on the CFG shape exist (oh, it checks it will
> at _least_ copy two stmts!).
>
> Given the profitability constraints above (well, correct me if I am
> wrong on these)
> it looks like the whole transform should be done within the FSM threading
> code which might be able to compute whether there will be an inner loop
> with a single exit only.
>
> I'm inclined to request the pass to be removed again or at least disabled by
> default.
>
> What closed source benchmark was this transform invented for?

Ah, some EEMBC one.

Btw, the testcase that was added shows

       if (xc < xm)
         {
           xk = (unsigned char) (xc < xy ? xc : xy);
         }
       else
        {
          xk = (unsigned char) (xm < xy ? xm : xy);
        }

which might be better handled by phiopt transforming it into

xk = MIN (xc, MIN (xm, xy))

phiopt1 sees (hooray to GENERIC folding)

  xc_26 = ~xr_21;
  xm_27 = ~xg_23;
  xy_28 = ~xb_25;
  if (xr_21 > xg_23)
    goto <bb 5>;
  else
    goto <bb 6>;

  <bb 5>:
  xk_29 = MIN_EXPR <xc_26, xy_28>;
  goto <bb 7>;

  <bb 6>:
  xk_30 = MIN_EXPR <xm_27, xy_28>;

  <bb 7>:
  # xk_4 = PHI <xk_29(5), xk_30(6)>

btw, see PR67438 for a similar testcase and the above pattern.

Richard.

> Richard.
>
>>
>>
>>
>> commit c1891376e5dcc99ad8be2d22f9551c03f9bb2729
>> Author: Jeff Law <law@redhat.com>
>> Date:   Fri Nov 13 16:29:34 2015 -0700
>>
>>     [Patch,tree-optimization]: Add new path Splitting pass on tree ssa
>>     representation
>>
>>         * Makefile.in (OBJS): Add gimple-ssa-split-paths.o
>>         * common.opt (-fsplit-paths): New flag controlling path splitting.
>>         * doc/invoke.texi (fsplit-paths): Document.
>>         * opts.c (default_options_table): Add -fsplit-paths to -O2.
>>         * passes.def: Add split_paths pass.
>>         * timevar.def (TV_SPLIT_PATHS): New timevar.
>>         * tracer.c: Include "tracer.h"
>>         (ignore_bb_p): No longer static.
>>         (transform_duplicate): New function, broken out of tail_duplicate.
>>         (tail_duplicate): Use transform_duplicate.
>>         * tracer.h (ignore_bb_p): Declare
>>         (transform_duplicate): Likewise.
>>         * tree-pass.h (make_pass_split_paths): Declare.
>>         * gimple-ssa-split-paths.c: New file.
>>
>>         * gcc.dg/tree-ssa/split-path-1.c: New test.
>>
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index dde2695..a7abe37 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,21 @@
>> +2015-11-13  Ajit Agarwal  <ajitkum@xilinx.com>
>> +           Jeff Law  <law@redhat.com>
>> +
>> +       * Makefile.in (OBJS): Add gimple-ssa-split-paths.o
>> +       * common.opt (-fsplit-paths): New flag controlling path splitting.
>> +       * doc/invoke.texi (fsplit-paths): Document.
>> +       * opts.c (default_options_table): Add -fsplit-paths to -O2.
>> +       * passes.def: Add split_paths pass.
>> +       * timevar.def (TV_SPLIT_PATHS): New timevar.
>> +       * tracer.c: Include "tracer.h"
>> +       (ignore_bb_p): No longer static.
>> +       (transform_duplicate): New function, broken out of tail_duplicate.
>> +       (tail_duplicate): Use transform_duplicate.
>> +       * tracer.h (ignore_bb_p): Declare
>> +       (transform_duplicate): Likewise.
>> +       * tree-pass.h (make_pass_split_paths): Declare.
>> +       * gimple-ssa-split-paths.c: New file.
>> +
>>  2015-11-13  Kai Tietz  <ktietz70@googlemail.com>
>>             Marek Polacek  <polacek@redhat.com>
>>             Jason Merrill  <jason@redhat.com>
>> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
>> index d3fd5e9..5c294df 100644
>> --- a/gcc/Makefile.in
>> +++ b/gcc/Makefile.in
>> @@ -1277,6 +1277,7 @@ OBJS = \
>>         gimple-pretty-print.o \
>>         gimple-ssa-backprop.o \
>>         gimple-ssa-isolate-paths.o \
>> +       gimple-ssa-split-paths.o \
>>         gimple-ssa-strength-reduction.o \
>>         gimple-streamer-in.o \
>>         gimple-streamer-out.o \
>> diff --git a/gcc/common.opt b/gcc/common.opt
>> index 757ce85..3eb520e 100644
>> --- a/gcc/common.opt
>> +++ b/gcc/common.opt
>> @@ -2403,6 +2403,10 @@ ftree-vrp
>>  Common Report Var(flag_tree_vrp) Init(0) Optimization
>>  Perform Value Range Propagation on trees.
>>
>> +fsplit-paths
>> +Common Report Var(flag_split_paths) Init(0) Optimization
>> +Split paths leading to loop backedges.
>> +
>>  funit-at-a-time
>>  Common Report Var(flag_unit_at_a_time) Init(1)
>>  Compile whole compilation unit at a time.
>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>> index c18df98..eeb79e6 100644
>> --- a/gcc/doc/invoke.texi
>> +++ b/gcc/doc/invoke.texi
>> @@ -354,6 +354,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fdump-tree-fre@r{[}-@var{n}@r{]} @gol
>>  -fdump-tree-vtable-verify @gol
>>  -fdump-tree-vrp@r{[}-@var{n}@r{]} @gol
>> +-fdump-tree-split-paths@r{[}-@var{n}@r{]} @gol
>>  -fdump-tree-storeccp@r{[}-@var{n}@r{]} @gol
>>  -fdump-final-insns=@var{file} @gol
>>  -fcompare-debug@r{[}=@var{opts}@r{]}  -fcompare-debug-second @gol
>> @@ -448,6 +449,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
>>  -fsemantic-interposition -fshrink-wrap -fsignaling-nans @gol
>>  -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
>> +-fsplit-paths @gol
>>  -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol
>>  -fstack-protector -fstack-protector-all -fstack-protector-strong @gol
>>  -fstack-protector-explicit -fstdarg-opt -fstrict-aliasing @gol
>> @@ -7171,6 +7173,11 @@ output on to @file{stderr}. If two conflicting dump
>> filenames are
>>  given for the same pass, then the latter option overrides the earlier
>>  one.
>>
>> +@item split-paths
>> +@opindex fdump-tree-split-paths
>> +Dump each function after splitting paths to loop backedges.  The file
>> +name is made by appending @file{.split-paths} to the source file name.
>> +
>>  @item all
>>  Turn on all options, except @option{raw}, @option{slim}, @option{verbose}
>>  and @option{lineno}.
>> @@ -7808,6 +7815,7 @@ also turns on the following optimization flags:
>>  -frerun-cse-after-loop  @gol
>>  -fsched-interblock  -fsched-spec @gol
>>  -fschedule-insns  -fschedule-insns2 @gol
>> +-fsplit-paths @gol
>>  -fstrict-aliasing -fstrict-overflow @gol
>>  -ftree-builtin-call-dce @gol
>>  -ftree-switch-conversion -ftree-tail-merge @gol
>> @@ -8821,7 +8829,7 @@ currently enabled, but may be enabled by @option{-O2}
>> in the future.
>>
>>  @item -ftree-sink
>>  @opindex ftree-sink
>> -Perform forward store motion  on trees.  This flag is
>> +Perform forward store motion on trees.  This flag is
>>  enabled by default at @option{-O} and higher.
>>
>>  @item -ftree-bit-ccp
>> @@ -9127,6 +9135,12 @@ enabled by default at @option{-O2} and higher.  Null
>> pointer check
>>  elimination is only done if @option{-fdelete-null-pointer-checks} is
>>  enabled.
>>
>> +@item -fsplit-paths
>> +@opindex fsplit-paths
>> +Split paths leading to loop backedges.  This can improve dead code
>> +elimination and common subexpression elimination.  This is enabled by
>> +default at @option{-O2} and above.
>> +
>>  @item -fsplit-ivs-in-unroller
>>  @opindex fsplit-ivs-in-unroller
>>  Enables expression of values of induction variables in later iterations
>> diff --git a/gcc/gimple-ssa-split-paths.c b/gcc/gimple-ssa-split-paths.c
>> new file mode 100644
>> index 0000000..602e916
>> --- /dev/null
>> +++ b/gcc/gimple-ssa-split-paths.c
>> @@ -0,0 +1,270 @@
>> +/* Support routines for Splitting Paths to loop backedges
>> +   Copyright (C) 2015 Free Software Foundation, Inc.
>> +   Contributed by Ajit Kumar Agarwal <ajitkum@xilinx.com>.
>> +
>> + This file is part of GCC.
>> +
>> + GCC is free software; you can redistribute it and/or modify
>> + it under the terms of the GNU General Public License as published by
>> + the Free Software Foundation; either version 3, or (at your option)
>> + any later version.
>> +
>> +GCC is distributed in the hope that it will be useful,
>> +but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +GNU General Public License for more details.
>> +
>> +You should have received a copy of the GNU General Public License
>> +along with GCC; see the file COPYING3.  If not see
>> +<http://www.gnu.org/licenses/>.  */
>> +
>> +#include "config.h"
>> +#include "system.h"
>> +#include "coretypes.h"
>> +#include "backend.h"
>> +#include "tree.h"
>> +#include "gimple.h"
>> +#include "tree-pass.h"
>> +#include "cfganal.h"
>> +#include "cfgloop.h"
>> +#include "gimple-iterator.h"
>> +#include "tracer.h"
>> +
>> +/* Given LATCH, the latch block in a loop, see if the shape of the
>> +   path reaching LATCH is suitable for being split by duplication.
>> +   If so, return the block that will be duplicated into its predecessor
>> +   paths.  Else return NULL.  */
>> +
>> +static basic_block
>> +find_block_to_duplicate_for_splitting_paths (basic_block latch)
>> +{
>> +  /* We should have simple latches at this point.  So the latch should
>> +     have a single successor.  This implies the predecessor of the latch
>> +     likely has the loop exit.  And it's that predecessor we're most
>> +     interested in. To keep things simple, we're going to require that
>> +     the latch have a single predecessor too.  */
>> +  if (single_succ_p (latch) && single_pred_p (latch))
>> +    {
>> +      basic_block bb = get_immediate_dominator (CDI_DOMINATORS, latch);
>> +      gcc_assert (single_pred_edge (latch)->src == bb);
>> +
>> +      /* If BB has been marked as not to be duplicated, then honor that
>> +        request.  */
>> +      if (ignore_bb_p (bb))
>> +       return NULL;
>> +
>> +      gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb));
>> +      /* The immediate dominator of the latch must end in a conditional.
>> */
>> +      if (!last || gimple_code (last) != GIMPLE_COND)
>> +       return NULL;
>> +
>> +      /* We're hoping that BB is a join point for an IF-THEN-ELSE diamond
>> +        region.  Verify that it is.
>> +
>> +        First, verify that BB has two predecessors (each arm of the
>> +        IF-THEN-ELSE) and two successors (the latch and exit).  */
>> +      if (EDGE_COUNT (bb->preds) == 2 && EDGE_COUNT (bb->succs) == 2)
>> +       {
>> +         /* Now verify that BB's immediate dominator ends in a
>> +            conditional as well.  */
>> +         basic_block bb_idom = get_immediate_dominator (CDI_DOMINATORS,
>> bb);
>> +         gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb_idom));
>> +         if (!last || gimple_code (last) != GIMPLE_COND)
>> +           return NULL;
>> +
>> +         /* And that BB's immediate dominator's successors are the
>> +            the predecessors of BB.  */
>> +         if (!find_edge (bb_idom, EDGE_PRED (bb, 0)->src)
>> +             || !find_edge (bb_idom, EDGE_PRED (bb, 1)->src))
>> +           return NULL;
>> +
>> +         /* So at this point we have a simple diamond for an IF-THEN-ELSE
>> +            construct starting at BB_IDOM, with a join point at BB.  BB
>> +            pass control outside the loop or to the loop latch.
>> +
>> +            We're going to want to create two duplicates of BB, one for
>> +            each successor of BB_IDOM.  */
>> +         return bb;
>> +       }
>> +    }
>> +  return NULL;
>> +}
>> +
>> +/* Return TRUE if BB is a reasonable block to duplicate by examining
>> +   its size, false otherwise.  BB will always be a loop latch block.
>> +
>> +   Should this use the same tests as we do for jump threading?  */
>> +
>> +static bool
>> +is_feasible_trace (basic_block bb)
>> +{
>> +  int num_stmt = 0;
>> +  gimple_stmt_iterator gsi;
>> +
>> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
>> +    {
>> +      gimple *stmt = gsi_stmt (gsi);
>> +      if (!is_gimple_debug (stmt))
>> +       num_stmt++;
>> +    }
>> +
>> +  /* We may want to limit how many statements we copy.  */
>> +  if (num_stmt > 1)
>> +    return true;
>> +
>> +  return false;
>> +}
>> +
>> +/* If the immediate dominator of the latch of the loop is
>> +   block with conditional branch, then the loop latch  is
>> +   duplicated to its predecessors path preserving the SSA
>> +   semantics.
>> +
>> +   CFG before transformation.
>> +
>> +              2
>> +              |
>> +              |
>> +        +---->3
>> +        |    / \
>> +        |   /   \
>> +        |  4     5
>> +        |   \   /
>> +        |    \ /
>> +        |     6
>> +        |    / \
>> +        |   /   \
>> +        |  8     7
>> +        |  |     |
>> +        ---+     E
>> +
>> +
>> +
>> +    Block 8 is the latch.  We're going to make copies of block 6 (9 & 10)
>> +    and wire things up so they look like this:
>> +
>> +              2
>> +              |
>> +              |
>> +        +---->3
>> +        |    / \
>> +        |   /   \
>> +        |  4     5
>> +        |  |     |
>> +        |  |     |
>> +        |  9    10
>> +        |  |\   /|
>> +        |  | \ / |
>> +        |  |  7  |
>> +        |  |  |  |
>> +        |  |  E  |
>> +        |  |     |
>> +        |   \   /
>> +        |    \ /
>> +        +-----8
>> +
>> +
>> +    Blocks 9 and 10 will get merged into blocks 4 & 5 respectively which
>> +    enables CSE, DCE and other optimizations to occur on a larger block
>> +    of code.   */
>> +
>> +static bool
>> +split_paths ()
>> +{
>> +  bool changed = false;
>> +  loop_p loop;
>> +
>> +  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
>> +  initialize_original_copy_tables ();
>> +  calculate_dominance_info (CDI_DOMINATORS);
>> +
>> +  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
>> +    {
>> +      /* See if there is a block that we can duplicate to split the
>> +        path to the loop latch.  */
>> +      basic_block bb = find_block_to_duplicate_for_splitting_paths
>> (loop->latch);
>> +
>> +      /* BB is the merge point for an IF-THEN-ELSE we want to transform.
>> +
>> +        Essentially we want to create two duplicates of BB and append
>> +        a duplicate to the THEN and ELSE clauses.  This will split the
>> +        path leading to the latch.  BB will be unreachable and removed.  */
>> +      if (bb && is_feasible_trace (bb))
>> +       {
>> +         if (dump_file && (dump_flags & TDF_DETAILS))
>> +           fprintf (dump_file,
>> +                    "Duplicating join block %d into predecessor paths\n",
>> +                    bb->index);
>> +         basic_block pred0 = EDGE_PRED (bb, 0)->src;
>> +         basic_block pred1 = EDGE_PRED (bb, 1)->src;
>> +         transform_duplicate (pred0, bb);
>> +         transform_duplicate (pred1, bb);
>> +         changed = true;
>> +       }
>> +    }
>> +
>> +  loop_optimizer_finalize ();
>> +  free_original_copy_tables ();
>> +  return changed;
>> +}
>> +
>> +/* Main entry point for splitting paths.  Returns TODO_cleanup_cfg if any
>> +   paths where split, otherwise return zero.  */
>> +
>> +static unsigned int
>> +execute_split_paths ()
>> +{
>> +  /* If we don't have at least 2 real blocks and backedges in the
>> +     CFG, then there's no point in trying to perform path splitting.  */
>> +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
>> +      || !mark_dfs_back_edges ())
>> +    return 0;
>> +
>> +  bool changed = split_paths();
>> +  if (changed)
>> +    free_dominance_info (CDI_DOMINATORS);
>> +
>> +  return changed ? TODO_cleanup_cfg : 0;
>> +}
>> +
>> +static bool
>> +gate_split_paths ()
>> +{
>> +  return flag_split_paths;
>> +}
>> +
>> +namespace {
>> +
>> +const pass_data pass_data_split_paths =
>> +{
>> +  GIMPLE_PASS, /* type */
>> +  "split-paths", /* name */
>> +  OPTGROUP_NONE, /* optinfo_flags */
>> +  TV_SPLIT_PATHS, /* tv_id */
>> +  PROP_ssa, /* properties_required */
>> +  0, /* properties_provided */
>> +  0, /* properties_destroyed */
>> +  0, /* todo_flags_start */
>> +  TODO_update_ssa, /* todo_flags_finish */
>> +};
>> +
>> +class pass_split_paths : public gimple_opt_pass
>> +{
>> +   public:
>> +    pass_split_paths (gcc::context *ctxt)
>> +      : gimple_opt_pass (pass_data_split_paths, ctxt)
>> +    {}
>> +   /* opt_pass methods: */
>> +   opt_pass * clone () { return new pass_split_paths (m_ctxt); }
>> +   virtual bool gate (function *) { return gate_split_paths (); }
>> +   virtual unsigned int execute (function *) { return execute_split_paths
>> (); }
>> +
>> +}; // class pass_split_paths
>> +
>> +} // anon namespace
>> +
>> +gimple_opt_pass *
>> +make_pass_split_paths (gcc::context *ctxt)
>> +{
>> +  return new pass_split_paths (ctxt);
>> +}
>> diff --git a/gcc/opts.c b/gcc/opts.c
>> index 930ae43..be04cf5 100644
>> --- a/gcc/opts.c
>> +++ b/gcc/opts.c
>> @@ -523,6 +523,7 @@ static const struct default_options
>> default_options_table[] =
>>      { OPT_LEVELS_2_PLUS, OPT_fisolate_erroneous_paths_dereference, NULL, 1
>> },
>>      { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
>>      { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
>> +    { OPT_LEVELS_2_PLUS, OPT_fsplit_paths, NULL, 1 },
>>
>>      /* -O3 optimizations.  */
>>      { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
>> diff --git a/gcc/passes.def b/gcc/passes.def
>> index c0ab6b9..db822d3 100644
>> --- a/gcc/passes.def
>> +++ b/gcc/passes.def
>> @@ -274,6 +274,7 @@ along with GCC; see the file COPYING3.  If not see
>>        POP_INSERT_PASSES ()
>>        NEXT_PASS (pass_simduid_cleanup);
>>        NEXT_PASS (pass_lower_vector_ssa);
>> +      NEXT_PASS (pass_split_paths);
>>        NEXT_PASS (pass_cse_reciprocals);
>>        NEXT_PASS (pass_reassoc);
>>        NEXT_PASS (pass_strength_reduction);
>> diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
>> index 3301130..ee92aaf 100644
>> --- a/gcc/testsuite/ChangeLog
>> +++ b/gcc/testsuite/ChangeLog
>> @@ -1,3 +1,8 @@
>> +2015-11-13  Ajit Agarwal  <ajitkum@xilinx.com>
>> +            Jeff Law  <law@redhat.com>
>> +
>> +       * gcc.dg/tree-ssa/split-path-1.c: New test.
>> +
>>  2015-11-13  Nathan Sidwell  <nathan@codesourcery.com>
>>
>>         * c-c++-common/goacc/loop-auto-1.c: New.
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/split-path-1.c
>> b/gcc/testsuite/gcc.dg/tree-ssa/split-path-1.c
>> new file mode 100644
>> index 0000000..1239892
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/split-path-1.c
>> @@ -0,0 +1,67 @@
>> +/* { dg-do run } */
>> +/* { dg-options "-O2 -fdump-tree-split-paths-details " } */
>> +
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +
>> +#define RGBMAX 255
>> +
>> +int
>> +test()
>> +{
>> +  int i, Pels;
>> +  unsigned char sum = 0;
>> +  unsigned char xr, xg, xb;
>> +  unsigned char xc, xm, xy, xk;
>> +  unsigned char *ReadPtr, *EritePtr;
>> +
>> +  ReadPtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
>> +  EritePtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
>> +
>> +  for (i = 0; i < 100;i++)
>> +     {
>> +       ReadPtr[i] = 100 - i;
>> +     }
>> +
>> +  for (i = 0; i < 100; i++)
>> +     {
>> +       xr = *ReadPtr++;
>> +       xg = *ReadPtr++;
>> +       xb = *ReadPtr++;
>> +
>> +       xc = (unsigned char) (RGBMAX - xr);
>> +       xm = (unsigned char) (RGBMAX - xg);
>> +       xy = (unsigned char) (RGBMAX - xb);
>> +
>> +       if (xc < xm)
>> +         {
>> +           xk = (unsigned char) (xc < xy ? xc : xy);
>> +         }
>> +       else
>> +        {
>> +          xk = (unsigned char) (xm < xy ? xm : xy);
>> +        }
>> +
>> +       xc = (unsigned char) (xc - xk);
>> +       xm = (unsigned char) (xm - xk);
>> +       xy = (unsigned char) (xy - xk);
>> +
>> +       *EritePtr++ = xc;
>> +       *EritePtr++ = xm;
>> +       *EritePtr++ = xy;
>> +       *EritePtr++ = xk;
>> +       sum += *EritePtr;
>> +    }
>> +  return sum;
>> +}
>> +
>> +int
>> +main()
>> +{
>> +  if (test() != 33)
>> +    abort();
>> +
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump "Duplicating join block" "split-paths" } }
>> */
>> diff --git a/gcc/timevar.def b/gcc/timevar.def
>> index b429faf..45e3b70 100644
>> --- a/gcc/timevar.def
>> +++ b/gcc/timevar.def
>> @@ -252,6 +252,7 @@ DEFTIMEVAR (TV_GCSE_AFTER_RELOAD     , "load CSE after
>> reload")
>>  DEFTIMEVAR (TV_REE                  , "ree")
>>  DEFTIMEVAR (TV_THREAD_PROLOGUE_AND_EPILOGUE, "thread pro- & epilogue")
>>  DEFTIMEVAR (TV_IFCVT2               , "if-conversion 2")
>> +DEFTIMEVAR (TV_SPLIT_PATHS          , "split paths")
>>  DEFTIMEVAR (TV_COMBINE_STACK_ADJUST  , "combine stack adjustments")
>>  DEFTIMEVAR (TV_PEEPHOLE2             , "peephole 2")
>>  DEFTIMEVAR (TV_RENAME_REGISTERS      , "rename registers")
>> diff --git a/gcc/tracer.c b/gcc/tracer.c
>> index 941dc20..c2dba4c 100644
>> --- a/gcc/tracer.c
>> +++ b/gcc/tracer.c
>> @@ -51,9 +51,9 @@
>>  #include "tree-inline.h"
>>  #include "cfgloop.h"
>>  #include "fibonacci_heap.h"
>> +#include "tracer.h"
>>
>>  static int count_insns (basic_block);
>> -static bool ignore_bb_p (const_basic_block);
>>  static bool better_p (const_edge, const_edge);
>>  static edge find_best_successor (basic_block);
>>  static edge find_best_predecessor (basic_block);
>> @@ -85,7 +85,7 @@ bb_seen_p (basic_block bb)
>>  }
>>
>>  /* Return true if we should ignore the basic block for purposes of tracing.
>> */
>> -static bool
>> +bool
>>  ignore_bb_p (const_basic_block bb)
>>  {
>>    if (bb->index < NUM_FIXED_BLOCKS)
>> @@ -226,6 +226,24 @@ find_trace (basic_block bb, basic_block *trace)
>>    return i;
>>  }
>>
>> +/* Duplicate block BB2, placing it after BB in the CFG.  Return the
>> +   newly created block.  */
>> +basic_block
>> +transform_duplicate (basic_block bb, basic_block bb2)
>> +{
>> +  edge e;
>> +  basic_block copy;
>> +
>> +  e = find_edge (bb, bb2);
>> +
>> +  copy = duplicate_block (bb2, e, bb);
>> +  flush_pending_stmts (e);
>> +
>> +  add_phi_args_after_copy (&copy, 1, NULL);
>> +
>> +  return (copy);
>> +}
>> +
>>  /* Look for basic blocks in frequency order, construct traces and tail
>> duplicate
>>     if profitable.  */
>>
>> @@ -321,17 +339,8 @@ tail_duplicate (void)
>>                  entries or at least rotate the loop.  */
>>               && bb2->loop_father->header != bb2)
>>             {
>> -             edge e;
>> -             basic_block copy;
>> -
>>               nduplicated += counts [bb2->index];
>> -
>> -             e = find_edge (bb, bb2);
>> -
>> -             copy = duplicate_block (bb2, e, bb);
>> -             flush_pending_stmts (e);
>> -
>> -             add_phi_args_after_copy (&copy, 1, NULL);
>> +             basic_block copy = transform_duplicate (bb, bb2);
>>
>>               /* Reconsider the original copy of block we've duplicated.
>>                  Removing the most common predecessor may make it to be
>> diff --git a/gcc/tracer.h b/gcc/tracer.h
>> new file mode 100644
>> index 0000000..cd1792a
>> --- /dev/null
>> +++ b/gcc/tracer.h
>> @@ -0,0 +1,26 @@
>> +/* Header file for Tracer.
>> +   Copyright (C) 2015 Free Software Foundation, Inc.
>> +
>> +This file is part of GCC.
>> +
>> +GCC is free software; you can redistribute it and/or modify it under
>> +the terms of the GNU General Public License as published by the Free
>> +Software Foundation; either version 3, or (at your option) any later
>> +version.
>> +
>> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
>> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>> + for more details.
>> +
>> +You should have received a copy of the GNU General Public License
>> +along with GCC; see the file COPYING3.  If not see
>> +<http://www.gnu.org/licenses/>.  */
>> +
>> +#ifndef GCC_TRACER_H
>> +#define GCC_TRACER_H
>> +
>> +extern basic_block transform_duplicate (basic_block bb, basic_block bb2);
>> +extern bool ignore_bb_p (const_basic_block bb);
>> +
>> +#endif /* GCC_TRACER_H */
>> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
>> index 49e22a9..da67761 100644
>> --- a/gcc/tree-pass.h
>> +++ b/gcc/tree-pass.h
>> @@ -390,6 +390,7 @@ extern gimple_opt_pass *make_pass_tree_loop_done
>> (gcc::context *ctxt);
>>  extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
>>  extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
>>  extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
>> +extern gimple_opt_pass *make_pass_split_paths (gcc::context *ctxt);
>>  extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
>>  extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
>>  extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt);
>>
Jeff Law Dec. 3, 2015, 3:46 p.m. UTC | #5
On 12/03/2015 07:38 AM, Richard Biener wrote:
>
> This pass is now enabled by default with -Os but has no limits on the amount of
> stmts it copies.  It also will make all loops with this shape have at least two
> exits (if the resulting loop will be disambiguated the inner loop will
> have two exits).
> Having more than one exit will disable almost all loop optimizations after it.
[ ... ]
split-paths in the queue -- behind addressing a couple of correctness 
issues that are on my plate (not split-paths related).  I'll respond 
fully.  FWIW, I wouldn't lose much sleep if this were disabled by 
default -- without the "sink-common-code-past-phi" stuff we've discussed 
in the past it's fairly hard to justify path-splitting this aggressively.

jeff
Jeff Law Dec. 10, 2015, 8:08 p.m. UTC | #6
On 12/03/2015 07:38 AM, Richard Biener wrote:
> This pass is now enabled by default with -Os but has no limits on the amount of
> stmts it copies.
The more statements it copies, the more likely it is that the path 
spitting will turn out to be useful!  It's counter-intuitive.

The primary benefit AFAICT with path splitting is that it exposes 
additional CSE, DCE, etc opportunities.

IIRC  Ajit posited that it could help with live/conflict analysis, I 
never saw that, and with the changes to push splitting deeper into the 
pipeline I'd further life/conflict analysis since that work also 
involved preserving the single latch property.



  It also will make all loops with this shape have at least two
> exits (if the resulting loop will be disambiguated the inner loop will
> have two exits).
> Having more than one exit will disable almost all loop optimizations after it.
Hmmm, the updated code keeps the single latch property, but I'm pretty 
sure it won't keep a single exit policy.

To keep a single exit policy would require keeping an additional block 
around.  Each of the split paths would unconditionally transfer to this 
new block.  The new block would then either transfer to the latch block 
or out of the loop.


>
> The pass itself documents the transform it does but does zero to motivate it.
>
> What's the benefit of this pass (apart from disrupting further optimizations)?
It's essentially building superblocks in a special case to enable 
additional CSE, DCE and the like.

Unfortunately what is is missing is heuristics and de-duplication.  The 
former to drive cases when it's not useful and the latter to reduce 
codesize for any statements that did not participate in optimizations 
when they were duplicated.

The de-duplication is the "sink-statements-through-phi" problems, cross 
jumping, tail merging and the like class of problems.

It was only after I approved this code after twiddling it for Ajit that 
I came across Honza's tracer implementation, which may in fact be 
retargettable to these loops and do a better job.  I haven't 
experimented with that.



>
> I can see a _single_ case where duplicating the latch will allow threading
> one of the paths through the loop header to eliminate the original exit.  Then
> disambiguation may create a nice nested loop out of this.  Of course that
> is only profitable again if you know the remaining single exit of the inner
> loop (exiting to the outer one) is executed infrequently (thus the inner loop
> actually loops).
It wasn't ever about threading.

>
> But no checks other than on the CFG shape exist (oh, it checks it will
> at _least_ copy two stmts!).
Again, the more statements it copies the more likely it is to be 
profitable.  Think superblocks to expose CSE, DCE and the like.

>
> Given the profitability constraints above (well, correct me if I am
> wrong on these)
> it looks like the whole transform should be done within the FSM threading
> code which might be able to compute whether there will be an inner loop
> with a single exit only.
While it shares some concepts with jump threading, I don't think the 
transformation belongs in jump threading.

>
> I'm inclined to request the pass to be removed again or at least disabled by
> default.
I wouldn't lose any sleep if we disabled by default or removed, 
particularly if we can repurpose Honza's code.  In fact, I might 
strongly support the former until we hear back from Ajit on performance 
data.

I also keep coming back to Click's paper on code motion -- in that 
context, copying statements would be a way to break dependencies and 
give the global code motion algorithm more freedom.  The advantage of 
doing it in a framework like Click's is it's got a built-in sinking step.


>
> What closed source benchmark was this transform invented for?
I think it was EEMBC or Coremark.  Ajit should know for sure.  I was 
actually still hoping to see benchmark results from Ajit to confirm the 
new placement in the pipeline didn't negate all the benefits he saw.

jeff
Jeff Law Dec. 10, 2015, 8:12 p.m. UTC | #7
On 12/03/2015 07:45 AM, Richard Biener wrote:

>
> Ah, some EEMBC one.
>
> Btw, the testcase that was added shows
>
>         if (xc < xm)
>           {
>             xk = (unsigned char) (xc < xy ? xc : xy);
>           }
>         else
>          {
>            xk = (unsigned char) (xm < xy ? xm : xy);
>          }
>
> which might be better handled by phiopt transforming it into
I don't think the included testcase is a particularly good one for this 
transformation -- I didn't see that the transformation made any 
significant difference on x86_64.  That why I asked for Ajit for more 
data on the benchmarking.


>
> xk = MIN (xc, MIN (xm, xy))
>
> phiopt1 sees (hooray to GENERIC folding)
>
>    xc_26 = ~xr_21;
>    xm_27 = ~xg_23;
>    xy_28 = ~xb_25;
>    if (xr_21 > xg_23)
>      goto <bb 5>;
>    else
>      goto <bb 6>;
>
>    <bb 5>:
>    xk_29 = MIN_EXPR <xc_26, xy_28>;
>    goto <bb 7>;
>
>    <bb 6>:
>    xk_30 = MIN_EXPR <xm_27, xy_28>;
>
>    <bb 7>:
>    # xk_4 = PHI <xk_29(5), xk_30(6)>
>
> btw, see PR67438 for a similar testcase and the above pattern.
That may be elsewhere in BZ database as well.  I've seen stuff that 
looks awful close to that when going through the bug lists in prior 
releases.

jeff
Ajit Kumar Agarwal Dec. 11, 2015, 9:11 a.m. UTC | #8
Hello Jeff:

Sorry for the delay in sending the benchmarks run with Split-Path change.

Here is the Summary of the results.

SPEC CPU 2000 INT benchmarks ( Target i386)
( Geomean Score without Split-Paths changes vs Geomean Score with Split-Path changes  =  3740.789 vs 3745.193).

SPEC CPU 2000 FP benchmarks. ( Target i386)
( Geomean Score without Split-Paths changes vs Geomean Score with Split-Path changes  =  4721.655 vs 4741.825).

Mibench/EEMBC benchmarks (Target Microblaze)

Automotive_qsort1(4.03%), Office_ispell(4.29%), Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%), ospfv2_lite(1.35%).

We are seeing minor negative gains that are mainly noise.(less than 0.5%)

Thanks & Regards
Ajit
-----Original Message-----
From: Jeff Law [mailto:law@redhat.com] 

Sent: Friday, December 11, 2015 1:39 AM
To: Richard Biener
Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 12/03/2015 07:38 AM, Richard Biener wrote:
> This pass is now enabled by default with -Os but has no limits on the 

> amount of stmts it copies.

The more statements it copies, the more likely it is that the path spitting will turn out to be useful!  It's counter-intuitive.

The primary benefit AFAICT with path splitting is that it exposes additional CSE, DCE, etc opportunities.

IIRC  Ajit posited that it could help with live/conflict analysis, I never saw that, and with the changes to push splitting deeper into the pipeline I'd further life/conflict analysis since that work also involved preserving the single latch property.



  It also will make all loops with this shape have at least two
> exits (if the resulting loop will be disambiguated the inner loop will 

> have two exits).

> Having more than one exit will disable almost all loop optimizations after it.

Hmmm, the updated code keeps the single latch property, but I'm pretty sure it won't keep a single exit policy.

To keep a single exit policy would require keeping an additional block around.  Each of the split paths would unconditionally transfer to this new block.  The new block would then either transfer to the latch block or out of the loop.


>

> The pass itself documents the transform it does but does zero to motivate it.

>

> What's the benefit of this pass (apart from disrupting further optimizations)?

It's essentially building superblocks in a special case to enable additional CSE, DCE and the like.

Unfortunately what is is missing is heuristics and de-duplication.  The former to drive cases when it's not useful and the latter to reduce codesize for any statements that did not participate in optimizations when they were duplicated.

The de-duplication is the "sink-statements-through-phi" problems, cross jumping, tail merging and the like class of problems.

It was only after I approved this code after twiddling it for Ajit that I came across Honza's tracer implementation, which may in fact be retargettable to these loops and do a better job.  I haven't experimented with that.



>

> I can see a _single_ case where duplicating the latch will allow 

> threading one of the paths through the loop header to eliminate the 

> original exit.  Then disambiguation may create a nice nested loop out 

> of this.  Of course that is only profitable again if you know the 

> remaining single exit of the inner loop (exiting to the outer one) is 

> executed infrequently (thus the inner loop actually loops).

It wasn't ever about threading.

>

> But no checks other than on the CFG shape exist (oh, it checks it will 

> at _least_ copy two stmts!).

Again, the more statements it copies the more likely it is to be profitable.  Think superblocks to expose CSE, DCE and the like.

>

> Given the profitability constraints above (well, correct me if I am 

> wrong on these) it looks like the whole transform should be done 

> within the FSM threading code which might be able to compute whether 

> there will be an inner loop with a single exit only.

While it shares some concepts with jump threading, I don't think the transformation belongs in jump threading.

>

> I'm inclined to request the pass to be removed again or at least 

> disabled by default.

I wouldn't lose any sleep if we disabled by default or removed, particularly if we can repurpose Honza's code.  In fact, I might strongly support the former until we hear back from Ajit on performance data.

I also keep coming back to Click's paper on code motion -- in that context, copying statements would be a way to break dependencies and give the global code motion algorithm more freedom.  The advantage of doing it in a framework like Click's is it's got a built-in sinking step.


>

> What closed source benchmark was this transform invented for?

I think it was EEMBC or Coremark.  Ajit should know for sure.  I was actually still hoping to see benchmark results from Ajit to confirm the new placement in the pipeline didn't negate all the benefits he saw.

jeff
Richard Biener Dec. 11, 2015, 10:05 a.m. UTC | #9
On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law <law@redhat.com> wrote:
> On 12/03/2015 07:38 AM, Richard Biener wrote:
>>
>> This pass is now enabled by default with -Os but has no limits on the
>> amount of
>> stmts it copies.
>
> The more statements it copies, the more likely it is that the path spitting
> will turn out to be useful!  It's counter-intuitive.

Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer is enabled
with -fprofile-use (but it is also properly driven to only trace hot paths)
and otherwise not by default at any optimization level.

> The primary benefit AFAICT with path splitting is that it exposes additional
> CSE, DCE, etc opportunities.
>
> IIRC  Ajit posited that it could help with live/conflict analysis, I never
> saw that, and with the changes to push splitting deeper into the pipeline
> I'd further life/conflict analysis since that work also involved preserving
> the single latch property.
>
>
>
>  It also will make all loops with this shape have at least two
>>
>> exits (if the resulting loop will be disambiguated the inner loop will
>> have two exits).
>> Having more than one exit will disable almost all loop optimizations after
>> it.
>
> Hmmm, the updated code keeps the single latch property, but I'm pretty sure
> it won't keep a single exit policy.
>
> To keep a single exit policy would require keeping an additional block
> around.  Each of the split paths would unconditionally transfer to this new
> block.  The new block would then either transfer to the latch block or out
> of the loop.

Don't see how this would work for the CFG pattern it operates on unless you
duplicate the exit condition into that new block creating an even more
obfuscated
CFG.

>
>>
>> The pass itself documents the transform it does but does zero to motivate
>> it.
>>
>> What's the benefit of this pass (apart from disrupting further
>> optimizations)?
>
> It's essentially building superblocks in a special case to enable additional
> CSE, DCE and the like.
>
> Unfortunately what is is missing is heuristics and de-duplication.  The
> former to drive cases when it's not useful and the latter to reduce codesize
> for any statements that did not participate in optimizations when they were
> duplicated.
>
> The de-duplication is the "sink-statements-through-phi" problems, cross
> jumping, tail merging and the like class of problems.
>
> It was only after I approved this code after twiddling it for Ajit that I
> came across Honza's tracer implementation, which may in fact be
> retargettable to these loops and do a better job.  I haven't experimented
> with that.

Well, I originally suggested to merge this with the tracer pass...

>> I can see a _single_ case where duplicating the latch will allow threading
>> one of the paths through the loop header to eliminate the original exit.
>> Then
>> disambiguation may create a nice nested loop out of this.  Of course that
>> is only profitable again if you know the remaining single exit of the
>> inner
>> loop (exiting to the outer one) is executed infrequently (thus the inner
>> loop
>> actually loops).
>
> It wasn't ever about threading.

I see.

>>
>> But no checks other than on the CFG shape exist (oh, it checks it will
>> at _least_ copy two stmts!).
>
> Again, the more statements it copies the more likely it is to be profitable.
> Think superblocks to expose CSE, DCE and the like.

Ok, so similar to tracer (where I think the main benefit is actually increasing
scheduling opportunities for architectures where it matters).

Note that both passes are placed quite late and thus won't see much
of the GIMPLE optimizations (DOM mainly).  I wonder why they were
not placed adjacent to each other.

>>
>> Given the profitability constraints above (well, correct me if I am
>> wrong on these)
>> it looks like the whole transform should be done within the FSM threading
>> code which might be able to compute whether there will be an inner loop
>> with a single exit only.
>
> While it shares some concepts with jump threading, I don't think the
> transformation belongs in jump threading.
>
>>
>> I'm inclined to request the pass to be removed again or at least disabled
>> by
>> default.
>
> I wouldn't lose any sleep if we disabled by default or removed, particularly
> if we can repurpose Honza's code.  In fact, I might strongly support the
> former until we hear back from Ajit on performance data.

See above for what we do with -ftracer.  path-splitting should at _least_
restrict itself to operate on optimize_loop_for_speed_p () loops.

It should also (even if counter-intuitive) limit the amount of stmt copying
it does - after all there is sth like an instruction cache size which exceeeding
for loops will never be a good idea (and even smaller special loop caches on
some archs).

Note that a better heuristic than "at least more than one stmt" would be
to have at least one PHI in the merger block.  Otherwise I don't see how
CSE opportunities could exist we don't see without the duplication.
And yes, more PHIs -> more possible CSE.  I wouldn't say so for
the number of stmts.  So please limit the number of stmt copies!
(after all we do limit the number of stmts we copy during jump threading!)

Richard.

> I also keep coming back to Click's paper on code motion -- in that context,
> copying statements would be a way to break dependencies and give the global
> code motion algorithm more freedom.  The advantage of doing it in a
> framework like Click's is it's got a built-in sinking step.
>
>
>>
>> What closed source benchmark was this transform invented for?
>
> I think it was EEMBC or Coremark.  Ajit should know for sure.  I was
> actually still hoping to see benchmark results from Ajit to confirm the new
> placement in the pipeline didn't negate all the benefits he saw.
>
> jeff
>
Jeff Law Dec. 15, 2015, 11:50 p.m. UTC | #10
On 12/11/2015 03:05 AM, Richard Biener wrote:
> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law <law@redhat.com> wrote:
>> On 12/03/2015 07:38 AM, Richard Biener wrote:
>>>
>>> This pass is now enabled by default with -Os but has no limits on the
>>> amount of
>>> stmts it copies.
>>
>> The more statements it copies, the more likely it is that the path spitting
>> will turn out to be useful!  It's counter-intuitive.
>
> Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer is enabled
> with -fprofile-use (but it is also properly driven to only trace hot paths)
> and otherwise not by default at any optimization level.
Definitely not appropriate for -Os.  But as I mentioned, I really want 
to look at the tracer code as it may totally subsume path splitting.

>
> Don't see how this would work for the CFG pattern it operates on unless you
> duplicate the exit condition into that new block creating an even more
> obfuscated CFG.
Agreed, I don't see any way to fix the multiple exit problem.  Then 
again, this all runs after the tree loop optimizer, so I'm not sure how 
big of an issue it is in practice.


>> It was only after I approved this code after twiddling it for Ajit that I
>> came across Honza's tracer implementation, which may in fact be
>> retargettable to these loops and do a better job.  I haven't experimented
>> with that.
>
> Well, I originally suggested to merge this with the tracer pass...
I missed that, or it didn't sink into my brain.

>> Again, the more statements it copies the more likely it is to be profitable.
>> Think superblocks to expose CSE, DCE and the like.
>
> Ok, so similar to tracer (where I think the main benefit is actually increasing
> scheduling opportunities for architectures where it matters).
Right.  They're both building superblocks, which has the effect of 
larger windows for scheduling, DCE, CSE, etc.


>
> Note that both passes are placed quite late and thus won't see much
> of the GIMPLE optimizations (DOM mainly).  I wonder why they were
> not placed adjacent to each other.
Ajit had it fairly early, but that didn't play well with if-conversion. 
  I just pushed it past if-conversion and vectorization, but before the 
last DOM pass.  That turns out to be where tracer lives too as you noted.

>>
>> I wouldn't lose any sleep if we disabled by default or removed, particularly
>> if we can repurpose Honza's code.  In fact, I might strongly support the
>> former until we hear back from Ajit on performance data.
>
> See above for what we do with -ftracer.  path-splitting should at _least_
> restrict itself to operate on optimize_loop_for_speed_p () loops.
I think we need to decide if we want the code at all, particularly given 
the multiple-exit problem.

The difficulty is I think Ajit posted some recent data that shows it's 
helping.  So maybe the thing to do is ask Ajit to try the tracer 
independent of path splitting and take the obvious actions based on 
Ajit's data.


>
> It should also (even if counter-intuitive) limit the amount of stmt copying
> it does - after all there is sth like an instruction cache size which exceeeding
> for loops will never be a good idea (and even smaller special loop caches on
> some archs).
Yup.

>
> Note that a better heuristic than "at least more than one stmt" would be
> to have at least one PHI in the merger block.  Otherwise I don't see how
> CSE opportunities could exist we don't see without the duplication.
> And yes, more PHIs -> more possible CSE.  I wouldn't say so for
> the number of stmts.  So please limit the number of stmt copies!
> (after all we do limit the number of stmts we copy during jump threading!)
Let's get some more data before we try to tune path splitting.  In an 
ideal world, the tracer can handle this for us and we just remove path 
splitting completely.

Jeff
Ajit Kumar Agarwal Dec. 16, 2015, 7:43 a.m. UTC | #11
Hello Jeff:

Here is more of a data you have asked for.

SPEC FP benchmarks.
a) No Path Splitting + tracer enabled 
    Geomean Score =  4749.726.
b) Path Splitting enabled + tracer enabled.
    Geomean Score =  4781.655.

Conclusion: With both Path Splitting and tracer enabled we got maximum gains. I think we need to have Path Splitting pass.

SPEC INT benchmarks.
a) Path Splitting enabled + tracer not enabled.
    Geomean Score =  3745.193.
b) No Path Splitting + tracer enabled.
    Geomean Score = 3738.558.
c) Path Splitting enabled + tracer enabled.
    Geomean Score = 3742.833.

Conclusion: We are getting more gains with Path Splitting as compared to tracer. With both Path Splitting and tracer enabled we are also getting  gains.
I think we should have Path Splitting pass.

One more observation: Richard's concern is the creation of multiple exits with Splitting paths through duplication. My observation is,  in tracer pass also there
is a creation of multiple exits through duplication. I don’t think that’s an issue with the practicality considering the gains we are getting with Splitting paths with
more PRE, CSE and DCE.

Thanks & Regards
Ajit 




-----Original Message-----
From: Jeff Law [mailto:law@redhat.com] 

Sent: Wednesday, December 16, 2015 5:20 AM
To: Richard Biener
Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 12/11/2015 03:05 AM, Richard Biener wrote:
> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law <law@redhat.com> wrote:

>> On 12/03/2015 07:38 AM, Richard Biener wrote:

>>>

>>> This pass is now enabled by default with -Os but has no limits on 

>>> the amount of stmts it copies.

>>

>> The more statements it copies, the more likely it is that the path 

>> spitting will turn out to be useful!  It's counter-intuitive.

>

> Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer 

> is enabled with -fprofile-use (but it is also properly driven to only 

> trace hot paths) and otherwise not by default at any optimization level.

Definitely not appropriate for -Os.  But as I mentioned, I really want to look at the tracer code as it may totally subsume path splitting.

>

> Don't see how this would work for the CFG pattern it operates on 

> unless you duplicate the exit condition into that new block creating 

> an even more obfuscated CFG.

Agreed, I don't see any way to fix the multiple exit problem.  Then again, this all runs after the tree loop optimizer, so I'm not sure how big of an issue it is in practice.


>> It was only after I approved this code after twiddling it for Ajit 

>> that I came across Honza's tracer implementation, which may in fact 

>> be retargettable to these loops and do a better job.  I haven't 

>> experimented with that.

>

> Well, I originally suggested to merge this with the tracer pass...

I missed that, or it didn't sink into my brain.

>> Again, the more statements it copies the more likely it is to be profitable.

>> Think superblocks to expose CSE, DCE and the like.

>

> Ok, so similar to tracer (where I think the main benefit is actually 

> increasing scheduling opportunities for architectures where it matters).

Right.  They're both building superblocks, which has the effect of larger windows for scheduling, DCE, CSE, etc.


>

> Note that both passes are placed quite late and thus won't see much

> of the GIMPLE optimizations (DOM mainly).  I wonder why they were

> not placed adjacent to each other.

Ajit had it fairly early, but that didn't play well with if-conversion. 
  I just pushed it past if-conversion and vectorization, but before the 
last DOM pass.  That turns out to be where tracer lives too as you noted.

>>

>> I wouldn't lose any sleep if we disabled by default or removed, particularly

>> if we can repurpose Honza's code.  In fact, I might strongly support the

>> former until we hear back from Ajit on performance data.

>

> See above for what we do with -ftracer.  path-splitting should at _least_

> restrict itself to operate on optimize_loop_for_speed_p () loops.

I think we need to decide if we want the code at all, particularly given 
the multiple-exit problem.

The difficulty is I think Ajit posted some recent data that shows it's 
helping.  So maybe the thing to do is ask Ajit to try the tracer 
independent of path splitting and take the obvious actions based on 
Ajit's data.


>

> It should also (even if counter-intuitive) limit the amount of stmt copying

> it does - after all there is sth like an instruction cache size which exceeeding

> for loops will never be a good idea (and even smaller special loop caches on

> some archs).

Yup.

>

> Note that a better heuristic than "at least more than one stmt" would be

> to have at least one PHI in the merger block.  Otherwise I don't see how

> CSE opportunities could exist we don't see without the duplication.

> And yes, more PHIs -> more possible CSE.  I wouldn't say so for

> the number of stmts.  So please limit the number of stmt copies!

> (after all we do limit the number of stmts we copy during jump threading!)

Let's get some more data before we try to tune path splitting.  In an 
ideal world, the tracer can handle this for us and we just remove path 
splitting completely.

Jeff
Richard Biener Dec. 16, 2015, 9:57 a.m. UTC | #12
On Wed, Dec 16, 2015 at 8:43 AM, Ajit Kumar Agarwal
<ajit.kumar.agarwal@xilinx.com> wrote:
> Hello Jeff:
>
> Here is more of a data you have asked for.
>
> SPEC FP benchmarks.
> a) No Path Splitting + tracer enabled
>     Geomean Score =  4749.726.
> b) Path Splitting enabled + tracer enabled.
>     Geomean Score =  4781.655.
>
> Conclusion: With both Path Splitting and tracer enabled we got maximum gains. I think we need to have Path Splitting pass.
>
> SPEC INT benchmarks.
> a) Path Splitting enabled + tracer not enabled.
>     Geomean Score =  3745.193.
> b) No Path Splitting + tracer enabled.
>     Geomean Score = 3738.558.
> c) Path Splitting enabled + tracer enabled.
>     Geomean Score = 3742.833.

I suppose with SPEC you mean SPEC CPU 2006?

Can you disclose the architecture you did the measurements on and the
compile flags you used otherwise?

Note that tracer does a very good job only when paired with FDO so can
you re-run SPEC with FDO and
compare with path-splitting enabled on top of that?

Thanks,
Richard.

> Conclusion: We are getting more gains with Path Splitting as compared to tracer. With both Path Splitting and tracer enabled we are also getting  gains.
> I think we should have Path Splitting pass.
>
> One more observation: Richard's concern is the creation of multiple exits with Splitting paths through duplication. My observation is,  in tracer pass also there
> is a creation of multiple exits through duplication. I don’t think that’s an issue with the practicality considering the gains we are getting with Splitting paths with
> more PRE, CSE and DCE.
>
> Thanks & Regards
> Ajit
>
>
>
>
> -----Original Message-----
> From: Jeff Law [mailto:law@redhat.com]
> Sent: Wednesday, December 16, 2015 5:20 AM
> To: Richard Biener
> Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
>
> On 12/11/2015 03:05 AM, Richard Biener wrote:
>> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law <law@redhat.com> wrote:
>>> On 12/03/2015 07:38 AM, Richard Biener wrote:
>>>>
>>>> This pass is now enabled by default with -Os but has no limits on
>>>> the amount of stmts it copies.
>>>
>>> The more statements it copies, the more likely it is that the path
>>> spitting will turn out to be useful!  It's counter-intuitive.
>>
>> Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer
>> is enabled with -fprofile-use (but it is also properly driven to only
>> trace hot paths) and otherwise not by default at any optimization level.
> Definitely not appropriate for -Os.  But as I mentioned, I really want to look at the tracer code as it may totally subsume path splitting.
>
>>
>> Don't see how this would work for the CFG pattern it operates on
>> unless you duplicate the exit condition into that new block creating
>> an even more obfuscated CFG.
> Agreed, I don't see any way to fix the multiple exit problem.  Then again, this all runs after the tree loop optimizer, so I'm not sure how big of an issue it is in practice.
>
>
>>> It was only after I approved this code after twiddling it for Ajit
>>> that I came across Honza's tracer implementation, which may in fact
>>> be retargettable to these loops and do a better job.  I haven't
>>> experimented with that.
>>
>> Well, I originally suggested to merge this with the tracer pass...
> I missed that, or it didn't sink into my brain.
>
>>> Again, the more statements it copies the more likely it is to be profitable.
>>> Think superblocks to expose CSE, DCE and the like.
>>
>> Ok, so similar to tracer (where I think the main benefit is actually
>> increasing scheduling opportunities for architectures where it matters).
> Right.  They're both building superblocks, which has the effect of larger windows for scheduling, DCE, CSE, etc.
>
>
>>
>> Note that both passes are placed quite late and thus won't see much
>> of the GIMPLE optimizations (DOM mainly).  I wonder why they were
>> not placed adjacent to each other.
> Ajit had it fairly early, but that didn't play well with if-conversion.
>   I just pushed it past if-conversion and vectorization, but before the
> last DOM pass.  That turns out to be where tracer lives too as you noted.
>
>>>
>>> I wouldn't lose any sleep if we disabled by default or removed, particularly
>>> if we can repurpose Honza's code.  In fact, I might strongly support the
>>> former until we hear back from Ajit on performance data.
>>
>> See above for what we do with -ftracer.  path-splitting should at _least_
>> restrict itself to operate on optimize_loop_for_speed_p () loops.
> I think we need to decide if we want the code at all, particularly given
> the multiple-exit problem.
>
> The difficulty is I think Ajit posted some recent data that shows it's
> helping.  So maybe the thing to do is ask Ajit to try the tracer
> independent of path splitting and take the obvious actions based on
> Ajit's data.
>
>
>>
>> It should also (even if counter-intuitive) limit the amount of stmt copying
>> it does - after all there is sth like an instruction cache size which exceeeding
>> for loops will never be a good idea (and even smaller special loop caches on
>> some archs).
> Yup.
>
>>
>> Note that a better heuristic than "at least more than one stmt" would be
>> to have at least one PHI in the merger block.  Otherwise I don't see how
>> CSE opportunities could exist we don't see without the duplication.
>> And yes, more PHIs -> more possible CSE.  I wouldn't say so for
>> the number of stmts.  So please limit the number of stmt copies!
>> (after all we do limit the number of stmts we copy during jump threading!)
> Let's get some more data before we try to tune path splitting.  In an
> ideal world, the tracer can handle this for us and we just remove path
> splitting completely.
>
> Jeff
Ajit Kumar Agarwal Dec. 16, 2015, 10:13 a.m. UTC | #13
-----Original Message-----
From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Richard Biener

Sent: Wednesday, December 16, 2015 3:27 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On Wed, Dec 16, 2015 at 8:43 AM, Ajit Kumar Agarwal <ajit.kumar.agarwal@xilinx.com> wrote:
> Hello Jeff:

>

> Here is more of a data you have asked for.

>

> SPEC FP benchmarks.

> a) No Path Splitting + tracer enabled

>     Geomean Score =  4749.726.

> b) Path Splitting enabled + tracer enabled.

>     Geomean Score =  4781.655.

>

> Conclusion: With both Path Splitting and tracer enabled we got maximum gains. I think we need to have Path Splitting pass.

>

> SPEC INT benchmarks.

> a) Path Splitting enabled + tracer not enabled.

>     Geomean Score =  3745.193.

> b) No Path Splitting + tracer enabled.

>     Geomean Score = 3738.558.

> c) Path Splitting enabled + tracer enabled.

>     Geomean Score = 3742.833.


>>I suppose with SPEC you mean SPEC CPU 2006?


The performance data is with respect to SPEC CPU 2000 benchmarks.

>>Can you disclose the architecture you did the measurements on and the compile flags you used otherwise?


Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz 
cpu cores       : 10
cache size      : 25600 KB

I have used -O3 and enable the tracer with  -ftracer .

Thanks & Regards
Ajit
>>Note that tracer does a very good job only when paired with FDO so can you re-run SPEC with FDO and compare with path-splitting enabled on top of that?



Thanks,
Richard.

> Conclusion: We are getting more gains with Path Splitting as compared to tracer. With both Path Splitting and tracer enabled we are also getting  gains.

> I think we should have Path Splitting pass.

>

> One more observation: Richard's concern is the creation of multiple 

> exits with Splitting paths through duplication. My observation is,  in 

> tracer pass also there is a creation of multiple exits through duplication. I don’t think that’s an issue with the practicality considering the gains we are getting with Splitting paths with more PRE, CSE and DCE.

>

> Thanks & Regards

> Ajit

>

>

>

>

> -----Original Message-----

> From: Jeff Law [mailto:law@redhat.com]

> Sent: Wednesday, December 16, 2015 5:20 AM

> To: Richard Biener

> Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya 

> Gupta; Vidhumouli Hunsigida; Nagaraju Mekala

> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on 

> tree ssa representation

>

> On 12/11/2015 03:05 AM, Richard Biener wrote:

>> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law <law@redhat.com> wrote:

>>> On 12/03/2015 07:38 AM, Richard Biener wrote:

>>>>

>>>> This pass is now enabled by default with -Os but has no limits on 

>>>> the amount of stmts it copies.

>>>

>>> The more statements it copies, the more likely it is that the path 

>>> spitting will turn out to be useful!  It's counter-intuitive.

>>

>> Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer 

>> is enabled with -fprofile-use (but it is also properly driven to only 

>> trace hot paths) and otherwise not by default at any optimization level.

> Definitely not appropriate for -Os.  But as I mentioned, I really want to look at the tracer code as it may totally subsume path splitting.

>

>>

>> Don't see how this would work for the CFG pattern it operates on 

>> unless you duplicate the exit condition into that new block creating 

>> an even more obfuscated CFG.

> Agreed, I don't see any way to fix the multiple exit problem.  Then again, this all runs after the tree loop optimizer, so I'm not sure how big of an issue it is in practice.

>

>

>>> It was only after I approved this code after twiddling it for Ajit 

>>> that I came across Honza's tracer implementation, which may in fact 

>>> be retargettable to these loops and do a better job.  I haven't 

>>> experimented with that.

>>

>> Well, I originally suggested to merge this with the tracer pass...

> I missed that, or it didn't sink into my brain.

>

>>> Again, the more statements it copies the more likely it is to be profitable.

>>> Think superblocks to expose CSE, DCE and the like.

>>

>> Ok, so similar to tracer (where I think the main benefit is actually 

>> increasing scheduling opportunities for architectures where it matters).

> Right.  They're both building superblocks, which has the effect of larger windows for scheduling, DCE, CSE, etc.

>

>

>>

>> Note that both passes are placed quite late and thus won't see much 

>> of the GIMPLE optimizations (DOM mainly).  I wonder why they were not 

>> placed adjacent to each other.

> Ajit had it fairly early, but that didn't play well with if-conversion.

>   I just pushed it past if-conversion and vectorization, but before 

> the last DOM pass.  That turns out to be where tracer lives too as you noted.

>

>>>

>>> I wouldn't lose any sleep if we disabled by default or removed, 

>>> particularly if we can repurpose Honza's code.  In fact, I might 

>>> strongly support the former until we hear back from Ajit on performance data.

>>

>> See above for what we do with -ftracer.  path-splitting should at 

>> _least_ restrict itself to operate on optimize_loop_for_speed_p () loops.

> I think we need to decide if we want the code at all, particularly 

> given the multiple-exit problem.

>

> The difficulty is I think Ajit posted some recent data that shows it's 

> helping.  So maybe the thing to do is ask Ajit to try the tracer 

> independent of path splitting and take the obvious actions based on 

> Ajit's data.

>

>

>>

>> It should also (even if counter-intuitive) limit the amount of stmt 

>> copying it does - after all there is sth like an instruction cache 

>> size which exceeeding for loops will never be a good idea (and even 

>> smaller special loop caches on some archs).

> Yup.

>

>>

>> Note that a better heuristic than "at least more than one stmt" would 

>> be to have at least one PHI in the merger block.  Otherwise I don't 

>> see how CSE opportunities could exist we don't see without the duplication.

>> And yes, more PHIs -> more possible CSE.  I wouldn't say so for the 

>> number of stmts.  So please limit the number of stmt copies!

>> (after all we do limit the number of stmts we copy during jump 

>> threading!)

> Let's get some more data before we try to tune path splitting.  In an 

> ideal world, the tracer can handle this for us and we just remove path 

> splitting completely.

>

> Jeff
Ajit Kumar Agarwal Dec. 17, 2015, 10:38 a.m. UTC | #14
Hello Jeff and Richard:

Here is the Summary of the FDO(Feedback Directed Optimization ) performance results.

SPEC CPU2000 INT benchmarks.
a) FDO + Splitting Paths enabled + tracer enabled
     Geomean Score = 3907.751673.
b) FDO + No Splitting Paths + tracer enabled
     Geomean Score = 3895.191536.

SPEC CPU2000 FP benchmarks.
a) FDO + Splitting Paths enabled + tracer enabled
     Geomean Score = 4793.321963
b) FDO + No Splitting Paths + tracer enabled
     Geomean Score = 4770.855467

The gains are maximum with Split Paths enabled + tracer pass enabled as compared to No Split Paths + tracer enabled. The 
Split Paths pass is very much required.

Thanks & Regards
Ajit

-----Original Message-----
From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal

Sent: Wednesday, December 16, 2015 3:44 PM
To: Richard Biener
Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation



-----Original Message-----
From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Richard Biener

Sent: Wednesday, December 16, 2015 3:27 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On Wed, Dec 16, 2015 at 8:43 AM, Ajit Kumar Agarwal <ajit.kumar.agarwal@xilinx.com> wrote:
> Hello Jeff:

>

> Here is more of a data you have asked for.

>

> SPEC FP benchmarks.

> a) No Path Splitting + tracer enabled

>     Geomean Score =  4749.726.

> b) Path Splitting enabled + tracer enabled.

>     Geomean Score =  4781.655.

>

> Conclusion: With both Path Splitting and tracer enabled we got maximum gains. I think we need to have Path Splitting pass.

>

> SPEC INT benchmarks.

> a) Path Splitting enabled + tracer not enabled.

>     Geomean Score =  3745.193.

> b) No Path Splitting + tracer enabled.

>     Geomean Score = 3738.558.

> c) Path Splitting enabled + tracer enabled.

>     Geomean Score = 3742.833.


>>I suppose with SPEC you mean SPEC CPU 2006?


The performance data is with respect to SPEC CPU 2000 benchmarks.

>>Can you disclose the architecture you did the measurements on and the compile flags you used otherwise?


Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz 
cpu cores       : 10
cache size      : 25600 KB

I have used -O3 and enable the tracer with  -ftracer .

Thanks & Regards
Ajit
>>Note that tracer does a very good job only when paired with FDO so can you re-run SPEC with FDO and compare with path-splitting enabled on top of that?



Thanks,
Richard.

> Conclusion: We are getting more gains with Path Splitting as compared to tracer. With both Path Splitting and tracer enabled we are also getting  gains.

> I think we should have Path Splitting pass.

>

> One more observation: Richard's concern is the creation of multiple 

> exits with Splitting paths through duplication. My observation is,  in 

> tracer pass also there is a creation of multiple exits through duplication. I don’t think that’s an issue with the practicality considering the gains we are getting with Splitting paths with more PRE, CSE and DCE.

>

> Thanks & Regards

> Ajit

>

>

>

>

> -----Original Message-----

> From: Jeff Law [mailto:law@redhat.com]

> Sent: Wednesday, December 16, 2015 5:20 AM

> To: Richard Biener

> Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya 

> Gupta; Vidhumouli Hunsigida; Nagaraju Mekala

> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on 

> tree ssa representation

>

> On 12/11/2015 03:05 AM, Richard Biener wrote:

>> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law <law@redhat.com> wrote:

>>> On 12/03/2015 07:38 AM, Richard Biener wrote:

>>>>

>>>> This pass is now enabled by default with -Os but has no limits on 

>>>> the amount of stmts it copies.

>>>

>>> The more statements it copies, the more likely it is that the path 

>>> spitting will turn out to be useful!  It's counter-intuitive.

>>

>> Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer 

>> is enabled with -fprofile-use (but it is also properly driven to only 

>> trace hot paths) and otherwise not by default at any optimization level.

> Definitely not appropriate for -Os.  But as I mentioned, I really want to look at the tracer code as it may totally subsume path splitting.

>

>>

>> Don't see how this would work for the CFG pattern it operates on 

>> unless you duplicate the exit condition into that new block creating 

>> an even more obfuscated CFG.

> Agreed, I don't see any way to fix the multiple exit problem.  Then again, this all runs after the tree loop optimizer, so I'm not sure how big of an issue it is in practice.

>

>

>>> It was only after I approved this code after twiddling it for Ajit 

>>> that I came across Honza's tracer implementation, which may in fact 

>>> be retargettable to these loops and do a better job.  I haven't 

>>> experimented with that.

>>

>> Well, I originally suggested to merge this with the tracer pass...

> I missed that, or it didn't sink into my brain.

>

>>> Again, the more statements it copies the more likely it is to be profitable.

>>> Think superblocks to expose CSE, DCE and the like.

>>

>> Ok, so similar to tracer (where I think the main benefit is actually 

>> increasing scheduling opportunities for architectures where it matters).

> Right.  They're both building superblocks, which has the effect of larger windows for scheduling, DCE, CSE, etc.

>

>

>>

>> Note that both passes are placed quite late and thus won't see much 

>> of the GIMPLE optimizations (DOM mainly).  I wonder why they were not 

>> placed adjacent to each other.

> Ajit had it fairly early, but that didn't play well with if-conversion.

>   I just pushed it past if-conversion and vectorization, but before 

> the last DOM pass.  That turns out to be where tracer lives too as you noted.

>

>>>

>>> I wouldn't lose any sleep if we disabled by default or removed, 

>>> particularly if we can repurpose Honza's code.  In fact, I might 

>>> strongly support the former until we hear back from Ajit on performance data.

>>

>> See above for what we do with -ftracer.  path-splitting should at 

>> _least_ restrict itself to operate on optimize_loop_for_speed_p () loops.

> I think we need to decide if we want the code at all, particularly 

> given the multiple-exit problem.

>

> The difficulty is I think Ajit posted some recent data that shows it's 

> helping.  So maybe the thing to do is ask Ajit to try the tracer 

> independent of path splitting and take the obvious actions based on 

> Ajit's data.

>

>

>>

>> It should also (even if counter-intuitive) limit the amount of stmt 

>> copying it does - after all there is sth like an instruction cache 

>> size which exceeeding for loops will never be a good idea (and even 

>> smaller special loop caches on some archs).

> Yup.

>

>>

>> Note that a better heuristic than "at least more than one stmt" would 

>> be to have at least one PHI in the merger block.  Otherwise I don't 

>> see how CSE opportunities could exist we don't see without the duplication.

>> And yes, more PHIs -> more possible CSE.  I wouldn't say so for the 

>> number of stmts.  So please limit the number of stmt copies!

>> (after all we do limit the number of stmts we copy during jump

>> threading!)

> Let's get some more data before we try to tune path splitting.  In an 

> ideal world, the tracer can handle this for us and we just remove path 

> splitting completely.

>

> Jeff
Jeff Law Dec. 17, 2015, 11:41 p.m. UTC | #15
On 12/11/2015 03:05 AM, Richard Biener wrote:
> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law <law@redhat.com> wrote:
>> On 12/03/2015 07:38 AM, Richard Biener wrote:
>>>
>>> This pass is now enabled by default with -Os but has no limits on the
>>> amount of
>>> stmts it copies.
>>
>> The more statements it copies, the more likely it is that the path spitting
>> will turn out to be useful!  It's counter-intuitive.
>
> Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer is enabled
> with -fprofile-use (but it is also properly driven to only trace hot paths)
> and otherwise not by default at any optimization level.
I've just committed a patch to limit to loops we're optimizing for speed 
and moved the transformation from -O2 to -O3.

I put in some instrumentation to see when this was triggering and, as 
expected the vast majority of triggers are with very small blocks, 2-3 
statements.  But those are probably the least interesting.  There's 
limited instances where it triggers on large blocks (say > 10 
statements).  But those were with GCC sources.  I'm going to pull out 
SPEC and do some instrumented builds with that, obviously focusing on 
those benchmarks where Ajit saw improvements.

>> Hmmm, the updated code keeps the single latch property, but I'm pretty sure
>> it won't keep a single exit policy.
>>
>> To keep a single exit policy would require keeping an additional block
>> around.  Each of the split paths would unconditionally transfer to this new
>> block.  The new block would then either transfer to the latch block or out
>> of the loop.
>
> Don't see how this would work for the CFG pattern it operates on unless you
> duplicate the exit condition into that new block creating an even more
> obfuscated CFG.
Upon further reflection, I don't think this is important as the pass 
runs after the tree loop optimizers.

>
> Note that both passes are placed quite late and thus won't see much
> of the GIMPLE optimizations (DOM mainly).  I wonder why they were
> not placed adjacent to each other.
I'm going to move them to be adjacent.  If for no other reason than 
it'll make comparisons easier without having to worry about any passes 
between them.  I suspect that'll drop in tonight after I get the kids to 
sleep :-)

Jeff
Zamyatin, Igor Dec. 18, 2015, 3:42 p.m. UTC | #16
> On 12/11/2015 03:05 AM, Richard Biener wrote:

> > On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law <law@redhat.com> wrote:

> >> On 12/03/2015 07:38 AM, Richard Biener wrote:

> >>>

> >>> This pass is now enabled by default with -Os but has no limits on

> >>> the amount of stmts it copies.

> >>

> >> The more statements it copies, the more likely it is that the path

> >> spitting will turn out to be useful!  It's counter-intuitive.

> >

> > Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer

> > is enabled with -fprofile-use (but it is also properly driven to only

> > trace hot paths) and otherwise not by default at any optimization level.

> I've just committed a patch to limit to loops we're optimizing for speed and

> moved the transformation from -O2 to -O3.

> 

> I put in some instrumentation to see when this was triggering and, as

> expected the vast majority of triggers are with very small blocks, 2-3

> statements.  But those are probably the least interesting.  There's limited

> instances where it triggers on large blocks (say > 10 statements).  But those

> were with GCC sources.  I'm going to pull out SPEC and do some

> instrumented builds with that, obviously focusing on those benchmarks

> where Ajit saw improvements.


I measured spec2000 and spec2006 on Haswell with recent GCC with following options - -Ofast -funroll-loops -flto -static -m64 -march=core-avx2 and saw no visible changes in performance for 2 runs, with and without -fno-split-paths.
Note also that there are couple of existing issues with path splitting - PR68541 and PR68522 (which is currently bisected)

Thanks,
Igor
Jeff Law Dec. 23, 2015, 6:36 a.m. UTC | #17
On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote:
>
> Mibench/EEMBC benchmarks (Target Microblaze)
>
> Automotive_qsort1(4.03%), Office_ispell(4.29%), Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%), ospfv2_lite(1.35%).
I'm having a real tough time reproducing any of these results.  In fact, 
I'm having a tough time seeing cases where path splitting even applies 
to the Mibench/EEMBC benchmarks mentioned above.

In the very few cases where split-paths might apply, the net resulting 
assembly code I get is the same with and without split-paths.

How consistent are these results?

What functions are being affected that in turn impact performance?

What options are you using to compile the benchmarks?  I'm trying with 
-O2 -fsplit-paths and -O3 in my attempts to trigger the transformation 
so that I can look more closely at possible heuristics.

Is this with the standard microblaze-elf target?  Or with some other target?

jeff
Ajit Kumar Agarwal Dec. 25, 2015, 8:40 a.m. UTC | #18
Hello Jeff:

I am out on vacation till 3rd Jan 2016.
Is it okay If I respond on the below once I am back in office.

Thanks & Regards
Ajit

-----Original Message-----
From: Jeff Law [mailto:law@redhat.com] 

Sent: Wednesday, December 23, 2015 12:06 PM
To: Ajit Kumar Agarwal; Richard Biener
Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote:
>

> Mibench/EEMBC benchmarks (Target Microblaze)

>

> Automotive_qsort1(4.03%), Office_ispell(4.29%), Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%), ospfv2_lite(1.35%).

I'm having a real tough time reproducing any of these results.  In fact, I'm having a tough time seeing cases where path splitting even applies to the Mibench/EEMBC benchmarks mentioned above.

In the very few cases where split-paths might apply, the net resulting assembly code I get is the same with and without split-paths.

How consistent are these results?

What functions are being affected that in turn impact performance?

What options are you using to compile the benchmarks?  I'm trying with
-O2 -fsplit-paths and -O3 in my attempts to trigger the transformation so that I can look more closely at possible heuristics.

Is this with the standard microblaze-elf target?  Or with some other target?

jeff
Jeff Law Jan. 2, 2016, 7:32 a.m. UTC | #19
On 12/25/2015 01:40 AM, Ajit Kumar Agarwal wrote:
> Hello Jeff:
>
> I am out on vacation till 3rd Jan 2016.
> Is it okay If I respond on the below once I am back in office.
Yes.  I'm on vacation until then as well.

Jeff
Ajit Kumar Agarwal Jan. 4, 2016, 2:32 p.m. UTC | #20
-----Original Message-----
From: Jeff Law [mailto:law@redhat.com] 

Sent: Wednesday, December 23, 2015 12:06 PM
To: Ajit Kumar Agarwal; Richard Biener
Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote:
>

> Mibench/EEMBC benchmarks (Target Microblaze)

>

> Automotive_qsort1(4.03%), Office_ispell(4.29%), Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%), ospfv2_lite(1.35%).

>>I'm having a real tough time reproducing any of these results.  In fact, I'm having a tough time seeing cases where path splitting even applies to the Mibench/EEMBC benchmarks >>mentioned above.


>>In the very few cases where split-paths might apply, the net resulting assembly code I get is the same with and without split-paths.


>>How consistent are these results?


I am consistently getting the gains for office_ispell and office_stringsearch1, telcom_adpcm_d. I ran it again today and we see gains in the same bench mark tests 
with the split path changes.

>>What functions are being affected that in turn impact performance?


For office_ispell: The function are Function "linit (linit, funcdef_no=0, decl_uid=2535, cgraph_uid=0, symbol_order=2) for lookup.c file".
                                   "Function checkfile (checkfile, funcdef_no=1, decl_uid=2478, cgraph_uid=1, symbol_order=4)"
                                   " Function correct (correct, funcdef_no=2, decl_uid=2503, cgraph_uid=2, symbol_order=5)"
                                   " Function askmode (askmode, funcdef_no=24, decl_uid=2464, cgraph_uid=24, symbol_order=27)"
                                   for correct.c file.
                                  
For office_stringsearch1: The function is Function "bmhi_search (bmhi_search, funcdef_no=1, decl_uid=2178, cgraph_uid=1, symbol_order=5)"
for bmhisrch.c file.

>>What options are you using to compile the benchmarks?  I'm trying with

>>-O2 -fsplit-paths and -O3 in my attempts to trigger the transformation so that I can look more closely at possible heuristics.


I am using the following flags.

-O3 mlittle-endian -mxl-barrel-shift -mno-xl-soft-div -mhard-float -mxl-float-convert -mxl-float-sqrt   -mno-xl-soft-mul -mxl-multiply-high -mxl-pattern-compare.

To disable split paths -fno-split-paths is used on top of the above flags.

>>Is this with the standard microblaze-elf target?  Or with some other target?


I am using the --target=microblaze-xilinx-elf to build the microblaze target.

Thanks & Regards
Ajit

jeff
Jeff Law Jan. 13, 2016, 8:10 a.m. UTC | #21
On 01/04/2016 07:32 AM, Ajit Kumar Agarwal wrote:
>
> I am consistently getting the gains for office_ispell and office_stringsearch1, telcom_adpcm_d. I ran it again today and we see gains in the same bench mark tests
> with the split path changes.
>
>>> What functions are being affected that in turn impact performance?
>
> For office_ispell: The function are Function "linit (linit, funcdef_no=0, decl_uid=2535, cgraph_uid=0, symbol_order=2) for lookup.c file".
>                                     "Function checkfile (checkfile, funcdef_no=1, decl_uid=2478, cgraph_uid=1, symbol_order=4)"
>                                     " Function correct (correct, funcdef_no=2, decl_uid=2503, cgraph_uid=2, symbol_order=5)"
>                                     " Function askmode (askmode, funcdef_no=24, decl_uid=2464, cgraph_uid=24, symbol_order=27)"
>                                     for correct.c file.
>
> For office_stringsearch1: The function is Function "bmhi_search (bmhi_search, funcdef_no=1, decl_uid=2178, cgraph_uid=1, symbol_order=5)"
> for bmhisrch.c file.
So I can see split-paths affecting adpcm & lookup.  I don't see it 
affecting correct.c or bmhisrch.c.

That's progress though.  It's likely one of one or more of the flags is 
critical, so thanks for passing those along.

I'm going to focus on adpcm for the moment, in particular adpcm_coder. 
It appears the key blocks are:


;;   basic block 14, loop depth 1, count 0, freq 9100, maybe hot
;;    prev block 13, next block 15, flags: (NEW, REACHABLE)
;;    pred:       12 [100.0%]  (FALLTHRU,EXECUTABLE)
;;                13 [100.0%]  (FALLTHRU,EXECUTABLE)
   # valpred_12 = PHI <valpred_54(12), valpred_55(13)>
   _112 = MAX_EXPR <valpred_12, -32768>;
   valpred_18 = MIN_EXPR <_112, 32767>;
   delta_56 = delta_7 | iftmp.1_114;
   _57 = indexTable[delta_56];
   index_58 = _57 + index_107;
   _113 = MIN_EXPR <index_58, 88>;
   index_111 = MAX_EXPR <_113, 0>;
   step_59 = stepsizeTable[index_111];
   if (bufferstep_93 != 0)
     goto <bb 15>;
   else
     goto <bb 16>;
;;    succ:       15 [50.0%]  (TRUE_VALUE,EXECUTABLE)
;;                16 [50.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 15, loop depth 1, count 0, freq 4550, maybe hot
;;    prev block 14, next block 16, flags: (NEW, REACHABLE)
;;    pred:       14 [50.0%]  (TRUE_VALUE,EXECUTABLE)
   _60 = delta_56 << 4;
   goto <bb 17>;
;;    succ:       17 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 16, loop depth 1, count 0, freq 4550, maybe hot
;;    prev block 15, next block 17, flags: (NEW, REACHABLE)
;;    pred:       14 [50.0%]  (FALSE_VALUE,EXECUTABLE)
   outp_62 = outp_83 + 1;
   _63 = (signed char) delta_56;
   _65 = (signed char) outputbuffer_90;
   _66 = _63 | _65;
   *outp_83 = _66;
;;    succ:       17 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 17, loop depth 1, count 0, freq 9100, maybe hot
;;    prev block 16, next block 18, flags: (NEW, REACHABLE)
;;    pred:       15 [100.0%]  (FALLTHRU,EXECUTABLE)
;;                16 [100.0%]  (FALLTHRU,EXECUTABLE)
   # outp_3 = PHI <outp_83(15), outp_62(16)>
   # outputbuffer_21 = PHI <_60(15), outputbuffer_90(16)>
   _109 = bufferstep_93 ^ 1;
   _98 = _109 & 1;
   ivtmp.11_68 = ivtmp.11_105 + 2;
   if (ivtmp.11_68 != _116)
     goto <bb 4>;
   else
     goto <bb 18>;


Block #17 is the join point that we're going to effectively copy into 
blocks #15 and #16.  Doing so in turn exposes bufferstep_93 as the 
constant 0 in block #16, which in turn allows elimination of a couple 
statements in the extended version of block #16 and we propagate the 
constant 1 for bufferstep_93 to the top of the loop when reached via 
block #16.  So we save a few instructions.  However, I think we're 
actually doing a fairly poor job here.

bufferstep is a great example of a flip-flop variable and its value is 
statically computable based on the path from the prior loop iteration 
which, if exploited would allow the FSM threader to eliminate the 
conditional at the end of bb14.  I'm going to have to play with that.

Anyway, it's late and I want to rip this test apart a bit more and see 
how it interacts with the heuristic that I've cobbled together as well 
as see what it would take to have DOM or VRP get data on bufferstep_93 
on the true path out of BB14 after a path-split.

Jeff
Jeff Law Jan. 14, 2016, 8:55 a.m. UTC | #22
On 01/13/2016 01:10 AM, Jeff Law wrote:
>
> I'm going to focus on adpcm for the moment, in particular adpcm_coder.
> It appears the key blocks are:
Looking at adpcm_decoder we have the same idiom as in adpcm_coder:

   if (bufferstep_79 != 0)
     goto <bb 6>;
   else
     goto <bb 7>;
;;    succ:       6 [50.0%]  (TRUE_VALUE,EXECUTABLE)
;;                7 [50.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 6, loop depth 1, count 0, freq 4550, maybe hot
;;    prev block 5, next block 7, flags: (NEW, REACHABLE)
;;    pred:       5 [50.0%]  (TRUE_VALUE,EXECUTABLE)
   delta_31 = inputbuffer_65 & 15;
   goto <bb 8>;
;;    succ:       8 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 7, loop depth 1, count 0, freq 4550, maybe hot
;;    prev block 6, next block 8, flags: (NEW, REACHABLE)
;;    pred:       5 [50.0%]  (FALSE_VALUE,EXECUTABLE)
   inp_32 = inp_70 + 1;
   _33 = *inp_70;
   inputbuffer_34 = (int) _33;
   _35 = inputbuffer_34 >> 4;
   delta_36 = _35 & 15;
;;    succ:       8 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 8, loop depth 1, count 0, freq 9100, maybe hot
;;    prev block 7, next block 9, flags: (NEW, REACHABLE)
;;    pred:       6 [100.0%]  (FALLTHRU,EXECUTABLE)
;;                7 [100.0%]  (FALLTHRU,EXECUTABLE)
   # inp_2 = PHI <inp_70(6), inp_32(7)>
   # delta_5 = PHI <delta_31(6), delta_36(7)>
   # inputbuffer_16 = PHI <inputbuffer_65(6), inputbuffer_34(7)>
   _83 = bufferstep_79 ^ 1;
   _1 = _83 & 1;
   _39 = indexTable[delta_5];
   index_40 = _39 + index_81;
   _84 = MIN_EXPR <index_40, 88>;
   index_62 = MAX_EXPR <_84, 0>;
   sign_41 = delta_5 & 8;
   vpdiff_42 = step_71 >> 3;
   _43 = delta_5 & 4;
   if (_43 != 0)
     goto <bb 9>;
   else
     goto <bb 10>;



The difference is there's all kinds of code between BB8 and the latch. 
So it doesn't trigger path splitting, which in turn doesn't expose the 
CSE/DCE opportunity for adpcm_decoder.  We're likely leaving some 
performance on the table for this code.

This brings up a deeper question.   Are we approaching path splitting in 
a fundamentally backward way?

Right now path splitting is guided by finding the right shape in the CFG 
immediately preceding the latch block of a loop.  Plus some heuristics 
I'm working on to predict when the duplication is more likely to lead to 
CSE/DCE opportunities.  But really it's driven by finding the right CFG 
shape before the latch block.

Should path splitting really be driven by partitions of PHI arguments 
and other incoming values (such as the implicit sets from conditionals). 
  I've been pondering this for a long time as it's a natural extension 
of what we do in the erroneous path isolation pass.

ie at any join point in the CFG, look for partitions of the arguments in 
any PHI nodes and split paths based on that.

So for example if we had a PHI like

x5 = phi (x4, 0, 0, 1, 1)

Look at how x5 is used.  If propagation of any of those PHI argument 
values would result in simplifications at the use points of x5, then we 
should consider splitting off all the paths with the beneficial PHI 
argument.

So in the case above, assume that the value 0 results in 
simplifications.  We'd create two paths.  One for the paths where x5 
would get the value zero, another for everything else.

This is *a lot* like the path isolation done erroneous paths in terms of 
the CFG & SSA manipulations.  Essentially in both cases we want to 
isolate a path so that we can manipulate it based on known inputs 
without affecting paths with uninteresting inputs.  All that really 
changes is how we drive the selection of paths to isolate and whether or 
not we insert the __builtin_trap or not.


Anyway, going back to adpcm_decode, we do end up splitting this path:

  # vpdiff_12 = PHI <vpdiff_11(12), vpdiff_50(13)>
   if (sign_41 != 0)
     goto <bb 15>;
   else
     goto <bb 16>;
;;    succ:       15
;;                16

;;   basic block 15, loop depth 1
;;    pred:       14
   valpred_51 = valpred_76 - vpdiff_12;
   goto <bb 17>;
;;    succ:       17

;;   basic block 16, loop depth 1
;;    pred:       14
   valpred_52 = vpdiff_12 + valpred_76;
;;    succ:       17

;;   basic block 17, loop depth 1
;;    pred:       15
;;                16
   # valpred_7 = PHI <valpred_51(15), valpred_52(16)>
   _85 = MAX_EXPR <valpred_7, -32768>;
   valpred_13 = MIN_EXPR <_85, 32767>;
   step_53 = stepsizeTable[index_62];
   outp_54 = outp_69 + 2;
   _55 = (short int) valpred_13;
   MEM[base: outp_54, offset: -2B] = _55;
   if (outp_54 != _74)
     goto <bb 20>;
   else
     goto <bb 18>;

This doesn't result in anything particularly interesting/good AFAICT. 
We propagate valpred_51/52 into the use in the MAX_EXPR in the duplicate 
paths, but that doesn't allow any further simplification.

Ajit, can you confirm which of adpcm_code or adpcm_decode where path 
splitting is showing a gain?  I suspect it's the former but would like 
to make sure so that I can adjust the heuristics properly.


jeff
Jeff Law Jan. 15, 2016, 10:38 p.m. UTC | #23
On 01/13/2016 01:10 AM, Jeff Law wrote:
>
> I'm going to focus on adpcm for the moment, in particular adpcm_coder.
> It appears the key blocks are:
>
>
> ;;   basic block 14, loop depth 1, count 0, freq 9100, maybe hot
> ;;    prev block 13, next block 15, flags: (NEW, REACHABLE)
> ;;    pred:       12 [100.0%]  (FALLTHRU,EXECUTABLE)
> ;;                13 [100.0%]  (FALLTHRU,EXECUTABLE)
>    # valpred_12 = PHI <valpred_54(12), valpred_55(13)>
>    _112 = MAX_EXPR <valpred_12, -32768>;
>    valpred_18 = MIN_EXPR <_112, 32767>;
>    delta_56 = delta_7 | iftmp.1_114;
>    _57 = indexTable[delta_56];
>    index_58 = _57 + index_107;
>    _113 = MIN_EXPR <index_58, 88>;
>    index_111 = MAX_EXPR <_113, 0>;
>    step_59 = stepsizeTable[index_111];
>    if (bufferstep_93 != 0)
>      goto <bb 15>;
>    else
>      goto <bb 16>;
> ;;    succ:       15 [50.0%]  (TRUE_VALUE,EXECUTABLE)
> ;;                16 [50.0%]  (FALSE_VALUE,EXECUTABLE)
>
> ;;   basic block 15, loop depth 1, count 0, freq 4550, maybe hot
> ;;    prev block 14, next block 16, flags: (NEW, REACHABLE)
> ;;    pred:       14 [50.0%]  (TRUE_VALUE,EXECUTABLE)
>    _60 = delta_56 << 4;
>    goto <bb 17>;
> ;;    succ:       17 [100.0%]  (FALLTHRU,EXECUTABLE)
>
> ;;   basic block 16, loop depth 1, count 0, freq 4550, maybe hot
> ;;    prev block 15, next block 17, flags: (NEW, REACHABLE)
> ;;    pred:       14 [50.0%]  (FALSE_VALUE,EXECUTABLE)
>    outp_62 = outp_83 + 1;
>    _63 = (signed char) delta_56;
>    _65 = (signed char) outputbuffer_90;
>    _66 = _63 | _65;
>    *outp_83 = _66;
> ;;    succ:       17 [100.0%]  (FALLTHRU,EXECUTABLE)
>
> ;;   basic block 17, loop depth 1, count 0, freq 9100, maybe hot
> ;;    prev block 16, next block 18, flags: (NEW, REACHABLE)
> ;;    pred:       15 [100.0%]  (FALLTHRU,EXECUTABLE)
> ;;                16 [100.0%]  (FALLTHRU,EXECUTABLE)
>    # outp_3 = PHI <outp_83(15), outp_62(16)>
>    # outputbuffer_21 = PHI <_60(15), outputbuffer_90(16)>
>    _109 = bufferstep_93 ^ 1;
>    _98 = _109 & 1;
>    ivtmp.11_68 = ivtmp.11_105 + 2;
>    if (ivtmp.11_68 != _116)
>      goto <bb 4>;
>    else
>      goto <bb 18>;
>
>
> Block #17 is the join point that we're going to effectively copy into
> blocks #15 and #16.  Doing so in turn exposes bufferstep_93 as the
> constant 0 in block #16, which in turn allows elimination of a couple
> statements in the extended version of block #16 and we propagate the
> constant 1 for bufferstep_93 to the top of the loop when reached via
> block #16.  So we save a few instructions.  However, I think we're
> actually doing a fairly poor job here.
>
> bufferstep is a great example of a flip-flop variable and its value is
> statically computable based on the path from the prior loop iteration
> which, if exploited would allow the FSM threader to eliminate the
> conditional at the end of bb14.  I'm going to have to play with that.
So I've extended DOM & uncprop to pick up the missed propagation 
opportunity, which in turn allows DOM to simplify this function even 
further and hopefully set ourselves up for either unrolling the loop or 
using the FSM threader to eliminate the test on bufferstep completely. 
But those are gcc-7 items.


>
> Anyway, it's late and I want to rip this test apart a bit more and see
> how it interacts with the heuristic that I've cobbled together as well
> as see what it would take to have DOM or VRP get data on bufferstep_93
> on the true path out of BB14 after a path-split.
As I expected, this showed a need for a minor tweak to the heuristic I'm 
poking at for path splitting.  Nothing particularly hard, it needs 
further work (it's not compile-time efficient right now), but it's good 
enough to put away adpcm_code and continue looking more closely at 
adpcm_decode.

Jeff
Jeff Law Jan. 15, 2016, 11:02 p.m. UTC | #24
On 01/14/2016 01:55 AM, Jeff Law wrote:
[ Replying to myself again, mostly to make sure we've got these thoughts 
in the archives. ]
>
> Anyway, going back to adpcm_decode, we do end up splitting this path:
>
>   # vpdiff_12 = PHI <vpdiff_11(12), vpdiff_50(13)>
>    if (sign_41 != 0)
>      goto <bb 15>;
>    else
>      goto <bb 16>;
> ;;    succ:       15
> ;;                16
>
> ;;   basic block 15, loop depth 1
> ;;    pred:       14
>    valpred_51 = valpred_76 - vpdiff_12;
>    goto <bb 17>;
> ;;    succ:       17
>
> ;;   basic block 16, loop depth 1
> ;;    pred:       14
>    valpred_52 = vpdiff_12 + valpred_76;
> ;;    succ:       17
>
> ;;   basic block 17, loop depth 1
> ;;    pred:       15
> ;;                16
>    # valpred_7 = PHI <valpred_51(15), valpred_52(16)>
>    _85 = MAX_EXPR <valpred_7, -32768>;
>    valpred_13 = MIN_EXPR <_85, 32767>;
>    step_53 = stepsizeTable[index_62];
>    outp_54 = outp_69 + 2;
>    _55 = (short int) valpred_13;
>    MEM[base: outp_54, offset: -2B] = _55;
>    if (outp_54 != _74)
>      goto <bb 20>;
>    else
>      goto <bb 18>;
>
> This doesn't result in anything particularly interesting/good AFAICT. We
> propagate valpred_51/52 into the use in the MAX_EXPR in the duplicate
> paths, but that doesn't allow any further simplification.
So with the heuristic I'm poking at, this gets rejected.  Essentially it 
doesn't think it's likely to expose CSE/DCE opportunities (and it's 
correct).  The number of statements in predecessor blocks that feed 
operands in the to-be-copied-block is too small relative to the size of 
the to-be-copied-block.


>
> Ajit, can you confirm which of adpcm_code or adpcm_decode where path
> splitting is showing a gain?  I suspect it's the former but would like
> to make sure so that I can adjust the heuristics properly.
I'd still like to have this answered when you can Ajit, just to be 100% 
  that it's the path splitting in adpcm_code that's responsible for the 
improvements you're seeing in adpcm.

jeff
Jeff Law Jan. 16, 2016, 6:32 a.m. UTC | #25
On 01/04/2016 07:32 AM, Ajit Kumar Agarwal wrote:
>
>
> -----Original Message----- From: Jeff Law [mailto:law@redhat.com]
> Sent: Wednesday, December 23, 2015 12:06 PM To: Ajit Kumar Agarwal;
> Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta;
> Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re:
> [Patch,tree-optimization]: Add new path Splitting pass on tree ssa
> representation
>
> On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote:
>>
>> Mibench/EEMBC benchmarks (Target Microblaze)
>>
>> Automotive_qsort1(4.03%), Office_ispell(4.29%),
>> Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%),
>> ospfv2_lite(1.35%).
>>> I'm having a real tough time reproducing any of these results.
>>> In fact, I'm having a tough time seeing cases where path
>>> splitting even applies to the Mibench/EEMBC benchmarks
>>> >>mentioned above.
>
>>> In the very few cases where split-paths might apply, the net
>>> resulting assembly code I get is the same with and without
>>> split-paths.
>
>>> How consistent are these results?
>
> I am consistently getting the gains for office_ispell and
> office_stringsearch1, telcom_adpcm_d. I ran it again today and we see
> gains in the same bench mark tests with the split path changes.
>
>>> What functions are being affected that in turn impact
>>> performance?
>
> For office_ispell: The function are Function "linit (linit,
> funcdef_no=0, decl_uid=2535, cgraph_uid=0, symbol_order=2) for
> lookup.c file". "Function checkfile (checkfile, funcdef_no=1,
> decl_uid=2478, cgraph_uid=1, symbol_order=4)" " Function correct
> (correct, funcdef_no=2, decl_uid=2503, cgraph_uid=2,
> symbol_order=5)" " Function askmode (askmode, funcdef_no=24,
> decl_uid=2464, cgraph_uid=24, symbol_order=27)" for correct.c file.
>
> For office_stringsearch1: The function is Function "bmhi_search
> (bmhi_search, funcdef_no=1, decl_uid=2178, cgraph_uid=1,
> symbol_order=5)" for bmhisrch.c file.
Can you send me the pre-processed lookup.c, correct.c and bmhi_search.c?

I generated mine using x86 and that may be affecting my ability to 
reproduce your results on the microblaze target.  Looking specifically 
at bmhi_search.c and correct.c, I see they are going to be sensitive to 
the target headers.  If (for exmaple) they use FORTIFY_SOURCE or macros 
for toupper.

In the bmhi_search I'm looking at, I don't see any opportunities for the 
path splitter to do anything.  The CFG just doesn't have the right 
shape.  Again, that may be an artifact of how toupper is implemented in 
the system header files -- hence my request for the cpp output on each 
of the important files.

Jeff
Ajit Kumar Agarwal Jan. 18, 2016, 9:13 a.m. UTC | #26
-----Original Message-----
From: Jeff Law [mailto:law@redhat.com] 

Sent: Saturday, January 16, 2016 12:03 PM
To: Ajit Kumar Agarwal; Richard Biener
Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 01/04/2016 07:32 AM, Ajit Kumar Agarwal wrote:
>

>

> -----Original Message----- From: Jeff Law [mailto:law@redhat.com]

> Sent: Wednesday, December 23, 2015 12:06 PM To: Ajit Kumar Agarwal; 

> Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; 

> Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re:

> [Patch,tree-optimization]: Add new path Splitting pass on tree ssa 

> representation

>

> On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote:

>>

>> Mibench/EEMBC benchmarks (Target Microblaze)

>>

>> Automotive_qsort1(4.03%), Office_ispell(4.29%), 

>> Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%), 

>> ospfv2_lite(1.35%).

>>> I'm having a real tough time reproducing any of these results.

>>> In fact, I'm having a tough time seeing cases where path splitting 

>>> even applies to the Mibench/EEMBC benchmarks

>>> >>mentioned above.

>

>>> In the very few cases where split-paths might apply, the net 

>>> resulting assembly code I get is the same with and without 

>>> split-paths.

>

>>> How consistent are these results?

>

> I am consistently getting the gains for office_ispell and 

> office_stringsearch1, telcom_adpcm_d. I ran it again today and we see 

> gains in the same bench mark tests with the split path changes.

>

>>> What functions are being affected that in turn impact performance?

>

> For office_ispell: The function are Function "linit (linit, 

> funcdef_no=0, decl_uid=2535, cgraph_uid=0, symbol_order=2) for 

> lookup.c file". "Function checkfile (checkfile, funcdef_no=1, 

> decl_uid=2478, cgraph_uid=1, symbol_order=4)" " Function correct 

> (correct, funcdef_no=2, decl_uid=2503, cgraph_uid=2, symbol_order=5)" 

> " Function askmode (askmode, funcdef_no=24, decl_uid=2464, 

> cgraph_uid=24, symbol_order=27)" for correct.c file.

>

> For office_stringsearch1: The function is Function "bmhi_search 

> (bmhi_search, funcdef_no=1, decl_uid=2178, cgraph_uid=1, 

> symbol_order=5)" for bmhisrch.c file.

>>Can you send me the pre-processed lookup.c, correct.c and bmhi_search.c?


>>I generated mine using x86 and that may be affecting my ability to reproduce your results on the microblaze target.  Looking specifically at bmhi_search.c and correct.c, I see they are >>going to be sensitive to the target headers.  If (for exmaple) they use FORTIFY_SOURCE or macros for toupper.


>>In the bmhi_search I'm looking at, I don't see any opportunities for the path splitter to do anything.  The CFG just doesn't have the right shape.  Again, that may be an artifact of how >>toupper is implemented in the system header files -- hence my request for the cpp output on each of the important files.


Would you like me  to send the above files and function pre-processed with -E option flag.

Thanks & Regards
Ajit
Jeff
Ajit Kumar Agarwal Jan. 18, 2016, 6:27 p.m. UTC | #27
-----Original Message-----
From: Jeff Law [mailto:law@redhat.com] 

Sent: Saturday, January 16, 2016 4:33 AM
To: Ajit Kumar Agarwal; Richard Biener
Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 01/14/2016 01:55 AM, Jeff Law wrote:
[ Replying to myself again, mostly to make sure we've got these thoughts in the archives. ]
>

> Anyway, going back to adpcm_decode, we do end up splitting this path:

>

>   # vpdiff_12 = PHI <vpdiff_11(12), vpdiff_50(13)>

>    if (sign_41 != 0)

>      goto <bb 15>;

>    else

>      goto <bb 16>;

> ;;    succ:       15

> ;;                16

>

> ;;   basic block 15, loop depth 1

> ;;    pred:       14

>    valpred_51 = valpred_76 - vpdiff_12;

>    goto <bb 17>;

> ;;    succ:       17

>

> ;;   basic block 16, loop depth 1

> ;;    pred:       14

>    valpred_52 = vpdiff_12 + valpred_76;

> ;;    succ:       17

>

> ;;   basic block 17, loop depth 1

> ;;    pred:       15

> ;;                16

>    # valpred_7 = PHI <valpred_51(15), valpred_52(16)>

>    _85 = MAX_EXPR <valpred_7, -32768>;

>    valpred_13 = MIN_EXPR <_85, 32767>;

>    step_53 = stepsizeTable[index_62];

>    outp_54 = outp_69 + 2;

>    _55 = (short int) valpred_13;

>    MEM[base: outp_54, offset: -2B] = _55;

>    if (outp_54 != _74)

>      goto <bb 20>;

>    else

>      goto <bb 18>;

>

> This doesn't result in anything particularly interesting/good AFAICT. 

> We propagate valpred_51/52 into the use in the MAX_EXPR in the 

> duplicate paths, but that doesn't allow any further simplification.

>>So with the heuristic I'm poking at, this gets rejected.  Essentially it doesn't think it's likely to expose CSE/DCE opportunities (and it's correct).  The number of statements in predecessor >>blocks that feed operands in the to-be-copied-block is too small relative to the size of the to-be-copied-block.



>

> Ajit, can you confirm which of adpcm_code or adpcm_decode where path 

> splitting is showing a gain?  I suspect it's the former but would like 

> to make sure so that I can adjust the heuristics properly.

>>I'd still like to have this answered when you can Ajit, just to be 100%

 >> that it's the path splitting in adpcm_code that's responsible for the improvements you're seeing in adpcm.


The adpcm_coder get optimized with path splitting whereas the adpcm_decoder is not optimized further with path splitting. In adpcm_decoder
the join node is duplicated into its predecessors and with the duplication of join node the code is not optimized further.

In adpcm_coder with path splitting the following optimization is triggered with path splitting.

1. /* Output last step, if needed */
    if ( !bufferstep )
      *outp++ = outputbuffer;

     IF-THEN inside the loop will be triggered with bufferstep is 1.  Then the flip happens and bufferstep is 0. For the exit branch if the bufferstep
Is 1 the flip convert it to 0  and above IF-THEN generate store to assign outputbuffer to outp.

The above sequence is optimized with path splitting, if the bufferstep is 1 then exit branch of the loop branches to the above store. This does not require the flip of
bufferstep using xor with immediate 1. With this optimization there is one level of exit branch for the bufferstep 1 path. This lead to scheduling the
exit branch to the store with a meaningful instruction instead of xor with immediate 1.

Without Path Splitting if the bufferstep is 1  the exit branch of the loop branches to piece of branch flipping it to zero and the above IF-THEN outside the
loop does the store to assign outputbuffer to outp. Thus without path splitting there is two level of branch in the case of exit branch in the path where 
bufferstep is 1 inside the loop generating non optimized. Also without path splitting the two level of exit branch of the loop is scheduled with xor immediate with 1.

 Thanks & Regards
Ajit

jeff
Jeff Law Jan. 27, 2016, 7:13 a.m. UTC | #28
On 01/18/2016 02:13 AM, Ajit Kumar Agarwal wrote:
>
>
> -----Original Message-----
> From: Jeff Law [mailto:law@redhat.com]
> Sent: Saturday, January 16, 2016 12:03 PM
> To: Ajit Kumar Agarwal; Richard Biener
> Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
>
> On 01/04/2016 07:32 AM, Ajit Kumar Agarwal wrote:
>>
>>
>> -----Original Message----- From: Jeff Law [mailto:law@redhat.com]
>> Sent: Wednesday, December 23, 2015 12:06 PM To: Ajit Kumar Agarwal;
>> Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta;
>> Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re:
>> [Patch,tree-optimization]: Add new path Splitting pass on tree ssa
>> representation
>>
>> On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote:
>>>
>>> Mibench/EEMBC benchmarks (Target Microblaze)
>>>
>>> Automotive_qsort1(4.03%), Office_ispell(4.29%),
>>> Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%),
>>> ospfv2_lite(1.35%).
>>>> I'm having a real tough time reproducing any of these results.
>>>> In fact, I'm having a tough time seeing cases where path splitting
>>>> even applies to the Mibench/EEMBC benchmarks
>>>>>> mentioned above.
>>
>>>> In the very few cases where split-paths might apply, the net
>>>> resulting assembly code I get is the same with and without
>>>> split-paths.
>>
>>>> How consistent are these results?
>>
>> I am consistently getting the gains for office_ispell and
>> office_stringsearch1, telcom_adpcm_d. I ran it again today and we see
>> gains in the same bench mark tests with the split path changes.
>>
>>>> What functions are being affected that in turn impact performance?
>>
>> For office_ispell: The function are Function "linit (linit,
>> funcdef_no=0, decl_uid=2535, cgraph_uid=0, symbol_order=2) for
>> lookup.c file". "Function checkfile (checkfile, funcdef_no=1,
>> decl_uid=2478, cgraph_uid=1, symbol_order=4)" " Function correct
>> (correct, funcdef_no=2, decl_uid=2503, cgraph_uid=2, symbol_order=5)"
>> " Function askmode (askmode, funcdef_no=24, decl_uid=2464,
>> cgraph_uid=24, symbol_order=27)" for correct.c file.
>>
>> For office_stringsearch1: The function is Function "bmhi_search
>> (bmhi_search, funcdef_no=1, decl_uid=2178, cgraph_uid=1,
>> symbol_order=5)" for bmhisrch.c file.
>>> Can you send me the pre-processed lookup.c, correct.c and bmhi_search.c?
>
>>> I generated mine using x86 and that may be affecting my ability to reproduce your results on the microblaze target.  Looking specifically at bmhi_search.c and correct.c, I see they are >>going to be sensitive to the target headers.  If (for exmaple) they use FORTIFY_SOURCE or macros for toupper.
>
>>> In the bmhi_search I'm looking at, I don't see any opportunities for the path splitter to do anything.  The CFG just doesn't have the right shape.  Again, that may be an artifact of how >>toupper is implemented in the system header files -- hence my request for the cpp output on each of the important files.
>
> Would you like me  to send the above files and function pre-processed with -E option flag.
That would be perfect.  I'm on the road the latter half of the week into 
early next week -- the long flights might be a good time for me to stare 
at the dumps and tweak the heuristic a bit.

jeff
Jeff Law Jan. 27, 2016, 7:17 a.m. UTC | #29
On 01/18/2016 11:27 AM, Ajit Kumar Agarwal wrote:

>>
>> Ajit, can you confirm which of adpcm_code or adpcm_decode where
>> path splitting is showing a gain?  I suspect it's the former but
>> would like to make sure so that I can adjust the heuristics
>> properly.
>>> I'd still like to have this answered when you can Ajit, just to
>>> be 100% that it's the path splitting in adpcm_code that's
>>> responsible for the improvements you're seeing in adpcm.
>
> The adpcm_coder get optimized with path splitting whereas the
> adpcm_decoder is not optimized further with path splitting. In
> adpcm_decoder the join node is duplicated into its predecessors and
> with the duplication of join node the code is not optimized further.
Right.  Just wanted to make sure my analysis corresponded with what you 
were seeing in your benchmarking -- and it does.

I suspect that if we looked at this problem from the angle of isolating 
paths based on how constant PHI arguments feed into and allow 
simplifications in later blocks that we might get better long term 
results -- including improving adpcm_decoder which has the same idiom as 
adpcm_coder -- it's just in the wrong spot in the CFG.

But that's obviously gcc-7 material.

Jeff
Ajit Kumar Agarwal Jan. 27, 2016, 9:35 a.m. UTC | #30
-----Original Message-----
From: Jeff Law [mailto:law@redhat.com] 

Sent: Wednesday, January 27, 2016 12:44 PM
To: Ajit Kumar Agarwal; Richard Biener
Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 01/18/2016 02:13 AM, Ajit Kumar Agarwal wrote:
>

>

> -----Original Message-----

> From: Jeff Law [mailto:law@redhat.com]

> Sent: Saturday, January 16, 2016 12:03 PM

> To: Ajit Kumar Agarwal; Richard Biener

> Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 

> Hunsigida; Nagaraju Mekala

> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on 

> tree ssa representation

>

> On 01/04/2016 07:32 AM, Ajit Kumar Agarwal wrote:

>>

>>

>> -----Original Message----- From: Jeff Law [mailto:law@redhat.com]

>> Sent: Wednesday, December 23, 2015 12:06 PM To: Ajit Kumar Agarwal; 

>> Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; 

>> Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re:

>> [Patch,tree-optimization]: Add new path Splitting pass on tree ssa 

>> representation

>>

>> On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote:

>>>

>>> Mibench/EEMBC benchmarks (Target Microblaze)

>>>

>>> Automotive_qsort1(4.03%), Office_ispell(4.29%), 

>>> Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%), 

>>> ospfv2_lite(1.35%).

>>>> I'm having a real tough time reproducing any of these results.

>>>> In fact, I'm having a tough time seeing cases where path splitting 

>>>> even applies to the Mibench/EEMBC benchmarks

>>>>>> mentioned above.

>>

>>>> In the very few cases where split-paths might apply, the net 

>>>> resulting assembly code I get is the same with and without 

>>>> split-paths.

>>

>>>> How consistent are these results?

>>

>> I am consistently getting the gains for office_ispell and 

>> office_stringsearch1, telcom_adpcm_d. I ran it again today and we see 

>> gains in the same bench mark tests with the split path changes.

>>

>>>> What functions are being affected that in turn impact performance?

>>

>> For office_ispell: The function are Function "linit (linit, 

>> funcdef_no=0, decl_uid=2535, cgraph_uid=0, symbol_order=2) for 

>> lookup.c file". "Function checkfile (checkfile, funcdef_no=1, 

>> decl_uid=2478, cgraph_uid=1, symbol_order=4)" " Function correct 

>> (correct, funcdef_no=2, decl_uid=2503, cgraph_uid=2, symbol_order=5)"

>> " Function askmode (askmode, funcdef_no=24, decl_uid=2464, 

>> cgraph_uid=24, symbol_order=27)" for correct.c file.

>>

>> For office_stringsearch1: The function is Function "bmhi_search 

>> (bmhi_search, funcdef_no=1, decl_uid=2178, cgraph_uid=1, 

>> symbol_order=5)" for bmhisrch.c file.

>>> Can you send me the pre-processed lookup.c, correct.c and bmhi_search.c?

>

>>> I generated mine using x86 and that may be affecting my ability to reproduce your results on the microblaze target.  Looking specifically at bmhi_search.c and correct.c, I see they are >>going to be sensitive to the target headers.  If (for exmaple) they use FORTIFY_SOURCE or macros for toupper.

>

>>> In the bmhi_search I'm looking at, I don't see any opportunities for the path splitter to do anything.  The CFG just doesn't have the right shape.  Again, that may be an artifact of how >>toupper is implemented in the system header files -- hence my request for the cpp output on each of the important files.

>

> Would you like me  to send the above files and function pre-processed with -E option flag.

>>That would be perfect.  I'm on the road the latter half of the week into early next week -- the long flights might be a good time for me to stare at the dumps and tweak the heuristic a bit.


Please find the preprocessed file pre-processed with -E flag option for  office_ispell/src/lookup.c , office_ispell/src/correct.c and office_stringsearch1/src/bmhisrch.c . 
The function that are interest to us are :  correct,  linit,  bmhi_search.

Thanks & Regards
Ajit

jeff
Ajit Kumar Agarwal Jan. 27, 2016, 5:21 p.m. UTC | #31
-----Original Message-----
From: Jeff Law [mailto:law@redhat.com] 

Sent: Wednesday, January 27, 2016 12:48 PM
To: Ajit Kumar Agarwal; Richard Biener
Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

On 01/18/2016 11:27 AM, Ajit Kumar Agarwal wrote:

>>

>> Ajit, can you confirm which of adpcm_code or adpcm_decode where path 

>> splitting is showing a gain?  I suspect it's the former but would 

>> like to make sure so that I can adjust the heuristics properly.

>>> I'd still like to have this answered when you can Ajit, just to be 

>>> 100% that it's the path splitting in adpcm_code that's responsible 

>>> for the improvements you're seeing in adpcm.

>

> The adpcm_coder get optimized with path splitting whereas the 

> adpcm_decoder is not optimized further with path splitting. In 

> adpcm_decoder the join node is duplicated into its predecessors and 

> with the duplication of join node the code is not optimized further.

>>Right.  Just wanted to make sure my analysis corresponded with what you were seeing in your benchmarking -- and it does.


>>I suspect that if we looked at this problem from the angle of isolating paths based on how constant PHI arguments feed into and allow simplifications in later blocks that we might get >>better long term results -- including improving adpcm_decoder which has the same idiom as adpcm_coder -- it's just in the wrong spot in the CFG.

>>But that's obviously gcc-7 material.


Can I look into it.

Thanks & Regards
Ajit

Jeff
Jeff Law Feb. 4, 2016, 8:57 a.m. UTC | #32
On 01/04/2016 07:32 AM, Ajit Kumar Agarwal wrote:
>
>
> -----Original Message-----
> From: Jeff Law [mailto:law@redhat.com]
> Sent: Wednesday, December 23, 2015 12:06 PM
> To: Ajit Kumar Agarwal; Richard Biener
> Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
>
> On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote:
>>
>> Mibench/EEMBC benchmarks (Target Microblaze)
>>
>> Automotive_qsort1(4.03%), Office_ispell(4.29%), Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%), ospfv2_lite(1.35%).
>>> I'm having a real tough time reproducing any of these results.  In fact, I'm having a tough time seeing cases where path splitting even applies to the Mibench/EEMBC benchmarks >>mentioned above.
>
>>> In the very few cases where split-paths might apply, the net resulting assembly code I get is the same with and without split-paths.
>
>>> How consistent are these results?
>
> I am consistently getting the gains for office_ispell and office_stringsearch1, telcom_adpcm_d. I ran it again today and we see gains in the same bench mark tests
> with the split path changes.
>
>>> What functions are being affected that in turn impact performance?
>
> For office_ispell: The function are Function "linit (linit, funcdef_no=0, decl_uid=2535, cgraph_uid=0, symbol_order=2) for lookup.c file".
>                                     "Function checkfile (checkfile, funcdef_no=1, decl_uid=2478, cgraph_uid=1, symbol_order=4)"
>                                     " Function correct (correct, funcdef_no=2, decl_uid=2503, cgraph_uid=2, symbol_order=5)"
>                                     " Function askmode (askmode, funcdef_no=24, decl_uid=2464, cgraph_uid=24, symbol_order=27)"
>                                     for correct.c file.
>
> For office_stringsearch1: The function is Function "bmhi_search (bmhi_search, funcdef_no=1, decl_uid=2178, cgraph_uid=1, symbol_order=5)"
> for bmhisrch.c file.
In linit there are two path splitting opportunities.  Neither of them 
are cases where path splitting exposes any CSE or DCE opportunities at 
the tree level.  In fact, in both cases there are no operands from the 
predecessors that feed into the join (that we duplicate the split the path).

There's a path splitting opportunity in correct.c::givehelp which AFAICT 
is totally uninteresting from a performance standpoint.  However, these 
are one of the few cases where path splitting actually results in 
something that is better optimized at the tree level.  How ironic.  We'd 
easily get the same result by sinking a statement down through a PHI in 
a manner similar to what's been suggested for 64700.

In correct.c::checkfile and correct.c::correct and correct.c::askmode 
the path splitting opportunities do not lead to any further 
simplifications at the gimple level.

Similarly for bmhisrch.c::bmhi_search.


So when I look across all these examples, the only one that's really a 
CSE/DCE opportunity exposed by path splitting that is performance 
important is adpcm_code.

The rest, AFAICT, benefit at a much lower level -- a diamond in the CFG 
will require an unconditional branch from the end of one arm around the 
other arm to the join point.  With path splitting that unconditional 
branch is eliminated.  So there's a small gain for that.  That gain may 
be even larger on the microblaze because of its exposed delay slot 
architecture -- one less slot to fill.  It may also result in better 
code layouts which help simplistic branch predictors.

So I find myself wondering if the primary effect we're looking for most 
of the time is really elimination of that unconditional branch.  And if 
it is the case, then we're looking at a very different costing heuristic 
-- one that favors very small join blocks rather than larger ones (that 
supposedly help expose CSE/DCE, but in reality aren't for the benchmarks 
I've looked at).

And if that's the case, then we may really be looking at something that 
belongs at the RTL level rather than at the tree/gimple level.  Sadly, 
it's harder to do things like duplicate blocks at the RTL level.

Anyway, I'm going to ponder some more.

jeff
diff mbox

Patch

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index dde2695..a7abe37 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,21 @@ 
+2015-11-13  Ajit Agarwal  <ajitkum@xilinx.com>
+	    Jeff Law  <law@redhat.com>
+
+	* Makefile.in (OBJS): Add gimple-ssa-split-paths.o
+	* common.opt (-fsplit-paths): New flag controlling path splitting.
+	* doc/invoke.texi (fsplit-paths): Document.
+	* opts.c (default_options_table): Add -fsplit-paths to -O2.
+	* passes.def: Add split_paths pass.
+	* timevar.def (TV_SPLIT_PATHS): New timevar.
+	* tracer.c: Include "tracer.h"
+	(ignore_bb_p): No longer static.
+	(transform_duplicate): New function, broken out of tail_duplicate.
+	(tail_duplicate): Use transform_duplicate.
+	* tracer.h (ignore_bb_p): Declare
+	(transform_duplicate): Likewise.
+	* tree-pass.h (make_pass_split_paths): Declare.
+	* gimple-ssa-split-paths.c: New file.
+
 2015-11-13  Kai Tietz  <ktietz70@googlemail.com>
 	    Marek Polacek  <polacek@redhat.com>
 	    Jason Merrill  <jason@redhat.com>
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index d3fd5e9..5c294df 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1277,6 +1277,7 @@  OBJS = \
 	gimple-pretty-print.o \
 	gimple-ssa-backprop.o \
 	gimple-ssa-isolate-paths.o \
+	gimple-ssa-split-paths.o \
 	gimple-ssa-strength-reduction.o \
 	gimple-streamer-in.o \
 	gimple-streamer-out.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 757ce85..3eb520e 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2403,6 +2403,10 @@  ftree-vrp
 Common Report Var(flag_tree_vrp) Init(0) Optimization
 Perform Value Range Propagation on trees.
 
+fsplit-paths
+Common Report Var(flag_split_paths) Init(0) Optimization
+Split paths leading to loop backedges.
+
 funit-at-a-time
 Common Report Var(flag_unit_at_a_time) Init(1)
 Compile whole compilation unit at a time.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index c18df98..eeb79e6 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -354,6 +354,7 @@  Objective-C and Objective-C++ Dialects}.
 -fdump-tree-fre@r{[}-@var{n}@r{]} @gol
 -fdump-tree-vtable-verify @gol
 -fdump-tree-vrp@r{[}-@var{n}@r{]} @gol
+-fdump-tree-split-paths@r{[}-@var{n}@r{]} @gol
 -fdump-tree-storeccp@r{[}-@var{n}@r{]} @gol
 -fdump-final-insns=@var{file} @gol
 -fcompare-debug@r{[}=@var{opts}@r{]}  -fcompare-debug-second @gol
@@ -448,6 +449,7 @@  Objective-C and Objective-C++ Dialects}.
 -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
 -fsemantic-interposition -fshrink-wrap -fsignaling-nans @gol
 -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
+-fsplit-paths @gol
 -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol
 -fstack-protector -fstack-protector-all -fstack-protector-strong @gol
 -fstack-protector-explicit -fstdarg-opt -fstrict-aliasing @gol
@@ -7171,6 +7173,11 @@  output on to @file{stderr}. If two conflicting dump filenames are
 given for the same pass, then the latter option overrides the earlier
 one.
 
+@item split-paths
+@opindex fdump-tree-split-paths
+Dump each function after splitting paths to loop backedges.  The file
+name is made by appending @file{.split-paths} to the source file name.
+
 @item all
 Turn on all options, except @option{raw}, @option{slim}, @option{verbose}
 and @option{lineno}.
@@ -7808,6 +7815,7 @@  also turns on the following optimization flags:
 -frerun-cse-after-loop  @gol
 -fsched-interblock  -fsched-spec @gol
 -fschedule-insns  -fschedule-insns2 @gol
+-fsplit-paths @gol
 -fstrict-aliasing -fstrict-overflow @gol
 -ftree-builtin-call-dce @gol
 -ftree-switch-conversion -ftree-tail-merge @gol
@@ -8821,7 +8829,7 @@  currently enabled, but may be enabled by @option{-O2} in the future.
 
 @item -ftree-sink
 @opindex ftree-sink
-Perform forward store motion  on trees.  This flag is
+Perform forward store motion on trees.  This flag is
 enabled by default at @option{-O} and higher.
 
 @item -ftree-bit-ccp
@@ -9127,6 +9135,12 @@  enabled by default at @option{-O2} and higher.  Null pointer check
 elimination is only done if @option{-fdelete-null-pointer-checks} is
 enabled.
 
+@item -fsplit-paths
+@opindex fsplit-paths
+Split paths leading to loop backedges.  This can improve dead code
+elimination and common subexpression elimination.  This is enabled by
+default at @option{-O2} and above.
+
 @item -fsplit-ivs-in-unroller
 @opindex fsplit-ivs-in-unroller
 Enables expression of values of induction variables in later iterations
diff --git a/gcc/gimple-ssa-split-paths.c b/gcc/gimple-ssa-split-paths.c
new file mode 100644
index 0000000..602e916
--- /dev/null
+++ b/gcc/gimple-ssa-split-paths.c
@@ -0,0 +1,270 @@ 
+/* Support routines for Splitting Paths to loop backedges
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal <ajitkum@xilinx.com>.
+
+ This file is part of GCC.
+
+ GCC is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 3, or (at your option)
+ any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "cfganal.h"
+#include "cfgloop.h"
+#include "gimple-iterator.h"
+#include "tracer.h"
+
+/* Given LATCH, the latch block in a loop, see if the shape of the
+   path reaching LATCH is suitable for being split by duplication.
+   If so, return the block that will be duplicated into its predecessor
+   paths.  Else return NULL.  */
+
+static basic_block
+find_block_to_duplicate_for_splitting_paths (basic_block latch)
+{
+  /* We should have simple latches at this point.  So the latch should
+     have a single successor.  This implies the predecessor of the latch
+     likely has the loop exit.  And it's that predecessor we're most
+     interested in. To keep things simple, we're going to require that
+     the latch have a single predecessor too.  */
+  if (single_succ_p (latch) && single_pred_p (latch))
+    {
+      basic_block bb = get_immediate_dominator (CDI_DOMINATORS, latch);
+      gcc_assert (single_pred_edge (latch)->src == bb);
+
+      /* If BB has been marked as not to be duplicated, then honor that
+	 request.  */
+      if (ignore_bb_p (bb))
+	return NULL;
+
+      gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb));
+      /* The immediate dominator of the latch must end in a conditional.  */
+      if (!last || gimple_code (last) != GIMPLE_COND)
+	return NULL;
+
+      /* We're hoping that BB is a join point for an IF-THEN-ELSE diamond
+	 region.  Verify that it is.
+
+	 First, verify that BB has two predecessors (each arm of the
+	 IF-THEN-ELSE) and two successors (the latch and exit).  */
+      if (EDGE_COUNT (bb->preds) == 2 && EDGE_COUNT (bb->succs) == 2)
+	{
+	  /* Now verify that BB's immediate dominator ends in a
+	     conditional as well.  */
+	  basic_block bb_idom = get_immediate_dominator (CDI_DOMINATORS, bb);
+	  gimple *last = gsi_stmt (gsi_last_nondebug_bb (bb_idom));
+	  if (!last || gimple_code (last) != GIMPLE_COND)
+	    return NULL;
+
+	  /* And that BB's immediate dominator's successors are the
+	     the predecessors of BB.  */
+	  if (!find_edge (bb_idom, EDGE_PRED (bb, 0)->src)
+	      || !find_edge (bb_idom, EDGE_PRED (bb, 1)->src))
+	    return NULL;
+
+	  /* So at this point we have a simple diamond for an IF-THEN-ELSE
+	     construct starting at BB_IDOM, with a join point at BB.  BB
+	     pass control outside the loop or to the loop latch.
+
+	     We're going to want to create two duplicates of BB, one for
+	     each successor of BB_IDOM.  */
+	  return bb;
+	}
+    }
+  return NULL;
+}
+
+/* Return TRUE if BB is a reasonable block to duplicate by examining
+   its size, false otherwise.  BB will always be a loop latch block.
+
+   Should this use the same tests as we do for jump threading?  */
+
+static bool
+is_feasible_trace (basic_block bb)
+{
+  int num_stmt = 0;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      if (!is_gimple_debug (stmt))
+	num_stmt++;
+    }
+
+  /* We may want to limit how many statements we copy.  */
+  if (num_stmt > 1)
+    return true;
+
+  return false;
+}
+
+/* If the immediate dominator of the latch of the loop is
+   block with conditional branch, then the loop latch  is
+   duplicated to its predecessors path preserving the SSA
+   semantics.
+
+   CFG before transformation.
+
+              2
+              |
+              |
+        +---->3
+        |    / \
+        |   /   \
+        |  4     5
+        |   \   /
+        |    \ /
+        |     6
+        |    / \
+        |   /   \
+        |  8     7
+        |  |     |
+        ---+     E
+
+
+
+    Block 8 is the latch.  We're going to make copies of block 6 (9 & 10)
+    and wire things up so they look like this:
+
+              2
+              |
+              |
+        +---->3
+        |    / \
+        |   /   \
+        |  4     5
+        |  |     |
+        |  |     |
+        |  9    10
+        |  |\   /|
+        |  | \ / |
+        |  |  7  |
+        |  |  |  |
+        |  |  E  |
+        |  |     |
+        |   \   /
+        |    \ /
+        +-----8
+
+
+    Blocks 9 and 10 will get merged into blocks 4 & 5 respectively which
+    enables CSE, DCE and other optimizations to occur on a larger block
+    of code.   */
+
+static bool
+split_paths ()
+{
+  bool changed = false;
+  loop_p loop;
+
+  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+  initialize_original_copy_tables ();
+  calculate_dominance_info (CDI_DOMINATORS);
+
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+    {
+      /* See if there is a block that we can duplicate to split the
+	 path to the loop latch.  */
+      basic_block bb = find_block_to_duplicate_for_splitting_paths (loop->latch);
+
+      /* BB is the merge point for an IF-THEN-ELSE we want to transform.
+
+	 Essentially we want to create two duplicates of BB and append
+	 a duplicate to the THEN and ELSE clauses.  This will split the
+	 path leading to the latch.  BB will be unreachable and removed.  */
+      if (bb && is_feasible_trace (bb))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Duplicating join block %d into predecessor paths\n",
+		     bb->index);
+	  basic_block pred0 = EDGE_PRED (bb, 0)->src;
+	  basic_block pred1 = EDGE_PRED (bb, 1)->src;
+	  transform_duplicate (pred0, bb);
+	  transform_duplicate (pred1, bb);
+	  changed = true;
+	}
+    }
+
+  loop_optimizer_finalize ();
+  free_original_copy_tables ();
+  return changed;
+}
+
+/* Main entry point for splitting paths.  Returns TODO_cleanup_cfg if any
+   paths where split, otherwise return zero.  */
+
+static unsigned int
+execute_split_paths ()
+{
+  /* If we don't have at least 2 real blocks and backedges in the
+     CFG, then there's no point in trying to perform path splitting.  */
+  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
+      || !mark_dfs_back_edges ())
+    return 0;
+
+  bool changed = split_paths();
+  if (changed)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return changed ? TODO_cleanup_cfg : 0;
+}
+
+static bool
+gate_split_paths ()
+{
+  return flag_split_paths;
+}
+
+namespace {
+
+const pass_data pass_data_split_paths =
+{
+  GIMPLE_PASS, /* type */
+  "split-paths", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_SPLIT_PATHS, /* tv_id */
+  PROP_ssa, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_split_paths : public gimple_opt_pass
+{
+   public:
+    pass_split_paths (gcc::context *ctxt)
+      : gimple_opt_pass (pass_data_split_paths, ctxt)
+    {}
+   /* opt_pass methods: */
+   opt_pass * clone () { return new pass_split_paths (m_ctxt); }
+   virtual bool gate (function *) { return gate_split_paths (); }
+   virtual unsigned int execute (function *) { return execute_split_paths (); }
+
+}; // class pass_split_paths
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_split_paths (gcc::context *ctxt)
+{
+  return new pass_split_paths (ctxt);
+}
diff --git a/gcc/opts.c b/gcc/opts.c
index 930ae43..be04cf5 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -523,6 +523,7 @@  static const struct default_options default_options_table[] =
     { OPT_LEVELS_2_PLUS, OPT_fisolate_erroneous_paths_dereference, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_fsplit_paths, NULL, 1 },
 
     /* -O3 optimizations.  */
     { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
diff --git a/gcc/passes.def b/gcc/passes.def
index c0ab6b9..db822d3 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -274,6 +274,7 @@  along with GCC; see the file COPYING3.  If not see
       POP_INSERT_PASSES ()
       NEXT_PASS (pass_simduid_cleanup);
       NEXT_PASS (pass_lower_vector_ssa);
+      NEXT_PASS (pass_split_paths);
       NEXT_PASS (pass_cse_reciprocals);
       NEXT_PASS (pass_reassoc);
       NEXT_PASS (pass_strength_reduction);
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 3301130..ee92aaf 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@ 
+2015-11-13  Ajit Agarwal  <ajitkum@xilinx.com>
+            Jeff Law  <law@redhat.com>
+
+	* gcc.dg/tree-ssa/split-path-1.c: New test.
+
 2015-11-13  Nathan Sidwell  <nathan@codesourcery.com>
 
 	* c-c++-common/goacc/loop-auto-1.c: New.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/split-path-1.c b/gcc/testsuite/gcc.dg/tree-ssa/split-path-1.c
new file mode 100644
index 0000000..1239892
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/split-path-1.c
@@ -0,0 +1,67 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-split-paths-details " } */
+
+#include <stdio.h>
+#include <stdlib.h>
+
+#define RGBMAX 255
+
+int
+test()
+{
+  int i, Pels;
+  unsigned char sum = 0;
+  unsigned char xr, xg, xb;
+  unsigned char xc, xm, xy, xk;
+  unsigned char *ReadPtr, *EritePtr;
+
+  ReadPtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
+  EritePtr = ( unsigned char *) malloc (sizeof (unsigned char) * 100);
+
+  for (i = 0; i < 100;i++)
+     {
+       ReadPtr[i] = 100 - i;
+     }
+
+  for (i = 0; i < 100; i++)
+     {
+       xr = *ReadPtr++;
+       xg = *ReadPtr++;
+       xb = *ReadPtr++;
+
+       xc = (unsigned char) (RGBMAX - xr);
+       xm = (unsigned char) (RGBMAX - xg);
+       xy = (unsigned char) (RGBMAX - xb);
+
+       if (xc < xm)
+         {
+           xk = (unsigned char) (xc < xy ? xc : xy);
+         }
+       else
+        {
+          xk = (unsigned char) (xm < xy ? xm : xy);
+        }
+
+       xc = (unsigned char) (xc - xk);
+       xm = (unsigned char) (xm - xk);
+       xy = (unsigned char) (xy - xk);
+
+       *EritePtr++ = xc;
+       *EritePtr++ = xm;
+       *EritePtr++ = xy;
+       *EritePtr++ = xk;
+       sum += *EritePtr;
+    }
+  return sum;
+}
+
+int
+main()
+{
+  if (test() != 33)
+    abort();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "Duplicating join block" "split-paths" } } */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index b429faf..45e3b70 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -252,6 +252,7 @@  DEFTIMEVAR (TV_GCSE_AFTER_RELOAD     , "load CSE after reload")
 DEFTIMEVAR (TV_REE		     , "ree")
 DEFTIMEVAR (TV_THREAD_PROLOGUE_AND_EPILOGUE, "thread pro- & epilogue")
 DEFTIMEVAR (TV_IFCVT2		     , "if-conversion 2")
+DEFTIMEVAR (TV_SPLIT_PATHS	     , "split paths")
 DEFTIMEVAR (TV_COMBINE_STACK_ADJUST  , "combine stack adjustments")
 DEFTIMEVAR (TV_PEEPHOLE2             , "peephole 2")
 DEFTIMEVAR (TV_RENAME_REGISTERS      , "rename registers")
diff --git a/gcc/tracer.c b/gcc/tracer.c
index 941dc20..c2dba4c 100644
--- a/gcc/tracer.c
+++ b/gcc/tracer.c
@@ -51,9 +51,9 @@ 
 #include "tree-inline.h"
 #include "cfgloop.h"
 #include "fibonacci_heap.h"
+#include "tracer.h"
 
 static int count_insns (basic_block);
-static bool ignore_bb_p (const_basic_block);
 static bool better_p (const_edge, const_edge);
 static edge find_best_successor (basic_block);
 static edge find_best_predecessor (basic_block);
@@ -85,7 +85,7 @@  bb_seen_p (basic_block bb)
 }
 
 /* Return true if we should ignore the basic block for purposes of tracing.  */
-static bool
+bool
 ignore_bb_p (const_basic_block bb)
 {
   if (bb->index < NUM_FIXED_BLOCKS)
@@ -226,6 +226,24 @@  find_trace (basic_block bb, basic_block *trace)
   return i;
 }
 
+/* Duplicate block BB2, placing it after BB in the CFG.  Return the
+   newly created block.  */
+basic_block
+transform_duplicate (basic_block bb, basic_block bb2)
+{
+  edge e;
+  basic_block copy;
+
+  e = find_edge (bb, bb2);
+
+  copy = duplicate_block (bb2, e, bb);
+  flush_pending_stmts (e);
+
+  add_phi_args_after_copy (&copy, 1, NULL);
+
+  return (copy);
+}
+
 /* Look for basic blocks in frequency order, construct traces and tail duplicate
    if profitable.  */
 
@@ -321,17 +339,8 @@  tail_duplicate (void)
 		 entries or at least rotate the loop.  */
 	      && bb2->loop_father->header != bb2)
 	    {
-	      edge e;
-	      basic_block copy;
-
 	      nduplicated += counts [bb2->index];
-
-	      e = find_edge (bb, bb2);
-
-	      copy = duplicate_block (bb2, e, bb);
-	      flush_pending_stmts (e);
-
-	      add_phi_args_after_copy (&copy, 1, NULL);
+	      basic_block copy = transform_duplicate (bb, bb2);
 
 	      /* Reconsider the original copy of block we've duplicated.
 	         Removing the most common predecessor may make it to be
diff --git a/gcc/tracer.h b/gcc/tracer.h
new file mode 100644
index 0000000..cd1792a
--- /dev/null
+++ b/gcc/tracer.h
@@ -0,0 +1,26 @@ 
+/* Header file for Tracer.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_TRACER_H
+#define GCC_TRACER_H
+
+extern basic_block transform_duplicate (basic_block bb, basic_block bb2);
+extern bool ignore_bb_p (const_basic_block bb);
+
+#endif /* GCC_TRACER_H */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 49e22a9..da67761 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -390,6 +390,7 @@  extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_split_paths (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt);