diff mbox

[RFC] Adding unroller and DCE to early passes

Message ID 20150424045410.GA20804@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka April 24, 2015, 4:54 a.m. UTC
Hi,
I was looking into reordering optimization queue of early passes. This is
motivated by PR57249 and fact that I run into some super sily loops while
looking into firefox dumps.  It  indeed makes a lot of sense for me as for code
dealing with short arrays this enables more SRA/FRE/DSE.

I added cunrolle pass that differ from cunrolli by not allowing code size
growth even at -O3 (because we do not know what loops are hot yet).
We currently unroll tiny loop with 2 calls that I think needs to be tammed
down, I can do that if the patch seems to make sense.

I tried several options and ended up adding cunrolle before FRE and reordering
FRE and SRA: SRA needs constant propagation to happen after unrolling to work
and I think value numbering does work pretty well on non-SRAed datastructures.
I also added DCE just before unrolling. This increases number of unrolls by
about 60% on both tramp3d and eon. (basically we want to have DCE and cprop
done to make unroller metrics go resonably well)

On tramp3d there is not great code quality improvement (which is expected), but
we get stronger early opts. In particular the number of basic blocks at
release_ssa time is 8% down, the inline size estimate at IPA time about 2%.

We do 124 unrollings early, 136 at cunrolli time and 132 in cunroll
compared to 228 at cunrolli and 130 at cunroll without the patch.

New early DCE pass does 8005 deletions, early cddce does 9307, first late dce
does 2888.
Without patch early cddce does 9510 (so the patch basically doubles statement count
we get rid of), first late dce does 8587 (almost 3 times as much).

This seems like a significant decrease of garbage pushed through IPA pipeline
(which in turn confuses inline metrics).


On DealII we early unroll 477 loops (out of 11k), 421 at cunrolli and 122 at cunroll
without patch we unroll loops 1428 in cunrolli and 127 loop in cunroll

early dce1 removes 24133 and cddce 15599, late dce does 7698 statements.
without patch we cddce1 33007 statements and first late dce does 8717 statements.
20% increase of # of statements we get rid of early and 13% decrease in late DCE.

Number of basic blocks at release_ssa time drops from 270859 to 260485, by 21%
number of statements by 4%.

If this looks resonable, I would suggest doing one change at a time, that
is first adding extra dce pass, then reordering SRA and finally adding the
cunrolle pass (after implementing the logic to not unroll calls)

Honza

Comments

Richard Biener April 24, 2015, 8:30 a.m. UTC | #1
On Fri, 24 Apr 2015, Jan Hubicka wrote:

> Hi,
> I was looking into reordering optimization queue of early passes. This is
> motivated by PR57249 and fact that I run into some super sily loops while
> looking into firefox dumps.  It  indeed makes a lot of sense for me as for code
> dealing with short arrays this enables more SRA/FRE/DSE.
> 
> I added cunrolle pass that differ from cunrolli by not allowing code size
> growth even at -O3 (because we do not know what loops are hot yet).
> We currently unroll tiny loop with 2 calls that I think needs to be tammed
> down, I can do that if the patch seems to make sense.

Please.

> I tried several options and ended up adding cunrolle before FRE and reordering
> FRE and SRA: SRA needs constant propagation to happen after unrolling to work
> and I think value numbering does work pretty well on non-SRAed datastructures.
> I also added DCE just before unrolling. This increases number of unrolls by
> about 60% on both tramp3d and eon. (basically we want to have DCE and cprop
> done to make unroller metrics go resonably well)

I've re-ordered SRA and ealias for GCC5 because ealias benefits from SRA
while SRA doesn't need PTA.  You undo that improvement.

We should improve the unroller instead of requiring DCE before it.

> On tramp3d there is not great code quality improvement (which is expected), but
> we get stronger early opts. In particular the number of basic blocks at
> release_ssa time is 8% down, the inline size estimate at IPA time about 2%.
> 
> We do 124 unrollings early, 136 at cunrolli time and 132 in cunroll
> compared to 228 at cunrolli and 130 at cunroll without the patch.
>
> New early DCE pass does 8005 deletions, early cddce does 9307, first late dce
> does 2888.
> Without patch early cddce does 9510 (so the patch basically doubles statement count
> we get rid of), first late dce does 8587 (almost 3 times as much ).
> 
> This seems like a significant decrease of garbage pushed through IPA pipeline
> (which in turn confuses inline metrics).
> 
> 
> On DealII we early unroll 477 loops (out of 11k), 421 at cunrolli and 122 at cunroll
> without patch we unroll loops 1428 in cunrolli and 127 loop in cunroll
> 
> early dce1 removes 24133 and cddce 15599, late dce does 7698 statements.
> without patch we cddce1 33007 statements and first late dce does 8717 statements.
> 20% increase of # of statements we get rid of early and 13% decrease in late DCE.
> 
> Number of basic blocks at release_ssa time drops from 270859 to 260485, by 21%
> number of statements by 4%.
> 
> If this looks resonable, I would suggest doing one change at a time, that
> is first adding extra dce pass, then reordering SRA and finally adding the
> cunrolle pass (after implementing the logic to not unroll calls)

I think that adding more passes to the early pipeline is bad.  Sure, it
will help weird C++ code-bases.  But it will slow down the compile for
all the rest.

As we want early unrolling to get the secondary effects by performing
better FRE/DCE/DSE I think trying to do a better job in value-numbering
for the kind of loops we are talking about would be better.  For tramp3d
we are talking about cases like

 for (i=0; i<3; ++i)
  dim[i] = i;

or similar code which ends up storing to a single array.  There is
the vn_reference_lookup_3 function in tree-ssa-sccvn.c which is
the canonical place for value-numbering "tricks".  It "simply"
needs to be taught how to "look through loops".

I'd like to net get into the game of huge pass order dependences in
early-opts - that just shows our scalar cleanups are too weak.
Ideally we'd just have a single pass in early-opts...

Richard.
 
> Honza
> 
> Index: tree-ssa-loop-ivcanon.c
> ===================================================================
> --- tree-ssa-loop-ivcanon.c	(revision 222391)
> +++ tree-ssa-loop-ivcanon.c	(working copy)
> @@ -1571,4 +1571,59 @@ make_pass_complete_unrolli (gcc::context
>    return new pass_complete_unrolli (ctxt);
>  }
>  
> +/* Early complete unrolling pass; do only those internal loops where code
> +   size gets reduced.  */
>  
> +namespace {
> +
> +const pass_data pass_data_complete_unrolle =
> +{
> +  GIMPLE_PASS, /* type */
> +  "cunrolle", /* name */
> +  OPTGROUP_LOOP, /* optinfo_flags */
> +  TV_COMPLETE_UNROLL, /* tv_id */
> +  ( PROP_cfg | PROP_ssa ), /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +class pass_complete_unrolle : public gimple_opt_pass
> +{
> +public:
> +  pass_complete_unrolle (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_complete_unrolle, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *) { return optimize >= 2; }
> +  virtual unsigned int execute (function *);
> +
> +}; // class pass_complete_unrolle
> +
> +unsigned int
> +pass_complete_unrolle::execute (function *fun)
> +{
> +  unsigned ret = 0;
> +
> +  loop_optimizer_init (LOOPS_NORMAL
> +		       | LOOPS_HAVE_RECORDED_EXITS);
> +  if (number_of_loops (fun) > 1)
> +    {
> +      scev_initialize ();
> +      ret = tree_unroll_loops_completely (false, false);
> +      free_numbers_of_iterations_estimates ();
> +      scev_finalize ();
> +    }
> +  loop_optimizer_finalize ();
> +
> +  return ret;
> +}
> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_complete_unrolle (gcc::context *ctxt)
> +{
> +  return new pass_complete_unrolle (ctxt);
> +}
> Index: tree-pass.h
> ===================================================================
> --- tree-pass.h	(revision 222391)
> +++ tree-pass.h	(working copy)
> @@ -375,6 +375,7 @@ extern gimple_opt_pass *make_pass_vector
>  extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_complete_unrolle (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
> Index: passes.def
> ===================================================================
> --- passes.def	(revision 222391)
> +++ passes.def	(working copy)
> @@ -83,11 +83,17 @@ along with GCC; see the file COPYING3.
>  	  /* After CCP we rewrite no longer addressed locals into SSA
>  	     form if possible.  */
>  	  NEXT_PASS (pass_forwprop);
> -	  NEXT_PASS (pass_sra_early);
> +	  /* DCE enables considerably more complete unrolling.  */
> +          NEXT_PASS (pass_dce);
> +	  /* Unroll small loops to enable SRA of tiny arrays tripped by them.
> +	     It is important to propagate constants in FRE before sra_early
> +	     happens. */
> +          NEXT_PASS (pass_complete_unrolle);
>  	  /* pass_build_ealias is a dummy pass that ensures that we
>  	     execute TODO_rebuild_alias at this point.  */
>  	  NEXT_PASS (pass_build_ealias);
>  	  NEXT_PASS (pass_fre);
> +	  NEXT_PASS (pass_sra_early);
>  	  NEXT_PASS (pass_merge_phi);
>            NEXT_PASS (pass_dse);
>  	  NEXT_PASS (pass_cd_dce);
> 
>
Jan Hubicka April 24, 2015, 11:35 a.m. UTC | #2
> > I added cunrolle pass that differ from cunrolli by not allowing code size
> > growth even at -O3 (because we do not know what loops are hot yet).
> > We currently unroll tiny loop with 2 calls that I think needs to be tammed
> > down, I can do that if the patch seems to make sense.
> 
> Please.

OK, will do that.
> 
> > I tried several options and ended up adding cunrolle before FRE and reordering
> > FRE and SRA: SRA needs constant propagation to happen after unrolling to work
> > and I think value numbering does work pretty well on non-SRAed datastructures.
> > I also added DCE just before unrolling. This increases number of unrolls by
> > about 60% on both tramp3d and eon. (basically we want to have DCE and cprop
> > done to make unroller metrics go resonably well)
> 
> I've re-ordered SRA and ealias for GCC5 because ealias benefits from SRA
> while SRA doesn't need PTA.  You undo that improvement.

OK, I see. I assumed that the PTA solutions are updated when SRA introduce new scalars.
We could simply do limited CCP when unrolling happened lifting the need for FRE 
in between cunrolle and SRA.  I dimly remember we even have code for that?
> 
> We should improve the unroller instead of requiring DCE before it.

If I get loop with dead code in it, because of einline or gimple production or whatever,
what unroller should do short of  doing its own DCE pass on the whole function body (well the mark, ignoring sweep)?

Honza
> 
> I think that adding more passes to the early pipeline is bad.  Sure, it
> will help weird C++ code-bases.  But it will slow down the compile for
> all the rest.

I am not 100% convinced about this especially with LTO, where we need to pickle
all garbage we did not eliminated.
> 
> As we want early unrolling to get the secondary effects by performing
> better FRE/DCE/DSE I think trying to do a better job in value-numbering
> for the kind of loops we are talking about would be better.  For tramp3d
> we are talking about cases like
> 
>  for (i=0; i<3; ++i)
>   dim[i] = i;
> 
> or similar code which ends up storing to a single array.  There is
> the vn_reference_lookup_3 function in tree-ssa-sccvn.c which is
> the canonical place for value-numbering "tricks".  It "simply"
> needs to be taught how to "look through loops".
> 
> I'd like to net get into the game of huge pass order dependences in
> early-opts - that just shows our scalar cleanups are too weak.
> Ideally we'd just have a single pass in early-opts...

Other motivation for getting rid of loops is to make control&data flow more
explicit for IPA analysers.
If you have code o nvectors of size 3 that does series of operations, it is very
hard to reorder/fuse/merge loops as needed for the scalar optimizations...

Honza
diff mbox

Patch

Index: tree-ssa-loop-ivcanon.c
===================================================================
--- tree-ssa-loop-ivcanon.c	(revision 222391)
+++ tree-ssa-loop-ivcanon.c	(working copy)
@@ -1571,4 +1571,59 @@  make_pass_complete_unrolli (gcc::context
   return new pass_complete_unrolli (ctxt);
 }
 
+/* Early complete unrolling pass; do only those internal loops where code
+   size gets reduced.  */
 
+namespace {
+
+const pass_data pass_data_complete_unrolle =
+{
+  GIMPLE_PASS, /* type */
+  "cunrolle", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_COMPLETE_UNROLL, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+class pass_complete_unrolle : public gimple_opt_pass
+{
+public:
+  pass_complete_unrolle (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_complete_unrolle, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return optimize >= 2; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_complete_unrolle
+
+unsigned int
+pass_complete_unrolle::execute (function *fun)
+{
+  unsigned ret = 0;
+
+  loop_optimizer_init (LOOPS_NORMAL
+		       | LOOPS_HAVE_RECORDED_EXITS);
+  if (number_of_loops (fun) > 1)
+    {
+      scev_initialize ();
+      ret = tree_unroll_loops_completely (false, false);
+      free_numbers_of_iterations_estimates ();
+      scev_finalize ();
+    }
+  loop_optimizer_finalize ();
+
+  return ret;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_complete_unrolle (gcc::context *ctxt)
+{
+  return new pass_complete_unrolle (ctxt);
+}
Index: tree-pass.h
===================================================================
--- tree-pass.h	(revision 222391)
+++ tree-pass.h	(working copy)
@@ -375,6 +375,7 @@  extern gimple_opt_pass *make_pass_vector
 extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_complete_unrolle (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
Index: passes.def
===================================================================
--- passes.def	(revision 222391)
+++ passes.def	(working copy)
@@ -83,11 +83,17 @@  along with GCC; see the file COPYING3.
 	  /* After CCP we rewrite no longer addressed locals into SSA
 	     form if possible.  */
 	  NEXT_PASS (pass_forwprop);
-	  NEXT_PASS (pass_sra_early);
+	  /* DCE enables considerably more complete unrolling.  */
+          NEXT_PASS (pass_dce);
+	  /* Unroll small loops to enable SRA of tiny arrays tripped by them.
+	     It is important to propagate constants in FRE before sra_early
+	     happens. */
+          NEXT_PASS (pass_complete_unrolle);
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
 	  NEXT_PASS (pass_fre);
+	  NEXT_PASS (pass_sra_early);
 	  NEXT_PASS (pass_merge_phi);
           NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);