Patchwork [RFC] Make vectorizer to skip loops with small iteration estimate

login
register
mail settings
Submitter Jan Hubicka
Date Oct. 4, 2012, 2:39 p.m.
Message ID <20121004143910.GB7301@atrey.karlin.mff.cuni.cz>
Download mbox | patch
Permalink /patch/189168/
State New
Headers show

Comments

Jan Hubicka - Oct. 4, 2012, 2:39 p.m.
> > So SOC cancels out in the runtime check.
> > I still think we need two formulas - one determining if vectorization is
> > profitable, other specifying the threshold for scalar path at runtime (that
> > will generally give lower values).
> 
> True, we want two values.  But part of the scalar path right now
> is all the computation required for alias and alignment runtime checks
> (because the way all the conditions are combined).
> 
> I'm not much into the details of what we account for in SOC (I suppose
> it's everything we insert in the preheader of the vector loop).

Yes, it seems contain everything we insert prior the loop in unfolded form.
> 
> +      if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> +        fprintf (vect_dump, "not vectorized: estimated iteration count 
> too small.");
> +      if (vect_print_dump_info (REPORT_DETAILS))
> +        fprintf (vect_dump, "not vectorized: estimated iteration count 
> smaller than "
> +                 "user specified loop bound parameter or minimum "
> +                 "profitable iterations (whichever is more 
> conservative).");
> 
> this won't work anymore btw - dumping infrastructure changed.

Ah, will update that.
> 
> I suppose your patch is a step in the right direction, but to really
> make progress we need to re-organize the loop and predicate structure
> produced by the vectorizer.

This reminds me what I did for string functions on x86. It gets very hard
to get all the paths right when one starts to be really cureful to not
output too much cruft on the short paths + do not consume too many registers.

In fact I want to re-think this for the SSE string ops patch, so I may try to
look into that incrementally.
> 
> So, please update your patch, re-test and then it's ok.

Thanks.
> > I tested enabling loop_ch in early passes with -fprofile-feedback and it is SPEC
> > neutral.  Given that it improves loop count estimates, I would still like mainline
> > doing that.  I do not like these quite important estimates to be wrong most of time.
> 
> I agree.  It also helps getting rid of once rolling loops I think.

I am attaching the patch for early-ch.  Will commit it tomorrow.

Concerning jump threading, it would help to make some of it during early passes
so the profile estiamte do not get invalided.  I tried to move VRP early but now it
makes compiler to hang during bootstrap.  I will debug that.
> 
> > > 
> > > Btw, I added a "similar" check in vect_analyze_loop_operations:
> > > 
> > >   if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > >        && (LOOP_VINFO_INT_NITERS (loop_vinfo) < vectorization_factor))
> > >       || ((max_niter = max_stmt_executions_int (loop)) != -1
> > >           && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
> > >     {
> > >       if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > >                          "not vectorized: iteration count too small.");
> > >       if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > >                          "not vectorized: iteration count smaller than "
> > >                          "vectorization factor.");
> > >       return false;
> > >     }
> > > 
> > > maybe you simply need to update that to also consider the profile?
> > 
> > Hmm, I am still getting familiar wth the code. Later we later have
> >   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >       && LOOP_VINFO_INT_NITERS (loop_vinfo) <= th)
> >     {
> >       if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> >         fprintf (vect_dump, "not vectorized: vectorization not "
> >                  "profitable.");
> >       if (vect_print_dump_info (REPORT_DETAILS))
> >         fprintf (vect_dump, "not vectorized: iteration count smaller than "
> >                  "user specified loop bound parameter or minimum "
> >                  "profitable iterations (whichever is more conservative).");
> >       return false;
> >     }
> > 
> > where th is always greater or equal than vectorization_factor from the cost model.
> > So this test seems redundant if the max_stmt_executions_int was pushed down
> > to the second conditoinal?
> 
> Yes, sort of.  The new check was supposed to be crystal clear, and
> even with the cost model disabled we want to not vectorize in this
> case.  But yes, the whole cost-model stuff needs TLC.

Ah yes, without cost model we would skip it.  I suppose we do not need to
brother  witht he profile estiamte in the case anyway. They are kind of aprt of
the cost models.

	* passes.c (init_optimization_passes): Schedule early CH.
	* tree-pass.h (pass_early_ch): Declare it.
	* tree-ssa-loop-ch.c (gate_early_ch): New function.
	(pass_early_ch): New pass.
Richard Guenther - Oct. 4, 2012, 4:16 p.m.
On Thu, 4 Oct 2012, Jan Hubicka wrote:

> > > So SOC cancels out in the runtime check.
> > > I still think we need two formulas - one determining if vectorization is
> > > profitable, other specifying the threshold for scalar path at runtime (that
> > > will generally give lower values).
> > 
> > True, we want two values.  But part of the scalar path right now
> > is all the computation required for alias and alignment runtime checks
> > (because the way all the conditions are combined).
> > 
> > I'm not much into the details of what we account for in SOC (I suppose
> > it's everything we insert in the preheader of the vector loop).
> 
> Yes, it seems contain everything we insert prior the loop in unfolded form.
> > 
> > +      if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> > +        fprintf (vect_dump, "not vectorized: estimated iteration count 
> > too small.");
> > +      if (vect_print_dump_info (REPORT_DETAILS))
> > +        fprintf (vect_dump, "not vectorized: estimated iteration count 
> > smaller than "
> > +                 "user specified loop bound parameter or minimum "
> > +                 "profitable iterations (whichever is more 
> > conservative).");
> > 
> > this won't work anymore btw - dumping infrastructure changed.
> 
> Ah, will update that.
> > 
> > I suppose your patch is a step in the right direction, but to really
> > make progress we need to re-organize the loop and predicate structure
> > produced by the vectorizer.
> 
> This reminds me what I did for string functions on x86. It gets very hard
> to get all the paths right when one starts to be really cureful to not
> output too much cruft on the short paths + do not consume too many registers.
> 
> In fact I want to re-think this for the SSE string ops patch, so I may try to
> look into that incrementally.
> > 
> > So, please update your patch, re-test and then it's ok.
> 
> Thanks.
> > > I tested enabling loop_ch in early passes with -fprofile-feedback and it is SPEC
> > > neutral.  Given that it improves loop count estimates, I would still like mainline
> > > doing that.  I do not like these quite important estimates to be wrong most of time.
> > 
> > I agree.  It also helps getting rid of once rolling loops I think.
> 
> I am attaching the patch for early-ch.  Will commit it tomorrow.
> 
> Concerning jump threading, it would help to make some of it during early passes
> so the profile estiamte do not get invalided.  I tried to move VRP early but now it
> makes compiler to hang during bootstrap.  I will debug that.
> > 
> > > > 
> > > > Btw, I added a "similar" check in vect_analyze_loop_operations:
> > > > 
> > > >   if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > >        && (LOOP_VINFO_INT_NITERS (loop_vinfo) < vectorization_factor))
> > > >       || ((max_niter = max_stmt_executions_int (loop)) != -1
> > > >           && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
> > > >     {
> > > >       if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> > > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > >                          "not vectorized: iteration count too small.");
> > > >       if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> > > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > >                          "not vectorized: iteration count smaller than "
> > > >                          "vectorization factor.");
> > > >       return false;
> > > >     }
> > > > 
> > > > maybe you simply need to update that to also consider the profile?
> > > 
> > > Hmm, I am still getting familiar wth the code. Later we later have
> > >   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > >       && LOOP_VINFO_INT_NITERS (loop_vinfo) <= th)
> > >     {
> > >       if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> > >         fprintf (vect_dump, "not vectorized: vectorization not "
> > >                  "profitable.");
> > >       if (vect_print_dump_info (REPORT_DETAILS))
> > >         fprintf (vect_dump, "not vectorized: iteration count smaller than "
> > >                  "user specified loop bound parameter or minimum "
> > >                  "profitable iterations (whichever is more conservative).");
> > >       return false;
> > >     }
> > > 
> > > where th is always greater or equal than vectorization_factor from the cost model.
> > > So this test seems redundant if the max_stmt_executions_int was pushed down
> > > to the second conditoinal?
> > 
> > Yes, sort of.  The new check was supposed to be crystal clear, and
> > even with the cost model disabled we want to not vectorize in this
> > case.  But yes, the whole cost-model stuff needs TLC.
> 
> Ah yes, without cost model we would skip it.  I suppose we do not need to
> brother  witht he profile estiamte in the case anyway. They are kind of aprt of
> the cost models.
> 
> 	* passes.c (init_optimization_passes): Schedule early CH.
> 	* tree-pass.h (pass_early_ch): Declare it.
> 	* tree-ssa-loop-ch.c (gate_early_ch): New function.
> 	(pass_early_ch): New pass.

Instead of an extra ch pass please move the existing one.  And
benchmark that, of course.

Richard.

> Index: passes.c
> ===================================================================
> --- passes.c	(revision 191852)
> +++ passes.c	(working copy)
> @@ -1335,6 +1336,7 @@ init_optimization_passes (void)
>            NEXT_PASS (pass_cleanup_eh);
>            NEXT_PASS (pass_profile);
>            NEXT_PASS (pass_local_pure_const);
> +          NEXT_PASS (pass_early_ch);
>  	  /* Split functions creates parts that are not run through
>  	     early optimizations again.  It is thus good idea to do this
>  	     late.  */
> Index: tree-pass.h
> ===================================================================
> --- tree-pass.h	(revision 191852)
> +++ tree-pass.h	(working copy)
> @@ -291,6 +291,7 @@ extern struct gimple_opt_pass pass_loop_
>  extern struct gimple_opt_pass pass_iv_optimize;
>  extern struct gimple_opt_pass pass_tree_loop_done;
>  extern struct gimple_opt_pass pass_ch;
> +extern struct gimple_opt_pass pass_early_ch;
>  extern struct gimple_opt_pass pass_ccp;
>  extern struct gimple_opt_pass pass_phi_only_cprop;
>  extern struct gimple_opt_pass pass_build_ssa;
> Index: tree-ssa-loop-ch.c
> ===================================================================
> --- tree-ssa-loop-ch.c	(revision 191852)
> +++ tree-ssa-loop-ch.c	(working copy)
> @@ -275,3 +275,34 @@ struct gimple_opt_pass pass_ch =
>      | TODO_verify_flow			/* todo_flags_finish */
>   }
>  };
> +
> +/* We duplicate loop headers early to get better profile feedback:
> +   the first iteration of loop is special, because the duplicated loop header
> +   test will usually pass.  */
> +
> +static bool
> +gate_early_ch (void)
> +{
> +  return flag_tree_ch != 0 && (flag_branch_probabilities || profile_arc_flag);
> +}
> +
> +struct gimple_opt_pass pass_early_ch =
> +{
> + {
> +  GIMPLE_PASS,
> +  "early_ch",				/* name */
> +  gate_early_ch,			/* gate */
> +  copy_loop_headers,			/* execute */
> +  NULL,					/* sub */
> +  NULL,					/* next */
> +  0,					/* static_pass_number */
> +  TV_TREE_CH,				/* tv_id */
> +  PROP_cfg | PROP_ssa,			/* properties_required */
> +  0,					/* properties_provided */
> +  0,					/* properties_destroyed */
> +  0,					/* todo_flags_start */
> +  TODO_cleanup_cfg
> +    | TODO_verify_ssa
> +    | TODO_verify_flow			/* todo_flags_finish */
> + }
> +};
> 
>
Jan Hubicka - Oct. 5, 2012, 1:52 p.m.
Hi,
this is the udpated patch I comitted after testing.  I suppose we will need to find way
to make SOC smaller for simple loops - it is way too overestimated currently.

	* tree-vectorizer.h (vect_estimate_min_profitable_iters): Remove.
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Declare here.
	(vect_analyze_loop_operations): Use loop count estimate to rule out
	unprofitable vectorization.
	(vect_estimate_min_profitable_iters): Return ret_min_profitable_estimate.
Index: tree-vectorizer.h
===================================================================
*** tree-vectorizer.h	(revision 192114)
--- tree-vectorizer.h	(working copy)
*************** extern bool vectorizable_live_operation 
*** 976,982 ****
  extern bool vectorizable_reduction (gimple, gimple_stmt_iterator *, gimple *,
                                      slp_tree);
  extern bool vectorizable_induction (gimple, gimple_stmt_iterator *, gimple *);
- extern int vect_estimate_min_profitable_iters (loop_vec_info);
  extern tree get_initial_def_for_reduction (gimple, tree, tree *);
  extern int vect_min_worthwhile_factor (enum tree_code);
  extern int vect_get_known_peeling_cost (loop_vec_info, int, int *, int,
--- 976,981 ----
Index: tree-vect-loop.c
===================================================================
*** tree-vect-loop.c	(revision 192114)
--- tree-vect-loop.c	(working copy)
*************** along with GCC; see the file COPYING3.  
*** 140,145 ****
--- 140,147 ----
     http://gcc.gnu.org/projects/tree-ssa/vectorization.html
  */
  
+ static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
+ 
  /* Function vect_determine_vectorization_factor
  
     Determine the vectorization factor (VF).  VF is the number of data elements
*************** vect_analyze_loop_operations (loop_vec_i
*** 1287,1292 ****
--- 1289,1296 ----
    unsigned int th;
    bool only_slp_in_loop = true, ok;
    HOST_WIDE_INT max_niter;
+   HOST_WIDE_INT estimated_niter;
+   int min_profitable_estimate;
  
    if (dump_kind_p (MSG_NOTE))
      dump_printf_loc (MSG_NOTE, vect_location,
*************** vect_analyze_loop_operations (loop_vec_i
*** 1490,1496 ****
       vector stmts depends on VF.  */
    vect_update_slp_costs_according_to_vf (loop_vinfo);
  
!   min_profitable_iters = vect_estimate_min_profitable_iters (loop_vinfo);
    LOOP_VINFO_COST_MODEL_MIN_ITERS (loop_vinfo) = min_profitable_iters;
  
    if (min_profitable_iters < 0)
--- 1494,1501 ----
       vector stmts depends on VF.  */
    vect_update_slp_costs_according_to_vf (loop_vinfo);
  
!   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
! 				      &min_profitable_estimate);
    LOOP_VINFO_COST_MODEL_MIN_ITERS (loop_vinfo) = min_profitable_iters;
  
    if (min_profitable_iters < 0)
*************** vect_analyze_loop_operations (loop_vec_i
*** 1531,1536 ****
--- 1537,1559 ----
        return false;
      }
  
+   if ((estimated_niter = estimated_stmt_executions_int (loop)) != -1
+       && ((unsigned HOST_WIDE_INT) estimated_niter
+           <= MAX (th, (unsigned)min_profitable_estimate)))
+     {
+       if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
+ 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+ 			 "not vectorized: estimated iteration count too "
+                          "small.");
+       if (dump_kind_p (MSG_NOTE))
+         dump_printf_loc (MSG_NOTE, vect_location,
+ 			 "not vectorized: estimated iteration count smaller "
+                          "than specified loop bound parameter or minimum "
+                          "profitable iterations (whichever is more "
+                          "conservative).");
+       return false;
+     }
+ 
    if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
        || LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0
        || LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
*************** vect_get_known_peeling_cost (loop_vec_in
*** 2603,2617 ****
  
     Return the number of iterations required for the vector version of the
     loop to be profitable relative to the cost of the scalar version of the
!    loop.
  
!    TODO: Take profile info into account before making vectorization
!    decisions, if available.  */
! 
! int
! vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo)
  {
    int min_profitable_iters;
    int peel_iters_prologue;
    int peel_iters_epilogue;
    unsigned vec_inside_cost = 0;
--- 2626,2640 ----
  
     Return the number of iterations required for the vector version of the
     loop to be profitable relative to the cost of the scalar version of the
!    loop.  */
  
! static void
! vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
! 				    int *ret_min_profitable_niters,
! 				    int *ret_min_profitable_estimate)
  {
    int min_profitable_iters;
+   int min_profitable_estimate;
    int peel_iters_prologue;
    int peel_iters_epilogue;
    unsigned vec_inside_cost = 0;
*************** vect_estimate_min_profitable_iters (loop
*** 2628,2634 ****
    if (!flag_vect_cost_model)
      {
        dump_printf_loc (MSG_NOTE, vect_location, "cost model disabled.");
!       return 0;
      }
  
    /* Requires loop versioning tests to handle misalignment.  */
--- 2651,2659 ----
    if (!flag_vect_cost_model)
      {
        dump_printf_loc (MSG_NOTE, vect_location, "cost model disabled.");
!       *ret_min_profitable_niters = 0;
!       *ret_min_profitable_estimate = 0;
!       return;
      }
  
    /* Requires loop versioning tests to handle misalignment.  */
*************** vect_estimate_min_profitable_iters (loop
*** 2863,2869 ****
  			 "divided by the scalar iteration cost = %d "
  			 "is greater or equal to the vectorization factor = %d.",
  			 vec_inside_cost, scalar_single_iter_cost, vf);
!       return -1;
      }
  
    if (dump_kind_p (MSG_NOTE))
--- 2888,2896 ----
  			 "divided by the scalar iteration cost = %d "
  			 "is greater or equal to the vectorization factor = %d.",
  			 vec_inside_cost, scalar_single_iter_cost, vf);
!       *ret_min_profitable_niters = -1;
!       *ret_min_profitable_estimate = -1;
!       return;
      }
  
    if (dump_kind_p (MSG_NOTE))
*************** vect_estimate_min_profitable_iters (loop
*** 2879,2884 ****
--- 2906,2913 ----
                     scalar_single_iter_cost);
        dump_printf (MSG_NOTE, "  Scalar outside cost: %d\n",
                     scalar_outside_cost);
+       dump_printf (MSG_NOTE, "  Vector outside cost: %d\n",
+                    vec_outside_cost);
        dump_printf (MSG_NOTE, "  prologue iterations: %d\n",
                     peel_iters_prologue);
        dump_printf (MSG_NOTE, "  epilogue iterations: %d\n",
*************** vect_estimate_min_profitable_iters (loop
*** 2898,2906 ****
  
    if (dump_kind_p (MSG_NOTE))
      dump_printf_loc (MSG_NOTE, vect_location,
!                      "  Profitability threshold = %d\n", min_profitable_iters);
  
!   return min_profitable_iters;
  }
  
  
--- 2927,2961 ----
  
    if (dump_kind_p (MSG_NOTE))
      dump_printf_loc (MSG_NOTE, vect_location,
!                      "  Runtime profitability threshold = %d\n", min_profitable_iters);
! 
!   *ret_min_profitable_niters = min_profitable_iters;
! 
!   /* Calculate number of iterations required to make the vector version
!      profitable, relative to the loop bodies only.
! 
!      Non-vectorized variant is SIC * niters and it must win over vector
!      variant on the expected loop trip count.  The following condition must hold true:
!      SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SOC  */
! 
!   if (vec_outside_cost <= 0)
!     min_profitable_estimate = 1;
!   else
!     {
!       min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost) * vf
! 				 - vec_inside_cost * peel_iters_prologue
! 				 - vec_inside_cost * peel_iters_epilogue)
! 				 / ((scalar_single_iter_cost * vf)
! 				   - vec_inside_cost);
!     }
!   min_profitable_estimate --;
!   min_profitable_estimate = MAX (min_profitable_estimate, min_profitable_iters);
!   if (dump_kind_p (MSG_NOTE))
!     dump_printf_loc (MSG_NOTE, vect_location,
!                      "  Static estimate profitability threshold = %d\n",
!                       min_profitable_iters);
  
!   *ret_min_profitable_estimate = min_profitable_estimate;
  }
Jan Hubicka - Oct. 6, 2012, 9:34 a.m.
Hi,
I benchmarked the patch moving loop header copying and it is quite noticeable win.

Some testsuite updating is needed. In many cases it is just because the
optimizations are now happening earlier.
There are however few testusite failures I have torubles to deal with:
./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times vrp1 "Threaded jump" 3
./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2.c scan-tree-dump-times vrp1 "Jumps threaded: 1" 1
./testsuite/gcc/gcc.sum:FAIL: gcc.dg/vect/O3-slp-reduc-10.c scan-tree-dump-times vect "vectorized 1 loops" 2
./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++98  scan-tree-dump-times vrp1 "if " 1
./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++11  scan-tree-dump-times vrp1 "if " 1

This is mostly about VRP losing its ability to thread some jumps from the
duplicated loop header out of the loop across the loopback edge.  This seems to
be due to loop updating logic.  Do we care about these?

Honza

Index: tree-ssa-threadupdate.c
===================================================================
*** tree-ssa-threadupdate.c	(revision 192123)
--- tree-ssa-threadupdate.c	(working copy)
*************** static bool
*** 846,854 ****
  def_split_header_continue_p (const_basic_block bb, const void *data)
  {
    const_basic_block new_header = (const_basic_block) data;
!   return (bb != new_header
! 	  && (loop_depth (bb->loop_father)
! 	      >= loop_depth (new_header->loop_father)));
  }
  
  /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
--- 846,860 ----
  def_split_header_continue_p (const_basic_block bb, const void *data)
  {
    const_basic_block new_header = (const_basic_block) data;
!   const struct loop *l;
! 
!   if (bb == new_header
!       || loop_depth (bb->loop_father) < loop_depth (new_header->loop_father))
!     return false;
!   for (l = bb->loop_father; l; l = loop_outer (l))
!     if (l == new_header->loop_father)
!       return true;
!   return false;
  }
  
  /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
Index: testsuite/gcc.dg/unroll_2.c
===================================================================
*** testsuite/gcc.dg/unroll_2.c	(revision 192123)
--- testsuite/gcc.dg/unroll_2.c	(working copy)
***************
*** 1,5 ****
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo -fenable-rtl-loop2_unroll" } */
  
  unsigned a[100], b[100];
  inline void bar()
--- 1,5 ----
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo -fenable-rtl-loop2_unroll -fno-tree-dominator-opts" } */
  
  unsigned a[100], b[100];
  inline void bar()
Index: testsuite/gcc.dg/unroll_3.c
===================================================================
*** testsuite/gcc.dg/unroll_3.c	(revision 192123)
--- testsuite/gcc.dg/unroll_3.c	(working copy)
***************
*** 1,5 ****
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo" } */
  
  unsigned a[100], b[100];
  inline void bar()
--- 1,5 ----
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo -fno-tree-dominator-opts" } */
  
  unsigned a[100], b[100];
  inline void bar()
Index: testsuite/gcc.dg/torture/pr23821.c
===================================================================
*** testsuite/gcc.dg/torture/pr23821.c	(revision 192123)
--- testsuite/gcc.dg/torture/pr23821.c	(working copy)
***************
*** 1,9 ****
  /* { dg-do compile } */
  /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
! /* At -O1 DOM threads a jump in a non-optimal way which leads to
     the bogus propagation.  */
! /* { dg-skip-if "" { *-*-* } { "-O1" } { "" } } */
! /* { dg-options "-fdump-tree-ivcanon-details" } */
  
  int a[199];
  
--- 1,8 ----
  /* { dg-do compile } */
  /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
! /* DOM threads a jump in a non-optimal way which leads to
     the bogus propagation.  */
! /* { dg-options "-fdump-tree-ivcanon-details -fno-tree-dominator-opts" } */
  
  int a[199];
  
Index: testsuite/gcc.dg/tree-ssa/ivopt_1.c
===================================================================
*** testsuite/gcc.dg/tree-ssa/ivopt_1.c	(revision 192123)
--- testsuite/gcc.dg/tree-ssa/ivopt_1.c	(working copy)
*************** void foo (int i_width, TYPE dst, TYPE sr
*** 14,18 ****
  }
  
  
! /* { dg-final { scan-tree-dump-times "PHI <ivtmp" 1 "ivopts"} } */
  /* { dg-final { cleanup-tree-dump "ivopts" } } */
--- 14,18 ----
  }
  
  
! /* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts"} } */
  /* { dg-final { cleanup-tree-dump "ivopts" } } */
Index: testsuite/gcc.dg/tree-ssa/ivopt_2.c
===================================================================
*** testsuite/gcc.dg/tree-ssa/ivopt_2.c	(revision 192123)
--- testsuite/gcc.dg/tree-ssa/ivopt_2.c	(working copy)
*************** void foo (int i_width, TYPE dst, TYPE sr
*** 13,17 ****
         }
  }
  
! /* { dg-final { scan-tree-dump-times "PHI <ivtmp" 1 "ivopts"} } */
  /* { dg-final { cleanup-tree-dump "ivopts" } } */
--- 13,17 ----
         }
  }
  
! /* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts"} } */
  /* { dg-final { cleanup-tree-dump "ivopts" } } */
Index: testsuite/gcc.dg/tree-ssa/ivopt_3.c
===================================================================
*** testsuite/gcc.dg/tree-ssa/ivopt_3.c	(revision 192123)
--- testsuite/gcc.dg/tree-ssa/ivopt_3.c	(working copy)
*************** void foo (int i_width, char* dst, char* 
*** 16,20 ****
         }
  } 
  
! /* { dg-final { scan-tree-dump-times "PHI <ivtmp" 1 "ivopts"} } */
  /* { dg-final { cleanup-tree-dump "ivopts" } } */
--- 16,20 ----
         }
  } 
  
! /* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts"} } */
  /* { dg-final { cleanup-tree-dump "ivopts" } } */
Index: testsuite/gcc.dg/tree-ssa/ivopt_4.c
===================================================================
*** testsuite/gcc.dg/tree-ssa/ivopt_4.c	(revision 192123)
--- testsuite/gcc.dg/tree-ssa/ivopt_4.c	(working copy)
*************** void foo (int i_width, TYPE dst, TYPE sr
*** 15,19 ****
         }
  }
  
! /* { dg-final { scan-tree-dump-times "PHI <ivtmp" 1 "ivopts"} } */
  /* { dg-final { cleanup-tree-dump "ivopts" } } */
--- 15,19 ----
         }
  }
  
! /* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts"} } */
  /* { dg-final { cleanup-tree-dump "ivopts" } } */
Index: testsuite/gcc.dg/unroll_4.c
===================================================================
*** testsuite/gcc.dg/unroll_4.c	(revision 192123)
--- testsuite/gcc.dg/unroll_4.c	(working copy)
***************
*** 1,5 ****
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo2" } */
  
  unsigned a[100], b[100];
  inline void bar()
--- 1,5 ----
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo2 -fno-tree-dominator-opts" } */
  
  unsigned a[100], b[100];
  inline void bar()
Index: testsuite/gcc.dg/tree-prof/update-loopch.c
===================================================================
*** testsuite/gcc.dg/tree-prof/update-loopch.c	(revision 192123)
--- testsuite/gcc.dg/tree-prof/update-loopch.c	(working copy)
*************** main ()
*** 11,20 ****
      }
    return 0;
  }
! /* Loop header copying will peel away the initial conditional, so the loop body
!    is once reached directly from entry point of function, rest via loopback
!    edge.  */
! /* { dg-final-use { scan-ipa-dump "loop depth 0, count 33334" "profile"} } */
  /* { dg-final-use { scan-tree-dump "loop depth 1, count 33332" "optimized"} } */
  /* { dg-final-use { scan-tree-dump-not "Invalid sum" "optimized"} } */
  /* { dg-final-use { cleanup-ipa-dump "profile" } } */
--- 11,20 ----
      }
    return 0;
  }
! /* Loop header copying, now happening before profiling, will peel away the
!    initial conditional, so the loop body is once reached directly from entry
!    point of function, rest via loopback edge.  */
! /* { dg-final-use { scan-ipa-dump "loop depth 0, count 33332" "profile"} } */
  /* { dg-final-use { scan-tree-dump "loop depth 1, count 33332" "optimized"} } */
  /* { dg-final-use { scan-tree-dump-not "Invalid sum" "optimized"} } */
  /* { dg-final-use { cleanup-ipa-dump "profile" } } */
Index: testsuite/gcc.dg/unroll_1.c
===================================================================
*** testsuite/gcc.dg/unroll_1.c	(revision 192123)
--- testsuite/gcc.dg/unroll_1.c	(working copy)
***************
*** 1,5 ****
  /* { dg-do compile } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll" } */
  
  unsigned a[100], b[100];
  inline void bar()
--- 1,5 ----
  /* { dg-do compile } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll -fno-tree-dominator-opts" } */
  
  unsigned a[100], b[100];
  inline void bar()
Index: passes.c
===================================================================
*** passes.c	(revision 192123)
--- passes.c	(working copy)
*************** init_optimization_passes (void)
*** 1330,1335 ****
--- 1330,1340 ----
  	  NEXT_PASS (pass_convert_switch);
            NEXT_PASS (pass_cleanup_eh);
            NEXT_PASS (pass_profile);
+ 	  /* Scheduling header copying before pass_ipa_tree_profile is important
+ 	     to get loop iteration counts estimated right.
+ 	     Scheduling it after pass_profile prevents it to copy loop headers
+ 	     in cold functions declares by the user.  */
+           NEXT_PASS (pass_ch);
            NEXT_PASS (pass_local_pure_const);
  	  /* Split functions creates parts that are not run through
  	     early optimizations again.  It is thus good idea to do this
*************** init_optimization_passes (void)
*** 1406,1412 ****
        NEXT_PASS (pass_tree_ifcombine);
        NEXT_PASS (pass_phiopt);
        NEXT_PASS (pass_tail_recursion);
-       NEXT_PASS (pass_ch);
        NEXT_PASS (pass_stdarg);
        NEXT_PASS (pass_lower_complex);
        NEXT_PASS (pass_sra);
--- 1411,1416 ----
Richard Guenther - Oct. 8, 2012, 10:02 a.m.
On Sat, Oct 6, 2012 at 11:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> I benchmarked the patch moving loop header copying and it is quite noticeable win.
>
> Some testsuite updating is needed. In many cases it is just because the
> optimizations are now happening earlier.
> There are however few testusite failures I have torubles to deal with:
> ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times vrp1 "Threaded jump" 3
> ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2.c scan-tree-dump-times vrp1 "Jumps threaded: 1" 1
> ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/vect/O3-slp-reduc-10.c scan-tree-dump-times vect "vectorized 1 loops" 2
> ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++98  scan-tree-dump-times vrp1 "if " 1
> ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++11  scan-tree-dump-times vrp1 "if " 1
>
> This is mostly about VRP losing its ability to thread some jumps from the
> duplicated loop header out of the loop across the loopback edge.  This seems to
> be due to loop updating logic.  Do we care about these?

Yes, I think so.  At least we care that the optimized result is the same.

Can you elaborate on "due to loop updating logic"?

Can you elaborate on the def_split_header_continue_p change?  Which probably
should be tested and installed separately?

Thanks,
Richard.

> Honza
>
> Index: tree-ssa-threadupdate.c
> ===================================================================
> *** tree-ssa-threadupdate.c     (revision 192123)
> --- tree-ssa-threadupdate.c     (working copy)
> *************** static bool
> *** 846,854 ****
>   def_split_header_continue_p (const_basic_block bb, const void *data)
>   {
>     const_basic_block new_header = (const_basic_block) data;
> !   return (bb != new_header
> !         && (loop_depth (bb->loop_father)
> !             >= loop_depth (new_header->loop_father)));
>   }
>
>   /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
> --- 846,860 ----
>   def_split_header_continue_p (const_basic_block bb, const void *data)
>   {
>     const_basic_block new_header = (const_basic_block) data;
> !   const struct loop *l;
> !
> !   if (bb == new_header
> !       || loop_depth (bb->loop_father) < loop_depth (new_header->loop_father))
> !     return false;
> !   for (l = bb->loop_father; l; l = loop_outer (l))
> !     if (l == new_header->loop_father)
> !       return true;
> !   return false;
>   }
>
>   /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
> Index: testsuite/gcc.dg/unroll_2.c
> ===================================================================
> *** testsuite/gcc.dg/unroll_2.c (revision 192123)
> --- testsuite/gcc.dg/unroll_2.c (working copy)
> ***************
> *** 1,5 ****
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo -fenable-rtl-loop2_unroll" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> --- 1,5 ----
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo -fenable-rtl-loop2_unroll -fno-tree-dominator-opts" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> Index: testsuite/gcc.dg/unroll_3.c
> ===================================================================
> *** testsuite/gcc.dg/unroll_3.c (revision 192123)
> --- testsuite/gcc.dg/unroll_3.c (working copy)
> ***************
> *** 1,5 ****
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> --- 1,5 ----
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo -fno-tree-dominator-opts" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> Index: testsuite/gcc.dg/torture/pr23821.c
> ===================================================================
> *** testsuite/gcc.dg/torture/pr23821.c  (revision 192123)
> --- testsuite/gcc.dg/torture/pr23821.c  (working copy)
> ***************
> *** 1,9 ****
>   /* { dg-do compile } */
>   /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
> ! /* At -O1 DOM threads a jump in a non-optimal way which leads to
>      the bogus propagation.  */
> ! /* { dg-skip-if "" { *-*-* } { "-O1" } { "" } } */
> ! /* { dg-options "-fdump-tree-ivcanon-details" } */
>
>   int a[199];
>
> --- 1,8 ----
>   /* { dg-do compile } */
>   /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
> ! /* DOM threads a jump in a non-optimal way which leads to
>      the bogus propagation.  */
> ! /* { dg-options "-fdump-tree-ivcanon-details -fno-tree-dominator-opts" } */
>
>   int a[199];
>
> Index: testsuite/gcc.dg/tree-ssa/ivopt_1.c
> ===================================================================
> *** testsuite/gcc.dg/tree-ssa/ivopt_1.c (revision 192123)
> --- testsuite/gcc.dg/tree-ssa/ivopt_1.c (working copy)
> *************** void foo (int i_width, TYPE dst, TYPE sr
> *** 14,18 ****
>   }
>
>
> ! /* { dg-final { scan-tree-dump-times "PHI <ivtmp" 1 "ivopts"} } */
>   /* { dg-final { cleanup-tree-dump "ivopts" } } */
> --- 14,18 ----
>   }
>
>
> ! /* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts"} } */
>   /* { dg-final { cleanup-tree-dump "ivopts" } } */
> Index: testsuite/gcc.dg/tree-ssa/ivopt_2.c
> ===================================================================
> *** testsuite/gcc.dg/tree-ssa/ivopt_2.c (revision 192123)
> --- testsuite/gcc.dg/tree-ssa/ivopt_2.c (working copy)
> *************** void foo (int i_width, TYPE dst, TYPE sr
> *** 13,17 ****
>          }
>   }
>
> ! /* { dg-final { scan-tree-dump-times "PHI <ivtmp" 1 "ivopts"} } */
>   /* { dg-final { cleanup-tree-dump "ivopts" } } */
> --- 13,17 ----
>          }
>   }
>
> ! /* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts"} } */
>   /* { dg-final { cleanup-tree-dump "ivopts" } } */
> Index: testsuite/gcc.dg/tree-ssa/ivopt_3.c
> ===================================================================
> *** testsuite/gcc.dg/tree-ssa/ivopt_3.c (revision 192123)
> --- testsuite/gcc.dg/tree-ssa/ivopt_3.c (working copy)
> *************** void foo (int i_width, char* dst, char*
> *** 16,20 ****
>          }
>   }
>
> ! /* { dg-final { scan-tree-dump-times "PHI <ivtmp" 1 "ivopts"} } */
>   /* { dg-final { cleanup-tree-dump "ivopts" } } */
> --- 16,20 ----
>          }
>   }
>
> ! /* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts"} } */
>   /* { dg-final { cleanup-tree-dump "ivopts" } } */
> Index: testsuite/gcc.dg/tree-ssa/ivopt_4.c
> ===================================================================
> *** testsuite/gcc.dg/tree-ssa/ivopt_4.c (revision 192123)
> --- testsuite/gcc.dg/tree-ssa/ivopt_4.c (working copy)
> *************** void foo (int i_width, TYPE dst, TYPE sr
> *** 15,19 ****
>          }
>   }
>
> ! /* { dg-final { scan-tree-dump-times "PHI <ivtmp" 1 "ivopts"} } */
>   /* { dg-final { cleanup-tree-dump "ivopts" } } */
> --- 15,19 ----
>          }
>   }
>
> ! /* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts"} } */
>   /* { dg-final { cleanup-tree-dump "ivopts" } } */
> Index: testsuite/gcc.dg/unroll_4.c
> ===================================================================
> *** testsuite/gcc.dg/unroll_4.c (revision 192123)
> --- testsuite/gcc.dg/unroll_4.c (working copy)
> ***************
> *** 1,5 ****
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo2" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> --- 1,5 ----
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo2 -fno-tree-dominator-opts" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> Index: testsuite/gcc.dg/tree-prof/update-loopch.c
> ===================================================================
> *** testsuite/gcc.dg/tree-prof/update-loopch.c  (revision 192123)
> --- testsuite/gcc.dg/tree-prof/update-loopch.c  (working copy)
> *************** main ()
> *** 11,20 ****
>       }
>     return 0;
>   }
> ! /* Loop header copying will peel away the initial conditional, so the loop body
> !    is once reached directly from entry point of function, rest via loopback
> !    edge.  */
> ! /* { dg-final-use { scan-ipa-dump "loop depth 0, count 33334" "profile"} } */
>   /* { dg-final-use { scan-tree-dump "loop depth 1, count 33332" "optimized"} } */
>   /* { dg-final-use { scan-tree-dump-not "Invalid sum" "optimized"} } */
>   /* { dg-final-use { cleanup-ipa-dump "profile" } } */
> --- 11,20 ----
>       }
>     return 0;
>   }
> ! /* Loop header copying, now happening before profiling, will peel away the
> !    initial conditional, so the loop body is once reached directly from entry
> !    point of function, rest via loopback edge.  */
> ! /* { dg-final-use { scan-ipa-dump "loop depth 0, count 33332" "profile"} } */
>   /* { dg-final-use { scan-tree-dump "loop depth 1, count 33332" "optimized"} } */
>   /* { dg-final-use { scan-tree-dump-not "Invalid sum" "optimized"} } */
>   /* { dg-final-use { cleanup-ipa-dump "profile" } } */
> Index: testsuite/gcc.dg/unroll_1.c
> ===================================================================
> *** testsuite/gcc.dg/unroll_1.c (revision 192123)
> --- testsuite/gcc.dg/unroll_1.c (working copy)
> ***************
> *** 1,5 ****
>   /* { dg-do compile } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> --- 1,5 ----
>   /* { dg-do compile } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll -fno-tree-dominator-opts" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> Index: passes.c
> ===================================================================
> *** passes.c    (revision 192123)
> --- passes.c    (working copy)
> *************** init_optimization_passes (void)
> *** 1330,1335 ****
> --- 1330,1340 ----
>           NEXT_PASS (pass_convert_switch);
>             NEXT_PASS (pass_cleanup_eh);
>             NEXT_PASS (pass_profile);
> +         /* Scheduling header copying before pass_ipa_tree_profile is important
> +            to get loop iteration counts estimated right.
> +            Scheduling it after pass_profile prevents it to copy loop headers
> +            in cold functions declares by the user.  */
> +           NEXT_PASS (pass_ch);
>             NEXT_PASS (pass_local_pure_const);
>           /* Split functions creates parts that are not run through
>              early optimizations again.  It is thus good idea to do this
> *************** init_optimization_passes (void)
> *** 1406,1412 ****
>         NEXT_PASS (pass_tree_ifcombine);
>         NEXT_PASS (pass_phiopt);
>         NEXT_PASS (pass_tail_recursion);
> -       NEXT_PASS (pass_ch);
>         NEXT_PASS (pass_stdarg);
>         NEXT_PASS (pass_lower_complex);
>         NEXT_PASS (pass_sra);
> --- 1411,1416 ----
Jan Hubicka - Oct. 8, 2012, 11:19 a.m.
> On Sat, Oct 6, 2012 at 11:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > Hi,
> > I benchmarked the patch moving loop header copying and it is quite noticeable win.
> >
> > Some testsuite updating is needed. In many cases it is just because the
> > optimizations are now happening earlier.
> > There are however few testusite failures I have torubles to deal with:
> > ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times vrp1 "Threaded jump" 3
> > ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2.c scan-tree-dump-times vrp1 "Jumps threaded: 1" 1
> > ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/vect/O3-slp-reduc-10.c scan-tree-dump-times vect "vectorized 1 loops" 2
> > ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++98  scan-tree-dump-times vrp1 "if " 1
> > ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++11  scan-tree-dump-times vrp1 "if " 1
> >
> > This is mostly about VRP losing its ability to thread some jumps from the
> > duplicated loop header out of the loop across the loopback edge.  This seems to
> > be due to loop updating logic.  Do we care about these?
> 
> Yes, I think so.  At least we care that the optimized result is the same.

it is not, we really lose optimization in those testcases.
The ones that are still optimized well I updated in the patch bellow.
> 
> Can you elaborate on "due to loop updating logic"?

The problem is:
  /* We do not allow VRP information to be used for jump threading
     across a back edge in the CFG.  Otherwise it becomes too
     difficult to avoid eliminating loop exit tests.  Of course
     EDGE_DFS_BACK is not accurate at this time so we have to
     recompute it.  */
  mark_dfs_back_edges ();

  /* Do not thread across edges we are about to remove.  Just marking
     them as EDGE_DFS_BACK will do.  */
  FOR_EACH_VEC_ELT (edge, to_remove_edges, i, e)
    e->flags |= EDGE_DFS_BACK;

Loop header copying puts some conditional before loop and we want to thread
up to exit out of the loop (that I think it rather important optimization).
But it no longer happens before back edge is in the way.  At least that was
the case in the tree-ssa failures I analyzed.
> 
> Can you elaborate on the def_split_header_continue_p change?  Which probably
> should be tested and installed separately?

Yes, that one is latent bug.  The code is expecting that loop exit is recognized
by loop depth decreasing that is not true.
It reproduces as ICE during bootstrap with the patch.
I will regtest/bootstrap and commit it today.

Honza

Patch

Index: passes.c
===================================================================
--- passes.c	(revision 191852)
+++ passes.c	(working copy)
@@ -1335,6 +1336,7 @@  init_optimization_passes (void)
           NEXT_PASS (pass_cleanup_eh);
           NEXT_PASS (pass_profile);
           NEXT_PASS (pass_local_pure_const);
+          NEXT_PASS (pass_early_ch);
 	  /* Split functions creates parts that are not run through
 	     early optimizations again.  It is thus good idea to do this
 	     late.  */
Index: tree-pass.h
===================================================================
--- tree-pass.h	(revision 191852)
+++ tree-pass.h	(working copy)
@@ -291,6 +291,7 @@  extern struct gimple_opt_pass pass_loop_
 extern struct gimple_opt_pass pass_iv_optimize;
 extern struct gimple_opt_pass pass_tree_loop_done;
 extern struct gimple_opt_pass pass_ch;
+extern struct gimple_opt_pass pass_early_ch;
 extern struct gimple_opt_pass pass_ccp;
 extern struct gimple_opt_pass pass_phi_only_cprop;
 extern struct gimple_opt_pass pass_build_ssa;
Index: tree-ssa-loop-ch.c
===================================================================
--- tree-ssa-loop-ch.c	(revision 191852)
+++ tree-ssa-loop-ch.c	(working copy)
@@ -275,3 +275,34 @@  struct gimple_opt_pass pass_ch =
     | TODO_verify_flow			/* todo_flags_finish */
  }
 };
+
+/* We duplicate loop headers early to get better profile feedback:
+   the first iteration of loop is special, because the duplicated loop header
+   test will usually pass.  */
+
+static bool
+gate_early_ch (void)
+{
+  return flag_tree_ch != 0 && (flag_branch_probabilities || profile_arc_flag);
+}
+
+struct gimple_opt_pass pass_early_ch =
+{
+ {
+  GIMPLE_PASS,
+  "early_ch",				/* name */
+  gate_early_ch,			/* gate */
+  copy_loop_headers,			/* execute */
+  NULL,					/* sub */
+  NULL,					/* next */
+  0,					/* static_pass_number */
+  TV_TREE_CH,				/* tv_id */
+  PROP_cfg | PROP_ssa,			/* properties_required */
+  0,					/* properties_provided */
+  0,					/* properties_destroyed */
+  0,					/* todo_flags_start */
+  TODO_cleanup_cfg
+    | TODO_verify_ssa
+    | TODO_verify_flow			/* todo_flags_finish */
+ }
+};