Patchwork Add vector cost model density heuristic

login
register
mail settings
Submitter William J. Schmidt
Date June 8, 2012, 3:23 p.m.
Message ID <1339169033.20419.2.camel@gnopaine>
Download mbox | patch
Permalink /patch/163789/
State New
Headers show

Comments

William J. Schmidt - June 8, 2012, 3:23 p.m.
This patch adds a heuristic to the vectorizer when estimating the
minimum profitable number of iterations.  The heuristic is
target-dependent, and is currently disabled for all targets except
PowerPC.  However, the intent is to make it general enough to be useful
for other targets that want to opt in.

A previous patch addressed some PowerPC SPEC degradations by modifying
the vector cost model values for vec_perm and vec_promote_demote.  The
values were set a little higher than their natural values because the
natural values were not sufficient to prevent a poor vectorization
choice.  However, this is not the right long-term solution, since it can
unnecessarily constrain other vectorization choices involving permute
instructions.

Analysis of the badly vectorized loop (in sphinx3) showed that the
problem was overcommitment of vector resources -- too many vector
instructions issued without enough non-vector instructions available to
cover the delays.  The vector cost model assumes that instructions
always have a constant cost, and doesn't have a way of judging this kind
of "density" of vector instructions.

The present patch adds a heuristic to recognize when a loop is likely to
overcommit resources, and adds a small penalty to the inside-loop cost
to account for the expected stalls.  The heuristic is parameterized with
three target-specific values:

 * Density threshold: The heuristic will apply only when the
   percentage of inside-loop cost attributable to vectorized
   instructions exceeds this value.

 * Size threshold: The heuristic will apply only when the
   inside-loop cost exceeds this value.

 * Penalty: The inside-loop cost will be increased by this
   percentage value when the heuristic applies.

Thus only reasonably large loop bodies that are mostly vectorized
instructions will be affected.

By applying only a small percentage bump to the inside-loop cost, the
heuristic will only turn off vectorization for loops that were
considered "barely profitable" to begin with (such as the sphinx3 loop).
So the heuristic is quite conservative and should not affect the vast
majority of vectorization decisions.

Together with the new heuristic, this patch reduces the vec_perm and
vec_promote_demote costs for PowerPC to their natural values.

I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
and verified that no performance regressions occur on SPEC cpu2006.  Is
this ok for trunk?

Thanks,
Bill


2012-06-08  Bill Schmidt  <wschmidt@linux.ibm.com>

	* doc/tm.texi.in: Add vectorization density hooks.
	* doc/tm.texi: Regenerate.
	* targhooks.c (default_density_pct_threshold): New.
	(default_density_size_threshold): New.
	(default_density_penalty): New.
	* targhooks.h: New decls for new targhooks.c functions.
	* target.def (density_pct_threshold): New DEF_HOOK.
	(density_size_threshold): Likewise.
	(density_penalty): Likewise.
	* tree-vect-loop.c (accum_stmt_cost): New.
	(vect_estimate_min_profitable_iters): Perform density test.
	* config/rs6000/rs6000.c (TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD):
	New macro definition.
	(TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD): Likewise.
	(TARGET_VECTORIZE_DENSITY_PENALTY): Likewise.
	(rs6000_builtin_vectorization_cost): Reduce costs of vec_perm and
	vec_promote_demote to correct values.
	(rs6000_density_pct_threshold): New.
	(rs6000_density_size_threshold): New.
	(rs6000_density_penalty): New.
Richard Guenther - June 11, 2012, 11:40 a.m.
On Fri, 8 Jun 2012, William J. Schmidt wrote:

> This patch adds a heuristic to the vectorizer when estimating the
> minimum profitable number of iterations.  The heuristic is
> target-dependent, and is currently disabled for all targets except
> PowerPC.  However, the intent is to make it general enough to be useful
> for other targets that want to opt in.
> 
> A previous patch addressed some PowerPC SPEC degradations by modifying
> the vector cost model values for vec_perm and vec_promote_demote.  The
> values were set a little higher than their natural values because the
> natural values were not sufficient to prevent a poor vectorization
> choice.  However, this is not the right long-term solution, since it can
> unnecessarily constrain other vectorization choices involving permute
> instructions.
> 
> Analysis of the badly vectorized loop (in sphinx3) showed that the
> problem was overcommitment of vector resources -- too many vector
> instructions issued without enough non-vector instructions available to
> cover the delays.  The vector cost model assumes that instructions
> always have a constant cost, and doesn't have a way of judging this kind
> of "density" of vector instructions.
> 
> The present patch adds a heuristic to recognize when a loop is likely to
> overcommit resources, and adds a small penalty to the inside-loop cost
> to account for the expected stalls.  The heuristic is parameterized with
> three target-specific values:
> 
>  * Density threshold: The heuristic will apply only when the
>    percentage of inside-loop cost attributable to vectorized
>    instructions exceeds this value.
> 
>  * Size threshold: The heuristic will apply only when the
>    inside-loop cost exceeds this value.
> 
>  * Penalty: The inside-loop cost will be increased by this
>    percentage value when the heuristic applies.
> 
> Thus only reasonably large loop bodies that are mostly vectorized
> instructions will be affected.
> 
> By applying only a small percentage bump to the inside-loop cost, the
> heuristic will only turn off vectorization for loops that were
> considered "barely profitable" to begin with (such as the sphinx3 loop).
> So the heuristic is quite conservative and should not affect the vast
> majority of vectorization decisions.
> 
> Together with the new heuristic, this patch reduces the vec_perm and
> vec_promote_demote costs for PowerPC to their natural values.
> 
> I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
> and verified that no performance regressions occur on SPEC cpu2006.  Is
> this ok for trunk?

Hmm.  I don't like this patch or its general idea too much.  Instead
I'd like us to move more of the cost model detail to the target, giving
it a chance to look at the whole loop before deciding on a cost.  ISTR
posting the overall idea at some point, but let me repeat it here instead
of trying to find that e-mail.

The basic interface of the cost model should be, in targetm.vectorize

  /* Tell the target to start cost analysis of a loop or a basic-block
     (if the loop argument is NULL).  Returns an opaque pointer to
     target-private data.  */
  void *init_cost (struct loop *loop);

  /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
  void add_stmt_cost (void *data, unsigned n,
		      vectorized-stmt-kind,
                      enum machine_mode vector_mode);

  /* Tell the target to compute and return the cost of the accumulated
     statements and free any target-private data.  */
  unsigned finish_cost (void *data);

with eventually slightly different signatures for add_stmt_cost
(like pass in the original scalar stmt?).

It allows the target, at finish_cost time, to evaluate things like
register pressure and resource utilization.

Thanks,
Richard.

> Thanks,
> Bill
> 
> 
> 2012-06-08  Bill Schmidt  <wschmidt@linux.ibm.com>
> 
> 	* doc/tm.texi.in: Add vectorization density hooks.
> 	* doc/tm.texi: Regenerate.
> 	* targhooks.c (default_density_pct_threshold): New.
> 	(default_density_size_threshold): New.
> 	(default_density_penalty): New.
> 	* targhooks.h: New decls for new targhooks.c functions.
> 	* target.def (density_pct_threshold): New DEF_HOOK.
> 	(density_size_threshold): Likewise.
> 	(density_penalty): Likewise.
> 	* tree-vect-loop.c (accum_stmt_cost): New.
> 	(vect_estimate_min_profitable_iters): Perform density test.
> 	* config/rs6000/rs6000.c (TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD):
> 	New macro definition.
> 	(TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD): Likewise.
> 	(TARGET_VECTORIZE_DENSITY_PENALTY): Likewise.
> 	(rs6000_builtin_vectorization_cost): Reduce costs of vec_perm and
> 	vec_promote_demote to correct values.
> 	(rs6000_density_pct_threshold): New.
> 	(rs6000_density_size_threshold): New.
> 	(rs6000_density_penalty): New.
> 
> 
> Index: gcc/doc/tm.texi
> ===================================================================
> --- gcc/doc/tm.texi	(revision 188305)
> +++ gcc/doc/tm.texi	(working copy)
> @@ -5798,6 +5798,27 @@ The default is @code{NULL_TREE} which means to not
>  loads.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD (void)
> +This hook should return the maximum density, expressed in percent, for
> +which autovectorization of loops with large bodies should be constrained.
> +See also @code{TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD}.  The default
> +is to return 100, which disables the density test.
> +@end deftypefn
> +
> +@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD (void)
> +This hook should return the minimum estimated size of a vectorized
> +loop body for which the density test should apply.  See also
> +@code{TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD}.  The default is set
> +to the unreasonable value of 1000000, which effectively disables 
> +the density test.
> +@end deftypefn
> +
> +@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_PENALTY (void)
> +This hook should return the penalty, expressed in percent, to be applied
> +to the inside-of-loop vectorization costs for a loop failing the density
> +test.  The default is 10.
> +@end deftypefn
> +
>  @node Anchored Addresses
>  @section Anchored Addresses
>  @cindex anchored addresses
> Index: gcc/doc/tm.texi.in
> ===================================================================
> --- gcc/doc/tm.texi.in	(revision 188305)
> +++ gcc/doc/tm.texi.in	(working copy)
> @@ -5730,6 +5730,27 @@ The default is @code{NULL_TREE} which means to not
>  loads.
>  @end deftypefn
>  
> +@hook TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD
> +This hook should return the maximum density, expressed in percent, for
> +which autovectorization of loops with large bodies should be constrained.
> +See also @code{TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD}.  The default
> +is to return 100, which disables the density test.
> +@end deftypefn
> +
> +@hook TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD
> +This hook should return the minimum estimated size of a vectorized
> +loop body for which the density test should apply.  See also
> +@code{TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD}.  The default is set
> +to the unreasonable value of 1000000, which effectively disables 
> +the density test.
> +@end deftypefn
> +
> +@hook TARGET_VECTORIZE_DENSITY_PENALTY
> +This hook should return the penalty, expressed in percent, to be applied
> +to the inside-of-loop vectorization costs for a loop failing the density
> +test.  The default is 10.
> +@end deftypefn
> +
>  @node Anchored Addresses
>  @section Anchored Addresses
>  @cindex anchored addresses
> Index: gcc/targhooks.c
> ===================================================================
> --- gcc/targhooks.c	(revision 188305)
> +++ gcc/targhooks.c	(working copy)
> @@ -990,6 +990,33 @@ default_autovectorize_vector_sizes (void)
>    return 0;
>  }
>  
> +/* By default, the density test for autovectorization is disabled by
> +   setting the minimum percentage to 100.  */
> +
> +int
> +default_density_pct_threshold (void)
> +{
> +  return 100;
> +}
> +
> +/* By default, the density size threshold for autovectorization is
> +   meaningless since the density test is disabled.  An unreasonably
> +   large number is used to further inhibit the density test.  */
> +
> +int
> +default_density_size_threshold (void)
> +{
> +  return 1000000;
> +}
> +
> +/* By default, the density penalty for autovectorization is set to 10%.  */
> +
> +int
> +default_density_penalty (void)
> +{
> +  return 10;
> +}
> +
>  /* Determine whether or not a pointer mode is valid. Assume defaults
>     of ptr_mode or Pmode - can be overridden.  */
>  bool
> Index: gcc/targhooks.h
> ===================================================================
> --- gcc/targhooks.h	(revision 188305)
> +++ gcc/targhooks.h	(working copy)
> @@ -90,6 +90,9 @@ default_builtin_support_vector_misalignment (enum
>  					     int, bool);
>  extern enum machine_mode default_preferred_simd_mode (enum machine_mode mode);
>  extern unsigned int default_autovectorize_vector_sizes (void);
> +extern int default_density_pct_threshold (void);
> +extern int default_density_size_threshold (void);
> +extern int default_density_penalty (void);
>  
>  /* These are here, and not in hooks.[ch], because not all users of
>     hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
> Index: gcc/target.def
> ===================================================================
> --- gcc/target.def	(revision 188305)
> +++ gcc/target.def	(working copy)
> @@ -1054,6 +1054,32 @@ DEFHOOK
>   (const_tree mem_vectype, const_tree index_type, int scale),
>   NULL)
>  
> +/* Return the maximum density in percent for loop vectorization.  */
> +DEFHOOK
> +(density_pct_threshold,
> +"",
> +int,
> +(void),
> +default_density_pct_threshold)
> +
> +/* Return the minimum size of a loop iteration for applying the density
> +   test for loop vectorization.  */
> +DEFHOOK
> +(density_size_threshold,
> +"",
> +int,
> +(void),
> +default_density_size_threshold)
> +
> +/* Return the penalty in percent for vectorizing a loop failing the
> +   density test.  */
> +DEFHOOK
> +(density_penalty,
> +"",
> +int,
> +(void),
> +default_density_penalty)
> +
>  HOOK_VECTOR_END (vectorize)
>  
>  #undef HOOK_PREFIX
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c	(revision 188305)
> +++ gcc/tree-vect-loop.c	(working copy)
> @@ -2485,6 +2485,58 @@ vect_get_known_peeling_cost (loop_vec_info loop_vi
>             + peel_guard_costs;
>  }
>  
> +/* Add the inside-loop cost of STMT to either *REL_COST or *IRREL_COST,
> +   depending on whether or not STMT will be vectorized.  For vectorized
> +   statements, the inside-loop cost is as already computed.  For other
> +   statements, assume a cost of one.  */
> +
> +static void
> +accum_stmt_cost (gimple stmt, int *rel_cost, int *irrel_cost)
> +{
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  gimple pattern_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> +  gimple_seq pattern_def_seq;
> +
> +  /* If the statement is irrelevant, but it has a related pattern
> +     statement that is relevant, process just the related statement.
> +     If the statement is relevant and it has a related pattern
> +     statement that is also relevant, process them both.  */
> +  if (!STMT_VINFO_RELEVANT_P (stmt_info)
> +      && !STMT_VINFO_LIVE_P (stmt_info))
> +    {
> +      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
> +	  && pattern_stmt
> +	  && (STMT_VINFO_RELEVANT_P (vinfo_for_stmt (pattern_stmt))
> +	      || STMT_VINFO_LIVE_P (vinfo_for_stmt (pattern_stmt))))
> +	accum_stmt_cost (pattern_stmt, rel_cost, irrel_cost);
> +      else
> +	(*irrel_cost)++;
> +    }
> +  else if (STMT_VINFO_IN_PATTERN_P (stmt_info)
> +	   && pattern_stmt
> +	   && (STMT_VINFO_RELEVANT_P (vinfo_for_stmt (pattern_stmt))
> +	       || STMT_VINFO_LIVE_P (vinfo_for_stmt (pattern_stmt))))
> +    accum_stmt_cost (pattern_stmt, rel_cost, irrel_cost);
> +
> +  /* If we're looking at a pattern that has additional statements,
> +     count them as well.  */
> +  if (is_pattern_stmt_p (stmt_info)
> +      && (pattern_def_seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_info)))
> +    {
> +      gimple_stmt_iterator gsi;
> +      for (gsi = gsi_start (pattern_def_seq); !gsi_end_p (gsi); gsi_next (&gsi))
> +	{
> +	  gimple pattern_def_stmt = gsi_stmt (gsi);
> +	  if (STMT_VINFO_RELEVANT_P (vinfo_for_stmt (pattern_def_stmt))
> +	      || STMT_VINFO_LIVE_P (vinfo_for_stmt (pattern_def_stmt)))
> +	    accum_stmt_cost (pattern_def_stmt, rel_cost, irrel_cost);
> +	}
> +    }
> +
> +  /* Accumulate the inside-loop cost of this vectorizable statement.  */
> +  *rel_cost += STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info);
> +}
> +
>  /* Function vect_estimate_min_profitable_iters
>  
>     Return the number of iterations required for the vector version of the
> @@ -2743,6 +2795,45 @@ vect_estimate_min_profitable_iters (loop_vec_info
>        vec_inside_cost += SLP_INSTANCE_INSIDE_OF_LOOP_COST (instance);
>      }
>  
> +  /* Test for likely overcommitment of vector hardware resources.  If a
> +     loop iteration is relatively large, and too large a percentage of
> +     instructions in the loop are vectorized, the cost model may not
> +     adequately reflect delays from unavailable vector resources.
> +     Penalize vec_inside_cost for this case, using target-specific
> +     parameters.  */
> +  if (targetm.vectorize.density_pct_threshold () < 100)
> +    {
> +      int rel_cost = 0, irrel_cost = 0;
> +      int density_pct;
> +
> +      for (i = 0; i < nbbs; i++)
> +	{
> +	  basic_block bb = bbs[i];
> +	  gimple_stmt_iterator gsi;
> +
> +	  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +	    {
> +	      gimple stmt = gsi_stmt (gsi);
> +	      accum_stmt_cost (stmt, &rel_cost, &irrel_cost);
> +	    }
> +	}
> +
> +      density_pct = (rel_cost * 100) / (rel_cost + irrel_cost);
> +
> +      if (density_pct > targetm.vectorize.density_pct_threshold ()
> +	  && (rel_cost + irrel_cost
> +	      > targetm.vectorize.density_size_threshold ()))
> +	{
> +	  int penalty = targetm.vectorize.density_penalty ();
> +	  vec_inside_cost = vec_inside_cost * (100 + penalty) / 100;
> +	  if (vect_print_dump_info (REPORT_DETAILS))
> +	    fprintf (vect_dump,
> +		     "density %d%%, cost %d exceeds threshold"
> +		     ", penalizing inside-loop cost by %d%%.",
> +		     density_pct, rel_cost + irrel_cost, penalty);
> +	}
> +    }
> +
>    /* Calculate number of iterations required to make the vector version
>       profitable, relative to the loop bodies only.  The following condition
>       must hold true:
> Index: gcc/config/rs6000/rs6000.c
> ===================================================================
> --- gcc/config/rs6000/rs6000.c	(revision 188305)
> +++ gcc/config/rs6000/rs6000.c	(working copy)
> @@ -1289,6 +1289,15 @@ static const struct attribute_spec rs6000_attribut
>  #undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE
>  #define TARGET_VECTORIZE_PREFERRED_SIMD_MODE \
>    rs6000_preferred_simd_mode
> +#undef TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD
> +#define TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD \
> +  rs6000_density_pct_threshold
> +#undef TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD
> +#define TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD \
> +  rs6000_density_size_threshold
> +#undef TARGET_VECTORIZE_DENSITY_PENALTY
> +#define TARGET_VECTORIZE_DENSITY_PENALTY \
> + rs6000_density_penalty
>  
>  #undef TARGET_INIT_BUILTINS
>  #define TARGET_INIT_BUILTINS rs6000_init_builtins
> @@ -3421,13 +3430,13 @@ rs6000_builtin_vectorization_cost (enum vect_cost_
>  
>        case vec_perm:
>  	if (TARGET_VSX)
> -	  return 4;
> +	  return 3;
>  	else
>  	  return 1;
>  
>        case vec_promote_demote:
>          if (TARGET_VSX)
> -          return 5;
> +          return 4;
>          else
>            return 1;
>  
> @@ -3551,6 +3560,30 @@ rs6000_preferred_simd_mode (enum machine_mode mode
>    return word_mode;
>  }
>  
> +/* Implement targetm.vectorize.density_pct_threshold.  */
> +
> +static int
> +rs6000_density_pct_threshold (void)
> +{
> +  return 85;
> +}
> +
> +/* Implement targetm.vectorize.density_size_threshold.  */
> +
> +static int
> +rs6000_density_size_threshold (void)
> +{
> +  return 70;
> +}
> +
> +/* Implement targetm.vectorize.density_penalty.  */
> +
> +static int
> +rs6000_density_penalty (void)
> +{
> +  return 10;
> +}
> +
>  /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a
>     library with vectorized intrinsics.  */
>  
> 
> 
>
William J. Schmidt - June 11, 2012, 2:37 p.m.
On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> On Fri, 8 Jun 2012, William J. Schmidt wrote:
> 
> > This patch adds a heuristic to the vectorizer when estimating the
> > minimum profitable number of iterations.  The heuristic is
> > target-dependent, and is currently disabled for all targets except
> > PowerPC.  However, the intent is to make it general enough to be useful
> > for other targets that want to opt in.
> > 
> > A previous patch addressed some PowerPC SPEC degradations by modifying
> > the vector cost model values for vec_perm and vec_promote_demote.  The
> > values were set a little higher than their natural values because the
> > natural values were not sufficient to prevent a poor vectorization
> > choice.  However, this is not the right long-term solution, since it can
> > unnecessarily constrain other vectorization choices involving permute
> > instructions.
> > 
> > Analysis of the badly vectorized loop (in sphinx3) showed that the
> > problem was overcommitment of vector resources -- too many vector
> > instructions issued without enough non-vector instructions available to
> > cover the delays.  The vector cost model assumes that instructions
> > always have a constant cost, and doesn't have a way of judging this kind
> > of "density" of vector instructions.
> > 
> > The present patch adds a heuristic to recognize when a loop is likely to
> > overcommit resources, and adds a small penalty to the inside-loop cost
> > to account for the expected stalls.  The heuristic is parameterized with
> > three target-specific values:
> > 
> >  * Density threshold: The heuristic will apply only when the
> >    percentage of inside-loop cost attributable to vectorized
> >    instructions exceeds this value.
> > 
> >  * Size threshold: The heuristic will apply only when the
> >    inside-loop cost exceeds this value.
> > 
> >  * Penalty: The inside-loop cost will be increased by this
> >    percentage value when the heuristic applies.
> > 
> > Thus only reasonably large loop bodies that are mostly vectorized
> > instructions will be affected.
> > 
> > By applying only a small percentage bump to the inside-loop cost, the
> > heuristic will only turn off vectorization for loops that were
> > considered "barely profitable" to begin with (such as the sphinx3 loop).
> > So the heuristic is quite conservative and should not affect the vast
> > majority of vectorization decisions.
> > 
> > Together with the new heuristic, this patch reduces the vec_perm and
> > vec_promote_demote costs for PowerPC to their natural values.
> > 
> > I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
> > and verified that no performance regressions occur on SPEC cpu2006.  Is
> > this ok for trunk?
> 
> Hmm.  I don't like this patch or its general idea too much.  Instead
> I'd like us to move more of the cost model detail to the target, giving
> it a chance to look at the whole loop before deciding on a cost.  ISTR
> posting the overall idea at some point, but let me repeat it here instead
> of trying to find that e-mail.
> 
> The basic interface of the cost model should be, in targetm.vectorize
> 
>   /* Tell the target to start cost analysis of a loop or a basic-block
>      (if the loop argument is NULL).  Returns an opaque pointer to
>      target-private data.  */
>   void *init_cost (struct loop *loop);
> 
>   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
>   void add_stmt_cost (void *data, unsigned n,
> 		      vectorized-stmt-kind,
>                       enum machine_mode vector_mode);
> 
>   /* Tell the target to compute and return the cost of the accumulated
>      statements and free any target-private data.  */
>   unsigned finish_cost (void *data);
> 
> with eventually slightly different signatures for add_stmt_cost
> (like pass in the original scalar stmt?).
> 
> It allows the target, at finish_cost time, to evaluate things like
> register pressure and resource utilization.

OK, I'm trying to understand how you would want this built into the
present structure.  Taking just the loop case for now:

Judging by your suggested API, we would have to call add_stmt_cost ()
everywhere that we now call stmt_vinfo_set_inside_of_loop_cost ().  For
now this would be an additional call, not a replacement, though maybe
the other goes away eventually.  This allows the target to save more
data about the vectorized instructions than just an accumulated cost
number (order and quantity of various kinds of instructions can be
maintained for better modeling).  Presumably the call to finish_cost
would be done within vect_estimate_min_profitable_iters () to produce
the final value of inside_cost for the loop.

The default target hook for add_stmt_cost would duplicate what we
currently do for calculating the inside_cost of a statement, and the
default target hook for finish_cost would just return the sum.

I'll have to go hunting where the similar code would fit for SLP in a
basic block.

If I read you correctly, you don't object to a density heuristic such as
the one I implemented here, but you want to see such heuristics confined
to a specific target rather than parameterized for all targets.

How'm I doing?  I want to be sure we're on the same page before I delve
into this.

Thanks,
Bill

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Bill

<patch snipped>
David Edelsohn - June 11, 2012, 2:45 p.m.
On Mon, Jun 11, 2012 at 7:40 AM, Richard Guenther <rguenther@suse.de> wrote:

> Hmm.  I don't like this patch or its general idea too much.  Instead
> I'd like us to move more of the cost model detail to the target, giving
> it a chance to look at the whole loop before deciding on a cost.  ISTR
> posting the overall idea at some point, but let me repeat it here instead
> of trying to find that e-mail.

Richard,

Allowing the target to see the entire loop and have more control over
the cost model is a fine goal, but the current code is using incorrect
heuristics. Requiring a complete re-write of the auto-vectorizer cost
model should not be a requirement for fixing the current code.

Thanks, David
Richard Guenther - June 11, 2012, 2:47 p.m.
On Mon, 11 Jun 2012, William J. Schmidt wrote:

> On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > 
> > > This patch adds a heuristic to the vectorizer when estimating the
> > > minimum profitable number of iterations.  The heuristic is
> > > target-dependent, and is currently disabled for all targets except
> > > PowerPC.  However, the intent is to make it general enough to be useful
> > > for other targets that want to opt in.
> > > 
> > > A previous patch addressed some PowerPC SPEC degradations by modifying
> > > the vector cost model values for vec_perm and vec_promote_demote.  The
> > > values were set a little higher than their natural values because the
> > > natural values were not sufficient to prevent a poor vectorization
> > > choice.  However, this is not the right long-term solution, since it can
> > > unnecessarily constrain other vectorization choices involving permute
> > > instructions.
> > > 
> > > Analysis of the badly vectorized loop (in sphinx3) showed that the
> > > problem was overcommitment of vector resources -- too many vector
> > > instructions issued without enough non-vector instructions available to
> > > cover the delays.  The vector cost model assumes that instructions
> > > always have a constant cost, and doesn't have a way of judging this kind
> > > of "density" of vector instructions.
> > > 
> > > The present patch adds a heuristic to recognize when a loop is likely to
> > > overcommit resources, and adds a small penalty to the inside-loop cost
> > > to account for the expected stalls.  The heuristic is parameterized with
> > > three target-specific values:
> > > 
> > >  * Density threshold: The heuristic will apply only when the
> > >    percentage of inside-loop cost attributable to vectorized
> > >    instructions exceeds this value.
> > > 
> > >  * Size threshold: The heuristic will apply only when the
> > >    inside-loop cost exceeds this value.
> > > 
> > >  * Penalty: The inside-loop cost will be increased by this
> > >    percentage value when the heuristic applies.
> > > 
> > > Thus only reasonably large loop bodies that are mostly vectorized
> > > instructions will be affected.
> > > 
> > > By applying only a small percentage bump to the inside-loop cost, the
> > > heuristic will only turn off vectorization for loops that were
> > > considered "barely profitable" to begin with (such as the sphinx3 loop).
> > > So the heuristic is quite conservative and should not affect the vast
> > > majority of vectorization decisions.
> > > 
> > > Together with the new heuristic, this patch reduces the vec_perm and
> > > vec_promote_demote costs for PowerPC to their natural values.
> > > 
> > > I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
> > > and verified that no performance regressions occur on SPEC cpu2006.  Is
> > > this ok for trunk?
> > 
> > Hmm.  I don't like this patch or its general idea too much.  Instead
> > I'd like us to move more of the cost model detail to the target, giving
> > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > posting the overall idea at some point, but let me repeat it here instead
> > of trying to find that e-mail.
> > 
> > The basic interface of the cost model should be, in targetm.vectorize
> > 
> >   /* Tell the target to start cost analysis of a loop or a basic-block
> >      (if the loop argument is NULL).  Returns an opaque pointer to
> >      target-private data.  */
> >   void *init_cost (struct loop *loop);
> > 
> >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> >   void add_stmt_cost (void *data, unsigned n,
> > 		      vectorized-stmt-kind,
> >                       enum machine_mode vector_mode);
> > 
> >   /* Tell the target to compute and return the cost of the accumulated
> >      statements and free any target-private data.  */
> >   unsigned finish_cost (void *data);
> > 
> > with eventually slightly different signatures for add_stmt_cost
> > (like pass in the original scalar stmt?).
> > 
> > It allows the target, at finish_cost time, to evaluate things like
> > register pressure and resource utilization.
> 
> OK, I'm trying to understand how you would want this built into the
> present structure.  Taking just the loop case for now:
> 
> Judging by your suggested API, we would have to call add_stmt_cost ()
> everywhere that we now call stmt_vinfo_set_inside_of_loop_cost ().  For
> now this would be an additional call, not a replacement, though maybe
> the other goes away eventually.  This allows the target to save more
> data about the vectorized instructions than just an accumulated cost
> number (order and quantity of various kinds of instructions can be
> maintained for better modeling).  Presumably the call to finish_cost
> would be done within vect_estimate_min_profitable_iters () to produce
> the final value of inside_cost for the loop.

Yes.  I didn't look in detail at the difference between inner/outer
costs though.  Maybe that complicates things, maybe not.

> The default target hook for add_stmt_cost would duplicate what we
> currently do for calculating the inside_cost of a statement, and the
> default target hook for finish_cost would just return the sum.

Right.  We should be able to remove STMT_VINFO_INSIDE_OF_LOOP_COST
and maybe STMT_VINFO_OUTSIDE_OF_LOOP_COST and the target would track
cost in whatever way it wants - either by summing on-the-fly or
queuing info from the add_stmt_cost calls to be able to look at the
"whole" thing (plus if it gets the scalar stmt as argument to
add_stmt_cost it knows some of the dependencies of the vector 
instructions, too).

> I'll have to go hunting where the similar code would fit for SLP in a
> basic block.
> 
> If I read you correctly, you don't object to a density heuristic such as
> the one I implemented here, but you want to see such heuristics confined
> to a specific target rather than parameterized for all targets.

Correct.

> How'm I doing?  I want to be sure we're on the same page before I delve
> into this.

Thanks,
Richard.
Richard Guenther - June 11, 2012, 2:48 p.m.
On Mon, 11 Jun 2012, David Edelsohn wrote:

> On Mon, Jun 11, 2012 at 7:40 AM, Richard Guenther <rguenther@suse.de> wrote:
> 
> > Hmm.  I don't like this patch or its general idea too much.  Instead
> > I'd like us to move more of the cost model detail to the target, giving
> > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > posting the overall idea at some point, but let me repeat it here instead
> > of trying to find that e-mail.
> 
> Richard,
> 
> Allowing the target to see the entire loop and have more control over
> the cost model is a fine goal, but the current code is using incorrect
> heuristics. Requiring a complete re-write of the auto-vectorizer cost
> model should not be a requirement for fixing the current code.

If fixing the current code adds more "magic" hooks at the wrong spot
then yes, in stage1 I can make you think about a more proper solution
to the issue.

Richard.
David Edelsohn - June 11, 2012, 2:51 p.m.
On Mon, Jun 11, 2012 at 10:48 AM, Richard Guenther <rguenther@suse.de> wrote:
> On Mon, 11 Jun 2012, David Edelsohn wrote:
>
>> On Mon, Jun 11, 2012 at 7:40 AM, Richard Guenther <rguenther@suse.de> wrote:
>>
>> > Hmm.  I don't like this patch or its general idea too much.  Instead
>> > I'd like us to move more of the cost model detail to the target, giving
>> > it a chance to look at the whole loop before deciding on a cost.  ISTR
>> > posting the overall idea at some point, but let me repeat it here instead
>> > of trying to find that e-mail.
>>
>> Richard,
>>
>> Allowing the target to see the entire loop and have more control over
>> the cost model is a fine goal, but the current code is using incorrect
>> heuristics. Requiring a complete re-write of the auto-vectorizer cost
>> model should not be a requirement for fixing the current code.
>
> If fixing the current code adds more "magic" hooks at the wrong spot
> then yes, in stage1 I can make you think about a more proper solution
> to the issue.

First, these are not magic hooks.

Second, I suggest that you need to rephrase "I can make you" and
re-send your reply.

Thanks, David
Richard Guenther - June 11, 2012, 2:55 p.m.
On Mon, 11 Jun 2012, David Edelsohn wrote:

> On Mon, Jun 11, 2012 at 10:48 AM, Richard Guenther <rguenther@suse.de> wrote:
> > On Mon, 11 Jun 2012, David Edelsohn wrote:
> >
> >> On Mon, Jun 11, 2012 at 7:40 AM, Richard Guenther <rguenther@suse.de> wrote:
> >>
> >> > Hmm.  I don't like this patch or its general idea too much.  Instead
> >> > I'd like us to move more of the cost model detail to the target, giving
> >> > it a chance to look at the whole loop before deciding on a cost.  ISTR
> >> > posting the overall idea at some point, but let me repeat it here instead
> >> > of trying to find that e-mail.
> >>
> >> Richard,
> >>
> >> Allowing the target to see the entire loop and have more control over
> >> the cost model is a fine goal, but the current code is using incorrect
> >> heuristics. Requiring a complete re-write of the auto-vectorizer cost
> >> model should not be a requirement for fixing the current code.
> >
> > If fixing the current code adds more "magic" hooks at the wrong spot
> > then yes, in stage1 I can make you think about a more proper solution
> > to the issue.
> 
> First, these are not magic hooks.

Well, they are at least magic numbers and heuristics that apply
generally and not only to the single issue in sphinx.  And in
fact how it works for sphinx _is_ magic.

> Second, I suggest that you need to rephrase "I can make you" and
> re-send your reply.

Sorry for my bad english.  Consider it meaning that I'd rather have
you think about a more proper solution.  That's what patch review
is about after all, no?  Sometimes a complete re-write (which gets
more difficult which each of the patches "enhancing" the not ideal
current state) is the best thing to do.

Richard.
Richard Guenther - June 11, 2012, 2:58 p.m.
On Mon, 11 Jun 2012, Richard Guenther wrote:

> On Mon, 11 Jun 2012, William J. Schmidt wrote:
> 
> > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > 
> > > > This patch adds a heuristic to the vectorizer when estimating the
> > > > minimum profitable number of iterations.  The heuristic is
> > > > target-dependent, and is currently disabled for all targets except
> > > > PowerPC.  However, the intent is to make it general enough to be useful
> > > > for other targets that want to opt in.
> > > > 
> > > > A previous patch addressed some PowerPC SPEC degradations by modifying
> > > > the vector cost model values for vec_perm and vec_promote_demote.  The
> > > > values were set a little higher than their natural values because the
> > > > natural values were not sufficient to prevent a poor vectorization
> > > > choice.  However, this is not the right long-term solution, since it can
> > > > unnecessarily constrain other vectorization choices involving permute
> > > > instructions.
> > > > 
> > > > Analysis of the badly vectorized loop (in sphinx3) showed that the
> > > > problem was overcommitment of vector resources -- too many vector
> > > > instructions issued without enough non-vector instructions available to
> > > > cover the delays.  The vector cost model assumes that instructions
> > > > always have a constant cost, and doesn't have a way of judging this kind
> > > > of "density" of vector instructions.
> > > > 
> > > > The present patch adds a heuristic to recognize when a loop is likely to
> > > > overcommit resources, and adds a small penalty to the inside-loop cost
> > > > to account for the expected stalls.  The heuristic is parameterized with
> > > > three target-specific values:
> > > > 
> > > >  * Density threshold: The heuristic will apply only when the
> > > >    percentage of inside-loop cost attributable to vectorized
> > > >    instructions exceeds this value.
> > > > 
> > > >  * Size threshold: The heuristic will apply only when the
> > > >    inside-loop cost exceeds this value.
> > > > 
> > > >  * Penalty: The inside-loop cost will be increased by this
> > > >    percentage value when the heuristic applies.
> > > > 
> > > > Thus only reasonably large loop bodies that are mostly vectorized
> > > > instructions will be affected.
> > > > 
> > > > By applying only a small percentage bump to the inside-loop cost, the
> > > > heuristic will only turn off vectorization for loops that were
> > > > considered "barely profitable" to begin with (such as the sphinx3 loop).
> > > > So the heuristic is quite conservative and should not affect the vast
> > > > majority of vectorization decisions.
> > > > 
> > > > Together with the new heuristic, this patch reduces the vec_perm and
> > > > vec_promote_demote costs for PowerPC to their natural values.
> > > > 
> > > > I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
> > > > and verified that no performance regressions occur on SPEC cpu2006.  Is
> > > > this ok for trunk?
> > > 
> > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > I'd like us to move more of the cost model detail to the target, giving
> > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > posting the overall idea at some point, but let me repeat it here instead
> > > of trying to find that e-mail.
> > > 
> > > The basic interface of the cost model should be, in targetm.vectorize
> > > 
> > >   /* Tell the target to start cost analysis of a loop or a basic-block
> > >      (if the loop argument is NULL).  Returns an opaque pointer to
> > >      target-private data.  */
> > >   void *init_cost (struct loop *loop);
> > > 
> > >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> > >   void add_stmt_cost (void *data, unsigned n,
> > > 		      vectorized-stmt-kind,
> > >                       enum machine_mode vector_mode);
> > > 
> > >   /* Tell the target to compute and return the cost of the accumulated
> > >      statements and free any target-private data.  */
> > >   unsigned finish_cost (void *data);
> > > 
> > > with eventually slightly different signatures for add_stmt_cost
> > > (like pass in the original scalar stmt?).
> > > 
> > > It allows the target, at finish_cost time, to evaluate things like
> > > register pressure and resource utilization.
> > 
> > OK, I'm trying to understand how you would want this built into the
> > present structure.  Taking just the loop case for now:
> > 
> > Judging by your suggested API, we would have to call add_stmt_cost ()
> > everywhere that we now call stmt_vinfo_set_inside_of_loop_cost ().  For
> > now this would be an additional call, not a replacement, though maybe
> > the other goes away eventually.  This allows the target to save more
> > data about the vectorized instructions than just an accumulated cost
> > number (order and quantity of various kinds of instructions can be
> > maintained for better modeling).  Presumably the call to finish_cost
> > would be done within vect_estimate_min_profitable_iters () to produce
> > the final value of inside_cost for the loop.
> 
> Yes.  I didn't look in detail at the difference between inner/outer
> costs though.  Maybe that complicates things, maybe not.
> 
> > The default target hook for add_stmt_cost would duplicate what we
> > currently do for calculating the inside_cost of a statement, and the
> > default target hook for finish_cost would just return the sum.
> 
> Right.  We should be able to remove STMT_VINFO_INSIDE_OF_LOOP_COST
> and maybe STMT_VINFO_OUTSIDE_OF_LOOP_COST and the target would track
> cost in whatever way it wants - either by summing on-the-fly or
> queuing info from the add_stmt_cost calls to be able to look at the
> "whole" thing (plus if it gets the scalar stmt as argument to
> add_stmt_cost it knows some of the dependencies of the vector 
> instructions, too).
> 
> > I'll have to go hunting where the similar code would fit for SLP in a
> > basic block.
> > 
> > If I read you correctly, you don't object to a density heuristic such as
> > the one I implemented here, but you want to see such heuristics confined
> > to a specific target rather than parameterized for all targets.
> 
> Correct.
> 
> > How'm I doing?  I want to be sure we're on the same page before I delve
> > into this.

Btw, a more "incremental" variant would be to add the init_cost/fini_cost
hooks and only add the void * state parameter to the existing target hook.
The target can then initialize a counter in init_cost and increment
it in the stmt_cost hook and return artificially high costs after, say,
the fourth vector shift (I forgot which vector instruction was the
issue for PPC).  Basically the idea was to give the target a "state"
to work on at stmt_cost time (or ideally at fini_cost time).

Richard.
David Edelsohn - June 11, 2012, 3:09 p.m.
On Mon, Jun 11, 2012 at 10:55 AM, Richard Guenther <rguenther@suse.de> wrote:

> Well, they are at least magic numbers and heuristics that apply
> generally and not only to the single issue in sphinx.  And in
> fact how it works for sphinx _is_ magic.
>
>> Second, I suggest that you need to rephrase "I can make you" and
>> re-send your reply.
>
> Sorry for my bad english.  Consider it meaning that I'd rather have
> you think about a more proper solution.  That's what patch review
> is about after all, no?  Sometimes a complete re-write (which gets
> more difficult which each of the patches "enhancing" the not ideal
> current state) is the best thing to do.

Richard,

The values of the heuristics may be "magic", but Bill believes the
heuristics are testing the important characteristics.  The heuristics
themselves are controlled by hooks, so the target can set the correct
values for their own requirements.

The concern is that a general cost infrastructure is too general.
And, based on history, all ports simply will copy the boilerplate from
the first implementation. It also may cause more problems because the
target has relatively little information to be able to judge
heuristics at that point in the middle-end. If the targets start to
get too "cute" or too complicated, it may cause more problems or more
confusion about why more complicated heuristics are not effective and
not producing the expected results.

I worry about creating another machine dependent reorg catch-all pass.

Maybe an incremental pre- and/or post- cost hook would be more
effective. I will let Bill comment.

Thanks, David
William J. Schmidt - June 11, 2012, 3:17 p.m.
On Mon, 2012-06-11 at 16:58 +0200, Richard Guenther wrote:
> On Mon, 11 Jun 2012, Richard Guenther wrote:
> 
> > On Mon, 11 Jun 2012, William J. Schmidt wrote:
> > 
> > > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > > 
> > > > > This patch adds a heuristic to the vectorizer when estimating the
> > > > > minimum profitable number of iterations.  The heuristic is
> > > > > target-dependent, and is currently disabled for all targets except
> > > > > PowerPC.  However, the intent is to make it general enough to be useful
> > > > > for other targets that want to opt in.
> > > > > 
> > > > > A previous patch addressed some PowerPC SPEC degradations by modifying
> > > > > the vector cost model values for vec_perm and vec_promote_demote.  The
> > > > > values were set a little higher than their natural values because the
> > > > > natural values were not sufficient to prevent a poor vectorization
> > > > > choice.  However, this is not the right long-term solution, since it can
> > > > > unnecessarily constrain other vectorization choices involving permute
> > > > > instructions.
> > > > > 
> > > > > Analysis of the badly vectorized loop (in sphinx3) showed that the
> > > > > problem was overcommitment of vector resources -- too many vector
> > > > > instructions issued without enough non-vector instructions available to
> > > > > cover the delays.  The vector cost model assumes that instructions
> > > > > always have a constant cost, and doesn't have a way of judging this kind
> > > > > of "density" of vector instructions.
> > > > > 
> > > > > The present patch adds a heuristic to recognize when a loop is likely to
> > > > > overcommit resources, and adds a small penalty to the inside-loop cost
> > > > > to account for the expected stalls.  The heuristic is parameterized with
> > > > > three target-specific values:
> > > > > 
> > > > >  * Density threshold: The heuristic will apply only when the
> > > > >    percentage of inside-loop cost attributable to vectorized
> > > > >    instructions exceeds this value.
> > > > > 
> > > > >  * Size threshold: The heuristic will apply only when the
> > > > >    inside-loop cost exceeds this value.
> > > > > 
> > > > >  * Penalty: The inside-loop cost will be increased by this
> > > > >    percentage value when the heuristic applies.
> > > > > 
> > > > > Thus only reasonably large loop bodies that are mostly vectorized
> > > > > instructions will be affected.
> > > > > 
> > > > > By applying only a small percentage bump to the inside-loop cost, the
> > > > > heuristic will only turn off vectorization for loops that were
> > > > > considered "barely profitable" to begin with (such as the sphinx3 loop).
> > > > > So the heuristic is quite conservative and should not affect the vast
> > > > > majority of vectorization decisions.
> > > > > 
> > > > > Together with the new heuristic, this patch reduces the vec_perm and
> > > > > vec_promote_demote costs for PowerPC to their natural values.
> > > > > 
> > > > > I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
> > > > > and verified that no performance regressions occur on SPEC cpu2006.  Is
> > > > > this ok for trunk?
> > > > 
> > > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > > I'd like us to move more of the cost model detail to the target, giving
> > > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > > posting the overall idea at some point, but let me repeat it here instead
> > > > of trying to find that e-mail.
> > > > 
> > > > The basic interface of the cost model should be, in targetm.vectorize
> > > > 
> > > >   /* Tell the target to start cost analysis of a loop or a basic-block
> > > >      (if the loop argument is NULL).  Returns an opaque pointer to
> > > >      target-private data.  */
> > > >   void *init_cost (struct loop *loop);
> > > > 
> > > >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> > > >   void add_stmt_cost (void *data, unsigned n,
> > > > 		      vectorized-stmt-kind,
> > > >                       enum machine_mode vector_mode);
> > > > 
> > > >   /* Tell the target to compute and return the cost of the accumulated
> > > >      statements and free any target-private data.  */
> > > >   unsigned finish_cost (void *data);
> > > > 
> > > > with eventually slightly different signatures for add_stmt_cost
> > > > (like pass in the original scalar stmt?).
> > > > 
> > > > It allows the target, at finish_cost time, to evaluate things like
> > > > register pressure and resource utilization.
> > > 
> > > OK, I'm trying to understand how you would want this built into the
> > > present structure.  Taking just the loop case for now:
> > > 
> > > Judging by your suggested API, we would have to call add_stmt_cost ()
> > > everywhere that we now call stmt_vinfo_set_inside_of_loop_cost ().  For
> > > now this would be an additional call, not a replacement, though maybe
> > > the other goes away eventually.  This allows the target to save more
> > > data about the vectorized instructions than just an accumulated cost
> > > number (order and quantity of various kinds of instructions can be
> > > maintained for better modeling).  Presumably the call to finish_cost
> > > would be done within vect_estimate_min_profitable_iters () to produce
> > > the final value of inside_cost for the loop.
> > 
> > Yes.  I didn't look in detail at the difference between inner/outer
> > costs though.  Maybe that complicates things, maybe not.
> > 
> > > The default target hook for add_stmt_cost would duplicate what we
> > > currently do for calculating the inside_cost of a statement, and the
> > > default target hook for finish_cost would just return the sum.
> > 
> > Right.  We should be able to remove STMT_VINFO_INSIDE_OF_LOOP_COST
> > and maybe STMT_VINFO_OUTSIDE_OF_LOOP_COST and the target would track
> > cost in whatever way it wants - either by summing on-the-fly or
> > queuing info from the add_stmt_cost calls to be able to look at the
> > "whole" thing (plus if it gets the scalar stmt as argument to
> > add_stmt_cost it knows some of the dependencies of the vector 
> > instructions, too).
> > 
> > > I'll have to go hunting where the similar code would fit for SLP in a
> > > basic block.
> > > 
> > > If I read you correctly, you don't object to a density heuristic such as
> > > the one I implemented here, but you want to see such heuristics confined
> > > to a specific target rather than parameterized for all targets.
> > 
> > Correct.
> > 
> > > How'm I doing?  I want to be sure we're on the same page before I delve
> > > into this.
> 
> Btw, a more "incremental" variant would be to add the init_cost/fini_cost
> hooks and only add the void * state parameter to the existing target hook.
> The target can then initialize a counter in init_cost and increment
> it in the stmt_cost hook and return artificially high costs after, say,
> the fourth vector shift (I forgot which vector instruction was the
> issue for PPC).  Basically the idea was to give the target a "state"
> to work on at stmt_cost time (or ideally at fini_cost time).

Yeah, this was what we suspected the problem was (overuse of vector
permute), but when I delved into the hardware simulator it wasn't
correct.  The permutes are restricted to one pipe, so we thought
imbalances were occurring, but that wasn't the case.  It turned out that
the instruction issue unit was getting overwhelmed by too many
instructions for both vector pipelines, with nothing else available to
cover the delays.  Hence the motivation for this heuristic to tweak the
costs when that is likely to happen.  There's some magic here, but it's
reasonably well-understood magic. :)

So unfortunately I can't use a simple counter to solve this problem with
any accuracy.

Thanks,
Bill

> 
> Richard.
>
William J. Schmidt - June 11, 2012, 3:38 p.m.
On Mon, 2012-06-11 at 11:09 -0400, David Edelsohn wrote:
> On Mon, Jun 11, 2012 at 10:55 AM, Richard Guenther <rguenther@suse.de> wrote:
> 
> > Well, they are at least magic numbers and heuristics that apply
> > generally and not only to the single issue in sphinx.  And in
> > fact how it works for sphinx _is_ magic.
> >
> >> Second, I suggest that you need to rephrase "I can make you" and
> >> re-send your reply.
> >
> > Sorry for my bad english.  Consider it meaning that I'd rather have
> > you think about a more proper solution.  That's what patch review
> > is about after all, no?  Sometimes a complete re-write (which gets
> > more difficult which each of the patches "enhancing" the not ideal
> > current state) is the best thing to do.
> 
> Richard,
> 
> The values of the heuristics may be "magic", but Bill believes the
> heuristics are testing the important characteristics.  The heuristics
> themselves are controlled by hooks, so the target can set the correct
> values for their own requirements.
> 
> The concern is that a general cost infrastructure is too general.
> And, based on history, all ports simply will copy the boilerplate from
> the first implementation. It also may cause more problems because the
> target has relatively little information to be able to judge
> heuristics at that point in the middle-end. If the targets start to
> get too "cute" or too complicated, it may cause more problems or more
> confusion about why more complicated heuristics are not effective and
> not producing the expected results.
> 
> I worry about creating another machine dependent reorg catch-all pass.
> 
> Maybe an incremental pre- and/or post- cost hook would be more
> effective. I will let Bill comment.

Thanks David,

I can see both sides of this, and it's hard to judge the future from
where I stand.  My belief is that the number of heuristics targets will
implement will be fairly limited, since judgments about cycle-level
costs are not accurately predictable during the middle end.  All we can
do is come up with a few things that seem to make sense.  Doing too much
in the back end seems impractical.

The interesting question to me is whether cost model heuristics are
general enough to be reusable.  What I saw in this case was what I
considered to be a somewhat target-neutral problem:  overwhelming those
assets of the processor that implement vectorization.  It seemed
reasonable to provide hooks for others to use the idea if they encounter
similar issues.  If reusing the heuristic is useful, then having to copy
the logic from one target to another isn't the best approach.  If nobody
else will ever use it, then embedding it in the back end is reasonable.
Unfortunately my crystal ball has been on the fritz for several decades,
so I can't tell you for sure which is right...

Richard, my biggest question is whether you think other targets are
likely to take advantage of a more general back-end interface, or
whether this will end up just being a PowerPC wart.  If you know of ways
this will be useful for i386, that would be helpful to know.  Perhaps
this requires your crystal ball as well; not sure how well yours
works...

If we look at just this one issue in isolation, then changing all the
code in the vectorizer that calculates inside/outside loop costs and
moving it to targetm seems more invasive than adding the few hooks.  But
if this will really be a useful feature for the community as a whole I
am certainly willing to tackle it.

Thanks,
Bill

> 
> Thanks, David
>
Richard Guenther - June 11, 2012, 6:27 p.m.
On Mon, Jun 11, 2012 at 5:38 PM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
>
>
> On Mon, 2012-06-11 at 11:09 -0400, David Edelsohn wrote:
>> On Mon, Jun 11, 2012 at 10:55 AM, Richard Guenther <rguenther@suse.de> wrote:
>>
>> > Well, they are at least magic numbers and heuristics that apply
>> > generally and not only to the single issue in sphinx.  And in
>> > fact how it works for sphinx _is_ magic.
>> >
>> >> Second, I suggest that you need to rephrase "I can make you" and
>> >> re-send your reply.
>> >
>> > Sorry for my bad english.  Consider it meaning that I'd rather have
>> > you think about a more proper solution.  That's what patch review
>> > is about after all, no?  Sometimes a complete re-write (which gets
>> > more difficult which each of the patches "enhancing" the not ideal
>> > current state) is the best thing to do.
>>
>> Richard,
>>
>> The values of the heuristics may be "magic", but Bill believes the
>> heuristics are testing the important characteristics.  The heuristics
>> themselves are controlled by hooks, so the target can set the correct
>> values for their own requirements.
>>
>> The concern is that a general cost infrastructure is too general.
>> And, based on history, all ports simply will copy the boilerplate from
>> the first implementation. It also may cause more problems because the
>> target has relatively little information to be able to judge
>> heuristics at that point in the middle-end. If the targets start to
>> get too "cute" or too complicated, it may cause more problems or more
>> confusion about why more complicated heuristics are not effective and
>> not producing the expected results.
>>
>> I worry about creating another machine dependent reorg catch-all pass.
>>
>> Maybe an incremental pre- and/or post- cost hook would be more
>> effective. I will let Bill comment.
>
> Thanks David,
>
> I can see both sides of this, and it's hard to judge the future from
> where I stand.  My belief is that the number of heuristics targets will
> implement will be fairly limited, since judgments about cycle-level
> costs are not accurately predictable during the middle end.  All we can
> do is come up with a few things that seem to make sense.  Doing too much
> in the back end seems impractical.
>
> The interesting question to me is whether cost model heuristics are
> general enough to be reusable.  What I saw in this case was what I
> considered to be a somewhat target-neutral problem:  overwhelming those
> assets of the processor that implement vectorization.  It seemed
> reasonable to provide hooks for others to use the idea if they encounter
> similar issues.  If reusing the heuristic is useful, then having to copy
> the logic from one target to another isn't the best approach.  If nobody
> else will ever use it, then embedding it in the back end is reasonable.
> Unfortunately my crystal ball has been on the fritz for several decades,
> so I can't tell you for sure which is right...
>
> Richard, my biggest question is whether you think other targets are
> likely to take advantage of a more general back-end interface, or
> whether this will end up just being a PowerPC wart.  If you know of ways
> this will be useful for i386, that would be helpful to know.  Perhaps
> this requires your crystal ball as well; not sure how well yours
> works...
>
> If we look at just this one issue in isolation, then changing all the
> code in the vectorizer that calculates inside/outside loop costs and
> moving it to targetm seems more invasive than adding the few hooks.  But
> if this will really be a useful feature for the community as a whole I
> am certainly willing to tackle it.

Well.  At the moment the cost model is purely "local" in that the target
receives no context about the cost it has to return for (part of) a vectorized
operation.  In fact the vectorizer already creates multiple
instructions (and thus
calls to the cost target hook) for a single scalar statement in some cases.
There are two ways to increase the context the target can look at - invent
more kinds of "statements" we pass to the cost model hook, and thus
arrive at a point where for each group of stmts the vectorizer emits and
that possibly has a more efficient target implementation we have a different
statement kind (only for the cost model, we continue to generate multiple
GIMPLE stmts).  The other way is to give the target the ability to see context,
thus relation of the (sub-)statements the vectorizer asks the target to compute
costs for.  I've run into the situation where there was missing context on x86,
too (but I forgot the specific case), and that's where I came up with
the general idea.
Of course both work, but for your kind of heuristic only looking at
context works.
In the scheme I suggested the fini_cost hook would be able to do the counting
(as would a simple counter - I don't believe that this would not work - you do
counting, too, after all, simply make the fini hook able to reject the
vectorization).
If powerpc is issue-limited then you have the additional issue that you have no
idea of the amount of address calculations involved (as IVOPTs only
runs afterwards).
Not sure if those would fit for the intermediate work.

Thanks,
Richard.

> Thanks,
> Bill
>
>>
>> Thanks, David
>>
>
William J. Schmidt - June 18, 2012, 6:49 p.m.
On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> On Fri, 8 Jun 2012, William J. Schmidt wrote:
> 
<snip>
> 
> Hmm.  I don't like this patch or its general idea too much.  Instead
> I'd like us to move more of the cost model detail to the target, giving
> it a chance to look at the whole loop before deciding on a cost.  ISTR
> posting the overall idea at some point, but let me repeat it here instead
> of trying to find that e-mail.
> 
> The basic interface of the cost model should be, in targetm.vectorize
> 
>   /* Tell the target to start cost analysis of a loop or a basic-block
>      (if the loop argument is NULL).  Returns an opaque pointer to
>      target-private data.  */
>   void *init_cost (struct loop *loop);
> 
>   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
>   void add_stmt_cost (void *data, unsigned n,
> 		      vectorized-stmt-kind,
>                       enum machine_mode vector_mode);
> 
>   /* Tell the target to compute and return the cost of the accumulated
>      statements and free any target-private data.  */
>   unsigned finish_cost (void *data);
> 
> with eventually slightly different signatures for add_stmt_cost
> (like pass in the original scalar stmt?).
> 
> It allows the target, at finish_cost time, to evaluate things like
> register pressure and resource utilization.
> 
> Thanks,
> Richard.

I've been looking at this in between other projects.  I wanted to be
sure I understood the SLP infrastructure and whether it would cause any
problems.  It looks to me like it will be mostly ok.  One issue I
noticed is a possible difference in the order in which SLP instructions
are analyzed and the order in which the instructions are "issued" during
transformation.

For both loop analysis and basic block analysis, SLP trees are
constructed and analyzed prior to examining other vectorizable
instructions.  Their costs are calculated and stored in the SLP trees at
this time.  Later, when transforming statements to their vector
equivalents, instructions in the block (or loop body) are processed in
order until the first instruction that's part of an SLP tree is
encountered.  At that point, every instruction that's part of any SLP
tree is transformed; then the vectorizer continues with the remaining
non-SLP vectorizable statements.

So if we do the natural and easy thing of placing calls to add_stmt_cost
everywhere that costs are calculated today, the order that those costs
are presented to the back end model will possibly be different than the
order they are actually "emitted."

For a first cut at this, I suggest ignoring the problem other than to
document it as an opportunity for improvement.  Later we could improve
it by using an add_stmt_slp_cost () interface (or adding an is_slp
flag), and another interface to be called at the time during analysis
when the SLP statements will be issued during transformation.  This
would allow the back end model to queue up the SLP costs in a separate
vector and later place them in its internal structures at the
appropriate place.

It should eventually be possible to remove these fields/accessors:

 * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST
 * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST
 * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST

However, I think this should be delayed until we have the basic
infrastructure in place for the new model and well-tested.

The other issue is that we should have the model track both the inside
and outside costs if we're going to get everything into the target
model.  For a first pass we can ignore this and keep the existing logic
for the outside costs.  Later we should add some interfaces analogous to
add_stmt_cost such as add_stmt_prolog_cost and add_stmt_epilog_cost so
the model can track this stuff as carefully as it wants to.

So, I'd propose going at this in several phases:

(1) Add calls to the new interface without disturbing existing logic;
modify the profitability algorithms to query the new model for inside
costs.  Default algorithm for the model is to just sum costs as is done
today.
(x) Add heuristics to target models as desired.
(2) Handle the SLP ordering problem.
(3) Handle outside costs in the target model.
(4) Remove the now unnecessary cost fields and the calls that set them.

Item (x) can happen anytime after item (1).

I don't think this work is terribly difficult, just a bit tedious.  The
only really time-consuming aspect of it will be in very careful testing
to keep from changing existing behavior.

All comments welcome -- please let me know what you think.

Thanks,
Bill
William J. Schmidt - June 18, 2012, 7:41 p.m.
On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote:
> On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > 
> <snip>
> > 
> > Hmm.  I don't like this patch or its general idea too much.  Instead
> > I'd like us to move more of the cost model detail to the target, giving
> > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > posting the overall idea at some point, but let me repeat it here instead
> > of trying to find that e-mail.
> > 
> > The basic interface of the cost model should be, in targetm.vectorize
> > 
> >   /* Tell the target to start cost analysis of a loop or a basic-block
> >      (if the loop argument is NULL).  Returns an opaque pointer to
> >      target-private data.  */
> >   void *init_cost (struct loop *loop);
> > 
> >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> >   void add_stmt_cost (void *data, unsigned n,
> > 		      vectorized-stmt-kind,
> >                       enum machine_mode vector_mode);
> > 
> >   /* Tell the target to compute and return the cost of the accumulated
> >      statements and free any target-private data.  */
> >   unsigned finish_cost (void *data);

By the way, I don't see much point in passing the void *data around
here.  Too many levels of interfaces that we'd have to pass it around in
the vectorizer, so it would just sit in a static variable.  Might as
well let the data be wholly private to the target.
> > 
> > with eventually slightly different signatures for add_stmt_cost
> > (like pass in the original scalar stmt?).
> > 
> > It allows the target, at finish_cost time, to evaluate things like
> > register pressure and resource utilization.
> > 
> > Thanks,
> > Richard.
Richard Guenther - June 19, 2012, 10:08 a.m.
On Mon, 18 Jun 2012, William J. Schmidt wrote:

> On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > 
> <snip>
> > 
> > Hmm.  I don't like this patch or its general idea too much.  Instead
> > I'd like us to move more of the cost model detail to the target, giving
> > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > posting the overall idea at some point, but let me repeat it here instead
> > of trying to find that e-mail.
> > 
> > The basic interface of the cost model should be, in targetm.vectorize
> > 
> >   /* Tell the target to start cost analysis of a loop or a basic-block
> >      (if the loop argument is NULL).  Returns an opaque pointer to
> >      target-private data.  */
> >   void *init_cost (struct loop *loop);
> > 
> >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> >   void add_stmt_cost (void *data, unsigned n,
> > 		      vectorized-stmt-kind,
> >                       enum machine_mode vector_mode);
> > 
> >   /* Tell the target to compute and return the cost of the accumulated
> >      statements and free any target-private data.  */
> >   unsigned finish_cost (void *data);
> > 
> > with eventually slightly different signatures for add_stmt_cost
> > (like pass in the original scalar stmt?).
> > 
> > It allows the target, at finish_cost time, to evaluate things like
> > register pressure and resource utilization.
> > 
> > Thanks,
> > Richard.
> 
> I've been looking at this in between other projects.  I wanted to be
> sure I understood the SLP infrastructure and whether it would cause any
> problems.  It looks to me like it will be mostly ok.  One issue I
> noticed is a possible difference in the order in which SLP instructions
> are analyzed and the order in which the instructions are "issued" during
> transformation.
> 
> For both loop analysis and basic block analysis, SLP trees are
> constructed and analyzed prior to examining other vectorizable
> instructions.  Their costs are calculated and stored in the SLP trees at
> this time.  Later, when transforming statements to their vector
> equivalents, instructions in the block (or loop body) are processed in
> order until the first instruction that's part of an SLP tree is
> encountered.  At that point, every instruction that's part of any SLP
> tree is transformed; then the vectorizer continues with the remaining
> non-SLP vectorizable statements.
> 
> So if we do the natural and easy thing of placing calls to add_stmt_cost
> everywhere that costs are calculated today, the order that those costs
> are presented to the back end model will possibly be different than the
> order they are actually "emitted."

Interesting.  But I suppose this is similar to how pattern statements
are handled?  Thus, the whole pattern sequence is processed when
we encounter the "main" pattern statement?

> For a first cut at this, I suggest ignoring the problem other than to
> document it as an opportunity for improvement.  Later we could improve
> it by using an add_stmt_slp_cost () interface (or adding an is_slp
> flag), and another interface to be called at the time during analysis
> when the SLP statements will be issued during transformation.  This
> would allow the back end model to queue up the SLP costs in a separate
> vector and later place them in its internal structures at the
> appropriate place.
>
> It should eventually be possible to remove these fields/accessors:
> 
>  * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST
>  * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST
>  * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST
> 
> However, I think this should be delayed until we have the basic
> infrastructure in place for the new model and well-tested.

Indeed.
 
> The other issue is that we should have the model track both the inside
> and outside costs if we're going to get everything into the target
> model.  For a first pass we can ignore this and keep the existing logic
> for the outside costs.  Later we should add some interfaces analogous to
> add_stmt_cost such as add_stmt_prolog_cost and add_stmt_epilog_cost so
> the model can track this stuff as carefully as it wants to.

Outside costs are merely added to the niter * inner-cost metric to
be compared with the scalar cost niter * scalar-cost, right?  Thus
they would be tracked completely separate - eventually similar to
how we compute the cost of the scalar loop.

> So, I'd propose going at this in several phases:
> 
> (1) Add calls to the new interface without disturbing existing logic;
> modify the profitability algorithms to query the new model for inside
> costs.  Default algorithm for the model is to just sum costs as is done
> today.

Right.

> (x) Add heuristics to target models as desired.
> (2) Handle the SLP ordering problem.
> (3) Handle outside costs in the target model.
> (4) Remove the now unnecessary cost fields and the calls that set them.
> 
> Item (x) can happen anytime after item (1).
> 
> I don't think this work is terribly difficult, just a bit tedious.  The
> only really time-consuming aspect of it will be in very careful testing
> to keep from changing existing behavior.
> 
> All comments welcome -- please let me know what you think.

I think this is a reasonable plan.

Thanks,
Richard.
Richard Guenther - June 19, 2012, 10:10 a.m.
On Mon, 18 Jun 2012, William J. Schmidt wrote:

> On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote:
> > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > 
> > <snip>
> > > 
> > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > I'd like us to move more of the cost model detail to the target, giving
> > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > posting the overall idea at some point, but let me repeat it here instead
> > > of trying to find that e-mail.
> > > 
> > > The basic interface of the cost model should be, in targetm.vectorize
> > > 
> > >   /* Tell the target to start cost analysis of a loop or a basic-block
> > >      (if the loop argument is NULL).  Returns an opaque pointer to
> > >      target-private data.  */
> > >   void *init_cost (struct loop *loop);
> > > 
> > >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> > >   void add_stmt_cost (void *data, unsigned n,
> > > 		      vectorized-stmt-kind,
> > >                       enum machine_mode vector_mode);
> > > 
> > >   /* Tell the target to compute and return the cost of the accumulated
> > >      statements and free any target-private data.  */
> > >   unsigned finish_cost (void *data);
> 
> By the way, I don't see much point in passing the void *data around
> here.  Too many levels of interfaces that we'd have to pass it around in
> the vectorizer, so it would just sit in a static variable.  Might as
> well let the data be wholly private to the target.

Ok, so you'd have void init_cost (struct loop *) and
unsigned finish_cost (void); then?  Static variables are of couse
not properly "abstracted" so we can't ever compute two set of costs
at the same time ... but that's true all-over-the-place in GCC ...

With previous discussion the add_stmt_cost hook would be split up
to also allow passing the operation code for example.

Richard.
William J. Schmidt - June 19, 2012, 12:20 p.m.
On Tue, 2012-06-19 at 12:08 +0200, Richard Guenther wrote:
> On Mon, 18 Jun 2012, William J. Schmidt wrote:
> 
> > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > 
> > <snip>
> > > 
> > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > I'd like us to move more of the cost model detail to the target, giving
> > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > posting the overall idea at some point, but let me repeat it here instead
> > > of trying to find that e-mail.
> > > 
> > > The basic interface of the cost model should be, in targetm.vectorize
> > > 
> > >   /* Tell the target to start cost analysis of a loop or a basic-block
> > >      (if the loop argument is NULL).  Returns an opaque pointer to
> > >      target-private data.  */
> > >   void *init_cost (struct loop *loop);
> > > 
> > >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> > >   void add_stmt_cost (void *data, unsigned n,
> > > 		      vectorized-stmt-kind,
> > >                       enum machine_mode vector_mode);
> > > 
> > >   /* Tell the target to compute and return the cost of the accumulated
> > >      statements and free any target-private data.  */
> > >   unsigned finish_cost (void *data);
> > > 
> > > with eventually slightly different signatures for add_stmt_cost
> > > (like pass in the original scalar stmt?).
> > > 
> > > It allows the target, at finish_cost time, to evaluate things like
> > > register pressure and resource utilization.
> > > 
> > > Thanks,
> > > Richard.
> > 
> > I've been looking at this in between other projects.  I wanted to be
> > sure I understood the SLP infrastructure and whether it would cause any
> > problems.  It looks to me like it will be mostly ok.  One issue I
> > noticed is a possible difference in the order in which SLP instructions
> > are analyzed and the order in which the instructions are "issued" during
> > transformation.
> > 
> > For both loop analysis and basic block analysis, SLP trees are
> > constructed and analyzed prior to examining other vectorizable
> > instructions.  Their costs are calculated and stored in the SLP trees at
> > this time.  Later, when transforming statements to their vector
> > equivalents, instructions in the block (or loop body) are processed in
> > order until the first instruction that's part of an SLP tree is
> > encountered.  At that point, every instruction that's part of any SLP
> > tree is transformed; then the vectorizer continues with the remaining
> > non-SLP vectorizable statements.
> > 
> > So if we do the natural and easy thing of placing calls to add_stmt_cost
> > everywhere that costs are calculated today, the order that those costs
> > are presented to the back end model will possibly be different than the
> > order they are actually "emitted."
> 
> Interesting.  But I suppose this is similar to how pattern statements
> are handled?  Thus, the whole pattern sequence is processed when
> we encounter the "main" pattern statement?

Yes, but the difference is that both vect_analyze_stmt and
vect_transform_loop handle the pattern statements in the same order
(thankfully -- I would hate to have to deal with the pattern mess).
With SLP, all SLP statements are analyzed ahead of time, but they aren't
transformed until one of them is encountered in the statement walk.

> 
> > For a first cut at this, I suggest ignoring the problem other than to
> > document it as an opportunity for improvement.  Later we could improve
> > it by using an add_stmt_slp_cost () interface (or adding an is_slp
> > flag), and another interface to be called at the time during analysis
> > when the SLP statements will be issued during transformation.  This
> > would allow the back end model to queue up the SLP costs in a separate
> > vector and later place them in its internal structures at the
> > appropriate place.
> >
> > It should eventually be possible to remove these fields/accessors:
> > 
> >  * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST
> >  * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST
> >  * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST
> > 
> > However, I think this should be delayed until we have the basic
> > infrastructure in place for the new model and well-tested.
> 
> Indeed.
> 
> > The other issue is that we should have the model track both the inside
> > and outside costs if we're going to get everything into the target
> > model.  For a first pass we can ignore this and keep the existing logic
> > for the outside costs.  Later we should add some interfaces analogous to
> > add_stmt_cost such as add_stmt_prolog_cost and add_stmt_epilog_cost so
> > the model can track this stuff as carefully as it wants to.
> 
> Outside costs are merely added to the niter * inner-cost metric to
> be compared with the scalar cost niter * scalar-cost, right?  Thus
> they would be tracked completely separate - eventually similar to
> how we compute the cost of the scalar loop.

Yes, that's the way they're used today, and probably nobody will ever
want to get fancier than that.  But as you say, the idea would be to let
them be tracked similarly, but in two more pieces: prolog and epilog.

> 
> > So, I'd propose going at this in several phases:
> > 
> > (1) Add calls to the new interface without disturbing existing logic;
> > modify the profitability algorithms to query the new model for inside
> > costs.  Default algorithm for the model is to just sum costs as is done
> > today.
> 
> Right.
> 
> > (x) Add heuristics to target models as desired.
> > (2) Handle the SLP ordering problem.
> > (3) Handle outside costs in the target model.
> > (4) Remove the now unnecessary cost fields and the calls that set them.
> > 
> > Item (x) can happen anytime after item (1).
> > 
> > I don't think this work is terribly difficult, just a bit tedious.  The
> > only really time-consuming aspect of it will be in very careful testing
> > to keep from changing existing behavior.
> > 
> > All comments welcome -- please let me know what you think.
> 
> I think this is a reasonable plan.
> 
> Thanks,
> Richard.
>
Richard Guenther - June 19, 2012, 12:22 p.m.
On Tue, 19 Jun 2012, William J. Schmidt wrote:

> On Tue, 2012-06-19 at 12:08 +0200, Richard Guenther wrote:
> > On Mon, 18 Jun 2012, William J. Schmidt wrote:
> > 
> > > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > > 
> > > <snip>
> > > > 
> > > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > > I'd like us to move more of the cost model detail to the target, giving
> > > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > > posting the overall idea at some point, but let me repeat it here instead
> > > > of trying to find that e-mail.
> > > > 
> > > > The basic interface of the cost model should be, in targetm.vectorize
> > > > 
> > > >   /* Tell the target to start cost analysis of a loop or a basic-block
> > > >      (if the loop argument is NULL).  Returns an opaque pointer to
> > > >      target-private data.  */
> > > >   void *init_cost (struct loop *loop);
> > > > 
> > > >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> > > >   void add_stmt_cost (void *data, unsigned n,
> > > > 		      vectorized-stmt-kind,
> > > >                       enum machine_mode vector_mode);
> > > > 
> > > >   /* Tell the target to compute and return the cost of the accumulated
> > > >      statements and free any target-private data.  */
> > > >   unsigned finish_cost (void *data);
> > > > 
> > > > with eventually slightly different signatures for add_stmt_cost
> > > > (like pass in the original scalar stmt?).
> > > > 
> > > > It allows the target, at finish_cost time, to evaluate things like
> > > > register pressure and resource utilization.
> > > > 
> > > > Thanks,
> > > > Richard.
> > > 
> > > I've been looking at this in between other projects.  I wanted to be
> > > sure I understood the SLP infrastructure and whether it would cause any
> > > problems.  It looks to me like it will be mostly ok.  One issue I
> > > noticed is a possible difference in the order in which SLP instructions
> > > are analyzed and the order in which the instructions are "issued" during
> > > transformation.
> > > 
> > > For both loop analysis and basic block analysis, SLP trees are
> > > constructed and analyzed prior to examining other vectorizable
> > > instructions.  Their costs are calculated and stored in the SLP trees at
> > > this time.  Later, when transforming statements to their vector
> > > equivalents, instructions in the block (or loop body) are processed in
> > > order until the first instruction that's part of an SLP tree is
> > > encountered.  At that point, every instruction that's part of any SLP
> > > tree is transformed; then the vectorizer continues with the remaining
> > > non-SLP vectorizable statements.
> > > 
> > > So if we do the natural and easy thing of placing calls to add_stmt_cost
> > > everywhere that costs are calculated today, the order that those costs
> > > are presented to the back end model will possibly be different than the
> > > order they are actually "emitted."
> > 
> > Interesting.  But I suppose this is similar to how pattern statements
> > are handled?  Thus, the whole pattern sequence is processed when
> > we encounter the "main" pattern statement?
> 
> Yes, but the difference is that both vect_analyze_stmt and
> vect_transform_loop handle the pattern statements in the same order
> (thankfully -- I would hate to have to deal with the pattern mess).
> With SLP, all SLP statements are analyzed ahead of time, but they aren't
> transformed until one of them is encountered in the statement walk.

Ah, ok.  I suppose we can simply declare that when we register
vectorized stmts with the backend they are in arbitrary oder.
After all this is not supposed to be another machine dependent reorg
phase (to quote David).

> > 
> > > For a first cut at this, I suggest ignoring the problem other than to
> > > document it as an opportunity for improvement.  Later we could improve
> > > it by using an add_stmt_slp_cost () interface (or adding an is_slp
> > > flag), and another interface to be called at the time during analysis
> > > when the SLP statements will be issued during transformation.  This
> > > would allow the back end model to queue up the SLP costs in a separate
> > > vector and later place them in its internal structures at the
> > > appropriate place.
> > >
> > > It should eventually be possible to remove these fields/accessors:
> > > 
> > >  * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST
> > >  * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST
> > >  * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST
> > > 
> > > However, I think this should be delayed until we have the basic
> > > infrastructure in place for the new model and well-tested.
> > 
> > Indeed.
> > 
> > > The other issue is that we should have the model track both the inside
> > > and outside costs if we're going to get everything into the target
> > > model.  For a first pass we can ignore this and keep the existing logic
> > > for the outside costs.  Later we should add some interfaces analogous to
> > > add_stmt_cost such as add_stmt_prolog_cost and add_stmt_epilog_cost so
> > > the model can track this stuff as carefully as it wants to.
> > 
> > Outside costs are merely added to the niter * inner-cost metric to
> > be compared with the scalar cost niter * scalar-cost, right?  Thus
> > they would be tracked completely separate - eventually similar to
> > how we compute the cost of the scalar loop.
> 
> Yes, that's the way they're used today, and probably nobody will ever
> want to get fancier than that.  But as you say, the idea would be to let
> them be tracked similarly, but in two more pieces: prolog and epilog.

Ok.

Richard.
William J. Schmidt - June 19, 2012, 12:45 p.m.
On Tue, 2012-06-19 at 12:10 +0200, Richard Guenther wrote:
> On Mon, 18 Jun 2012, William J. Schmidt wrote:
> 
> > On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote:
> > > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > > 
> > > <snip>
> > > > 
> > > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > > I'd like us to move more of the cost model detail to the target, giving
> > > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > > posting the overall idea at some point, but let me repeat it here instead
> > > > of trying to find that e-mail.
> > > > 
> > > > The basic interface of the cost model should be, in targetm.vectorize
> > > > 
> > > >   /* Tell the target to start cost analysis of a loop or a basic-block
> > > >      (if the loop argument is NULL).  Returns an opaque pointer to
> > > >      target-private data.  */
> > > >   void *init_cost (struct loop *loop);
> > > > 
> > > >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> > > >   void add_stmt_cost (void *data, unsigned n,
> > > > 		      vectorized-stmt-kind,
> > > >                       enum machine_mode vector_mode);
> > > > 
> > > >   /* Tell the target to compute and return the cost of the accumulated
> > > >      statements and free any target-private data.  */
> > > >   unsigned finish_cost (void *data);
> > 
> > By the way, I don't see much point in passing the void *data around
> > here.  Too many levels of interfaces that we'd have to pass it around in
> > the vectorizer, so it would just sit in a static variable.  Might as
> > well let the data be wholly private to the target.
> 
> Ok, so you'd have void init_cost (struct loop *) and
> unsigned finish_cost (void); then?  Static variables are of couse
> not properly "abstracted" so we can't ever compute two set of costs
> at the same time ... but that's true all-over-the-place in GCC ...

It's a fair point, and perhaps I'll decide to pass the data pointer
around anyway to keep that option open.  We'll see which looks uglier.

> 
> With previous discussion the add_stmt_cost hook would be split up
> to also allow passing the operation code for example.

I remember having this discussion, and I was looking for it to check on
the details, but I can't seem to find it either in my inbox or in the
archives.  Can you please point me to that again?  Sorry for the bother.

Thanks,
Bill

> 
> Richard.
>
Richard Guenther - June 19, 2012, 12:48 p.m.
On Tue, 19 Jun 2012, William J. Schmidt wrote:

> On Tue, 2012-06-19 at 12:10 +0200, Richard Guenther wrote:
> > On Mon, 18 Jun 2012, William J. Schmidt wrote:
> > 
> > > On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote:
> > > > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > > > 
> > > > <snip>
> > > > > 
> > > > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > > > I'd like us to move more of the cost model detail to the target, giving
> > > > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > > > posting the overall idea at some point, but let me repeat it here instead
> > > > > of trying to find that e-mail.
> > > > > 
> > > > > The basic interface of the cost model should be, in targetm.vectorize
> > > > > 
> > > > >   /* Tell the target to start cost analysis of a loop or a basic-block
> > > > >      (if the loop argument is NULL).  Returns an opaque pointer to
> > > > >      target-private data.  */
> > > > >   void *init_cost (struct loop *loop);
> > > > > 
> > > > >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> > > > >   void add_stmt_cost (void *data, unsigned n,
> > > > > 		      vectorized-stmt-kind,
> > > > >                       enum machine_mode vector_mode);
> > > > > 
> > > > >   /* Tell the target to compute and return the cost of the accumulated
> > > > >      statements and free any target-private data.  */
> > > > >   unsigned finish_cost (void *data);
> > > 
> > > By the way, I don't see much point in passing the void *data around
> > > here.  Too many levels of interfaces that we'd have to pass it around in
> > > the vectorizer, so it would just sit in a static variable.  Might as
> > > well let the data be wholly private to the target.
> > 
> > Ok, so you'd have void init_cost (struct loop *) and
> > unsigned finish_cost (void); then?  Static variables are of couse
> > not properly "abstracted" so we can't ever compute two set of costs
> > at the same time ... but that's true all-over-the-place in GCC ...
> 
> It's a fair point, and perhaps I'll decide to pass the data pointer
> around anyway to keep that option open.  We'll see which looks uglier.
> 
> > 
> > With previous discussion the add_stmt_cost hook would be split up
> > to also allow passing the operation code for example.
> 
> I remember having this discussion, and I was looking for it to check on
> the details, but I can't seem to find it either in my inbox or in the
> archives.  Can you please point me to that again?  Sorry for the bother.

It was in the "Correct cost model for strided loads" thread.

Richard.
William J. Schmidt - June 19, 2012, 2:17 p.m.
On Tue, 2012-06-19 at 14:48 +0200, Richard Guenther wrote:
> On Tue, 19 Jun 2012, William J. Schmidt wrote:
> 
> > I remember having this discussion, and I was looking for it to check on
> > the details, but I can't seem to find it either in my inbox or in the
> > archives.  Can you please point me to that again?  Sorry for the bother.
> 
> It was in the "Correct cost model for strided loads" thread.

Ah, right, thanks.  I think it will be best to make that a separate
patch in the series.  Like so:

(1) Add calls to the new interface without disturbing existing logic;
modify the profitability algorithms to query the new model for inside
costs.  Default algorithm for the model is to just sum costs as is done
today.
(1a) Split up the cost hooks (one for loads/stores with misalign parm,
one for vector_stmt with tree_code, etc.).
(x) Add heuristics to target models as desired.
(2) Handle the SLP ordering problem.
(3) Handle outside costs in the target model.
(4) Remove the now unnecessary cost fields and the calls that set them.

I'll start work on this series of patches as I have time between other
projects.

Thanks,
Bill

> 
> Richard.
>
Richard Guenther - June 19, 2012, 2:20 p.m.
On Tue, 19 Jun 2012, William J. Schmidt wrote:

> On Tue, 2012-06-19 at 14:48 +0200, Richard Guenther wrote:
> > On Tue, 19 Jun 2012, William J. Schmidt wrote:
> > 
> > > I remember having this discussion, and I was looking for it to check on
> > > the details, but I can't seem to find it either in my inbox or in the
> > > archives.  Can you please point me to that again?  Sorry for the bother.
> > 
> > It was in the "Correct cost model for strided loads" thread.
> 
> Ah, right, thanks.  I think it will be best to make that a separate
> patch in the series.  Like so:
> 
> (1) Add calls to the new interface without disturbing existing logic;
> modify the profitability algorithms to query the new model for inside
> costs.  Default algorithm for the model is to just sum costs as is done
> today.
> (1a) Split up the cost hooks (one for loads/stores with misalign parm,
> one for vector_stmt with tree_code, etc.).
> (x) Add heuristics to target models as desired.
> (2) Handle the SLP ordering problem.
> (3) Handle outside costs in the target model.
> (4) Remove the now unnecessary cost fields and the calls that set them.
> 
> I'll start work on this series of patches as I have time between other
> projects.

Thanks!
Richard.
William J. Schmidt - June 21, 2012, 12:29 p.m.
On Tue, 2012-06-19 at 16:20 +0200, Richard Guenther wrote:
> On Tue, 19 Jun 2012, William J. Schmidt wrote:
> 
> > On Tue, 2012-06-19 at 14:48 +0200, Richard Guenther wrote:
> > > On Tue, 19 Jun 2012, William J. Schmidt wrote:
> > > 
> > > > I remember having this discussion, and I was looking for it to check on
> > > > the details, but I can't seem to find it either in my inbox or in the
> > > > archives.  Can you please point me to that again?  Sorry for the bother.
> > > 
> > > It was in the "Correct cost model for strided loads" thread.
> > 
> > Ah, right, thanks.  I think it will be best to make that a separate
> > patch in the series.  Like so:
> > 
> > (1) Add calls to the new interface without disturbing existing logic;
> > modify the profitability algorithms to query the new model for inside
> > costs.  Default algorithm for the model is to just sum costs as is done
> > today.

Just FYI, this is not quite as straightforward as I thought.  There is
some code in tree-vect-data-refs.c that computes costs for various
peeling options and picks one of them.  In most other places we can just
pass the instructions to the back end at the same place that the costs
are currently calculated, but not here.  This will require some more
major surgery to save the instructions needed from each peeling option
and only pass along the ones that end up being chosen.

The upside is the same sort of "delayed emit" is needed for the SLP
ordering problem, so the infrastructure for this will be reusable for
that problem.

Grumble.

Bill

> > (1a) Split up the cost hooks (one for loads/stores with misalign parm,
> > one for vector_stmt with tree_code, etc.).
> > (x) Add heuristics to target models as desired.
> > (2) Handle the SLP ordering problem.
> > (3) Handle outside costs in the target model.
> > (4) Remove the now unnecessary cost fields and the calls that set them.
> > 
> > I'll start work on this series of patches as I have time between other
> > projects.
> 
> Thanks!
> Richard.
>

Patch

Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 188305)
+++ gcc/doc/tm.texi	(working copy)
@@ -5798,6 +5798,27 @@  The default is @code{NULL_TREE} which means to not
 loads.
 @end deftypefn
 
+@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD (void)
+This hook should return the maximum density, expressed in percent, for
+which autovectorization of loops with large bodies should be constrained.
+See also @code{TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD}.  The default
+is to return 100, which disables the density test.
+@end deftypefn
+
+@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD (void)
+This hook should return the minimum estimated size of a vectorized
+loop body for which the density test should apply.  See also
+@code{TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD}.  The default is set
+to the unreasonable value of 1000000, which effectively disables 
+the density test.
+@end deftypefn
+
+@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_PENALTY (void)
+This hook should return the penalty, expressed in percent, to be applied
+to the inside-of-loop vectorization costs for a loop failing the density
+test.  The default is 10.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	(revision 188305)
+++ gcc/doc/tm.texi.in	(working copy)
@@ -5730,6 +5730,27 @@  The default is @code{NULL_TREE} which means to not
 loads.
 @end deftypefn
 
+@hook TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD
+This hook should return the maximum density, expressed in percent, for
+which autovectorization of loops with large bodies should be constrained.
+See also @code{TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD}.  The default
+is to return 100, which disables the density test.
+@end deftypefn
+
+@hook TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD
+This hook should return the minimum estimated size of a vectorized
+loop body for which the density test should apply.  See also
+@code{TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD}.  The default is set
+to the unreasonable value of 1000000, which effectively disables 
+the density test.
+@end deftypefn
+
+@hook TARGET_VECTORIZE_DENSITY_PENALTY
+This hook should return the penalty, expressed in percent, to be applied
+to the inside-of-loop vectorization costs for a loop failing the density
+test.  The default is 10.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c	(revision 188305)
+++ gcc/targhooks.c	(working copy)
@@ -990,6 +990,33 @@  default_autovectorize_vector_sizes (void)
   return 0;
 }
 
+/* By default, the density test for autovectorization is disabled by
+   setting the minimum percentage to 100.  */
+
+int
+default_density_pct_threshold (void)
+{
+  return 100;
+}
+
+/* By default, the density size threshold for autovectorization is
+   meaningless since the density test is disabled.  An unreasonably
+   large number is used to further inhibit the density test.  */
+
+int
+default_density_size_threshold (void)
+{
+  return 1000000;
+}
+
+/* By default, the density penalty for autovectorization is set to 10%.  */
+
+int
+default_density_penalty (void)
+{
+  return 10;
+}
+
 /* Determine whether or not a pointer mode is valid. Assume defaults
    of ptr_mode or Pmode - can be overridden.  */
 bool
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	(revision 188305)
+++ gcc/targhooks.h	(working copy)
@@ -90,6 +90,9 @@  default_builtin_support_vector_misalignment (enum
 					     int, bool);
 extern enum machine_mode default_preferred_simd_mode (enum machine_mode mode);
 extern unsigned int default_autovectorize_vector_sizes (void);
+extern int default_density_pct_threshold (void);
+extern int default_density_size_threshold (void);
+extern int default_density_penalty (void);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
Index: gcc/target.def
===================================================================
--- gcc/target.def	(revision 188305)
+++ gcc/target.def	(working copy)
@@ -1054,6 +1054,32 @@  DEFHOOK
  (const_tree mem_vectype, const_tree index_type, int scale),
  NULL)
 
+/* Return the maximum density in percent for loop vectorization.  */
+DEFHOOK
+(density_pct_threshold,
+"",
+int,
+(void),
+default_density_pct_threshold)
+
+/* Return the minimum size of a loop iteration for applying the density
+   test for loop vectorization.  */
+DEFHOOK
+(density_size_threshold,
+"",
+int,
+(void),
+default_density_size_threshold)
+
+/* Return the penalty in percent for vectorizing a loop failing the
+   density test.  */
+DEFHOOK
+(density_penalty,
+"",
+int,
+(void),
+default_density_penalty)
+
 HOOK_VECTOR_END (vectorize)
 
 #undef HOOK_PREFIX
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	(revision 188305)
+++ gcc/tree-vect-loop.c	(working copy)
@@ -2485,6 +2485,58 @@  vect_get_known_peeling_cost (loop_vec_info loop_vi
            + peel_guard_costs;
 }
 
+/* Add the inside-loop cost of STMT to either *REL_COST or *IRREL_COST,
+   depending on whether or not STMT will be vectorized.  For vectorized
+   statements, the inside-loop cost is as already computed.  For other
+   statements, assume a cost of one.  */
+
+static void
+accum_stmt_cost (gimple stmt, int *rel_cost, int *irrel_cost)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  gimple pattern_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
+  gimple_seq pattern_def_seq;
+
+  /* If the statement is irrelevant, but it has a related pattern
+     statement that is relevant, process just the related statement.
+     If the statement is relevant and it has a related pattern
+     statement that is also relevant, process them both.  */
+  if (!STMT_VINFO_RELEVANT_P (stmt_info)
+      && !STMT_VINFO_LIVE_P (stmt_info))
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+	  && pattern_stmt
+	  && (STMT_VINFO_RELEVANT_P (vinfo_for_stmt (pattern_stmt))
+	      || STMT_VINFO_LIVE_P (vinfo_for_stmt (pattern_stmt))))
+	accum_stmt_cost (pattern_stmt, rel_cost, irrel_cost);
+      else
+	(*irrel_cost)++;
+    }
+  else if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+	   && pattern_stmt
+	   && (STMT_VINFO_RELEVANT_P (vinfo_for_stmt (pattern_stmt))
+	       || STMT_VINFO_LIVE_P (vinfo_for_stmt (pattern_stmt))))
+    accum_stmt_cost (pattern_stmt, rel_cost, irrel_cost);
+
+  /* If we're looking at a pattern that has additional statements,
+     count them as well.  */
+  if (is_pattern_stmt_p (stmt_info)
+      && (pattern_def_seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_info)))
+    {
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start (pattern_def_seq); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple pattern_def_stmt = gsi_stmt (gsi);
+	  if (STMT_VINFO_RELEVANT_P (vinfo_for_stmt (pattern_def_stmt))
+	      || STMT_VINFO_LIVE_P (vinfo_for_stmt (pattern_def_stmt)))
+	    accum_stmt_cost (pattern_def_stmt, rel_cost, irrel_cost);
+	}
+    }
+
+  /* Accumulate the inside-loop cost of this vectorizable statement.  */
+  *rel_cost += STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info);
+}
+
 /* Function vect_estimate_min_profitable_iters
 
    Return the number of iterations required for the vector version of the
@@ -2743,6 +2795,45 @@  vect_estimate_min_profitable_iters (loop_vec_info
       vec_inside_cost += SLP_INSTANCE_INSIDE_OF_LOOP_COST (instance);
     }
 
+  /* Test for likely overcommitment of vector hardware resources.  If a
+     loop iteration is relatively large, and too large a percentage of
+     instructions in the loop are vectorized, the cost model may not
+     adequately reflect delays from unavailable vector resources.
+     Penalize vec_inside_cost for this case, using target-specific
+     parameters.  */
+  if (targetm.vectorize.density_pct_threshold () < 100)
+    {
+      int rel_cost = 0, irrel_cost = 0;
+      int density_pct;
+
+      for (i = 0; i < nbbs; i++)
+	{
+	  basic_block bb = bbs[i];
+	  gimple_stmt_iterator gsi;
+
+	  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      gimple stmt = gsi_stmt (gsi);
+	      accum_stmt_cost (stmt, &rel_cost, &irrel_cost);
+	    }
+	}
+
+      density_pct = (rel_cost * 100) / (rel_cost + irrel_cost);
+
+      if (density_pct > targetm.vectorize.density_pct_threshold ()
+	  && (rel_cost + irrel_cost
+	      > targetm.vectorize.density_size_threshold ()))
+	{
+	  int penalty = targetm.vectorize.density_penalty ();
+	  vec_inside_cost = vec_inside_cost * (100 + penalty) / 100;
+	  if (vect_print_dump_info (REPORT_DETAILS))
+	    fprintf (vect_dump,
+		     "density %d%%, cost %d exceeds threshold"
+		     ", penalizing inside-loop cost by %d%%.",
+		     density_pct, rel_cost + irrel_cost, penalty);
+	}
+    }
+
   /* Calculate number of iterations required to make the vector version
      profitable, relative to the loop bodies only.  The following condition
      must hold true:
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 188305)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -1289,6 +1289,15 @@  static const struct attribute_spec rs6000_attribut
 #undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE
 #define TARGET_VECTORIZE_PREFERRED_SIMD_MODE \
   rs6000_preferred_simd_mode
+#undef TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD
+#define TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD \
+  rs6000_density_pct_threshold
+#undef TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD
+#define TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD \
+  rs6000_density_size_threshold
+#undef TARGET_VECTORIZE_DENSITY_PENALTY
+#define TARGET_VECTORIZE_DENSITY_PENALTY \
+ rs6000_density_penalty
 
 #undef TARGET_INIT_BUILTINS
 #define TARGET_INIT_BUILTINS rs6000_init_builtins
@@ -3421,13 +3430,13 @@  rs6000_builtin_vectorization_cost (enum vect_cost_
 
       case vec_perm:
 	if (TARGET_VSX)
-	  return 4;
+	  return 3;
 	else
 	  return 1;
 
       case vec_promote_demote:
         if (TARGET_VSX)
-          return 5;
+          return 4;
         else
           return 1;
 
@@ -3551,6 +3560,30 @@  rs6000_preferred_simd_mode (enum machine_mode mode
   return word_mode;
 }
 
+/* Implement targetm.vectorize.density_pct_threshold.  */
+
+static int
+rs6000_density_pct_threshold (void)
+{
+  return 85;
+}
+
+/* Implement targetm.vectorize.density_size_threshold.  */
+
+static int
+rs6000_density_size_threshold (void)
+{
+  return 70;
+}
+
+/* Implement targetm.vectorize.density_penalty.  */
+
+static int
+rs6000_density_penalty (void)
+{
+  return 10;
+}
+
 /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a
    library with vectorized intrinsics.  */