[RFC] Make vectorizer to skip loops with small iteration estimate

Hi,
the point of the following patch is to make vectorizer to not vectorize the
following testcase with profile feedback:

int a[10000];
int i=500000000;
int k=2;
int val;
__attribute__ ((noinline,noclone))
test()
{
  int j;
  for(j=0;j<k;j++)
    a[j]=val;
}
main()
{
  while (i)
    {
      test ();
      i--;
    }
}

Here the compiler should work out that the second loop iterates 2 times at the
average and thus it is not good candidate for vectorizing.

In my first attempt I added the following:
@@ -1474,6 +1478,18 @@ vect_analyze_loop_operations (loop_vec_i
       return false;
     }
 
+  if ((estimated_niter = estimated_stmt_executions_int (loop)) != -1
+      && (unsigned HOST_WIDE_INT) estimated_niter <= th)
+    {
+      if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
+        fprintf (vect_dump, "not vectorized: estimated iteration count too small.");
+      if (vect_print_dump_info (REPORT_DETAILS))
+        fprintf (vect_dump, "not vectorized: estimated iteration count smaller than "
+                 "user specified loop bound parameter or minimum "
+                 "profitable iterations (whichever is more conservative).");
+      return false;
+    }
+

But to my surprise it does not help.  There are two things:

1) the value of TH is bit low.  In a way the cost model works is that
   it finds minimal niters where vectorized loop with all the setup costs
   is cheaper than the vector loop with all the setup costs.  I.e.

  /* Calculate number of iterations required to make the vector version
     profitable, relative to the loop bodies only.  The following condition
     must hold true:
     SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC    (A)
     where
     SIC = scalar iteration cost, VIC = vector iteration cost,
     VOC = vector outside cost, VF = vectorization factor,
     PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
     SOC = scalar outside cost for run time cost model check.  */

    This value is used for both
    1) decision if number of iterations is too low (max iterations is known)
    2) decision on runtime whether we want to take the vectorized path
    or the scalar path.

    The vectoried loop looks like:
      k.1_10 = k;
      if (k.1_10 > 0)
	{
	  pretmp_2 = val;
	  niters.8_4 = (unsigned int) k.1_10;
	  bnd.9_13 = niters.8_4 >> 2;
	  ratio_mult_vf.10_1 = bnd.9_13 << 2;
	  _18 = niters.8_4 <= 3;
	  _19 = ratio_mult_vf.10_1 == 0;
	  _20 = _19 | _18;
	  if (_20 != 0)
	    scalar loop
	  else
	    vector prologue
	}

     So the unvectorized cost is
     SIC * niters

     The vectorized path is
     SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
     The scalar path of vectorizer loop is
     SIC * niters + SOC

   It makes sense to vectorize if
   SIC * niters > SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC   (B)
   That is in the optimal cse where we actually vectorize the overall
   speed of vectorized loop including the runtime check is better.

   It makes sense to take the vector loop if
   SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC         (C)
   Because the scalar loop is taken.

   The attached patch implements the formula (C) and uses it to deterine the
   decision based on number of iterations estimate (that is usually provided by
   the feedback)

   As a reality check, I tried my testcase.

   9: Cost model analysis:
     Vector inside of loop cost: 1
     Vector prologue cost: 7
     Vector epilogue cost: 2
     Scalar iteration cost: 1
     Scalar outside cost: 6
     Vector outside cost: 9
     prologue iterations: 0
     epilogue iterations: 2
     Calculated minimum iters for profitability: 4

   9:   Profitability threshold = 3

   9:   Profitability estimated iterations threshold = 20

   This is overrated. The loop starts to be benefical at about 4 iterations in
   reality.  I guess the values are kind of wrong.

   Vector inside of loop cost and Scalar iteration cost seems to ignore the
   fact that the loops do contain some control flow that should account at least
   one extra cycle.

   Vector prologue cost seems bit overrated for one pack operation.

   Of course this is very simple benchmark, in reality the vectorizatoin can be
   a lot more harmful by complicating more complex control flows.

   So I guess we have two options
    1) go with the new formula and try to make cost model a bit more realistic.
    2) stay with original formula that is quite close to reality, but I think
       more by an accident.

2) Even when loop iterates 2 times, it is estimated to 4 iterations by
   estimated_stmt_executions_int with the profile feedback.
   The reason is loop_ch pass.  Given a rolled loop with exit probability
   30%, proceeds by duplicating the header with original probabilities.
   This makes the loop to be executed with 60% probability.  Because the
   loop body counts remain the same (and they should), the expected number
   of iterations increase by the decrease of entry edge to the header.

   I wonder what to do about this.  Obviously without path profiling
   loop_ch can not really do a good job.  We can artifically make
   header to suceed more likely, that is the reality, but that requires
   non-trivial loop profile updating.

   We can also simply record the iteration bound into loop structure 
   and ignore that the profile is not realistic

   Finally we can duplicate loop headers before profilng.  I implemented
   that via early_ch pass executed only with profile generation or feedback.
   I guess it makes sense to do, even if it breaks the assumption that
   we should do strictly -Os generation on paths where


[RFC] Make vectorizer to skip loops with small iteration estimate

Commit Message

Comments

Patch