diff mbox

Tree unroll - Relaxing code size increase with O2

Message ID CAAs8HmzTLwD5DVR5HRXE+waH8mUCNpj_PKMtDA8aAKO1yJBHcA@mail.gmail.com
State New
Headers show

Commit Message

Sriraman Tallam Nov. 21, 2013, 3:41 a.m. UTC
Hi,

    Currently, tree unrolling pass(cunroll) does not allow any code
size growth in O2 mode.  Code size growth is permitted only if O3 or
funroll-loops/fpeel-loops is used. I have created  a patch to allow
partial code size increase in O2 mode. With funroll-loops the maximum
allowed code growth is 100 unrolled insns. For partial growth, I
experimented with various values of code growth and I have attached
SPEC 2006 performance numbers for code growth from 20 to 100 insns in
steps of 20.

   For this patch, I have set the partial code growth in O2 mode to be
40 insns (tunable via param) where we get performance improvements
with minimal code size growth.  Perf. data shows good improvements in
a few benchmarks.  h264, sjeng and bzip2 get >2%  improvement.
calculix shows a big regression(4.5% on westmere) which I am
investigating along with the povray regression.

   I also ran experiments with -ftree-vectorize turned on with -O2
both in baseline and with the partial unroll to study the effect of
unrolling on vectorization. Loop unrolling seems to benefit more
benchmarks when vectorization is turned on.

   I have attached the patch and pdfs of the perf. data. and code size growth.

How to read the attached perf data:

There are two data files.

* spec_perf_O2_unroll.txt contains perf data using unrolling with
various code size growth on O2.
* spec_perf_O2_vectorize_ unroll.txt contains perf data using
unrolling with various code size growth on O2 + ftree-vectorize.

Each file contains perf. improvements and code size growth data.
Experiments were done on Ibis-sandybridge and Ikaria-westmere.

Here is a sample from the file (All perf. numbers are in %):

Unroll insns code growth           20      40     60       80        100

Comments

Xinliang David Li Nov. 21, 2013, 6:05 a.m. UTC | #1
Would it be sufficient to

1) get rid of the 'may_increase_size' parameter' in all the unroll
interfaces (basically make it true for O2); and
2) set MAX_COMPLETELY_PEELED_INSNS parameter to be a smaller value for
O2? -- this makes O2 and O3's complete unroll behave in the same way
but with different parameter. Note that doing so is very similar to
loop vectorization at O2 -- O2 requires a cheap cost model which
lowers the value for related parameter such as # of alias checks.  See
how this is done in opts.c

David

On Wed, Nov 20, 2013 at 7:41 PM, Sriraman Tallam <tmsriram@google.com> wrote:
> Hi,
>
>     Currently, tree unrolling pass(cunroll) does not allow any code
> size growth in O2 mode.  Code size growth is permitted only if O3 or
> funroll-loops/fpeel-loops is used. I have created  a patch to allow
> partial code size increase in O2 mode. With funroll-loops the maximum
> allowed code growth is 100 unrolled insns. For partial growth, I
> experimented with various values of code growth and I have attached
> SPEC 2006 performance numbers for code growth from 20 to 100 insns in
> steps of 20.
>
>    For this patch, I have set the partial code growth in O2 mode to be
> 40 insns (tunable via param) where we get performance improvements
> with minimal code size growth.  Perf. data shows good improvements in
> a few benchmarks.  h264, sjeng and bzip2 get >2%  improvement.
> calculix shows a big regression(4.5% on westmere) which I am
> investigating along with the povray regression.
>
>    I also ran experiments with -ftree-vectorize turned on with -O2
> both in baseline and with the partial unroll to study the effect of
> unrolling on vectorization. Loop unrolling seems to benefit more
> benchmarks when vectorization is turned on.
>
>    I have attached the patch and pdfs of the perf. data. and code size growth.
>
> How to read the attached perf data:
>
> There are two data files.
>
> * spec_perf_O2_unroll.txt contains perf data using unrolling with
> various code size growth on O2.
> * spec_perf_O2_vectorize_ unroll.txt contains perf data using
> unrolling with various code size growth on O2 + ftree-vectorize.
>
> Each file contains perf. improvements and code size growth data.
> Experiments were done on Ibis-sandybridge and Ikaria-westmere.
>
> Here is a sample from the file (All perf. numbers are in %):
>
> Unroll insns code growth           20      40     60       80        100
> _____________________________________________________
> spec/2006/fp/C++/444.namd     -3.2   -0.13   -0.4    -0.57      -0.31
>
> This data shows that namd regressed by 3.2% over baseline when code
> size growth was set to 20 insns and regressed by 0.57% over baseline
> when growth was 80 insns.
>
>    Please let me know what you think.
>
> Thanks
> Sri
Richard Biener Nov. 21, 2013, 1:29 p.m. UTC | #2
On Thu, Nov 21, 2013 at 7:05 AM, Xinliang David Li <davidxl@google.com> wrote:
> Would it be sufficient to
>
> 1) get rid of the 'may_increase_size' parameter' in all the unroll
> interfaces (basically make it true for O2); and
> 2) set MAX_COMPLETELY_PEELED_INSNS parameter to be a smaller value for
> O2? -- this makes O2 and O3's complete unroll behave in the same way
> but with different parameter. Note that doing so is very similar to
> loop vectorization at O2 -- O2 requires a cheap cost model which
> lowers the value for related parameter such as # of alias checks.  See
> how this is done in opts.c

I agree that yet another param is bad.

> David
>
> On Wed, Nov 20, 2013 at 7:41 PM, Sriraman Tallam <tmsriram@google.com> wrote:
>> Hi,
>>
>>     Currently, tree unrolling pass(cunroll) does not allow any code
>> size growth in O2 mode.  Code size growth is permitted only if O3 or
>> funroll-loops/fpeel-loops is used. I have created  a patch to allow
>> partial code size increase in O2 mode. With funroll-loops the maximum
>> allowed code growth is 100 unrolled insns. For partial growth, I
>> experimented with various values of code growth and I have attached
>> SPEC 2006 performance numbers for code growth from 20 to 100 insns in
>> steps of 20.
>>
>>    For this patch, I have set the partial code growth in O2 mode to be
>> 40 insns (tunable via param) where we get performance improvements
>> with minimal code size growth.  Perf. data shows good improvements in
>> a few benchmarks.  h264, sjeng and bzip2 get >2%  improvement.
>> calculix shows a big regression(4.5% on westmere) which I am
>> investigating along with the povray regression.

Did you look at compile-time effects?  Note that you should avoid
complete peeling here (unrolling based on max_iter) as well I think.
40 instructions is a lot to allow given the optimistic unrolling.

See PRs we have where even with the current code we unroll way
too much for -O2.

Richard.

>>    I also ran experiments with -ftree-vectorize turned on with -O2
>> both in baseline and with the partial unroll to study the effect of
>> unrolling on vectorization. Loop unrolling seems to benefit more
>> benchmarks when vectorization is turned on.
>>
>>    I have attached the patch and pdfs of the perf. data. and code size growth.
>>
>> How to read the attached perf data:
>>
>> There are two data files.
>>
>> * spec_perf_O2_unroll.txt contains perf data using unrolling with
>> various code size growth on O2.
>> * spec_perf_O2_vectorize_ unroll.txt contains perf data using
>> unrolling with various code size growth on O2 + ftree-vectorize.
>>
>> Each file contains perf. improvements and code size growth data.
>> Experiments were done on Ibis-sandybridge and Ikaria-westmere.
>>
>> Here is a sample from the file (All perf. numbers are in %):
>>
>> Unroll insns code growth           20      40     60       80        100
>> _____________________________________________________
>> spec/2006/fp/C++/444.namd     -3.2   -0.13   -0.4    -0.57      -0.31
>>
>> This data shows that namd regressed by 3.2% over baseline when code
>> size growth was set to 20 insns and regressed by 0.57% over baseline
>> when growth was 80 insns.
>>
>>    Please let me know what you think.
>>
>> Thanks
>> Sri
diff mbox

Patch

Index: params.def
===================================================================
--- params.def	(revision 205058)
+++ params.def	(working copy)
@@ -304,6 +304,11 @@  DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS,
 	"max-completely-peeled-insns",
 	"The maximum number of insns of a completely peeled loop",
 	100, 0, 0)
+/* The maximum number of insns in a peeled loop for default unrolling.  */
+DEFPARAM(PARAM_MAX_DEFAULT_UNROLL_INSNS,
+	"max-default-unroll-insns",
+	"The maximum number of insns for the default tree unrolling",
+	40, 0, 0)
 /* The maximum number of peelings of a single loop that is peeled completely.  */
 DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES,
 	"max-completely-peel-times",
Index: tree-ssa-loop-ivcanon.c
===================================================================
--- tree-ssa-loop-ivcanon.c	(revision 205058)
+++ tree-ssa-loop-ivcanon.c	(working copy)
@@ -71,9 +71,18 @@  enum unroll_level
 			   iteration.  */
   UL_NO_GROWTH,		/* Only loops whose unrolling will not cause increase
 			   of code size.  */
+  UL_PARTIAL, 		/* All suitable loops whose unrolling will not
+			   increase code size by more than 50% of UL_ALL.  */
   UL_ALL		/* All suitable loops.  */
 };
 
+typedef enum _increase_code_size
+{
+  UNROLL_NO_INCREASE = 0,
+  UNROLL_PARTIAL_INCREASE = 1,
+  UNROLL_FULL_INCREASE = 2
+} increase_code_size;
+
 /* Adds a canonical induction variable to LOOP iterating NITER times.  EXIT
    is the exit edge whose condition is replaced.  */
 
@@ -651,6 +660,7 @@  try_unroll_loop_completely (struct loop *loop,
 			    location_t locus)
 {
   unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
+  unsigned HOST_WIDE_INT max_unroll_insns;
   gimple cond;
   struct loop_size size;
   bool n_unroll_found = false;
@@ -696,6 +706,10 @@  try_unroll_loop_completely (struct loop *loop,
     return false;
 
   max_unroll = PARAM_VALUE (PARAM_MAX_COMPLETELY_PEEL_TIMES);
+  max_unroll_insns = (ul != UL_PARTIAL) ?
+		     PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS) :
+		     PARAM_VALUE (PARAM_MAX_DEFAULT_UNROLL_INSNS);
+
   if (n_unroll > max_unroll)
     return false;
 
@@ -805,8 +819,7 @@  try_unroll_loop_completely (struct loop *loop,
 		     loop->num);
 	  return false;
 	}
-      else if (unr_insns
-	       > (unsigned) PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS))
+      else if (unr_insns > max_unroll_insns)
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "Not unrolling loop %d: "
@@ -1100,7 +1113,8 @@  propagate_constants_for_unrolling (basic_block bb)
    loop we unrolled.  */
 
 static bool
-tree_unroll_loops_completely_1 (bool may_increase_size, bool unroll_outer,
+tree_unroll_loops_completely_1 (increase_code_size may_increase_size,
+				bool unroll_outer,
 				vec<loop_p, va_heap>& father_stack,
 				struct loop *loop)
 {
@@ -1135,7 +1149,7 @@  static bool
       /* Unroll outermost loops only if asked to do so or they do
 	 not cause code growth.  */
       && (unroll_outer || loop_outer (loop_father)))
-    ul = UL_ALL;
+    ul = (may_increase_size == UNROLL_PARTIAL_INCREASE) ? UL_PARTIAL : UL_ALL;
   else
     ul = UL_NO_GROWTH;
 
@@ -1163,7 +1177,8 @@  static bool
    size of the code does not increase.  */
 
 unsigned int
-tree_unroll_loops_completely (bool may_increase_size, bool unroll_outer)
+tree_unroll_loops_completely (increase_code_size may_increase_size,
+			      bool unroll_outer)
 {
   stack_vec<loop_p, 16> father_stack;
   bool changed;
@@ -1308,12 +1323,19 @@  make_pass_iv_canon (gcc::context *ctxt)
 static unsigned int
 tree_complete_unroll (void)
 {
+  increase_code_size code_size;
+
   if (number_of_loops (cfun) <= 1)
     return 0;
 
-  return tree_unroll_loops_completely (flag_unroll_loops
-				       || flag_peel_loops
-				       || optimize >= 3, true);
+  if (flag_unroll_loops || flag_peel_loops || (optimize >= 3))
+    code_size = UNROLL_FULL_INCREASE;
+  else if (optimize == 2)
+    code_size = UNROLL_PARTIAL_INCREASE;
+  else
+    code_size = UNROLL_NO_INCREASE;
+
+  return tree_unroll_loops_completely (code_size, true);
 }
 
 static bool
@@ -1366,13 +1388,20 @@  static unsigned int
 tree_complete_unroll_inner (void)
 {
   unsigned ret = 0;
+  increase_code_size code_size;
 
   loop_optimizer_init (LOOPS_NORMAL
 		       | LOOPS_HAVE_RECORDED_EXITS);
   if (number_of_loops (cfun) > 1)
     {
       scev_initialize ();
-      ret = tree_unroll_loops_completely (optimize >= 3, false);
+
+      if (optimize >= 3)
+	code_size = UNROLL_FULL_INCREASE;
+      else
+	code_size = UNROLL_NO_INCREASE;
+
+      ret = tree_unroll_loops_completely (code_size, false);
       free_numbers_of_iterations_estimates ();
       scev_finalize ();
     }