diff mbox

Refining autopar cost model for outer loops

Message ID OF34BF09C1.25EE1EE9-ONC22579F8.00429264-C22579FD.0058198C@il.ibm.com
State New
Headers show

Commit Message

Razya Ladelsky May 13, 2012, 4:02 p.m. UTC
Hi,

This patch changes the minimum number of iterations of outer loops for the 
runtime check which tests whether it is worthwhile to parallelize the loop 
or not.
The current minimum number of iterations for all loops is MIN_PER_THREAD * 
number of threads, when MIN_PER_THREAD is arbitrarily set to 100.
This prevents some of the promising loops of SPEC2006 from getting 
parallelized.
I changed the minimum bound for outer loops, under the assumption that 
even if there are not enough iterations, the fact that an outer loop 
contains more loops, obtains enough work to get parallelized.
This indeed allowed for a lot more loops to get parallelized, resulting in 
substantial performance improvements for SPEC2006 benchmarks, measured on 
a Power7 6 core, 4 way SMT each.
I compared  the trunk with O3 + autopar (parallelizing with 6 threads) vs. 
the trunk with   O3  minus vectorization.
None of the benchmarks shows any significant degradation.

The speedup shown for  libquatum  with autopar has been obtained with 
previous versions of autopar, having no relation to this patch, but surely 
not degraded by it either.

These are the speedups I collected:

462.libquantum  2.5 X
410.bwaves      3.3 X
436.cactusADM   4.5 X
459.GemsFDTD    1.27 X
481.wrf         1.25 X


Bootstrap and testsuite (with -ftree-parallelize-loops=4) pass 
successfully.
spec-2006 showed no regressions.


OK for trunk?
Thanks,
razya

2012-05-08  Razya Ladelsky  <razya@il.ibm.com>
 
                 * tree-parloops.c (gen_parallel_loop): Change 
many_iterations_cond for outer loops.
=

Comments

Richard Biener May 14, 2012, 9:39 a.m. UTC | #1
On Sun, May 13, 2012 at 6:02 PM, Razya Ladelsky <RAZYA@il.ibm.com> wrote:
> Hi,
>
> This patch changes the minimum number of iterations of outer loops for the
> runtime check which tests whether it is worthwhile to parallelize the loop
> or not.
> The current minimum number of iterations for all loops is MIN_PER_THREAD *
> number of threads, when MIN_PER_THREAD is arbitrarily set to 100.
> This prevents some of the promising loops of SPEC2006 from getting
> parallelized.
> I changed the minimum bound for outer loops, under the assumption that
> even if there are not enough iterations, the fact that an outer loop
> contains more loops, obtains enough work to get parallelized.
> This indeed allowed for a lot more loops to get parallelized, resulting in
> substantial performance improvements for SPEC2006 benchmarks, measured on
> a Power7 6 core, 4 way SMT each.
> I compared  the trunk with O3 + autopar (parallelizing with 6 threads) vs.
> the trunk with   O3  minus vectorization.
> None of the benchmarks shows any significant degradation.
>
> The speedup shown for  libquatum  with autopar has been obtained with
> previous versions of autopar, having no relation to this patch, but surely
> not degraded by it either.
>
> These are the speedups I collected:
>
> 462.libquantum  2.5 X
> 410.bwaves      3.3 X
> 436.cactusADM   4.5 X
> 459.GemsFDTD    1.27 X
> 481.wrf         1.25 X
>
>
> Bootstrap and testsuite (with -ftree-parallelize-loops=4) pass
> successfully.
> spec-2006 showed no regressions.
>
>
> OK for trunk?

Can you add a comment that we should compute a better number-of-iterations
value here?  That is, if we have

  for (i = 0; i < n; ++i)
    for (j = 0; j < m; ++j)
      ...

we should compute nit = n * m, not nit = n.  Also may_be_zero handling
would need to be adjusted so we compute nit = (n-maybe-zero ? 0 : n) *
(m-maybe-zero ? 0 : m).  Thus, generally do a better job of computing
the work done per thread.

The patch is ok with a suitable comment.

Thanks,
Richard.

> Thanks,
> razya
>
> 2012-05-08  Razya Ladelsky  <razya@il.ibm.com>
>
>                 * tree-parloops.c (gen_parallel_loop): Change
> many_iterations_cond for outer loops.
>
diff mbox

Patch

Index: tree-parloops.c
===================================================================
--- tree-parloops.c	(revision 186667)
+++ tree-parloops.c	(working copy)
@@ -1732,6 +1732,7 @@  gen_parallel_loop (struct loop *loop, htab_t reduc
   unsigned prob;
   location_t loc;
   gimple cond_stmt;
+  unsigned int m_p_thread=2;
 
   /* From
 
@@ -1792,9 +1793,15 @@  gen_parallel_loop (struct loop *loop, htab_t reduc
   if (stmts)
     gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
 
-  many_iterations_cond =
-    fold_build2 (GE_EXPR, boolean_type_node,
-		 nit, build_int_cst (type, MIN_PER_THREAD * n_threads));
+  if (loop->inner)
+    m_p_thread=2;
+  else
+    m_p_thread=MIN_PER_THREAD;
+
+   many_iterations_cond =
+     fold_build2 (GE_EXPR, boolean_type_node,
+                nit, build_int_cst (type, m_p_thread * n_threads));
+
   many_iterations_cond
     = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
 		   invert_truthvalue (unshare_expr (niter->may_be_zero)),