Message ID | OF34BF09C1.25EE1EE9-ONC22579F8.00429264-C22579FD.0058198C@il.ibm.com |
---|---|
State | New |
Headers | show |
On Sun, May 13, 2012 at 6:02 PM, Razya Ladelsky <RAZYA@il.ibm.com> wrote: > Hi, > > This patch changes the minimum number of iterations of outer loops for the > runtime check which tests whether it is worthwhile to parallelize the loop > or not. > The current minimum number of iterations for all loops is MIN_PER_THREAD * > number of threads, when MIN_PER_THREAD is arbitrarily set to 100. > This prevents some of the promising loops of SPEC2006 from getting > parallelized. > I changed the minimum bound for outer loops, under the assumption that > even if there are not enough iterations, the fact that an outer loop > contains more loops, obtains enough work to get parallelized. > This indeed allowed for a lot more loops to get parallelized, resulting in > substantial performance improvements for SPEC2006 benchmarks, measured on > a Power7 6 core, 4 way SMT each. > I compared the trunk with O3 + autopar (parallelizing with 6 threads) vs. > the trunk with O3 minus vectorization. > None of the benchmarks shows any significant degradation. > > The speedup shown for libquatum with autopar has been obtained with > previous versions of autopar, having no relation to this patch, but surely > not degraded by it either. > > These are the speedups I collected: > > 462.libquantum 2.5 X > 410.bwaves 3.3 X > 436.cactusADM 4.5 X > 459.GemsFDTD 1.27 X > 481.wrf 1.25 X > > > Bootstrap and testsuite (with -ftree-parallelize-loops=4) pass > successfully. > spec-2006 showed no regressions. > > > OK for trunk? Can you add a comment that we should compute a better number-of-iterations value here? That is, if we have for (i = 0; i < n; ++i) for (j = 0; j < m; ++j) ... we should compute nit = n * m, not nit = n. Also may_be_zero handling would need to be adjusted so we compute nit = (n-maybe-zero ? 0 : n) * (m-maybe-zero ? 0 : m). Thus, generally do a better job of computing the work done per thread. The patch is ok with a suitable comment. Thanks, Richard. > Thanks, > razya > > 2012-05-08 Razya Ladelsky <razya@il.ibm.com> > > * tree-parloops.c (gen_parallel_loop): Change > many_iterations_cond for outer loops. >
Index: tree-parloops.c =================================================================== --- tree-parloops.c (revision 186667) +++ tree-parloops.c (working copy) @@ -1732,6 +1732,7 @@ gen_parallel_loop (struct loop *loop, htab_t reduc unsigned prob; location_t loc; gimple cond_stmt; + unsigned int m_p_thread=2; /* From @@ -1792,9 +1793,15 @@ gen_parallel_loop (struct loop *loop, htab_t reduc if (stmts) gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts); - many_iterations_cond = - fold_build2 (GE_EXPR, boolean_type_node, - nit, build_int_cst (type, MIN_PER_THREAD * n_threads)); + if (loop->inner) + m_p_thread=2; + else + m_p_thread=MIN_PER_THREAD; + + many_iterations_cond = + fold_build2 (GE_EXPR, boolean_type_node, + nit, build_int_cst (type, m_p_thread * n_threads)); + many_iterations_cond = fold_build2 (TRUTH_AND_EXPR, boolean_type_node, invert_truthvalue (unshare_expr (niter->may_be_zero)),