From patchwork Tue Nov 6 16:24:42 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 197499 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 014352C008C for ; Wed, 7 Nov 2012 03:25:13 +1100 (EST) Comment: DKIM? See http://www.dkim.org DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=gcc.gnu.org; s=default; x=1352823915; h=Comment: DomainKey-Signature:Received:Received:Received:Received:Date: From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To:User-Agent: Mailing-List:Precedence:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:Sender:Delivered-To; bh=QaFZOn6N1VakXdssKfKd ydSDbU0=; b=nVZGtTLtNVQRgZik7kPjqPm6Ky1UjhPNsKoIyXin7NU9XywcgMCT Lx6F9OteZBzf8e3ZTky6EYFOiamplWVn5XFZpe2awUgekDvXtHGQEW+lu4OIyWtL 1syutA+Vbn0DPkI/lPGCs0/eVFCw6bVmOZl64dAs4lVoo/HmYtEYJy0= Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=gcc.gnu.org; h=Received:Received:X-SWARE-Spam-Status:X-Spam-Check-By:Received:Received:Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:Content-Type:Content-Disposition:In-Reply-To:User-Agent:Mailing-List:Precedence:List-Id:List-Unsubscribe:List-Archive:List-Post:List-Help:Sender:Delivered-To; b=DuA78ym8OM5cE0gfeuQtjFTlfhQAFpj8XACK9fkkRSW+/+IYSEDaftB1nVuvna 8qYQZpC7bkJw/5kaPNCZnOXrh1vK7CXJOh5+So16t7Gswa3rqtfKrjS2UPdXf8h8 s+hlpBL2DVXCeZ+yjAt2c98l33GoFsiFzjNmfBgtl4eSc=; Received: (qmail 23567 invoked by alias); 6 Nov 2012 16:25:01 -0000 Received: (qmail 23553 invoked by uid 22791); 6 Nov 2012 16:24:58 -0000 X-SWARE-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL, BAYES_50, KHOP_RCVD_UNTRUST, RCVD_IN_DNSWL_LOW, RCVD_IN_HOSTKARMA_W, RCVD_IN_HOSTKARMA_WL, RP_MATCHES_RCVD, TW_IV, TW_TM, TW_VT X-Spam-Check-By: sourceware.org Received: from nikam.ms.mff.cuni.cz (HELO nikam.ms.mff.cuni.cz) (195.113.20.16) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 06 Nov 2012 16:24:43 +0000 Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 6B47C543D2C; Tue, 6 Nov 2012 17:24:42 +0100 (CET) Date: Tue, 6 Nov 2012 17:24:42 +0100 From: Jan Hubicka To: Richard Biener Cc: Jan Hubicka , gcc-patches@gcc.gnu.org Subject: Re: [RFC] Heuristics to throttle the complette unrolling Message-ID: <20121106162442.GF30424@kam.mff.cuni.cz> References: <20121030111504.GA19276@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org > On Tue, 30 Oct 2012, Jan Hubicka wrote: > > > Hi, > > for past week or two I was playing with ways to throttle down the complette > > unrolling heuristics. I made complette unroller to use the tree-ssa-loop-niter > > upper bound and unroll even in non-trivial cases and this has turned out to > > increase number of complettely unrolled loops by great amount, so one can > > see it as considerable code size growth at -O3 SPEC build. > > > > http://gcc.opensuse.org/SPEC/CFP/sb-vangelis-head-64/Total-size_big.png > > it is the largest jump on right hand side in both peak and base runs. > > There are also performance imrovements, most impotantly 11% on applu. > > > > The intuition is that complette unrolling is most profitable when IV tests > > are eliminated and single basic block is created. When condtionals stay > > in the code it is not that good idea and also functions containing calls > > are less interesting for unrolling since the calls are slow and optimization > > oppurtunities are not so great. > > > > This patch reduces unrolling on loops having many branches or calls on the > > hot patch. I found that for applu speedup the number of branches needs to be > > pretty high - about 32. > > > > The patch saves about half of the growth introduced (but on different benchmarks) > > and I think I can move all peeling to trees and reduce peeling limits a bit, too. > > > > Does this sound sane? Any ideas? > > Yes, this sounds ok (beware of unrelated PARAM_MAX_ONCE_PEELED_INSNS > remove in the patch below). Hi, this is somewhat polished version I comitted today. Main change is to test inexpensive_builtlin_p when deciding whether to count builtin call as a call. Bootstrapped/regtested x86_64-linux. Honza * cfgloopanal.c (get_loop_hot_path): New function. * tree-ssa-lop-ivcanon.c (struct loop_size): Add CONSTANT_IV, NUM_NON_PURE_CALLS_ON_HOT_PATH, NUM_PURE_CALLS_ON_HOT_PATH, NUM_BRANCHES_ON_HOT_PATH. (tree_estimate_loop_size): Compute the new values. (try_unroll_loop_completely): Disable unrolling of loops with only calls or too many branches. (tree_unroll_loops_completely): Deal also with outer loops of hot loops. * cfgloop.h (get_loop_hot_path): Declare. * params.def (PARAM_MAX_PEEL_BRANCHES): New parameters. * invoke.texi (max-peel-branches): Document. * gcc.dg/tree-ssa/loop-1.c: Make to look like a good unroling candidate still. * gcc.dg/tree-ssa/loop-23.c: Likewise. * gcc.dg/tree-ssa/cunroll-1.c: Unrolling now happens early. * gcc.dg/tree-prof/unroll-1.c: Remove confused dg-options. Index: cfgloopanal.c =================================================================== --- cfgloopanal.c (revision 193240) +++ cfgloopanal.c (working copy) @@ -483,3 +483,36 @@ single_likely_exit (struct loop *loop) VEC_free (edge, heap, exits); return found; } + + +/* Gets basic blocks of a LOOP. Header is the 0-th block, rest is in dfs + order against direction of edges from latch. Specially, if + header != latch, latch is the 1-st block. */ + +VEC (basic_block, heap) * +get_loop_hot_path (const struct loop *loop) +{ + basic_block bb = loop->header; + VEC (basic_block, heap) *path = NULL; + bitmap visited = BITMAP_ALLOC (NULL); + + while (true) + { + edge_iterator ei; + edge e; + edge best = NULL; + + VEC_safe_push (basic_block, heap, path, bb); + bitmap_set_bit (visited, bb->index); + FOR_EACH_EDGE (e, ei, bb->succs) + if ((!best || e->probability > best->probability) + && !loop_exit_edge_p (loop, e) + && !bitmap_bit_p (visited, e->dest->index)) + best = e; + if (!best || best->dest == loop->header) + break; + bb = best->dest; + } + BITMAP_FREE (visited); + return path; +} Index: testsuite/gcc.dg/tree-ssa/loop-1.c =================================================================== --- testsuite/gcc.dg/tree-ssa/loop-1.c (revision 193240) +++ testsuite/gcc.dg/tree-ssa/loop-1.c (working copy) @@ -17,13 +17,16 @@ to the load from the GOT this also contains the name of the funtion so for each call the function name would appear twice. */ /* { dg-options "-O1 -ftree-loop-ivcanon -funroll-loops -fdump-tree-ivcanon-details -fdump-tree-cunroll-details -fdump-tree-optimized -mno-relax-pic-calls" { target mips*-*-* } } */ - -void xxx(void) +__attribute__ ((pure)) +int foo (int x); +int xxx(void) { int x = 45; + int sum; while (x >>= 1) - foo (); + sum += foo (x) * 2; + return sum; } /* We should be able to find out that the loop iterates four times and unroll it completely. */ Index: testsuite/gcc.dg/tree-ssa/cunroll-1.c =================================================================== --- testsuite/gcc.dg/tree-ssa/cunroll-1.c (revision 193240) +++ testsuite/gcc.dg/tree-ssa/cunroll-1.c (working copy) @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O3 -fdump-tree-cunroll-details" } */ +/* { dg-options "-O3 -fdump-tree-cunrolli-details" } */ int a[2]; test(int c) { @@ -10,4 +10,4 @@ test(int c) /* Array bounds says the loop will not roll much. */ /* { dg-final { scan-tree-dump "Unrolled loop 1 completely .duplicated 1 times.." "cunroll"} } */ /* { dg-final { scan-tree-dump "Last iteration exit edge was proved true." "cunroll"} } */ -/* { dg-final { cleanup-tree-dump "cunroll" } } */ +/* { dg-final { cleanup-tree-dump "cunrolli" } } */ Index: testsuite/gcc.dg/tree-ssa/loop-23.c =================================================================== --- testsuite/gcc.dg/tree-ssa/loop-23.c (revision 193240) +++ testsuite/gcc.dg/tree-ssa/loop-23.c (working copy) @@ -1,24 +1,27 @@ /* { dg-do compile } */ /* { dg-options "-O2 -funroll-loops -fdump-tree-cunroll-details" } */ -void bla(int); +__attribute__ ((pure)) +int bla(int); -void foo(void) +int foo(void) { int i; + int sum; /* This loop used to appear to be too large for unrolling. */ for (i = 0; i < 4; i++) { - bla (i); - bla (2*i); - bla (3*i); - bla (4*i); - bla (5*i); - bla (6*i); - bla (7*i); - bla (8*i); + sum += bla (i); + sum += bla (2*i); + sum += bla (3*i); + sum += bla (4*i); + sum += bla (5*i); + sum += bla (6*i); + sum += bla (7*i); + sum += bla (8*i); } + return sum; } /* { dg-final { scan-tree-dump-times "Unrolled loop 1 completely" 1 "cunroll" } } */ Index: testsuite/gcc.dg/tree-prof/unroll-1.c =================================================================== --- testsuite/gcc.dg/tree-prof/unroll-1.c (revision 193240) +++ testsuite/gcc.dg/tree-prof/unroll-1.c (working copy) @@ -21,4 +21,3 @@ main() } /* { dg-final-use { scan-rtl-dump "Considering unrolling loop with constant number of iterations" "loop2_unroll" } } */ /* { dg-final-use { cleanup-rtl-dump "Not unrolling loop, doesn't roll" } } */ -/* { dg-options "-O3 -fdump-rtl-loop2_unroll -funroll-loops -fno-peel-loops" } */ Index: tree-ssa-loop-ivcanon.c =================================================================== --- tree-ssa-loop-ivcanon.c (revision 193240) +++ tree-ssa-loop-ivcanon.c (working copy) @@ -140,6 +140,20 @@ struct loop_size instructions after exit are not executed. */ int last_iteration; int last_iteration_eliminated_by_peeling; + + /* If some IV computation will become constant. */ + bool constant_iv; + + /* Number of call stmts that are not a builtin and are pure or const + present on the hot path. */ + int num_pure_calls_on_hot_path; + /* Number of call stmts that are not a builtin and are not pure nor const + present on the hot path. */ + int num_non_pure_calls_on_hot_path; + /* Number of statements other than calls in the loop. */ + int non_call_stmts_on_hot_path; + /* Number of branches seen on the hot path. */ + int num_branches_on_hot_path; }; /* Return true if OP in STMT will be constant after peeling LOOP. */ @@ -188,7 +202,11 @@ constant_after_peeling (tree op, gimple return true; } -/* Computes an estimated number of insns in LOOP, weighted by WEIGHTS. +/* Computes an estimated number of insns in LOOP. + EXIT (if non-NULL) is an exite edge that will be eliminated in all but last + iteration of the loop. + EDGE_TO_CANCEL (if non-NULL) is an non-exit edge eliminated in the last iteration + of loop. Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT. */ static void @@ -198,11 +216,17 @@ tree_estimate_loop_size (struct loop *lo gimple_stmt_iterator gsi; unsigned int i; bool after_exit; + VEC (basic_block, heap) *path = get_loop_hot_path (loop); size->overall = 0; size->eliminated_by_peeling = 0; size->last_iteration = 0; size->last_iteration_eliminated_by_peeling = 0; + size->num_pure_calls_on_hot_path = 0; + size->num_non_pure_calls_on_hot_path = 0; + size->non_call_stmts_on_hot_path = 0; + size->num_branches_on_hot_path = 0; + size->constant_iv = 0; if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "Estimating sizes for loop %i\n", loop->num); @@ -221,6 +245,8 @@ tree_estimate_loop_size (struct loop *lo gimple stmt = gsi_stmt (gsi); int num = estimate_num_insns (stmt, &eni_size_weights); bool likely_eliminated = false; + bool likely_eliminated_last = false; + bool likely_eliminated_peeled = false; if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -231,11 +257,21 @@ tree_estimate_loop_size (struct loop *lo /* Look for reasons why we might optimize this stmt away. */ /* Exit conditional. */ - if (exit && body[i] == exit->src && stmt == last_stmt (exit->src)) + if (exit && body[i] == exit->src + && stmt == last_stmt (exit->src)) { if (dump_file && (dump_flags & TDF_DETAILS)) - fprintf (dump_file, " Exit condition will be eliminated.\n"); - likely_eliminated = true; + fprintf (dump_file, " Exit condition will be eliminated " + "in peeled copies.\n"); + likely_eliminated_peeled = true; + } + else if (edge_to_cancel && body[i] == edge_to_cancel->src + && stmt == last_stmt (edge_to_cancel->src)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, " Exit condition will be eliminated " + "in last copy.\n"); + likely_eliminated_last = true; } /* Sets of IV variables */ else if (gimple_code (stmt) == GIMPLE_ASSIGN @@ -249,19 +285,22 @@ tree_estimate_loop_size (struct loop *lo /* Assignments of IV variables. */ else if (gimple_code (stmt) == GIMPLE_ASSIGN && TREE_CODE (gimple_assign_lhs (stmt)) == SSA_NAME - && constant_after_peeling (gimple_assign_rhs1 (stmt), stmt,loop) + && constant_after_peeling (gimple_assign_rhs1 (stmt), stmt, loop) && (gimple_assign_rhs_class (stmt) != GIMPLE_BINARY_RHS || constant_after_peeling (gimple_assign_rhs2 (stmt), stmt, loop))) { + size->constant_iv = true; if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, " Constant expression will be folded away.\n"); likely_eliminated = true; } /* Conditionals. */ - else if (gimple_code (stmt) == GIMPLE_COND - && constant_after_peeling (gimple_cond_lhs (stmt), stmt, loop) - && constant_after_peeling (gimple_cond_rhs (stmt), stmt, loop)) + else if ((gimple_code (stmt) == GIMPLE_COND + && constant_after_peeling (gimple_cond_lhs (stmt), stmt, loop) + && constant_after_peeling (gimple_cond_rhs (stmt), stmt, loop)) + || (gimple_code (stmt) == GIMPLE_SWITCH + && constant_after_peeling (gimple_switch_index (stmt), stmt, loop))) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, " Constant conditional.\n"); @@ -269,16 +308,49 @@ tree_estimate_loop_size (struct loop *lo } size->overall += num; - if (likely_eliminated) + if (likely_eliminated || likely_eliminated_peeled) size->eliminated_by_peeling += num; if (!after_exit) { size->last_iteration += num; - if (likely_eliminated) + if (likely_eliminated || likely_eliminated_last) size->last_iteration_eliminated_by_peeling += num; } } } + while (VEC_length (basic_block, path)) + { + basic_block bb = VEC_pop (basic_block, path); + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) + { + gimple stmt = gsi_stmt (gsi); + if (gimple_code (stmt) == GIMPLE_CALL) + { + int flags = gimple_call_flags (stmt); + tree decl = gimple_call_fndecl (stmt); + + if (decl && DECL_IS_BUILTIN (decl) + && is_inexpensive_builtin (decl)) + ; + else if (flags & (ECF_PURE | ECF_CONST)) + size->num_pure_calls_on_hot_path++; + else + size->num_non_pure_calls_on_hot_path++; + size->num_branches_on_hot_path ++; + } + else if (gimple_code (stmt) != GIMPLE_CALL + && gimple_code (stmt) != GIMPLE_DEBUG) + size->non_call_stmts_on_hot_path++; + if (((gimple_code (stmt) == GIMPLE_COND + && (!constant_after_peeling (gimple_cond_lhs (stmt), stmt, loop) + || constant_after_peeling (gimple_cond_rhs (stmt), stmt, loop))) + || (gimple_code (stmt) == GIMPLE_SWITCH + && !constant_after_peeling (gimple_switch_index (stmt), stmt, loop))) + && (!exit || bb != exit->src)) + size->num_branches_on_hot_path++; + } + } + VEC_free (basic_block, heap, path); if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "size: %i-%i, last_iteration: %i-%i\n", size->overall, size->eliminated_by_peeling, size->last_iteration, @@ -644,34 +716,85 @@ try_unroll_loop_completely (struct loop (int) unr_insns); } + /* If the code is going to shrink, we don't need to be extra cautious + on guessing if the unrolling is going to be profitable. */ + if (unr_insns + /* If there is IV variable that will become constant, we save + one instruction in the loop prologue we do not account + otherwise. */ + <= ninsns + (size.constant_iv != false)) + ; /* We unroll only inner loops, because we do not consider it profitable otheriwse. We still can cancel loopback edge of not rolling loop; this is always a good idea. */ - if (loop->inner && unr_insns > ninsns) + else if (ul == UL_NO_GROWTH) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Not unrolling loop %d: size would grow.\n", + loop->num); + return false; + } + /* Outer loops tend to be less interesting candidates for complette + unrolling unless we can do a lot of propagation into the inner loop + body. For now we disable outer loop unrolling when the code would + grow. */ + else if (loop->inner) { if (dump_file && (dump_flags & TDF_DETAILS)) - fprintf (dump_file, "Not unrolling loop %d:" + fprintf (dump_file, "Not unrolling loop %d: " "it is not innermost and code would grow.\n", loop->num); return false; } - - if (unr_insns > ninsns - && (unr_insns - > (unsigned) PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS))) + /* If there is call on a hot path through the loop, then + there is most probably not much to optimize. */ + else if (size.num_non_pure_calls_on_hot_path) { if (dump_file && (dump_flags & TDF_DETAILS)) - fprintf (dump_file, "Not unrolling loop %d " - "(--param max-completely-peeled-insns limit reached).\n", + fprintf (dump_file, "Not unrolling loop %d: " + "contains call and code would grow.\n", loop->num); return false; } - - if (ul == UL_NO_GROWTH - && unr_insns > ninsns) + /* If there is pure/const call in the function, then we + can still optimize the unrolled loop body if it contains + some other interesting code than the calls and code + storing or cumulating the return value. */ + else if (size.num_pure_calls_on_hot_path + /* One IV increment, one test, one ivtmp store + and one usefull stmt. That is about minimal loop + doing pure call. */ + && (size.non_call_stmts_on_hot_path + <= 3 + size.num_pure_calls_on_hot_path)) { if (dump_file && (dump_flags & TDF_DETAILS)) - fprintf (dump_file, "Not unrolling loop %d: size would grow.\n", + fprintf (dump_file, "Not unrolling loop %d: " + "contains just pure calls and code would grow.\n", + loop->num); + return false; + } + /* Complette unrolling is major win when control flow is removed and + one big basic block is created. If the loop contains control flow + the optimization may still be a win because of eliminating the loop + overhead but it also may blow the branch predictor tables. + Limit number of branches on the hot path through the peeled + sequence. */ + else if (size.num_branches_on_hot_path * (int)n_unroll + > PARAM_VALUE (PARAM_MAX_PEEL_BRANCHES)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Not unrolling loop %d: " + " number of branches on hot path in the unrolled sequence" + " reach --param max-peel-branches limit.\n", + loop->num); + return false; + } + else if (unr_insns + > (unsigned) PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Not unrolling loop %d: " + "(--param max-completely-peeled-insns limit reached).\n", loop->num); return false; } @@ -689,6 +812,8 @@ try_unroll_loop_completely (struct loop { free_original_copy_tables (); free (wont_exit); + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Failed to duplicate the loop\n"); return false; } @@ -968,7 +1093,7 @@ tree_unroll_loops_completely (bool may_i { struct loop *loop_father = loop_outer (loop); - if (may_increase_size && optimize_loop_for_speed_p (loop) + if (may_increase_size && optimize_loop_nest_for_speed_p (loop) /* Unroll outermost loops only if asked to do so or they do not cause code growth. */ && (unroll_outer || loop_outer (loop_father))) Index: cfgloop.h =================================================================== --- cfgloop.h (revision 193240) +++ cfgloop.h (working copy) @@ -714,6 +714,7 @@ extern void doloop_optimize_loops (void) extern void move_loop_invariants (void); extern bool finite_loop_p (struct loop *); extern void scale_loop_profile (struct loop *loop, int scale, int iteration_bound); +extern VEC (basic_block, heap) * get_loop_hot_path (const struct loop *loop); /* Returns the outermost loop of the loop nest that contains LOOP.*/ static inline struct loop * Index: params.def =================================================================== --- params.def (revision 193240) +++ params.def (working copy) @@ -291,6 +291,11 @@ DEFPARAM(PARAM_MAX_PEEL_TIMES, "max-peel-times", "The maximum number of peelings of a single loop", 16, 0, 0) +/* The maximum number of peelings of a single loop that is peeled completely. */ +DEFPARAM(PARAM_MAX_PEEL_BRANCHES, + "max-peel-branches", + "The maximum number of branches on the path through the peeled sequence", + 32, 0, 0) /* The maximum number of insns of a peeled loop. */ DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS, "max-completely-peeled-insns", Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 193240) +++ doc/invoke.texi (working copy) @@ -9085,6 +9085,9 @@ the loop code is peeled. @item max-peel-times The maximum number of peelings of a single loop. +@item max-peel-branches +The maximum number of branches on the hot path through the peeled sequence. + @item max-completely-peeled-insns The maximum number of insns of a completely peeled loop.