From patchwork Sat Jun 19 03:59:00 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sandra Loosemore X-Patchwork-Id: 56236 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 2FCDA1007D4 for ; Sat, 19 Jun 2010 13:57:48 +1000 (EST) Received: (qmail 20180 invoked by alias); 19 Jun 2010 03:57:46 -0000 Received: (qmail 20169 invoked by uid 22791); 19 Jun 2010 03:57:44 -0000 X-SWARE-Spam-Status: No, hits=-1.8 required=5.0 tests=AWL, BAYES_00, T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (38.113.113.100) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Sat, 19 Jun 2010 03:57:39 +0000 Received: (qmail 2793 invoked from network); 19 Jun 2010 03:57:37 -0000 Received: from unknown (HELO ?192.168.2.3?) (sandra@127.0.0.2) by mail.codesourcery.com with ESMTPA; 19 Jun 2010 03:57:37 -0000 Message-ID: <4C1C4084.4060200@codesourcery.com> Date: Fri, 18 Jun 2010 23:59:00 -0400 From: Sandra Loosemore User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: gcc-patches@gcc.gnu.org Subject: PATCH: stop ivopts from pessimizing code (PR42505) Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org I spent a lot of time bumbling around in the ivopts cost model for PR42505 before I finally realized that, even using its existing cost functions, the supposedly "optimized" solution it was coming up with had a higher cost metric than the original code for the test case in the PR. Apparently its approach of using a heuristic to choose an initial solution and then jiggling it to try to find lower-cost alternatives is incapable of jiggling hard enough to find its way back to the original set of IV candidates. So, the simple solution is just to compare what it comes up with against the cost of the original set, but having done that much it's not much more work to try jiggling the original set too to see if it can be optimized further. That's what I've implemented in the attached patch. The one change to the cost model that I found to be worthwhile was using call_used_regs rather than fixed_regs to count the number of registers available for holding loop variables. E.g., on Thumb-1 there are really only 4 registers available, not 9, so ivopts was significantly underestimating register pressure in the test case filed with the PR. I tested this patch by doing full bootstrap and regression test on x86-64 native. I did some initial size benchmarking with CSiBE on both x86-64 and ARM. Then I did some more detailed benchmarking on ARM for both code size and speed. Here's a summary of the numbers. Size benchmarks (smaller is better): x86: CSiBE -0.22% ARMv5TE Thumb-1: CSiBE -0.15% CoreMark -2.7% eembc CINT -0.1% spec2000 -0.1% ARMv7-a Thumb-2: CSiBE -0.19% CoreMark -5.7% eembc CINT -0.4% spec2000 -0.3% Speed benchmarks (bigger is better): ARMv5TE: CoreMark -0.5% eembc CINT -0.3% spec2000 -1.2% ARMv7-a: CoreMark +1.3% eembc CINT +0.6% spec2000 -0.3% With only the change to check for pessimization and not the costs change, speed numbers went down on both ARM targets. The costs change helped a lot more on ARMv7-a than it did on ARMv5TE. So, perhaps there is still something wrong with the ivopts costs model, but perhaps this is just tripping over bugs somewhere else in the compiler. I noted that on both ARMv5TE and ARMv7-a, some of the individual spec2000 and eembc benchmarks had quite large regressions (> 5%). I spent a couple days on some further experiments with the cost model, but nothing I tried gave an obvious improvement. My gut feeling now is that there are unnecessary spills being generated somewhere else and that there's no point in further tinkering with ivopts costs without some specific evidence or test cases that indicate that's where the problem is. Anyway, I've about run out of time for working on this issue for now. I think both changes in my current patch are abstractly improvements over the existing code, and from an experimental point of view at least the code size improvements are good even if the speed results are mixed. So, OK to check in the patch as-is? -Sandra 2010-06-18 Sandra Loosemore PR middle-end/42505 gcc/ * tree-ssa-loop-ivopts.c (determine_set_costs): Delete obsolete comments about cost model. (try_add_cand_for): Add second strategy for choosing initial set based on original IVs, controlled by ORIGINALP argument. (get_initial_solution): Add ORIGINALP argument. (find_optimal_iv_set_1): New function, split from find_optimal_iv_set. (find_optimal_iv_set): Try two different strategies for choosing the IV set, and return the one with lower cost. * cfgloopanal.c (init_set_costs): Use call_used_regs rather than fixed_regs to count number of registers available for loop variables. gcc/testsuite/ * gcc.target/arm/pr42505.c: New test case. Index: gcc/tree-ssa-loop-ivopts.c =================================================================== --- gcc/tree-ssa-loop-ivopts.c (revision 160755) +++ gcc/tree-ssa-loop-ivopts.c (working copy) @@ -4444,26 +4444,6 @@ determine_set_costs (struct ivopts_data struct loop *loop = data->current_loop; bitmap_iterator bi; - /* We use the following model (definitely improvable, especially the - cost function -- TODO): - - We estimate the number of registers available (using MD data), name it A. - - We estimate the number of registers used by the loop, name it U. This - number is obtained as the number of loop phi nodes (not counting virtual - registers and bivs) + the number of variables from outside of the loop. - - We set a reserve R (free regs that are used for temporary computations, - etc.). For now the reserve is a constant 3. - - Let I be the number of induction variables. - - -- if U + I + R <= A, the cost is I * SMALL_COST (just not to encourage - make a lot of ivs without a reason). - -- if A - R < U + I <= A, the cost is I * PRES_COST - -- if U + I > A, the cost is I * PRES_COST and - number of uses * SPILL_COST * (U + I - A) / (U + I) is added. */ - if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "Global costs:\n"); @@ -5087,11 +5067,13 @@ iv_ca_prune (struct ivopts_data *data, s } /* Tries to extend the sets IVS in the best possible way in order - to express the USE. */ + to express the USE. If ORIGINALP is true, prefer candidates from + the original set of IVs, otherwise favor important candidates not + based on any memory object. */ static bool try_add_cand_for (struct ivopts_data *data, struct iv_ca *ivs, - struct iv_use *use) + struct iv_use *use, bool originalp) { comp_cost best_cost, act_cost; unsigned i; @@ -5110,7 +5092,8 @@ try_add_cand_for (struct ivopts_data *da iv_ca_set_no_cp (data, ivs, use); } - /* First try important candidates not based on any memory object. Only if + /* If ORIGINALP is true, try to find the original IV for the use. Otherwise + first try important candidates not based on any memory object. Only if this fails, try the specific ones. Rationale -- in loops with many variables the best choice often is to use just one generic biv. If we added here many ivs specific to the uses, the optimization algorithm later @@ -5122,7 +5105,10 @@ try_add_cand_for (struct ivopts_data *da { cand = iv_cand (data, i); - if (cand->iv->base_object != NULL_TREE) + if (originalp && cand->pos !=IP_ORIGINAL) + continue; + + if (!originalp && cand->iv->base_object != NULL_TREE) continue; if (iv_ca_cand_used_p (ivs, cand)) @@ -5158,8 +5144,13 @@ try_add_cand_for (struct ivopts_data *da continue; /* Already tried this. */ - if (cand->important && cand->iv->base_object == NULL_TREE) - continue; + if (cand->important) + { + if (originalp && cand->pos == IP_ORIGINAL) + continue; + if (!originalp && cand->iv->base_object == NULL_TREE) + continue; + } if (iv_ca_cand_used_p (ivs, cand)) continue; @@ -5193,13 +5184,13 @@ try_add_cand_for (struct ivopts_data *da /* Finds an initial assignment of candidates to uses. */ static struct iv_ca * -get_initial_solution (struct ivopts_data *data) +get_initial_solution (struct ivopts_data *data, bool originalp) { struct iv_ca *ivs = iv_ca_new (data); unsigned i; for (i = 0; i < n_iv_uses (data); i++) - if (!try_add_cand_for (data, ivs, iv_use (data, i))) + if (!try_add_cand_for (data, ivs, iv_use (data, i), originalp)) { iv_ca_free (&ivs); return NULL; @@ -5271,14 +5262,12 @@ try_improve_iv_set (struct ivopts_data * solution and remove the unused ivs while this improves the cost. */ static struct iv_ca * -find_optimal_iv_set (struct ivopts_data *data) +find_optimal_iv_set_1 (struct ivopts_data *data, bool originalp) { - unsigned i; struct iv_ca *set; - struct iv_use *use; /* Get the initial solution. */ - set = get_initial_solution (data); + set = get_initial_solution (data, originalp); if (!set) { if (dump_file && (dump_flags & TDF_DETAILS)) @@ -5301,11 +5290,46 @@ find_optimal_iv_set (struct ivopts_data } } + return set; +} + +static struct iv_ca * +find_optimal_iv_set (struct ivopts_data *data) +{ + unsigned i; + struct iv_ca *set, *origset; + struct iv_use *use; + comp_cost cost, origcost; + + /* Determine the cost based on a strategy that starts with original IVs, + and try again using a strategy that prefers candidates not based + on any IVs. */ + origset = find_optimal_iv_set_1 (data, true); + set = find_optimal_iv_set_1 (data, false); + + if (!origset && !set) + return NULL; + + origcost = origset ? iv_ca_cost (origset) : infinite_cost; + cost = set ? iv_ca_cost (set) : infinite_cost; + if (dump_file && (dump_flags & TDF_DETAILS)) { - comp_cost cost = iv_ca_cost (set); - fprintf (dump_file, "Final cost %d (complexity %d)\n\n", cost.cost, cost.complexity); + fprintf (dump_file, "Original cost %d (complexity %d)\n\n", + origcost.cost, origcost.complexity); + fprintf (dump_file, "Final cost %d (complexity %d)\n\n", + cost.cost, cost.complexity); + } + + /* Choose the one with the best cost. */ + if (compare_costs (origcost, cost) <= 0) + { + if (set) + iv_ca_free (&set); + set = origset; } + else if (origset) + iv_ca_free (&origset); for (i = 0; i < n_iv_uses (data); i++) { Index: gcc/cfgloopanal.c =================================================================== --- gcc/cfgloopanal.c (revision 160755) +++ gcc/cfgloopanal.c (working copy) @@ -344,7 +344,7 @@ init_set_costs (void) target_avail_regs = 0; for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) if (TEST_HARD_REG_BIT (reg_class_contents[GENERAL_REGS], i) - && !fixed_regs[i]) + && !call_used_regs[i]) target_avail_regs++; target_res_regs = 3; Index: gcc/testsuite/gcc.target/arm/pr42505.c =================================================================== --- gcc/testsuite/gcc.target/arm/pr42505.c (revision 0) +++ gcc/testsuite/gcc.target/arm/pr42505.c (revision 0) @@ -0,0 +1,23 @@ +/* { dg-options "-mthumb -Os -march=armv5te" } */ +/* { dg-require-effective-target arm_thumb1_ok } */ +/* { dg-final { scan-assembler-not "str\[\\t \]*r.,\[\\t \]*.sp," } } */ + +struct A { + int f1; + int f2; +}; + +int func(int c); + +/* This function should not need to spill anything to the stack. */ +int test(struct A* src, struct A* dst, int count) +{ + while (count--) { + if (!func(src->f2)) { + return 0; + } + *dst++ = *src++; + } + + return 1; +}