Patchwork PATCH: stop ivopts from pessimizing code (PR42505)

login
register
mail settings
Submitter Sandra Loosemore
Date June 19, 2010, 3:59 a.m.
Message ID <4C1C4084.4060200@codesourcery.com>
Download mbox | patch
Permalink /patch/56236/
State New
Headers show

Comments

Sandra Loosemore - June 19, 2010, 3:59 a.m.
I spent a lot of time bumbling around in the ivopts cost model for PR42505 
before I finally realized that, even using its existing cost functions, the 
supposedly "optimized" solution it was coming up with had a higher cost metric 
than the original code for the test case in the PR.  Apparently its approach of 
using a heuristic to choose an initial solution and then jiggling it to try to 
find lower-cost alternatives is incapable of jiggling hard enough to find its 
way back to the original set of IV candidates.  So, the simple solution is just 
to compare what it comes up with against the cost of the original set, but 
having done that much it's not much more work to try jiggling the original set 
too to see if it can be optimized further.  That's what I've implemented in the 
attached patch.

The one change to the cost model that I found to be worthwhile was using 
call_used_regs rather than fixed_regs to count the number of registers available 
for holding loop variables.  E.g., on Thumb-1 there are really only 4 registers 
available, not 9, so ivopts was significantly underestimating register pressure 
in the test case filed with the PR.

I tested this patch by doing full bootstrap and regression test on x86-64 
native.  I did some initial size benchmarking with CSiBE on both x86-64 and ARM. 
   Then I did some more detailed benchmarking on ARM for both code size and 
speed.  Here's a summary of the numbers.

Size benchmarks (smaller is better):
   x86:
     CSiBE	-0.22%
   ARMv5TE Thumb-1:
     CSiBE	-0.15%
     CoreMark	-2.7%
     eembc CINT	-0.1%
     spec2000	-0.1%
   ARMv7-a Thumb-2:
     CSiBE	-0.19%
     CoreMark	-5.7%
     eembc CINT	-0.4%
     spec2000	-0.3%

Speed benchmarks (bigger is better):
   ARMv5TE:
     CoreMark	-0.5%
     eembc CINT	-0.3%
     spec2000	-1.2%
   ARMv7-a:
     CoreMark	+1.3%
     eembc CINT	+0.6%
     spec2000	-0.3%

With only the change to check for pessimization and not the costs change, speed 
numbers went down on both ARM targets.  The costs change helped a lot more on 
ARMv7-a than it did on ARMv5TE.  So, perhaps there is still something wrong with 
the ivopts costs model, but perhaps this is just tripping over bugs somewhere 
else in the compiler.  I noted that on both ARMv5TE and ARMv7-a, some of the 
individual spec2000 and eembc benchmarks had quite large regressions (> 5%).  I 
spent a couple days on some further experiments with the cost model, but nothing 
I tried gave an obvious improvement.  My gut feeling now is that there are 
unnecessary spills being generated somewhere else and that there's no point in 
further tinkering with ivopts costs without some specific evidence or test cases 
that indicate that's where the problem is.

Anyway, I've about run out of time for working on this issue for now.  I think 
both changes in my current patch are abstractly improvements over the existing 
code, and from an experimental point of view at least the code size improvements 
are good even if the speed results are mixed.  So, OK to check in the patch as-is?

-Sandra


2010-06-18  Sandra Loosemore  <sandra@codesourcery.com>

	PR middle-end/42505

	gcc/
	* tree-ssa-loop-ivopts.c (determine_set_costs): Delete obsolete
	comments about cost model.
	(try_add_cand_for):  Add second strategy for choosing initial set
	based on original IVs, controlled by ORIGINALP argument.
	(get_initial_solution): Add ORIGINALP argument.
	(find_optimal_iv_set_1): New function, split from find_optimal_iv_set.
	(find_optimal_iv_set): Try two different strategies for choosing
	the IV set, and return the one with lower cost.

	* cfgloopanal.c (init_set_costs): Use call_used_regs rather than
	fixed_regs to count number of registers available for loop variables.

	gcc/testsuite/
	* gcc.target/arm/pr42505.c: New test case.
Sandra Loosemore - July 2, 2010, 12:08 p.m.
Ping?

http://gcc.gnu.org/ml/gcc-patches/2010-06/msg01920.html

-Sandra

> 2010-06-18  Sandra Loosemore  <sandra@codesourcery.com>
> 
>     PR middle-end/42505
> 
>     gcc/
>     * tree-ssa-loop-ivopts.c (determine_set_costs): Delete obsolete
>     comments about cost model.
>     (try_add_cand_for):  Add second strategy for choosing initial set
>     based on original IVs, controlled by ORIGINALP argument.
>     (get_initial_solution): Add ORIGINALP argument.
>     (find_optimal_iv_set_1): New function, split from find_optimal_iv_set.
>     (find_optimal_iv_set): Try two different strategies for choosing
>     the IV set, and return the one with lower cost.
> 
>     * cfgloopanal.c (init_set_costs): Use call_used_regs rather than
>     fixed_regs to count number of registers available for loop variables.
> 
>     gcc/testsuite/
>     * gcc.target/arm/pr42505.c: New test case.
>
Zdenek Dvorak - July 2, 2010, 2:07 p.m.
Hi,

> Ping?
>
> http://gcc.gnu.org/ml/gcc-patches/2010-06/msg01920.html

what are the effects on the compile time? The tree-ssa-loop-ivopts.c part of the
patch is OK, assuming that they are negligible.

>> 2010-06-18  Sandra Loosemore  <sandra@codesourcery.com>
>>
>>     PR middle-end/42505
>>
>>     gcc/
>>     * tree-ssa-loop-ivopts.c (determine_set_costs): Delete obsolete
>>     comments about cost model.
>>     (try_add_cand_for):  Add second strategy for choosing initial set
>>     based on original IVs, controlled by ORIGINALP argument.
>>     (get_initial_solution): Add ORIGINALP argument.
>>     (find_optimal_iv_set_1): New function, split from find_optimal_iv_set.
>>     (find_optimal_iv_set): Try two different strategies for choosing
>>     the IV set, and return the one with lower cost.

As for this change:

>>     * cfgloopanal.c (init_set_costs): Use call_used_regs rather than
>>     fixed_regs to count number of registers available for loop variables.

Should not we make call_used_regs unavailable only if there is a function call in the
loop?

Zdenek
Sandra Loosemore - July 2, 2010, 9:42 p.m.
Zdenek Dvorak wrote:
> Hi,
> 
>> Ping?
>>
>> http://gcc.gnu.org/ml/gcc-patches/2010-06/msg01920.html
> 
> what are the effects on the compile time? The tree-ssa-loop-ivopts.c part of the
> patch is OK, assuming that they are negligible.

I looked at the CSiBE compilation times on ARM, and yes, the changes are 
negligible.  If anything, it seems to be a little faster with my patch (smaller 
code, less work for subsequent passes).

> As for this change:
> 
>>>     * cfgloopanal.c (init_set_costs): Use call_used_regs rather than
>>>     fixed_regs to count number of registers available for loop variables.
> 
> Should not we make call_used_regs unavailable only if there is a function call in the
> loop?

I actually tried that before submitting the patch, and saw that empirically the 
results were better using call_used_regs all the time.  My theory is that, at 
least on ARM, a couple additional scratch registers are required for loading 
constants and performing some of the address computations that ivopts thinks are 
free.  It is better to be conservative in ivopts and underestimate register 
availability rather than risk introducing additional spills, which would totally 
overwhelm any savings from ivopts.

I realize that it's possible to tweak the costs model in other ways, but after a 
week or two poking at it I felt like it was a black hole.  The minimal change I 
proposed fixes a fairly obvious bug in the current code in a conservatively 
correct way, and actually helped both code size and speed, unlike other things I 
tried.  ;-)  Perhaps someone else would like to take a stab at it and/or can 
demonstrate that the more complicated version that tests for a function call in 
the loop wins on some other target, even if it produces worse code on ARM than 
my simple fix?

-Sandra
Zdenek Dvorak - July 5, 2010, 8:53 a.m.
Hi,

>> As for this change:
>>
>>>>     * cfgloopanal.c (init_set_costs): Use call_used_regs rather than
>>>>     fixed_regs to count number of registers available for loop variables.
>>
>> Should not we make call_used_regs unavailable only if there is a function call in the
>> loop?
>
> I actually tried that before submitting the patch, and saw that 
> empirically the results were better using call_used_regs all the time.  
> My theory is that, at least on ARM, a couple additional scratch registers 
> are required for loading constants and performing some of the address 
> computations that ivopts thinks are free.  It is better to be 
> conservative in ivopts and underestimate register availability rather 
> than risk introducing additional spills, which would totally overwhelm 
> any savings from ivopts.
>
> I realize that it's possible to tweak the costs model in other ways, but 
> after a week or two poking at it I felt like it was a black hole.  The 
> minimal change I proposed fixes a fairly obvious bug in the current code 
> in a conservatively correct way, and actually helped both code size and 
> speed, unlike other things I tried.  ;-)  Perhaps someone else would like 
> to take a stab at it and/or can demonstrate that the more complicated 
> version that tests for a function call in the loop wins on some other 
> target, even if it produces worse code on ARM than my simple fix?

my concern is that this change is not obviously correct, and did not get
enough testing to warrant it purely on its heuristic value.  So, I cannot
approve it unless someone picks it up and makes a more detailed analysis of the
problem,

Zdenek
Sandra Loosemore - July 5, 2010, 4:43 p.m.
Zdenek Dvorak wrote:

>>> As for this change:
>>>
>>>>>     * cfgloopanal.c (init_set_costs): Use call_used_regs rather than
>>>>>     fixed_regs to count number of registers available for loop variables.
>>> Should not we make call_used_regs unavailable only if there is a function call in the
>>> loop?
>> I actually tried that before submitting the patch, and saw that 
>> empirically the results were better using call_used_regs all the time.  
>> My theory is that, at least on ARM, a couple additional scratch registers 
>> are required for loading constants and performing some of the address 
>> computations that ivopts thinks are free.  It is better to be 
>> conservative in ivopts and underestimate register availability rather 
>> than risk introducing additional spills, which would totally overwhelm 
>> any savings from ivopts.
>>
>> I realize that it's possible to tweak the costs model in other ways, but 
>> after a week or two poking at it I felt like it was a black hole.  The 
>> minimal change I proposed fixes a fairly obvious bug in the current code 
>> in a conservatively correct way, and actually helped both code size and 
>> speed, unlike other things I tried.  ;-)  Perhaps someone else would like 
>> to take a stab at it and/or can demonstrate that the more complicated 
>> version that tests for a function call in the loop wins on some other 
>> target, even if it produces worse code on ARM than my simple fix?
> 
> my concern is that this change is not obviously correct, and did not get
> enough testing to warrant it purely on its heuristic value.  So, I cannot
> approve it unless someone picks it up and makes a more detailed analysis of the
> problem,

Hrmmmm.  Well, the current code is obviously incorrect, so I don't think it's 
right to do nothing, either.  I guess I'll check in the other uncontroversial 
part of the patch to tree-ssa-loop-ivopts.c first to get that out of the way, 
and then try resurrecting the more complicated hack for init_set_costs.

-Sandra
Zdenek Dvorak - July 5, 2010, 7:52 p.m.
Hi,

>> my concern is that this change is not obviously correct, and did not get
>> enough testing to warrant it purely on its heuristic value.  So, I cannot
>> approve it unless someone picks it up and makes a more detailed analysis of the
>> problem,
>
> Hrmmmm.  Well, the current code is obviously incorrect,

that is not so obvious to me :-)  It is rather plausible that most
performance-critical loops contain no calls, which would make it more
correct than your proposed change.  In particular, your test results
only indicate that we underestimate the register pressure, but do not
give much insight into the reasons of the underestimation,

Zdenek
H.J. Lu - July 6, 2010, 3:24 p.m.
On Mon, Jul 5, 2010 at 9:43 AM, Sandra Loosemore
<sandra@codesourcery.com> wrote:
> Zdenek Dvorak wrote:
>
>>>> As for this change:
>>>>
>>>>>>    * cfgloopanal.c (init_set_costs): Use call_used_regs rather than
>>>>>>    fixed_regs to count number of registers available for loop
>>>>>> variables.
>>>>
>>>> Should not we make call_used_regs unavailable only if there is a
>>>> function call in the
>>>> loop?
>>>
>>> I actually tried that before submitting the patch, and saw that
>>> empirically the results were better using call_used_regs all the time.  My
>>> theory is that, at least on ARM, a couple additional scratch registers are
>>> required for loading constants and performing some of the address
>>> computations that ivopts thinks are free.  It is better to be conservative
>>> in ivopts and underestimate register availability rather than risk
>>> introducing additional spills, which would totally overwhelm any savings
>>> from ivopts.
>>>
>>> I realize that it's possible to tweak the costs model in other ways, but
>>> after a week or two poking at it I felt like it was a black hole.  The
>>> minimal change I proposed fixes a fairly obvious bug in the current code in
>>> a conservatively correct way, and actually helped both code size and speed,
>>> unlike other things I tried.  ;-)  Perhaps someone else would like to take a
>>> stab at it and/or can demonstrate that the more complicated version that
>>> tests for a function call in the loop wins on some other target, even if it
>>> produces worse code on ARM than my simple fix?
>>
>> my concern is that this change is not obviously correct, and did not get
>> enough testing to warrant it purely on its heuristic value.  So, I cannot
>> approve it unless someone picks it up and makes a more detailed analysis
>> of the
>> problem,
>
> Hrmmmm.  Well, the current code is obviously incorrect, so I don't think
> it's right to do nothing, either.  I guess I'll check in the other
> uncontroversial part of the patch to tree-ssa-loop-ivopts.c first to get
> that out of the way, and then try resurrecting the more complicated hack for
> init_set_costs.
>

Your checkin:

http://gcc.gnu.org/ml/gcc-cvs/2010-07/msg00198.html

caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44838

Patch

Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 160755)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -4444,26 +4444,6 @@  determine_set_costs (struct ivopts_data 
   struct loop *loop = data->current_loop;
   bitmap_iterator bi;
 
-  /* We use the following model (definitely improvable, especially the
-     cost function -- TODO):
-
-     We estimate the number of registers available (using MD data), name it A.
-
-     We estimate the number of registers used by the loop, name it U.  This
-     number is obtained as the number of loop phi nodes (not counting virtual
-     registers and bivs) + the number of variables from outside of the loop.
-
-     We set a reserve R (free regs that are used for temporary computations,
-     etc.).  For now the reserve is a constant 3.
-
-     Let I be the number of induction variables.
-
-     -- if U + I + R <= A, the cost is I * SMALL_COST (just not to encourage
-	make a lot of ivs without a reason).
-     -- if A - R < U + I <= A, the cost is I * PRES_COST
-     -- if U + I > A, the cost is I * PRES_COST and
-        number of uses * SPILL_COST * (U + I - A) / (U + I) is added.  */
-
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file, "Global costs:\n");
@@ -5087,11 +5067,13 @@  iv_ca_prune (struct ivopts_data *data, s
 }
 
 /* Tries to extend the sets IVS in the best possible way in order
-   to express the USE.  */
+   to express the USE.  If ORIGINALP is true, prefer candidates from
+   the original set of IVs, otherwise favor important candidates not
+   based on any memory object.  */
 
 static bool
 try_add_cand_for (struct ivopts_data *data, struct iv_ca *ivs,
-		  struct iv_use *use)
+		  struct iv_use *use, bool originalp)
 {
   comp_cost best_cost, act_cost;
   unsigned i;
@@ -5110,7 +5092,8 @@  try_add_cand_for (struct ivopts_data *da
       iv_ca_set_no_cp (data, ivs, use);
     }
 
-  /* First try important candidates not based on any memory object.  Only if
+  /* If ORIGINALP is true, try to find the original IV for the use.  Otherwise
+     first try important candidates not based on any memory object.  Only if
      this fails, try the specific ones.  Rationale -- in loops with many
      variables the best choice often is to use just one generic biv.  If we
      added here many ivs specific to the uses, the optimization algorithm later
@@ -5122,7 +5105,10 @@  try_add_cand_for (struct ivopts_data *da
     {
       cand = iv_cand (data, i);
 
-      if (cand->iv->base_object != NULL_TREE)
+      if (originalp && cand->pos !=IP_ORIGINAL)
+	continue;
+
+      if (!originalp && cand->iv->base_object != NULL_TREE)
 	continue;
 
       if (iv_ca_cand_used_p (ivs, cand))
@@ -5158,8 +5144,13 @@  try_add_cand_for (struct ivopts_data *da
 	    continue;
 
 	  /* Already tried this.  */
-	  if (cand->important && cand->iv->base_object == NULL_TREE)
-	    continue;
+	  if (cand->important)
+	    {
+	      if (originalp && cand->pos == IP_ORIGINAL)
+		continue;
+	      if (!originalp && cand->iv->base_object == NULL_TREE)
+		continue;
+	    }
 
 	  if (iv_ca_cand_used_p (ivs, cand))
 	    continue;
@@ -5193,13 +5184,13 @@  try_add_cand_for (struct ivopts_data *da
 /* Finds an initial assignment of candidates to uses.  */
 
 static struct iv_ca *
-get_initial_solution (struct ivopts_data *data)
+get_initial_solution (struct ivopts_data *data, bool originalp)
 {
   struct iv_ca *ivs = iv_ca_new (data);
   unsigned i;
 
   for (i = 0; i < n_iv_uses (data); i++)
-    if (!try_add_cand_for (data, ivs, iv_use (data, i)))
+    if (!try_add_cand_for (data, ivs, iv_use (data, i), originalp))
       {
 	iv_ca_free (&ivs);
 	return NULL;
@@ -5271,14 +5262,12 @@  try_improve_iv_set (struct ivopts_data *
    solution and remove the unused ivs while this improves the cost.  */
 
 static struct iv_ca *
-find_optimal_iv_set (struct ivopts_data *data)
+find_optimal_iv_set_1 (struct ivopts_data *data, bool originalp)
 {
-  unsigned i;
   struct iv_ca *set;
-  struct iv_use *use;
 
   /* Get the initial solution.  */
-  set = get_initial_solution (data);
+  set = get_initial_solution (data, originalp);
   if (!set)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -5301,11 +5290,46 @@  find_optimal_iv_set (struct ivopts_data 
 	}
     }
 
+  return set;
+}
+
+static struct iv_ca *
+find_optimal_iv_set (struct ivopts_data *data)
+{
+  unsigned i;
+  struct iv_ca *set, *origset;
+  struct iv_use *use;
+  comp_cost cost, origcost;
+
+  /* Determine the cost based on a strategy that starts with original IVs,
+     and try again using a strategy that prefers candidates not based
+     on any IVs.  */
+  origset = find_optimal_iv_set_1 (data, true);
+  set = find_optimal_iv_set_1 (data, false);
+
+  if (!origset && !set)
+    return NULL;
+
+  origcost = origset ? iv_ca_cost (origset) : infinite_cost;
+  cost = set ? iv_ca_cost (set) : infinite_cost;
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
-      comp_cost cost = iv_ca_cost (set);
-      fprintf (dump_file, "Final cost %d (complexity %d)\n\n", cost.cost, cost.complexity);
+      fprintf (dump_file, "Original cost %d (complexity %d)\n\n",
+	       origcost.cost, origcost.complexity);
+      fprintf (dump_file, "Final cost %d (complexity %d)\n\n",
+	       cost.cost, cost.complexity);
+    }
+
+  /* Choose the one with the best cost.  */
+  if (compare_costs (origcost, cost) <= 0)
+    {
+      if (set)
+	iv_ca_free (&set);
+      set = origset;
     }
+  else if (origset)
+    iv_ca_free (&origset);
 
   for (i = 0; i < n_iv_uses (data); i++)
     {
Index: gcc/cfgloopanal.c
===================================================================
--- gcc/cfgloopanal.c	(revision 160755)
+++ gcc/cfgloopanal.c	(working copy)
@@ -344,7 +344,7 @@  init_set_costs (void)
   target_avail_regs = 0;
   for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
     if (TEST_HARD_REG_BIT (reg_class_contents[GENERAL_REGS], i)
-	&& !fixed_regs[i])
+	&& !call_used_regs[i])
       target_avail_regs++;
 
   target_res_regs = 3;
Index: gcc/testsuite/gcc.target/arm/pr42505.c
===================================================================
--- gcc/testsuite/gcc.target/arm/pr42505.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/pr42505.c	(revision 0)
@@ -0,0 +1,23 @@ 
+/* { dg-options "-mthumb -Os -march=armv5te" }  */
+/* { dg-require-effective-target arm_thumb1_ok } */
+/* { dg-final { scan-assembler-not "str\[\\t \]*r.,\[\\t \]*.sp," } } */
+
+struct A {
+ int f1;
+ int f2;
+};
+
+int func(int c);
+
+/* This function should not need to spill anything to the stack.  */
+int test(struct A* src, struct A* dst, int count)
+{
+  while (count--) {
+    if (!func(src->f2)) {
+        return 0;
+      }
+      *dst++ = *src++;
+  }
+
+  return 1;
+}