Message ID | 4FCE3512.7070607@codesourcery.com |
---|---|
State | New |
Headers | show |
On Tue, Jun 5, 2012 at 6:34 PM, Sandra Loosemore <sandra@codesourcery.com> wrote: > My colleagues and I have been working on the GCC port for the Qualcomm > Hexagon. Along the way I noticed that we were getting poor results > from the ivopts pass no matter how we adjusted the target-specific RTX > costs. In many cases ivopts was coming up with candidate use costs > that seemed completely inconsistent with the target cost model. On > further inspection, I found what appears to be a whole bunch of bugs > in the way ivopts is computing address costs: > > (1) While the address cost computation is assuming in some situations > that pre/post increment/decrement addressing will be used if > supported by the target, it isn't actually using the target's address > cost for such forms -- instead, just the cost of the form that would > be used if autoinc weren't available/applicable. > > (2) The computation to determine which multiplier values are supported > by target addressing modes is constructing an address rtx of the form > (reg * ratio) to do the tests. This isn't a valid address RTX on > Hexagon, although both (reg + reg * ratio) and (sym + reg * ratio) > are. Because it's choosing the wrong address form to probe with, it > thinks that the target doesn't support multipliers at all and is > incorrectly tacking on an extra cost for them. I also note that it's > assuming that the same set of ratios are supported by all three > address forms that can potentially include them, and that all valid > ratios have the same cost. > > (3) The computation to determine the range of valid constant offsets > for address forms that can include them is probing the upper end of > the range using constants of the form ((1<<n) - 1). On Hexagon, the > offsets have to be aligned appropriately for the mode, so it's > incorrectly rejecting all positive offsets for non-char modes. And > again, it's assuming that the same range of offsets are supported by > all address forms that can legitimately include them, and that all > valid offsets have the same cost. The latter isn't true on Hexagon. > > (4) The cost adjustment for converting the symbol_present address to a > var_present address seems overly optimistic in assuming that the > symbol load will be hoisted outside the loop. I looked at a lot of > code where this was not happening no matter how expensive I made the > absolute addressing forms in the target-specific costs. Also, if > subsequent passes actually do hoist the symbol load, this requires an > additional register to be available over the entire loop, which ivopts > isn't accounting for in any way. It seems to me that this adjustment > shouldn't be attempted when the symbol_present form is a legitimate > address (because subsequent passes are not doing the anticipated > optimization in that case). There's also a bug present in the cost > accounting: it's adding the full cost of the symbol + var addition, > whereas it should be pro-rated across iterations (see the way this is > handled for the non-address cases in get_computation_cost_at). > > (5) The documentation of TARGET_ADDRESS_COST says that when two > (legitimate) address expressions have the same lowest cost, the one > with the higher complexity is used. But, the ivopts code is doing the > opposite of this and choosing the lower complexity as the tie-breaker. > (Actually, it's also computing the complexity without regard to > whether the address rtx is even legitimate on the target.) > > (6) The way get_address_costs is precomputing and caching a complete > set of cost data for each addressing mode seems incorrect to me, in > the general case. For example, consider MIPS where MIPS16 and MIPS32 > functions can be freely intermixed in the same compilation unit. On > Hexagon, as I've noted, the assumption that all valid multiplier and > offset values have the same cost is also invalid. The current code is > already precomputing 16 sets of address costs for each mode, plus > probing 128 addresses with multipliers for validity, and similarly > probing up to 32 or so constant offsets for validity, so I'm pretty > skeptical that precomputing and caching even more permutations would > be worthwhile. Pre-computing and caching things is to avoid creating RTXen over and over. As you have discarded this completely did you try to measure the cost of doing so in terms of produced garbage and compile-time cost? Did you consider changing the target interface of IVOPTs to a (bunch of) new target hooks that avoid the RTX generation (which in fact we are not sure that we'll end up producing anyways in exactly that form due to subsequent optimizations)? > (7) If the computed address cost turns out to be 0, the current code > (for some unknown reason) is turning that into 1, which can screw up > the relative costs of address computations vs other operations like > addition. > > I've come up with the attached patch to try to fix these things. The > biggest change is that I have discarded the code for precomputing and > caching costs and instead go straight to querying the target back end > for the cost for the specific address computation we're handed; this > makes the code a lot simpler. I would kind of like to get rid of > multiplier_allowed_in_address_p too, but it's being used in a couple > places other than the address computation and it seemed better not to > mess with that for now. The other fixes are pretty straightforward. Can you split the patch into pieces fixing the above bugs separately? Removing the pre-compute and caching is the most questionable change, the others look like real bugs (the symbol cost might be questionable as well). CC-ing Zdenek for his opinions (disclaimer: I didn't look at the actual patch). Thanks, Richard. > I bootstrapped and regression-tested the patch on x86_64. I haven't > tried to benchmark the performance effect of the patch on anything > other than Hexagon; there I found that, once ivopts actually started > paying attention to the target address costs function, it needed to be > re-tuned. So, it wouldn't surprise me if other back ends could > benefit from some tweaking as well, depending on the extent to which > they're affected by the bugs I listed above. > > Comments, complaints, proposals for alternate fixes, etc? Or OK to > commit? > > -Sandra > > > 2012-06-05 Sandra Loosemore <sandra@codesourcery.com> > > gcc/ > * tree-ssa-loop-ivopts.c (comp_cost): Make complexity field signed. > Update comments to indicate this is for addressing mode complexity. > (new_cost): Make signedness of parameters match comp_cost fields. > (compare_costs): Prefer higher complexity, not lower, per documentation > of TARGET_ADDRESS_COST. > (multiplier_allowed_in_address_p): Use (+ (* reg1 ratio) reg2) to > probe for valid ratios, rather than just (* reg1 ratio). > (get_address_cost): Rewrite to eliminate precomputation and caching. > Use target's address cost for autoinc forms if possible. Only attempt > sym_present -> var_present cost conversion if the sym_present form > is not legitimate; amortize setup cost over loop iterations. > Adjust complexity computation. > (get_computation_cost_at): Adjust call to get_address_cost. Do not > mess with complexity for non-address expressions. > (determine_use_iv_cost_address): Initialize can_autoinc. > (autoinc_possible_for_pair): Likewise. >
Hi, > > (7) If the computed address cost turns out to be 0, the current code > > (for some unknown reason) is turning that into 1, which can screw up > > the relative costs of address computations vs other operations like > > addition. > > > > I've come up with the attached patch to try to fix these things. The > > biggest change is that I have discarded the code for precomputing and > > caching costs and instead go straight to querying the target back end > > for the cost for the specific address computation we're handed; this > > makes the code a lot simpler. I would kind of like to get rid of > > multiplier_allowed_in_address_p too, but it's being used in a couple > > places other than the address computation and it seemed better not to > > mess with that for now. The other fixes are pretty straightforward. > > Can you split the patch into pieces fixing the above bugs separately? > Removing the pre-compute and caching is the most questionable change, > the others look like real bugs (the symbol cost might be questionable as > well). the changes seem reasonable to me (with the caveat about caching). On the other hand, an important thing to keep in mind is that all these are just heuristics that cannot model the actual costs very realistically. Furthermore, due to interference from further optimizations, there is no guarantee that say the choices of the addressing modes by ivopts will actually be respected. Consequently, improving the cost computation will not necessarily improve the resulting code quality, and beyond some point, any such improvements become essentially irrelevant. So, any changes complicating the model should be backed by benchmarking, as otherwise there is no way to say whether there are actually benefitial or not, Zdenek
On Tue, 5 Jun 2012, Sandra Loosemore wrote: > (1) While the address cost computation is assuming in some situations > that pre/post increment/decrement addressing will be used if > supported by the target, it isn't actually using the target's address > cost for such forms -- instead, just the cost of the form that would > be used if autoinc weren't available/applicable. There are lots of bugzilla entries complaining about bad autoinc/dec generation. Maybe your patch solves some of them? > (2) The computation to determine which multiplier values are supported > by target addressing modes is constructing an address rtx of the form > (reg * ratio) to do the tests. This isn't a valid address RTX on > Hexagon, although both (reg + reg * ratio) and (sym + reg * ratio) > are. Yeah, I've spotted this one and (7), funny in a bad way. It's not a sane addressing mode except as a corner-case of (reg*ratio + constant) (e.g. constant=sym). A value in a register, and just multiply that by a constant to use as an address? When would that be useful? Should a target include the corner-case as a special-case addressing-mode just to appease ivopts? Made me think less of ivopts. Dunno if I entered a PR, mea culpa ...doesn't seem so. > I bootstrapped and regression-tested the patch on x86_64. I haven't > tried to benchmark the performance effect of the patch on anything > other than Hexagon; there I found that, once ivopts actually started > paying attention to the target address costs function, it needed to be > re-tuned. So, it wouldn't surprise me if other back ends could > benefit from some tweaking as well, depending on the extent to which > they're affected by the bugs I listed above. Right, but the lesson learned is to just ignore effects on other targets... In all fairness, I don't think there's anything to do regarding this patch in the default cost function, but it'd nice with a heads-up before committing the final version of this patch for a change though, maybe even with rtx cost tweaking-examples from a target of your choice (in the tree) if I could wish. > Comments, complaints, proposals for alternate fixes, etc? Or OK to > commit? Thank you! Others mentioned benchmarking on some major target, so I'll just add a wish for some PR annotations, any target with ivopts-related PR's. brgds, H-P
On Fri, 2012-06-08 at 22:33 -0400, Hans-Peter Nilsson wrote: > On Tue, 5 Jun 2012, Sandra Loosemore wrote: > > > (1) While the address cost computation is assuming in some situations > > that pre/post increment/decrement addressing will be used if > > supported by the target, it isn't actually using the target's address > > cost for such forms -- instead, just the cost of the form that would > > be used if autoinc weren't available/applicable. > > There are lots of bugzilla entries complaining about bad > autoinc/dec generation. Maybe your patch solves some of them? > I've tried some of the cases mentioned in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 with Sandra's patch applied. Unfortunately it didn't help much. There seem to be other things going wrong with auto-inc-dec. BTW, auto-inc-dec uses 'set_src_cost' in 'attempt_change' to determine the address costs. At least the SH target will not respond to that properly. I was thinking of adding something to sh_rtx_costs to invoke sh_address_cost as a fix for that, but on the other hand I was wondering why the target's address cost function isn't used in auto-inc-dec directly ... Cheers, Oleg
On Sun, 10 Jun 2012, Oleg Endo wrote: > I've tried some of the cases mentioned in > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 > with Sandra's patch applied. Unfortunately it didn't help much. But thanks for checking! > There > seem to be other things going wrong with auto-inc-dec. Yeah, probably in more than one place. > BTW, auto-inc-dec uses 'set_src_cost' in 'attempt_change' to determine > the address costs. At least the SH target will not respond to that > properly. I was thinking of adding something to sh_rtx_costs to invoke > sh_address_cost as a fix for that, but on the other hand I was wondering > why the target's address cost function isn't used in auto-inc-dec > directly ... Sounds like a bug. TBH, I haven't dug into the real reason why auto-inc-dec-generation is still poor (or whether it by magic has improved dramatically recently), because every so often there's some effort to improve that, alas I don't remember seeing any improvement mentioned for any target I have interest in. brgds, H-P
On Sun, 2012-06-10 at 13:50 -0400, Hans-Peter Nilsson wrote: > On Sun, 10 Jun 2012, Oleg Endo wrote: > > I've tried some of the cases mentioned in > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 > > with Sandra's patch applied. Unfortunately it didn't help much. > > But thanks for checking! > Sure thing. I forgot to mention that when trying it, I also played a bit with the address cost function, but it didn't have any effect either. > > BTW, auto-inc-dec uses 'set_src_cost' in 'attempt_change' to determine > > the address costs. At least the SH target will not respond to that > > properly. I was thinking of adding something to sh_rtx_costs to invoke > > sh_address_cost as a fix for that, but on the other hand I was wondering > > why the target's address cost function isn't used in auto-inc-dec > > directly ... > > Sounds like a bug. Depends on the perspective ;) I don't know on/for which target it was developed originally. If this target's rtx cost function meets the expectations, then there's a bug in any other target ;) > TBH, I haven't dug into the real reason why > auto-inc-dec-generation is still poor (or whether it by magic > has improved dramatically recently), because every so often > there's some effort to improve that, alas I don't remember > seeing any improvement mentioned for any target I have interest > in. At the time when I filed the aforementioned PR it was at least able to find and generate the first post-inc addr (see the original description of the PR). Now it seems that it fails to do so. So I guess some of the pre-conditions for the auto-inc-dec pass have slightly changed since then... Cheers, Oleg
On 06/06/2012 02:29 AM, Richard Guenther wrote: > > Pre-computing and caching things is to avoid creating RTXen over and over. > As you have discarded this completely did you try to measure the cost > of doing so in terms of produced garbage and compile-time cost? Did you > consider changing the target interface of IVOPTs to a (bunch of) new > target hooks that avoid the RTX generation (which in fact we are not sure > that we'll end up producing anyways in exactly that form due to subsequent > optimizations)? Since there seemed to be resistance to removing the pre-computing of costs, I've spent much of the last week trying to glue fixes on the existing code while preserving the caching, and just am not happy with the result. It makes the code too complicated, adds additional overhead by precomputing more things, and still does not fix the lurking bugs WRT differing costs for different values of constant offsets and the like. Basically, I don't want to put my name on anything that ugly. :-P So, I went back and did some compile time benchmarking on my previously posted patch instead. I used the bzip2 and gcc test programs available here: http://people.csail.mit.edu/smcc/projects/single-file-programs/ These are respectively large and gigantic single-file programs, so you would expect the performance effects of caching to be particularly evident here as there would likely be many loops involving the same modes in each compilation unit. I compiled them using a native x86_64 build with "time gcc -c -O3", and ran each set of timings 3 times with an unmodified build and with my previously-posted patch. And, it turns out there is no obvious difference in the results. bzip2 base: 5.82, 5.82, 5.75 bzip2 patched: 5.73, 5.71, 5.85 gcc base: 4m44.390, 4m44.270, 4m44.060 gcc patched: 4m44.210, 4m44.530, 4m44.040 > Can you split the patch into pieces fixing the above bugs separately? > Removing the pre-compute and caching is the most questionable change, > the others look like real bugs (the symbol cost might be questionable as > well). Given that most of the bugs were in the same function and were fixed by rewriting it completely, trying to split up the patch seems kind of pointless, to me. > CC-ing Zdenek for his opinions (disclaimer: I didn't look at the actual patch). Might somebody be willing to review the patch as posted? FWIW, I was doing some digging around in the mail archives to see if there was any discussion that would help me understand the rationale for the current cost model better. What I found was that when the ivopts pass was originally added back in 2004, the costs computation was one of the things that was specifically mentioned as needing work at least to document it better, but the patch was rushed through and approved without any thorough review because the Stage 3 deadline was looming. Anyway, given that get_address_cost has had a big FIXME on it all these years, it seems to me like maybe it's time to try to fix it? -Sandra
On 06/05/2012 10:34 AM, Sandra Loosemore wrote: > 2012-06-05 Sandra Loosemore<sandra@codesourcery.com> > > gcc/ > * tree-ssa-loop-ivopts.c (comp_cost): Make complexity field signed. > Update comments to indicate this is for addressing mode complexity. > (new_cost): Make signedness of parameters match comp_cost fields. > (compare_costs): Prefer higher complexity, not lower, per documentation > of TARGET_ADDRESS_COST. > (multiplier_allowed_in_address_p): Use (+ (* reg1 ratio) reg2) to > probe for valid ratios, rather than just (* reg1 ratio). > (get_address_cost): Rewrite to eliminate precomputation and caching. > Use target's address cost for autoinc forms if possible. Only attempt > sym_present -> var_present cost conversion if the sym_present form > is not legitimate; amortize setup cost over loop iterations. > Adjust complexity computation. > (get_computation_cost_at): Adjust call to get_address_cost. Do not > mess with complexity for non-address expressions. > (determine_use_iv_cost_address): Initialize can_autoinc. > (autoinc_possible_for_pair): Likewise. On 06/13/2012 01:52 PM, Sandra Loosemore wrote: > > Might somebody be willing to review the patch as posted? Ping? Original post with patch is here: http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html -Sandra
Hi, For the following code change, @@ -4212,11 +4064,6 @@ get_computation_cost_at (struct ivopts_d cost.cost += adjust_setup_cost (data, add_cost (TYPE_MODE (ctype), speed)); - /* Having offset does not affect runtime cost in case it is added to - symbol, but it increases complexity. */ - if (offset) - cost.complexity++; - cost.cost += add_cost (TYPE_MODE (ctype), speed); aratio = ratio > 0 ? ratio : -ratio; I think this shouldn't be removed. The offset may be affected by the position of inserting reduction variable accumulation statement. There will be different cases between before and after reduction variable accumulation. The cost of replacing use point with reduction variable should be different accordingly. BTW, I personally think the current ivopt cost modelling basically works fine, although there might be some tunings needed. The most difficult part is the choice of reduction variable candidates has something to do with register pressure cost, while the register cost estimate is not accurate enough at this stage because we don't have back-end live range interference graph at all. we are always able to find holes on some particular cases or benchmarks, but we can't only want to find a optimal result for them, and the tuning needs to be backed by more comprehensive result. Thanks, -Jiangning 2012/6/6 Sandra Loosemore <sandra@codesourcery.com>: > My colleagues and I have been working on the GCC port for the Qualcomm > Hexagon. Along the way I noticed that we were getting poor results > from the ivopts pass no matter how we adjusted the target-specific RTX > costs. In many cases ivopts was coming up with candidate use costs > that seemed completely inconsistent with the target cost model. On > further inspection, I found what appears to be a whole bunch of bugs > in the way ivopts is computing address costs: > > (1) While the address cost computation is assuming in some situations > that pre/post increment/decrement addressing will be used if > supported by the target, it isn't actually using the target's address > cost for such forms -- instead, just the cost of the form that would > be used if autoinc weren't available/applicable. > > (2) The computation to determine which multiplier values are supported > by target addressing modes is constructing an address rtx of the form > (reg * ratio) to do the tests. This isn't a valid address RTX on > Hexagon, although both (reg + reg * ratio) and (sym + reg * ratio) > are. Because it's choosing the wrong address form to probe with, it > thinks that the target doesn't support multipliers at all and is > incorrectly tacking on an extra cost for them. I also note that it's > assuming that the same set of ratios are supported by all three > address forms that can potentially include them, and that all valid > ratios have the same cost. > > (3) The computation to determine the range of valid constant offsets > for address forms that can include them is probing the upper end of > the range using constants of the form ((1<<n) - 1). On Hexagon, the > offsets have to be aligned appropriately for the mode, so it's > incorrectly rejecting all positive offsets for non-char modes. And > again, it's assuming that the same range of offsets are supported by > all address forms that can legitimately include them, and that all > valid offsets have the same cost. The latter isn't true on Hexagon. > > (4) The cost adjustment for converting the symbol_present address to a > var_present address seems overly optimistic in assuming that the > symbol load will be hoisted outside the loop. I looked at a lot of > code where this was not happening no matter how expensive I made the > absolute addressing forms in the target-specific costs. Also, if > subsequent passes actually do hoist the symbol load, this requires an > additional register to be available over the entire loop, which ivopts > isn't accounting for in any way. It seems to me that this adjustment > shouldn't be attempted when the symbol_present form is a legitimate > address (because subsequent passes are not doing the anticipated > optimization in that case). There's also a bug present in the cost > accounting: it's adding the full cost of the symbol + var addition, > whereas it should be pro-rated across iterations (see the way this is > handled for the non-address cases in get_computation_cost_at). > > (5) The documentation of TARGET_ADDRESS_COST says that when two > (legitimate) address expressions have the same lowest cost, the one > with the higher complexity is used. But, the ivopts code is doing the > opposite of this and choosing the lower complexity as the tie-breaker. > (Actually, it's also computing the complexity without regard to > whether the address rtx is even legitimate on the target.) > > (6) The way get_address_costs is precomputing and caching a complete > set of cost data for each addressing mode seems incorrect to me, in > the general case. For example, consider MIPS where MIPS16 and MIPS32 > functions can be freely intermixed in the same compilation unit. On > Hexagon, as I've noted, the assumption that all valid multiplier and > offset values have the same cost is also invalid. The current code is > already precomputing 16 sets of address costs for each mode, plus > probing 128 addresses with multipliers for validity, and similarly > probing up to 32 or so constant offsets for validity, so I'm pretty > skeptical that precomputing and caching even more permutations would > be worthwhile. > > (7) If the computed address cost turns out to be 0, the current code > (for some unknown reason) is turning that into 1, which can screw up > the relative costs of address computations vs other operations like > addition. > > I've come up with the attached patch to try to fix these things. The > biggest change is that I have discarded the code for precomputing and > caching costs and instead go straight to querying the target back end > for the cost for the specific address computation we're handed; this > makes the code a lot simpler. I would kind of like to get rid of > multiplier_allowed_in_address_p too, but it's being used in a couple > places other than the address computation and it seemed better not to > mess with that for now. The other fixes are pretty straightforward. > > I bootstrapped and regression-tested the patch on x86_64. I haven't > tried to benchmark the performance effect of the patch on anything > other than Hexagon; there I found that, once ivopts actually started > paying attention to the target address costs function, it needed to be > re-tuned. So, it wouldn't surprise me if other back ends could > benefit from some tweaking as well, depending on the extent to which > they're affected by the bugs I listed above. > > Comments, complaints, proposals for alternate fixes, etc? Or OK to > commit? > > -Sandra > > > 2012-06-05 Sandra Loosemore <sandra@codesourcery.com> > > gcc/ > * tree-ssa-loop-ivopts.c (comp_cost): Make complexity field signed. > Update comments to indicate this is for addressing mode complexity. > (new_cost): Make signedness of parameters match comp_cost fields. > (compare_costs): Prefer higher complexity, not lower, per documentation > of TARGET_ADDRESS_COST. > (multiplier_allowed_in_address_p): Use (+ (* reg1 ratio) reg2) to > probe for valid ratios, rather than just (* reg1 ratio). > (get_address_cost): Rewrite to eliminate precomputation and caching. > Use target's address cost for autoinc forms if possible. Only attempt > sym_present -> var_present cost conversion if the sym_present form > is not legitimate; amortize setup cost over loop iterations. > Adjust complexity computation. > (get_computation_cost_at): Adjust call to get_address_cost. Do not > mess with complexity for non-address expressions. > (determine_use_iv_cost_address): Initialize can_autoinc. > (autoinc_possible_for_pair): Likewise. >
On Wed, Jul 4, 2012 at 6:35 PM, Sandra Loosemore <sandra@codesourcery.com> wrote: > On 06/05/2012 10:34 AM, Sandra Loosemore wrote: > >> 2012-06-05 Sandra Loosemore<sandra@codesourcery.com> >> >> gcc/ >> * tree-ssa-loop-ivopts.c (comp_cost): Make complexity field >> signed. >> Update comments to indicate this is for addressing mode >> complexity. >> (new_cost): Make signedness of parameters match comp_cost fields. >> (compare_costs): Prefer higher complexity, not lower, per >> documentation >> of TARGET_ADDRESS_COST. >> (multiplier_allowed_in_address_p): Use (+ (* reg1 ratio) reg2) to >> probe for valid ratios, rather than just (* reg1 ratio). >> (get_address_cost): Rewrite to eliminate precomputation and >> caching. >> Use target's address cost for autoinc forms if possible. Only >> attempt >> sym_present -> var_present cost conversion if the sym_present >> form >> is not legitimate; amortize setup cost over loop iterations. >> Adjust complexity computation. >> (get_computation_cost_at): Adjust call to get_address_cost. Do >> not >> mess with complexity for non-address expressions. >> (determine_use_iv_cost_address): Initialize can_autoinc. >> (autoinc_possible_for_pair): Likewise. > > > On 06/13/2012 01:52 PM, Sandra Loosemore wrote: >> >> >> Might somebody be willing to review the patch as posted? > > > Ping? Original post with patch is here: > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html Can you update the patch and numbers based on what Bill did for straight-line strength reduction which re-uses this analysis/caching part? Thanks, Richard. > -Sandra >
On 07/17/2012 05:22 AM, Richard Guenther wrote: > On Wed, Jul 4, 2012 at 6:35 PM, Sandra Loosemore > <sandra@codesourcery.com> wrote: >> >> Ping? Original post with patch is here: >> >> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html > > Can you update the patch and numbers based on what Bill did for > straight-line strength reduction which re-uses this analysis/caching part? I will try to take another look at this once Bill has finished his work that touches on this; it's been hard for me to track a moving target. I was wondering if it might be more consistent with Bill's work to defer some of the address cost computation to new target hooks, after all. -Sandra
On Wed, 2012-07-25 at 13:39 -0600, Sandra Loosemore wrote: > On 07/17/2012 05:22 AM, Richard Guenther wrote: > > On Wed, Jul 4, 2012 at 6:35 PM, Sandra Loosemore > > <sandra@codesourcery.com> wrote: > >> > >> Ping? Original post with patch is here: > >> > >> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html > > > > Can you update the patch and numbers based on what Bill did for > > straight-line strength reduction which re-uses this analysis/caching part? > > I will try to take another look at this once Bill has finished his work > that touches on this; it's been hard for me to track a moving target. I > was wondering if it might be more consistent with Bill's work to defer > some of the address cost computation to new target hooks, after all. > > -Sandra > Hi Sandra, I apologize for the mess. I should be done causing distress to this part of the code as soon as the patch I submitted today is committed. Sorry! Bill
Index: gcc/tree-ssa-loop-ivopts.c =================================================================== --- gcc/tree-ssa-loop-ivopts.c (revision 188110) +++ gcc/tree-ssa-loop-ivopts.c (working copy) @@ -157,10 +157,10 @@ enum use_type typedef struct { int cost; /* The runtime cost. */ - unsigned complexity; /* The estimate of the complexity of the code for - the computation (in no concrete units -- - complexity field should be larger for more - complex expressions and addressing modes). */ + int complexity; /* The estimate of the complexity of the code for + addressing modes in the computation (in no + concrete units -- complexity field should be + larger for more complex addressing modes). */ } comp_cost; static const comp_cost zero_cost = {0, 0}; @@ -2621,7 +2621,7 @@ alloc_use_cost_map (struct ivopts_data * cost is RUNTIME and complexity corresponds to COMPLEXITY. */ static comp_cost -new_cost (unsigned runtime, unsigned complexity) +new_cost (int runtime, int complexity) { comp_cost cost; @@ -2659,7 +2659,7 @@ static int compare_costs (comp_cost cost1, comp_cost cost2) { if (cost1.cost == cost2.cost) - return cost1.complexity - cost2.complexity; + return cost2.complexity - cost1.complexity; return cost1.cost - cost2.cost; } @@ -3182,15 +3182,23 @@ multiplier_allowed_in_address_p (HOST_WI { enum machine_mode address_mode = targetm.addr_space.address_mode (as); rtx reg1 = gen_raw_REG (address_mode, LAST_VIRTUAL_REGISTER + 1); - rtx addr; + rtx reg2 = gen_raw_REG (address_mode, LAST_VIRTUAL_REGISTER + 2); + rtx addr, mult; HOST_WIDE_INT i; valid_mult = sbitmap_alloc (2 * MAX_RATIO + 1); sbitmap_zero (valid_mult); - addr = gen_rtx_fmt_ee (MULT, address_mode, reg1, NULL_RTX); + /* Use an expression of the form (PLUS (MULT reg1 constant) reg2) + to test the validity of various constants. Plain (MULT reg1 constant) + is less likely to be a valid address RTX on many targets, and + probably (PLUS (MULT reg1 constant) symbol) likewise. Note that + we're assuming the set of valid multipliers is the same for any/all + of the three address RTX forms that allow them. */ + mult = gen_rtx_fmt_ee (MULT, address_mode, reg1, NULL_RTX); + addr = gen_rtx_fmt_ee (PLUS, address_mode, mult, reg2); for (i = -MAX_RATIO; i <= MAX_RATIO; i++) { - XEXP (addr, 1) = gen_int_mode (i, address_mode); + XEXP (mult, 1) = gen_int_mode (i, address_mode); if (memory_address_addr_space_p (mode, addr, as)) SET_BIT (valid_mult, i + MAX_RATIO); } @@ -3224,292 +3232,136 @@ multiplier_allowed_in_address_p (HOST_WI look at the size of the increment to be made, which is given in CSTEP. CSTEP may be zero if the step is unknown. STMT_AFTER_INC is true iff the statement we're looking at is after the - increment of the original biv. - - TODO -- there must be some better way. This all is quite crude. */ - -typedef struct -{ - HOST_WIDE_INT min_offset, max_offset; - unsigned costs[2][2][2][2]; -} *address_cost_data; - -DEF_VEC_P (address_cost_data); -DEF_VEC_ALLOC_P (address_cost_data, heap); + increment of the original biv. */ static comp_cost get_address_cost (bool symbol_present, bool var_present, unsigned HOST_WIDE_INT offset, HOST_WIDE_INT ratio, HOST_WIDE_INT cstep, enum machine_mode mem_mode, addr_space_t as, bool speed, - bool stmt_after_inc, bool *may_autoinc) + bool stmt_after_inc, bool *may_autoinc, + struct ivopts_data *data) { enum machine_mode address_mode = targetm.addr_space.address_mode (as); - static VEC(address_cost_data, heap) *address_cost_data_list; - unsigned int data_index = (int) as * MAX_MACHINE_MODE + (int) mem_mode; - address_cost_data data; - static bool has_preinc[MAX_MACHINE_MODE], has_postinc[MAX_MACHINE_MODE]; - static bool has_predec[MAX_MACHINE_MODE], has_postdec[MAX_MACHINE_MODE]; - unsigned cost, acost, complexity; - bool offset_p, ratio_p, autoinc; - HOST_WIDE_INT s_offset, autoinc_offset, msize; - unsigned HOST_WIDE_INT mask; - unsigned bits; - - if (data_index >= VEC_length (address_cost_data, address_cost_data_list)) - VEC_safe_grow_cleared (address_cost_data, heap, address_cost_data_list, - data_index + 1); - - data = VEC_index (address_cost_data, address_cost_data_list, data_index); - if (!data) - { - HOST_WIDE_INT i; - HOST_WIDE_INT rat, off = 0; - int old_cse_not_expected, width; - unsigned sym_p, var_p, off_p, rat_p, add_c; - rtx seq, addr, base; - rtx reg0, reg1; - - data = (address_cost_data) xcalloc (1, sizeof (*data)); - - reg1 = gen_raw_REG (address_mode, LAST_VIRTUAL_REGISTER + 1); - - width = GET_MODE_BITSIZE (address_mode) - 1; - if (width > (HOST_BITS_PER_WIDE_INT - 1)) - width = HOST_BITS_PER_WIDE_INT - 1; - addr = gen_rtx_fmt_ee (PLUS, address_mode, reg1, NULL_RTX); - - for (i = width; i >= 0; i--) - { - off = -((HOST_WIDE_INT) 1 << i); - XEXP (addr, 1) = gen_int_mode (off, address_mode); - if (memory_address_addr_space_p (mem_mode, addr, as)) - break; - } - data->min_offset = (i == -1? 0 : off); - - for (i = width; i >= 0; i--) - { - off = ((HOST_WIDE_INT) 1 << i) - 1; - XEXP (addr, 1) = gen_int_mode (off, address_mode); - if (memory_address_addr_space_p (mem_mode, addr, as)) - break; - } - if (i == -1) - off = 0; - data->max_offset = off; - - if (dump_file && (dump_flags & TDF_DETAILS)) - { - fprintf (dump_file, "get_address_cost:\n"); - fprintf (dump_file, " min offset %s " HOST_WIDE_INT_PRINT_DEC "\n", - GET_MODE_NAME (mem_mode), - data->min_offset); - fprintf (dump_file, " max offset %s " HOST_WIDE_INT_PRINT_DEC "\n", - GET_MODE_NAME (mem_mode), - data->max_offset); - } - - rat = 1; - for (i = 2; i <= MAX_RATIO; i++) - if (multiplier_allowed_in_address_p (i, mem_mode, as)) - { - rat = i; - break; - } - - /* Compute the cost of various addressing modes. */ - acost = 0; - reg0 = gen_raw_REG (address_mode, LAST_VIRTUAL_REGISTER + 1); - reg1 = gen_raw_REG (address_mode, LAST_VIRTUAL_REGISTER + 2); - - if (USE_LOAD_PRE_DECREMENT (mem_mode) - || USE_STORE_PRE_DECREMENT (mem_mode)) - { - addr = gen_rtx_PRE_DEC (address_mode, reg0); - has_predec[mem_mode] - = memory_address_addr_space_p (mem_mode, addr, as); - } - if (USE_LOAD_POST_DECREMENT (mem_mode) - || USE_STORE_POST_DECREMENT (mem_mode)) - { - addr = gen_rtx_POST_DEC (address_mode, reg0); - has_postdec[mem_mode] - = memory_address_addr_space_p (mem_mode, addr, as); - } - if (USE_LOAD_PRE_INCREMENT (mem_mode) - || USE_STORE_PRE_DECREMENT (mem_mode)) - { - addr = gen_rtx_PRE_INC (address_mode, reg0); - has_preinc[mem_mode] - = memory_address_addr_space_p (mem_mode, addr, as); - } - if (USE_LOAD_POST_INCREMENT (mem_mode) - || USE_STORE_POST_INCREMENT (mem_mode)) - { - addr = gen_rtx_POST_INC (address_mode, reg0); - has_postinc[mem_mode] - = memory_address_addr_space_p (mem_mode, addr, as); + rtx reg0 = gen_raw_REG (address_mode, LAST_VIRTUAL_REGISTER + 1); + rtx addr = NULL_RTX; + rtx base = NULL_RTX; + int cost, complexity; + + if (may_autoinc && *may_autoinc + && !symbol_present && !var_present && ratio == 1) + { + HOST_WIDE_INT msize = GET_MODE_SIZE (mem_mode); + HOST_WIDE_INT autoinc_offset = stmt_after_inc ? offset + cstep : offset; + if ((USE_LOAD_POST_INCREMENT (mem_mode) + || USE_STORE_POST_INCREMENT (mem_mode)) + && autoinc_offset == 0 && msize == cstep) + addr = gen_rtx_POST_INC (address_mode, reg0); + else if ((USE_LOAD_POST_DECREMENT (mem_mode) + || USE_STORE_POST_DECREMENT (mem_mode)) + && autoinc_offset == 0 && msize == -cstep) + addr = gen_rtx_POST_DEC (address_mode, reg0); + else if ((USE_LOAD_PRE_INCREMENT (mem_mode) + || USE_STORE_PRE_INCREMENT (mem_mode)) + && autoinc_offset == msize && msize == cstep) + addr = gen_rtx_PRE_INC (address_mode, reg0); + else if ((USE_LOAD_PRE_DECREMENT (mem_mode) + || USE_STORE_PRE_DECREMENT (mem_mode)) + && autoinc_offset == -msize && msize == -cstep) + addr = gen_rtx_PRE_DEC (address_mode, reg0); + if (addr && memory_address_addr_space_p (mem_mode, addr, as)) + { + *may_autoinc = true; + /* Favor autoinc modes in case of a cost tie. */ + return new_cost (address_cost (addr, mem_mode, as, speed), 5); } - for (i = 0; i < 16; i++) - { - sym_p = i & 1; - var_p = (i >> 1) & 1; - off_p = (i >> 2) & 1; - rat_p = (i >> 3) & 1; + } + if (may_autoinc) + *may_autoinc = false; - addr = reg0; - if (rat_p) - addr = gen_rtx_fmt_ee (MULT, address_mode, addr, - gen_int_mode (rat, address_mode)); + addr = reg0; + if (ratio != 1) + addr = gen_rtx_fmt_ee (MULT, address_mode, addr, + gen_int_mode (ratio, address_mode)); + if (var_present) + addr = gen_rtx_fmt_ee (PLUS, address_mode, addr, + gen_raw_REG (address_mode, + LAST_VIRTUAL_REGISTER + 2)); + + if (offset != 0) + base = gen_int_mode (offset, address_mode); + + if (symbol_present) + { + rtx sym = gen_rtx_SYMBOL_REF (address_mode, ggc_strdup ("")); + /* ??? We can run into trouble with some backends by presenting + it with symbols which haven't been properly passed through + targetm.encode_section_info. By setting the local bit, we + enhance the probability of things working. */ + SYMBOL_REF_FLAGS (sym) = SYMBOL_FLAG_LOCAL; + + if (base) + base = gen_rtx_fmt_e (CONST, address_mode, + gen_rtx_fmt_ee (PLUS, address_mode, sym, base)); + else + base = sym; + } - if (var_p) - addr = gen_rtx_fmt_ee (PLUS, address_mode, addr, reg1); + if (base) + addr = gen_rtx_fmt_ee (PLUS, address_mode, addr, base); - if (sym_p) - { - base = gen_rtx_SYMBOL_REF (address_mode, ggc_strdup ("")); - /* ??? We can run into trouble with some backends by presenting - it with symbols which haven't been properly passed through - targetm.encode_section_info. By setting the local bit, we - enhance the probability of things working. */ - SYMBOL_REF_FLAGS (base) = SYMBOL_FLAG_LOCAL; - - if (off_p) - base = gen_rtx_fmt_e (CONST, address_mode, - gen_rtx_fmt_ee - (PLUS, address_mode, base, - gen_int_mode (off, address_mode))); - } - else if (off_p) - base = gen_int_mode (off, address_mode); + if (memory_address_addr_space_p (mem_mode, addr, as)) + { + cost = address_cost (addr, mem_mode, as, speed); + complexity = ((symbol_present != 0) + (var_present != 0) + + (offset != 0) + (ratio != 1)); + } + else + { + int old_cse_not_expected = cse_not_expected; + rtx seq, addr2; + + start_sequence (); + /* To avoid splitting addressing modes, pretend that no cse will + follow. */ + cse_not_expected = true; + addr2 = memory_address_addr_space (mem_mode, addr, as); + cse_not_expected = old_cse_not_expected; + seq = get_insns (); + end_sequence (); + cost = seq_cost (seq, speed) + address_cost (addr2, mem_mode, as, speed); + complexity = 0; + + /* If we ended up with a illegitimate address form involving a symbol, + see if we can do better by converting the sym to a var and + amortizing the load or addition cost over the iterations. + Only allow this cost substituion if the address form is not + legitimate as-is, because otherwise subsequent optimizations + always seem to prefer leaving the symbol in the address RTX no + matter how expensive it is. Note also that hoisting the symbol + load outside the loop requires an additional register that is live + throughout the whole loop, and this additional register pressure + cost is not otherwise accounted for in the ivopts cost model. */ + if (symbol_present) + { + comp_cost alt = get_address_cost (false, true, 0, + ratio, cstep, mem_mode, as, speed, + stmt_after_inc, NULL, data); + int altcost = alt.cost; + if (var_present) + altcost += adjust_setup_cost (data, add_cost (address_mode, speed)); else - base = NULL_RTX; - - if (base) - addr = gen_rtx_fmt_ee (PLUS, address_mode, addr, base); - - start_sequence (); - /* To avoid splitting addressing modes, pretend that no cse will - follow. */ - old_cse_not_expected = cse_not_expected; - cse_not_expected = true; - addr = memory_address_addr_space (mem_mode, addr, as); - cse_not_expected = old_cse_not_expected; - seq = get_insns (); - end_sequence (); - - acost = seq_cost (seq, speed); - acost += address_cost (addr, mem_mode, as, speed); - - if (!acost) - acost = 1; - data->costs[sym_p][var_p][off_p][rat_p] = acost; - } - - /* On some targets, it is quite expensive to load symbol to a register, - which makes addresses that contain symbols look much more expensive. - However, the symbol will have to be loaded in any case before the - loop (and quite likely we have it in register already), so it does not - make much sense to penalize them too heavily. So make some final - tweaks for the SYMBOL_PRESENT modes: - - If VAR_PRESENT is false, and the mode obtained by changing symbol to - var is cheaper, use this mode with small penalty. - If VAR_PRESENT is true, try whether the mode with - SYMBOL_PRESENT = false is cheaper even with cost of addition, and - if this is the case, use it. */ - add_c = add_cost (address_mode, speed); - for (i = 0; i < 8; i++) - { - var_p = i & 1; - off_p = (i >> 1) & 1; - rat_p = (i >> 2) & 1; - - acost = data->costs[0][1][off_p][rat_p] + 1; - if (var_p) - acost += add_c; - - if (acost < data->costs[1][var_p][off_p][rat_p]) - data->costs[1][var_p][off_p][rat_p] = acost; - } - - if (dump_file && (dump_flags & TDF_DETAILS)) - { - fprintf (dump_file, "Address costs:\n"); - - for (i = 0; i < 16; i++) + altcost += 1; /* Small penalty for additional register use. */ + if (altcost < cost + || (altcost == cost && alt.complexity > complexity)) { - sym_p = i & 1; - var_p = (i >> 1) & 1; - off_p = (i >> 2) & 1; - rat_p = (i >> 3) & 1; - - fprintf (dump_file, " "); - if (sym_p) - fprintf (dump_file, "sym + "); - if (var_p) - fprintf (dump_file, "var + "); - if (off_p) - fprintf (dump_file, "cst + "); - if (rat_p) - fprintf (dump_file, "rat * "); - - acost = data->costs[sym_p][var_p][off_p][rat_p]; - fprintf (dump_file, "index costs %d\n", acost); + cost = altcost; + complexity = alt.complexity; } - if (has_predec[mem_mode] || has_postdec[mem_mode] - || has_preinc[mem_mode] || has_postinc[mem_mode]) - fprintf (dump_file, " May include autoinc/dec\n"); - fprintf (dump_file, "\n"); } - - VEC_replace (address_cost_data, address_cost_data_list, - data_index, data); } - bits = GET_MODE_BITSIZE (address_mode); - mask = ~(~(unsigned HOST_WIDE_INT) 0 << (bits - 1) << 1); - offset &= mask; - if ((offset >> (bits - 1) & 1)) - offset |= ~mask; - s_offset = offset; - - autoinc = false; - msize = GET_MODE_SIZE (mem_mode); - autoinc_offset = offset; - if (stmt_after_inc) - autoinc_offset += ratio * cstep; - if (symbol_present || var_present || ratio != 1) - autoinc = false; - else if ((has_postinc[mem_mode] && autoinc_offset == 0 - && msize == cstep) - || (has_postdec[mem_mode] && autoinc_offset == 0 - && msize == -cstep) - || (has_preinc[mem_mode] && autoinc_offset == msize - && msize == cstep) - || (has_predec[mem_mode] && autoinc_offset == -msize - && msize == -cstep)) - autoinc = true; - - cost = 0; - offset_p = (s_offset != 0 - && data->min_offset <= s_offset - && s_offset <= data->max_offset); - ratio_p = (ratio != 1 - && multiplier_allowed_in_address_p (ratio, mem_mode, as)); - - if (ratio != 1 && !ratio_p) - cost += multiply_by_cost (ratio, address_mode, speed); - - if (s_offset && !offset_p && !symbol_present) - cost += add_cost (address_mode, speed); - - if (may_autoinc) - *may_autoinc = autoinc; - acost = data->costs[symbol_present][var_present][offset_p][ratio_p]; - complexity = (symbol_present != 0) + (var_present != 0) + offset_p + ratio_p; - return new_cost (cost + acost, complexity); + return new_cost (cost, complexity); } /* Calculate the SPEED or size cost of shiftadd EXPR in MODE. MULT is the @@ -4196,7 +4048,7 @@ get_computation_cost_at (struct ivopts_d TYPE_MODE (TREE_TYPE (utype)), TYPE_ADDR_SPACE (TREE_TYPE (utype)), speed, stmt_is_after_inc, - can_autoinc)); + can_autoinc, data)); /* Otherwise estimate the costs for computing the expression. */ if (!symbol_present && !var_present && !offset) @@ -4212,11 +4064,6 @@ get_computation_cost_at (struct ivopts_d cost.cost += adjust_setup_cost (data, add_cost (TYPE_MODE (ctype), speed)); - /* Having offset does not affect runtime cost in case it is added to - symbol, but it increases complexity. */ - if (offset) - cost.complexity++; - cost.cost += add_cost (TYPE_MODE (ctype), speed); aratio = ratio > 0 ? ratio : -ratio; @@ -4299,7 +4146,7 @@ determine_use_iv_cost_address (struct iv struct iv_use *use, struct iv_cand *cand) { bitmap depends_on; - bool can_autoinc; + bool can_autoinc = (cand->ainc_use == use); int inv_expr_id = -1; comp_cost cost = get_computation_cost (data, use, cand, true, &depends_on, &can_autoinc, &inv_expr_id); @@ -4899,7 +4746,7 @@ autoinc_possible_for_pair (struct ivopts struct iv_cand *cand) { bitmap depends_on; - bool can_autoinc; + bool can_autoinc = true; comp_cost cost; if (use->type != USE_ADDRESS)