[RFC] Improve ivopts group costs

Message ID	4594f0d63038f76d8be3afaf2470e7b8@ispras.ru
State	New
Headers	show Return-Path: <gcc-patches-return-440327-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:content-type:date:from:to:cc:subject:message-id; q=dns; s=default; b=DOd7X1la3TgaS9UQCj/Kju+VGkdy/QeaYT3Yv4kF8eF NL0LqiYQKemz62/m1/fHrMz0C+UiXtqa+Kbt0f3QeaalzLugCDaugouPXrnJBGY/ 0FFX7jtbZOTMcKWFP48wSDocjR3+Zy/tSTJR9xDh/HdILkOevlQRKXpioCg4Rklg = Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=_f76334978c9b9e313eb92e1eb5763391" Date: Thu, 03 Nov 2016 16:35:41 +0300 From: Evgeny Kudryashov <kudryashov@ispras.ru> To: gcc-patches@gcc.gnu.org Cc: bin.cheng@arm.com Subject: [PATCH, RFC] Improve ivopts group costs Message-ID: <4594f0d63038f76d8be3afaf2470e7b8@ispras.ru> User-Agent: Roundcube Webmail/1.1.2

Evgeny Kudryashov Nov. 3, 2016, 1:35 p.m. UTC

Hello,

I'm facing the following problem related to ivopts. The problem is that 
GCC generates a lot of induction variables and as a result there is an 
unnecessary increase of stack usage and register pressure.

For instance, for the attached testcase (tc_ivopts.c) GCC generates 26 
induction variables (25 for each of lhsX[{0-4}][{0-4}][k] and one for 
rhs[k][j][{0-2}]). You should be able to reproduce this issue on arm 
target using GCC with "-O2 -mcpu=cortex-a9 -mtune=cortex-a9". For 
reference, I'm attaching assembly I get on current trunk.

The reason might be in use groups costs, in particular, in the way of 
estimation. Currently, the cost of a tuple (group, candidate) is a sum 
of per-use costs in the group. So, the cost of a group grows 
proportional to the number of uses in the group. This approach has a 
negative effect on the algorithm for finding the best set of induction 
variables: the part of a total cost related to adding a new candidate is 
almost washed out by the cost of the group. In addition, when there is a 
lot of groups with many uses inside and a target is out of free 
registers, the cost of spill is washing out too. As a result, GCC 
prefers to use native candidates (candidate created for a particular 
group) for each group of uses instead of considering the real cost of 
introducing a new variable into a set.

The summing approach was added as a part of this patch 
(https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00641.html) and the 
motivation for taking the sum does not seem to be
specifically discussed.

I propose the following patch that changes a group cost from cost 
summing to selecting the largest one inside a group. For the given test 
case I have: necessary size of stack is decreased by almost 3 times and 
ldr\str amount are decreased by less than 2 times. Also I'm attaching 
assembly after applying the patch.

The essential change in the patch is just:

                      NULL_TREE, ERROR_MARK, inv_expr);

Any suggestions?


Thanks,
Evgeny.

Bin.Cheng Nov. 3, 2016, 4 p.m. UTC | #1

On Thu, Nov 3, 2016 at 1:35 PM, Evgeny Kudryashov <kudryashov@ispras.ru> wrote:
> Hello,
>
> I'm facing the following problem related to ivopts. The problem is that GCC
> generates a lot of induction variables and as a result there is an
> unnecessary increase of stack usage and register pressure.
>
> For instance, for the attached testcase (tc_ivopts.c) GCC generates 26
> induction variables (25 for each of lhsX[{0-4}][{0-4}][k] and one for
> rhs[k][j][{0-2}]). You should be able to reproduce this issue on arm target
> using GCC with "-O2 -mcpu=cortex-a9 -mtune=cortex-a9". For reference, I'm
> attaching assembly I get on current trunk.
>
> The reason might be in use groups costs, in particular, in the way of
> estimation. Currently, the cost of a tuple (group, candidate) is a sum of
> per-use costs in the group. So, the cost of a group grows proportional to
> the number of uses in the group. This approach has a negative effect on the
> algorithm for finding the best set of induction variables: the part of a
> total cost related to adding a new candidate is almost washed out by the
> cost of the group. In addition, when there is a lot of groups with many uses
> inside and a target is out of free registers, the cost of spill is washing
> out too. As a result, GCC prefers to use native candidates (candidate
> created for a particular group) for each group of uses instead of
> considering the real cost of introducing a new variable into a set.
>
> The summing approach was added as a part of this patch
> (https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00641.html) and the
> motivation for taking the sum does not seem to be
> specifically discussed.
>
> I propose the following patch that changes a group cost from cost summing to
> selecting the largest one inside a group. For the given test case I have:
> necessary size of stack is decreased by almost 3 times and ldr\str amount
> are decreased by less than 2 times. Also I'm attaching assembly after
> applying the patch.
>
> The essential change in the patch is just:
>
> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
> index f9211ad..a149418 100644
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -5151,7 +5151,8 @@ determine_group_iv_cost_address (struct ivopts_data
> *data,
>          offset and where the iv_use is.  */
>         cost = get_computation_cost (data, next, cand, true,
>                                      NULL, &can_autoinc, NULL);
> -      sum_cost += cost;
> +      if (sum_cost < cost)
> +        sum_cost = cost;
>      }
>    set_group_iv_cost (data, group, cand, sum_cost, depends_on,
>                      NULL_TREE, ERROR_MARK, inv_expr);
>
> Any suggestions?
Hi Evgeny,

I tend to be practical here.  Given cost is not model that well in
IVOPT, any heuristic tuning in cost is quite likely resulting in
improvement in some cases, while regressions in others.  At least we
need to run good number of benchmarks for such changes.  Specific to
this one, the summary of cost in a group is a natural result of the
original code, in which uses' cost is added up before candidate
selection.  Take your code as an example, choosing loop's original
candidate for group results in lots of memory accesses with [base +
index << scale] addressing mode, which could have higher cost than
[base + offset] mode wrto u-arch, IMHO, it makes sense to add up this
cost.  OTOH, I totally agree that IVOPT tends to choose too many
candidates at the moment, especially for large loops.  Is it possible
to achieve the same effect by penalizing register pressure cost?
Meanwhile, I can collect benchmark data for your patch on AArch64 and
get back to you later.

Thanks,
bin
>
>
> Thanks,
> Evgeny.

Bin.Cheng Nov. 9, 2016, 1:02 p.m. UTC | #2

On Thu, Nov 3, 2016 at 4:00 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 1:35 PM, Evgeny Kudryashov <kudryashov@ispras.ru> wrote:
>> Hello,
>>
>> I'm facing the following problem related to ivopts. The problem is that GCC
>> generates a lot of induction variables and as a result there is an
>> unnecessary increase of stack usage and register pressure.
>>
>> For instance, for the attached testcase (tc_ivopts.c) GCC generates 26
>> induction variables (25 for each of lhsX[{0-4}][{0-4}][k] and one for
>> rhs[k][j][{0-2}]). You should be able to reproduce this issue on arm target
>> using GCC with "-O2 -mcpu=cortex-a9 -mtune=cortex-a9". For reference, I'm
>> attaching assembly I get on current trunk.
>>
>> The reason might be in use groups costs, in particular, in the way of
>> estimation. Currently, the cost of a tuple (group, candidate) is a sum of
>> per-use costs in the group. So, the cost of a group grows proportional to
>> the number of uses in the group. This approach has a negative effect on the
>> algorithm for finding the best set of induction variables: the part of a
>> total cost related to adding a new candidate is almost washed out by the
>> cost of the group. In addition, when there is a lot of groups with many uses
>> inside and a target is out of free registers, the cost of spill is washing
>> out too. As a result, GCC prefers to use native candidates (candidate
>> created for a particular group) for each group of uses instead of
>> considering the real cost of introducing a new variable into a set.
>>
>> The summing approach was added as a part of this patch
>> (https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00641.html) and the
>> motivation for taking the sum does not seem to be
>> specifically discussed.
>>
>> I propose the following patch that changes a group cost from cost summing to
>> selecting the largest one inside a group. For the given test case I have:
>> necessary size of stack is decreased by almost 3 times and ldr\str amount
>> are decreased by less than 2 times. Also I'm attaching assembly after
>> applying the patch.
>>
>> The essential change in the patch is just:
>>
>> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
>> index f9211ad..a149418 100644
>> --- a/gcc/tree-ssa-loop-ivopts.c
>> +++ b/gcc/tree-ssa-loop-ivopts.c
>> @@ -5151,7 +5151,8 @@ determine_group_iv_cost_address (struct ivopts_data
>> *data,
>>          offset and where the iv_use is.  */
>>         cost = get_computation_cost (data, next, cand, true,
>>                                      NULL, &can_autoinc, NULL);
>> -      sum_cost += cost;
>> +      if (sum_cost < cost)
>> +        sum_cost = cost;
>>      }
>>    set_group_iv_cost (data, group, cand, sum_cost, depends_on,
>>                      NULL_TREE, ERROR_MARK, inv_expr);
>>
>> Any suggestions?
> Hi Evgeny,
>
> I tend to be practical here.  Given cost is not model that well in
> IVOPT, any heuristic tuning in cost is quite likely resulting in
> improvement in some cases, while regressions in others.  At least we
> need to run good number of benchmarks for such changes.  Specific to
> this one, the summary of cost in a group is a natural result of the
> original code, in which uses' cost is added up before candidate
> selection.  Take your code as an example, choosing loop's original
> candidate for group results in lots of memory accesses with [base +
> index << scale] addressing mode, which could have higher cost than
> [base + offset] mode wrto u-arch, IMHO, it makes sense to add up this
> cost.  OTOH, I totally agree that IVOPT tends to choose too many
> candidates at the moment, especially for large loops.  Is it possible
> to achieve the same effect by penalizing register pressure cost?
> Meanwhile, I can collect benchmark data for your patch on AArch64 and
> get back to you later.
I collected spec2k6 data on one AArch64 processors, it doesn't give
meaningful improvement or regression.  Looking at the test, it boils
down to how we choose induction variable for loops having below memory
references:
for (biv)
  MEM[base + biv << scale + off1];
  MEM[base + biv << scale + off2];
  // ...
  MEM[base + biv << scale + offn];

On targets support [base + index << scale + offset] addressing mode ,
the biv should be preferred (if cost of the addressing mode is not
blocking) thus register pressure is 2.  While on targets only support
[base + index << scale], it is more complicated.  Choosing biv
actually increases the register pressure by 1 than choosing {base_1 +
biv << scale, step} as the candidate, or an additional computation
outside of address expression is required for each memory referece.
Well, there is one exception when "base" is actually anticipated on
exit edge of this loop.  So this is really target dependent cost model
issue, IMHO, decreasing group cost in target-independent way is not
that good, what do you think?  I will look deeper why choosing biv can
get rid of spilling.

BTW, the case can be further extended:
for (biv)
  MEM[base_1 + biv << scale + off1];
  MEM[base_1 + biv << scale + off2];
  // ...
  MEM[base_1 + biv << scale + offn];
  //
  // ...
  //
  MEM[base_m + biv << scale + off1];
  MEM[base_m + biv << scale + off2];
  // ...
  MEM[base_m + biv << scale + offn];

IVOPT doesn't work well for this either.

Thanks,
bin

Bin.Cheng Nov. 10, 2016, 10:30 a.m. UTC | #3

On Wed, Nov 9, 2016 at 1:02 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 4:00 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Thu, Nov 3, 2016 at 1:35 PM, Evgeny Kudryashov <kudryashov@ispras.ru> wrote:
>>> Hello,
>>>
>>> I'm facing the following problem related to ivopts. The problem is that GCC
>>> generates a lot of induction variables and as a result there is an
>>> unnecessary increase of stack usage and register pressure.
>>>
>>> For instance, for the attached testcase (tc_ivopts.c) GCC generates 26
>>> induction variables (25 for each of lhsX[{0-4}][{0-4}][k] and one for
>>> rhs[k][j][{0-2}]). You should be able to reproduce this issue on arm target
>>> using GCC with "-O2 -mcpu=cortex-a9 -mtune=cortex-a9". For reference, I'm
>>> attaching assembly I get on current trunk.
>>>
>>> The reason might be in use groups costs, in particular, in the way of
>>> estimation. Currently, the cost of a tuple (group, candidate) is a sum of
>>> per-use costs in the group. So, the cost of a group grows proportional to
>>> the number of uses in the group. This approach has a negative effect on the
>>> algorithm for finding the best set of induction variables: the part of a
>>> total cost related to adding a new candidate is almost washed out by the
>>> cost of the group. In addition, when there is a lot of groups with many uses
>>> inside and a target is out of free registers, the cost of spill is washing
>>> out too. As a result, GCC prefers to use native candidates (candidate
>>> created for a particular group) for each group of uses instead of
>>> considering the real cost of introducing a new variable into a set.
>>>
>>> The summing approach was added as a part of this patch
>>> (https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00641.html) and the
>>> motivation for taking the sum does not seem to be
>>> specifically discussed.
>>>
>>> I propose the following patch that changes a group cost from cost summing to
>>> selecting the largest one inside a group. For the given test case I have:
>>> necessary size of stack is decreased by almost 3 times and ldr\str amount
>>> are decreased by less than 2 times. Also I'm attaching assembly after
>>> applying the patch.
>>>
>>> The essential change in the patch is just:
>>>
>>> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
>>> index f9211ad..a149418 100644
>>> --- a/gcc/tree-ssa-loop-ivopts.c
>>> +++ b/gcc/tree-ssa-loop-ivopts.c
>>> @@ -5151,7 +5151,8 @@ determine_group_iv_cost_address (struct ivopts_data
>>> *data,
>>>          offset and where the iv_use is.  */
>>>         cost = get_computation_cost (data, next, cand, true,
>>>                                      NULL, &can_autoinc, NULL);
>>> -      sum_cost += cost;
>>> +      if (sum_cost < cost)
>>> +        sum_cost = cost;
>>>      }
>>>    set_group_iv_cost (data, group, cand, sum_cost, depends_on,
>>>                      NULL_TREE, ERROR_MARK, inv_expr);
>>>
>>> Any suggestions?
>> Hi Evgeny,
>>
>> I tend to be practical here.  Given cost is not model that well in
>> IVOPT, any heuristic tuning in cost is quite likely resulting in
>> improvement in some cases, while regressions in others.  At least we
>> need to run good number of benchmarks for such changes.  Specific to
>> this one, the summary of cost in a group is a natural result of the
>> original code, in which uses' cost is added up before candidate
>> selection.  Take your code as an example, choosing loop's original
>> candidate for group results in lots of memory accesses with [base +
>> index << scale] addressing mode, which could have higher cost than
>> [base + offset] mode wrto u-arch, IMHO, it makes sense to add up this
>> cost.  OTOH, I totally agree that IVOPT tends to choose too many
>> candidates at the moment, especially for large loops.  Is it possible
>> to achieve the same effect by penalizing register pressure cost?
>> Meanwhile, I can collect benchmark data for your patch on AArch64 and
>> get back to you later.
> I collected spec2k6 data on one AArch64 processors, it doesn't give
> meaningful improvement or regression.  Looking at the test, it boils
> down to how we choose induction variable for loops having below memory
> references:
> for (biv)
>   MEM[base + biv << scale + off1];
>   MEM[base + biv << scale + off2];
>   // ...
>   MEM[base + biv << scale + offn];
>
> On targets support [base + index << scale + offset] addressing mode ,
> the biv should be preferred (if cost of the addressing mode is not
> blocking) thus register pressure is 2.  While on targets only support
> [base + index << scale], it is more complicated.  Choosing biv
> actually increases the register pressure by 1 than choosing {base_1 +
> biv << scale, step} as the candidate, or an additional computation
> outside of address expression is required for each memory referece.
> Well, there is one exception when "base" is actually anticipated on
> exit edge of this loop.  So this is really target dependent cost model
> issue, IMHO, decreasing group cost in target-independent way is not
> that good, what do you think?  I will look deeper why choosing biv can
> get rid of spilling.
>
> BTW, the case can be further extended:
> for (biv)
>   MEM[base_1 + biv << scale + off1];
>   MEM[base_1 + biv << scale + off2];
>   // ...
>   MEM[base_1 + biv << scale + offn];
>   //
>   // ...
>   //
>   MEM[base_m + biv << scale + off1];
>   MEM[base_m + biv << scale + off2];
>   // ...
>   MEM[base_m + biv << scale + offn];
>
> IVOPT doesn't work well for this either.

Hi,
I see the cost problem with your test now.  When computing an address
type iv_use with a candidate, the computation consists of two parts,
for computation can be represented by addressing mode, it is done in
memory reference; for computation cannot be represented by addressing
mode, it is done outside of memory reference.  The final cost is added
up from the two computation parts.
For address iv_use:
    MEM[base + biv << scale + offset]
when it is computed with below candidate on target only supports [base
+ biv << scale] addressing mode:
    biv
The computations would be like:
    base' = base + offset
    MEM[base' + biv << scale]
Both computations has its own cost, the first one is normal RTX cost,
the second one is addressing mode cost.  Final cost is added up from
both parts.

Normally, all these cost should be added up in cost model, but there
should be one exception found in your test: If iv_uses of a group has
exactly the same iv ({base, step}), the first part computation (RTX)
can be shared among all iv_uses, thus the cost should only counted one
time.  That is, we should be able to model such CSE opportunities.
Apparently, we can't CSE the second part computation, of course there
won't be CSE opportunities in address expression anyway.

That said, this patch should make difference between cost of RTX
computation and address expression, and only add up RTX cost once if
it can be CSEed.  Well, it might be not trivial to check CSE
opportunities of RTX computation, for example, some iv_uses of the
group are the same, others are not.

Thanks,
bin

Evgeny Kudryashov Nov. 12, 2016, 8:36 a.m. UTC | #4

On 2016-11-10 13:30, Bin.Cheng wrote:
> Hi,
> I see the cost problem with your test now.  When computing an address
> type iv_use with a candidate, the computation consists of two parts,
> for computation can be represented by addressing mode, it is done in
> memory reference; for computation cannot be represented by addressing
> mode, it is done outside of memory reference.  The final cost is added
> up from the two computation parts.
> For address iv_use:
>     MEM[base + biv << scale + offset]
> when it is computed with below candidate on target only supports [base
> + biv << scale] addressing mode:
>     biv
> The computations would be like:
>     base' = base + offset
>     MEM[base' + biv << scale]
> Both computations has its own cost, the first one is normal RTX cost,
> the second one is addressing mode cost.  Final cost is added up from
> both parts.
> 
> Normally, all these cost should be added up in cost model, but there
> should be one exception found in your test: If iv_uses of a group has
> exactly the same iv ({base, step}), the first part computation (RTX)
> can be shared among all iv_uses, thus the cost should only counted one
> time.  That is, we should be able to model such CSE opportunities.
> Apparently, we can't CSE the second part computation, of course there
> won't be CSE opportunities in address expression anyway.

Hi Bin,
Yes, that is exactly what happens. And this computation might be cheaper 
than initialization and increment of new iv and it would be more 
preferable.

> That said, this patch should make difference between cost of RTX
> computation and address expression, and only add up RTX cost once if
> it can be CSEed.  Well, it might be not trivial to check CSE
> opportunities of RTX computation, for example, some iv_uses of the
> group are the same, others are not.
> 
> Thanks,
> bin

Since uses in a given group have the same base and step, they can only 
differ by offsets. Among those, equivalent offsets can be CSE'd. Then, 
perhaps it's possible to use a hash set of unique offsets in this group 
cost estimation loop, and count RTX computation cost only when adding a 
new entry to the set. What do you think about this approach?

While working on this issue, I've found another problem: that costs may 
become negative. That looks unintended, I have filed a new bug: 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78332

Thanks,
Evgeny.

Bin.Cheng Nov. 14, 2016, 9:22 a.m. UTC | #5

On Sat, Nov 12, 2016 at 8:36 AM, Evgeny Kudryashov <kudryashov@ispras.ru> wrote:
> On 2016-11-10 13:30, Bin.Cheng wrote:
>>
>> Hi,
>> I see the cost problem with your test now.  When computing an address
>> type iv_use with a candidate, the computation consists of two parts,
>> for computation can be represented by addressing mode, it is done in
>> memory reference; for computation cannot be represented by addressing
>> mode, it is done outside of memory reference.  The final cost is added
>> up from the two computation parts.
>> For address iv_use:
>>     MEM[base + biv << scale + offset]
>> when it is computed with below candidate on target only supports [base
>> + biv << scale] addressing mode:
>>     biv
>> The computations would be like:
>>     base' = base + offset
>>     MEM[base' + biv << scale]
>> Both computations has its own cost, the first one is normal RTX cost,
>> the second one is addressing mode cost.  Final cost is added up from
>> both parts.
>>
>> Normally, all these cost should be added up in cost model, but there
>> should be one exception found in your test: If iv_uses of a group has
>> exactly the same iv ({base, step}), the first part computation (RTX)
>> can be shared among all iv_uses, thus the cost should only counted one
>> time.  That is, we should be able to model such CSE opportunities.
>> Apparently, we can't CSE the second part computation, of course there
>> won't be CSE opportunities in address expression anyway.
>
>
> Hi Bin,
> Yes, that is exactly what happens. And this computation might be cheaper
> than initialization and increment of new iv and it would be more preferable.
>
>> That said, this patch should make difference between cost of RTX
>> computation and address expression, and only add up RTX cost once if
>> it can be CSEed.  Well, it might be not trivial to check CSE
>> opportunities of RTX computation, for example, some iv_uses of the
>> group are the same, others are not.
>>
>> Thanks,
>> bin
>
>
> Since uses in a given group have the same base and step, they can only
> differ by offsets. Among those, equivalent offsets can be CSE'd. Then,
> perhaps it's possible to use a hash set of unique offsets in this group cost
> estimation loop, and count RTX computation cost only when adding a new entry
> to the set. What do you think about this approach?
We can start handling groups with exactly the same uses.  When
constructing groups, record a flag indicating the group only has the
same uses; when computing cost, accumulate RTX computation cost once
for flagged groups.  The rationale is: use/cand cost computation in
IVOPT is complicated and inaccurate, it's doesn't make much sense
trying to do fine-tuning based on such costs.  It often results in
Brownian-movement (I am not sure if that's the word).  Moreover, we
may want to further restrict to single basic block iv_uses when
flagging groups.
BTW, it maybe non-trivial to compute costs of RTX computation and
address expression separately.   Such costs are computed and
accumulated together in get_address_cost.

>
> While working on this issue, I've found another problem: that costs may
> become negative. That looks unintended, I have filed a new bug:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78332
It could be improved, but not a functional bug IMHO.  I will comment on the PR.

Thanks,
bin

[RFC] Improve ivopts group costs

Commit Message

Comments

Patch