diff mbox series

[3/4] ivopts: Consider cost_step on different forms during unrolling

Message ID 16f8ae40-c0ae-dd57-346e-f9cacea55038@linux.ibm.com
State New
Headers show
Series IVOPTs consider step cost for different forms when unrolling | expand

Commit Message

Kewen.Lin May 28, 2020, 12:24 p.m. UTC
gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p.
	(struct iv_cand): New field reg_offset_p.
	(struct ivopts_data): New field consider_reg_offset_for_unroll_p.
	(dump_groups): Dump group with reg_offset_p.
	(record_group): Initialize reg_offset_p.
	(mark_reg_offset_groups): New function.
	(find_interesting_uses): Call mark_reg_offset_groups.
	(add_candidate_1): Update reg_offset_p if derived from reg_offset_p group.
	(set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(determine_iv_cost): Increase step cost with estimate_unroll_factor if
	consider_reg_offset_for_unroll_p.
	(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
	consider_reg_offset_for_unroll_p.

----

Comments

Richard Sandiford June 1, 2020, 5:59 p.m. UTC | #1
Could you go into more detail about this choice of cost calculation?
It looks like we first calculate per-group flags, which are true only if
the unrolled offsets are valid for all uses in the group.  Then we create
per-candidate flags when associating candidates with groups.

Instead, couldn't we take this into account in get_address_cost,
which calculates the cost of an address use for a given candidate?
E.g. after the main if-else at the start of the function,
perhaps it would make sense to add the worst-case offset to
the address in “parts”, check whether that too is a valid address,
and if not, increase var_cost by the cost of one add instruction.

I guess there are two main sources of inexactness if we do that:

(1) It might underestimate the cost because it assumes that vuse[0]
    stands for all vuses in the group.

(2) It might overestimates the cost because it treats all unrolled
    iterations as having the cost of the final unrolled iteration.

(1) could perhaps be avoided by adding a flag to the iv_use to say
whether it wants this treatment.  I think the flag approach suffers
from (2) too, and I'd be surprised if it makes a difference in practice.

Thanks,
Richard
Kewen.Lin June 2, 2020, 3:39 a.m. UTC | #2
Hi Richard,

Thanks for the comments!

on 2020/6/2 上午1:59, Richard Sandiford wrote:
> Could you go into more detail about this choice of cost calculation?
> It looks like we first calculate per-group flags, which are true only if
> the unrolled offsets are valid for all uses in the group.  Then we create
> per-candidate flags when associating candidates with groups.
> 

Sure.  It checks every address type IV group to determine whether this
group is valid to use reg offset addressing mode.  Here we only need to
check the first one and the last one, since the intermediates should 
have been handled by split_address_groups.  With unrolling the
displacement of the address can be offset-ed by (UF-1)*step, check the
address with this max offset whether still valid.  If the check finds
it's valid to use reg offset mode for the whole group, we flag this
group.  Later, when we create IV candidate for address group flagged,
we flag the candidate further.  This flag is mainly for iv cand
costing, we don't need to scale up iv cand's step cost for this kind
of candidate.

Imagining this loop is being unrolled, all the statements will be
duplicated by UF.  For the cost modeling against iv group, it's
scaling up the cost by UF (here I simply excluded the compare_type
since in most cases it for loop ending check).  For the cost modeling
against iv candidate, it's to focus on step costs, for an iv candidate
we flagged before, it's taken as one time step cost, for the others,
it's scaling up the step cost since the unrolling make step 
calculation become UF times.

This cost modeling is trying to simulate cost change after the
unrolling, scaling up the costs accordingly.  There are somethings
to be improved like distinguish the loop ending compare or else,
whether need to tweak the other costs somehow since the scaling up
probably cause existing cost framework imbalance, but during
benchmarking I didn't find these matter, so take it as simple as 
possible for now.


> Instead, couldn't we take this into account in get_address_cost,
> which calculates the cost of an address use for a given candidate?
> E.g. after the main if-else at the start of the function,
> perhaps it would make sense to add the worst-case offset to
> the address in “parts”, check whether that too is a valid address,
> and if not, increase var_cost by the cost of one add instruction.
> 

IIUC, what you suggest is to tweak the iv group cost, if we find
one address group is valid for reg offset mode, we price more on
the pairs between this group and other non address-based iv cands.
The question is how do we decide this add-on cost.  For the test
case I was working on initially, adding one cost (of add) doesn't
work, the normal iv still wined.  We can price it more like two
but what's the justification on this value, by heuristics?

> I guess there are two main sources of inexactness if we do that:
> 
> (1) It might underestimate the cost because it assumes that vuse[0]
>     stands for all vuses in the group.
> 

Do you mean we don't need one check function like mark_reg_offset_groups?
If without it, vuse[0] might be not enough since we can't ensure the
others are fine with additional displacement from unrolling.  If we still
have it, I think it's fine to just use vuse[0].

> (2) It might overestimates the cost because it treats all unrolled
>     iterations as having the cost of the final unrolled iteration.
>
> (1) could perhaps be avoided by adding a flag to the iv_use to say
> whether it wants this treatment.  I think the flag approach suffers
> from (2) too, and I'd be surprised if it makes a difference in practice.
> 

Sorry, I didn't have the whole picture how to deal with uf for your proposal.
But the flag approach considers uf in iv group cost calculation as well as
iv cand step cost calculation.

BR,
Kewen

> Thanks,
> Richard
>
Richard Sandiford June 2, 2020, 7:14 a.m. UTC | #3
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> Thanks for the comments!
>
> on 2020/6/2 上午1:59, Richard Sandiford wrote:
>> Could you go into more detail about this choice of cost calculation?
>> It looks like we first calculate per-group flags, which are true only if
>> the unrolled offsets are valid for all uses in the group.  Then we create
>> per-candidate flags when associating candidates with groups.
>> 
>
> Sure.  It checks every address type IV group to determine whether this
> group is valid to use reg offset addressing mode.  Here we only need to
> check the first one and the last one, since the intermediates should 
> have been handled by split_address_groups.  With unrolling the
> displacement of the address can be offset-ed by (UF-1)*step, check the
> address with this max offset whether still valid.  If the check finds
> it's valid to use reg offset mode for the whole group, we flag this
> group.  Later, when we create IV candidate for address group flagged,
> we flag the candidate further.  This flag is mainly for iv cand
> costing, we don't need to scale up iv cand's step cost for this kind
> of candidate.

But AIUI, this is calculating whether the uses in their original form
support all unrolled offsets.  For ivopts, I think the question is really
whether the uses support all unrolled offsets when based on a given IV
candidate (which might not be the original IV).

E.g. there might be another IV candidate at a constant offset
from the original one, and the offsets might all be in range
for that offset too.

> Imagining this loop is being unrolled, all the statements will be
> duplicated by UF.  For the cost modeling against iv group, it's
> scaling up the cost by UF (here I simply excluded the compare_type
> since in most cases it for loop ending check).  For the cost modeling
> against iv candidate, it's to focus on step costs, for an iv candidate
> we flagged before, it's taken as one time step cost, for the others,
> it's scaling up the step cost since the unrolling make step 
> calculation become UF times.
>
> This cost modeling is trying to simulate cost change after the
> unrolling, scaling up the costs accordingly.  There are somethings
> to be improved like distinguish the loop ending compare or else,
> whether need to tweak the other costs somehow since the scaling up
> probably cause existing cost framework imbalance, but during
> benchmarking I didn't find these matter, so take it as simple as 
> possible for now.
>
>
>> Instead, couldn't we take this into account in get_address_cost,
>> which calculates the cost of an address use for a given candidate?
>> E.g. after the main if-else at the start of the function,
>> perhaps it would make sense to add the worst-case offset to
>> the address in “parts”, check whether that too is a valid address,
>> and if not, increase var_cost by the cost of one add instruction.
>> 
>
> IIUC, what you suggest is to tweak the iv group cost, if we find
> one address group is valid for reg offset mode, we price more on
> the pairs between this group and other non address-based iv cands.
> The question is how do we decide this add-on cost.  For the test
> case I was working on initially, adding one cost (of add) doesn't
> work, the normal iv still wined.  We can price it more like two
> but what's the justification on this value, by heuristics?

Yeah, I was thinking of adding one instance of add_cost.  If that
doesn't work, it'd be interesting to know why in more detail.

>> I guess there are two main sources of inexactness if we do that:
>> 
>> (1) It might underestimate the cost because it assumes that vuse[0]
>>     stands for all vuses in the group.
>> 
>
> Do you mean we don't need one check function like mark_reg_offset_groups?
> If without it, vuse[0] might be not enough since we can't ensure the
> others are fine with additional displacement from unrolling.  If we still
> have it, I think it's fine to just use vuse[0].
>
>> (2) It might overestimates the cost because it treats all unrolled
>>     iterations as having the cost of the final unrolled iteration.
>>
>> (1) could perhaps be avoided by adding a flag to the iv_use to say
>> whether it wants this treatment.  I think the flag approach suffers
>> from (2) too, and I'd be surprised if it makes a difference in practice.
>> 
>
> Sorry, I didn't have the whole picture how to deal with uf for your proposal.
> But the flag approach considers uf in iv group cost calculation as well as
> iv cand step cost calculation.
>
> BR,
> Kewen
>
>> Thanks,
>> Richard
>>
Kewen.Lin June 3, 2020, 3:18 a.m. UTC | #4
Hi Richard,

on 2020/6/2 下午3:14, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard,
>>
>> Thanks for the comments!
>>
>> on 2020/6/2 上午1:59, Richard Sandiford wrote:
>>> Could you go into more detail about this choice of cost calculation?
>>> It looks like we first calculate per-group flags, which are true only if
>>> the unrolled offsets are valid for all uses in the group.  Then we create
>>> per-candidate flags when associating candidates with groups.
>>>
>>
>> Sure.  It checks every address type IV group to determine whether this
>> group is valid to use reg offset addressing mode.  Here we only need to
>> check the first one and the last one, since the intermediates should 
>> have been handled by split_address_groups.  With unrolling the
>> displacement of the address can be offset-ed by (UF-1)*step, check the
>> address with this max offset whether still valid.  If the check finds
>> it's valid to use reg offset mode for the whole group, we flag this
>> group.  Later, when we create IV candidate for address group flagged,
>> we flag the candidate further.  This flag is mainly for iv cand
>> costing, we don't need to scale up iv cand's step cost for this kind
>> of candidate.
> 
> But AIUI, this is calculating whether the uses in their original form
> support all unrolled offsets.  For ivopts, I think the question is really
> whether the uses support all unrolled offsets when based on a given IV
> candidate (which might not be the original IV).
> 

Good point!  Indeed, the patch only flags the IV cands derived from the
address group flagged with reg_offset_p, it has the possibility that
we miss some other candidates with same basic but different offset
which can satisfy addr_offset_valid_p.

How about to update the current approach to: instead of flag the derived
iv cand, when we determine the cost for iv cand, we check whether this
iv cand has the same basic object as the one of any reg_offset_p vuse[0]
(both should be stripped their offsets), then further check the offset
can satisfy addr_offset_valid_p, if all checks expected, update the step
cost without uf consideration.

I would expect this kind of address based iv cand will mainly be used
for address type group and compare type group, for the address type 
group, it can be only applied for those with same basic object, in most
cases they are put in the same address group.  If it's used for generic
much, the step cost tweaking might not be fixed at the beginning but
varies according to the iv set members.  It looks too much for the
existing framework.

> E.g. there might be another IV candidate at a constant offset
> from the original one, and the offsets might all be in range
> for that offset too.
> 
>> Imagining this loop is being unrolled, all the statements will be
>> duplicated by UF.  For the cost modeling against iv group, it's
>> scaling up the cost by UF (here I simply excluded the compare_type
>> since in most cases it for loop ending check).  For the cost modeling
>> against iv candidate, it's to focus on step costs, for an iv candidate
>> we flagged before, it's taken as one time step cost, for the others,
>> it's scaling up the step cost since the unrolling make step 
>> calculation become UF times.
>>
>> This cost modeling is trying to simulate cost change after the
>> unrolling, scaling up the costs accordingly.  There are somethings
>> to be improved like distinguish the loop ending compare or else,
>> whether need to tweak the other costs somehow since the scaling up
>> probably cause existing cost framework imbalance, but during
>> benchmarking I didn't find these matter, so take it as simple as 
>> possible for now.
>>
>>
>>> Instead, couldn't we take this into account in get_address_cost,
>>> which calculates the cost of an address use for a given candidate?
>>> E.g. after the main if-else at the start of the function,
>>> perhaps it would make sense to add the worst-case offset to
>>> the address in “parts”, check whether that too is a valid address,
>>> and if not, increase var_cost by the cost of one add instruction.
>>>
>>
>> IIUC, what you suggest is to tweak the iv group cost, if we find
>> one address group is valid for reg offset mode, we price more on
>> the pairs between this group and other non address-based iv cands.
>> The question is how do we decide this add-on cost.  For the test
>> case I was working on initially, adding one cost (of add) doesn't
>> work, the normal iv still wined.  We can price it more like two
>> but what's the justification on this value, by heuristics?
> 
> Yeah, I was thinking of adding one instance of add_cost.  If that
> doesn't work, it'd be interesting to know why in more detail.
> 

The case is like:

            for (i = 0; i < SIZE; i++)
              y[i] = a * x[i] + z[i];

It has three array access in the loop body, after vectorization,
it looks like

  vect__1.7_15 = MEM <vector(2) double> [(double *)vectp_x.5_20];
  vect__3.8_13 = vect__1.7_15 * vect_cst__14;
  vect__4.11_19 = MEM <vector(2) double> [(double *)vectp_z.9_7];
  vect__5.12_23 = vect__3.8_13 + vect__4.11_19;
  MEM <vector(2) double> [(double *)vectp_y.13_24] = vect__5.12_23;

we expect to use reg_offset for those vector load/store when
unrolling factor is big, but without unrolling or unrolling factor
smaller like 2, the reg_index should outperform reg_offset.

IIUC, the proposed costing change would look like (the left is before
change, while the right is after change).  Those zero costs are for
those iv cands based on address objects.  Group 0,1,2 are address
groups, group 3 is compare group (omitted).  One insn add cost is
counted as 4 on ppc64le.


  <Group-candidate Costs>:                \  <Group-candidate Costs>:
  Group 0:                                \  Group 0:
    cand  cost    compl.  inv.expr.       \    cand  cost    compl.  inv.expr.       inv.vars
    1     5       1       1;      NIL;    \    1     9       1       1;      NIL;
    4     1       1       1;      NIL;    \    4     5       1       1;      NIL;
    7     0       1       NIL;    NIL;    \    7     0       1       NIL;    NIL;
    8     0       0       NIL;    NIL;    \    8     0       0       NIL;    NIL;
    9     0       0       NIL;    NIL;    \    9     0       0       NIL;    NIL;
    13    5       1       1;      NIL;    \    13    9       1       1;      NIL;
    14    5       1       3;      NIL;    \    14    9       1       3;      NIL;
                                          \
  Group 1:                                \  Group 1:
    cand  cost    compl.  inv.expr.       \    cand  cost    compl.  inv.expr.       inv.vars
    1     5       1       4;      NIL;    \    1     9       1       4;      NIL;
    3     0       1       NIL;    NIL;    \    3     0       1       NIL;    NIL;
    4     1       1       4;      NIL;    \    4     5       1       4;      NIL;
    5     0       0       NIL;    NIL;    \    5     0       0       NIL;    NIL;
    6     0       0       NIL;    NIL;    \    6     0       0       NIL;    NIL;
    13    5       1       4;      NIL;    \    13    9       1       4;      NIL;
    14    5       1       6;      NIL;    \    14    9       1       6;      NIL;
                                          \
  Group 2:                                \  Group 2:
    cand  cost    compl.  inv.expr.       \    cand  cost    compl.  inv.expr.       inv.vars
    1     5       1       7;      NIL;    \    1     9       1       7;      NIL;
    4     1       1       7;      NIL;    \    4     5       1       7;      NIL;
    10    0       0       NIL;    NIL;    \    10    4       0       NIL;    NIL;
    11    0       0       NIL;    NIL;    \    11    4       0       NIL;    NIL;
    12    0       1       NIL;    NIL;    \    12    4       1       NIL;    NIL;
    13    5       1       7;      NIL;    \    13    9       1       7;      NIL;
    14    5       1       9;      NIL;    \    14    9       1       9;      NIL;

  Initial set of candidates:              \  Initial set of candidates:
    cost: 13 (complexity 3)               \    cost: 25 (complexity 0)
    reg_cost: 5                           \    reg_cost: 6
    cand_cost: 5                          \    cand_cost: 15
    cand_group_cost: 3 (complexity 3)     \    cand_group_cost: 4 (complexity 0)
    candidates: 4                         \    candidates: 5, 8, 10
     group:0 --> iv_cand:4, cost=(1,1)    \     group:0 --> iv_cand:8, cost=(0,0)
     group:1 --> iv_cand:4, cost=(1,1)    \     group:1 --> iv_cand:5, cost=(0,0)
     group:2 --> iv_cand:4, cost=(1,1)    \     group:2 --> iv_cand:10, cost=(4,0)
     group:3 --> iv_cand:4, cost=(0,0)    \     group:3 --> iv_cand:5, cost=(0,0)
    invariant variables:                  \    invariant variables:
    invariant expressions: 1, 4, 7        \    invariant expressions:
                                          \
  Original cost 21 (complexity 0)         \  Original cost 25 (complexity 0)
                                          \
  Final cost 13 (complexity 3)            \  Final cost 25 (complexity 0)

When unrolling factor is 8, we expect to use reg offset modes for group 0, 1, 2,
this proposal works well.  But for unrolling factor is 2, we expect to still
use reg index modes for group 0, 1, 2 (with iv_cand 4 as before), this proposal
looks sticked to use iv_cand 8,5,10.  It looks we will over-blame the other 
non reg offset candidates.  sorry that I have no idea how to leverage uf in this
proposal, but it changes the focus to iv_group cost, seems not easy to scale up
or down this with good justification.

BR,
Kewen

>>> I guess there are two main sources of inexactness if we do that:
>>>
>>> (1) It might underestimate the cost because it assumes that vuse[0]
>>>     stands for all vuses in the group.
>>>
>>
>> Do you mean we don't need one check function like mark_reg_offset_groups?
>> If without it, vuse[0] might be not enough since we can't ensure the
>> others are fine with additional displacement from unrolling.  If we still
>> have it, I think it's fine to just use vuse[0].
>>
>>> (2) It might overestimates the cost because it treats all unrolled
>>>     iterations as having the cost of the final unrolled iteration.
>>>
>>> (1) could perhaps be avoided by adding a flag to the iv_use to say
>>> whether it wants this treatment.  I think the flag approach suffers
>>> from (2) too, and I'd be surprised if it makes a difference in practice.
>>>
>>
>> Sorry, I didn't have the whole picture how to deal with uf for your proposal.
>> But the flag approach considers uf in iv group cost calculation as well as
>> iv cand step cost calculation.
>>
>> BR,
>> Kewen
>>
>>> Thanks,
>>> Richard
>>>
Bin.Cheng Aug. 8, 2020, 8:01 a.m. UTC | #5
Hi Kewen,
Sorry for the late reply.
The patch's most important change is below cost computation:

> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
>
> +  /* Consider additional step updates during unrolling.  */
> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
This is a bit strange, to me the add instructions are additional
computation caused by unrolling+addressing_mode, rather than a native
part in candidate itself.  Specifically, an additional cost is needed
if candidates (without reg_offset_p) are chosen for the address type
group/uses.
> +
>   /* Prefer the original ivs unless we may gain something by replacing it.
>      The reason is to make debugging simpler; so this is not relevant for
>      artificial ivs created by other optimization passes.  */
>

> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>       return;
>     }
>
> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
> +     matter.  */
> +  if (data->consider_reg_offset_for_unroll_p
> +      && group->vuses[0]->type != USE_COMPARE)
> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
Not quite follow here, giving "pricing more on on-reg_offset IV cand"
doesn't make much sense to me.  Also why generic type uses are not
skipped?  We want to model the cost required for address computation,
however, for generic type uses there is no way to save the computation
in "address expression".  Once unrolled, the computation is always
there?

And what's the impact on targets supporting [base + index + offset]
addressing mode?

Given the patch is not likely to harm because rtl loop unrolling is
(or was?) by default disabled, so I am OK once the above comments are
addressed.

I wonder if it's possible to get 10% of (all which should be unrolled)
loops unrolled (conservatively) on gimple and enable it by default at
O3, rather than teaching ivopts to model a future pass which not
likely be used outside of benchmarks?

Thanks,
bin

> +
>   if (data->consider_all_candidates)
>     {
>       group->cost_map[cand->id].cand = cand;

On Thu, May 28, 2020 at 8:24 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
>
> gcc/ChangeLog
>
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>
>         * tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p.
>         (struct iv_cand): New field reg_offset_p.
>         (struct ivopts_data): New field consider_reg_offset_for_unroll_p.
>         (dump_groups): Dump group with reg_offset_p.
>         (record_group): Initialize reg_offset_p.
>         (mark_reg_offset_groups): New function.
>         (find_interesting_uses): Call mark_reg_offset_groups.
>         (add_candidate_1): Update reg_offset_p if derived from reg_offset_p group.
>         (set_group_iv_cost): Scale up group cost with estimate_unroll_factor if
>         consider_reg_offset_for_unroll_p.
>         (determine_iv_cost): Increase step cost with estimate_unroll_factor if
>         consider_reg_offset_for_unroll_p.
>         (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update
>         consider_reg_offset_for_unroll_p.
>
> ----
Kewen.Lin Aug. 10, 2020, 4:27 a.m. UTC | #6
Hi Bin,

Thanks for the review!!

on 2020/8/8 下午4:01, Bin.Cheng wrote:
> Hi Kewen,
> Sorry for the late reply.
> The patch's most important change is below cost computation:
> 
>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
>>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
>>
>> +  /* Consider additional step updates during unrolling.  */
>> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
>> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> This is a bit strange, to me the add instructions are additional
> computation caused by unrolling+addressing_mode, rather than a native
> part in candidate itself.  Specifically, an additional cost is needed
> if candidates (without reg_offset_p) are chosen for the address type
> group/uses.

Good point, ideally it should be one additional cost for each cand set,
when we select one cand for one group, we need to check this pair need
more (estimated_unroll - 1) step costs, we probably need to care about
this during remove/replace etc.  IIUC the current IVOPTs cost framework
doesn't support this and it could increase the selection complexity and
time.  I hesitated to do it and put it to cand step cost initially instead.

I was thinking those candidates with reg_offset_p should be only used for
those reg_offset_p groups in most cases (very limited) meanwhile the others
are simply scaled up like before.  But indeed this can cover some similar
cases like one cand is only used for the compare type group which is for
loop closing, then it doesn't need more step costs for unrolling.

Do you prefer me to improve the current cost framework?

>> +
>>   /* Prefer the original ivs unless we may gain something by replacing it.
>>      The reason is to make debugging simpler; so this is not relevant for
>>      artificial ivs created by other optimization passes.  */
>>
> 
>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>>       return;
>>     }
>>
>> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
>> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
>> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
>> +     matter.  */
>> +  if (data->consider_reg_offset_for_unroll_p
>> +      && group->vuses[0]->type != USE_COMPARE)
>> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> doesn't make much sense to me.  Also why generic type uses are not
> skipped?  We want to model the cost required for address computation,
> however, for generic type uses there is no way to save the computation
> in "address expression".  Once unrolled, the computation is always
> there?
> 

The main intention is to scale up the group/cand cost for unrolling since
we have scaled up the step costs.  The assumption is that the original
costing (without this patch) can be viewed as either for all unrolled
iterations or just one single iteration.  Since IVOPTs doesn't support
fractional costing, I interpreted it as single iterations, to emulate
unrolling scenario based on the previous step cost scaling, we need to 
scale up the cost for all computation.

In most cases, the compare type use is for loop closing, there is still
only one computation even unrolling happens, so I excluded it here.
As "FIXME", if we find some cases are off, we can further restrict it to
those USE_COMPARE uses which is exactly for loop closing.

> And what's the impact on targets supporting [base + index + offset]
> addressing mode?

Good question, I didn't notice it since power doesn't support it.
I noticed the comments of function addr_offset_valid_p only mentioning
[base + offset], I guess it excludes [base + index + offset]?
But I guess the address-based IV can work for this mode?

> 
> Given the patch is not likely to harm because rtl loop unrolling is
> (or was?) by default disabled, so I am OK once the above comments are
> addressed.
> 

Yes, it needs explicit unrolling options, excepting for some targets
wants to enable it for some cases with specific loop_unroll_adjust checks.

> I wonder if it's possible to get 10% of (all which should be unrolled)
> loops unrolled (conservatively) on gimple and enable it by default at
> O3, rather than teaching ivopts to model a future pass which not
> likely be used outside of benchmarks?
> 

Yeah, it would be nice if the unrolling happen before IVOPTs and won't
have any future unrollings to get it off.  PR[1] seems to have some
discussion on gimple unrolling.

Richi suggested driven-by-need gimple unrolling in the previous discussion[2]
on the RFC of this.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
[2] https://gcc.gnu.org/pipermail/gcc-patches/2020-January/537645.html

BR,
Kewen
Bin.Cheng Aug. 10, 2020, 12:38 p.m. UTC | #7
On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> Thanks for the review!!
>
> on 2020/8/8 下午4:01, Bin.Cheng wrote:
> > Hi Kewen,
> > Sorry for the late reply.
> > The patch's most important change is below cost computation:
> >
> >> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
> >>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
> >>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
> >>
> >> +  /* Consider additional step updates during unrolling.  */
> >> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> >> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> > This is a bit strange, to me the add instructions are additional
> > computation caused by unrolling+addressing_mode, rather than a native
> > part in candidate itself.  Specifically, an additional cost is needed
> > if candidates (without reg_offset_p) are chosen for the address type
> > group/uses.
>
> Good point, ideally it should be one additional cost for each cand set,
> when we select one cand for one group, we need to check this pair need
> more (estimated_unroll - 1) step costs, we probably need to care about
> this during remove/replace etc.  IIUC the current IVOPTs cost framework
> doesn't support this and it could increase the selection complexity and
> time.  I hesitated to do it and put it to cand step cost initially instead.
>
> I was thinking those candidates with reg_offset_p should be only used for
> those reg_offset_p groups in most cases (very limited) meanwhile the others
> are simply scaled up like before.  But indeed this can cover some similar
> cases like one cand is only used for the compare type group which is for
> loop closing, then it doesn't need more step costs for unrolling.
>
> Do you prefer me to improve the current cost framework?
No, I don't think it's relevant to the candidate selecting algorithm.
I was thinking about adjusting cost somehow in
determine_group_iv_cost_address. Given we don't expose selected
addressing mode in this function, you may need to do it in
get_address_cost, either way.

>
> >> +
> >>   /* Prefer the original ivs unless we may gain something by replacing it.
> >>      The reason is to make debugging simpler; so this is not relevant for
> >>      artificial ivs created by other optimization passes.  */
> >>
> >
> >> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
> >>       return;
> >>     }
> >>
> >> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
> >> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
> >> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
> >> +     matter.  */
> >> +  if (data->consider_reg_offset_for_unroll_p
> >> +      && group->vuses[0]->type != USE_COMPARE)
> >> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> > Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> > doesn't make much sense to me.  Also why generic type uses are not
> > skipped?  We want to model the cost required for address computation,
> > however, for generic type uses there is no way to save the computation
> > in "address expression".  Once unrolled, the computation is always
> > there?
> >
>
> The main intention is to scale up the group/cand cost for unrolling since
> we have scaled up the step costs.  The assumption is that the original
If we adjust cost appropriately in function *group_iv_cost_address,
this would become unnecessary, right?  And naturally.
> costing (without this patch) can be viewed as either for all unrolled
> iterations or just one single iteration.  Since IVOPTs doesn't support
> fractional costing, I interpreted it as single iterations, to emulate
> unrolling scenario based on the previous step cost scaling, we need to
> scale up the cost for all computation.
>
> In most cases, the compare type use is for loop closing, there is still
> only one computation even unrolling happens, so I excluded it here.
> As "FIXME", if we find some cases are off, we can further restrict it to
> those USE_COMPARE uses which is exactly for loop closing.
>
> > And what's the impact on targets supporting [base + index + offset]
> > addressing mode?
>
> Good question, I didn't notice it since power doesn't support it.
> I noticed the comments of function addr_offset_valid_p only mentioning
> [base + offset], I guess it excludes [base + index + offset]?
> But I guess the address-based IV can work for this mode?
No, addr_offset_valid_p is only used to split address use groups.  See
get_address_cost and struct mem_address.
I forgot to ask, what about target only supports [base + offset]
addressing mode like RISC-V?  I would expect it's not affected at all.

>
> >
> > Given the patch is not likely to harm because rtl loop unrolling is
> > (or was?) by default disabled, so I am OK once the above comments are
> > addressed.
> >
>
> Yes, it needs explicit unrolling options, excepting for some targets
> wants to enable it for some cases with specific loop_unroll_adjust checks.
>
> > I wonder if it's possible to get 10% of (all which should be unrolled)
> > loops unrolled (conservatively) on gimple and enable it by default at
> > O3, rather than teaching ivopts to model a future pass which not
> > likely be used outside of benchmarks?
> >
>
> Yeah, it would be nice if the unrolling happen before IVOPTs and won't
> have any future unrollings to get it off.  PR[1] seems to have some
> discussion on gimple unrolling.
Thanks for directing me to the discussion.  I am on Wilco's side on
this problem, IMHO, It might be useful getting small loops unrolled
(at O3 by default) by simply saving induction variable stepping and
exit condition check, which can be modeled on gimple.  Especially for
RISC-V, it doesn't support index addressing mode, which means there
will be as many induction variables as distinct arrays.  Also
interleaving after unrolling is not that important, it's at high
chance that small loops eligible for interleaving are handled by
vectorizer already.

Thanks,
bin
>
> Richi suggested driven-by-need gimple unrolling in the previous discussion[2]
> on the RFC of this.
>
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> [2] https://gcc.gnu.org/pipermail/gcc-patches/2020-January/537645.html
>
> BR,
> Kewen
Kewen.Lin Aug. 10, 2020, 2:41 p.m. UTC | #8
Hi Bin,

on 2020/8/10 下午8:38, Bin.Cheng wrote:
> On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Bin,
>>
>> Thanks for the review!!
>>
>> on 2020/8/8 下午4:01, Bin.Cheng wrote:
>>> Hi Kewen,
>>> Sorry for the late reply.
>>> The patch's most important change is below cost computation:
>>>
>>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
>>>>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
>>>>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
>>>>
>>>> +  /* Consider additional step updates during unrolling.  */
>>>> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
>>>> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
>>> This is a bit strange, to me the add instructions are additional
>>> computation caused by unrolling+addressing_mode, rather than a native
>>> part in candidate itself.  Specifically, an additional cost is needed
>>> if candidates (without reg_offset_p) are chosen for the address type
>>> group/uses.
>>
>> Good point, ideally it should be one additional cost for each cand set,
>> when we select one cand for one group, we need to check this pair need
>> more (estimated_unroll - 1) step costs, we probably need to care about
>> this during remove/replace etc.  IIUC the current IVOPTs cost framework
>> doesn't support this and it could increase the selection complexity and
>> time.  I hesitated to do it and put it to cand step cost initially instead.
>>
>> I was thinking those candidates with reg_offset_p should be only used for
>> those reg_offset_p groups in most cases (very limited) meanwhile the others
>> are simply scaled up like before.  But indeed this can cover some similar
>> cases like one cand is only used for the compare type group which is for
>> loop closing, then it doesn't need more step costs for unrolling.
>>
>> Do you prefer me to improve the current cost framework?
> No, I don't think it's relevant to the candidate selecting algorithm.
> I was thinking about adjusting cost somehow in
> determine_group_iv_cost_address. Given we don't expose selected
> addressing mode in this function, you may need to do it in
> get_address_cost, either way.
> 

Thanks for your suggestion!

Sorry, I may miss something, but I still think the additional cost is
per candidate.  The justification is that we miss to model the iv
candidate step well in the context of unrolling, the step cost is part
of candidate cost, which is per candidate.

To initialize it in determine_iv_cost isn't perfect as you pointed out,
ideally we should check any uses of the candidate requires iv update
after each replicated iteration, and take extra step costs into account
if at least one needs, meanwhile scaling up all the computation cost to
reflect unrolling cost nature.

Besides, the reg_offset desirable pair already takes zero cost for
cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you
suggesting increasing the cost for non reg_offset pairs?  If so and per
pair, the extra cost looks possible to be computed several times
unexpectedly.

>>
>>>> +
>>>>   /* Prefer the original ivs unless we may gain something by replacing it.
>>>>      The reason is to make debugging simpler; so this is not relevant for
>>>>      artificial ivs created by other optimization passes.  */
>>>>
>>>
>>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
>>>>       return;
>>>>     }
>>>>
>>>> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
>>>> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
>>>> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
>>>> +     matter.  */
>>>> +  if (data->consider_reg_offset_for_unroll_p
>>>> +      && group->vuses[0]->type != USE_COMPARE)
>>>> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
>>> Not quite follow here, giving "pricing more on on-reg_offset IV cand"
>>> doesn't make much sense to me.  Also why generic type uses are not
>>> skipped?  We want to model the cost required for address computation,
>>> however, for generic type uses there is no way to save the computation
>>> in "address expression".  Once unrolled, the computation is always
>>> there?
>>>
>>
>> The main intention is to scale up the group/cand cost for unrolling since
>> we have scaled up the step costs.  The assumption is that the original
> If we adjust cost appropriately in function *group_iv_cost_address,
> this would become unnecessary, right?  And naturally.
>> costing (without this patch) can be viewed as either for all unrolled
>> iterations or just one single iteration.  Since IVOPTs doesn't support
>> fractional costing, I interpreted it as single iterations, to emulate
>> unrolling scenario based on the previous step cost scaling, we need to
>> scale up the cost for all computation.
>>
>> In most cases, the compare type use is for loop closing, there is still
>> only one computation even unrolling happens, so I excluded it here.
>> As "FIXME", if we find some cases are off, we can further restrict it to
>> those USE_COMPARE uses which is exactly for loop closing.
>>
>>> And what's the impact on targets supporting [base + index + offset]
>>> addressing mode?
>>
>> Good question, I didn't notice it since power doesn't support it.
>> I noticed the comments of function addr_offset_valid_p only mentioning
>> [base + offset], I guess it excludes [base + index + offset]?
>> But I guess the address-based IV can work for this mode?
> No, addr_offset_valid_p is only used to split address use groups.  See
> get_address_cost and struct mem_address.
> I forgot to ask, what about target only supports [base + offset]
> addressing mode like RISC-V?  I would expect it's not affected at all.
> 
addr_offset_valid_p is also used in this patch as Richard S. and Segher
suggested to check the offset after unrolling (like: offset+(uf-1)*step)
is still valid for the target.  If address-based IV can not work for
[base + index + offset], it won't affect [base + index + offset].

It can help all targets which supports [base + offset], so I'm afraid
it can affect RISC-V too, but I would expect it's positive.  Or do you
happen to see some potential issues? or have some concerns?

And as Richard S. suggested before, it has one parameter to control,
target can simply disable this if it dislikes.

>>
>>>
>>> Given the patch is not likely to harm because rtl loop unrolling is
>>> (or was?) by default disabled, so I am OK once the above comments are
>>> addressed.
>>>
>>
>> Yes, it needs explicit unrolling options, excepting for some targets
>> wants to enable it for some cases with specific loop_unroll_adjust checks.
>>
>>> I wonder if it's possible to get 10% of (all which should be unrolled)
>>> loops unrolled (conservatively) on gimple and enable it by default at
>>> O3, rather than teaching ivopts to model a future pass which not
>>> likely be used outside of benchmarks?
>>>
>>
>> Yeah, it would be nice if the unrolling happen before IVOPTs and won't
>> have any future unrollings to get it off.  PR[1] seems to have some
>> discussion on gimple unrolling.
> Thanks for directing me to the discussion.  I am on Wilco's side on
> this problem, IMHO, It might be useful getting small loops unrolled
> (at O3 by default) by simply saving induction variable stepping and
> exit condition check, which can be modeled on gimple.  Especially for
> RISC-V, it doesn't support index addressing mode, which means there
> will be as many induction variables as distinct arrays.  Also
> interleaving after unrolling is not that important, it's at high
> chance that small loops eligible for interleaving are handled by
> vectorizer already.
> 

Good idea, CC Jeff since I think Jeff (Jiufu) has been working to see
whether we can bring in one gimple unrolling pass.  Regardless we
have/don't have it, if the RTL unrolling is there, I guess this patch
set is still beneficial?  If so, I would take it separately.

BR,
Kewen
Bin.Cheng Aug. 16, 2020, 3:59 a.m. UTC | #9
On Mon, Aug 10, 2020 at 10:41 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> on 2020/8/10 下午8:38, Bin.Cheng wrote:
> > On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi Bin,
> >>
> >> Thanks for the review!!
> >>
> >> on 2020/8/8 下午4:01, Bin.Cheng wrote:
> >>> Hi Kewen,
> >>> Sorry for the late reply.
> >>> The patch's most important change is below cost computation:
> >>>
> >>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
> >>>>     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
> >>>>   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
> >>>>
> >>>> +  /* Consider additional step updates during unrolling.  */
> >>>> +  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
> >>>> +    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
> >>> This is a bit strange, to me the add instructions are additional
> >>> computation caused by unrolling+addressing_mode, rather than a native
> >>> part in candidate itself.  Specifically, an additional cost is needed
> >>> if candidates (without reg_offset_p) are chosen for the address type
> >>> group/uses.
> >>
> >> Good point, ideally it should be one additional cost for each cand set,
> >> when we select one cand for one group, we need to check this pair need
> >> more (estimated_unroll - 1) step costs, we probably need to care about
> >> this during remove/replace etc.  IIUC the current IVOPTs cost framework
> >> doesn't support this and it could increase the selection complexity and
> >> time.  I hesitated to do it and put it to cand step cost initially instead.
> >>
> >> I was thinking those candidates with reg_offset_p should be only used for
> >> those reg_offset_p groups in most cases (very limited) meanwhile the others
> >> are simply scaled up like before.  But indeed this can cover some similar
> >> cases like one cand is only used for the compare type group which is for
> >> loop closing, then it doesn't need more step costs for unrolling.
> >>
> >> Do you prefer me to improve the current cost framework?
> > No, I don't think it's relevant to the candidate selecting algorithm.
> > I was thinking about adjusting cost somehow in
> > determine_group_iv_cost_address. Given we don't expose selected
> > addressing mode in this function, you may need to do it in
> > get_address_cost, either way.
> >
>
> Thanks for your suggestion!
>
> Sorry, I may miss something, but I still think the additional cost is
> per candidate.  The justification is that we miss to model the iv
> candidate step well in the context of unrolling, the step cost is part
> of candidate cost, which is per candidate.
>
> To initialize it in determine_iv_cost isn't perfect as you pointed out,
> ideally we should check any uses of the candidate requires iv update
> after each replicated iteration, and take extra step costs into account
> if at least one needs, meanwhile scaling up all the computation cost to
> reflect unrolling cost nature.
I see, it's similar to the auto-increment case where cost should be
recorded only once.  So this is okay given 1) fine predicting
rtl-unroll is likely impossible here; 2) the patch has very limited
impact.

Thanks,
bin
>
> Besides, the reg_offset desirable pair already takes zero cost for
> cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you
> suggesting increasing the cost for non reg_offset pairs?  If so and per
> pair, the extra cost looks possible to be computed several times
> unexpectedly.
>
> >>
> >>>> +
> >>>>   /* Prefer the original ivs unless we may gain something by replacing it.
> >>>>      The reason is to make debugging simpler; so this is not relevant for
> >>>>      artificial ivs created by other optimization passes.  */
> >>>>
> >>>
> >>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data,
> >>>>       return;
> >>>>     }
> >>>>
> >>>> +  /* Since we priced more on non reg_offset IV cand step cost, we should scale
> >>>> +     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
> >>>> +     loop exit, FIXME if multiple exits supported or no loop exit comparisons
> >>>> +     matter.  */
> >>>> +  if (data->consider_reg_offset_for_unroll_p
> >>>> +      && group->vuses[0]->type != USE_COMPARE)
> >>>> +    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
> >>> Not quite follow here, giving "pricing more on on-reg_offset IV cand"
> >>> doesn't make much sense to me.  Also why generic type uses are not
> >>> skipped?  We want to model the cost required for address computation,
> >>> however, for generic type uses there is no way to save the computation
> >>> in "address expression".  Once unrolled, the computation is always
> >>> there?
> >>>
> >>
> >> The main intention is to scale up the group/cand cost for unrolling since
> >> we have scaled up the step costs.  The assumption is that the original
> > If we adjust cost appropriately in function *group_iv_cost_address,
> > this would become unnecessary, right?  And naturally.
> >> costing (without this patch) can be viewed as either for all unrolled
> >> iterations or just one single iteration.  Since IVOPTs doesn't support
> >> fractional costing, I interpreted it as single iterations, to emulate
> >> unrolling scenario based on the previous step cost scaling, we need to
> >> scale up the cost for all computation.
> >>
> >> In most cases, the compare type use is for loop closing, there is still
> >> only one computation even unrolling happens, so I excluded it here.
> >> As "FIXME", if we find some cases are off, we can further restrict it to
> >> those USE_COMPARE uses which is exactly for loop closing.
> >>
> >>> And what's the impact on targets supporting [base + index + offset]
> >>> addressing mode?
> >>
> >> Good question, I didn't notice it since power doesn't support it.
> >> I noticed the comments of function addr_offset_valid_p only mentioning
> >> [base + offset], I guess it excludes [base + index + offset]?
> >> But I guess the address-based IV can work for this mode?
> > No, addr_offset_valid_p is only used to split address use groups.  See
> > get_address_cost and struct mem_address.
> > I forgot to ask, what about target only supports [base + offset]
> > addressing mode like RISC-V?  I would expect it's not affected at all.
> >
> addr_offset_valid_p is also used in this patch as Richard S. and Segher
> suggested to check the offset after unrolling (like: offset+(uf-1)*step)
> is still valid for the target.  If address-based IV can not work for
> [base + index + offset], it won't affect [base + index + offset].
>
> It can help all targets which supports [base + offset], so I'm afraid
> it can affect RISC-V too, but I would expect it's positive.  Or do you
> happen to see some potential issues? or have some concerns?
>
> And as Richard S. suggested before, it has one parameter to control,
> target can simply disable this if it dislikes.
>
> >>
> >>>
> >>> Given the patch is not likely to harm because rtl loop unrolling is
> >>> (or was?) by default disabled, so I am OK once the above comments are
> >>> addressed.
> >>>
> >>
> >> Yes, it needs explicit unrolling options, excepting for some targets
> >> wants to enable it for some cases with specific loop_unroll_adjust checks.
> >>
> >>> I wonder if it's possible to get 10% of (all which should be unrolled)
> >>> loops unrolled (conservatively) on gimple and enable it by default at
> >>> O3, rather than teaching ivopts to model a future pass which not
> >>> likely be used outside of benchmarks?
> >>>
> >>
> >> Yeah, it would be nice if the unrolling happen before IVOPTs and won't
> >> have any future unrollings to get it off.  PR[1] seems to have some
> >> discussion on gimple unrolling.
> > Thanks for directing me to the discussion.  I am on Wilco's side on
> > this problem, IMHO, It might be useful getting small loops unrolled
> > (at O3 by default) by simply saving induction variable stepping and
> > exit condition check, which can be modeled on gimple.  Especially for
> > RISC-V, it doesn't support index addressing mode, which means there
> > will be as many induction variables as distinct arrays.  Also
> > interleaving after unrolling is not that important, it's at high
> > chance that small loops eligible for interleaving are handled by
> > vectorizer already.
> >
>
> Good idea, CC Jeff since I think Jeff (Jiufu) has been working to see
> whether we can bring in one gimple unrolling pass.  Regardless we
> have/don't have it, if the RTL unrolling is there, I guess this patch
> set is still beneficial?  If so, I would take it separately.
>
> BR,
> Kewen
diff mbox series

Patch

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1d2697ae1ba..1b7e4621f37 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -432,6 +432,8 @@  struct iv_group
   struct iv_cand *selected;
   /* To indicate this is a doloop use group.  */
   bool doloop_p;
+  /* To indicate this group is reg_offset valid.  */
+  bool reg_offset_p;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -473,6 +475,7 @@  struct iv_cand
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
   bool doloop_p;	/* Whether this is a doloop candidate.  */
+  bool reg_offset_p;    /* Derived from one reg_offset valid group.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -653,6 +656,10 @@  struct ivopts_data
 
   /* Whether the loop has doloop comparison use.  */
   bool doloop_use_p;
+
+  /* Whether need to consider register offset addressing mode for the loop with
+     upcoming unrolling by estimated unroll factor.  */
+  bool consider_reg_offset_for_unroll_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -840,6 +847,11 @@  dump_groups (FILE *file, struct ivopts_data *data)
 	  gcc_assert (group->type == USE_COMPARE);
 	  fprintf (file, "  Type:\tCOMPARE\n");
 	}
+      if (group->reg_offset_p)
+	{
+	  gcc_assert (address_p (group->type));
+	  fprintf (file, "  reg_offset_p: true\n");
+	}
       for (j = 0; j < group->vuses.length (); j++)
 	dump_use (file, group->vuses[j]);
     }
@@ -1582,6 +1594,7 @@  record_group (struct ivopts_data *data, enum use_type type)
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
   group->doloop_p = false;
+  group->reg_offset_p = false;
 
   data->vgroups.safe_push (group);
   return group;
@@ -2731,6 +2744,60 @@  split_address_groups (struct ivopts_data *data)
     }
 }
 
+/* Go through all address type groups, check and mark reg_offset addressing mode
+   valid groups.  */
+
+static void
+mark_reg_offset_groups (struct ivopts_data *data)
+{
+  class loop *loop = data->current_loop;
+  gcc_assert (data->current_loop->estimated_unroll > 1);
+  bool any_reg_offset_p = false;
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (address_p (group->type))
+	{
+	  struct iv_use *head_use = group->vuses[0];
+	  if (!tree_fits_poly_int64_p (head_use->iv->step))
+	    continue;
+
+	  bool found = true;
+	  poly_int64 step = tree_to_poly_int64 (head_use->iv->step);
+	  /* Max extra offset to fill for head of group.  */
+	  poly_int64 max_increase = (loop->estimated_unroll - 1) * step;
+	  /* Check whether this increment still valid.  */
+	  if (!addr_offset_valid_p (head_use, max_increase))
+	    found = false;
+
+	  unsigned group_size = group->vuses.length ();
+	  /* Check the whole group further.  */
+	  if (group_size > 1)
+	    {
+	      /* Only need to check the last one in the group, both the head and
+		the last is valid, the others should be fine.  */
+	      struct iv_use *last_use = group->vuses[group_size - 1];
+	      poly_int64 max_delta
+		= last_use->addr_offset - head_use->addr_offset;
+	      poly_int64 max_offset = max_delta + max_increase;
+	      if (maybe_ne (max_delta, 0)
+		  && !addr_offset_valid_p (head_use, max_offset))
+		found = false;
+	    }
+
+	  if (found)
+	    {
+	      group->reg_offset_p = true;
+	      any_reg_offset_p = true;
+	    }
+	}
+    }
+
+  if (!any_reg_offset_p)
+    data->consider_reg_offset_for_unroll_p = false;
+}
+
 /* Finds uses of the induction variables that are interesting.  */
 
 static void
@@ -2762,6 +2829,9 @@  find_interesting_uses (struct ivopts_data *data)
 
   split_address_groups (data);
 
+  if (data->consider_reg_offset_for_unroll_p)
+    mark_reg_offset_groups (data);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file, "\n<IV Groups>:\n");
@@ -3147,6 +3217,7 @@  add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
       cand->important = important;
       cand->incremented_at = incremented_at;
       cand->doloop_p = doloop;
+      cand->reg_offset_p = false;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3183,7 +3254,11 @@  add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
 
   /* Relate candidate to the group for which it is added.  */
   if (use)
-    bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+    {
+      bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i);
+      if (data->vgroups[use->group_id]->reg_offset_p)
+	cand->reg_offset_p = true;
+    }
 
   return cand;
 }
@@ -3654,6 +3729,14 @@  set_group_iv_cost (struct ivopts_data *data,
       return;
     }
 
+  /* Since we priced more on non reg_offset IV cand step cost, we should scale
+     up the appropriate IV group costs.  Simply consider USE_COMPARE at the
+     loop exit, FIXME if multiple exits supported or no loop exit comparisons
+     matter.  */
+  if (data->consider_reg_offset_for_unroll_p
+      && group->vuses[0]->type != USE_COMPARE)
+    cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll;
+
   if (data->consider_all_candidates)
     {
       group->cost_map[cand->id].cand = cand;
@@ -5890,6 +5973,10 @@  determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
     cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
+  /* Consider additional step updates during unrolling.  */
+  if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p)
+    cost += (data->current_loop->estimated_unroll - 1) * cost_step;
+
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
@@ -7976,6 +8063,7 @@  tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
+  data->consider_reg_offset_for_unroll_p = false;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -8008,6 +8096,16 @@  tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop,
   if (!find_induction_variables (data))
     goto finish;
 
+  if (param_iv_consider_reg_offset_for_unroll != 0 && exit)
+    {
+      tree_niter_desc *desc = niter_for_exit (data, exit);
+      estimate_unroll_factor (loop, desc);
+      data->consider_reg_offset_for_unroll_p = loop->estimated_unroll > 1;
+      if (dump_file && (dump_flags & TDF_DETAILS)
+	  && data->consider_reg_offset_for_unroll_p)
+	fprintf (dump_file, "estimated_unroll:%u\n", loop->estimated_unroll);
+    }
+
   /* Finds interesting uses (item 1).  */
   find_interesting_uses (data);
   if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)