diff mbox series

[1/2] vect: Add costing_for_scalar parameter to init_cost hook

Message ID 5169a943-4a45-040a-60c6-91af7718dc6f@linux.ibm.com
State New
Headers show
Series [1/2] vect: Add costing_for_scalar parameter to init_cost hook | expand

Commit Message

Kewen.Lin May 8, 2021, 8:05 a.m. UTC
Hi Richi,

Thanks for the comments!

on 2021/5/7 下午5:43, Richard Biener wrote:
> On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Hi,
>>
>> When I was investigating density_test heuristics, I noticed that
>> the current rs6000_density_test could be used for single scalar
>> iteration cost calculation, through the call trace:
>>   vect_compute_single_scalar_iteration_cost
>>     -> rs6000_finish_cost
>>          -> rs6000_density_test
>>
>> It looks unexpected as its desriptive function comments and Bill
>> helped to confirm this needs to be fixed (thanks!).
>>
>> So this patch is to check the passed data, if it's the same as
>> the one in loop_vinfo, it indicates it's working on vector version
>> cost calculation, otherwise just early return.
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>
>> Nothing remarkable was observed with SPEC2017 Power9 full run.
>>
>> Is it ok for trunk?
> 
> +  /* Only care about cost of vector version, so exclude scalar
> version here.  */
> +  if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data)
> +    return;
> 
> Hmm, looks like a quite "random" test to me.  What about adding a
> parameter to finish_cost () (or init_cost?) indicating the cost kind?
> 

I originally wanted to change the hook interface, but noticed that
the finish_cost in function vect_estimate_min_profitable_iters is
the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
it looks enough to differentiate the scalar version costing or
vector version costing for loop.  Do you mean this observation/
assumption easy to be broken sometime later?

The attached patch to add one new parameter to indicate the
costing kind explicitly as you suggested.

Does it look better?

gcc/ChangeLog:

	* doc/tm.texi: Regenerated.
	* target.def (init_cost): Add new parameter costing_for_scalar.
	* targhooks.c (default_init_cost): Adjust for new parameter.
	* targhooks.h (default_init_cost): Likewise.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise.
	(vect_compute_single_scalar_iteration_cost): Likewise.
	(vect_analyze_loop_2): Likewise.
	* tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise.
	(vect_bb_vectorization_profitable_p): Likewise.
	* tree-vectorizer.h (init_cost): Likewise.
	* config/aarch64/aarch64.c (aarch64_init_cost): Likewise.
	* config/i386/i386.c (ix86_init_cost): Likewise.
	* config/rs6000/rs6000.c (rs6000_init_cost): Likewise.  

> OTOH we already pass scalar_stmt to individual add_stmt_cost,
> so not sure whether the context really matters.  That said,
> the density test looks "interesting" ... the intent was that finish_cost
> might look at gathered data from add_stmt, not that it looks at
> the GIMPLE IL ... so why are you not counting vector_stmt vs.
> scalar_stmt entries in vect_body and using that for this metric?
> 

Good to know the intention behind finish_cost, thanks!

I'm afraid that the check on vector_stmt and scalar_stmt entries
from add_stmt_cost doesn't work for the density test here.  The
density test focuses on the vector version itself, there are some
stmts whose relevants are marked as vect_unused_in_scope, IIUC
they won't be passed down when costing for both versions.  But the
existing density check would like to know the cost for the
non-vectorized part.  The current implementation does:
 
 vec_cost = data->cost[vect_body]

	  if (!STMT_VINFO_RELEVANT_P (stmt_info)
	      && !STMT_VINFO_IN_PATTERN_P (stmt_info))
	    not_vec_cost++

 density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);

it takes those unrelevant stmts into account, and then has
both costs from the non-vectorized part (not_vec_cost)
and vectorized part (cost[vect_body]), it can calculate the
vectorization code density ratio.

BR,
Kewen
gcc/config/aarch64/aarch64.c | 2 +-
 gcc/config/i386/i386.c       | 2 +-
 gcc/config/rs6000/rs6000.c   | 2 +-
 gcc/doc/tm.texi              | 4 ++--
 gcc/target.def               | 6 ++++--
 gcc/targhooks.c              | 3 ++-
 gcc/targhooks.h              | 2 +-
 gcc/tree-vect-loop.c         | 6 +++---
 gcc/tree-vect-slp.c          | 8 +++++---
 gcc/tree-vectorizer.h        | 4 ++--
 10 files changed, 22 insertions(+), 17 deletions(-)

Comments

Richard Biener May 10, 2021, 1:55 p.m. UTC | #1
On Sat, May 8, 2021 at 10:05 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richi,
>
> Thanks for the comments!
>
> on 2021/5/7 下午5:43, Richard Biener wrote:
> > On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> Hi,
> >>
> >> When I was investigating density_test heuristics, I noticed that
> >> the current rs6000_density_test could be used for single scalar
> >> iteration cost calculation, through the call trace:
> >>   vect_compute_single_scalar_iteration_cost
> >>     -> rs6000_finish_cost
> >>          -> rs6000_density_test
> >>
> >> It looks unexpected as its desriptive function comments and Bill
> >> helped to confirm this needs to be fixed (thanks!).
> >>
> >> So this patch is to check the passed data, if it's the same as
> >> the one in loop_vinfo, it indicates it's working on vector version
> >> cost calculation, otherwise just early return.
> >>
> >> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
> >>
> >> Nothing remarkable was observed with SPEC2017 Power9 full run.
> >>
> >> Is it ok for trunk?
> >
> > +  /* Only care about cost of vector version, so exclude scalar
> > version here.  */
> > +  if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data)
> > +    return;
> >
> > Hmm, looks like a quite "random" test to me.  What about adding a
> > parameter to finish_cost () (or init_cost?) indicating the cost kind?
> >
>
> I originally wanted to change the hook interface, but noticed that
> the finish_cost in function vect_estimate_min_profitable_iters is
> the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
> it looks enough to differentiate the scalar version costing or
> vector version costing for loop.  Do you mean this observation/
> assumption easy to be broken sometime later?

Yes, this field is likely to become stale.

>
> The attached patch to add one new parameter to indicate the
> costing kind explicitly as you suggested.
>
> Does it look better?
>
> gcc/ChangeLog:
>
>         * doc/tm.texi: Regenerated.
>         * target.def (init_cost): Add new parameter costing_for_scalar.
>         * targhooks.c (default_init_cost): Adjust for new parameter.
>         * targhooks.h (default_init_cost): Likewise.
>         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise.
>         (vect_compute_single_scalar_iteration_cost): Likewise.
>         (vect_analyze_loop_2): Likewise.
>         * tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise.
>         (vect_bb_vectorization_profitable_p): Likewise.
>         * tree-vectorizer.h (init_cost): Likewise.
>         * config/aarch64/aarch64.c (aarch64_init_cost): Likewise.
>         * config/i386/i386.c (ix86_init_cost): Likewise.
>         * config/rs6000/rs6000.c (rs6000_init_cost): Likewise.
>
> > OTOH we already pass scalar_stmt to individual add_stmt_cost,
> > so not sure whether the context really matters.  That said,
> > the density test looks "interesting" ... the intent was that finish_cost
> > might look at gathered data from add_stmt, not that it looks at
> > the GIMPLE IL ... so why are you not counting vector_stmt vs.
> > scalar_stmt entries in vect_body and using that for this metric?
> >
>
> Good to know the intention behind finish_cost, thanks!
>
> I'm afraid that the check on vector_stmt and scalar_stmt entries
> from add_stmt_cost doesn't work for the density test here.  The
> density test focuses on the vector version itself, there are some
> stmts whose relevants are marked as vect_unused_in_scope, IIUC
> they won't be passed down when costing for both versions.  But the
> existing density check would like to know the cost for the
> non-vectorized part.  The current implementation does:
>
>  vec_cost = data->cost[vect_body]
>
>           if (!STMT_VINFO_RELEVANT_P (stmt_info)
>               && !STMT_VINFO_IN_PATTERN_P (stmt_info))
>             not_vec_cost++
>
>  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
>
> it takes those unrelevant stmts into account, and then has
> both costs from the non-vectorized part (not_vec_cost)
> and vectorized part (cost[vect_body]), it can calculate the
> vectorization code density ratio.

Yes, but then what "relevant" stmts are actually needed and what
not is missed by your heuristics.  It's really some GIGO one
I fear - each vectorized data reference will add a pointer IV
(eventually commoned by IVOPTs later) and pointer value updates
that are not accounted for in costing (the IVs and updates in the
scalar code are marked as not relevant).  Are those the stmts
this heuristic wants to look at?

The patch looks OK btw.

Thanks,
Richard.

>
> BR,
> Kewen
Richard Sandiford May 10, 2021, 2:08 p.m. UTC | #2
"Kewen.Lin via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> on 2021/5/7 下午5:43, Richard Biener wrote:
>> On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>> Hi,
>>>
>>> When I was investigating density_test heuristics, I noticed that
>>> the current rs6000_density_test could be used for single scalar
>>> iteration cost calculation, through the call trace:
>>>   vect_compute_single_scalar_iteration_cost
>>>     -> rs6000_finish_cost
>>>          -> rs6000_density_test
>>>
>>> It looks unexpected as its desriptive function comments and Bill
>>> helped to confirm this needs to be fixed (thanks!).
>>>
>>> So this patch is to check the passed data, if it's the same as
>>> the one in loop_vinfo, it indicates it's working on vector version
>>> cost calculation, otherwise just early return.
>>>
>>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>>
>>> Nothing remarkable was observed with SPEC2017 Power9 full run.
>>>
>>> Is it ok for trunk?
>> 
>> +  /* Only care about cost of vector version, so exclude scalar
>> version here.  */
>> +  if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data)
>> +    return;
>> 
>> Hmm, looks like a quite "random" test to me.  What about adding a
>> parameter to finish_cost () (or init_cost?) indicating the cost kind?
>> 
>
> I originally wanted to change the hook interface, but noticed that
> the finish_cost in function vect_estimate_min_profitable_iters is
> the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
> it looks enough to differentiate the scalar version costing or
> vector version costing for loop.  Do you mean this observation/
> assumption easy to be broken sometime later?
>
> The attached patch to add one new parameter to indicate the
> costing kind explicitly as you suggested.
>
> Does it look better?
>
> gcc/ChangeLog:
>
> 	* doc/tm.texi: Regenerated.
> 	* target.def (init_cost): Add new parameter costing_for_scalar.
> 	* targhooks.c (default_init_cost): Adjust for new parameter.
> 	* targhooks.h (default_init_cost): Likewise.
> 	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise.
> 	(vect_compute_single_scalar_iteration_cost): Likewise.
> 	(vect_analyze_loop_2): Likewise.
> 	* tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise.
> 	(vect_bb_vectorization_profitable_p): Likewise.
> 	* tree-vectorizer.h (init_cost): Likewise.
> 	* config/aarch64/aarch64.c (aarch64_init_cost): Likewise.
> 	* config/i386/i386.c (ix86_init_cost): Likewise.
> 	* config/rs6000/rs6000.c (rs6000_init_cost): Likewise.  

Just wanted to say thanks for doing this.  I hit the same problem
when doing the Neoverse V1 tuning near the end of stage 4.  Due to
the extreme lateness of the changes, I couldn't reasonably ask for
target-independent help at that time, but this patch will make
things simpler for AArch64. :-)

Richard
Kewen.Lin May 11, 2021, 7:10 a.m. UTC | #3
Hi Richi,

on 2021/5/10 下午9:55, Richard Biener wrote:
> On Sat, May 8, 2021 at 10:05 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richi,
>>
>> Thanks for the comments!
>>
>> on 2021/5/7 下午5:43, Richard Biener wrote:
>>> On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches
>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> When I was investigating density_test heuristics, I noticed that
>>>> the current rs6000_density_test could be used for single scalar
>>>> iteration cost calculation, through the call trace:
>>>>   vect_compute_single_scalar_iteration_cost
>>>>     -> rs6000_finish_cost
>>>>          -> rs6000_density_test
>>>>
>>>> It looks unexpected as its desriptive function comments and Bill
>>>> helped to confirm this needs to be fixed (thanks!).
>>>>
>>>> So this patch is to check the passed data, if it's the same as
>>>> the one in loop_vinfo, it indicates it's working on vector version
>>>> cost calculation, otherwise just early return.
>>>>
>>>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>>>
>>>> Nothing remarkable was observed with SPEC2017 Power9 full run.
>>>>
>>>> Is it ok for trunk?
>>>
>>> +  /* Only care about cost of vector version, so exclude scalar
>>> version here.  */
>>> +  if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data)
>>> +    return;
>>>
>>> Hmm, looks like a quite "random" test to me.  What about adding a
>>> parameter to finish_cost () (or init_cost?) indicating the cost kind?
>>>
>>
>> I originally wanted to change the hook interface, but noticed that
>> the finish_cost in function vect_estimate_min_profitable_iters is
>> the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
>> it looks enough to differentiate the scalar version costing or
>> vector version costing for loop.  Do you mean this observation/
>> assumption easy to be broken sometime later?
> 
> Yes, this field is likely to become stale.
> 
>>
>> The attached patch to add one new parameter to indicate the
>> costing kind explicitly as you suggested.
>>
>> Does it look better?
>>
>> gcc/ChangeLog:
>>
>>         * doc/tm.texi: Regenerated.
>>         * target.def (init_cost): Add new parameter costing_for_scalar.
>>         * targhooks.c (default_init_cost): Adjust for new parameter.
>>         * targhooks.h (default_init_cost): Likewise.
>>         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise.
>>         (vect_compute_single_scalar_iteration_cost): Likewise.
>>         (vect_analyze_loop_2): Likewise.
>>         * tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise.
>>         (vect_bb_vectorization_profitable_p): Likewise.
>>         * tree-vectorizer.h (init_cost): Likewise.
>>         * config/aarch64/aarch64.c (aarch64_init_cost): Likewise.
>>         * config/i386/i386.c (ix86_init_cost): Likewise.
>>         * config/rs6000/rs6000.c (rs6000_init_cost): Likewise.
>>
>>> OTOH we already pass scalar_stmt to individual add_stmt_cost,
>>> so not sure whether the context really matters.  That said,
>>> the density test looks "interesting" ... the intent was that finish_cost
>>> might look at gathered data from add_stmt, not that it looks at
>>> the GIMPLE IL ... so why are you not counting vector_stmt vs.
>>> scalar_stmt entries in vect_body and using that for this metric?
>>>
>>
>> Good to know the intention behind finish_cost, thanks!
>>
>> I'm afraid that the check on vector_stmt and scalar_stmt entries
>> from add_stmt_cost doesn't work for the density test here.  The
>> density test focuses on the vector version itself, there are some
>> stmts whose relevants are marked as vect_unused_in_scope, IIUC
>> they won't be passed down when costing for both versions.  But the
>> existing density check would like to know the cost for the
>> non-vectorized part.  The current implementation does:
>>
>>  vec_cost = data->cost[vect_body]
>>
>>           if (!STMT_VINFO_RELEVANT_P (stmt_info)
>>               && !STMT_VINFO_IN_PATTERN_P (stmt_info))
>>             not_vec_cost++
>>
>>  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
>>
>> it takes those unrelevant stmts into account, and then has
>> both costs from the non-vectorized part (not_vec_cost)
>> and vectorized part (cost[vect_body]), it can calculate the
>> vectorization code density ratio.
> 
> Yes, but then what "relevant" stmts are actually needed and what
> not is missed by your heuristics.  It's really some GIGO one
> I fear - each vectorized data reference will add a pointer IV
> (eventually commoned by IVOPTs later) and pointer value updates
> that are not accounted for in costing (the IVs and updates in the
> scalar code are marked as not relevant).  Are those the stmts
> this heuristic wants to look at?

Yes, the IVs and updates (even the comparison for exit) are what
the heuristics tries to count.  In most cases, the non vectorized
part in the loop are IV updates.  And it's so true that the
collected not_vec_cost could be not accurate, but it seems hard
to predict the cost exactly here?

Assuming this not_vect_cost cost is over priced, it could result
in a lower density ratio than what it should be.  Also assuming
the density threshold is relatively conservative, in this case
if the ratio still exceeds the density threshold, we can say the
loop is really dense.  It could miss to catch some "dense" loops,
but I hope it won't take "non-dense" loops as "dense" unexpectedly.


> 
> The patch looks OK btw.
> 

Thanks for the review!

The patch was bootstrapped/reg-tested on powerpc64le-linux-gnu P9,
x86_64-redhat-linux and aarch64-linux-gnu.  Committed in r12-704.


BR,
Kewen
Kewen.Lin May 11, 2021, 7:18 a.m. UTC | #4
Hi Richard,

on 2021/5/10 下午10:08, Richard Sandiford wrote:
> "Kewen.Lin via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
>> on 2021/5/7 下午5:43, Richard Biener wrote:
>>> On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches
>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> When I was investigating density_test heuristics, I noticed that
>>>> the current rs6000_density_test could be used for single scalar
>>>> iteration cost calculation, through the call trace:
>>>>   vect_compute_single_scalar_iteration_cost
>>>>     -> rs6000_finish_cost
>>>>          -> rs6000_density_test
>>>>
>>>> It looks unexpected as its desriptive function comments and Bill
>>>> helped to confirm this needs to be fixed (thanks!).
>>>>
>>>> So this patch is to check the passed data, if it's the same as
>>>> the one in loop_vinfo, it indicates it's working on vector version
>>>> cost calculation, otherwise just early return.
>>>>
>>>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>>>
>>>> Nothing remarkable was observed with SPEC2017 Power9 full run.
>>>>
>>>> Is it ok for trunk?
>>>
>>> +  /* Only care about cost of vector version, so exclude scalar
>>> version here.  */
>>> +  if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data)
>>> +    return;
>>>
>>> Hmm, looks like a quite "random" test to me.  What about adding a
>>> parameter to finish_cost () (or init_cost?) indicating the cost kind?
>>>
>>
>> I originally wanted to change the hook interface, but noticed that
>> the finish_cost in function vect_estimate_min_profitable_iters is
>> the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
>> it looks enough to differentiate the scalar version costing or
>> vector version costing for loop.  Do you mean this observation/
>> assumption easy to be broken sometime later?
>>
>> The attached patch to add one new parameter to indicate the
>> costing kind explicitly as you suggested.
>>
>> Does it look better?
>>
>> gcc/ChangeLog:
>>
>> 	* doc/tm.texi: Regenerated.
>> 	* target.def (init_cost): Add new parameter costing_for_scalar.
>> 	* targhooks.c (default_init_cost): Adjust for new parameter.
>> 	* targhooks.h (default_init_cost): Likewise.
>> 	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise.
>> 	(vect_compute_single_scalar_iteration_cost): Likewise.
>> 	(vect_analyze_loop_2): Likewise.
>> 	* tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise.
>> 	(vect_bb_vectorization_profitable_p): Likewise.
>> 	* tree-vectorizer.h (init_cost): Likewise.
>> 	* config/aarch64/aarch64.c (aarch64_init_cost): Likewise.
>> 	* config/i386/i386.c (ix86_init_cost): Likewise.
>> 	* config/rs6000/rs6000.c (rs6000_init_cost): Likewise.  
> 
> Just wanted to say thanks for doing this.  I hit the same problem
> when doing the Neoverse V1 tuning near the end of stage 4.  Due to
> the extreme lateness of the changes, I couldn't reasonably ask for
> target-independent help at that time, but this patch will make
> things simpler for AArch64. :-)
> 


Glad to see that rs6000 port isn't the only port requesting this.  :-)
Thanks for the information!
BR,
Kewen
Richard Biener May 11, 2021, 7:54 a.m. UTC | #5
On Tue, May 11, 2021 at 9:10 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richi,
>
> on 2021/5/10 下午9:55, Richard Biener wrote:
> > On Sat, May 8, 2021 at 10:05 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi Richi,
> >>
> >> Thanks for the comments!
> >>
> >> on 2021/5/7 下午5:43, Richard Biener wrote:
> >>> On Fri, May 7, 2021 at 5:30 AM Kewen.Lin via Gcc-patches
> >>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> When I was investigating density_test heuristics, I noticed that
> >>>> the current rs6000_density_test could be used for single scalar
> >>>> iteration cost calculation, through the call trace:
> >>>>   vect_compute_single_scalar_iteration_cost
> >>>>     -> rs6000_finish_cost
> >>>>          -> rs6000_density_test
> >>>>
> >>>> It looks unexpected as its desriptive function comments and Bill
> >>>> helped to confirm this needs to be fixed (thanks!).
> >>>>
> >>>> So this patch is to check the passed data, if it's the same as
> >>>> the one in loop_vinfo, it indicates it's working on vector version
> >>>> cost calculation, otherwise just early return.
> >>>>
> >>>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
> >>>>
> >>>> Nothing remarkable was observed with SPEC2017 Power9 full run.
> >>>>
> >>>> Is it ok for trunk?
> >>>
> >>> +  /* Only care about cost of vector version, so exclude scalar
> >>> version here.  */
> >>> +  if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data)
> >>> +    return;
> >>>
> >>> Hmm, looks like a quite "random" test to me.  What about adding a
> >>> parameter to finish_cost () (or init_cost?) indicating the cost kind?
> >>>
> >>
> >> I originally wanted to change the hook interface, but noticed that
> >> the finish_cost in function vect_estimate_min_profitable_iters is
> >> the only invocation with LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
> >> it looks enough to differentiate the scalar version costing or
> >> vector version costing for loop.  Do you mean this observation/
> >> assumption easy to be broken sometime later?
> >
> > Yes, this field is likely to become stale.
> >
> >>
> >> The attached patch to add one new parameter to indicate the
> >> costing kind explicitly as you suggested.
> >>
> >> Does it look better?
> >>
> >> gcc/ChangeLog:
> >>
> >>         * doc/tm.texi: Regenerated.
> >>         * target.def (init_cost): Add new parameter costing_for_scalar.
> >>         * targhooks.c (default_init_cost): Adjust for new parameter.
> >>         * targhooks.h (default_init_cost): Likewise.
> >>         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise.
> >>         (vect_compute_single_scalar_iteration_cost): Likewise.
> >>         (vect_analyze_loop_2): Likewise.
> >>         * tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise.
> >>         (vect_bb_vectorization_profitable_p): Likewise.
> >>         * tree-vectorizer.h (init_cost): Likewise.
> >>         * config/aarch64/aarch64.c (aarch64_init_cost): Likewise.
> >>         * config/i386/i386.c (ix86_init_cost): Likewise.
> >>         * config/rs6000/rs6000.c (rs6000_init_cost): Likewise.
> >>
> >>> OTOH we already pass scalar_stmt to individual add_stmt_cost,
> >>> so not sure whether the context really matters.  That said,
> >>> the density test looks "interesting" ... the intent was that finish_cost
> >>> might look at gathered data from add_stmt, not that it looks at
> >>> the GIMPLE IL ... so why are you not counting vector_stmt vs.
> >>> scalar_stmt entries in vect_body and using that for this metric?
> >>>
> >>
> >> Good to know the intention behind finish_cost, thanks!
> >>
> >> I'm afraid that the check on vector_stmt and scalar_stmt entries
> >> from add_stmt_cost doesn't work for the density test here.  The
> >> density test focuses on the vector version itself, there are some
> >> stmts whose relevants are marked as vect_unused_in_scope, IIUC
> >> they won't be passed down when costing for both versions.  But the
> >> existing density check would like to know the cost for the
> >> non-vectorized part.  The current implementation does:
> >>
> >>  vec_cost = data->cost[vect_body]
> >>
> >>           if (!STMT_VINFO_RELEVANT_P (stmt_info)
> >>               && !STMT_VINFO_IN_PATTERN_P (stmt_info))
> >>             not_vec_cost++
> >>
> >>  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
> >>
> >> it takes those unrelevant stmts into account, and then has
> >> both costs from the non-vectorized part (not_vec_cost)
> >> and vectorized part (cost[vect_body]), it can calculate the
> >> vectorization code density ratio.
> >
> > Yes, but then what "relevant" stmts are actually needed and what
> > not is missed by your heuristics.  It's really some GIGO one
> > I fear - each vectorized data reference will add a pointer IV
> > (eventually commoned by IVOPTs later) and pointer value updates
> > that are not accounted for in costing (the IVs and updates in the
> > scalar code are marked as not relevant).  Are those the stmts
> > this heuristic wants to look at?
>
> Yes, the IVs and updates (even the comparison for exit) are what
> the heuristics tries to count.  In most cases, the non vectorized
> part in the loop are IV updates.  And it's so true that the
> collected not_vec_cost could be not accurate, but it seems hard
> to predict the cost exactly here?
>
> Assuming this not_vect_cost cost is over priced, it could result
> in a lower density ratio than what it should be.  Also assuming
> the density threshold is relatively conservative, in this case
> if the ratio still exceeds the density threshold, we can say the
> loop is really dense.  It could miss to catch some "dense" loops,
> but I hope it won't take "non-dense" loops as "dense" unexpectedly.

So we could in principle include IVs and updates in the costing but
then the vectorizer isn't absolutely careful for doing scalar cleanups
and instead expects IVOPTs to create canonical IVs.  Note for
the scalar part those stmts are not costed either, we'd have to
change that as well.  What this would mean is that for a scalar
loop accessing a[i] and b[i] we'd have one original IV + update
and the vectorizer generates two pointer IVs + updates.

But in the end the vector code shouldn't end up worse than the
scalar code with respect to IVs - the cases where it would should
be already costed.  So I wonder if you have specific examples
where things go worse enough for the heuristic to trigger?

Thanks,
Richard.

>
> >
> > The patch looks OK btw.
> >
>
> Thanks for the review!
>
> The patch was bootstrapped/reg-tested on powerpc64le-linux-gnu P9,
> x86_64-redhat-linux and aarch64-linux-gnu.  Committed in r12-704.
>
>
> BR,
> Kewen
Kewen.Lin May 11, 2021, 10:50 a.m. UTC | #6
Hi Richi,

>>>>> OTOH we already pass scalar_stmt to individual add_stmt_cost,
>>>>> so not sure whether the context really matters.  That said,
>>>>> the density test looks "interesting" ... the intent was that finish_cost
>>>>> might look at gathered data from add_stmt, not that it looks at
>>>>> the GIMPLE IL ... so why are you not counting vector_stmt vs.
>>>>> scalar_stmt entries in vect_body and using that for this metric?
>>>>>
>>>>
>>>> Good to know the intention behind finish_cost, thanks!
>>>>
>>>> I'm afraid that the check on vector_stmt and scalar_stmt entries
>>>> from add_stmt_cost doesn't work for the density test here.  The
>>>> density test focuses on the vector version itself, there are some
>>>> stmts whose relevants are marked as vect_unused_in_scope, IIUC
>>>> they won't be passed down when costing for both versions.  But the
>>>> existing density check would like to know the cost for the
>>>> non-vectorized part.  The current implementation does:
>>>>
>>>>  vec_cost = data->cost[vect_body]
>>>>
>>>>           if (!STMT_VINFO_RELEVANT_P (stmt_info)
>>>>               && !STMT_VINFO_IN_PATTERN_P (stmt_info))
>>>>             not_vec_cost++
>>>>
>>>>  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
>>>>
>>>> it takes those unrelevant stmts into account, and then has
>>>> both costs from the non-vectorized part (not_vec_cost)
>>>> and vectorized part (cost[vect_body]), it can calculate the
>>>> vectorization code density ratio.
>>>
>>> Yes, but then what "relevant" stmts are actually needed and what
>>> not is missed by your heuristics.  It's really some GIGO one
>>> I fear - each vectorized data reference will add a pointer IV
>>> (eventually commoned by IVOPTs later) and pointer value updates
>>> that are not accounted for in costing (the IVs and updates in the
>>> scalar code are marked as not relevant).  Are those the stmts
>>> this heuristic wants to look at?
>>
>> Yes, the IVs and updates (even the comparison for exit) are what
>> the heuristics tries to count.  In most cases, the non vectorized
>> part in the loop are IV updates.  And it's so true that the
>> collected not_vec_cost could be not accurate, but it seems hard
>> to predict the cost exactly here?
>>
>> Assuming this not_vect_cost cost is over priced, it could result
>> in a lower density ratio than what it should be.  Also assuming
>> the density threshold is relatively conservative, in this case
>> if the ratio still exceeds the density threshold, we can say the
>> loop is really dense.  It could miss to catch some "dense" loops,
>> but I hope it won't take "non-dense" loops as "dense" unexpectedly.
> 
> So we could in principle include IVs and updates in the costing but
> then the vectorizer isn't absolutely careful for doing scalar cleanups
> and instead expects IVOPTs to create canonical IVs.  Note for
> the scalar part those stmts are not costed either, we'd have to
> change that as well.  What this would mean is that for a scalar
> loop accessing a[i] and b[i] we'd have one original IV + update
> and the vectorizer generates two pointer IVs + updates.
> 


I broke down my understanding a bit below to ensure it's correct.

  - We can pass down those "unrelevant" stmts into add_stmt_cost
    for both scalar and vector versions, then targets can check
    stmts accordingly instead of scanning IL by themselves.
    For scalar version, these are mainly original IV + update
    + some address ref calculation;  while for vector version,
    these are mainly pointer IVs + updates.
  
  - What's the cost assigned for these "unrelevant" stmts?
    The comments seems to imply we want to cost them?   If so,
    I am worried that it can break some current costing
    heuristics which don't consider these costs.  Besides,
    these "unrelavant" stmts can be optimized later, if we
    consider them somwhere like calculating profitable min
    iter, could result in worse code?
    Can we pass them down but cost them freely?

> But in the end the vector code shouldn't end up worse than the
> scalar code with respect to IVs - the cases where it would should
> be already costed.  So I wonder if you have specific examples
> where things go worse enough for the heuristic to trigger?
> 

One typical case that I worked on to reuse this density check is the
function mat_times_vec of src file block_solver.fppized.f of SPEC2017
503.bwaves_r, the density with the existing heuristic is 83 (doesn't
exceed the threshold unlikely).  The interesting loop is the innermost
one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize".
We have verified that this loop isn't profitable to be vectorized at
O2 (without loop-interchange).

Another function shell which also comes from 503.bwaves_r src file
shell_lam.fppized.f does hit this threshold, the loop is the one
starting from line 228.

BR,
Kewen
Richard Biener May 11, 2021, 11:59 a.m. UTC | #7
On Tue, May 11, 2021 at 12:50 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richi,
>
> >>>>> OTOH we already pass scalar_stmt to individual add_stmt_cost,
> >>>>> so not sure whether the context really matters.  That said,
> >>>>> the density test looks "interesting" ... the intent was that finish_cost
> >>>>> might look at gathered data from add_stmt, not that it looks at
> >>>>> the GIMPLE IL ... so why are you not counting vector_stmt vs.
> >>>>> scalar_stmt entries in vect_body and using that for this metric?
> >>>>>
> >>>>
> >>>> Good to know the intention behind finish_cost, thanks!
> >>>>
> >>>> I'm afraid that the check on vector_stmt and scalar_stmt entries
> >>>> from add_stmt_cost doesn't work for the density test here.  The
> >>>> density test focuses on the vector version itself, there are some
> >>>> stmts whose relevants are marked as vect_unused_in_scope, IIUC
> >>>> they won't be passed down when costing for both versions.  But the
> >>>> existing density check would like to know the cost for the
> >>>> non-vectorized part.  The current implementation does:
> >>>>
> >>>>  vec_cost = data->cost[vect_body]
> >>>>
> >>>>           if (!STMT_VINFO_RELEVANT_P (stmt_info)
> >>>>               && !STMT_VINFO_IN_PATTERN_P (stmt_info))
> >>>>             not_vec_cost++
> >>>>
> >>>>  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
> >>>>
> >>>> it takes those unrelevant stmts into account, and then has
> >>>> both costs from the non-vectorized part (not_vec_cost)
> >>>> and vectorized part (cost[vect_body]), it can calculate the
> >>>> vectorization code density ratio.
> >>>
> >>> Yes, but then what "relevant" stmts are actually needed and what
> >>> not is missed by your heuristics.  It's really some GIGO one
> >>> I fear - each vectorized data reference will add a pointer IV
> >>> (eventually commoned by IVOPTs later) and pointer value updates
> >>> that are not accounted for in costing (the IVs and updates in the
> >>> scalar code are marked as not relevant).  Are those the stmts
> >>> this heuristic wants to look at?
> >>
> >> Yes, the IVs and updates (even the comparison for exit) are what
> >> the heuristics tries to count.  In most cases, the non vectorized
> >> part in the loop are IV updates.  And it's so true that the
> >> collected not_vec_cost could be not accurate, but it seems hard
> >> to predict the cost exactly here?
> >>
> >> Assuming this not_vect_cost cost is over priced, it could result
> >> in a lower density ratio than what it should be.  Also assuming
> >> the density threshold is relatively conservative, in this case
> >> if the ratio still exceeds the density threshold, we can say the
> >> loop is really dense.  It could miss to catch some "dense" loops,
> >> but I hope it won't take "non-dense" loops as "dense" unexpectedly.
> >
> > So we could in principle include IVs and updates in the costing but
> > then the vectorizer isn't absolutely careful for doing scalar cleanups
> > and instead expects IVOPTs to create canonical IVs.  Note for
> > the scalar part those stmts are not costed either, we'd have to
> > change that as well.  What this would mean is that for a scalar
> > loop accessing a[i] and b[i] we'd have one original IV + update
> > and the vectorizer generates two pointer IVs + updates.
> >
>
>
> I broke down my understanding a bit below to ensure it's correct.
>
>   - We can pass down those "unrelevant" stmts into add_stmt_cost
>     for both scalar and vector versions, then targets can check
>     stmts accordingly instead of scanning IL by themselves.
>     For scalar version, these are mainly original IV + update
>     + some address ref calculation;  while for vector version,
>     these are mainly pointer IVs + updates.
>
>   - What's the cost assigned for these "unrelevant" stmts?
>     The comments seems to imply we want to cost them?   If so,
>     I am worried that it can break some current costing
>     heuristics which don't consider these costs.  Besides,
>     these "unrelavant" stmts can be optimized later, if we
>     consider them somwhere like calculating profitable min
>     iter, could result in worse code?
>     Can we pass them down but cost them freely?
>
> > But in the end the vector code shouldn't end up worse than the
> > scalar code with respect to IVs - the cases where it would should
> > be already costed.  So I wonder if you have specific examples
> > where things go worse enough for the heuristic to trigger?
> >
>
> One typical case that I worked on to reuse this density check is the
> function mat_times_vec of src file block_solver.fppized.f of SPEC2017
> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't
> exceed the threshold unlikely).  The interesting loop is the innermost
> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize".
> We have verified that this loop isn't profitable to be vectorized at
> O2 (without loop-interchange).

Yeah, but that's because the loop only runs 5 iterations, not because
of some "density" (which suggests AGU overloading or some such)?
Because if you modify it so it iterates more then with keeping the
"density" measurement constant you suddenly become profitable?

The loop does have quite many memory streams so optimizing
the (few) arithmetic ops by vectorizign them might not be worth
the trouble, esp. since most of the loads are "strided" (composed
from scalars) when no interchange is performed.  So it's probably
more a "density" of # memory streams vs. # arithmetic ops, and
esp. with any non-consecutive vector loads this balance being
worse in the vector case?

The x86 add_stmt_cost has

  /* If we do elementwise loads into a vector then we are bound by
     latency and execution resources for the many scalar loads
     (AGU and load ports).  Try to account for this by scaling the
     construction cost by the number of elements involved.  */
  if ((kind == vec_construct || kind == vec_to_scalar)
      && stmt_info
      && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
          || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
      && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
      && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
    {
      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
    }

so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and stores.
The above materialized over PRs 84037, 85491 and 87561, so not specifically
for the bwaves case.  IIRC on x86 bwaves at -O2 is slower with vectorization
as well.

Oh, and yes - the info the vectorizer presents the target with for
vector loads/stores
leaves a lot to be desired ...

Richard.

> Another function shell which also comes from 503.bwaves_r src file
> shell_lam.fppized.f does hit this threshold, the loop is the one
> starting from line 228.
>
> BR,
> Kewen
Kewen.Lin May 13, 2021, 7:04 a.m. UTC | #8
Hi!

>>> But in the end the vector code shouldn't end up worse than the
>>> scalar code with respect to IVs - the cases where it would should
>>> be already costed.  So I wonder if you have specific examples
>>> where things go worse enough for the heuristic to trigger?
>>>
>>
>> One typical case that I worked on to reuse this density check is the
>> function mat_times_vec of src file block_solver.fppized.f of SPEC2017
>> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't
>> exceed the threshold unlikely).  The interesting loop is the innermost
>> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize".
>> We have verified that this loop isn't profitable to be vectorized at
>> O2 (without loop-interchange).
> 
> Yeah, but that's because the loop only runs 5 iterations, not because
> of some "density" (which suggests AGU overloading or some such)?
> Because if you modify it so it iterates more then with keeping the
> "density" measurement constant you suddenly become profitable?
> 

Yes, I agree this isn't a perfect one showing how the density check
matters, though it led me to find this check.  I tried to run SPEC2017
bmks w/ and w/o this density heuristic to catch some "expected" case,
but failed to unluckily.  It may be worth to trying with some more
option sets or even test with the previous SPECs later.

I hacked the innermost loop iteration from 5 to 20, but baseline run
didn't stop (after more than 7 hrs then I killed it), which was
suspected to become endless because of some garbage (out of bound) data.

But the current cost modeling for this loop on Power is still bad, the
min profitable iteration (both static and run time) are evaluated as 2,
while the reality shows 5 isn't profitable at least.


> The loop does have quite many memory streams so optimizing
> the (few) arithmetic ops by vectorizign them might not be worth
> the trouble, esp. since most of the loads are "strided" (composed
> from scalars) when no interchange is performed.  So it's probably
> more a "density" of # memory streams vs. # arithmetic ops, and
> esp. with any non-consecutive vector loads this balance being
> worse in the vector case?
> 

Yeah, these many scalar "strided" loads make things worse.  The fed
vector CTORs have to wait for all of their required loads are ready,
and these vector CTOR are required by further multiplications.

I posted one patch[1] on this, which tries to model it with
some counts: nload (total load number), nload_ctor (strided
load number fed into CTOR) and nctor_strided (CTOR number fed
by strided load).

Restricting the penalization by considering some factors:
  1) vect density ratio, if there are many vector instructions,
     the stalls from loads are easy to impact the subsequent
     computation.
  2) total load number, if nload is small, it's unlikely to
     bother the load/store units much.
  3) strided loads fed into CTOR pct., if there are high portion
     strided loads fed into CTOR, it's very likely to block
     the CTOR and its subsequent chain.

btw, as your previous comments on add_stmt_cost, the load/strided/ctor
statistics should be gathered there instead, like:

  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
    {
      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
          || kind == vector_gather_load)
          data->nload += count;
      if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info))
        {
          if (kind == scalar_load || kind == unaligned_load)
            data->nload_ctor += count;
          else if (kind == vec_construct)
            data->nctor_strided += count;
        }
    }

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html

> The x86 add_stmt_cost has
> 
>   /* If we do elementwise loads into a vector then we are bound by
>      latency and execution resources for the many scalar loads
>      (AGU and load ports).  Try to account for this by scaling the
>      construction cost by the number of elements involved.  */
>   if ((kind == vec_construct || kind == vec_to_scalar)
>       && stmt_info
>       && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>           || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
>       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>       && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
>     {
>       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
>     }
> 
> so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and stores.
> The above materialized over PRs 84037, 85491 and 87561, so not specifically
> for the bwaves case.  IIRC on x86 bwaves at -O2 is slower with vectorization
> as well.
> 

Thanks for the pointer!  rs6000 probably can follow this way instead.
IIUC, this cost adjustment is for each individual vec_construct/vec_to_scalar,
is it better to use the way that firstly collecting the data in add_stmt_cost
and then considering them from a view of the whole loop?  This individual way
seems easy to overkill if there are just a few VMAT_ELEMENTWISE loads then the
resource hazard isn't so bad.  And as you mentioned above, bwaves_r mainly
suffers from strided loads, this tuning looks should be applied for
VMAT_STRIDED_SLP as well?  Is it the reason why it only considers the variable
step that the constant step loads can be grouped on x86 (like: paired load/store
fusion)?

BR,
Kewen

> Oh, and yes - the info the vectorizer presents the target with for
> vector loads/stores
> leaves a lot to be desired ...
> 
> Richard.
>
Richard Biener May 17, 2021, 8:55 a.m. UTC | #9
On Thu, May 13, 2021 at 9:04 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi!
>
> >>> But in the end the vector code shouldn't end up worse than the
> >>> scalar code with respect to IVs - the cases where it would should
> >>> be already costed.  So I wonder if you have specific examples
> >>> where things go worse enough for the heuristic to trigger?
> >>>
> >>
> >> One typical case that I worked on to reuse this density check is the
> >> function mat_times_vec of src file block_solver.fppized.f of SPEC2017
> >> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't
> >> exceed the threshold unlikely).  The interesting loop is the innermost
> >> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize".
> >> We have verified that this loop isn't profitable to be vectorized at
> >> O2 (without loop-interchange).
> >
> > Yeah, but that's because the loop only runs 5 iterations, not because
> > of some "density" (which suggests AGU overloading or some such)?
> > Because if you modify it so it iterates more then with keeping the
> > "density" measurement constant you suddenly become profitable?
> >
>
> Yes, I agree this isn't a perfect one showing how the density check
> matters, though it led me to find this check.  I tried to run SPEC2017
> bmks w/ and w/o this density heuristic to catch some "expected" case,
> but failed to unluckily.  It may be worth to trying with some more
> option sets or even test with the previous SPECs later.
>
> I hacked the innermost loop iteration from 5 to 20, but baseline run
> didn't stop (after more than 7 hrs then I killed it), which was
> suspected to become endless because of some garbage (out of bound) data.
>
> But the current cost modeling for this loop on Power is still bad, the
> min profitable iteration (both static and run time) are evaluated as 2,
> while the reality shows 5 isn't profitable at least.
>
>
> > The loop does have quite many memory streams so optimizing
> > the (few) arithmetic ops by vectorizign them might not be worth
> > the trouble, esp. since most of the loads are "strided" (composed
> > from scalars) when no interchange is performed.  So it's probably
> > more a "density" of # memory streams vs. # arithmetic ops, and
> > esp. with any non-consecutive vector loads this balance being
> > worse in the vector case?
> >
>
> Yeah, these many scalar "strided" loads make things worse.  The fed
> vector CTORs have to wait for all of their required loads are ready,
> and these vector CTOR are required by further multiplications.
>
> I posted one patch[1] on this, which tries to model it with
> some counts: nload (total load number), nload_ctor (strided
> load number fed into CTOR) and nctor_strided (CTOR number fed
> by strided load).
>
> Restricting the penalization by considering some factors:
>   1) vect density ratio, if there are many vector instructions,
>      the stalls from loads are easy to impact the subsequent
>      computation.
>   2) total load number, if nload is small, it's unlikely to
>      bother the load/store units much.
>   3) strided loads fed into CTOR pct., if there are high portion
>      strided loads fed into CTOR, it's very likely to block
>      the CTOR and its subsequent chain.
>
> btw, as your previous comments on add_stmt_cost, the load/strided/ctor
> statistics should be gathered there instead, like:
>
>   if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>     {
>       if (kind == scalar_load || kind == vector_load || kind == unaligned_load
>           || kind == vector_gather_load)
>           data->nload += count;
>       if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info))
>         {
>           if (kind == scalar_load || kind == unaligned_load)
>             data->nload_ctor += count;
>           else if (kind == vec_construct)
>             data->nctor_strided += count;
>         }
>     }
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html
>
> > The x86 add_stmt_cost has
> >
> >   /* If we do elementwise loads into a vector then we are bound by
> >      latency and execution resources for the many scalar loads
> >      (AGU and load ports).  Try to account for this by scaling the
> >      construction cost by the number of elements involved.  */
> >   if ((kind == vec_construct || kind == vec_to_scalar)
> >       && stmt_info
> >       && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
> >           || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
> >       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
> >       && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
> >     {
> >       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> >       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> >     }
> >
> > so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and stores.
> > The above materialized over PRs 84037, 85491 and 87561, so not specifically
> > for the bwaves case.  IIRC on x86 bwaves at -O2 is slower with vectorization
> > as well.
> >
>
> Thanks for the pointer!  rs6000 probably can follow this way instead.
> IIUC, this cost adjustment is for each individual vec_construct/vec_to_scalar,
> is it better to use the way that firstly collecting the data in add_stmt_cost
> and then considering them from a view of the whole loop?

It depends - when you look at the whole loop you'd like to see data dependences
which are not obvious for the vector code.  So for the PRs mentioned
the simplest
thing was to do it per-stmt (since x86 doesn't yet have any special code in
finish_cost).

> This individual way
> seems easy to overkill if there are just a few VMAT_ELEMENTWISE loads then the
> resource hazard isn't so bad.

True, but then the pessimization is cancelled easily by the
vectorization benefit of
other vectorized stmts.

>  And as you mentioned above, bwaves_r mainly
> suffers from strided loads, this tuning looks should be applied for
> VMAT_STRIDED_SLP as well?  Is it the reason why it only considers the variable
> step that the constant step loads can be grouped on x86 (like: paired load/store
> fusion)?

Possibly a speciality of x86 and its complex addressing modes where those
need extra AGU resources (and those proved the bottlenecks for the PRs
IIRC).  So it's just a heuristic and you shouldn't read too much into
the details
of the implementation ;)  (just realize I do the same for the rs6000 one...)

Richard.

>
> BR,
> Kewen
>
> > Oh, and yes - the info the vectorizer presents the target with for
> > vector loads/stores
> > leaves a lot to be desired ...
> >
> > Richard.
> >
Kewen.Lin May 17, 2021, 9:24 a.m. UTC | #10
on 2021/5/17 下午4:55, Richard Biener wrote:
> On Thu, May 13, 2021 at 9:04 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi!
>>
>>>>> But in the end the vector code shouldn't end up worse than the
>>>>> scalar code with respect to IVs - the cases where it would should
>>>>> be already costed.  So I wonder if you have specific examples
>>>>> where things go worse enough for the heuristic to trigger?
>>>>>
>>>>
>>>> One typical case that I worked on to reuse this density check is the
>>>> function mat_times_vec of src file block_solver.fppized.f of SPEC2017
>>>> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't
>>>> exceed the threshold unlikely).  The interesting loop is the innermost
>>>> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize".
>>>> We have verified that this loop isn't profitable to be vectorized at
>>>> O2 (without loop-interchange).
>>>
>>> Yeah, but that's because the loop only runs 5 iterations, not because
>>> of some "density" (which suggests AGU overloading or some such)?
>>> Because if you modify it so it iterates more then with keeping the
>>> "density" measurement constant you suddenly become profitable?
>>>
>>
>> Yes, I agree this isn't a perfect one showing how the density check
>> matters, though it led me to find this check.  I tried to run SPEC2017
>> bmks w/ and w/o this density heuristic to catch some "expected" case,
>> but failed to unluckily.  It may be worth to trying with some more
>> option sets or even test with the previous SPECs later.
>>
>> I hacked the innermost loop iteration from 5 to 20, but baseline run
>> didn't stop (after more than 7 hrs then I killed it), which was
>> suspected to become endless because of some garbage (out of bound) data.
>>
>> But the current cost modeling for this loop on Power is still bad, the
>> min profitable iteration (both static and run time) are evaluated as 2,
>> while the reality shows 5 isn't profitable at least.
>>
>>
>>> The loop does have quite many memory streams so optimizing
>>> the (few) arithmetic ops by vectorizign them might not be worth
>>> the trouble, esp. since most of the loads are "strided" (composed
>>> from scalars) when no interchange is performed.  So it's probably
>>> more a "density" of # memory streams vs. # arithmetic ops, and
>>> esp. with any non-consecutive vector loads this balance being
>>> worse in the vector case?
>>>
>>
>> Yeah, these many scalar "strided" loads make things worse.  The fed
>> vector CTORs have to wait for all of their required loads are ready,
>> and these vector CTOR are required by further multiplications.
>>
>> I posted one patch[1] on this, which tries to model it with
>> some counts: nload (total load number), nload_ctor (strided
>> load number fed into CTOR) and nctor_strided (CTOR number fed
>> by strided load).
>>
>> Restricting the penalization by considering some factors:
>>   1) vect density ratio, if there are many vector instructions,
>>      the stalls from loads are easy to impact the subsequent
>>      computation.
>>   2) total load number, if nload is small, it's unlikely to
>>      bother the load/store units much.
>>   3) strided loads fed into CTOR pct., if there are high portion
>>      strided loads fed into CTOR, it's very likely to block
>>      the CTOR and its subsequent chain.
>>
>> btw, as your previous comments on add_stmt_cost, the load/strided/ctor
>> statistics should be gathered there instead, like:
>>
>>   if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>     {
>>       if (kind == scalar_load || kind == vector_load || kind == unaligned_load
>>           || kind == vector_gather_load)
>>           data->nload += count;
>>       if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info))
>>         {
>>           if (kind == scalar_load || kind == unaligned_load)
>>             data->nload_ctor += count;
>>           else if (kind == vec_construct)
>>             data->nctor_strided += count;
>>         }
>>     }
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html
>>
>>> The x86 add_stmt_cost has
>>>
>>>   /* If we do elementwise loads into a vector then we are bound by
>>>      latency and execution resources for the many scalar loads
>>>      (AGU and load ports).  Try to account for this by scaling the
>>>      construction cost by the number of elements involved.  */
>>>   if ((kind == vec_construct || kind == vec_to_scalar)
>>>       && stmt_info
>>>       && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>>>           || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
>>>       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>>>       && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
>>>     {
>>>       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>>>       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
>>>     }
>>>
>>> so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and stores.
>>> The above materialized over PRs 84037, 85491 and 87561, so not specifically
>>> for the bwaves case.  IIRC on x86 bwaves at -O2 is slower with vectorization
>>> as well.
>>>
>>
>> Thanks for the pointer!  rs6000 probably can follow this way instead.
>> IIUC, this cost adjustment is for each individual vec_construct/vec_to_scalar,
>> is it better to use the way that firstly collecting the data in add_stmt_cost
>> and then considering them from a view of the whole loop?
> 
> It depends - when you look at the whole loop you'd like to see data dependences
> which are not obvious for the vector code.  So for the PRs mentioned
> the simplest
> thing was to do it per-stmt (since x86 doesn't yet have any special code in
> finish_cost).
> 
>> This individual way
>> seems easy to overkill if there are just a few VMAT_ELEMENTWISE loads then the
>> resource hazard isn't so bad.
> 
> True, but then the pessimization is cancelled easily by the
> vectorization benefit of
> other vectorized stmts.
> 
>>  And as you mentioned above, bwaves_r mainly
>> suffers from strided loads, this tuning looks should be applied for
>> VMAT_STRIDED_SLP as well?  Is it the reason why it only considers the variable
>> step that the constant step loads can be grouped on x86 (like: paired load/store
>> fusion)?
> 
> Possibly a speciality of x86 and its complex addressing modes where those
> need extra AGU resources (and those proved the bottlenecks for the PRs
> IIRC).  So it's just a heuristic and you shouldn't read too much into
> the details
> of the implementation ;)  (just realize I do the same for the rs6000 one...)
> 

Thanks for all the comments above.  I'll do some experiments referring to
the i386 way, to see if it's fine without considering the density ratio.

btw, I guessed you meant "do the same for the i386 one"?  ;)

BR,
Kewen
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 12625a4bee3..de3c86d85fb 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14390,7 +14390,7 @@  struct aarch64_vector_costs
 
 /* Implement TARGET_VECTORIZE_INIT_COST.  */
 void *
-aarch64_init_cost (class loop *)
+aarch64_init_cost (class loop *, bool)
 {
   return new aarch64_vector_costs;
 }
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7c41302c75b..5078d94970a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -22223,7 +22223,7 @@  ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
 /* Implement targetm.vectorize.init_cost.  */
 
 static void *
-ix86_init_cost (class loop *)
+ix86_init_cost (class loop *, bool)
 {
   unsigned *cost = XNEWVEC (unsigned, 3);
   cost[vect_prologue] = cost[vect_body] = cost[vect_epilogue] = 0;
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index e2ff00145c5..88f16b909a3 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5292,7 +5292,7 @@  rs6000_density_test (rs6000_cost_data *data)
 /* Implement targetm.vectorize.init_cost.  */
 
 static void *
-rs6000_init_cost (struct loop *loop_info)
+rs6000_init_cost (struct loop *loop_info, bool)
 {
   rs6000_cost_data *data = XNEW (struct _rs6000_cost_data);
   data->loop_info = loop_info;
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 823f85ba9ab..8b3af67782e 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6111,8 +6111,8 @@  type @code{internal_fn}) should be considered expensive when the mask is
 all zeros.  GCC can then try to branch around the instruction instead.
 @end deftypefn
 
-@deftypefn {Target Hook} {void *} TARGET_VECTORIZE_INIT_COST (class loop *@var{loop_info})
-This hook should initialize target-specific data structures in preparation for modeling the costs of vectorizing a loop or basic block.  The default allocates three unsigned integers for accumulating costs for the prologue, body, and epilogue of the loop or basic block.  If @var{loop_info} is non-NULL, it identifies the loop being vectorized; otherwise a single block is being vectorized.
+@deftypefn {Target Hook} {void *} TARGET_VECTORIZE_INIT_COST (class loop *@var{loop_info}, bool @var{costing_for_scalar})
+This hook should initialize target-specific data structures in preparation for modeling the costs of vectorizing a loop or basic block.  The default allocates three unsigned integers for accumulating costs for the prologue, body, and epilogue of the loop or basic block.  If @var{loop_info} is non-NULL, it identifies the loop being vectorized; otherwise a single block is being vectorized.  If @var{costing_for_scalar} is true, it indicates the current cost model is for the scalar version of a loop or block; otherwise it is for the vector version.
 @end deftypefn
 
 @deftypefn {Target Hook} unsigned TARGET_VECTORIZE_ADD_STMT_COST (class vec_info *@var{}, void *@var{data}, int @var{count}, enum vect_cost_for_stmt @var{kind}, class _stmt_vec_info *@var{stmt_info}, tree @var{vectype}, int @var{misalign}, enum vect_cost_model_location @var{where})
diff --git a/gcc/target.def b/gcc/target.def
index d7b94bd8e5d..9407ed17f2d 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2004,9 +2004,11 @@  DEFHOOK
  "allocates three unsigned integers for accumulating costs for the prologue, "
  "body, and epilogue of the loop or basic block.  If @var{loop_info} is "
  "non-NULL, it identifies the loop being vectorized; otherwise a single block "
- "is being vectorized.",
+ "is being vectorized.  If @var{costing_for_scalar} is true, it indicates the "
+ "current cost model is for the scalar version of a loop or block; otherwise "
+ "it is for the vector version.",
  void *,
- (class loop *loop_info),
+ (class loop *loop_info, bool costing_for_scalar),
  default_init_cost)
 
 /* Target function to record N statements of the given kind using the
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 952fad422eb..2e0fdb797e0 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1373,7 +1373,8 @@  default_empty_mask_is_expensive (unsigned ifn)
    array of three unsigned ints, set it to zero, and return its address.  */
 
 void *
-default_init_cost (class loop *loop_info ATTRIBUTE_UNUSED)
+default_init_cost (class loop *loop_info ATTRIBUTE_UNUSED,
+		   bool costing_for_scalar ATTRIBUTE_UNUSED)
 {
   unsigned *cost = XNEWVEC (unsigned, 3);
   cost[vect_prologue] = cost[vect_body] = cost[vect_epilogue] = 0;
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 9928d064abd..b537038c0aa 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -117,7 +117,7 @@  extern opt_machine_mode default_vectorize_related_mode (machine_mode,
 							poly_uint64);
 extern opt_machine_mode default_get_mask_mode (machine_mode);
 extern bool default_empty_mask_is_expensive (unsigned);
-extern void *default_init_cost (class loop *);
+extern void *default_init_cost (class loop *, bool);
 extern unsigned default_add_stmt_cost (class vec_info *, void *, int,
 				       enum vect_cost_for_stmt,
 				       class _stmt_vec_info *, tree, int,
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 2aba503fef7..f10e66a2465 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -813,7 +813,7 @@  bb_in_loop_p (const_basic_block bb, const void *data)
    stmt_vec_info structs for all the stmts in LOOP_IN.  */
 
 _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
-  : vec_info (vec_info::loop, init_cost (loop_in), shared),
+  : vec_info (vec_info::loop, init_cost (loop_in, false), shared),
     loop (loop_in),
     bbs (XCNEWVEC (basic_block, loop->num_nodes)),
     num_itersm1 (NULL_TREE),
@@ -1284,7 +1284,7 @@  vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
     }
 
   /* Now accumulate cost.  */
-  void *target_cost_data = init_cost (loop);
+  void *target_cost_data = init_cost (loop, true);
   stmt_info_for_cost *si;
   int j;
   FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
@@ -2723,7 +2723,7 @@  again:
   /* Reset target cost data.  */
   destroy_cost_data (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo));
   LOOP_VINFO_TARGET_COST_DATA (loop_vinfo)
-    = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
+    = init_cost (LOOP_VINFO_LOOP (loop_vinfo), false);
   /* Reset accumulated rgroup information.  */
   release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
   release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 1c5b7ae84e2..0ec92b0f0ca 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -3690,7 +3690,9 @@  vect_detect_hybrid_slp (loop_vec_info loop_vinfo)
 /* Initialize a bb_vec_info struct for the statements in BBS basic blocks.  */
 
 _bb_vec_info::_bb_vec_info (vec<basic_block> _bbs, vec_info_shared *shared)
-  : vec_info (vec_info::bb, init_cost (NULL), shared), bbs (_bbs), roots (vNULL)
+  : vec_info (vec_info::bb, init_cost (NULL, false), shared),
+    bbs (_bbs),
+    roots (vNULL)
 {
   for (unsigned i = 0; i < bbs.length (); ++i)
     {
@@ -4530,7 +4532,7 @@  vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
 	  continue;
 	}
 
-      void *scalar_target_cost_data = init_cost (NULL);
+      void *scalar_target_cost_data = init_cost (NULL, true);
       do
 	{
 	  add_stmt_cost (bb_vinfo, scalar_target_cost_data,
@@ -4544,7 +4546,7 @@  vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
       destroy_cost_data (scalar_target_cost_data);
 
       /* Complete the target-specific vector cost calculation.  */
-      void *vect_target_cost_data = init_cost (NULL);
+      void *vect_target_cost_data = init_cost (NULL, false);
       do
 	{
 	  add_stmt_cost (bb_vinfo, vect_target_cost_data,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 9861d9e8810..8d1ffafdbf0 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1455,9 +1455,9 @@  int vect_get_stmt_cost (enum vect_cost_for_stmt type_of_cost)
 /* Alias targetm.vectorize.init_cost.  */
 
 static inline void *
-init_cost (class loop *loop_info)
+init_cost (class loop *loop_info, bool costing_for_scalar)
 {
-  return targetm.vectorize.init_cost (loop_info);
+  return targetm.vectorize.init_cost (loop_info, costing_for_scalar);
 }
 
 extern void dump_stmt_cost (FILE *, void *, int, enum vect_cost_for_stmt,