mbox series

[0/4] IVOPTs consider step cost for different forms when unrolling

Message ID a2a9189d-d114-7d07-3b3e-1a4d13613ef1@linux.ibm.com
Headers show
Series IVOPTs consider step cost for different forms when unrolling | expand

Message

Kewen.Lin May 28, 2020, 12:17 p.m. UTC
Hi,

This is one repost and you can refer to the original series 
via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.

As we discussed in the thread
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
I'm working to teach IVOPTs to consider D-form group access during unrolling.
The difference on D-form and other forms during unrolling is we can put the
stride into displacement field to avoid additional step increment. eg:

With X-form (uf step increment):
  ...
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  ...

With D-form (one step increment for each base):
  ...
  LD A = baseA, OFF
  LD B = baseB, OFF
  ST C = baseC, OFF
  LD A = baseA, OFF+stride
  LD B = baseB, OFF+stride
  ST C = baseC, OFF+stride
  LD A = baseA, OFF+2*stride
  LD B = baseB, OFF+2*stride
  ST C = baseC, OFF+2*stride
  ...
  baseA += stride * uf
  baseB += stride * uf
  baseC += stride * uf

Imagining that if the loop get unrolled by 8 times, then 3 step updates with
D-form vs. 8 step updates with X-form. Here we only need to check stride
meet D-form field requirement, since if OFF doesn't meet, we can construct
baseA' with baseA + OFF.

This patch set consists four parts:
     
  [PATCH 1/4] unroll: Add middle-end unroll factor estimation

     Add unroll factor estimation in middle-end. It mainly refers to current
     RTL unroll factor determination in function decide_unrolling and its
     sub calls.  As Richi suggested, we probably can force unroll factor
     with this and avoid duplicate unroll factor calculation, but I think it
     need more benchmarking work and should be handled separately.

  [PATCH 2/4] param: Introduce one param to control unroll factor 

     As Richard and Segher's suggestion, I used addr_offset_valid_p for the
     addressing mode, rather than one target hook.  As Richard's suggestion,     
     it introduces one parameter to control this IVOPTs consideration and
     further tweaking [3/4] on top of unroll factor estimation [1/4].
     
  [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling

     Teach IVOPTs to mark the IV cand as reg_offset_p which is derived from
     one address IV type group where the whole group is valid to use reg_offset
     mode.  Then scaling up the IV cand step cost by (uf - 1) for no
     reg_offset_p IV cands, here the uf is one estimated unroll factor [1/4].
     
  [PATCH 4/4] rs6000: P9 D-form test cases

     Add some test cases, mainly copied from Kelvin's patch.  This is approved
     by Segher if the whole series is fine.


Many thanks to Richard and Segher on previous version reviews.

Bootstrapped and regress tested on powerpc64le-linux-gnu.

Any comments are highly appreciated!  Thanks in advance!


BR,
Kewen

-------

 gcc/cfgloop.h                  |   3 ++
 gcc/config/i386/i386-options.c |   6 +++
 gcc/config/s390/s390.c         |   6 +++
 gcc/doc/invoke.texi            |   9 +++++
 gcc/params.opt                 |   4 ++
 gcc/tree-ssa-loop-ivopts.c     | 100 ++++++++++++++++++++++++++++++++++++++++++++++-
 gcc/tree-ssa-loop-manip.c      | 253 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-ssa-loop-manip.h      |   3 +-
 gcc/tree-ssa-loop.c            |  33 ++++++++++++++++
 gcc/tree-ssa-loop.h            |   2 +
 10 files changed, 416 insertions(+), 3 deletions(-)

Comments

Richard Biener June 2, 2020, 11:38 a.m. UTC | #1
On Thu, 28 May 2020, Kewen.Lin wrote:

> Hi,
> 
> This is one repost and you can refer to the original series 
> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
> 
> As we discussed in the thread
> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> The difference on D-form and other forms during unrolling is we can put the
> stride into displacement field to avoid additional step increment. eg:
> 
> With X-form (uf step increment):
>   ...
>   LD A = baseA, X
>   LD B = baseB, X
>   ST C = baseC, X
>   X = X + stride
>   LD A = baseA, X
>   LD B = baseB, X
>   ST C = baseC, X
>   X = X + stride
>   LD A = baseA, X
>   LD B = baseB, X
>   ST C = baseC, X
>   X = X + stride
>   ...
> 
> With D-form (one step increment for each base):
>   ...
>   LD A = baseA, OFF
>   LD B = baseB, OFF
>   ST C = baseC, OFF
>   LD A = baseA, OFF+stride
>   LD B = baseB, OFF+stride
>   ST C = baseC, OFF+stride
>   LD A = baseA, OFF+2*stride
>   LD B = baseB, OFF+2*stride
>   ST C = baseC, OFF+2*stride
>   ...
>   baseA += stride * uf
>   baseB += stride * uf
>   baseC += stride * uf
> 
> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> D-form vs. 8 step updates with X-form. Here we only need to check stride
> meet D-form field requirement, since if OFF doesn't meet, we can construct
> baseA' with baseA + OFF.

I'd just mention there are other targets that have the choice between
the above forms.  Since IVOPTs itself does not perform the unrolling
the IL it produces is the same, correct?

Richard.

> This patch set consists four parts:
>      
>   [PATCH 1/4] unroll: Add middle-end unroll factor estimation
> 
>      Add unroll factor estimation in middle-end. It mainly refers to current
>      RTL unroll factor determination in function decide_unrolling and its
>      sub calls.  As Richi suggested, we probably can force unroll factor
>      with this and avoid duplicate unroll factor calculation, but I think it
>      need more benchmarking work and should be handled separately.
> 
>   [PATCH 2/4] param: Introduce one param to control unroll factor 
> 
>      As Richard and Segher's suggestion, I used addr_offset_valid_p for the
>      addressing mode, rather than one target hook.  As Richard's suggestion,     
>      it introduces one parameter to control this IVOPTs consideration and
>      further tweaking [3/4] on top of unroll factor estimation [1/4].
>      
>   [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling
> 
>      Teach IVOPTs to mark the IV cand as reg_offset_p which is derived from
>      one address IV type group where the whole group is valid to use reg_offset
>      mode.  Then scaling up the IV cand step cost by (uf - 1) for no
>      reg_offset_p IV cands, here the uf is one estimated unroll factor [1/4].
>      
>   [PATCH 4/4] rs6000: P9 D-form test cases
> 
>      Add some test cases, mainly copied from Kelvin's patch.  This is approved
>      by Segher if the whole series is fine.
> 
> 
> Many thanks to Richard and Segher on previous version reviews.
> 
> Bootstrapped and regress tested on powerpc64le-linux-gnu.
> 
> Any comments are highly appreciated!  Thanks in advance!
> 
> 
> BR,
> Kewen
> 
> -------
> 
>  gcc/cfgloop.h                  |   3 ++
>  gcc/config/i386/i386-options.c |   6 +++
>  gcc/config/s390/s390.c         |   6 +++
>  gcc/doc/invoke.texi            |   9 +++++
>  gcc/params.opt                 |   4 ++
>  gcc/tree-ssa-loop-ivopts.c     | 100 ++++++++++++++++++++++++++++++++++++++++++++++-
>  gcc/tree-ssa-loop-manip.c      | 253 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/tree-ssa-loop-manip.h      |   3 +-
>  gcc/tree-ssa-loop.c            |  33 ++++++++++++++++
>  gcc/tree-ssa-loop.h            |   2 +
>  10 files changed, 416 insertions(+), 3 deletions(-)
> 
>
Kewen.Lin June 3, 2020, 3:46 a.m. UTC | #2
Hi Richi,

on 2020/6/2 下午7:38, Richard Biener wrote:
> On Thu, 28 May 2020, Kewen.Lin wrote:
> 
>> Hi,
>>
>> This is one repost and you can refer to the original series 
>> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
>>
>> As we discussed in the thread
>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
>> The difference on D-form and other forms during unrolling is we can put the
>> stride into displacement field to avoid additional step increment. eg:
>>
>> With X-form (uf step increment):
>>   ...
>>   LD A = baseA, X
>>   LD B = baseB, X
>>   ST C = baseC, X
>>   X = X + stride
>>   LD A = baseA, X
>>   LD B = baseB, X
>>   ST C = baseC, X
>>   X = X + stride
>>   LD A = baseA, X
>>   LD B = baseB, X
>>   ST C = baseC, X
>>   X = X + stride
>>   ...
>>
>> With D-form (one step increment for each base):
>>   ...
>>   LD A = baseA, OFF
>>   LD B = baseB, OFF
>>   ST C = baseC, OFF
>>   LD A = baseA, OFF+stride
>>   LD B = baseB, OFF+stride
>>   ST C = baseC, OFF+stride
>>   LD A = baseA, OFF+2*stride
>>   LD B = baseB, OFF+2*stride
>>   ST C = baseC, OFF+2*stride
>>   ...
>>   baseA += stride * uf
>>   baseB += stride * uf
>>   baseC += stride * uf
>>
>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>> baseA' with baseA + OFF.
> 
> I'd just mention there are other targets that have the choice between
> the above forms.  Since IVOPTs itself does not perform the unrolling
> the IL it produces is the same, correct?
> 
Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
it only models things based on what it sees.  We can assume it thinks
later RTL unrolling won't perform.

With this patch, since the IV choice probably changes, the IL can probably
change.  The typical difference with this patch is:

  vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
vs.
  vect__1.7_15 = MEM[base: _29, offset: 0B];

BR,
Kewen

> Richard.
>
Richard Biener June 3, 2020, 7:07 a.m. UTC | #3
On Wed, 3 Jun 2020, Kewen.Lin wrote:

> Hi Richi,
> 
> on 2020/6/2 下午7:38, Richard Biener wrote:
> > On Thu, 28 May 2020, Kewen.Lin wrote:
> > 
> >> Hi,
> >>
> >> This is one repost and you can refer to the original series 
> >> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
> >>
> >> As we discussed in the thread
> >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> >> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> >> The difference on D-form and other forms during unrolling is we can put the
> >> stride into displacement field to avoid additional step increment. eg:
> >>
> >> With X-form (uf step increment):
> >>   ...
> >>   LD A = baseA, X
> >>   LD B = baseB, X
> >>   ST C = baseC, X
> >>   X = X + stride
> >>   LD A = baseA, X
> >>   LD B = baseB, X
> >>   ST C = baseC, X
> >>   X = X + stride
> >>   LD A = baseA, X
> >>   LD B = baseB, X
> >>   ST C = baseC, X
> >>   X = X + stride
> >>   ...
> >>
> >> With D-form (one step increment for each base):
> >>   ...
> >>   LD A = baseA, OFF
> >>   LD B = baseB, OFF
> >>   ST C = baseC, OFF
> >>   LD A = baseA, OFF+stride
> >>   LD B = baseB, OFF+stride
> >>   ST C = baseC, OFF+stride
> >>   LD A = baseA, OFF+2*stride
> >>   LD B = baseB, OFF+2*stride
> >>   ST C = baseC, OFF+2*stride
> >>   ...
> >>   baseA += stride * uf
> >>   baseB += stride * uf
> >>   baseC += stride * uf
> >>
> >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> >> D-form vs. 8 step updates with X-form. Here we only need to check stride
> >> meet D-form field requirement, since if OFF doesn't meet, we can construct
> >> baseA' with baseA + OFF.
> > 
> > I'd just mention there are other targets that have the choice between
> > the above forms.  Since IVOPTs itself does not perform the unrolling
> > the IL it produces is the same, correct?
> > 
> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
> it only models things based on what it sees.  We can assume it thinks
> later RTL unrolling won't perform.
> 
> With this patch, since the IV choice probably changes, the IL can probably
> change.  The typical difference with this patch is:
> 
>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
> vs.
>   vect__1.7_15 = MEM[base: _29, offset: 0B];

So we're asking IVOPTS "if we were unrolling this loop would you make
a different IV choice?" thus I wonder why we need so much complexity
here?  That is, if we can classify the loop as being possibly unrolled
we could evaluate IVOPTs IV choice (and overall cost) on the original
loop and in a second run on the original loop with fake IV uses
added with extra offset.  If the overall IV cost is similar we'll
take the unroll friendly choice if the costs are way different
(I wouldn't expect this to be the case ever?) I'd side with the
IV choice when not unrolling (and mark the loop as to be not unrolled).

Thus I'd err on the side of not unrolling but leave the ultimate choice
of whether to unroll to RTL unless IV cost makes that prohibitive.

Even without X- or D- form addressing modes the IV choice may differ
and I think we don't need extra knobs for the unroller but instead
can decide to set the existing n_unroll to zero (force not unroll)
when costs say it would be bad?

Richard.

> BR,
> Kewen
> 
> > Richard.
> > 
>
Kewen.Lin June 3, 2020, 7:58 a.m. UTC | #4
on 2020/6/3 下午3:07, Richard Biener wrote:
> On Wed, 3 Jun 2020, Kewen.Lin wrote:
> 
>> Hi Richi,
>>
>> on 2020/6/2 下午7:38, Richard Biener wrote:
>>> On Thu, 28 May 2020, Kewen.Lin wrote:
>>>
>>>> Hi,
>>>>
>>>> This is one repost and you can refer to the original series 
>>>> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
>>>>
>>>> As we discussed in the thread
>>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>>>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
>>>> The difference on D-form and other forms during unrolling is we can put the
>>>> stride into displacement field to avoid additional step increment. eg:
>>>>
>>>> With X-form (uf step increment):
>>>>   ...
>>>>   LD A = baseA, X
>>>>   LD B = baseB, X
>>>>   ST C = baseC, X
>>>>   X = X + stride
>>>>   LD A = baseA, X
>>>>   LD B = baseB, X
>>>>   ST C = baseC, X
>>>>   X = X + stride
>>>>   LD A = baseA, X
>>>>   LD B = baseB, X
>>>>   ST C = baseC, X
>>>>   X = X + stride
>>>>   ...
>>>>
>>>> With D-form (one step increment for each base):
>>>>   ...
>>>>   LD A = baseA, OFF
>>>>   LD B = baseB, OFF
>>>>   ST C = baseC, OFF
>>>>   LD A = baseA, OFF+stride
>>>>   LD B = baseB, OFF+stride
>>>>   ST C = baseC, OFF+stride
>>>>   LD A = baseA, OFF+2*stride
>>>>   LD B = baseB, OFF+2*stride
>>>>   ST C = baseC, OFF+2*stride
>>>>   ...
>>>>   baseA += stride * uf
>>>>   baseB += stride * uf
>>>>   baseC += stride * uf
>>>>
>>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
>>>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>>>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>>>> baseA' with baseA + OFF.
>>>
>>> I'd just mention there are other targets that have the choice between
>>> the above forms.  Since IVOPTs itself does not perform the unrolling
>>> the IL it produces is the same, correct?
>>>
>> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
>> it only models things based on what it sees.  We can assume it thinks
>> later RTL unrolling won't perform.
>>
>> With this patch, since the IV choice probably changes, the IL can probably
>> change.  The typical difference with this patch is:
>>
>>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
>> vs.
>>   vect__1.7_15 = MEM[base: _29, offset: 0B];
> 
> So we're asking IVOPTS "if we were unrolling this loop would you make
> a different IV choice?" thus I wonder why we need so much complexity
> here?  

I would describe it more like "we are going to unroll this loop with
unroll factor uf in RTL, would you consider this variable when modeling?"

In most cases, one single iteration is representative for the unrolled
body, so it doesn't matter considering unrolling or not.  But for the
case here, it's not true, expected reg_offset iv cand can make iv cand
step cost reduced, it leads the difference.

> That is, if we can classify the loop as being possibly unrolled
> we could evaluate IVOPTs IV choice (and overall cost) on the original
> loop and in a second run on the original loop with fake IV uses
> added with extra offset.  If the overall IV cost is similar we'll
> take the unroll friendly choice if the costs are way different
> (I wouldn't expect this to be the case ever?) I'd side with the
> IV choice when not unrolling (and mark the loop as to be not unrolled).
> 

Could you elaborate it a bit?  I guess it won't estimate the unroll
factor here, just guess it's to be unrolled or not?  The second run
with fake IV uses added with extra offset sounds like scaling up the 
iv group cost by uf.

> Thus I'd err on the side of not unrolling but leave the ultimate choice
> of whether to unroll to RTL unless IV cost makes that prohibitive.
> 
> Even without X- or D- form addressing modes the IV choice may differ
> and I think we don't need extra knobs for the unroller but instead
> can decide to set the existing n_unroll to zero (force not unroll)
> when costs say it would be bad?

Yes, even without x- or d- form addressing, the difference probably comes 
from compare type IV use for loop ending, maybe more cases which I am not
aware of.  But I don't see people care about it, probably the impact is
small.

IIUC what you stated here looks like to use ivopts information for unrolling
factor decision, I think this is a separate direction, do we have this
kind of case where ivopts costs can foresee the unrolling?

Now the unroll factor estimation can be used for other optimization passes
if they are wondering future unrolling factor decision, as discussed it
sounds a good idea to override the n_unroll with some benchmarking.

BR,
Kewen

> 
> Richard.
> 
>> BR,
>> Kewen
>>
>>> Richard.
>>>
>>
>
Richard Biener June 3, 2020, 9:27 a.m. UTC | #5
On Wed, 3 Jun 2020, Kewen.Lin wrote:

> on 2020/6/3 下午3:07, Richard Biener wrote:
> > On Wed, 3 Jun 2020, Kewen.Lin wrote:
> > 
> >> Hi Richi,
> >>
> >> on 2020/6/2 下午7:38, Richard Biener wrote:
> >>> On Thu, 28 May 2020, Kewen.Lin wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> This is one repost and you can refer to the original series 
> >>>> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html.
> >>>>
> >>>> As we discussed in the thread
> >>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> >>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> >>>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> >>>> The difference on D-form and other forms during unrolling is we can put the
> >>>> stride into displacement field to avoid additional step increment. eg:
> >>>>
> >>>> With X-form (uf step increment):
> >>>>   ...
> >>>>   LD A = baseA, X
> >>>>   LD B = baseB, X
> >>>>   ST C = baseC, X
> >>>>   X = X + stride
> >>>>   LD A = baseA, X
> >>>>   LD B = baseB, X
> >>>>   ST C = baseC, X
> >>>>   X = X + stride
> >>>>   LD A = baseA, X
> >>>>   LD B = baseB, X
> >>>>   ST C = baseC, X
> >>>>   X = X + stride
> >>>>   ...
> >>>>
> >>>> With D-form (one step increment for each base):
> >>>>   ...
> >>>>   LD A = baseA, OFF
> >>>>   LD B = baseB, OFF
> >>>>   ST C = baseC, OFF
> >>>>   LD A = baseA, OFF+stride
> >>>>   LD B = baseB, OFF+stride
> >>>>   ST C = baseC, OFF+stride
> >>>>   LD A = baseA, OFF+2*stride
> >>>>   LD B = baseB, OFF+2*stride
> >>>>   ST C = baseC, OFF+2*stride
> >>>>   ...
> >>>>   baseA += stride * uf
> >>>>   baseB += stride * uf
> >>>>   baseC += stride * uf
> >>>>
> >>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> >>>> D-form vs. 8 step updates with X-form. Here we only need to check stride
> >>>> meet D-form field requirement, since if OFF doesn't meet, we can construct
> >>>> baseA' with baseA + OFF.
> >>>
> >>> I'd just mention there are other targets that have the choice between
> >>> the above forms.  Since IVOPTs itself does not perform the unrolling
> >>> the IL it produces is the same, correct?
> >>>
> >> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
> >> it only models things based on what it sees.  We can assume it thinks
> >> later RTL unrolling won't perform.
> >>
> >> With this patch, since the IV choice probably changes, the IL can probably
> >> change.  The typical difference with this patch is:
> >>
> >>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
> >> vs.
> >>   vect__1.7_15 = MEM[base: _29, offset: 0B];
> > 
> > So we're asking IVOPTS "if we were unrolling this loop would you make
> > a different IV choice?" thus I wonder why we need so much complexity
> > here?  
> 
> I would describe it more like "we are going to unroll this loop with
> unroll factor uf in RTL, would you consider this variable when modeling?"
> 
> In most cases, one single iteration is representative for the unrolled
> body, so it doesn't matter considering unrolling or not.  But for the
> case here, it's not true, expected reg_offset iv cand can make iv cand
> step cost reduced, it leads the difference.
> 
> > That is, if we can classify the loop as being possibly unrolled
> > we could evaluate IVOPTs IV choice (and overall cost) on the original
> > loop and in a second run on the original loop with fake IV uses
> > added with extra offset.  If the overall IV cost is similar we'll
> > take the unroll friendly choice if the costs are way different
> > (I wouldn't expect this to be the case ever?) I'd side with the
> > IV choice when not unrolling (and mark the loop as to be not unrolled).
> > 
> 
> Could you elaborate it a bit?  I guess it won't estimate the unroll
> factor here, just guess it's to be unrolled or not?  The second run
> with fake IV uses added with extra offset sounds like scaling up the 
> iv group cost by uf.

From your example above the D-form (MEM[symbol: x, index: ivtmp.19_22, 
offset: 0B]) is preferable since in the unrolled variant we have
the same addres but with a different constant offset for the unroll
copies while the second form would have to update the 'base' IV.

Thus I think the difference in IV cost and decision should already
show up if we, for each USE add a USE with an added constant offset.
This might be what your patch does with that extra flag on the USEs,
I was suggesting to model the USEs more explicitely, simulating a
2-way unroll.  I think in the end I'll defer to Bin here who knows
the code best.

> > Thus I'd err on the side of not unrolling but leave the ultimate choice
> > of whether to unroll to RTL unless IV cost makes that prohibitive.
> > 
> > Even without X- or D- form addressing modes the IV choice may differ
> > and I think we don't need extra knobs for the unroller but instead
> > can decide to set the existing n_unroll to zero (force not unroll)
> > when costs say it would be bad?
> 
> Yes, even without x- or d- form addressing, the difference probably comes 
> from compare type IV use for loop ending, maybe more cases which I am not
> aware of.  But I don't see people care about it, probably the impact is
> small.
> 
> IIUC what you stated here looks like to use ivopts information for unrolling
> factor decision, I think this is a separate direction, do we have this
> kind of case where ivopts costs can foresee the unrolling?
> 
> Now the unroll factor estimation can be used for other optimization passes
> if they are wondering future unrolling factor decision, as discussed it
> sounds a good idea to override the n_unroll with some benchmarking.

I didnt' suggest to use IVOPTs to determine the unroll factor.  In
fact your patch looks like it does this?  Instead I wanted to make
IVOPTs choose a set of IVs that is best for a blend of both worlds - use
D-form when it doesn't hurt the not unrolled code [much], and X-form
when the D-form is way worse (for whatever reason) and signal that
to the unroller (but we could chose to not do that).

The real issue is of course we're applying IV decision to a not final
loop.

> BR,
> Kewen
> 
> > 
> > Richard.
> > 
> >> BR,
> >> Kewen
> >>
> >>> Richard.
> >>>
> >>
> > 
>
Kewen.Lin June 3, 2020, 10:47 a.m. UTC | #6
on 2020/6/3 下午5:27, Richard Biener wrote:
> On Wed, 3 Jun 2020, Kewen.Lin wrote:
> 
>> on 2020/6/3 下午3:07, Richard Biener wrote:
>>> On Wed, 3 Jun 2020, Kewen.Lin wrote:
>>>
>>>> Hi Richi,
>>>>

snip ...

>>>>>
>>>>> I'd just mention there are other targets that have the choice between
>>>>> the above forms.  Since IVOPTs itself does not perform the unrolling
>>>>> the IL it produces is the same, correct?
>>>>>
>>>> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
>>>> it only models things based on what it sees.  We can assume it thinks
>>>> later RTL unrolling won't perform.
>>>>
>>>> With this patch, since the IV choice probably changes, the IL can probably
>>>> change.  The typical difference with this patch is:
>>>>
>>>>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
>>>> vs.
>>>>   vect__1.7_15 = MEM[base: _29, offset: 0B];
>>>
>>> So we're asking IVOPTS "if we were unrolling this loop would you make
>>> a different IV choice?" thus I wonder why we need so much complexity
>>> here?  
>>
>> I would describe it more like "we are going to unroll this loop with
>> unroll factor uf in RTL, would you consider this variable when modeling?"
>>
>> In most cases, one single iteration is representative for the unrolled
>> body, so it doesn't matter considering unrolling or not.  But for the
>> case here, it's not true, expected reg_offset iv cand can make iv cand
>> step cost reduced, it leads the difference.
>>
>>> That is, if we can classify the loop as being possibly unrolled
>>> we could evaluate IVOPTs IV choice (and overall cost) on the original
>>> loop and in a second run on the original loop with fake IV uses
>>> added with extra offset.  If the overall IV cost is similar we'll
>>> take the unroll friendly choice if the costs are way different
>>> (I wouldn't expect this to be the case ever?) I'd side with the
>>> IV choice when not unrolling (and mark the loop as to be not unrolled).
>>>
>>
>> Could you elaborate it a bit?  I guess it won't estimate the unroll
>> factor here, just guess it's to be unrolled or not?  The second run
>> with fake IV uses added with extra offset sounds like scaling up the 
>> iv group cost by uf.
> 
> From your example above the D-form (MEM[symbol: x, index: ivtmp.19_22, 
> offset: 0B]) is preferable since in the unrolled variant we have
> the same addres but with a different constant offset for the unroll
> copies while the second form would have to update the 'base' IV.
> 
> Thus I think the difference in IV cost and decision should already
> show up if we, for each USE add a USE with an added constant offset.
> This might be what your patch does with that extra flag on the USEs,
> I was suggesting to model the USEs more explicitely, simulating a
> 2-way unroll.  I think in the end I'll defer to Bin here who knows
> the code best.
> 

Thanks for your further explanation!  As your proposal we introduce more
iv use groups with step added.  Take the example here
https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547128.html
Imagining initially the cand iv 4 leading to x-form wins, it's the
original iv, has the iv-group cost 1 against the address group.
Although we introduce one more group (2-way unrolling), the iv still
wins since pulling the address iv in takes 5 (15 for three).  Probably
we can introduce more groups according to uf here.

OK.  Looking forward to Bin's comments.

>>> Thus I'd err on the side of not unrolling but leave the ultimate choice
>>> of whether to unroll to RTL unless IV cost makes that prohibitive.
>>>
>>> Even without X- or D- form addressing modes the IV choice may differ
>>> and I think we don't need extra knobs for the unroller but instead
>>> can decide to set the existing n_unroll to zero (force not unroll)
>>> when costs say it would be bad?
>>
>> Yes, even without x- or d- form addressing, the difference probably comes 
>> from compare type IV use for loop ending, maybe more cases which I am not
>> aware of.  But I don't see people care about it, probably the impact is
>> small.
>>
>> IIUC what you stated here looks like to use ivopts information for unrolling
>> factor decision, I think this is a separate direction, do we have this
>> kind of case where ivopts costs can foresee the unrolling?
>>
>> Now the unroll factor estimation can be used for other optimization passes
>> if they are wondering future unrolling factor decision, as discussed it
>> sounds a good idea to override the n_unroll with some benchmarking.
> 
> I didnt' suggest to use IVOPTs to determine the unroll factor.  In
> fact your patch looks like it does this?  Instead I wanted to make
> IVOPTs choose a set of IVs that is best for a blend of both worlds - use
> D-form when it doesn't hurt the not unrolled code [much], and X-form
> when the D-form is way worse (for whatever reason) and signal that
> to the unroller (but we could chose to not do that).
> 

Sorry for my weak comprehension!  Nice, we are on the same direction.  :)

> The real issue is of course we're applying IV decision to a not final
> loop.
> 

Exactly.

BR,
Kewen
Richard Sandiford June 3, 2020, 11:08 a.m. UTC | #7
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/6/3 下午5:27, Richard Biener wrote:
>> On Wed, 3 Jun 2020, Kewen.Lin wrote:
>> 
>>> on 2020/6/3 下午3:07, Richard Biener wrote:
>>>> On Wed, 3 Jun 2020, Kewen.Lin wrote:
>>>>
>>>>> Hi Richi,
>>>>>
>
> snip ...
>
>>>>>>
>>>>>> I'd just mention there are other targets that have the choice between
>>>>>> the above forms.  Since IVOPTs itself does not perform the unrolling
>>>>>> the IL it produces is the same, correct?
>>>>>>
>>>>> Yes.  Before this patch, IVOPTs doesn't consider the unrolling impacts,
>>>>> it only models things based on what it sees.  We can assume it thinks
>>>>> later RTL unrolling won't perform.
>>>>>
>>>>> With this patch, since the IV choice probably changes, the IL can probably
>>>>> change.  The typical difference with this patch is:
>>>>>
>>>>>   vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B];
>>>>> vs.
>>>>>   vect__1.7_15 = MEM[base: _29, offset: 0B];
>>>>
>>>> So we're asking IVOPTS "if we were unrolling this loop would you make
>>>> a different IV choice?" thus I wonder why we need so much complexity
>>>> here?  
>>>
>>> I would describe it more like "we are going to unroll this loop with
>>> unroll factor uf in RTL, would you consider this variable when modeling?"
>>>
>>> In most cases, one single iteration is representative for the unrolled
>>> body, so it doesn't matter considering unrolling or not.  But for the
>>> case here, it's not true, expected reg_offset iv cand can make iv cand
>>> step cost reduced, it leads the difference.
>>>
>>>> That is, if we can classify the loop as being possibly unrolled
>>>> we could evaluate IVOPTs IV choice (and overall cost) on the original
>>>> loop and in a second run on the original loop with fake IV uses
>>>> added with extra offset.  If the overall IV cost is similar we'll
>>>> take the unroll friendly choice if the costs are way different
>>>> (I wouldn't expect this to be the case ever?) I'd side with the
>>>> IV choice when not unrolling (and mark the loop as to be not unrolled).
>>>>
>>>
>>> Could you elaborate it a bit?  I guess it won't estimate the unroll
>>> factor here, just guess it's to be unrolled or not?  The second run
>>> with fake IV uses added with extra offset sounds like scaling up the 
>>> iv group cost by uf.
>> 
>> From your example above the D-form (MEM[symbol: x, index: ivtmp.19_22, 
>> offset: 0B]) is preferable since in the unrolled variant we have
>> the same addres but with a different constant offset for the unroll
>> copies while the second form would have to update the 'base' IV.
>> 
>> Thus I think the difference in IV cost and decision should already
>> show up if we, for each USE add a USE with an added constant offset.
>> This might be what your patch does with that extra flag on the USEs,
>> I was suggesting to model the USEs more explicitely, simulating a
>> 2-way unroll.  I think in the end I'll defer to Bin here who knows
>> the code best.
>> 
>
> Thanks for your further explanation!  As your proposal we introduce more
> iv use groups with step added.  Take the example here
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547128.html
> Imagining initially the cand iv 4 leading to x-form wins, it's the
> original iv, has the iv-group cost 1 against the address group.
> Although we introduce one more group (2-way unrolling), the iv still
> wins since pulling the address iv in takes 5 (15 for three).  Probably
> we can introduce more groups according to uf here.

Yeah, to summarise that thread: the idea there was that we would
continue to cost each use once, but base the cost on the kind of address
seen in the unrolled iterations.  I guess this tends to over-estimate the
cost of index IVs to some extent, but I too was aiming for something
simple that doesn't depend on a specific unroll factor.

Kewen's point there was that that approach works for high unroll factors,
but not for small unroll factors like 2.  For:

  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride

using X as an IV is still preferred.  It's only once the unroll
factor exceeds the number of pointer IVs that using pointer IVs
becomes better.

So like Kewen says, using 2 USEs (the original one and an unrolled one)
would have the opposite problem: it would still prefer index IVs and not
consider the benefit of pointer IVs at higher unroll factors.

But I agree that trying to guess what a much later pass will do doesn't
feel very clean either...

Thanks,
Richard