diff mbox series

[RFC] Using main loop's updated IV as base_address for epilogue vectorization

Message ID ea077462-00c3-f2dc-e1ec-1f95130e918c@arm.com
State New
Headers show
Series [RFC] Using main loop's updated IV as base_address for epilogue vectorization | expand

Commit Message

Andre Vieira (lists) April 30, 2021, 9:07 a.m. UTC
Hi,

The aim of this RFC is to explore a way of cleaning up the codegen 
around data_references.  To be specific, I'd like to reuse the 
main-loop's updated data_reference as the base_address for the 
epilogue's corresponding data_reference, rather than use the niters.  We 
have found this leads to better codegen in the vectorized epilogue loops.

The approach in this RFC creates a map if iv_updates which always 
contain an updated pointer that is caputed in vectorizable_{load,store}, 
an iv_update may also contain a skip_edge in case we decide the 
vectorization can be skipped in 'vect_do_peeling'. During the epilogue 
update this map of iv_updates is then checked to see if it contains an 
entry for a data_reference and it is used accordingly and if not it 
reverts back to the old behavior of using the niters to advance the 
data_reference.

The motivation for this work is to improve codegen for the option 
`--param vect-partial-vector-usage=1` for SVE. We found that one of the 
main problems for the codegen here was coming from unnecessary 
conversions caused by the way we update the data_references in the epilogue.

This patch passes regression tests in aarch64-linux-gnu, but the codegen 
is still not optimal in some cases. Specifically those where we have a 
scalar epilogue, as this does not use the data_reference's and will rely 
on the gimple scalar code, thus constructing again a memory access using 
the niters.  This is a limitation for which I haven't quite worked out a 
solution yet and does cause some minor regressions due to unfortunate 
spills.

Let me know what you think and if you have ideas of how we can better 
achieve this.

Kind regards,
Andre Vieira

Comments

Richard Biener May 4, 2021, 9:56 a.m. UTC | #1
On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:

> Hi,
> 
> The aim of this RFC is to explore a way of cleaning up the codegen around
> data_references.  To be specific, I'd like to reuse the main-loop's updated
> data_reference as the base_address for the epilogue's corresponding
> data_reference, rather than use the niters.  We have found this leads to
> better codegen in the vectorized epilogue loops.
> 
> The approach in this RFC creates a map if iv_updates which always contain an
> updated pointer that is caputed in vectorizable_{load,store}, an iv_update may
> also contain a skip_edge in case we decide the vectorization can be skipped in
> 'vect_do_peeling'. During the epilogue update this map of iv_updates is then
> checked to see if it contains an entry for a data_reference and it is used
> accordingly and if not it reverts back to the old behavior of using the niters
> to advance the data_reference.
> 
> The motivation for this work is to improve codegen for the option `--param
> vect-partial-vector-usage=1` for SVE. We found that one of the main problems
> for the codegen here was coming from unnecessary conversions caused by the way
> we update the data_references in the epilogue.
> 
> This patch passes regression tests in aarch64-linux-gnu, but the codegen is
> still not optimal in some cases. Specifically those where we have a scalar
> epilogue, as this does not use the data_reference's and will rely on the
> gimple scalar code, thus constructing again a memory access using the niters. 
> This is a limitation for which I haven't quite worked out a solution yet and
> does cause some minor regressions due to unfortunate spills.
> 
> Let me know what you think and if you have ideas of how we can better achieve
> this.

Hmm, so the patch adds a kludge to improve the kludge we have in place ;)

I think it might be interesting to create a C testcase mimicing the
update problem without involving the vectorizer.  That way we can
see how the various components involved behave (FRE + ivopts most
specifically).

That said, a cleaner approach to dealing with this would be to
explicitely track the IVs we generate for vectorized DRs, eventually
factoring that out from vectorizable_{store,load} so we can simply
carry over the actual pointer IV final value to the epilogue as
initial value.  For each DR group we'd create a single IV (we can
even do better in case we have load + store of the "same" group).

We already kind-of track things via the ivexpr_map, but I'm not sure
if this lazly populated map can be reliably re-used to "re-populate"
the epilogue one (walk the map, create epilogue IVs with the appropriate
initial value & adjustd upate).

Richard.

> Kind regards,
> Andre Vieira
> 
> 
>
Andre Vieira (lists) May 5, 2021, 11:34 a.m. UTC | #2
Hi Richi,

So I'm trying to look at what IVOPTs does right now and how it might be 
able to help us. Looking at these two code examples:
#include <stddef.h>
#if 0
int foo(short * a, short * b, unsigned int n)
{
     int sum = 0;
     for (unsigned int i = 0; i < n; ++i)
         sum += a[i] + b[i];

     return sum;
}


#else

int bar (short * a, short *b, unsigned int n)
{
     int sum = 0;
     unsigned int i = 0;
     for (; i < (n / 16); i += 1)
     {
         // Iterates [0, 16, .., (n/16 * 16) * 16]
         // Example n = 127,
         // iterates [0, 16, 32, 48, 64, 80, 96, 112]
         sum += a[i*16] + b[i*16];
     }
     for (size_t j =  (size_t) ((n / 16) * 16); j < n; ++j)
     {
         // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16]
         // Example n = 127,
         // j starts at (127/16) * 16 = 7 * 16 = 112,
         // So iterates over [112, 113, 114, 115, ..., 127]
         sum += a[j] + b[j];
     }
     return sum;
}
#endif

Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the 
following options '-O3 -march=armv8-a -fno-tree-vectorize 
-fdump-tree-ivopts-all -fno-unroll-loops'. See godbolt link here: 
https://godbolt.org/z/MEf6j6ebM

I tried to see what IVOPTs would make of this and it is able to analyze 
the IVs but it doesn't realize (not even sure it tries) that one IV's 
end (loop 1) could be used as the base for the other (loop 2). I don't 
know if this is where you'd want such optimizations to be made, on one 
side I think it would be great as it would also help with non-vectorized 
loops as you allured to.

However, if you compile the top test case (#if 1) and let the 
tree-vectorizer have a go you will see different behaviours for 
different vectorization approaches, so for:
'-O3 -march=armv8-a', using NEON and epilogue vectorization it seems 
IVOPTs only picks up on one loop.
If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' 
it will detect two loops. This may well be because in fact epilogue 
vectorization 'un-loops' it because it knows it will only have to do one 
iteration of the vectorized epilogue. vect-partial-vector-usage=1 could 
have done the same, but because we are dealing with polymorphic vector 
modes it fails to, I have a hack that works for 
vect-partial-vector-usage to avoid it, but I think we can probably do 
better and try to reason about boundaries in poly_int's rather than 
integers (TBC).

Anyway I diverge. Back to the main question of this patch. How do you 
suggest I go about this? Is there a way to make IVOPTS aware of the 
'iterate-once' IVs in the epilogue(s) (both vector and scalar!) and then 
teach it to merge IV's if one ends where the other begins?

On 04/05/2021 10:56, Richard Biener wrote:
> On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:
>
>> Hi,
>>
>> The aim of this RFC is to explore a way of cleaning up the codegen around
>> data_references.  To be specific, I'd like to reuse the main-loop's updated
>> data_reference as the base_address for the epilogue's corresponding
>> data_reference, rather than use the niters.  We have found this leads to
>> better codegen in the vectorized epilogue loops.
>>
>> The approach in this RFC creates a map if iv_updates which always contain an
>> updated pointer that is caputed in vectorizable_{load,store}, an iv_update may
>> also contain a skip_edge in case we decide the vectorization can be skipped in
>> 'vect_do_peeling'. During the epilogue update this map of iv_updates is then
>> checked to see if it contains an entry for a data_reference and it is used
>> accordingly and if not it reverts back to the old behavior of using the niters
>> to advance the data_reference.
>>
>> The motivation for this work is to improve codegen for the option `--param
>> vect-partial-vector-usage=1` for SVE. We found that one of the main problems
>> for the codegen here was coming from unnecessary conversions caused by the way
>> we update the data_references in the epilogue.
>>
>> This patch passes regression tests in aarch64-linux-gnu, but the codegen is
>> still not optimal in some cases. Specifically those where we have a scalar
>> epilogue, as this does not use the data_reference's and will rely on the
>> gimple scalar code, thus constructing again a memory access using the niters.
>> This is a limitation for which I haven't quite worked out a solution yet and
>> does cause some minor regressions due to unfortunate spills.
>>
>> Let me know what you think and if you have ideas of how we can better achieve
>> this.
> Hmm, so the patch adds a kludge to improve the kludge we have in place ;)
>
> I think it might be interesting to create a C testcase mimicing the
> update problem without involving the vectorizer.  That way we can
> see how the various components involved behave (FRE + ivopts most
> specifically).
>
> That said, a cleaner approach to dealing with this would be to
> explicitely track the IVs we generate for vectorized DRs, eventually
> factoring that out from vectorizable_{store,load} so we can simply
> carry over the actual pointer IV final value to the epilogue as
> initial value.  For each DR group we'd create a single IV (we can
> even do better in case we have load + store of the "same" group).
>
> We already kind-of track things via the ivexpr_map, but I'm not sure
> if this lazly populated map can be reliably re-used to "re-populate"
> the epilogue one (walk the map, create epilogue IVs with the appropriate
> initial value & adjustd upate).
>
> Richard.
>
>> Kind regards,
>> Andre Vieira
>>
>>
>>
Richard Biener May 5, 2021, 12:34 p.m. UTC | #3
On Wed, 5 May 2021, Andre Vieira (lists) wrote:

> Hi Richi,
> 
> So I'm trying to look at what IVOPTs does right now and how it might be able
> to help us. Looking at these two code examples:
> #include <stddef.h>
> #if 0
> int foo(short * a, short * b, unsigned int n)
> {
>     int sum = 0;
>     for (unsigned int i = 0; i < n; ++i)
>         sum += a[i] + b[i];
> 
>     return sum;
> }
> 
> 
> #else
> 
> int bar (short * a, short *b, unsigned int n)
> {
>     int sum = 0;
>     unsigned int i = 0;
>     for (; i < (n / 16); i += 1)
>     {
>         // Iterates [0, 16, .., (n/16 * 16) * 16]
>         // Example n = 127,
>         // iterates [0, 16, 32, 48, 64, 80, 96, 112]
>         sum += a[i*16] + b[i*16];
>     }
>     for (size_t j =  (size_t) ((n / 16) * 16); j < n; ++j)
>     {
>         // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16]
>         // Example n = 127,
>         // j starts at (127/16) * 16 = 7 * 16 = 112,
>         // So iterates over [112, 113, 114, 115, ..., 127]
>         sum += a[j] + b[j];
>     }
>     return sum;
> }
> #endif
> 
> Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the following
> options '-O3 -march=armv8-a -fno-tree-vectorize -fdump-tree-ivopts-all
> -fno-unroll-loops'. See godbolt link here: https://godbolt.org/z/MEf6j6ebM
> 
> I tried to see what IVOPTs would make of this and it is able to analyze the
> IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1)
> could be used as the base for the other (loop 2). I don't know if this is
> where you'd want such optimizations to be made, on one side I think it would
> be great as it would also help with non-vectorized loops as you allured to.

Hmm, OK.  So there's the first loop that has a looparound jump and thus
we do not always enter the 2nd loop with the first loop final value of the
IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
And for a followup transform to catch this it would need to compute
the final value of the IV and then match this up with the initial
value computation.  I suppose FRE could be teached to do this, at
least for very simple cases.

> However, if you compile the top test case (#if 1) and let the tree-vectorizer
> have a go you will see different behaviours for different vectorization
> approaches, so for:
> '-O3 -march=armv8-a', using NEON and epilogue vectorization it seems IVOPTs
> only picks up on one loop.

Yep, the "others" are likely fully peeled because they have just a single
iteration.  Again some kind-of final-value replacement "CSE" could
eventually help - but then we have jump-arounds here as well thus
we'd need final-value replacement "PRE".

> If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' it
> will detect two loops. This may well be because in fact epilogue vectorization
> 'un-loops' it because it knows it will only have to do one iteration of the
> vectorized epilogue. vect-partial-vector-usage=1 could have done the same, but
> because we are dealing with polymorphic vector modes it fails to, I have a
> hack that works for vect-partial-vector-usage to avoid it, but I think we can
> probably do better and try to reason about boundaries in poly_int's rather
> than integers (TBC).
> 
> Anyway I diverge. Back to the main question of this patch. How do you suggest
> I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs
> in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's
> if one ends where the other begins?

I don't think we will make that work easily.  So indeed attacking this
in the vectorizer sounds most promising.  I'll note there's also
the issue of epilogue vectorization and reductions where we seem
to not re-use partially reduced reduction vectors but instead
reduce to a scalar in each step.  That's a related issue - we're
not able to carry forward a (reduction) IV we generated for the
main vector loop to the epilogue loops.  Like for

double foo (double *a, int n)
{
  double sum = 0.;
  for (int i = 0; i < n; ++i)
    sum += a[i];
  return sum;
}

with AVX512 we get three reductions to scalars instead of
a partial reduction from zmm to ymm before the first vectorized
epilogue followed by a reduction from ymm to xmm before the second
(the jump around for the epilogues need to jump to the further
reduction piece obviously).

So I think we want to record IVs we generate (the reduction IVs
are already nicely associated with the stmt-infos), one might
consider to refer to them from the dr_vec_info for example.

It's just going to be "interesting" to wire everything up
correctly with all the jump-arounds we have ...

> On 04/05/2021 10:56, Richard Biener wrote:
> > On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:
> >
> >> Hi,
> >>
> >> The aim of this RFC is to explore a way of cleaning up the codegen around
> >> data_references.  To be specific, I'd like to reuse the main-loop's updated
> >> data_reference as the base_address for the epilogue's corresponding
> >> data_reference, rather than use the niters.  We have found this leads to
> >> better codegen in the vectorized epilogue loops.
> >>
> >> The approach in this RFC creates a map if iv_updates which always contain
> >> an
> >> updated pointer that is caputed in vectorizable_{load,store}, an iv_update
> >> may
> >> also contain a skip_edge in case we decide the vectorization can be skipped
> >> in
> >> 'vect_do_peeling'. During the epilogue update this map of iv_updates is
> >> then
> >> checked to see if it contains an entry for a data_reference and it is used
> >> accordingly and if not it reverts back to the old behavior of using the
> >> niters
> >> to advance the data_reference.
> >>
> >> The motivation for this work is to improve codegen for the option `--param
> >> vect-partial-vector-usage=1` for SVE. We found that one of the main
> >> problems
> >> for the codegen here was coming from unnecessary conversions caused by the
> >> way
> >> we update the data_references in the epilogue.
> >>
> >> This patch passes regression tests in aarch64-linux-gnu, but the codegen is
> >> still not optimal in some cases. Specifically those where we have a scalar
> >> epilogue, as this does not use the data_reference's and will rely on the
> >> gimple scalar code, thus constructing again a memory access using the
> >> niters.
> >> This is a limitation for which I haven't quite worked out a solution yet
> >> and
> >> does cause some minor regressions due to unfortunate spills.
> >>
> >> Let me know what you think and if you have ideas of how we can better
> >> achieve
> >> this.
> > Hmm, so the patch adds a kludge to improve the kludge we have in place ;)
> >
> > I think it might be interesting to create a C testcase mimicing the
> > update problem without involving the vectorizer.  That way we can
> > see how the various components involved behave (FRE + ivopts most
> > specifically).
> >
> > That said, a cleaner approach to dealing with this would be to
> > explicitely track the IVs we generate for vectorized DRs, eventually
> > factoring that out from vectorizable_{store,load} so we can simply
> > carry over the actual pointer IV final value to the epilogue as
> > initial value.  For each DR group we'd create a single IV (we can
> > even do better in case we have load + store of the "same" group).
> >
> > We already kind-of track things via the ivexpr_map, but I'm not sure
> > if this lazly populated map can be reliably re-used to "re-populate"
> > the epilogue one (walk the map, create epilogue IVs with the appropriate
> > initial value & adjustd upate).
> >
> > Richard.
> >
> >> Kind regards,
> >> Andre Vieira
> >>
> >>
> >>
> 
>
Andre Vieira (lists) May 5, 2021, 4:58 p.m. UTC | #4
On 05/05/2021 13:34, Richard Biener wrote:
> On Wed, 5 May 2021, Andre Vieira (lists) wrote:
>
>> I tried to see what IVOPTs would make of this and it is able to analyze the
>> IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1)
>> could be used as the base for the other (loop 2). I don't know if this is
>> where you'd want such optimizations to be made, on one side I think it would
>> be great as it would also help with non-vectorized loops as you allured to.
> Hmm, OK.  So there's the first loop that has a looparound jump and thus
> we do not always enter the 2nd loop with the first loop final value of the
> IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
> And for a followup transform to catch this it would need to compute
> the final value of the IV and then match this up with the initial
> value computation.  I suppose FRE could be teached to do this, at
> least for very simple cases.
I will admit I am not at all familiar with how FRE works, I know it 
exists as the occlusion of running it often breaks my vector patches :P 
But that's about all I know.
I will have a look and see if it makes sense from my perspective to 
address it there, because ...
>
>> Anyway I diverge. Back to the main question of this patch. How do you suggest
>> I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs
>> in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's
>> if one ends where the other begins?
> I don't think we will make that work easily.  So indeed attacking this
> in the vectorizer sounds most promising.

The problem with this that I found with my approach is that it only 
tackles the vectorized epilogues and that leads to regressions, I don't 
have the example at hand, but what I saw was happening was that 
increased register pressure lead to a spill in the hot path. I believe 
this was caused by the epilogue loop using the update pointers as the 
base for their DR's, in this case there were three DR's (2 loads one 
store), but the scalar epilogue still using the original base + niters, 
since this data_reference approach only changes the vectorized epilogues.


>   I'll note there's also
> the issue of epilogue vectorization and reductions where we seem
> to not re-use partially reduced reduction vectors but instead
> reduce to a scalar in each step.  That's a related issue - we're
> not able to carry forward a (reduction) IV we generated for the
> main vector loop to the epilogue loops.  Like for
>
> double foo (double *a, int n)
> {
>    double sum = 0.;
>    for (int i = 0; i < n; ++i)
>      sum += a[i];
>    return sum;
> }
>
> with AVX512 we get three reductions to scalars instead of
> a partial reduction from zmm to ymm before the first vectorized
> epilogue followed by a reduction from ymm to xmm before the second
> (the jump around for the epilogues need to jump to the further
> reduction piece obviously).
>
> So I think we want to record IVs we generate (the reduction IVs
> are already nicely associated with the stmt-infos), one might
> consider to refer to them from the dr_vec_info for example.
>
> It's just going to be "interesting" to wire everything up
> correctly with all the jump-arounds we have ...
I have a downstream hack for the reductions, but it only worked for 
partial-vector-usage as there you have the guarantee it's the same 
vector-mode, so you don't need to pfaff around with half and full 
vectors. Obviously what you are suggesting has much wider applications 
and not surprisingly I think Richard Sandiford also pointed out to me 
that these are somewhat related and we might be able to reuse the 
IV-creation to manage it all. But I feel like I am currently light years 
away from that.

I had started to look at removing the data_reference updating we have 
now and dealing with this in the 'create_iv' calls from 
'vect_create_data_ref_ptr' inside 'vectorizable_{load,store}' but then I 
thought it would be good to discuss it with you first. This will require 
keeping track of the 'end-value' of the IV, which for loops where we can 
skip the previous loop means we will need to construct a phi-node 
containing the updated pointer and the initial base. But I'm not 
entirely sure where to keep track of all this. Also I don't know if I 
can replace the base address of the data_reference right there at the 
'create_iv' call, can a data_reference be used multiple times in the 
same loop?

I'll go do a bit more nosing around this idea and the ivmap you 
mentioned before. Let me know if you have any ideas on how this all 
should look like, even if its a 'in an ideal world'.

Andre
>
>> On 04/05/2021 10:56, Richard Biener wrote:
>>> On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:
>>>
>>>> Hi,
>>>>
>>>> The aim of this RFC is to explore a way of cleaning up the codegen around
>>>> data_references.  To be specific, I'd like to reuse the main-loop's updated
>>>> data_reference as the base_address for the epilogue's corresponding
>>>> data_reference, rather than use the niters.  We have found this leads to
>>>> better codegen in the vectorized epilogue loops.
>>>>
>>>> The approach in this RFC creates a map if iv_updates which always contain
>>>> an
>>>> updated pointer that is caputed in vectorizable_{load,store}, an iv_update
>>>> may
>>>> also contain a skip_edge in case we decide the vectorization can be skipped
>>>> in
>>>> 'vect_do_peeling'. During the epilogue update this map of iv_updates is
>>>> then
>>>> checked to see if it contains an entry for a data_reference and it is used
>>>> accordingly and if not it reverts back to the old behavior of using the
>>>> niters
>>>> to advance the data_reference.
>>>>
>>>> The motivation for this work is to improve codegen for the option `--param
>>>> vect-partial-vector-usage=1` for SVE. We found that one of the main
>>>> problems
>>>> for the codegen here was coming from unnecessary conversions caused by the
>>>> way
>>>> we update the data_references in the epilogue.
>>>>
>>>> This patch passes regression tests in aarch64-linux-gnu, but the codegen is
>>>> still not optimal in some cases. Specifically those where we have a scalar
>>>> epilogue, as this does not use the data_reference's and will rely on the
>>>> gimple scalar code, thus constructing again a memory access using the
>>>> niters.
>>>> This is a limitation for which I haven't quite worked out a solution yet
>>>> and
>>>> does cause some minor regressions due to unfortunate spills.
>>>>
>>>> Let me know what you think and if you have ideas of how we can better
>>>> achieve
>>>> this.
>>> Hmm, so the patch adds a kludge to improve the kludge we have in place ;)
>>>
>>> I think it might be interesting to create a C testcase mimicing the
>>> update problem without involving the vectorizer.  That way we can
>>> see how the various components involved behave (FRE + ivopts most
>>> specifically).
>>>
>>> That said, a cleaner approach to dealing with this would be to
>>> explicitely track the IVs we generate for vectorized DRs, eventually
>>> factoring that out from vectorizable_{store,load} so we can simply
>>> carry over the actual pointer IV final value to the epilogue as
>>> initial value.  For each DR group we'd create a single IV (we can
>>> even do better in case we have load + store of the "same" group).
>>>
>>> We already kind-of track things via the ivexpr_map, but I'm not sure
>>> if this lazly populated map can be reliably re-used to "re-populate"
>>> the epilogue one (walk the map, create epilogue IVs with the appropriate
>>> initial value & adjustd upate).
>>>
>>> Richard.
>>>
>>>> Kind regards,
>>>> Andre Vieira
>>>>
>>>>
>>>>
>>
Richard Biener May 7, 2021, 12:05 p.m. UTC | #5
On Wed, 5 May 2021, Andre Vieira (lists) wrote:

> 
> On 05/05/2021 13:34, Richard Biener wrote:
> > On Wed, 5 May 2021, Andre Vieira (lists) wrote:
> >
> >> I tried to see what IVOPTs would make of this and it is able to analyze the
> >> IVs but it doesn't realize (not even sure it tries) that one IV's end (loop
> >> 1)
> >> could be used as the base for the other (loop 2). I don't know if this is
> >> where you'd want such optimizations to be made, on one side I think it
> >> would
> >> be great as it would also help with non-vectorized loops as you allured to.
> > Hmm, OK.  So there's the first loop that has a looparound jump and thus
> > we do not always enter the 2nd loop with the first loop final value of the
> > IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
> > And for a followup transform to catch this it would need to compute
> > the final value of the IV and then match this up with the initial
> > value computation.  I suppose FRE could be teached to do this, at
> > least for very simple cases.
> I will admit I am not at all familiar with how FRE works, I know it exists as
> the occlusion of running it often breaks my vector patches :P But that's about
> all I know.
> I will have a look and see if it makes sense from my perspective to address it
> there, because ...
> >
> >> Anyway I diverge. Back to the main question of this patch. How do you
> >> suggest
> >> I go about this? Is there a way to make IVOPTS aware of the 'iterate-once'
> >> IVs
> >> in the epilogue(s) (both vector and scalar!) and then teach it to merge
> >> IV's
> >> if one ends where the other begins?
> > I don't think we will make that work easily.  So indeed attacking this
> > in the vectorizer sounds most promising.
> 
> The problem with this that I found with my approach is that it only tackles
> the vectorized epilogues and that leads to regressions, I don't have the
> example at hand, but what I saw was happening was that increased register
> pressure lead to a spill in the hot path. I believe this was caused by the
> epilogue loop using the update pointers as the base for their DR's, in this
> case there were three DR's (2 loads one store), but the scalar epilogue still
> using the original base + niters, since this data_reference approach only
> changes the vectorized epilogues.

Yeah, this issue obviously extends to the scalar pro and epilogue loops...

So ideally we'd produce IL (mainly for the IV setup code I guess)
that will be handled well by the following passes but then IVOPTs
is not multi-loop aware ...

That said, in the end we should be able to code-generate the scalar
loops as well (my plan is to add that, at least for the vector
loop, to be able to support partly vectorized loops with unvectorizable
stmts simply replicated as scalar ops).  In that case we can use
the same IVs again.

> >   I'll note there's also
> > the issue of epilogue vectorization and reductions where we seem
> > to not re-use partially reduced reduction vectors but instead
> > reduce to a scalar in each step.  That's a related issue - we're
> > not able to carry forward a (reduction) IV we generated for the
> > main vector loop to the epilogue loops.  Like for
> >
> > double foo (double *a, int n)
> > {
> >    double sum = 0.;
> >    for (int i = 0; i < n; ++i)
> >      sum += a[i];
> >    return sum;
> > }
> >
> > with AVX512 we get three reductions to scalars instead of
> > a partial reduction from zmm to ymm before the first vectorized
> > epilogue followed by a reduction from ymm to xmm before the second
> > (the jump around for the epilogues need to jump to the further
> > reduction piece obviously).
> >
> > So I think we want to record IVs we generate (the reduction IVs
> > are already nicely associated with the stmt-infos), one might
> > consider to refer to them from the dr_vec_info for example.
> >
> > It's just going to be "interesting" to wire everything up
> > correctly with all the jump-arounds we have ...
> I have a downstream hack for the reductions, but it only worked for
> partial-vector-usage as there you have the guarantee it's the same
> vector-mode, so you don't need to pfaff around with half and full vectors.
> Obviously what you are suggesting has much wider applications and not
> surprisingly I think Richard Sandiford also pointed out to me that these are
> somewhat related and we might be able to reuse the IV-creation to manage it
> all. But I feel like I am currently light years away from that.
> 
> I had started to look at removing the data_reference updating we have now and
> dealing with this in the 'create_iv' calls from 'vect_create_data_ref_ptr'
> inside 'vectorizable_{load,store}' but then I thought it would be good to
> discuss it with you first. This will require keeping track of the 'end-value'
> of the IV, which for loops where we can skip the previous loop means we will
> need to construct a phi-node containing the updated pointer and the initial
> base. But I'm not entirely sure where to keep track of all this. Also I don't
> know if I can replace the base address of the data_reference right there at
> the 'create_iv' call, can a data_reference be used multiple times in the same
> loop?

Yes, it can.  I think we should see the data-ref IVs more as first-class.

But then we can also forgo completely those IVs and compute the
actual IV values based on the/a canonical IV we have/add.  We'd
then make sure to connect the canonical IV through all loops and
hope for IVOPTs to make sense of the mess we code-generate for
each dataref access ;)  Of course the datarefs in the scalar code
are not based on the canonical IV, but hey, IVOPTs eventually
manages to express them in terms of it.  And for cross-loop IVOPT
the important part would be that it's easy to connect the final/inital
values of adjacent loops.

Just in case you're out of experimenting ideas.

> I'll go do a bit more nosing around this idea and the ivmap you mentioned
> before. Let me know if you have any ideas on how this all should look like,
> even if its a 'in an ideal world'.

So in an ideal world we'd have a more explicit representation of those
IVs, just like we know about the reductions and inductions.  So
for example for each vector store/load (SLP node or stmt_info)
store the IV we use (maybe some extra struct so we can track if how
many uses we have, if it's the "main" DR or if we re-used it via
iv-map).  Then for epilogues we could replicate that meta by
cerating the IVs in the epilogue.  Would leave the issue of the
scalar pro- and epilogues. 

> 
> Andre
> >
> >> On 04/05/2021 10:56, Richard Biener wrote:
> >>> On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> The aim of this RFC is to explore a way of cleaning up the codegen around
> >>>> data_references.  To be specific, I'd like to reuse the main-loop's
> >>>> updated
> >>>> data_reference as the base_address for the epilogue's corresponding
> >>>> data_reference, rather than use the niters.  We have found this leads to
> >>>> better codegen in the vectorized epilogue loops.
> >>>>
> >>>> The approach in this RFC creates a map if iv_updates which always contain
> >>>> an
> >>>> updated pointer that is caputed in vectorizable_{load,store}, an
> >>>> iv_update
> >>>> may
> >>>> also contain a skip_edge in case we decide the vectorization can be
> >>>> skipped
> >>>> in
> >>>> 'vect_do_peeling'. During the epilogue update this map of iv_updates is
> >>>> then
> >>>> checked to see if it contains an entry for a data_reference and it is
> >>>> used
> >>>> accordingly and if not it reverts back to the old behavior of using the
> >>>> niters
> >>>> to advance the data_reference.
> >>>>
> >>>> The motivation for this work is to improve codegen for the option
> >>>> `--param
> >>>> vect-partial-vector-usage=1` for SVE. We found that one of the main
> >>>> problems
> >>>> for the codegen here was coming from unnecessary conversions caused by
> >>>> the
> >>>> way
> >>>> we update the data_references in the epilogue.
> >>>>
> >>>> This patch passes regression tests in aarch64-linux-gnu, but the codegen
> >>>> is
> >>>> still not optimal in some cases. Specifically those where we have a
> >>>> scalar
> >>>> epilogue, as this does not use the data_reference's and will rely on the
> >>>> gimple scalar code, thus constructing again a memory access using the
> >>>> niters.
> >>>> This is a limitation for which I haven't quite worked out a solution yet
> >>>> and
> >>>> does cause some minor regressions due to unfortunate spills.
> >>>>
> >>>> Let me know what you think and if you have ideas of how we can better
> >>>> achieve
> >>>> this.
> >>> Hmm, so the patch adds a kludge to improve the kludge we have in place ;)
> >>>
> >>> I think it might be interesting to create a C testcase mimicing the
> >>> update problem without involving the vectorizer.  That way we can
> >>> see how the various components involved behave (FRE + ivopts most
> >>> specifically).
> >>>
> >>> That said, a cleaner approach to dealing with this would be to
> >>> explicitely track the IVs we generate for vectorized DRs, eventually
> >>> factoring that out from vectorizable_{store,load} so we can simply
> >>> carry over the actual pointer IV final value to the epilogue as
> >>> initial value.  For each DR group we'd create a single IV (we can
> >>> even do better in case we have load + store of the "same" group).
> >>>
> >>> We already kind-of track things via the ivexpr_map, but I'm not sure
> >>> if this lazly populated map can be reliably re-used to "re-populate"
> >>> the epilogue one (walk the map, create epilogue IVs with the appropriate
> >>> initial value & adjustd upate).
> >>>
> >>> Richard.
> >>>
> >>>> Kind regards,
> >>>> Andre Vieira
> >>>>
> >>>>
> >>>>
> >>
>
Andre Vieira (lists) May 17, 2021, 8:14 a.m. UTC | #6
Hi,

So this is my second attempt at finding a way to improve how we generate 
the vector IV's and teach the vectorizer to share them between main loop 
and epilogues. On IRC we discussed my idea to use the loop's control_iv, 
but that was a terrible idea and I quickly threw it in the bin. The main 
problem, that for some reason I failed to see, was that the control_iv 
increases by 's' and the datarefs by 's' * NELEMENTS where 's' is 
usually 1 and NELEMENTs the amount of elements we handle per iteration. 
That means the epilogue loops would have to start from the last loop's 
IV * the last loop's NELEMENT's and that would just cause a mess.

Instead I started to think about creating IV's for the datarefs and what 
I thought worked best was to create these in scalar before peeling. That 
way the peeling mechanisms takes care of the duplication of these for 
the vector and scalar epilogues and it also takes care of adding 
phi-nodes for the skip_vector paths.
These new IV's have two functions:
1) 'vect_create_data_ref_ptr' can use them to:
  a) if it's the main loop: replace the values of the 'initial' value of 
the main loop's IV and the initial values in the skip_vector phi-nodes
  b) Update the the skip_vector phi-nodes argument for the non-skip path 
with the updated vector ptr.

2) They are used for the scalar epilogue ensuring they share the same 
datareference ptr.

There are still a variety of 'hacky' elements here and a lot of testing 
to be done, but I hope to be able to clean them away. One of the main 
issues I had was that I had to skip a couple of checks and things for 
the added phi-nodes and update statements as these do not have 
stmt_vec_info representation.  Though I'm not sure adding this 
representation at their creation was much cleaner... It is something I 
could play around with but I thought this was a good moment to ask you 
for input. For instance, maybe we could do this transformation before 
analysis?

Also be aware that because I create a IV for each dataref this leads to 
regressions with SVE codegen for instance. NEON is able to use the 
post-index addressing mode to increase each dr IV at access time, but 
SVE can't do this.  For this I don't know if maybe we could try to be 
smart and create shared IV's. So rather than make them based on the 
actual vector ptr, use a shared sizetype IV that can be shared among dr 
IV's with the same step. Or maybe this is something for IVOPTs?

Let me know what ya think!

Kind regards,
Andre
diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h
index 8001cc54f518d9d9d1a0fcfe5790d22dae109fb2..939c0a7fefd4355dd75d7646ac2ae63ce23a0e14 100644
--- a/gcc/tree-data-ref.h
+++ b/gcc/tree-data-ref.h
@@ -174,6 +174,8 @@ struct data_reference
 
   /* Alias information for the data reference.  */
   struct dr_alias alias;
+
+  hash_map<loop_p, tree> *iv_bases;
 };
 
 #define DR_STMT(DR)                (DR)->stmt
diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index 124a7bea6a94161556a6622fa7b113b3cef98bcf..f638bb3e0aa007e0bf7ad8f75fb767d3484b02ce 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -1475,6 +1475,7 @@ void
 free_data_ref (data_reference_p dr)
 {
   DR_ACCESS_FNS (dr).release ();
+  delete dr->iv_bases;
   free (dr);
 }
 
@@ -1506,6 +1507,7 @@ create_data_ref (edge nest, loop_p loop, tree memref, gimple *stmt,
   DR_REF (dr) = memref;
   DR_IS_READ (dr) = is_read;
   DR_IS_CONDITIONAL_IN_STMT (dr) = is_conditional_in_stmt;
+  dr->iv_bases = new hash_map<loop_p, tree> ();
 
   dr_analyze_innermost (&DR_INNERMOST (dr), memref,
 			nest != NULL ? loop : NULL, stmt);
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index 86fc118b6befb06233e5e86a01454fd7075075e1..93e14d09763da5034ba97d09b07c94c20fe25a28 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -24,6 +24,8 @@ typedef void (*transform_callback)(class loop *, void *);
 
 extern void create_iv (tree, tree, tree, class loop *, gimple_stmt_iterator *,
 		       bool, tree *, tree *);
+extern void create_or_update_iv (tree, tree, tree, class loop *, gimple_stmt_iterator *,
+				  bool, tree *, tree *, gphi *, bool);
 extern void rewrite_into_loop_closed_ssa_1 (bitmap, unsigned, int,
 					    class loop *);
 extern void rewrite_into_loop_closed_ssa (bitmap, unsigned);
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 28ae1316fa0eb6939a45d15e893b7386622ba60c..1709e175c382ef5d74c2f628a61c9fffe26f726d 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -57,9 +57,10 @@ static bitmap_obstack loop_renamer_obstack;
    VAR_AFTER (unless they are NULL).  */
 
 void
-create_iv (tree base, tree step, tree var, class loop *loop,
-	   gimple_stmt_iterator *incr_pos, bool after,
-	   tree *var_before, tree *var_after)
+create_or_update_iv (tree base, tree step, tree var, class loop *loop,
+		     gimple_stmt_iterator *incr_pos, bool after,
+		     tree *var_before, tree *var_after, gphi *update_phi,
+		     bool epilogue)
 {
   gassign *stmt;
   gphi *phi;
@@ -153,11 +154,64 @@ create_iv (tree base, tree step, tree var, class loop *loop,
   if (stmts)
     gsi_insert_seq_on_edge_immediate (pe, stmts);
 
+  if (update_phi) {
+      gimple_seq stmts;
+      gimple_stmt_iterator gsi;
+      imm_use_iterator imm_iter;
+      gimple *stmt;
+      tree old_va = PHI_ARG_DEF_FROM_EDGE (update_phi, loop_latch_edge (loop));
+      tree old_init = PHI_ARG_DEF_FROM_EDGE (update_phi, loop_preheader_edge (loop));
+      tree type = TREE_TYPE (old_va);
+      tree va_conv = fold_convert (type, va);
+
+      gsi = gsi_for_stmt (SSA_NAME_DEF_STMT (va));
+      va_conv = force_gimple_operand (va_conv, &stmts, true, NULL_TREE);
+
+      if (stmts)
+	gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT);
+
+
+      FOR_EACH_IMM_USE_STMT (stmt, imm_iter, old_va) {
+	  if (!flow_bb_inside_loop_p (loop, stmt->bb))
+	    {
+	      /* We added this IV, so the outside uses should all be in PHI's
+		 added for skip_vector handling.  */
+	      gcc_assert (gimple_code (stmt) == GIMPLE_PHI);
+	      use_operand_p use_p;
+	      FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
+		SET_USE (use_p, va_conv);
+
+	    }
+      }
+      if (epilogue)
+	initial = old_init;
+      else {
+	  tree initial_conv = fold_convert (type, initial);
+	  stmts = NULL;
+	  initial_conv = force_gimple_operand (initial_conv, &stmts, true,
+					       NULL_TREE);
+	  if (stmts)
+	    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
+					      stmts);
+	  replace_uses_by (old_init, initial_conv);
+	  release_ssa_name_fn (cfun, old_init);
+      }
+  }
   phi = create_phi_node (vb, loop->header);
   add_phi_arg (phi, initial, loop_preheader_edge (loop), UNKNOWN_LOCATION);
   add_phi_arg (phi, va, loop_latch_edge (loop), UNKNOWN_LOCATION);
 }
 
+void
+create_iv (tree base, tree step, tree var, class loop *loop,
+	   gimple_stmt_iterator *incr_pos, bool after,
+	   tree *var_before, tree *var_after)
+{
+  create_or_update_iv (base, step, var, loop, incr_pos, after, var_before,
+		       var_after, NULL, false);
+}
+
+
 /* Return the innermost superloop LOOP of USE_LOOP that is a superloop of
    both DEF_LOOP and USE_LOOP.  */
 
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 97c8577ebe720cf857d811c70eccf60c7655509c..87ca4891df792af73f226ebfb1ce6753818ec87e 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4941,11 +4941,14 @@ vect_create_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info,
 	}
 
       standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+      gphi *update_phi
+	= as_a<gphi *> (SSA_NAME_DEF_STMT (*dr->iv_bases->get (loop)));
 
-      create_iv (aggr_ptr_init,
+      create_or_update_iv (aggr_ptr_init,
 		 fold_convert (aggr_ptr_type, iv_step),
 		 aggr_ptr, loop, &incr_gsi, insert_after,
-		 &indx_before_incr, &indx_after_incr);
+		 &indx_before_incr, &indx_after_incr, update_phi,
+		 LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) != NULL);
       incr = gsi_stmt (incr_gsi);
 
       /* Copy the points-to information if it exists. */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 012f48bd4870125c820049b4fc70db0ef0759bdf..89e0db5d8adcd9afa709dd37a1842ec36dfa13e6 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1402,6 +1402,10 @@ find_loop_location (class loop *loop)
 static bool
 iv_phi_p (stmt_vec_info stmt_info)
 {
+  /* This is to make sure we don't try to vectorize the IV's added by
+     the vectorizer for datarefs.  */
+  if (stmt_info == NULL)
+    return false;
   gphi *phi = as_a <gphi *> (stmt_info->stmt);
   if (virtual_operand_p (PHI_RESULT (phi)))
     return false;
@@ -1439,6 +1443,10 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo)
 
       gphi *phi = gsi.phi ();
       stmt_vec_info phi_info = loop_vinfo->lookup_stmt (phi);
+
+      if (!phi_info)
+	continue;
+
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location, "Analyze phi: %G",
 			 phi_info->stmt);
@@ -2280,7 +2288,8 @@ slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
 static void
 slpeel_update_phi_nodes_for_guard1 (class loop *skip_loop,
 				    class loop *update_loop,
-				    edge guard_edge, edge merge_edge)
+				    edge guard_edge, edge merge_edge,
+				    vec<data_reference_p> *datarefs)
 {
   location_t merge_loc, guard_loc;
   edge orig_e = loop_preheader_edge (skip_loop);
@@ -2310,6 +2319,25 @@ slpeel_update_phi_nodes_for_guard1 (class loop *skip_loop,
 
       /* Update phi in UPDATE_PHI.  */
       adjust_phi_and_debug_stmts (update_phi, update_e, new_res);
+
+      if (datarefs) {
+	  struct data_reference *dr;
+	  unsigned int i;
+
+	  FOR_EACH_VEC_ELT (*datarefs, i, dr) {
+	      if (*(dr->iv_bases->get (skip_loop)) == PHI_RESULT (orig_phi)) {
+		  tree &iv_base = dr->iv_bases->get_or_insert (update_loop);
+		  iv_base = PHI_RESULT (update_phi);
+		  /*
+		  tree new_dummy_var = make_ssa_name (TREE_TYPE (iv_base));
+		  SET_PHI_ARG_DEF (update_phi,
+				   loop_latch_edge (update_loop)->dest_idx,
+				   new_dummy_var);
+				   */
+	      }
+	  }
+	}
+
     }
 }
 
@@ -2831,7 +2859,10 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 					   irred_flag);
 	  e = EDGE_PRED (guard_to, 0);
 	  e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
-	  slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e);
+	  slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e,
+					      NULL);
+	  /* TODO: create dummy phi-nodes and iv-bases for prolog.  Update
+	     'loop's iv-bases.  */
 
 	  scale_bbs_frequencies (&bb_after_prolog, 1, prob_prolog);
 	  scale_loop_profile (prolog, prob_prolog, bound_prolog);
@@ -2931,7 +2962,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  skip_e = guard_e;
 	  e = EDGE_PRED (guard_to, 0);
 	  e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
-	  slpeel_update_phi_nodes_for_guard1 (first_loop, epilog, guard_e, e);
+	  slpeel_update_phi_nodes_for_guard1 (first_loop, epilog, guard_e, e,
+					      &LOOP_VINFO_DATAREFS (loop_vinfo));
 
 	  /* Simply propagate profile info from guard_bb to guard_to which is
 	     a merge point of control flow.  */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 3e973e774af8f9205be893e01ad9263281116885..638f620db96438b67512aa0d364a4dde57e1ea42 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -9264,11 +9264,6 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
   free (LOOP_VINFO_BBS (epilogue_vinfo));
   LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
 
-  /* Advance data_reference's with the number of iterations of the previous
-     loop and its prologue.  */
-  vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
-
-
   /* The EPILOGUE loop is a copy of the original loop so they share the same
      gimple UIDs.  In this loop we update the loop_vec_info of the EPILOGUE to
      point to the copied statements.  We also create a mapping of all LHS' in
@@ -9281,7 +9276,9 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
 	{
 	  new_stmt = epilogue_phi_gsi.phi ();
 
-	  gcc_assert (gimple_uid (new_stmt) > 0);
+	  if (gimple_uid (new_stmt) == 0)
+	    continue;
+
 	  stmt_vinfo
 	    = epilogue_vinfo->stmt_vec_infos[gimple_uid (new_stmt) - 1];
 
@@ -9299,10 +9296,9 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
 	   !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
 	{
 	  new_stmt = gsi_stmt (epilogue_gsi);
-	  if (is_gimple_debug (new_stmt))
+	  if (is_gimple_debug (new_stmt) || gimple_uid (new_stmt) == 0)
 	    continue;
 
-	  gcc_assert (gimple_uid (new_stmt) > 0);
 	  stmt_vinfo
 	    = epilogue_vinfo->stmt_vec_infos[gimple_uid (new_stmt) - 1];
 
@@ -9484,6 +9480,51 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   tree advance;
   drs_init_vec orig_drs_init;
 
+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) == NULL)
+    {
+      struct data_reference *dr;
+      unsigned int i;
+      gimple_seq stmts;
+      gimple_stmt_iterator gsi = gsi_for_stmt (get_loop_exit_condition (loop));
+
+      FOR_EACH_VEC_ELT (LOOP_VINFO_DATAREFS (loop_vinfo), i, dr) {
+	  tree &iv_base = dr->iv_bases->get_or_insert (loop);
+	  tree type = TREE_TYPE (DR_BASE_ADDRESS (dr));
+	  gphi *phi = create_phi_node (type, loop->header);
+
+	  /* Create an update.  */
+	  int step_i = int_cst_value (DR_STEP (dr));
+	  step_i = step_i / int_cst_value (TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr))));
+	  tree step = build_int_cst (sizetype, step_i);
+	  tree update = fold_build2 (POINTER_PLUS_EXPR, type,
+				     PHI_RESULT (phi),
+				     step);
+	  stmts = NULL;
+	  update = force_gimple_operand (update, &stmts, true, NULL_TREE);
+	  gsi_insert_seq_before (&gsi, stmts, GSI_NEW_STMT);
+
+	  /* Replace IV in ref. */
+	  tree mem = fold_build2 (MEM_REF, TREE_TYPE (DR_REF (dr)),
+				  PHI_RESULT (phi),
+				  build_zero_cst (type));
+	  for (int j = 0; j < gimple_num_ops (DR_STMT (dr)); ++j) {
+	      tree op = gimple_op (DR_STMT (dr), j);
+	     if (op == DR_REF (dr))
+		gimple_set_op (DR_STMT (dr), j, mem);
+	      else
+		simplify_replace_tree (op, DR_REF (dr), mem, NULL, NULL, false);
+	  }
+	  update_stmt (DR_STMT (dr));
+
+	  tree dummy_base = make_ssa_name (type);
+	  add_phi_arg (phi, update, single_succ_edge (loop->latch),
+		       UNKNOWN_LOCATION);
+	  add_phi_arg (phi, dummy_base, loop_preheader_edge (loop), UNKNOWN_LOCATION);
+	  iv_base = PHI_RESULT (phi);
+      }
+    }
+
+
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
 			      check_profitability, niters_no_overflow,
Richard Biener May 20, 2021, 10:22 a.m. UTC | #7
On Mon, 17 May 2021, Andre Vieira (lists) wrote:

> Hi,
> 
> So this is my second attempt at finding a way to improve how we generate the
> vector IV's and teach the vectorizer to share them between main loop and
> epilogues. On IRC we discussed my idea to use the loop's control_iv, but that
> was a terrible idea and I quickly threw it in the bin. The main problem, that
> for some reason I failed to see, was that the control_iv increases by 's' and
> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
> amount of elements we handle per iteration. That means the epilogue loops
> would have to start from the last loop's IV * the last loop's NELEMENT's and
> that would just cause a mess.
> 
> Instead I started to think about creating IV's for the datarefs and what I
> thought worked best was to create these in scalar before peeling. That way the
> peeling mechanisms takes care of the duplication of these for the vector and
> scalar epilogues and it also takes care of adding phi-nodes for the
> skip_vector paths.

How does this work for if-converted loops where we use the 
non-if-converted scalar loop for (scalar) peeling but the
if-converted scalar loop for vectorized epilogues?  I suppose
you're only adjusting the if-converted copy.

> These new IV's have two functions:
> 1) 'vect_create_data_ref_ptr' can use them to:
>  a) if it's the main loop: replace the values of the 'initial' value of the
> main loop's IV and the initial values in the skip_vector phi-nodes
>  b) Update the the skip_vector phi-nodes argument for the non-skip path with
> the updated vector ptr.

b) means the prologue IV will not be dead there so we actually need
to compute it?  I suppose IVOPTs could be teached to replace an
IV with its final value (based on some other IV) when it's unused?
Or does it already magically do good?

> 2) They are used for the scalar epilogue ensuring they share the same
> datareference ptr.
> 
> There are still a variety of 'hacky' elements here and a lot of testing to be
> done, but I hope to be able to clean them away. One of the main issues I had
> was that I had to skip a couple of checks and things for the added phi-nodes
> and update statements as these do not have stmt_vec_info representation. 
> Though I'm not sure adding this representation at their creation was much
> cleaner... It is something I could play around with but I thought this was a
> good moment to ask you for input. For instance, maybe we could do this
> transformation before analysis?
> 
> Also be aware that because I create a IV for each dataref this leads to
> regressions with SVE codegen for instance. NEON is able to use the post-index
> addressing mode to increase each dr IV at access time, but SVE can't do this. 
> For this I don't know if maybe we could try to be smart and create shared
> IV's. So rather than make them based on the actual vector ptr, use a shared
> sizetype IV that can be shared among dr IV's with the same step. Or maybe this
> is something for IVOPTs?

Certainly IVOPTs could decide to use the newly created IVs in the
scalar loops for the DRs therein as well.  But since IVOPTs only
considers a single loop at a time it will probably not pay too
much attention and is only influenced by the out-of-loop uses of
the final values of the IVs.

My gut feeling tells me that whatever we do we'll have to look
into improving IVOPTs to consider multiple loops.

> Let me know what ya think!

Now as to the patch itself.  We shouldn't amend struct data_reference,
try to use dr_vec_info instead.

--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4941,11 +4941,14 @@ vect_create_data_ref_ptr (vec_info *vinfo, 
stmt_vec_info stmt_info,
        }

       standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+      gphi *update_phi
+       = as_a<gphi *> (SSA_NAME_DEF_STMT (*dr->iv_bases->get (loop)));


and avoid this by having some explicit representation of IVs.

iv_bases is a map that exists for each DR and maps the loop to the
IV def it seems.  I'd have added the map to loop_vinfo, mapping
the (vect-)DR to the IV [def].

That probably makes the convenience of transforming the scalar
loop before peeling go away, but I wonder whether that's a good
idea anyway.

@@ -9484,6 +9480,51 @@ vect_transform_loop (loop_vec_info loop_vinfo, 
gimple *loop_vectorized_call)
   tree advance;
   drs_init_vec orig_drs_init;

+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) == NULL)
+    {
+      struct data_reference *dr;
+      unsigned int i;
+      gimple_seq stmts;
+      gimple_stmt_iterator gsi = gsi_for_stmt (get_loop_exit_condition 
(loop));
+
+      FOR_EACH_VEC_ELT (LOOP_VINFO_DATAREFS (loop_vinfo), i, dr) {
+         tree &iv_base = dr->iv_bases->get_or_insert (loop);

there's a comment missing - does this try to replace the
IVs used by the scalar DRs in the non-if-converted loop by
the new artificial IVs?

I notice that even for

void foo (int * __restrict x, int *y, int n1, int n2)
{
  int i;
  for (i = 0; i < n1; ++i)
    x[i] += y[i];
  for (; i < n2; ++i)
    x[i] -= y[i];
}

on x86 we're unable to re-use the IV from the first loop but
end up with

.L3:
        movl    (%rsi,%rax,4), %r8d
        addl    %r8d, (%rdi,%rax,4)
        addq    $1, %rax
        cmpq    %rax, %r9
        jne     .L3
.L2:
        cmpl    %edx, %ecx
        jle     .L1
        movslq  %edx, %rdx  <--- init from n1
        .p2align 4,,10
        .p2align 3
.L5:
        movl    (%rsi,%rdx,4), %eax
        subl    %eax, (%rdi,%rdx,4)
        addq    $1, %rdx
        cmpl    %edx, %ecx
        jg      .L5

we're also making it difficult for IVOPTs to eventually
see the connection since we've propagated the final value
of the IV from the first loop.  Disabing SCCP makes it
even worse though.

I wonder how other compilers deal with this situation...

And how we prevent IVOPTs from messing things up again.

Richard.
Andre Vieira (lists) June 14, 2021, 8:10 a.m. UTC | #8
Hi,


On 20/05/2021 11:22, Richard Biener wrote:
> On Mon, 17 May 2021, Andre Vieira (lists) wrote:
>
>> Hi,
>>
>> So this is my second attempt at finding a way to improve how we generate the
>> vector IV's and teach the vectorizer to share them between main loop and
>> epilogues. On IRC we discussed my idea to use the loop's control_iv, but that
>> was a terrible idea and I quickly threw it in the bin. The main problem, that
>> for some reason I failed to see, was that the control_iv increases by 's' and
>> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
>> amount of elements we handle per iteration. That means the epilogue loops
>> would have to start from the last loop's IV * the last loop's NELEMENT's and
>> that would just cause a mess.
>>
>> Instead I started to think about creating IV's for the datarefs and what I
>> thought worked best was to create these in scalar before peeling. That way the
>> peeling mechanisms takes care of the duplication of these for the vector and
>> scalar epilogues and it also takes care of adding phi-nodes for the
>> skip_vector paths.
> How does this work for if-converted loops where we use the
> non-if-converted scalar loop for (scalar) peeling but the
> if-converted scalar loop for vectorized epilogues?  I suppose
> you're only adjusting the if-converted copy.
True hadn't thought about this :(
>
>> These new IV's have two functions:
>> 1) 'vect_create_data_ref_ptr' can use them to:
>>   a) if it's the main loop: replace the values of the 'initial' value of the
>> main loop's IV and the initial values in the skip_vector phi-nodes
>>   b) Update the the skip_vector phi-nodes argument for the non-skip path with
>> the updated vector ptr.
> b) means the prologue IV will not be dead there so we actually need
> to compute it?  I suppose IVOPTs could be teached to replace an
> IV with its final value (based on some other IV) when it's unused?
> Or does it already magically do good?
It does not and ...
>
>> 2) They are used for the scalar epilogue ensuring they share the same
>> datareference ptr.
>>
>> There are still a variety of 'hacky' elements here and a lot of testing to be
>> done, but I hope to be able to clean them away. One of the main issues I had
>> was that I had to skip a couple of checks and things for the added phi-nodes
>> and update statements as these do not have stmt_vec_info representation.
>> Though I'm not sure adding this representation at their creation was much
>> cleaner... It is something I could play around with but I thought this was a
>> good moment to ask you for input. For instance, maybe we could do this
>> transformation before analysis?
>>
>> Also be aware that because I create a IV for each dataref this leads to
>> regressions with SVE codegen for instance. NEON is able to use the post-index
>> addressing mode to increase each dr IV at access time, but SVE can't do this.
>> For this I don't know if maybe we could try to be smart and create shared
>> IV's. So rather than make them based on the actual vector ptr, use a shared
>> sizetype IV that can be shared among dr IV's with the same step. Or maybe this
>> is something for IVOPTs?
> Certainly IVOPTs could decide to use the newly created IVs in the
> scalar loops for the DRs therein as well.  But since IVOPTs only
> considers a single loop at a time it will probably not pay too
> much attention and is only influenced by the out-of-loop uses of
> the final values of the IVs.
>
> My gut feeling tells me that whatever we do we'll have to look
> into improving IVOPTs to consider multiple loops.

So I redid the IV-sharing and it's looking a lot simpler and neater, 
however it only shares IVs between vectorized loops and not scalar pro- 
or epilogues. I am not certain IVOPTs will be able to deal with these, 
as it has no knowledge of the number of iterations of each different 
loop. So take for instance a prologue peeling for alignment loop and a 
first main vectorization loop. To be able to reuse the IV's from the 
prologue in the main vectorization loop it would need to know that the 
initial start adress + PEELING_NITERS == base address for main 
vectorization loop.

I'll starting testing this approach for correctness if there are no 
major concerns. Though I suspect we will only want to turn this into a 
patch once we have the IVOPTs work done to a point where it at least 
doesn't regress codegen because of shared IVs and eventually we can look 
at how to solve the sharing between vectorized and scalar loops.

A small nitpick on my own RFC. I will probably move the 'skip_e' to 
outside of the map, as we only need one per loop_vinfo and not one per 
DR. Initially I didnt even have this skip_e in, but was using the 
creation of a dummy PHI node and then replacing it with the real thing 
later. Though this made the code simpler, especially when inserting the 
'init's stmt_list.

Kind regards,
Andre
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index b317df532a9a92a619de9572378437d09c632ab0..e7d0f1e657b1a0c9bec75799817242e0bc1d8282 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4909,11 +4909,27 @@ vect_create_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info,
 						   offset, byte_offset);
   if (new_stmt_list)
     {
-      if (pe)
-        {
-          new_bb = gsi_insert_seq_on_edge_immediate (pe, new_stmt_list);
-          gcc_assert (!new_bb);
-        }
+      if (loop_vinfo)
+	{
+	  iv_struct *iv = loop_vinfo->dr_ivs->get (dr);
+	  if (iv && iv->skip_e)
+	    {
+	      /* The initial value of the IV we are inserting here will be used
+		 in a PHI-node that will be used as the initial IV for the next
+		 vectorized epilogue, so we have to make sure we insert this
+		 NEW_STMT_LIST before the SKIP_E.  */
+	      gimple_stmt_iterator g = gsi_last_bb (iv->skip_e->src);
+	      if (gimple_code (gsi_stmt (g)) == GIMPLE_COND)
+		gsi_insert_seq_before (&g, new_stmt_list, GSI_NEW_STMT);
+	      else
+		gsi_insert_seq_after (&g, new_stmt_list, GSI_NEW_STMT);
+	    }
+	  else
+	    {
+	      new_bb = gsi_insert_seq_on_edge_immediate (pe, new_stmt_list);
+	      gcc_assert (!new_bb);
+	    }
+	}
       else
         gsi_insert_seq_before (gsi, new_stmt_list, GSI_SAME_STMT);
     }
@@ -4946,12 +4962,55 @@ vect_create_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info,
 
       standard_iv_increment_position (loop, &incr_gsi, &insert_after);
 
+      /* If this is an epilogue, make sure to use the previous loop's IV for
+	  the same DR as the init of this DR's IV.  */
+      if (loop_vinfo
+	  && LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+	{
+	  iv_struct *iv
+	    = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)->dr_ivs->get (dr);
+	  gcc_assert (iv);
+	  aggr_ptr_init = iv->def;
+	}
+
       create_iv (aggr_ptr_init,
 		 fold_convert (aggr_ptr_type, iv_step),
 		 aggr_ptr, loop, &incr_gsi, insert_after,
 		 &indx_before_incr, &indx_after_incr);
       incr = gsi_stmt (incr_gsi);
 
+      if (loop_vinfo)
+	{
+	  iv_struct &iv = loop_vinfo->dr_ivs->get_or_insert (dr);
+	  tree incr_lhs = gimple_get_lhs (incr);
+	  if (iv.skip_e)
+	    {
+	      edge e;
+	      edge_iterator ei;
+	      tree val;
+	      /* Create a skip PHI, use INCR_LHS as the value for the not-skip
+		 path(s) and AGGR_PTR_INIT as the value for the skip path.  */
+	      iv.def = copy_ssa_name (incr_lhs);
+	      gphi *skip_phi = create_phi_node (iv.def,
+						iv.skip_e->dest);
+
+	      FOR_EACH_EDGE (e, ei, skip_phi->bb->preds)
+		{
+		  if (e == iv.skip_e)
+		    val = aggr_ptr_init;
+		  else
+		    val = incr_lhs;
+		  add_phi_arg (skip_phi, val, e, UNKNOWN_LOCATION);
+		}
+	    }
+	  else
+	    {
+	      /* No skip PHI has been created, so we can use INCR_LHS as the
+		 IV.DEF.  */
+	      iv.def = incr_lhs;
+	    }
+	}
+
       /* Copy the points-to information if it exists. */
       if (DR_PTR_INFO (dr))
 	{
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 012f48bd4870125c820049b4fc70db0ef0759bdf..2475ab8b8263499f3471b64aa10b956523fe56df 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2554,8 +2554,7 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow,
-		 tree *advance)
+		 bool check_profitability, bool niters_no_overflow)
 {
   edge e, guard_e;
   tree type = TREE_TYPE (niters), guard_cond;
@@ -3059,10 +3058,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  niters = PHI_RESULT (new_phi);
 	}
 
-      /* Set ADVANCE to the number of iterations performed by the previous
-	 loop and its prologue.  */
-      *advance = niters;
-
       if (!vect_epilogues_updated_niters)
 	{
 	  /* Subtract the number of iterations performed by the vectorized loop
@@ -3090,6 +3085,21 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   adjust_vec.release ();
   free_original_copy_tables ();
 
+  if (vect_epilogues && skip_e)
+    {
+	/* Keep track of the SKIP_E to later construct a phi-node per
+	 * datareference.  */
+	unsigned int i;
+	vec<data_reference_p> datarefs = LOOP_VINFO_DATAREFS (loop_vinfo);
+	struct data_reference *dr;
+
+	FOR_EACH_VEC_ELT (datarefs, i, dr)
+	  {
+	    iv_struct &iv = loop_vinfo->dr_ivs->get_or_insert (dr);
+	    iv.skip_e = skip_e;
+	  }
+    }
+
   return vect_epilogues ? epilog : NULL;
 }
 
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index ba36348b835c25bc556da71a133f81f8a6fc3745..fc2468ebc001935d86fee9b06855a02832b444ad 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -901,6 +901,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     }
 
   epilogue_vinfos.create (6);
+  dr_ivs = new hash_map<data_reference *, iv_struct> ();
 }
 
 /* Free all levels of rgroup CONTROLS.  */
@@ -927,6 +928,7 @@ _loop_vec_info::~_loop_vec_info ()
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
+  delete dr_ivs;
 
   /* When we release an epiloge vinfo that we do not intend to use
      avoid clearing AUX of the main loop which should continue to
@@ -9252,7 +9254,7 @@ find_in_mapping (tree t, void *context)
    prologue of the main loop.  */
 
 static void
-update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
+update_epilogue_loop_vinfo (class loop *epilogue)
 {
   loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
   auto_vec<gimple *> stmt_worklist;
@@ -9267,11 +9269,6 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
   free (LOOP_VINFO_BBS (epilogue_vinfo));
   LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
 
-  /* Advance data_reference's with the number of iterations of the previous
-     loop and its prologue.  */
-  vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
-
-
   /* The EPILOGUE loop is a copy of the original loop so they share the same
      gimple UIDs.  In this loop we update the loop_vec_info of the EPILOGUE to
      point to the copied statements.  We also create a mapping of all LHS' in
@@ -9489,8 +9486,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow,
-			      &advance);
+			      check_profitability, niters_no_overflow);
 
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
@@ -9818,7 +9814,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 
   if (epilogue)
     {
-      update_epilogue_loop_vinfo (epilogue, advance);
+      update_epilogue_loop_vinfo (epilogue);
 
       epilogue->simduid = loop->simduid;
       epilogue->force_vectorize = loop->force_vectorize;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 7dcb4cd0b46b03eef90705eed776d9c3dd797101..da7e2fe72a549634d3cb197e7136de995e20c2b0 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -545,6 +545,12 @@ typedef auto_vec<rgroup_controls> vec_loop_lens;
 
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
+struct iv_struct
+{
+  tree def;
+  edge skip_e;
+};
+
 /*-----------------------------------------------------------------*/
 /* Info on vectorized loops.                                       */
 /*-----------------------------------------------------------------*/
@@ -756,6 +762,8 @@ public:
      analysis.  */
   vec<_loop_vec_info *> epilogue_vinfos;
 
+  hash_map<data_reference *, iv_struct> *dr_ivs;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -1791,8 +1799,7 @@ class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
 class loop *vect_loop_versioning (loop_vec_info, gimple *);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				    tree *, tree *, tree *, int, bool, bool,
-				    tree *);
+				    tree *, tree *, tree *, int, bool, bool);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
Richard Biener June 14, 2021, 8:37 a.m. UTC | #9
On Mon, 14 Jun 2021, Andre Vieira (lists) wrote:

> Hi,
> 
> 
> On 20/05/2021 11:22, Richard Biener wrote:
> > On Mon, 17 May 2021, Andre Vieira (lists) wrote:
> >
> >> Hi,
> >>
> >> So this is my second attempt at finding a way to improve how we generate
> >> the
> >> vector IV's and teach the vectorizer to share them between main loop and
> >> epilogues. On IRC we discussed my idea to use the loop's control_iv, but
> >> that
> >> was a terrible idea and I quickly threw it in the bin. The main problem,
> >> that
> >> for some reason I failed to see, was that the control_iv increases by 's'
> >> and
> >> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
> >> amount of elements we handle per iteration. That means the epilogue loops
> >> would have to start from the last loop's IV * the last loop's NELEMENT's
> >> and
> >> that would just cause a mess.
> >>
> >> Instead I started to think about creating IV's for the datarefs and what I
> >> thought worked best was to create these in scalar before peeling. That way
> >> the
> >> peeling mechanisms takes care of the duplication of these for the vector
> >> and
> >> scalar epilogues and it also takes care of adding phi-nodes for the
> >> skip_vector paths.
> > How does this work for if-converted loops where we use the
> > non-if-converted scalar loop for (scalar) peeling but the
> > if-converted scalar loop for vectorized epilogues?  I suppose
> > you're only adjusting the if-converted copy.
> True hadn't thought about this :(
> >
> >> These new IV's have two functions:
> >> 1) 'vect_create_data_ref_ptr' can use them to:
> >>  a) if it's the main loop: replace the values of the 'initial' value of the
> >> main loop's IV and the initial values in the skip_vector phi-nodes
> >>  b) Update the the skip_vector phi-nodes argument for the non-skip path
> >> with
> >> the updated vector ptr.
> > b) means the prologue IV will not be dead there so we actually need
> > to compute it?  I suppose IVOPTs could be teached to replace an
> > IV with its final value (based on some other IV) when it's unused?
> > Or does it already magically do good?
> It does not and ...
> >
> >> 2) They are used for the scalar epilogue ensuring they share the same
> >> datareference ptr.
> >>
> >> There are still a variety of 'hacky' elements here and a lot of testing to
> >> be
> >> done, but I hope to be able to clean them away. One of the main issues I
> >> had
> >> was that I had to skip a couple of checks and things for the added
> >> phi-nodes
> >> and update statements as these do not have stmt_vec_info representation.
> >> Though I'm not sure adding this representation at their creation was much
> >> cleaner... It is something I could play around with but I thought this was
> >> a
> >> good moment to ask you for input. For instance, maybe we could do this
> >> transformation before analysis?
> >>
> >> Also be aware that because I create a IV for each dataref this leads to
> >> regressions with SVE codegen for instance. NEON is able to use the
> >> post-index
> >> addressing mode to increase each dr IV at access time, but SVE can't do
> >> this.
> >> For this I don't know if maybe we could try to be smart and create shared
> >> IV's. So rather than make them based on the actual vector ptr, use a shared
> >> sizetype IV that can be shared among dr IV's with the same step. Or maybe
> >> this
> >> is something for IVOPTs?
> > Certainly IVOPTs could decide to use the newly created IVs in the
> > scalar loops for the DRs therein as well.  But since IVOPTs only
> > considers a single loop at a time it will probably not pay too
> > much attention and is only influenced by the out-of-loop uses of
> > the final values of the IVs.
> >
> > My gut feeling tells me that whatever we do we'll have to look
> > into improving IVOPTs to consider multiple loops.
> 
> So I redid the IV-sharing and it's looking a lot simpler and neater, however
> it only shares IVs between vectorized loops and not scalar pro- or epilogues.
> I am not certain IVOPTs will be able to deal with these, as it has no
> knowledge of the number of iterations of each different loop. So take for
> instance a prologue peeling for alignment loop and a first main vectorization
> loop. To be able to reuse the IV's from the prologue in the main vectorization
> loop it would need to know that the initial start adress + PEELING_NITERS ==
> base address for main vectorization loop.

Yes, indeed.  I think what we may want to do is to make this CSE/PRE
"easy" when IVOPTs manages to choose compatible IVs.  Since we generally
have bypass jumps around the various loops this boils down to PRE
since the redundancies only appear on the loop-to-loop fallthru edges
which means we won't see those opportunities on GIMPLE (unless we
realize the PRE opportunities in the actual vectorizer generated code).
But eventually RTL PRE does all the magic.

Iff we'd manage to present IVOPTs with out-of-loop IV uses that are
actual initial values of IVs in other loops that might be a key trick
to bias its cost model.  But then in the end I think it's much more
important to choose the best IV for the loop bodies rather than
paying attention to IV base setup costs for adjacent loops
(in the vectorizer case we of course know the vectorized epilogue
loops usually do not iterate and the final scalar epilogue iterates
only a few times...).

> I'll starting testing this approach for correctness if there are no major
> concerns. Though I suspect we will only want to turn this into a patch once we
> have the IVOPTs work done to a point where it at least doesn't regress codegen
> because of shared IVs and eventually we can look at how to solve the sharing
> between vectorized and scalar loops.

Indeed.  For example a simple

int a[1024], b[1024], c[1024];

void foo(int n)
{
  for (int i = 0; i < n; ++i)
    a[i+1] += c[i+i] ? b[i+1] : 0;
}

should usually see peeling for alignment (though on x86 you need
exotic -march= since cost models generally have equal aligned and
unaligned access costs).  For example with -mavx2 -mtune=atom
we'll see an alignment peeling prologue, a AVX2 vector loop,
a SSE2 vectorized epilogue and a scalar epilogue.  It also
shows the original scalar loop being used in the scalar prologue
and epilogue.

We're not even trying to make the counting IV easily used
across loops (we're not counting scalar iterations in the
vector loops).

> A small nitpick on my own RFC. I will probably move the 'skip_e' to outside of
> the map, as we only need one per loop_vinfo and not one per DR. Initially I
> didnt even have this skip_e in, but was using the creation of a dummy PHI node
> and then replacing it with the real thing later. Though this made the code
> simpler, especially when inserting the 'init's stmt_list.
> 
> Kind regards,
> Andre
> 
>
Richard Biener June 14, 2021, 10:57 a.m. UTC | #10
On Mon, 14 Jun 2021, Richard Biener wrote:

> On Mon, 14 Jun 2021, Andre Vieira (lists) wrote:
> 
> > Hi,
> > 
> > 
> > On 20/05/2021 11:22, Richard Biener wrote:
> > > On Mon, 17 May 2021, Andre Vieira (lists) wrote:
> > >
> > >> Hi,
> > >>
> > >> So this is my second attempt at finding a way to improve how we generate
> > >> the
> > >> vector IV's and teach the vectorizer to share them between main loop and
> > >> epilogues. On IRC we discussed my idea to use the loop's control_iv, but
> > >> that
> > >> was a terrible idea and I quickly threw it in the bin. The main problem,
> > >> that
> > >> for some reason I failed to see, was that the control_iv increases by 's'
> > >> and
> > >> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
> > >> amount of elements we handle per iteration. That means the epilogue loops
> > >> would have to start from the last loop's IV * the last loop's NELEMENT's
> > >> and
> > >> that would just cause a mess.
> > >>
> > >> Instead I started to think about creating IV's for the datarefs and what I
> > >> thought worked best was to create these in scalar before peeling. That way
> > >> the
> > >> peeling mechanisms takes care of the duplication of these for the vector
> > >> and
> > >> scalar epilogues and it also takes care of adding phi-nodes for the
> > >> skip_vector paths.
> > > How does this work for if-converted loops where we use the
> > > non-if-converted scalar loop for (scalar) peeling but the
> > > if-converted scalar loop for vectorized epilogues?  I suppose
> > > you're only adjusting the if-converted copy.
> > True hadn't thought about this :(
> > >
> > >> These new IV's have two functions:
> > >> 1) 'vect_create_data_ref_ptr' can use them to:
> > >>  a) if it's the main loop: replace the values of the 'initial' value of the
> > >> main loop's IV and the initial values in the skip_vector phi-nodes
> > >>  b) Update the the skip_vector phi-nodes argument for the non-skip path
> > >> with
> > >> the updated vector ptr.
> > > b) means the prologue IV will not be dead there so we actually need
> > > to compute it?  I suppose IVOPTs could be teached to replace an
> > > IV with its final value (based on some other IV) when it's unused?
> > > Or does it already magically do good?
> > It does not and ...
> > >
> > >> 2) They are used for the scalar epilogue ensuring they share the same
> > >> datareference ptr.
> > >>
> > >> There are still a variety of 'hacky' elements here and a lot of testing to
> > >> be
> > >> done, but I hope to be able to clean them away. One of the main issues I
> > >> had
> > >> was that I had to skip a couple of checks and things for the added
> > >> phi-nodes
> > >> and update statements as these do not have stmt_vec_info representation.
> > >> Though I'm not sure adding this representation at their creation was much
> > >> cleaner... It is something I could play around with but I thought this was
> > >> a
> > >> good moment to ask you for input. For instance, maybe we could do this
> > >> transformation before analysis?
> > >>
> > >> Also be aware that because I create a IV for each dataref this leads to
> > >> regressions with SVE codegen for instance. NEON is able to use the
> > >> post-index
> > >> addressing mode to increase each dr IV at access time, but SVE can't do
> > >> this.
> > >> For this I don't know if maybe we could try to be smart and create shared
> > >> IV's. So rather than make them based on the actual vector ptr, use a shared
> > >> sizetype IV that can be shared among dr IV's with the same step. Or maybe
> > >> this
> > >> is something for IVOPTs?
> > > Certainly IVOPTs could decide to use the newly created IVs in the
> > > scalar loops for the DRs therein as well.  But since IVOPTs only
> > > considers a single loop at a time it will probably not pay too
> > > much attention and is only influenced by the out-of-loop uses of
> > > the final values of the IVs.
> > >
> > > My gut feeling tells me that whatever we do we'll have to look
> > > into improving IVOPTs to consider multiple loops.
> > 
> > So I redid the IV-sharing and it's looking a lot simpler and neater, however
> > it only shares IVs between vectorized loops and not scalar pro- or epilogues.
> > I am not certain IVOPTs will be able to deal with these, as it has no
> > knowledge of the number of iterations of each different loop. So take for
> > instance a prologue peeling for alignment loop and a first main vectorization
> > loop. To be able to reuse the IV's from the prologue in the main vectorization
> > loop it would need to know that the initial start adress + PEELING_NITERS ==
> > base address for main vectorization loop.
> 
> Yes, indeed.  I think what we may want to do is to make this CSE/PRE
> "easy" when IVOPTs manages to choose compatible IVs.  Since we generally
> have bypass jumps around the various loops this boils down to PRE
> since the redundancies only appear on the loop-to-loop fallthru edges
> which means we won't see those opportunities on GIMPLE (unless we
> realize the PRE opportunities in the actual vectorizer generated code).
> But eventually RTL PRE does all the magic.
> 
> Iff we'd manage to present IVOPTs with out-of-loop IV uses that are
> actual initial values of IVs in other loops that might be a key trick
> to bias its cost model.  But then in the end I think it's much more
> important to choose the best IV for the loop bodies rather than
> paying attention to IV base setup costs for adjacent loops
> (in the vectorizer case we of course know the vectorized epilogue
> loops usually do not iterate and the final scalar epilogue iterates
> only a few times...).
> 
> > I'll starting testing this approach for correctness if there are no major
> > concerns. Though I suspect we will only want to turn this into a patch once we
> > have the IVOPTs work done to a point where it at least doesn't regress codegen
> > because of shared IVs and eventually we can look at how to solve the sharing
> > between vectorized and scalar loops.
> 
> Indeed.  For example a simple
> 
> int a[1024], b[1024], c[1024];
> 
> void foo(int n)
> {
>   for (int i = 0; i < n; ++i)
>     a[i+1] += c[i+i] ? b[i+1] : 0;
> }
> 
> should usually see peeling for alignment (though on x86 you need
> exotic -march= since cost models generally have equal aligned and
> unaligned access costs).  For example with -mavx2 -mtune=atom
> we'll see an alignment peeling prologue, a AVX2 vector loop,
> a SSE2 vectorized epilogue and a scalar epilogue.  It also
> shows the original scalar loop being used in the scalar prologue
> and epilogue.
> 
> We're not even trying to make the counting IV easily used
> across loops (we're not counting scalar iterations in the
> vector loops).

Specifically we see

<bb 33> [local count: 94607391]:
niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
_67 = niters_vector_mult_vf.10_62 + 7;
_64 = (int) niters_vector_mult_vf.10_62;
tmp.11_63 = i_43 + _64;
if (niters.8_45 == niters_vector_mult_vf.10_62)
  goto <bb 37>; [12.50%]
else
  goto <bb 36>; [87.50%]

after the maini vect loop, recomputing the original IV (i) rather
than using the inserted canonical IV.  And then the vectorized
epilogue header check doing

<bb 36> [local count: 93293400]:
# i_59 = PHI <tmp.11_63(33), 0(18)>
# _66 = PHI <_67(33), 0(18)>
_96 = (unsigned int) n_10(D);
niters.26_95 = _96 - _66;
_108 = (unsigned int) n_10(D);
_109 = _108 - _66;
_110 = _109 + 4294967295;
if (_110 <= 3)
  goto <bb 47>; [10.00%]
else
  goto <bb 40>; [90.00%]

re-computing everything from scratch again (also notice how
the main vect loop guard jumps around the alignment prologue
as well and lands here - and the vectorized epilogue using
unaligned accesses - good!).

That is, I'd expect _much_ easier jobs if we'd manage to
track the number of performed scalar iterations (or the
number of scalar iterations remaining) using the canonical
IV we add to all loops across all of the involved loops.

Richard.
Andre Vieira (lists) June 16, 2021, 10:24 a.m. UTC | #11
On 14/06/2021 11:57, Richard Biener wrote:
> On Mon, 14 Jun 2021, Richard Biener wrote:
>
>> Indeed. For example a simple
>> int a[1024], b[1024], c[1024];
>>
>> void foo(int n)
>> {
>>    for (int i = 0; i < n; ++i)
>>      a[i+1] += c[i+i] ? b[i+1] : 0;
>> }
>>
>> should usually see peeling for alignment (though on x86 you need
>> exotic -march= since cost models generally have equal aligned and
>> unaligned access costs).  For example with -mavx2 -mtune=atom
>> we'll see an alignment peeling prologue, a AVX2 vector loop,
>> a SSE2 vectorized epilogue and a scalar epilogue.  It also
>> shows the original scalar loop being used in the scalar prologue
>> and epilogue.
>>
>> We're not even trying to make the counting IV easily used
>> across loops (we're not counting scalar iterations in the
>> vector loops).
> Specifically we see
>
> <bb 33> [local count: 94607391]:
> niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
> _67 = niters_vector_mult_vf.10_62 + 7;
> _64 = (int) niters_vector_mult_vf.10_62;
> tmp.11_63 = i_43 + _64;
> if (niters.8_45 == niters_vector_mult_vf.10_62)
>    goto <bb 37>; [12.50%]
> else
>    goto <bb 36>; [87.50%]
>
> after the maini vect loop, recomputing the original IV (i) rather
> than using the inserted canonical IV.  And then the vectorized
> epilogue header check doing
>
> <bb 36> [local count: 93293400]:
> # i_59 = PHI <tmp.11_63(33), 0(18)>
> # _66 = PHI <_67(33), 0(18)>
> _96 = (unsigned int) n_10(D);
> niters.26_95 = _96 - _66;
> _108 = (unsigned int) n_10(D);
> _109 = _108 - _66;
> _110 = _109 + 4294967295;
> if (_110 <= 3)
>    goto <bb 47>; [10.00%]
> else
>    goto <bb 40>; [90.00%]
>
> re-computing everything from scratch again (also notice how
> the main vect loop guard jumps around the alignment prologue
> as well and lands here - and the vectorized epilogue using
> unaligned accesses - good!).
>
> That is, I'd expect _much_ easier jobs if we'd manage to
> track the number of performed scalar iterations (or the
> number of scalar iterations remaining) using the canonical
> IV we add to all loops across all of the involved loops.
>
> Richard.


So I am now looking at using an IV that counts scalar iterations rather 
than vector iterations and reusing that through all loops, (prologue, 
main loop, vect_epilogue and scalar epilogue). The first is easy, since 
that's what we already do for partial vectors or non-constant VFs. The 
latter requires some plumbing and removing a lot of the code in there 
that creates new IV's going from [0, niters - previous iterations]. I 
don't yet have a clear cut view of how to do this, I first thought of 
keeping track of the 'control' IV in the loop_vinfo, but the prologue 
and scalar epilogues won't have one. 'loop' keeps a control_ivs struct, 
but that is used for overflow detection and only keeps track of what 
looks like a constant 'base' and 'step'. Not quite sure how all that 
works, but intuitively doesn't seem like the right thing to reuse.

I'll go hack around and keep you posted on progress.

Regards,
Andre
Richard Biener June 16, 2021, 11:13 a.m. UTC | #12
On Wed, 16 Jun 2021, Andre Vieira (lists) wrote:

> 
> On 14/06/2021 11:57, Richard Biener wrote:
> > On Mon, 14 Jun 2021, Richard Biener wrote:
> >
> >> Indeed. For example a simple
> >> int a[1024], b[1024], c[1024];
> >>
> >> void foo(int n)
> >> {
> >>    for (int i = 0; i < n; ++i)
> >>      a[i+1] += c[i+i] ? b[i+1] : 0;
> >> }
> >>
> >> should usually see peeling for alignment (though on x86 you need
> >> exotic -march= since cost models generally have equal aligned and
> >> unaligned access costs).  For example with -mavx2 -mtune=atom
> >> we'll see an alignment peeling prologue, a AVX2 vector loop,
> >> a SSE2 vectorized epilogue and a scalar epilogue.  It also
> >> shows the original scalar loop being used in the scalar prologue
> >> and epilogue.
> >>
> >> We're not even trying to make the counting IV easily used
> >> across loops (we're not counting scalar iterations in the
> >> vector loops).
> > Specifically we see
> >
> > <bb 33> [local count: 94607391]:
> > niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
> > _67 = niters_vector_mult_vf.10_62 + 7;
> > _64 = (int) niters_vector_mult_vf.10_62;
> > tmp.11_63 = i_43 + _64;
> > if (niters.8_45 == niters_vector_mult_vf.10_62)
> >    goto <bb 37>; [12.50%]
> > else
> >    goto <bb 36>; [87.50%]
> >
> > after the maini vect loop, recomputing the original IV (i) rather
> > than using the inserted canonical IV.  And then the vectorized
> > epilogue header check doing
> >
> > <bb 36> [local count: 93293400]:
> > # i_59 = PHI <tmp.11_63(33), 0(18)>
> > # _66 = PHI <_67(33), 0(18)>
> > _96 = (unsigned int) n_10(D);
> > niters.26_95 = _96 - _66;
> > _108 = (unsigned int) n_10(D);
> > _109 = _108 - _66;
> > _110 = _109 + 4294967295;
> > if (_110 <= 3)
> >    goto <bb 47>; [10.00%]
> > else
> >    goto <bb 40>; [90.00%]
> >
> > re-computing everything from scratch again (also notice how
> > the main vect loop guard jumps around the alignment prologue
> > as well and lands here - and the vectorized epilogue using
> > unaligned accesses - good!).
> >
> > That is, I'd expect _much_ easier jobs if we'd manage to
> > track the number of performed scalar iterations (or the
> > number of scalar iterations remaining) using the canonical
> > IV we add to all loops across all of the involved loops.
> >
> > Richard.
> 
> 
> So I am now looking at using an IV that counts scalar iterations rather than
> vector iterations and reusing that through all loops, (prologue, main loop,
> vect_epilogue and scalar epilogue). The first is easy, since that's what we
> already do for partial vectors or non-constant VFs. The latter requires some
> plumbing and removing a lot of the code in there that creates new IV's going
> from [0, niters - previous iterations]. I don't yet have a clear cut view of
> how to do this, I first thought of keeping track of the 'control' IV in the
> loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps
> a control_ivs struct, but that is used for overflow detection and only keeps
> track of what looks like a constant 'base' and 'step'. Not quite sure how all
> that works, but intuitively doesn't seem like the right thing to reuse.

Maybe it's enough to maintain this [remaining] scalar iterations counter
between loops, thus after the vector loop do

  remain_scalar_iter -= vector_iters * vf;

etc., this should make it possible to do some first order cleanups,
avoiding some repeated computations.  It does involve placing
additional PHIs for this remain_scalar_iter var of course (I'd be
hesitant to rely on the SSA renamer for this due to its expense).

I think that for all later jump-around tests tracking remaining
scalar iters is more convenient than tracking performed scalar iters.

> I'll go hack around and keep you posted on progress.

Thanks - it's an iffy area ...
Richard.
diff mbox series

Patch

diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index c1d6e02194b251f7c940784c291d58c754f07454..ebb71948abe4ca27d495a2707254beb27e385a0d 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1928,6 +1928,15 @@  vect_gen_prolog_loop_niters (loop_vec_info loop_vinfo,
   return iters_name;
 }
 
+static bool
+maybe_not_zero (tree t)
+{
+  if (!t)
+    return false;
+  if (TREE_CODE (t) != INTEGER_CST)
+    return true;
+  return !tree_int_cst_equal (t, build_zero_cst (TREE_TYPE (t)));
+}
 
 /* Function vect_update_init_of_dr
 
@@ -1954,6 +1963,76 @@  vect_update_init_of_dr (dr_vec_info *dr_info, tree niters, tree_code code)
   dr_info->offset = offset;
 }
 
+static void
+vect_update_base_of_dr (struct data_reference * dr,
+			loop_vec_info epilogue_vinfo, iv_update *iv_update)
+{
+  tree new_base_addr = iv_update->new_base_addr;
+  edge skip_e = iv_update->skip_edge;
+  if (skip_e)
+    {
+      /* If we have SKIP_E we need to use the phi-node that joins the IV coming
+	 from the main loop and the initial IV.  */
+      gimple_seq stmts;
+      tree base_addr = DR_BASE_ADDRESS (dr);
+      tree type = TREE_TYPE (base_addr);
+      gphi *new_phi;
+
+      edge e = EDGE_PRED (skip_e->dest, 0);
+      e = e != skip_e ? e : EDGE_PRED (skip_e->dest, 1);
+
+      base_addr = force_gimple_operand (base_addr, &stmts, true,
+					NULL_TREE);
+      gimple_stmt_iterator gsi = gsi_last_bb (skip_e->src);
+      if (is_gimple_assign (gsi_stmt (gsi))
+	  || is_gimple_call (gsi_stmt (gsi)))
+	gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT);
+      else
+	gsi_insert_seq_before (&gsi, stmts, GSI_NEW_STMT);
+
+      /* Make sure NEW_BASE_ADDR and the initial base address use the same
+	 type.  Not sure why I chose to use DR_BASE_ADDR's type here, probably
+	 makes more sense to use the NEW_BASE_ADDR's type.  */
+      stmts = NULL;
+      new_base_addr = fold_convert (type, new_base_addr);
+      new_base_addr = force_gimple_operand (new_base_addr, &stmts, true, NULL_TREE);
+      gsi = gsi_last_bb (e->src);
+      if (is_gimple_assign (gsi_stmt (gsi))
+	  || is_gimple_call (gsi_stmt (gsi)))
+	gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT);
+      else
+	gsi_insert_seq_before (&gsi, stmts, GSI_NEW_STMT);
+
+      new_phi = create_phi_node (make_ssa_name (type), skip_e->dest);
+      add_phi_arg (new_phi, new_base_addr, e, UNKNOWN_LOCATION);
+      add_phi_arg (new_phi, base_addr, skip_e, UNKNOWN_LOCATION);
+
+      new_base_addr = gimple_phi_result (new_phi);
+    }
+  else
+    {
+      gimple_seq stmts;
+      class loop *loop = LOOP_VINFO_LOOP (epilogue_vinfo);
+      tree type = TREE_TYPE (DR_BASE_ADDRESS (dr));
+      new_base_addr = fold_convert (type, new_base_addr);
+      new_base_addr = force_gimple_operand (new_base_addr, &stmts, true,
+					    NULL_TREE);
+      gimple_stmt_iterator gsi
+	= gsi_last_bb (loop_preheader_edge (loop)->src);
+      if (!gsi_stmt (gsi)
+	  || is_gimple_assign (gsi_stmt (gsi))
+	  || is_gimple_call (gsi_stmt (gsi)))
+	gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT);
+      else
+	gsi_insert_seq_before (&gsi, stmts, GSI_NEW_STMT);
+    }
+  DR_BASE_ADDRESS (dr) = new_base_addr;
+  /* TODO: This will need a different approach for vector_skip path with non-zero init or offset.  These are currently disabled.  */
+  DR_INIT (dr) = build_zero_cst(sizetype);
+  DR_OFFSET (dr) = build_zero_cst(sizetype);
+}
+
+
 
 /* Function vect_update_inits_of_drs
 
@@ -1966,6 +2045,7 @@  vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 {
   unsigned int i;
   vec<data_reference_p> datarefs = LOOP_VINFO_DATAREFS (loop_vinfo);
+  loop_vec_info orig_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
   struct data_reference *dr;
 
   DUMP_VECT_SCOPE ("vect_update_inits_of_dr");
@@ -1981,7 +2061,23 @@  vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
     {
       dr_vec_info *dr_info = loop_vinfo->lookup_dr (dr);
       if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt))
-	vect_update_init_of_dr (dr_info, niters, code);
+	{
+	  iv_update *iv_update = NULL;
+	  if (orig_vinfo)
+	    iv_update
+	      = LOOP_VINFO_IV_UPDATES (orig_vinfo)->get (dr);
+	  /* dont use iv_update if we are using skip_vector, but DR_INIT is not
+	     zero.  */
+	  if (iv_update && iv_update->new_base_addr
+	      && !(iv_update->skip_edge
+		   && (maybe_not_zero (DR_INIT (dr))
+		       || maybe_not_zero (DR_OFFSET (dr)))))
+	    vect_update_base_of_dr (dr, loop_vinfo, iv_update);
+	  else
+	    /* Advance data_reference's with the number of iterations of the previous
+	       loop and its prologue.  */
+	    vect_update_init_of_dr (dr_info, niters, code);
+	}
     }
 }
 
@@ -2857,6 +2953,9 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     {
       epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
       loop_vinfo->epilogue_vinfos.ordered_remove (0);
+      if (!LOOP_VINFO_IV_UPDATES (loop_vinfo))
+	LOOP_VINFO_IV_UPDATES (loop_vinfo)
+	  = new hash_map<struct data_reference *, iv_update> ();
     }
 
   tree niters_vector_mult_vf = NULL_TREE;
@@ -3091,6 +3190,18 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
 	  slpeel_update_phi_nodes_for_guard1 (first_loop, epilog, guard_e, e);
 
+	  if (epilogue_vinfo) {
+	    struct data_reference *dr;
+	    int i;
+	    vec<data_reference_p> datarefs = LOOP_VINFO_DATAREFS (epilogue_vinfo);
+	    FOR_EACH_VEC_ELT (datarefs, i, dr)
+	      {
+		iv_update &iv_update
+		  = LOOP_VINFO_IV_UPDATES (loop_vinfo)->get_or_insert (dr);
+		iv_update.skip_edge = guard_e;
+	      }
+	  }
+
 	  /* Simply propagate profile info from guard_bb to guard_to which is
 	     a merge point of control flow.  */
 	  guard_to->count = guard_bb->count;
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 3e973e774af8f9205be893e01ad9263281116885..500b0efbf34afa0e9202f6b0fa61e5e291482321 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -846,7 +846,8 @@  _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     has_mask_store (false),
     scalar_loop_scaling (profile_probability::uninitialized ()),
     scalar_loop (NULL),
-    orig_loop_info (NULL)
+    orig_loop_info (NULL),
+    iv_updates_map(NULL)
 {
   /* CHECKME: We want to visit all BBs before their successors (except for
      latch blocks, for which this assertion wouldn't hold).  In the simple
@@ -926,6 +927,7 @@  _loop_vec_info::~_loop_vec_info ()
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
+  delete iv_updates_map;
 
   /* When we release an epiloge vinfo that we do not intend to use
      avoid clearing AUX of the main loop which should continue to
@@ -9264,8 +9266,8 @@  update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
   free (LOOP_VINFO_BBS (epilogue_vinfo));
   LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
 
-  /* Advance data_reference's with the number of iterations of the previous
-     loop and its prologue.  */
+  /* Either advance data_reference's with the number of iterations of the previous
+     loop and its prologue or reuse the previous loop's data_reference update.  */
   vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
 
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 85d3161fe3b2fbe289396457361c7af7519bd2b3..65650d71149abd22bd320135d01411edd38d9cf9 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -7123,7 +7123,39 @@  vectorizable_scan_store (vec_info *vinfo,
   return true;
 }
 
+static void
+vect_capture_iv_update (struct data_reference *dr, tree after_incr,
+			tree before_incr, loop_vec_info loop_vinfo)
+{
+  if (!loop_vinfo || !LOOP_VINFO_IV_UPDATES (loop_vinfo))
+    return;
+
+  /* We aren't updating the datareference, so nothing to update the base addr
+     with.  */
+  if (after_incr == NULL_TREE)
+    LOOP_VINFO_IV_UPDATES (loop_vinfo)->remove (dr);
+
+  /* Use PTR_INCR for positive step's as that will contain the next dataref to
+     be accessed.  For negative steps we access the data at step * datasize, so
+     use BEFORE_INCR as otherwise the epilogue's data access will be off by the
+     main loops datasize * step.  */
+  bool neg_step
+    = TREE_CODE (DR_STEP (dr)) == INTEGER_CST
+      && tree_int_cst_sgn (DR_STEP (dr)) == -1;
+
+  tree new_base_addr;
+  if (neg_step)
+    new_base_addr
+      = fold_build2 (POINTER_PLUS_EXPR,
+		     TREE_TYPE (before_incr),
+		     before_incr, build_int_cst (sizetype, -1));
+  else
+    new_base_addr = after_incr;
 
+  iv_update &iv_update
+    = LOOP_VINFO_IV_UPDATES (loop_vinfo)->get_or_insert (dr);
+  iv_update.new_base_addr = new_base_addr;
+}
 /* Function vectorizable_store.
 
    Check if STMT_INFO defines a non scalar data-ref (array/pointer/structure)
@@ -7151,6 +7183,7 @@  vectorizable_store (vec_info *vinfo,
   tree dataref_ptr = NULL_TREE;
   tree dataref_offset = NULL_TREE;
   gimple *ptr_incr = NULL;
+  tree after_incr = NULL_TREE;
   int ncopies;
   int j;
   stmt_vec_info first_stmt_info;
@@ -8279,6 +8312,11 @@  vectorizable_store (vec_info *vinfo,
   result_chain.release ();
   vec_oprnds.release ();
 
+  if (ptr_incr)
+    after_incr = gimple_get_lhs (ptr_incr);
+
+  vect_capture_iv_update (first_dr_info->dr, after_incr, dataref_ptr, loop_vinfo);
+
   return true;
 }
 
@@ -8420,6 +8458,7 @@  vectorizable_load (vec_info *vinfo,
   tree dataref_ptr = NULL_TREE;
   tree dataref_offset = NULL_TREE;
   gimple *ptr_incr = NULL;
+  tree after_incr = NULL_TREE;
   int ncopies;
   int i, j;
   unsigned int group_size;
@@ -9792,6 +9831,11 @@  vectorizable_load (vec_info *vinfo,
         }
       dr_chain.release ();
     }
+
+  if (ptr_incr)
+    after_incr = gimple_get_lhs (ptr_incr);
+  vect_capture_iv_update (first_dr_info->dr, after_incr, dataref_ptr, loop_vinfo);
+
   if (!slp)
     *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index b861c97ab3aef179ba9b2900701cf09e75a847a5..95cb806f715d3f5b1389b2e56a026e07709da990 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -545,6 +545,12 @@  typedef auto_vec<rgroup_controls> vec_loop_lens;
 
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
+typedef struct
+{
+  edge skip_edge;
+  tree new_base_addr;
+} iv_update;
+
 /*-----------------------------------------------------------------*/
 /* Info on vectorized loops.                                       */
 /*-----------------------------------------------------------------*/
@@ -752,6 +758,8 @@  public:
      analysis.  */
   vec<_loop_vec_info *> epilogue_vinfos;
 
+  hash_map<struct data_reference*, iv_update> *iv_updates_map;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -807,6 +815,7 @@  public:
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
 #define LOOP_VINFO_ORIG_LOOP_INFO(L)       (L)->orig_loop_info
 #define LOOP_VINFO_SIMD_IF_COND(L)         (L)->simd_if_cond
+#define LOOP_VINFO_IV_UPDATES(L)	   (L)->iv_updates_map
 
 #define LOOP_VINFO_FULLY_MASKED_P(L)		\
   (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\