diff mbox

[PR68030/PR69710,RFC] Introduce a simple local CSE interface and use it in vectorizer

Message ID c73aa592-404b-cc52-2926-10cdaa707087@imgtec.com
State New
Headers show

Commit Message

Doug Gilmore Nov. 21, 2016, 9:34 p.m. UTC
I haven't seen any followups to this discussion of Bin's patch to
PR68303 and PR69710, the patch submission:
http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02000.html

Discussion:
http://gcc.gnu.org/ml/gcc-patches/2016-07/msg00761.html
http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01551.html
http://gcc.gnu.org/ml/gcc-patches/2016-06/msg00372.html
http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01550.html
http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02162.html
http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02155.html
http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02154.html


so I did some investigation to get a better understanding of the
issues involved.

On 07/13/2016 01:59 PM, Jeff Law wrote:
> On 05/25/2016 05:22 AM, Bin Cheng wrote:
>> Hi, As analyzed in PR68303 and PR69710, vectorizer generates
>> duplicated computations in loop's pre-header basic block when
>> creating base address for vector reference to the same memory object.
> Not a huge surprise.  Loop optimizations generally have a tendency
> to create and/or expose CSE opportunities.  Unrolling is a common
> culprit, there's certainly the possibility for header duplication,
> code motions and IV rewriting to also expose/create redundant code.
>
> ...
> 
>  But, 1) It
>> doesn't fix all the problem on x86_64.  Root cause is computation for
>> base address of the first reference is somehow moved outside of
>> loop's pre-header, local CSE can't help in this case.
> That's a bid odd -- have you investigated why this is outside the loop header?
> ...
I didn't look at this issue per se, but I did try running DOM between
autovectorization and IVS.  Just running DOM had little effect, what
was crucial was adding the change Bin mentioned in his original
message:

    Besides CSE issue, this patch also re-associates address
    expressions in vect_create_addr_base_for_vector_ref, specifically,
    it splits constant offset and adds it back near the expression
    root in IR.  This is necessary because GCC only handles
    re-association for commutative operators in CSE.

I attached a patch for these changes only.  These are the important
modifications that address the some of the IVS related issues exposed
by PR68303. I found that adding the CSE change (or calling DOM between
autovectorization and IVOPTS) is not needed, and from what I have
seen, actually makes the code worse.

Applying only the modifications to
vect_create_addr_base_for_vector_ref, additional simplifications will
be done when induction variables are found (function
find_induction_variables).  These simplications are indicated by the
appearance of lines:

Applying pattern match.pd:1056, generic-match.c:11865

in the IVOPS dump file.  Now IVOPTs transforms the code so that
constants now appear in the computation of the effective addresses for
the memory OPs.  However the code generated by IVOPTS still uses a
separate base register for each memory reference.  Later DOM3
transforms the code to use just one base register, which is the form
the code needs to be in for the preliminary phase of IVOPTs where
"IV uses" associated with memory OPs are placed into groups.  At the
time of this grouping, checks are done to ensure that for each member
of a group the constant offsets don't overflow the immediate fields in
actual machine instructions (more in this see * below).

Currently it appears that an IV is generated for each memory
reference.  Instead of generating a new IV for each memory reference,
we could try to detect that value of the new IV is just a constant
offset of an existing IV and just generate a new temp reflecting that.
I haven't worked through what needs to be done to implement that, but
for the issue in PR69710 (saxpy example where the same IV should be
used for a load and store) is straightforward to implement so since
work has already been done in during data dependence analysis to
detect this situation.  I attached a patch for PR69710 that was
bootstrapped and tested on X86_64 without errors.  It does appear that
it needs more testing, since I did notice SPEC 2006 h264ref produces
different results with the patch applied, which I still need to
investigate.

Doug

* Note that when IV uses are grouped, only positive constant offsets
constraints are considered.  That negative offsets can be used are
reflected in the costs of using a different IV than the IV associated
with a particular group.  Thus once the optimal IV set is found, a
different IV may chosen, which causes negative constant offsets to be
used.

Comments

Bin.Cheng Nov. 22, 2016, 4:07 p.m. UTC | #1
On Mon, Nov 21, 2016 at 9:34 PM, Doug Gilmore <Doug.Gilmore@imgtec.com> wrote:
> I haven't seen any followups to this discussion of Bin's patch to
> PR68303 and PR69710, the patch submission:
> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02000.html
>
> Discussion:
> http://gcc.gnu.org/ml/gcc-patches/2016-07/msg00761.html
> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01551.html
> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg00372.html
> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01550.html
> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02162.html
> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02155.html
> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02154.html
>
>
> so I did some investigation to get a better understanding of the
> issues involved.
Hi Doug,
Thanks for looking into this problem.
>
> On 07/13/2016 01:59 PM, Jeff Law wrote:
>> On 05/25/2016 05:22 AM, Bin Cheng wrote:
>>> Hi, As analyzed in PR68303 and PR69710, vectorizer generates
>>> duplicated computations in loop's pre-header basic block when
>>> creating base address for vector reference to the same memory object.
>> Not a huge surprise.  Loop optimizations generally have a tendency
>> to create and/or expose CSE opportunities.  Unrolling is a common
>> culprit, there's certainly the possibility for header duplication,
>> code motions and IV rewriting to also expose/create redundant code.
>>
>> ...
>>
>>  But, 1) It
>>> doesn't fix all the problem on x86_64.  Root cause is computation for
>>> base address of the first reference is somehow moved outside of
>>> loop's pre-header, local CSE can't help in this case.
>> That's a bid odd -- have you investigated why this is outside the loop header?
>> ...
> I didn't look at this issue per se, but I did try running DOM between
> autovectorization and IVS.  Just running DOM had little effect, what
> was crucial was adding the change Bin mentioned in his original
> message:
>
>     Besides CSE issue, this patch also re-associates address
>     expressions in vect_create_addr_base_for_vector_ref, specifically,
>     it splits constant offset and adds it back near the expression
>     root in IR.  This is necessary because GCC only handles
>     re-association for commutative operators in CSE.
>
> I attached a patch for these changes only.  These are the important
> modifications that address the some of the IVS related issues exposed
> by PR68303. I found that adding the CSE change (or calling DOM between
> autovectorization and IVOPTS) is not needed, and from what I have
I checked the code again.  As you said, re-association part is important
to enable CSE opportunities, no matter when and which pass handles it.
After re-association, the computation of base addresses are like:

    //preheader
    b_1 = g_Input + var_offset_1;
    vectp_1 = b_1 + cst_offset_1;
    b_2 = g_Input + var_offset_2;
    vectp_2 = b_2 + cst_offset_2;
    ...
    b_n = g_input + var_offset_n;
    vectp_n = b_n + cst_offset_n;

    //loop
    MEM[vectp_1];
    MEM[vectp_2];
    ...
    MEM[vectp_n];

In fact, var_offset_1, var_offset_2, ..., var_offset_n are equal to others.  So
the addresses are in the form of "g_Input + var_offset + cst_offset_x" differing
to each other wrto constant offset.  The purpose of CSE is to propagate all
parts of this address to IVOPTs, otherwise IVOPTS only knows IVs as below:

    iv_use_1: {b_1 + cst_offset_1, step}_loop
    iv_use_1: {b_2 + cst_offset_2, step}_loop
    ...
    iv_use_n: {b_n + cst_offset_n, step}_loop

> seen, actually makes the code worse.
>
> Applying only the modifications to
> vect_create_addr_base_for_vector_ref, additional simplifications will
> be done when induction variables are found (function
> find_induction_variables).  These simplications are indicated by the
> appearance of lines:
>
> Applying pattern match.pd:1056, generic-match.c:11865
This doesn't look related to this problem to me.  The simplification of this
problem is CSE, it's not what match.pd does.

>
> in the IVOPS dump file.  Now IVOPTs transforms the code so that
> constants now appear in the computation of the effective addresses for
> the memory OPs.  However the code generated by IVOPTS still uses a
> separate base register for each memory reference.  Later DOM3
> transforms the code to use just one base register, which is the form

Indeed CSE now looks like unnecessary fixing the problem, we can relying on
DOM pass to explore the equality among new bases (b_1, b_2, ..., b_n).  This
actually echoes my humble opinion: we shouldn't rely on IVOPTs to fix all bad
code issues.  On the other handle, for cases in which these bases
(b_1, b_2, ..., b_n)
are not equal to each other, there is not much to lose in this way either.

> the code needs to be in for the preliminary phase of IVOPTs where
> "IV uses" associated with memory OPs are placed into groups.  At the
> time of this grouping, checks are done to ensure that for each member
> of a group the constant offsets don't overflow the immediate fields in
> actual machine instructions (more in this see * below).
>
> Currently it appears that an IV is generated for each memory
> reference.  Instead of generating a new IV for each memory reference,
> we could try to detect that value of the new IV is just a constant
> offset of an existing IV and just generate a new temp reflecting that.
> I haven't worked through what needs to be done to implement that, but
> for the issue in PR69710 (saxpy example where the same IV should be
Basic idea is to re-associate generated base addresses for vectors in order
to enable more CSE opportunities.  The original patch re-associates all
vector base addresses, no matter if they share common-sub-expression
with others or not.  This is not good, and could result in sub-optimal code.
That's one reason why the patch is not updated.  The old idea here is to
introduce a vectorizer local fix, for example using hash-table storing existing
base address, and generating new ones (which differ in const offset) based
on it.

> used for a load and store) is straightforward to implement so since
> work has already been done in during data dependence analysis to
> detect this situation.  I attached a patch for PR69710 that was
> bootstrapped and tested on X86_64 without errors.  It does appear that
> it needs more testing, since I did notice SPEC 2006 h264ref produces
> different results with the patch applied, which I still need to
> investigate.
Your patch reminds me of another possible method.  Like tree-predcom.c
does, a light-weight merge&find structure, it records groups of data-refs,
and data-refs in a group having the same base address (except constant
offset).  We can build such group information during analyzing each ddr.
This is lighter than has-table with optimal time complexity.

Above comments are made based on x86, there is one more problem on
AArch64.  Given address iv_use with the form MEM[base + IV + cst_offset],
when cst_offset is out the range of [base + offset] addressing mode, IVOPTs
rewrites it into below form:

    temp_1 = base + IV;
    temp_2 = temp_1 + cst_offset;
    MEM[temp_2];

What we want is:

    temp_1 = base + cst_offset;
    MEM[temp-1 + IV];

Thus temp_1 can be hoisted out of loop when register pressure allows (as
in this problem).  In cases register pressure is high, very likely the addition
is kept in loop, which is still better than now in which register pressure is
high because different IVs are chosen.  Without this change, the vectorization
change causes obvious regression in loop body on AArch64.

But both changes look like stage 1 work.

Thanks,
bin

> Doug
>
> * Note that when IV uses are grouped, only positive constant offsets
> constraints are considered.  That negative offsets can be used are
> reflected in the costs of using a different IV than the IV associated
> with a particular group.  Thus once the optimal IV set is found, a
> different IV may chosen, which causes negative constant offsets to be
> used.
>
Doug Gilmore Nov. 23, 2016, 12:35 a.m. UTC | #2
On 11/22/2016 08:07 AM, Bin.Cheng wrote:
> On Mon, Nov 21, 2016 at 9:34 PM, Doug Gilmore <Doug.Gilmore@imgtec.com> wrote:
>> I haven't seen any followups to this discussion of Bin's patch to
>> PR68303 and PR69710, the patch submission:
>> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02000.html
>>
>> Discussion:
>> http://gcc.gnu.org/ml/gcc-patches/2016-07/msg00761.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01551.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg00372.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01550.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02162.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02155.html
>> http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02154.html
>>
>>
>> so I did some investigation to get a better understanding of the
>> issues involved.
> Hi Doug,
> Thanks for looking into this problem.
>>
>> On 07/13/2016 01:59 PM, Jeff Law wrote:
>>> On 05/25/2016 05:22 AM, Bin Cheng wrote:
>>>> Hi, As analyzed in PR68303 and PR69710, vectorizer generates
>>>> duplicated computations in loop's pre-header basic block when
>>>> creating base address for vector reference to the same memory object.
>>> Not a huge surprise.  Loop optimizations generally have a tendency
>>> to create and/or expose CSE opportunities.  Unrolling is a common
>>> culprit, there's certainly the possibility for header duplication,
>>> code motions and IV rewriting to also expose/create redundant code.
>>>
>>> ...
>>>
>>>  But, 1) It
>>>> doesn't fix all the problem on x86_64.  Root cause is computation for
>>>> base address of the first reference is somehow moved outside of
>>>> loop's pre-header, local CSE can't help in this case.
>>> That's a bid odd -- have you investigated why this is outside the loop header?
>>> ...
>> I didn't look at this issue per se, but I did try running DOM between
>> autovectorization and IVS.  Just running DOM had little effect, what
>> was crucial was adding the change Bin mentioned in his original
>> message:
>>
>>     Besides CSE issue, this patch also re-associates address
>>     expressions in vect_create_addr_base_for_vector_ref, specifically,
>>     it splits constant offset and adds it back near the expression
>>     root in IR.  This is necessary because GCC only handles
>>     re-association for commutative operators in CSE.
>>
>> I attached a patch for these changes only.  These are the important
>> modifications that address the some of the IVS related issues exposed
>> by PR68303. I found that adding the CSE change (or calling DOM between
>> autovectorization and IVOPTS) is not needed, and from what I have
> I checked the code again.  As you said, re-association part is important
> to enable CSE opportunities, no matter when and which pass handles it.
> After re-association, the computation of base addresses are like:
> 
>     //preheader
>     b_1 = g_Input + var_offset_1;
>     vectp_1 = b_1 + cst_offset_1;
>     b_2 = g_Input + var_offset_2;
>     vectp_2 = b_2 + cst_offset_2;
>     ...
>     b_n = g_input + var_offset_n;
>     vectp_n = b_n + cst_offset_n;
> 
>     //loop
>     MEM[vectp_1];
>     MEM[vectp_2];
>     ...
>     MEM[vectp_n];
> 
> In fact, var_offset_1, var_offset_2, ..., var_offset_n are equal to others.  So
> the addresses are in the form of "g_Input + var_offset + cst_offset_x" differing
> to each other wrto constant offset.  The purpose of CSE is to propagate all
> parts of this address to IVOPTs, otherwise IVOPTS only knows IVs as below:
> 
>     iv_use_1: {b_1 + cst_offset_1, step}_loop
>     iv_use_1: {b_2 + cst_offset_2, step}_loop
>     ...
>     iv_use_n: {b_n + cst_offset_n, step}_loop
> 
>> seen, actually makes the code worse.
>>
>> Applying only the modifications to
>> vect_create_addr_base_for_vector_ref, additional simplifications will
>> be done when induction variables are found (function
>> find_induction_variables).  These simplications are indicated by the
>> appearance of lines:
>>
>> Applying pattern match.pd:1056, generic-match.c:11865
> This doesn't look related to this problem to me.  The simplification of this
> problem is CSE, it's not what match.pd does.
> 
>>
>> in the IVOPS dump file.  Now IVOPTs transforms the code so that
>> constants now appear in the computation of the effective addresses for
>> the memory OPs.  However the code generated by IVOPTS still uses a
>> separate base register for each memory reference.  Later DOM3
>> transforms the code to use just one base register, which is the form
> 
> Indeed CSE now looks like unnecessary fixing the problem, we can relying on
> DOM pass to explore the equality among new bases (b_1, b_2, ..., b_n).  This
> actually echoes my humble opinion: we shouldn't rely on IVOPTs to fix all bad
> code issues.  On the other handle, for cases in which these bases
> (b_1, b_2, ..., b_n)
> are not equal to each other, there is not much to lose in this way either.
> 
>> the code needs to be in for the preliminary phase of IVOPTs where
>> "IV uses" associated with memory OPs are placed into groups.  At the
>> time of this grouping, checks are done to ensure that for each member
>> of a group the constant offsets don't overflow the immediate fields in
>> actual machine instructions (more in this see * below).
>>
>> Currently it appears that an IV is generated for each memory
>> reference.  Instead of generating a new IV for each memory reference,
>> we could try to detect that value of the new IV is just a constant
>> offset of an existing IV and just generate a new temp reflecting that.
>> I haven't worked through what needs to be done to implement that, but
>> for the issue in PR69710 (saxpy example where the same IV should be
> Basic idea is to re-associate generated base addresses for vectors in order
> to enable more CSE opportunities.  The original patch re-associates all
> vector base addresses, no matter if they share common-sub-expression
> with others or not.  This is not good, and could result in sub-optimal code.
> That's one reason why the patch is not updated.  The old idea here is to
> introduce a vectorizer local fix, for example using hash-table storing existing
> base address, and generating new ones (which differ in const offset) based
> on it.
> 
>> used for a load and store) is straightforward to implement so since
>> work has already been done in during data dependence analysis to
>> detect this situation.  I attached a patch for PR69710 that was
>> bootstrapped and tested on X86_64 without errors.  It does appear that
>> it needs more testing, since I did notice SPEC 2006 h264ref produces
>> different results with the patch applied, which I still need to
>> investigate.
> Your patch reminds me of another possible method.  Like tree-predcom.c
> does, a light-weight merge&find structure, it records groups of data-refs,
> and data-refs in a group having the same base address (except constant
> offset).  We can build such group information during analyzing each ddr.
> This is lighter than has-table with optimal time complexity.
Hi Bin,

Thank you for the quick response.

Isn't this work ready done for us?

See trace output in the "vect" details dump for DR structures:

...
	base_object: *global_Input.0_145 + (sizetype) ((unsigned int) iy_179 * 2064)
	Access function 0: {4144B, +, 4}_2
...
> 
> Above comments are made based on x86, there is one more problem on
> AArch64.  Given address iv_use with the form MEM[base + IV + cst_offset],
> when cst_offset is out the range of [base + offset] addressing mode, IVOPTs
> rewrites it into below form:
> 
>     temp_1 = base + IV;
>     temp_2 = temp_1 + cst_offset;
>     MEM[temp_2];
> 
> What we want is:
> 
>     temp_1 = base + cst_offset;
>     MEM[temp-1 + IV];
> 
> Thus temp_1 can be hoisted out of loop when register pressure allows (as
> in this problem).  In cases register pressure is high, very likely the addition
> is kept in loop, which is still better than now in which register pressure is
> high because different IVs are chosen.  Without this change, the vectorization
> change causes obvious regression in loop body on AArch64.
> 
> But both changes look like stage 1 work.
Right -- Thanks!

Doug
> 
> Thanks,
> bin
> 
>> Doug
>>
>> * Note that when IV uses are grouped, only positive constant offsets
>> constraints are considered.  That negative offsets can be used are
>> reflected in the costs of using a different IV than the IV associated
>> with a particular group.  Thus once the optimal IV set is found, a
>> different IV may chosen, which causes negative constant offsets to be
>> used.
>>
diff mbox

Patch

From 68d783d5b8e3a6238a7e7b25d76fc6e457bfc484 Mon Sep 17 00:00:00 2001
From: Doug Gilmore <doug.gilmore@imgtec.com>
Date: Fri, 18 Nov 2016 13:29:47 -0800
Subject: [PATCH] Another prototype fix to PR69710.

Would need to be extended to handle PR68030.

Tested on r240609 several weeks ago: bootstrapped and no make check
regressions on x86_64-unknown-linux-gnu, though a run-time error on
SPEC 2006 h264ref at -O3 with --size=test was observed.
---
 gcc/tree-data-ref.c       |  1 +
 gcc/tree-data-ref.h       |  4 ++++
 gcc/tree-vect-data-refs.c | 50 ++++++++++++++++++++++++++++++++++++++++++++---
 gcc/tree-vect-loop.c      |  2 ++
 gcc/tree-vectorizer.h     |  5 +++++
 5 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index 8152da3..7a4850c 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -1492,6 +1492,7 @@  initialize_data_dependence_relation (struct data_reference *a,
   DDR_SUBSCRIPTS (res).create (0);
   DDR_DIR_VECTS (res).create (0);
   DDR_DIST_VECTS (res).create (0);
+  DDR_PREHEADER_IV (res) = NULL_TREE;
 
   if (a == NULL || b == NULL)
     {
diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h
index 856dd58..ca67d93 100644
--- a/gcc/tree-data-ref.h
+++ b/gcc/tree-data-ref.h
@@ -264,6 +264,9 @@  struct data_dependence_relation
   /* Set to true when the dependence relation is on the same data
      access.  */
   bool self_reference_p;
+
+  /* For zero distance dependencies, use the same IV in the preheader.  */
+  tree preheader_iv;
 };
 
 typedef struct data_dependence_relation *ddr_p;
@@ -294,6 +297,7 @@  typedef struct data_dependence_relation *ddr_p;
 #define DDR_DIST_VECT(DDR, I) \
   DDR_DIST_VECTS (DDR)[I]
 #define DDR_REVERSED_P(DDR) DDR->reversed_p
+#define DDR_PREHEADER_IV(DDR) DDR->preheader_iv
 
 
 bool dr_analyze_innermost (struct data_reference *, struct loop *);
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 5a30314..12cfbec 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -394,6 +394,7 @@  vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
 		}
 	    }
 
+	  LOOP_VINFO_ZERO_DIST_DEPS (loop_vinfo).safe_push (ddr);
 	  continue;
 	}
 
@@ -4492,10 +4493,53 @@  vect_create_data_ref_ptr (gimple *stmt, tree aggr_type, struct loop *at_loop,
   /* (2) Calculate the initial address of the aggregate-pointer, and set
      the aggregate-pointer to point to it before the loop.  */
 
-  /* Create: (&(base[init_val+offset]+byte_offset) in the loop preheader.  */
+  /* Create: (&(base[init_val+offset]+byte_offset) in the loop preheader,
+     taking care to use reuse IVs.  */
 
-  new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list,
-						   offset, loop, byte_offset);
+  bool need_new_temp = true;
+  ddr_p zd_ddr = (ddr_p) NULL;
+  if (loop_vinfo)
+    {
+      int i;
+      ddr_p ddr;
+      
+
+      FOR_EACH_VEC_ELT (LOOP_VINFO_ZERO_DIST_DEPS (loop_vinfo), i, ddr)
+	{
+	  if (DDR_A(ddr) == dr || DDR_B(ddr) == dr)
+	    {
+	      zd_ddr = ddr;
+	      break;
+	    }
+	}
+
+      if (zd_ddr && DDR_PREHEADER_IV (zd_ddr))
+	{
+	  new_temp = DDR_PREHEADER_IV (zd_ddr);
+	  need_new_temp = false;
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location, "reuse ");
+	      dump_generic_expr (MSG_NOTE, TDF_SLIM, new_temp);
+	      dump_printf (MSG_NOTE, "\n");
+	    }
+	}
+    }
+
+  if (need_new_temp)
+      new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list,
+						       offset, loop, byte_offset);
+  if (zd_ddr)
+    {
+      DDR_PREHEADER_IV (zd_ddr) = new_temp;
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location, "save ");
+	  dump_generic_expr (MSG_NOTE, TDF_SLIM, new_temp);
+	  dump_printf (MSG_NOTE, "\n");
+	}
+    }
+  
   if (new_stmt_list)
     {
       if (pe)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 4150b0d..0bf67e9 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1173,6 +1173,7 @@  new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_PEELING_FOR_NITER (res) = false;
   LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
   LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL;
+  LOOP_VINFO_ZERO_DIST_DEPS (res) = vNULL;
 
   return res;
 }
@@ -1271,6 +1272,7 @@  destroy_loop_vec_info (loop_vec_info loop_vinfo, bool clean_stmts)
   LOOP_VINFO_GROUPED_STORES (loop_vinfo).release ();
   LOOP_VINFO_REDUCTIONS (loop_vinfo).release ();
   LOOP_VINFO_REDUCTION_CHAINS (loop_vinfo).release ();
+  LOOP_VINFO_ZERO_DIST_DEPS (loop_vinfo).release ();
 
   destroy_cost_data (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo));
   loop_vinfo->scalar_cost_vec.release ();
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 2a7fa0a..7fcc824 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -339,6 +339,10 @@  typedef struct _loop_vec_info : public vec_info {
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
 
+  /* Dependency relations that have zero distance, thus the associated
+     memory references can share the same IV.  */
+  vec<ddr_p> zero_dist_deps;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -379,6 +383,7 @@  typedef struct _loop_vec_info : public vec_info {
 #define LOOP_VINFO_SCALAR_ITERATION_COST(L) (L)->scalar_cost_vec
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
 #define LOOP_VINFO_ORIG_LOOP_INFO(L)       (L)->orig_loop_info
+#define LOOP_VINFO_ZERO_DIST_DEPS(L)       (L)->zero_dist_deps
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
-- 
1.9.1