diff mbox

[PR66388] Add sizetype cand for BIV of smaller type if it's used as index of memory ref

Message ID 000401d0e52f$2c1917d0$844b4770$@arm.com
State New
Headers show

Commit Message

Bin Cheng Sept. 2, 2015, 3:26 a.m. UTC
Hi,
This patch is a new approach to fix PR66388.  IVO today computes iv_use with
iv_cand which has at least same type precision as the use.  On 64bit
platforms like AArch64, this results in different iv_cand created for each
address type iv_use, and register pressure increased.  As a matter of fact,
the BIV should be used for all iv_uses in some of these cases.  It is a
latent bug but recently getting worse because of overflow changes.

The original approach at
https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
except it conflict with IV elimination.  Seems to me it is impossible to
mitigate the contradiction.

This new approach fixes the issue by adding sizetype iv_cand for BIVs
directly.  In cases if the original BIV is preferred, the sizetype iv_cand
will be chosen.  As for code generation, the sizetype iv_cand has the same
effect as the original BIV.  Actually, it's better because BIV needs to be
explicitly extended to sizetype to be used in address expression on most
targets.

One shortage of this approach is it may introduce more iv candidates.  To
minimize the impact, this patch does sophisticated code analysis and adds
sizetype candidate for BIV only if it is used as index.  Moreover, it avoids
to add candidate of the original type if the BIV is only used as index.
Statistics for compiling spec2k6 shows increase of candidate number is
modest and can be ignored.

There are two more patches following to fix corner cases revealed by this
one.  In together they bring obvious perf improvement for spec26k/int on
aarch64.
Spec2k6/int
400.perlbench	3.44%
445.gobmk	-0.86%
456.hmmer	14.83%
458.sjeng	2.49%
462.libquantum	-0.79%
GEOMEAN         1.68%

There is also about 0.36% improvement for spec2k6/fp, mostly because of case
436.cactusADM.  I believe it can be further improved, but that should be
another patch.

I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  As
for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
regressed by ~3%.  I can see BIVs are chosen for some loops instead of
address candidates.  Generally, the loop header will be simplified because
iv elimination with BIV is simpler; the number of instructions in loop body
isn't changed.  I suspect the regression comes from different addressing
modes.  With BIV, complex addressing mode like [base + index << scale +
disp] is used, rather than [base + disp].  I guess the former has more
micro-ops, thus more expensive.  This guess can be confirmed by manually
suppressing the complex addressing mode with higher address cost.
Now the problem becomes why overall cost of BIV is computed lower while the
actual cost is higher.  I noticed for most affected loops, loop header is
bloated because of iv elimination using the old address candidate.  The
bloated loop header results in much higher cost than BIV.  As a result, BIV
is preferred.  I also noticed the bloated loop header generally can be
simplified (I have a following patch for this).  After applying the local
patch, the old address candidate is chosen, and most of regression is
recovered.
Conclusion is I think loop header bloated issue should be blamed for the
regression, and it can be resolved.

Bootstrap and test on x64_64 and aarch64.  It fixes failure of
gcc.target/i386/pr49781-1.c, without new breakage.

So what do you think?

Thanks,
bin

2015-08-31  Bin Cheng  <bin.cheng@arm.com>

	* tree-affine.c (aff_combination_expand): New parameters.
	(tree_to_aff_combination_expand): Ditto.
	* tree-affine.h (aff_combination_expand): New declaration.
	(tree_to_aff_combination_expand): Ditto.
	* tree-ssa-loop-ivopts.c (struct iv, iv_cand): New fields.
	(dump_iv): Dump no_overflow information.
	(alloc_iv): Initialize new field for struct iv.
	(struct expand_data): New struct for affine combination expanding.
	(stop_expand): New callback func for affine combination expanding.
	(find_deriving_biv_for_iv, record_biv_for_address_use): New
functions.
	(idx_find_step): Call new functions above.
	(find_depends, add_candidate): New paramter.
	(add_iv_candidate_for_biv): Add sizetype cand for BIV.
	(get_computation_aff): Simplify convertion of cand for BIV.
	(get_computation_cost_at): Step cand's base if necessary.

Comments

Richard Biener Sept. 2, 2015, 2:12 p.m. UTC | #1
On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng <bin.cheng@arm.com> wrote:
> Hi,
> This patch is a new approach to fix PR66388.  IVO today computes iv_use with
> iv_cand which has at least same type precision as the use.  On 64bit
> platforms like AArch64, this results in different iv_cand created for each
> address type iv_use, and register pressure increased.  As a matter of fact,
> the BIV should be used for all iv_uses in some of these cases.  It is a
> latent bug but recently getting worse because of overflow changes.
>
> The original approach at
> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
> except it conflict with IV elimination.  Seems to me it is impossible to
> mitigate the contradiction.
>
> This new approach fixes the issue by adding sizetype iv_cand for BIVs
> directly.  In cases if the original BIV is preferred, the sizetype iv_cand
> will be chosen.  As for code generation, the sizetype iv_cand has the same
> effect as the original BIV.  Actually, it's better because BIV needs to be
> explicitly extended to sizetype to be used in address expression on most
> targets.
>
> One shortage of this approach is it may introduce more iv candidates.  To
> minimize the impact, this patch does sophisticated code analysis and adds
> sizetype candidate for BIV only if it is used as index.  Moreover, it avoids
> to add candidate of the original type if the BIV is only used as index.
> Statistics for compiling spec2k6 shows increase of candidate number is
> modest and can be ignored.
>
> There are two more patches following to fix corner cases revealed by this
> one.  In together they bring obvious perf improvement for spec26k/int on
> aarch64.
> Spec2k6/int
> 400.perlbench   3.44%
> 445.gobmk       -0.86%
> 456.hmmer       14.83%
> 458.sjeng       2.49%
> 462.libquantum  -0.79%
> GEOMEAN         1.68%
>
> There is also about 0.36% improvement for spec2k6/fp, mostly because of case
> 436.cactusADM.  I believe it can be further improved, but that should be
> another patch.
>
> I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  As
> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
> regressed by ~3%.  I can see BIVs are chosen for some loops instead of
> address candidates.  Generally, the loop header will be simplified because
> iv elimination with BIV is simpler; the number of instructions in loop body
> isn't changed.  I suspect the regression comes from different addressing
> modes.  With BIV, complex addressing mode like [base + index << scale +
> disp] is used, rather than [base + disp].  I guess the former has more
> micro-ops, thus more expensive.  This guess can be confirmed by manually
> suppressing the complex addressing mode with higher address cost.
> Now the problem becomes why overall cost of BIV is computed lower while the
> actual cost is higher.  I noticed for most affected loops, loop header is
> bloated because of iv elimination using the old address candidate.  The
> bloated loop header results in much higher cost than BIV.  As a result, BIV
> is preferred.  I also noticed the bloated loop header generally can be
> simplified (I have a following patch for this).  After applying the local
> patch, the old address candidate is chosen, and most of regression is
> recovered.
> Conclusion is I think loop header bloated issue should be blamed for the
> regression, and it can be resolved.
>
> Bootstrap and test on x64_64 and aarch64.  It fixes failure of
> gcc.target/i386/pr49781-1.c, without new breakage.
>
> So what do you think?

The data above looks ok to me.

+static struct iv *
+find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv)
+{
+  aff_tree aff;
+  struct expand_data exp_data;
+
+  if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME)
+    return iv;
+
+  /* Expand IV's ssa_name till the deriving biv is found.  */
+  exp_data.data = data;
+  exp_data.biv = NULL;
+  tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name),
+                                 &aff, &data->name_expansion_cache,
+                                 stop_expand, &exp_data);
+  return exp_data.biv;

that's actually "abusing" tree_to_aff_combination_expand for simply walking
SSA uses and their defs uses recursively until you hit "stop".  ISTR past
discussion to add a generic walk_ssa_use interface for that.  Not sure if it
materialized with a name I can't remember or whether it didn't.

-  add_candidate (data, iv->base, iv->step, true, NULL);
+  /* Check if this biv is used in address type use.  */
+  if (iv->no_overflow  && iv->have_address_use
+      && INTEGRAL_TYPE_P (TREE_TYPE (iv->base))
+      && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype))
+    {
+      tree type = unsigned_type_for (sizetype);

sizetype is unsigned.

the rest looks ok to me but I really don't like the abuse of
tree_to_aff_combination_expand...

Thanks,
Richard.

> Thanks,
> bin
>
> 2015-08-31  Bin Cheng  <bin.cheng@arm.com>
>
>         * tree-affine.c (aff_combination_expand): New parameters.
>         (tree_to_aff_combination_expand): Ditto.
>         * tree-affine.h (aff_combination_expand): New declaration.
>         (tree_to_aff_combination_expand): Ditto.
>         * tree-ssa-loop-ivopts.c (struct iv, iv_cand): New fields.
>         (dump_iv): Dump no_overflow information.
>         (alloc_iv): Initialize new field for struct iv.
>         (struct expand_data): New struct for affine combination expanding.
>         (stop_expand): New callback func for affine combination expanding.
>         (find_deriving_biv_for_iv, record_biv_for_address_use): New
> functions.
>         (idx_find_step): Call new functions above.
>         (find_depends, add_candidate): New paramter.
>         (add_iv_candidate_for_biv): Add sizetype cand for BIV.
>         (get_computation_aff): Simplify convertion of cand for BIV.
>         (get_computation_cost_at): Step cand's base if necessary.
>
Bin.Cheng Sept. 8, 2015, 10:06 a.m. UTC | #2
On Wed, Sep 2, 2015 at 10:12 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng <bin.cheng@arm.com> wrote:
>> Hi,
>> This patch is a new approach to fix PR66388.  IVO today computes iv_use with
>> iv_cand which has at least same type precision as the use.  On 64bit
>> platforms like AArch64, this results in different iv_cand created for each
>> address type iv_use, and register pressure increased.  As a matter of fact,
>> the BIV should be used for all iv_uses in some of these cases.  It is a
>> latent bug but recently getting worse because of overflow changes.
>>
>> The original approach at
>> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
>> except it conflict with IV elimination.  Seems to me it is impossible to
>> mitigate the contradiction.
>>
>> This new approach fixes the issue by adding sizetype iv_cand for BIVs
>> directly.  In cases if the original BIV is preferred, the sizetype iv_cand
>> will be chosen.  As for code generation, the sizetype iv_cand has the same
>> effect as the original BIV.  Actually, it's better because BIV needs to be
>> explicitly extended to sizetype to be used in address expression on most
>> targets.
>>
>> One shortage of this approach is it may introduce more iv candidates.  To
>> minimize the impact, this patch does sophisticated code analysis and adds
>> sizetype candidate for BIV only if it is used as index.  Moreover, it avoids
>> to add candidate of the original type if the BIV is only used as index.
>> Statistics for compiling spec2k6 shows increase of candidate number is
>> modest and can be ignored.
>>
>> There are two more patches following to fix corner cases revealed by this
>> one.  In together they bring obvious perf improvement for spec26k/int on
>> aarch64.
>> Spec2k6/int
>> 400.perlbench   3.44%
>> 445.gobmk       -0.86%
>> 456.hmmer       14.83%
>> 458.sjeng       2.49%
>> 462.libquantum  -0.79%
>> GEOMEAN         1.68%
>>
>> There is also about 0.36% improvement for spec2k6/fp, mostly because of case
>> 436.cactusADM.  I believe it can be further improved, but that should be
>> another patch.
>>
>> I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  As
>> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
>> regressed by ~3%.  I can see BIVs are chosen for some loops instead of
>> address candidates.  Generally, the loop header will be simplified because
>> iv elimination with BIV is simpler; the number of instructions in loop body
>> isn't changed.  I suspect the regression comes from different addressing
>> modes.  With BIV, complex addressing mode like [base + index << scale +
>> disp] is used, rather than [base + disp].  I guess the former has more
>> micro-ops, thus more expensive.  This guess can be confirmed by manually
>> suppressing the complex addressing mode with higher address cost.
>> Now the problem becomes why overall cost of BIV is computed lower while the
>> actual cost is higher.  I noticed for most affected loops, loop header is
>> bloated because of iv elimination using the old address candidate.  The
>> bloated loop header results in much higher cost than BIV.  As a result, BIV
>> is preferred.  I also noticed the bloated loop header generally can be
>> simplified (I have a following patch for this).  After applying the local
>> patch, the old address candidate is chosen, and most of regression is
>> recovered.
>> Conclusion is I think loop header bloated issue should be blamed for the
>> regression, and it can be resolved.
>>
>> Bootstrap and test on x64_64 and aarch64.  It fixes failure of
>> gcc.target/i386/pr49781-1.c, without new breakage.
>>
>> So what do you think?
>
> The data above looks ok to me.
>
> +static struct iv *
> +find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv)
> +{
> +  aff_tree aff;
> +  struct expand_data exp_data;
> +
> +  if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME)
> +    return iv;
> +
> +  /* Expand IV's ssa_name till the deriving biv is found.  */
> +  exp_data.data = data;
> +  exp_data.biv = NULL;
> +  tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name),
> +                                 &aff, &data->name_expansion_cache,
> +                                 stop_expand, &exp_data);
> +  return exp_data.biv;
>
> that's actually "abusing" tree_to_aff_combination_expand for simply walking
> SSA uses and their defs uses recursively until you hit "stop".  ISTR past
> discussion to add a generic walk_ssa_use interface for that.  Not sure if it
> materialized with a name I can't remember or whether it didn't.
Thanks for reviewing.  I didn't found existing interface to walk up
definition chains of ssa vars.  In this updated patch, I implemented a
simple function which meets the minimal requirement of walking up
definition chains of BIV variables.  I also counted number of
no_overflow BIVs that are not used in address type use.  Since
generally there are only two BIVs in a loop, this can prevent us from
visiting definition chains most of time.  Statistics shows that number
of call to find_deriving_biv_for_expr plummet.

>
> -  add_candidate (data, iv->base, iv->step, true, NULL);
> +  /* Check if this biv is used in address type use.  */
> +  if (iv->no_overflow  && iv->have_address_use
> +      && INTEGRAL_TYPE_P (TREE_TYPE (iv->base))
> +      && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype))
> +    {
> +      tree type = unsigned_type_for (sizetype);
>
> sizetype is unsigned.
Fixed.

Bootstrap and test on x86_64, is this OK?

Thanks,
bin
>
> the rest looks ok to me but I really don't like the abuse of
> tree_to_aff_combination_expand...
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>>
>> 2015-08-31  Bin Cheng  <bin.cheng@arm.com>
>>
>>         * tree-affine.c (aff_combination_expand): New parameters.
>>         (tree_to_aff_combination_expand): Ditto.
>>         * tree-affine.h (aff_combination_expand): New declaration.
>>         (tree_to_aff_combination_expand): Ditto.
>>         * tree-ssa-loop-ivopts.c (struct iv, iv_cand): New fields.
>>         (dump_iv): Dump no_overflow information.
>>         (alloc_iv): Initialize new field for struct iv.
>>         (struct expand_data): New struct for affine combination expanding.
>>         (stop_expand): New callback func for affine combination expanding.
>>         (find_deriving_biv_for_iv, record_biv_for_address_use): New
>> functions.
>>         (idx_find_step): Call new functions above.
>>         (find_depends, add_candidate): New paramter.
>>         (add_iv_candidate_for_biv): Add sizetype cand for BIV.
>>         (get_computation_aff): Simplify convertion of cand for BIV.
>>         (get_computation_cost_at): Step cand's base if necessary.
>>
Bin.Cheng Sept. 8, 2015, 10:07 a.m. UTC | #3
On Tue, Sep 8, 2015 at 6:06 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Sep 2, 2015 at 10:12 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng <bin.cheng@arm.com> wrote:
>>> Hi,
>>> This patch is a new approach to fix PR66388.  IVO today computes iv_use with
>>> iv_cand which has at least same type precision as the use.  On 64bit
>>> platforms like AArch64, this results in different iv_cand created for each
>>> address type iv_use, and register pressure increased.  As a matter of fact,
>>> the BIV should be used for all iv_uses in some of these cases.  It is a
>>> latent bug but recently getting worse because of overflow changes.
>>>
>>> The original approach at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
>>> except it conflict with IV elimination.  Seems to me it is impossible to
>>> mitigate the contradiction.
>>>
>>> This new approach fixes the issue by adding sizetype iv_cand for BIVs
>>> directly.  In cases if the original BIV is preferred, the sizetype iv_cand
>>> will be chosen.  As for code generation, the sizetype iv_cand has the same
>>> effect as the original BIV.  Actually, it's better because BIV needs to be
>>> explicitly extended to sizetype to be used in address expression on most
>>> targets.
>>>
>>> One shortage of this approach is it may introduce more iv candidates.  To
>>> minimize the impact, this patch does sophisticated code analysis and adds
>>> sizetype candidate for BIV only if it is used as index.  Moreover, it avoids
>>> to add candidate of the original type if the BIV is only used as index.
>>> Statistics for compiling spec2k6 shows increase of candidate number is
>>> modest and can be ignored.
>>>
>>> There are two more patches following to fix corner cases revealed by this
>>> one.  In together they bring obvious perf improvement for spec26k/int on
>>> aarch64.
>>> Spec2k6/int
>>> 400.perlbench   3.44%
>>> 445.gobmk       -0.86%
>>> 456.hmmer       14.83%
>>> 458.sjeng       2.49%
>>> 462.libquantum  -0.79%
>>> GEOMEAN         1.68%
>>>
>>> There is also about 0.36% improvement for spec2k6/fp, mostly because of case
>>> 436.cactusADM.  I believe it can be further improved, but that should be
>>> another patch.
>>>
>>> I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  As
>>> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
>>> regressed by ~3%.  I can see BIVs are chosen for some loops instead of
>>> address candidates.  Generally, the loop header will be simplified because
>>> iv elimination with BIV is simpler; the number of instructions in loop body
>>> isn't changed.  I suspect the regression comes from different addressing
>>> modes.  With BIV, complex addressing mode like [base + index << scale +
>>> disp] is used, rather than [base + disp].  I guess the former has more
>>> micro-ops, thus more expensive.  This guess can be confirmed by manually
>>> suppressing the complex addressing mode with higher address cost.
>>> Now the problem becomes why overall cost of BIV is computed lower while the
>>> actual cost is higher.  I noticed for most affected loops, loop header is
>>> bloated because of iv elimination using the old address candidate.  The
>>> bloated loop header results in much higher cost than BIV.  As a result, BIV
>>> is preferred.  I also noticed the bloated loop header generally can be
>>> simplified (I have a following patch for this).  After applying the local
>>> patch, the old address candidate is chosen, and most of regression is
>>> recovered.
>>> Conclusion is I think loop header bloated issue should be blamed for the
>>> regression, and it can be resolved.
>>>
>>> Bootstrap and test on x64_64 and aarch64.  It fixes failure of
>>> gcc.target/i386/pr49781-1.c, without new breakage.
>>>
>>> So what do you think?
>>
>> The data above looks ok to me.
>>
>> +static struct iv *
>> +find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv)
>> +{
>> +  aff_tree aff;
>> +  struct expand_data exp_data;
>> +
>> +  if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME)
>> +    return iv;
>> +
>> +  /* Expand IV's ssa_name till the deriving biv is found.  */
>> +  exp_data.data = data;
>> +  exp_data.biv = NULL;
>> +  tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name),
>> +                                 &aff, &data->name_expansion_cache,
>> +                                 stop_expand, &exp_data);
>> +  return exp_data.biv;
>>
>> that's actually "abusing" tree_to_aff_combination_expand for simply walking
>> SSA uses and their defs uses recursively until you hit "stop".  ISTR past
>> discussion to add a generic walk_ssa_use interface for that.  Not sure if it
>> materialized with a name I can't remember or whether it didn't.
> Thanks for reviewing.  I didn't found existing interface to walk up
> definition chains of ssa vars.  In this updated patch, I implemented a
> simple function which meets the minimal requirement of walking up
> definition chains of BIV variables.  I also counted number of
> no_overflow BIVs that are not used in address type use.  Since
> generally there are only two BIVs in a loop, this can prevent us from
> visiting definition chains most of time.  Statistics shows that number
> of call to find_deriving_biv_for_expr plummet.
>
>>
>> -  add_candidate (data, iv->base, iv->step, true, NULL);
>> +  /* Check if this biv is used in address type use.  */
>> +  if (iv->no_overflow  && iv->have_address_use
>> +      && INTEGRAL_TYPE_P (TREE_TYPE (iv->base))
>> +      && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype))
>> +    {
>> +      tree type = unsigned_type_for (sizetype);
>>
>> sizetype is unsigned.
> Fixed.
>
> Bootstrap and test on x86_64, is this OK?
>
> Thanks,
> bin

And here is the updated Changelog entry.


2015-09-08  Bin Cheng  <bin.cheng@arm.com>

    PR tree-optimization/66388
    * tree-ssa-loop-ivopts.c (struct iv, iv_cand, ivopts_data): New
    fields.
    (dump_iv): Dump no_overflow information.
    (alloc_iv): Initialize new field for struct iv.
    (mark_bivs): Count number of no_overflow bivs.
    (find_deriving_biv_for_expr, record_biv_for_address_use): New
    functions.
    (idx_find_step): Call new functions above.
    (add_candidate_1, add_candidate): New paramter.
    (add_iv_candidate_for_biv): Add sizetype cand for BIV.
    (get_computation_aff): Simplify convertion of cand for BIV.
    (get_computation_cost_at): Step cand's base if necessary.
diff mbox

Patch

Index: gcc/tree-affine.c
===================================================================
--- gcc/tree-affine.c	(revision 227163)
+++ gcc/tree-affine.c	(working copy)
@@ -625,11 +656,14 @@  struct name_expansion
 };
 
 /* Expands SSA names in COMB recursively.  CACHE is used to cache the
-   results.  */
+   results.  If callback function STOP is specified, this function stops
+   expanding when it returns TRUE.  DATA points to private structure
+   used by the callback function.  */
 
 void
 aff_combination_expand (aff_tree *comb ATTRIBUTE_UNUSED,
-			hash_map<tree, name_expansion *> **cache)
+			hash_map<tree, name_expansion *> **cache,
+			bool (*stop) (void *, void *), void *data)
 {
   unsigned i;
   aff_tree to_add, current, curre;
@@ -654,6 +688,11 @@  aff_combination_expand (aff_tree *comb ATTRIBUTE_U
 	name = TREE_OPERAND (e, 0);
       if (TREE_CODE (name) != SSA_NAME)
 	continue;
+
+      /* Don't expand further if STOP returns TRUE.  */
+      if (stop != NULL && (*stop) (data, name))
+	continue;
+
       def = SSA_NAME_DEF_STMT (name);
       if (!is_gimple_assign (def) || gimple_assign_lhs (def) != name)
 	continue;
@@ -702,7 +741,8 @@  aff_combination_expand (aff_tree *comb ATTRIBUTE_U
 	      if (e != name)
 		rhs = fold_convert (type, rhs);
 	    }
-	  tree_to_aff_combination_expand (rhs, comb->type, &current, cache);
+	  tree_to_aff_combination_expand (rhs, comb->type,
+					  &current, cache, stop, data);
 	  exp->expansion = current;
 	  exp->in_progress = 0;
 	}
@@ -735,14 +775,19 @@  aff_combination_expand (aff_tree *comb ATTRIBUTE_U
    a1 = a0 + a0;
    a2 = a1 + a1;
    a3 = a2 + a2;
-   ...  */
+   ...
+
+   If callback function STOP is specified, this function stops expanding
+   when it returns TRUE.  DATA points to private structure used by the
+   callback function.  */
 
 void
 tree_to_aff_combination_expand (tree expr, tree type, aff_tree *comb,
-				hash_map<tree, name_expansion *> **cache)
+				hash_map<tree, name_expansion *> **cache,
+				bool (*stop) (void *, void *), void *data)
 {
   tree_to_aff_combination (expr, type, comb);
-  aff_combination_expand (comb, cache);
+  aff_combination_expand (comb, cache, stop, data);
 }
 
 /* Frees memory occupied by struct name_expansion in *VALUE.  Callback for
Index: gcc/tree-affine.h
===================================================================
--- gcc/tree-affine.h	(revision 227163)
+++ gcc/tree-affine.h	(working copy)
@@ -77,9 +77,12 @@  void tree_to_aff_combination (tree, tree, aff_tree
 tree aff_combination_to_tree (aff_tree *);
 void unshare_aff_combination (aff_tree *);
 bool aff_combination_constant_multiple_p (aff_tree *, aff_tree *, widest_int *);
-void aff_combination_expand (aff_tree *, hash_map<tree, name_expansion *> **);
+void aff_combination_expand (aff_tree *, hash_map<tree, name_expansion *> **,
+			     bool (*)(void *, void *) = NULL, void * = NULL);
 void tree_to_aff_combination_expand (tree, tree, aff_tree *,
-				     hash_map<tree, name_expansion *> **);
+				     hash_map<tree, name_expansion *> **,
+				     bool (*)(void *, void *) = NULL,
+				     void * = NULL);
 tree get_inner_reference_aff (tree, aff_tree *, widest_int *);
 void free_affine_expand_cache (hash_map<tree, name_expansion *> **);
 bool aff_comb_cannot_overlap_p (aff_tree *, const widest_int &,
Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 227163)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -147,6 +147,8 @@  struct iv
   bool biv_p;		/* Is it a biv?  */
   bool have_use_for;	/* Do we already have a use for it?  */
   bool no_overflow;	/* True if the iv doesn't overflow.  */
+  bool have_address_use;/* For biv, indicate if it's used in any address
+			   type use.  */
 };
 
 /* Per-ssa version information (induction variable descriptions, etc.).  */
@@ -251,6 +253,8 @@  struct iv_cand
 			      where it is incremented.  */
   bitmap depends_on;	/* The list of invariants that are used in step of the
 			   biv.  */
+  struct iv *orig_iv;	/* The original iv if this cand is added from biv with
+			   smaller type.  */
 };
 
 /* Loop invariant expression hashtable entry.  */
@@ -529,6 +533,9 @@  dump_iv (FILE *file, struct iv *iv, bool dump_name
 
   if (iv->biv_p)
     fprintf (file, "  is a biv\n");
+
+  if (iv->no_overflow)
+    fprintf (file, "  iv doesn't overflow wrto loop niter\n");
 }
 
 /* Dumps information about the USE to FILE.  */
@@ -1013,6 +1020,7 @@  alloc_iv (struct ivopts_data *data, tree base, tre
   iv->use_id = 0;
   iv->ssa_name = NULL_TREE;
   iv->no_overflow = no_overflow;
+  iv->have_address_use = false;
 
   return iv;
 }
@@ -1621,6 +1629,106 @@  expr_invariant_in_loop_p (struct loop *loop, tree
   return true;
 }
 
+/* Local data for affine combination expanding.  */
+
+struct expand_data
+{
+  struct ivopts_data *data;
+  struct iv *biv;
+};
+
+/* Callback function to tree_to_aff_combination_expand.
+
+   Return true if biv or loop invariant are encountered, false otherwise.  */
+
+static bool
+stop_expand (void *d1, void *d2)
+{
+  struct expand_data *exp_data = (struct expand_data *)d1;
+  tree ssa_name = (tree)d2;
+  struct ivopts_data *data;
+  struct iv *iv;
+
+  if (!exp_data || !ssa_name || TREE_CODE (ssa_name) != SSA_NAME)
+    return false;
+
+  data = exp_data->data;
+  gcc_assert (data != NULL);
+  iv = get_iv (data, ssa_name);
+  if (!iv || integer_zerop (iv->step))
+    return true;
+
+  if (iv->biv_p)
+    {
+      exp_data->biv = iv;
+      return true;
+    }
+
+  return false;
+}
+
+/* Return biv from which the IV is derived.  */
+
+static struct iv *
+find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv)
+{
+  aff_tree aff;
+  struct expand_data exp_data;
+
+  if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME)
+    return iv;
+
+  /* Expand IV's ssa_name till the deriving biv is found.  */
+  exp_data.data = data;
+  exp_data.biv = NULL;
+  tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name),
+				  &aff, &data->name_expansion_cache,
+				  stop_expand, &exp_data);
+  return exp_data.biv;
+}
+
+/* Record BIV, its predecessor and successor that they are used in
+   address type uses.  */
+
+static void
+record_biv_for_address_use (struct ivopts_data *data, struct iv *biv)
+{
+  unsigned i;
+  tree type, base_1, base_2;
+  bitmap_iterator bi;
+
+  if (!biv || !biv->biv_p || integer_zerop (biv->step)
+      || biv->have_address_use || !biv->no_overflow)
+    return;
+
+  type = TREE_TYPE (biv->base);
+  if (!INTEGRAL_TYPE_P (type))
+    return;
+
+  biv->have_address_use = true;
+  base_1 = fold_build2 (PLUS_EXPR, type, biv->base, biv->step);
+  EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, i, bi)
+    {
+      struct iv *iv = ver_info (data, i)->iv;
+
+      if (!iv || !iv->biv_p || integer_zerop (iv->step)
+	  || iv->have_address_use || !iv->no_overflow)
+	continue;
+
+      if (type != TREE_TYPE (iv->base)
+	  || !INTEGRAL_TYPE_P (TREE_TYPE (iv->base)))
+	continue;
+
+      if (!operand_equal_p (biv->step, iv->step, 0))
+	continue;
+
+      base_2 = fold_build2 (PLUS_EXPR, type, iv->base, iv->step);
+      if (operand_equal_p (base_1, iv->base, 0)
+	  || operand_equal_p (base_2, biv->base, 0))
+	iv->have_address_use = true;
+    }
+}
+
 /* Cumulates the steps of indices into DATA and replaces their values with the
    initial ones.  Returns false when the value of the index cannot be determined.
    Callback for for_each_index.  */
@@ -1711,6 +1819,10 @@  idx_find_step (tree base, tree *idx, void *data)
   step = fold_build2 (MULT_EXPR, sizetype, step, iv_step);
   dta->step = fold_build2 (PLUS_EXPR, sizetype, dta->step, step);
 
+  if (!iv->biv_p)
+    iv = find_deriving_biv_for_iv (dta->ivopts_data, iv);
+
+  record_biv_for_address_use (dta->ivopts_data, iv);
   return true;
 }
 
@@ -2603,7 +2715,8 @@  find_depends (tree *expr_p, int *ws ATTRIBUTE_UNUS
 static struct iv_cand *
 add_candidate_1 (struct ivopts_data *data,
 		 tree base, tree step, bool important, enum iv_position pos,
-		 struct iv_use *use, gimple incremented_at)
+		 struct iv_use *use, gimple incremented_at,
+		 struct iv *orig_iv = NULL)
 {
   unsigned i;
   struct iv_cand *cand = NULL;
@@ -2685,6 +2798,7 @@  add_candidate_1 (struct ivopts_data *data,
       else
 	cand->ainc_use = NULL;
 
+      cand->orig_iv = orig_iv;
       if (dump_file && (dump_flags & TDF_DETAILS))
 	dump_cand (dump_file, cand);
     }
@@ -2793,15 +2907,17 @@  add_autoinc_candidates (struct ivopts_data *data,
 
 static void
 add_candidate (struct ivopts_data *data,
-	       tree base, tree step, bool important, struct iv_use *use)
+	       tree base, tree step, bool important, struct iv_use *use,
+	       struct iv *orig_iv = NULL)
 {
   gcc_assert (use == NULL || use->sub_id == 0);
 
   if (ip_normal_pos (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL);
+    add_candidate_1 (data, base, step, important,
+		     IP_NORMAL, use, NULL, orig_iv);
   if (ip_end_pos (data->current_loop)
       && allow_ip_end_pos_p (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_END, use, NULL);
+    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
 }
 
 /* Adds standard iv candidates.  */
@@ -2836,8 +2952,24 @@  add_iv_candidate_for_biv (struct ivopts_data *data
   tree def;
   struct iv_cand *cand;
 
-  add_candidate (data, iv->base, iv->step, true, NULL);
+  /* Check if this biv is used in address type use.  */
+  if (iv->no_overflow  && iv->have_address_use
+      && INTEGRAL_TYPE_P (TREE_TYPE (iv->base))
+      && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype))
+    {
+      tree type = unsigned_type_for (sizetype);
+      tree base = fold_convert (type, iv->base);
+      tree step = fold_convert (type, iv->step);
 
+      /* Add iv cand of same precision as index part in TARGET_MEM_REF.  */
+      add_candidate (data, base, step, true, NULL, iv);
+      /* Add iv cand of the original type only if it has nonlinear use.  */
+      if (iv->have_use_for)
+	add_candidate (data, iv->base, iv->step, true, NULL);
+    }
+  else
+    add_candidate (data, iv->base, iv->step, true, NULL);
+
   /* The same, but with initial value zero.  */
   if (POINTER_TYPE_P (TREE_TYPE (iv->base)))
     add_candidate (data, size_int (0), iv->step, true, NULL);
@@ -3358,6 +3490,28 @@  get_computation_aff (struct loop *loop,
   /* If the conversion is not noop, perform it.  */
   if (TYPE_PRECISION (utype) < TYPE_PRECISION (ctype))
     {
+      if (cand->orig_iv != NULL && CONVERT_EXPR_P (cbase)
+	  && (CONVERT_EXPR_P (cstep) || TREE_CODE (cstep) == INTEGER_CST))
+	{
+	  tree inner_base, inner_step, inner_type;
+	  inner_base = TREE_OPERAND (cbase, 0);
+	  if (CONVERT_EXPR_P (cstep))
+	    inner_step = TREE_OPERAND (cstep, 0);
+	  else
+	    inner_step = cstep;
+
+	  inner_type = TREE_TYPE (inner_base);
+	  /* If candidate is added from a biv whose type is smaller than
+	     ctype, we know both candidate and the biv won't overflow.
+	     In this case, it's safe to skip the convertion in candidate.
+	     As an example, (unsigned short)((unsigned long)A) equals to
+	     (unsigned short)A, if A has a type no larger than short.  */
+	  if (TYPE_PRECISION (inner_type) <= TYPE_PRECISION (uutype))
+	    {
+	      cbase = inner_base;
+	      cstep = inner_step;
+	    }
+	}
       cstep = fold_convert (uutype, cstep);
       cbase = fold_convert (uutype, cbase);
       var = fold_convert (uutype, var);
@@ -4525,6 +4681,13 @@  get_computation_cost_at (struct ivopts_data *data,
 		(ratio, mem_mode,
 			TYPE_ADDR_SPACE (TREE_TYPE (utype))))
     {
+      if (cstepi == 0 && stmt_is_after_inc)
+	{
+	  if (POINTER_TYPE_P (ctype))
+	    cbase = fold_build2 (POINTER_PLUS_EXPR, ctype, cbase, cstep);
+	  else
+	    cbase = fold_build2 (PLUS_EXPR, ctype, cbase, cstep);
+	}
       cbase
 	= fold_build2 (MULT_EXPR, ctype, cbase, build_int_cst (ctype, ratio));
       cost = difference_cost (data,