diff mbox

[PR66388] Add sizetype cand for BIV of smaller type if it's used as index of memory ref

Message ID CAHFci2-C=VDu0Cow5o35sjsQ_0uLZB3di0h-JUbkgUYXA5UYgw@mail.gmail.com
State New
Headers show

Commit Message

Bin.Cheng Sept. 15, 2015, 5:56 a.m. UTC
Just realized that I missed the updated patch before.  Here it is...

Thanks,
bin

On Tue, Sep 8, 2015 at 6:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Sep 8, 2015 at 6:06 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Sep 2, 2015 at 10:12 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng <bin.cheng@arm.com> wrote:
>>>> Hi,
>>>> This patch is a new approach to fix PR66388.  IVO today computes iv_use with
>>>> iv_cand which has at least same type precision as the use.  On 64bit
>>>> platforms like AArch64, this results in different iv_cand created for each
>>>> address type iv_use, and register pressure increased.  As a matter of fact,
>>>> the BIV should be used for all iv_uses in some of these cases.  It is a
>>>> latent bug but recently getting worse because of overflow changes.
>>>>
>>>> The original approach at
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
>>>> except it conflict with IV elimination.  Seems to me it is impossible to
>>>> mitigate the contradiction.
>>>>
>>>> This new approach fixes the issue by adding sizetype iv_cand for BIVs
>>>> directly.  In cases if the original BIV is preferred, the sizetype iv_cand
>>>> will be chosen.  As for code generation, the sizetype iv_cand has the same
>>>> effect as the original BIV.  Actually, it's better because BIV needs to be
>>>> explicitly extended to sizetype to be used in address expression on most
>>>> targets.
>>>>
>>>> One shortage of this approach is it may introduce more iv candidates.  To
>>>> minimize the impact, this patch does sophisticated code analysis and adds
>>>> sizetype candidate for BIV only if it is used as index.  Moreover, it avoids
>>>> to add candidate of the original type if the BIV is only used as index.
>>>> Statistics for compiling spec2k6 shows increase of candidate number is
>>>> modest and can be ignored.
>>>>
>>>> There are two more patches following to fix corner cases revealed by this
>>>> one.  In together they bring obvious perf improvement for spec26k/int on
>>>> aarch64.
>>>> Spec2k6/int
>>>> 400.perlbench   3.44%
>>>> 445.gobmk       -0.86%
>>>> 456.hmmer       14.83%
>>>> 458.sjeng       2.49%
>>>> 462.libquantum  -0.79%
>>>> GEOMEAN         1.68%
>>>>
>>>> There is also about 0.36% improvement for spec2k6/fp, mostly because of case
>>>> 436.cactusADM.  I believe it can be further improved, but that should be
>>>> another patch.
>>>>
>>>> I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  As
>>>> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
>>>> regressed by ~3%.  I can see BIVs are chosen for some loops instead of
>>>> address candidates.  Generally, the loop header will be simplified because
>>>> iv elimination with BIV is simpler; the number of instructions in loop body
>>>> isn't changed.  I suspect the regression comes from different addressing
>>>> modes.  With BIV, complex addressing mode like [base + index << scale +
>>>> disp] is used, rather than [base + disp].  I guess the former has more
>>>> micro-ops, thus more expensive.  This guess can be confirmed by manually
>>>> suppressing the complex addressing mode with higher address cost.
>>>> Now the problem becomes why overall cost of BIV is computed lower while the
>>>> actual cost is higher.  I noticed for most affected loops, loop header is
>>>> bloated because of iv elimination using the old address candidate.  The
>>>> bloated loop header results in much higher cost than BIV.  As a result, BIV
>>>> is preferred.  I also noticed the bloated loop header generally can be
>>>> simplified (I have a following patch for this).  After applying the local
>>>> patch, the old address candidate is chosen, and most of regression is
>>>> recovered.
>>>> Conclusion is I think loop header bloated issue should be blamed for the
>>>> regression, and it can be resolved.
>>>>
>>>> Bootstrap and test on x64_64 and aarch64.  It fixes failure of
>>>> gcc.target/i386/pr49781-1.c, without new breakage.
>>>>
>>>> So what do you think?
>>>
>>> The data above looks ok to me.
>>>
>>> +static struct iv *
>>> +find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv)
>>> +{
>>> +  aff_tree aff;
>>> +  struct expand_data exp_data;
>>> +
>>> +  if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME)
>>> +    return iv;
>>> +
>>> +  /* Expand IV's ssa_name till the deriving biv is found.  */
>>> +  exp_data.data = data;
>>> +  exp_data.biv = NULL;
>>> +  tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name),
>>> +                                 &aff, &data->name_expansion_cache,
>>> +                                 stop_expand, &exp_data);
>>> +  return exp_data.biv;
>>>
>>> that's actually "abusing" tree_to_aff_combination_expand for simply walking
>>> SSA uses and their defs uses recursively until you hit "stop".  ISTR past
>>> discussion to add a generic walk_ssa_use interface for that.  Not sure if it
>>> materialized with a name I can't remember or whether it didn't.
>> Thanks for reviewing.  I didn't found existing interface to walk up
>> definition chains of ssa vars.  In this updated patch, I implemented a
>> simple function which meets the minimal requirement of walking up
>> definition chains of BIV variables.  I also counted number of
>> no_overflow BIVs that are not used in address type use.  Since
>> generally there are only two BIVs in a loop, this can prevent us from
>> visiting definition chains most of time.  Statistics shows that number
>> of call to find_deriving_biv_for_expr plummet.
>>
>>>
>>> -  add_candidate (data, iv->base, iv->step, true, NULL);
>>> +  /* Check if this biv is used in address type use.  */
>>> +  if (iv->no_overflow  && iv->have_address_use
>>> +      && INTEGRAL_TYPE_P (TREE_TYPE (iv->base))
>>> +      && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype))
>>> +    {
>>> +      tree type = unsigned_type_for (sizetype);
>>>
>>> sizetype is unsigned.
>> Fixed.
>>
>> Bootstrap and test on x86_64, is this OK?
>>
>> Thanks,
>> bin
>
> And here is the updated Changelog entry.
>
>
> 2015-09-08  Bin Cheng  <bin.cheng@arm.com>
>
>     PR tree-optimization/66388
>     * tree-ssa-loop-ivopts.c (struct iv, iv_cand, ivopts_data): New
>     fields.
>     (dump_iv): Dump no_overflow information.
>     (alloc_iv): Initialize new field for struct iv.
>     (mark_bivs): Count number of no_overflow bivs.
>     (find_deriving_biv_for_expr, record_biv_for_address_use): New
>     functions.
>     (idx_find_step): Call new functions above.
>     (add_candidate_1, add_candidate): New paramter.
>     (add_iv_candidate_for_biv): Add sizetype cand for BIV.
>     (get_computation_aff): Simplify convertion of cand for BIV.
>     (get_computation_cost_at): Step cand's base if necessary.

Comments

Richard Biener Sept. 16, 2015, 11:46 a.m. UTC | #1
Ok.

Thanks,
Richard.

On Tue, Sep 15, 2015 at 7:56 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> Just realized that I missed the updated patch before.  Here it is...
>
> Thanks,
> bin
>
> On Tue, Sep 8, 2015 at 6:07 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Tue, Sep 8, 2015 at 6:06 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Sep 2, 2015 at 10:12 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng <bin.cheng@arm.com> wrote:
>>>>> Hi,
>>>>> This patch is a new approach to fix PR66388.  IVO today computes iv_use with
>>>>> iv_cand which has at least same type precision as the use.  On 64bit
>>>>> platforms like AArch64, this results in different iv_cand created for each
>>>>> address type iv_use, and register pressure increased.  As a matter of fact,
>>>>> the BIV should be used for all iv_uses in some of these cases.  It is a
>>>>> latent bug but recently getting worse because of overflow changes.
>>>>>
>>>>> The original approach at
>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
>>>>> except it conflict with IV elimination.  Seems to me it is impossible to
>>>>> mitigate the contradiction.
>>>>>
>>>>> This new approach fixes the issue by adding sizetype iv_cand for BIVs
>>>>> directly.  In cases if the original BIV is preferred, the sizetype iv_cand
>>>>> will be chosen.  As for code generation, the sizetype iv_cand has the same
>>>>> effect as the original BIV.  Actually, it's better because BIV needs to be
>>>>> explicitly extended to sizetype to be used in address expression on most
>>>>> targets.
>>>>>
>>>>> One shortage of this approach is it may introduce more iv candidates.  To
>>>>> minimize the impact, this patch does sophisticated code analysis and adds
>>>>> sizetype candidate for BIV only if it is used as index.  Moreover, it avoids
>>>>> to add candidate of the original type if the BIV is only used as index.
>>>>> Statistics for compiling spec2k6 shows increase of candidate number is
>>>>> modest and can be ignored.
>>>>>
>>>>> There are two more patches following to fix corner cases revealed by this
>>>>> one.  In together they bring obvious perf improvement for spec26k/int on
>>>>> aarch64.
>>>>> Spec2k6/int
>>>>> 400.perlbench   3.44%
>>>>> 445.gobmk       -0.86%
>>>>> 456.hmmer       14.83%
>>>>> 458.sjeng       2.49%
>>>>> 462.libquantum  -0.79%
>>>>> GEOMEAN         1.68%
>>>>>
>>>>> There is also about 0.36% improvement for spec2k6/fp, mostly because of case
>>>>> 436.cactusADM.  I believe it can be further improved, but that should be
>>>>> another patch.
>>>>>
>>>>> I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  As
>>>>> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
>>>>> regressed by ~3%.  I can see BIVs are chosen for some loops instead of
>>>>> address candidates.  Generally, the loop header will be simplified because
>>>>> iv elimination with BIV is simpler; the number of instructions in loop body
>>>>> isn't changed.  I suspect the regression comes from different addressing
>>>>> modes.  With BIV, complex addressing mode like [base + index << scale +
>>>>> disp] is used, rather than [base + disp].  I guess the former has more
>>>>> micro-ops, thus more expensive.  This guess can be confirmed by manually
>>>>> suppressing the complex addressing mode with higher address cost.
>>>>> Now the problem becomes why overall cost of BIV is computed lower while the
>>>>> actual cost is higher.  I noticed for most affected loops, loop header is
>>>>> bloated because of iv elimination using the old address candidate.  The
>>>>> bloated loop header results in much higher cost than BIV.  As a result, BIV
>>>>> is preferred.  I also noticed the bloated loop header generally can be
>>>>> simplified (I have a following patch for this).  After applying the local
>>>>> patch, the old address candidate is chosen, and most of regression is
>>>>> recovered.
>>>>> Conclusion is I think loop header bloated issue should be blamed for the
>>>>> regression, and it can be resolved.
>>>>>
>>>>> Bootstrap and test on x64_64 and aarch64.  It fixes failure of
>>>>> gcc.target/i386/pr49781-1.c, without new breakage.
>>>>>
>>>>> So what do you think?
>>>>
>>>> The data above looks ok to me.
>>>>
>>>> +static struct iv *
>>>> +find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv)
>>>> +{
>>>> +  aff_tree aff;
>>>> +  struct expand_data exp_data;
>>>> +
>>>> +  if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME)
>>>> +    return iv;
>>>> +
>>>> +  /* Expand IV's ssa_name till the deriving biv is found.  */
>>>> +  exp_data.data = data;
>>>> +  exp_data.biv = NULL;
>>>> +  tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name),
>>>> +                                 &aff, &data->name_expansion_cache,
>>>> +                                 stop_expand, &exp_data);
>>>> +  return exp_data.biv;
>>>>
>>>> that's actually "abusing" tree_to_aff_combination_expand for simply walking
>>>> SSA uses and their defs uses recursively until you hit "stop".  ISTR past
>>>> discussion to add a generic walk_ssa_use interface for that.  Not sure if it
>>>> materialized with a name I can't remember or whether it didn't.
>>> Thanks for reviewing.  I didn't found existing interface to walk up
>>> definition chains of ssa vars.  In this updated patch, I implemented a
>>> simple function which meets the minimal requirement of walking up
>>> definition chains of BIV variables.  I also counted number of
>>> no_overflow BIVs that are not used in address type use.  Since
>>> generally there are only two BIVs in a loop, this can prevent us from
>>> visiting definition chains most of time.  Statistics shows that number
>>> of call to find_deriving_biv_for_expr plummet.
>>>
>>>>
>>>> -  add_candidate (data, iv->base, iv->step, true, NULL);
>>>> +  /* Check if this biv is used in address type use.  */
>>>> +  if (iv->no_overflow  && iv->have_address_use
>>>> +      && INTEGRAL_TYPE_P (TREE_TYPE (iv->base))
>>>> +      && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype))
>>>> +    {
>>>> +      tree type = unsigned_type_for (sizetype);
>>>>
>>>> sizetype is unsigned.
>>> Fixed.
>>>
>>> Bootstrap and test on x86_64, is this OK?
>>>
>>> Thanks,
>>> bin
>>
>> And here is the updated Changelog entry.
>>
>>
>> 2015-09-08  Bin Cheng  <bin.cheng@arm.com>
>>
>>     PR tree-optimization/66388
>>     * tree-ssa-loop-ivopts.c (struct iv, iv_cand, ivopts_data): New
>>     fields.
>>     (dump_iv): Dump no_overflow information.
>>     (alloc_iv): Initialize new field for struct iv.
>>     (mark_bivs): Count number of no_overflow bivs.
>>     (find_deriving_biv_for_expr, record_biv_for_address_use): New
>>     functions.
>>     (idx_find_step): Call new functions above.
>>     (add_candidate_1, add_candidate): New paramter.
>>     (add_iv_candidate_for_biv): Add sizetype cand for BIV.
>>     (get_computation_aff): Simplify convertion of cand for BIV.
>>     (get_computation_cost_at): Step cand's base if necessary.
diff mbox

Patch

Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 227163)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -147,6 +147,8 @@  struct iv
   bool biv_p;		/* Is it a biv?  */
   bool have_use_for;	/* Do we already have a use for it?  */
   bool no_overflow;	/* True if the iv doesn't overflow.  */
+  bool have_address_use;/* For biv, indicate if it's used in any address
+			   type use.  */
 };
 
 /* Per-ssa version information (induction variable descriptions, etc.).  */
@@ -251,6 +253,8 @@  struct iv_cand
 			      where it is incremented.  */
   bitmap depends_on;	/* The list of invariants that are used in step of the
 			   biv.  */
+  struct iv *orig_iv;	/* The original iv if this cand is added from biv with
+			   smaller type.  */
 };
 
 /* Loop invariant expression hashtable entry.  */
@@ -336,6 +340,9 @@  struct ivopts_data
   /* The maximum invariant id.  */
   unsigned max_inv_id;
 
+  /* Number of no_overflow BIVs which are not used in memory address.  */
+  unsigned bivs_not_used_in_addr;
+
   /* Obstack for iv structure.  */
   struct obstack iv_obstack;
 
@@ -529,6 +536,9 @@  dump_iv (FILE *file, struct iv *iv, bool dump_name
 
   if (iv->biv_p)
     fprintf (file, "  is a biv\n");
+
+  if (iv->no_overflow)
+    fprintf (file, "  iv doesn't overflow wrto loop niter\n");
 }
 
 /* Dumps information about the USE to FILE.  */
@@ -1013,6 +1023,7 @@  alloc_iv (struct ivopts_data *data, tree base, tre
   iv->use_id = 0;
   iv->ssa_name = NULL_TREE;
   iv->no_overflow = no_overflow;
+  iv->have_address_use = false;
 
   return iv;
 }
@@ -1155,6 +1166,7 @@  mark_bivs (struct ivopts_data *data)
   basic_block incr_bb;
   gphi_iterator psi;
 
+  data->bivs_not_used_in_addr = 0;
   for (psi = gsi_start_phis (loop->header); !gsi_end_p (psi); gsi_next (&psi))
     {
       phi = psi.phi ();
@@ -1183,6 +1195,10 @@  mark_bivs (struct ivopts_data *data)
 
       iv->biv_p = true;
       incr_iv->biv_p = true;
+      if (iv->no_overflow)
+	data->bivs_not_used_in_addr++;
+      if (incr_iv->no_overflow)
+	data->bivs_not_used_in_addr++;
     }
 }
 
@@ -1621,6 +1637,143 @@  expr_invariant_in_loop_p (struct loop *loop, tree
   return true;
 }
 
+/* Given expression EXPR which computes inductive with respect to
+   loop recorded in DATA, this function returns biv from which EXPR
+   is derived by tracing definition chains of ssa variables in EXPR.  */
+
+static struct iv*
+find_deriving_biv_for_expr (struct ivopts_data *data, tree expr)
+{
+  struct iv *iv;
+  unsigned i, n;
+  tree e2, e1;
+  enum tree_code code;
+  gimple stmt;
+
+  if (expr == NULL_TREE)
+    return NULL;
+
+  if (is_gimple_min_invariant (expr))
+    return NULL;
+
+  code = TREE_CODE (expr);
+  if (IS_EXPR_CODE_CLASS (TREE_CODE_CLASS (code)))
+    {
+      n = TREE_OPERAND_LENGTH (expr);
+      for (i = 0; i < n; i++)
+	{
+	  iv = find_deriving_biv_for_expr (data, TREE_OPERAND (expr, i));
+	  if (iv)
+	    return iv;
+	}
+    }
+
+  /* Stop if it's not ssa name.  */
+  if (code != SSA_NAME)
+    return NULL;
+
+  iv = get_iv (data, expr);
+  if (!iv || integer_zerop (iv->step))
+    return NULL;
+  else if (iv->biv_p)
+    return iv;
+
+  stmt = SSA_NAME_DEF_STMT (expr);
+  if (gphi *phi = dyn_cast <gphi *> (stmt))
+    {
+      ssa_op_iter iter;
+      use_operand_p use_p;
+
+      if (virtual_operand_p (gimple_phi_result (phi)))
+	return NULL;
+
+      FOR_EACH_PHI_ARG (use_p, phi, iter, SSA_OP_USE)
+	{
+	  tree use = USE_FROM_PTR (use_p);
+	  iv = find_deriving_biv_for_expr (data, use);
+	  if (iv)
+	    return iv;
+	}
+      return NULL;
+    }
+  if (gimple_code (stmt) != GIMPLE_ASSIGN)
+    return NULL;
+
+  e1 = gimple_assign_rhs1 (stmt);
+  code = gimple_assign_rhs_code (stmt);
+  if (get_gimple_rhs_class (code) == GIMPLE_SINGLE_RHS)
+    return find_deriving_biv_for_expr (data, e1);
+
+  switch (code)
+    {
+    case MULT_EXPR:
+    case PLUS_EXPR:
+    case MINUS_EXPR:
+    case POINTER_PLUS_EXPR:
+      /* And increments and decrements by a constant are simple.  */
+      e2 = gimple_assign_rhs2 (stmt);
+      iv = find_deriving_biv_for_expr (data, e2);
+      if (iv)
+	return iv;
+
+      /* Fallthru.  */
+    CASE_CONVERT:
+      /* Casts are simple.  */
+      return find_deriving_biv_for_expr (data, e1);
+
+    default:
+      break;
+    }
+
+  return NULL;
+}
+
+/* Record BIV, its predecessor and successor that they are used in
+   address type uses.  */
+
+static void
+record_biv_for_address_use (struct ivopts_data *data, struct iv *biv)
+{
+  unsigned i;
+  tree type, base_1, base_2;
+  bitmap_iterator bi;
+
+  if (!biv || !biv->biv_p || integer_zerop (biv->step)
+      || biv->have_address_use || !biv->no_overflow)
+    return;
+
+  type = TREE_TYPE (biv->base);
+  if (!INTEGRAL_TYPE_P (type))
+    return;
+
+  biv->have_address_use = true;
+  data->bivs_not_used_in_addr--;
+  base_1 = fold_build2 (PLUS_EXPR, type, biv->base, biv->step);
+  EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, i, bi)
+    {
+      struct iv *iv = ver_info (data, i)->iv;
+
+      if (!iv || !iv->biv_p || integer_zerop (iv->step)
+	  || iv->have_address_use || !iv->no_overflow)
+	continue;
+
+      if (type != TREE_TYPE (iv->base)
+	  || !INTEGRAL_TYPE_P (TREE_TYPE (iv->base)))
+	continue;
+
+      if (!operand_equal_p (biv->step, iv->step, 0))
+	continue;
+
+      base_2 = fold_build2 (PLUS_EXPR, type, iv->base, iv->step);
+      if (operand_equal_p (base_1, iv->base, 0)
+	  || operand_equal_p (base_2, biv->base, 0))
+	{
+	  iv->have_address_use = true;
+	  data->bivs_not_used_in_addr--;
+	}
+    }
+}
+
 /* Cumulates the steps of indices into DATA and replaces their values with the
    initial ones.  Returns false when the value of the index cannot be determined.
    Callback for for_each_index.  */
@@ -1711,6 +1864,13 @@  idx_find_step (tree base, tree *idx, void *data)
   step = fold_build2 (MULT_EXPR, sizetype, step, iv_step);
   dta->step = fold_build2 (PLUS_EXPR, sizetype, dta->step, step);
 
+  if (dta->ivopts_data->bivs_not_used_in_addr)
+    {
+      if (!iv->biv_p)
+	iv = find_deriving_biv_for_expr (dta->ivopts_data, iv->ssa_name);
+
+      record_biv_for_address_use (dta->ivopts_data, iv);
+    }
   return true;
 }
 
@@ -2603,7 +2763,8 @@  find_depends (tree *expr_p, int *ws ATTRIBUTE_UNUS
 static struct iv_cand *
 add_candidate_1 (struct ivopts_data *data,
 		 tree base, tree step, bool important, enum iv_position pos,
-		 struct iv_use *use, gimple incremented_at)
+		 struct iv_use *use, gimple incremented_at,
+		 struct iv *orig_iv = NULL)
 {
   unsigned i;
   struct iv_cand *cand = NULL;
@@ -2685,6 +2846,7 @@  add_candidate_1 (struct ivopts_data *data,
       else
 	cand->ainc_use = NULL;
 
+      cand->orig_iv = orig_iv;
       if (dump_file && (dump_flags & TDF_DETAILS))
 	dump_cand (dump_file, cand);
     }
@@ -2793,15 +2955,17 @@  add_autoinc_candidates (struct ivopts_data *data,
 
 static void
 add_candidate (struct ivopts_data *data,
-	       tree base, tree step, bool important, struct iv_use *use)
+	       tree base, tree step, bool important, struct iv_use *use,
+	       struct iv *orig_iv = NULL)
 {
   gcc_assert (use == NULL || use->sub_id == 0);
 
   if (ip_normal_pos (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL);
+    add_candidate_1 (data, base, step, important,
+		     IP_NORMAL, use, NULL, orig_iv);
   if (ip_end_pos (data->current_loop)
       && allow_ip_end_pos_p (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_END, use, NULL);
+    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
 }
 
 /* Adds standard iv candidates.  */
@@ -2836,8 +3000,23 @@  add_iv_candidate_for_biv (struct ivopts_data *data
   tree def;
   struct iv_cand *cand;
 
-  add_candidate (data, iv->base, iv->step, true, NULL);
+  /* Check if this biv is used in address type use.  */
+  if (iv->no_overflow  && iv->have_address_use
+      && INTEGRAL_TYPE_P (TREE_TYPE (iv->base))
+      && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype))
+    {
+      tree base = fold_convert (sizetype, iv->base);
+      tree step = fold_convert (sizetype, iv->step);
 
+      /* Add iv cand of same precision as index part in TARGET_MEM_REF.  */
+      add_candidate (data, base, step, true, NULL, iv);
+      /* Add iv cand of the original type only if it has nonlinear use.  */
+      if (iv->have_use_for)
+	add_candidate (data, iv->base, iv->step, true, NULL);
+    }
+  else
+    add_candidate (data, iv->base, iv->step, true, NULL);
+
   /* The same, but with initial value zero.  */
   if (POINTER_TYPE_P (TREE_TYPE (iv->base)))
     add_candidate (data, size_int (0), iv->step, true, NULL);
@@ -3358,6 +3537,28 @@  get_computation_aff (struct loop *loop,
   /* If the conversion is not noop, perform it.  */
   if (TYPE_PRECISION (utype) < TYPE_PRECISION (ctype))
     {
+      if (cand->orig_iv != NULL && CONVERT_EXPR_P (cbase)
+	  && (CONVERT_EXPR_P (cstep) || TREE_CODE (cstep) == INTEGER_CST))
+	{
+	  tree inner_base, inner_step, inner_type;
+	  inner_base = TREE_OPERAND (cbase, 0);
+	  if (CONVERT_EXPR_P (cstep))
+	    inner_step = TREE_OPERAND (cstep, 0);
+	  else
+	    inner_step = cstep;
+
+	  inner_type = TREE_TYPE (inner_base);
+	  /* If candidate is added from a biv whose type is smaller than
+	     ctype, we know both candidate and the biv won't overflow.
+	     In this case, it's safe to skip the convertion in candidate.
+	     As an example, (unsigned short)((unsigned long)A) equals to
+	     (unsigned short)A, if A has a type no larger than short.  */
+	  if (TYPE_PRECISION (inner_type) <= TYPE_PRECISION (uutype))
+	    {
+	      cbase = inner_base;
+	      cstep = inner_step;
+	    }
+	}
       cstep = fold_convert (uutype, cstep);
       cbase = fold_convert (uutype, cbase);
       var = fold_convert (uutype, var);
@@ -4525,6 +4726,13 @@  get_computation_cost_at (struct ivopts_data *data,
 		(ratio, mem_mode,
 			TYPE_ADDR_SPACE (TREE_TYPE (utype))))
     {
+      if (cstepi == 0 && stmt_is_after_inc)
+	{
+	  if (POINTER_TYPE_P (ctype))
+	    cbase = fold_build2 (POINTER_PLUS_EXPR, ctype, cbase, cstep);
+	  else
+	    cbase = fold_build2 (PLUS_EXPR, ctype, cbase, cstep);
+	}
       cbase
 	= fold_build2 (MULT_EXPR, ctype, cbase, build_int_cst (ctype, ratio));
       cost = difference_cost (data,