diff mbox

[PR66388] Compute use with cand of smaller precision by further exercising scev overflow info.

Message ID 000e01d0c059$a9c162f0$fd4428d0$@arm.com
State New
Headers show

Commit Message

Bin Cheng July 17, 2015, 6:27 a.m. UTC
Hi,
This patch is to fix PR66388.  It's an old issue but recently became worse
after my scev overflow change.  IVOPT now can only compute iv use with
candidate which has at least same type precision.  See below code:

  if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
    {
      /* We do not have a precision to express the values of use.  */
      return infinite_cost;
    }

This is not always true.  It's possible to compute with a candidate of
smaller precision if it has enough stepping periods to express the iv use.
Just as code in iv_elimination.  Well, since now we have iv no_overflow
information, we can use that to prove it's safe.  Actually I am thinking
about improving iv elimination with overflow information too.  So this patch
relaxes the constraint to allow computation of uses with smaller precision
candidates.

Benchmark data shows several cases in spec2k6 are obviously improved on
aarch64:
400.perlbench                2.32%
445.gobmk                    0.86%
456.hmmer                    11.72%
464.h264ref                  1.93%
473.astar	             0.75%
433.milc                     -1.49%
436.cactusADM                6.61%
444.namd                     -0.76%

I looked into assembly code of 456.hmmer&436.cactusADM, and can confirm hot
loops are reduced.  Also perf data could confirm the improvement in
456.hmmer.  
I looked into 433.milc and found most hot functions are not affected by this
patch.  But I do observe two kinds of regressions described as below:
A)  For some loops, auto-increment addressing mode is generated before this
patch, but "base + index<<scale" is generated after. I don't worry about
this too much because auto-increment support in IVO hasn't been enabled on
AArch64 yet. On the contrary, we should worry that auto-increment support is
too aggressive in IVO, resulting in auto-increment addressing mode generated
where it shouldn't. I suspect the regression we monitored before is caused
by such kind of reason.
B) This patch enables computation of 64 bits address iv use with 32 bits biv
candidate.  So there will be a sign extension before the candidate can be
used in memory reference as an index. I already increased the cost by 2 for
such biv candidates but there still be some peculiar cases... Decreasing
cost in determine_iv_cost for biv candidates makes this worse.  It does that
to make debugging simpler, nothing to do with performance.

Bootstrap and test on x86_64.  It fixes failure of pr49781-1.c.
Unfortunately, it introduces new failure of
g++.dg/debug/dwarf2/deallocator.C.  I looked into the test and found with
this patch, the loop is transformed into a shape that can be later
eliminated(because it can be proved never loop back?).  We can further
discuss if it's this patch's problem or the case should be tuned.
Also bootstrap and test on aarch64.

So what's your opinion?

Thanks,
bin


2015-07-16  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/66388
	* tree-ssa-loop-ivopts.c (dump_iv): Dump no_overflow info.
	(add_candidate_1): New parameter.  Use unsigned type when iv
	overflows.  Pass no_overflow to alloc_iv.
	(add_autoinc_candidates, add_candidate): New parameter.
	Pass no_overflow to add_candidate_1.
	(add_candidate): Ditto.
	(add_iv_candidate_for_biv, add_iv_candidate_for_use): Pass iv's
	no_overflow info to add_candidate and add_candidate_1.
	(get_computation_aff, get_computation_cost_at): Handle candidate
	with smaller precision than iv use.

gcc/testsuite/ChangeLog
2015-07-16  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/66388
	* gcc.dg/tree-ssa/pr66388.c: New test.

Comments

Richard Biener July 23, 2015, 2:06 p.m. UTC | #1
On Fri, Jul 17, 2015 at 8:27 AM, Bin Cheng <bin.cheng@arm.com> wrote:
> Hi,
> This patch is to fix PR66388.  It's an old issue but recently became worse
> after my scev overflow change.  IVOPT now can only compute iv use with
> candidate which has at least same type precision.  See below code:
>
>   if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
>     {
>       /* We do not have a precision to express the values of use.  */
>       return infinite_cost;
>     }
>
> This is not always true.  It's possible to compute with a candidate of
> smaller precision if it has enough stepping periods to express the iv use.
> Just as code in iv_elimination.  Well, since now we have iv no_overflow
> information, we can use that to prove it's safe.  Actually I am thinking
> about improving iv elimination with overflow information too.  So this patch
> relaxes the constraint to allow computation of uses with smaller precision
> candidates.
>
> Benchmark data shows several cases in spec2k6 are obviously improved on
> aarch64:
> 400.perlbench                2.32%
> 445.gobmk                    0.86%
> 456.hmmer                    11.72%
> 464.h264ref                  1.93%
> 473.astar                    0.75%
> 433.milc                     -1.49%
> 436.cactusADM                6.61%
> 444.namd                     -0.76%
>
> I looked into assembly code of 456.hmmer&436.cactusADM, and can confirm hot
> loops are reduced.  Also perf data could confirm the improvement in
> 456.hmmer.
> I looked into 433.milc and found most hot functions are not affected by this
> patch.  But I do observe two kinds of regressions described as below:
> A)  For some loops, auto-increment addressing mode is generated before this
> patch, but "base + index<<scale" is generated after. I don't worry about
> this too much because auto-increment support in IVO hasn't been enabled on
> AArch64 yet. On the contrary, we should worry that auto-increment support is
> too aggressive in IVO, resulting in auto-increment addressing mode generated
> where it shouldn't. I suspect the regression we monitored before is caused
> by such kind of reason.
> B) This patch enables computation of 64 bits address iv use with 32 bits biv
> candidate.  So there will be a sign extension before the candidate can be
> used in memory reference as an index. I already increased the cost by 2 for
> such biv candidates but there still be some peculiar cases... Decreasing
> cost in determine_iv_cost for biv candidates makes this worse.  It does that
> to make debugging simpler, nothing to do with performance.
>
> Bootstrap and test on x86_64.  It fixes failure of pr49781-1.c.
> Unfortunately, it introduces new failure of
> g++.dg/debug/dwarf2/deallocator.C.  I looked into the test and found with
> this patch, the loop is transformed into a shape that can be later
> eliminated(because it can be proved never loop back?).  We can further
> discuss if it's this patch's problem or the case should be tuned.
> Also bootstrap and test on aarch64.
>
> So what's your opinion?

Looks sensible, but the deallocator.C fail looks odd.  I presume that
i + j is simplified in a way that either the first or the second iteration
must exit the loop via the return and thus the scan for deallocator.C:34
fails?  How does this happen - I can only see this happen if we unroll
the loop and then run into VRP.  So does IVOPTs now affect non-loop
code as well?  Ah, at the moment we use an IV that runs backward.

Still curious if this isn't a wrong-code issue...

Richard.

> Thanks,
> bin
>
>
> 2015-07-16  Bin Cheng  <bin.cheng@arm.com>
>
>         PR tree-optimization/66388
>         * tree-ssa-loop-ivopts.c (dump_iv): Dump no_overflow info.
>         (add_candidate_1): New parameter.  Use unsigned type when iv
>         overflows.  Pass no_overflow to alloc_iv.
>         (add_autoinc_candidates, add_candidate): New parameter.
>         Pass no_overflow to add_candidate_1.
>         (add_candidate): Ditto.
>         (add_iv_candidate_for_biv, add_iv_candidate_for_use): Pass iv's
>         no_overflow info to add_candidate and add_candidate_1.
>         (get_computation_aff, get_computation_cost_at): Handle candidate
>         with smaller precision than iv use.
>
> gcc/testsuite/ChangeLog
> 2015-07-16  Bin Cheng  <bin.cheng@arm.com>
>
>         PR tree-optimization/66388
>         * gcc.dg/tree-ssa/pr66388.c: New test.
>
Bin.Cheng July 24, 2015, 11:09 a.m. UTC | #2
On Thu, Jul 23, 2015 at 10:06 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Jul 17, 2015 at 8:27 AM, Bin Cheng <bin.cheng@arm.com> wrote:
>> Hi,
>> This patch is to fix PR66388.  It's an old issue but recently became worse
>> after my scev overflow change.  IVOPT now can only compute iv use with
>> candidate which has at least same type precision.  See below code:
>>
>>   if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
>>     {
>>       /* We do not have a precision to express the values of use.  */
>>       return infinite_cost;
>>     }
>>
>> This is not always true.  It's possible to compute with a candidate of
>> smaller precision if it has enough stepping periods to express the iv use.
>> Just as code in iv_elimination.  Well, since now we have iv no_overflow
>> information, we can use that to prove it's safe.  Actually I am thinking
>> about improving iv elimination with overflow information too.  So this patch
>> relaxes the constraint to allow computation of uses with smaller precision
>> candidates.
>>
>> Benchmark data shows several cases in spec2k6 are obviously improved on
>> aarch64:
>> 400.perlbench                2.32%
>> 445.gobmk                    0.86%
>> 456.hmmer                    11.72%
>> 464.h264ref                  1.93%
>> 473.astar                    0.75%
>> 433.milc                     -1.49%
>> 436.cactusADM                6.61%
>> 444.namd                     -0.76%
>>
>> I looked into assembly code of 456.hmmer&436.cactusADM, and can confirm hot
>> loops are reduced.  Also perf data could confirm the improvement in
>> 456.hmmer.
>> I looked into 433.milc and found most hot functions are not affected by this
>> patch.  But I do observe two kinds of regressions described as below:
>> A)  For some loops, auto-increment addressing mode is generated before this
>> patch, but "base + index<<scale" is generated after. I don't worry about
>> this too much because auto-increment support in IVO hasn't been enabled on
>> AArch64 yet. On the contrary, we should worry that auto-increment support is
>> too aggressive in IVO, resulting in auto-increment addressing mode generated
>> where it shouldn't. I suspect the regression we monitored before is caused
>> by such kind of reason.
>> B) This patch enables computation of 64 bits address iv use with 32 bits biv
>> candidate.  So there will be a sign extension before the candidate can be
>> used in memory reference as an index. I already increased the cost by 2 for
>> such biv candidates but there still be some peculiar cases... Decreasing
>> cost in determine_iv_cost for biv candidates makes this worse.  It does that
>> to make debugging simpler, nothing to do with performance.
>>
>> Bootstrap and test on x86_64.  It fixes failure of pr49781-1.c.
>> Unfortunately, it introduces new failure of
>> g++.dg/debug/dwarf2/deallocator.C.  I looked into the test and found with
>> this patch, the loop is transformed into a shape that can be later
>> eliminated(because it can be proved never loop back?).  We can further
>> discuss if it's this patch's problem or the case should be tuned.
>> Also bootstrap and test on aarch64.
>>
>> So what's your opinion?
>
> Looks sensible, but the deallocator.C fail looks odd.  I presume that
> i + j is simplified in a way that either the first or the second iteration
> must exit the loop via the return and thus the scan for deallocator.C:34
> fails?  How does this happen - I can only see this happen if we unroll
> the loop and then run into VRP.  So does IVOPTs now affect non-loop
> code as well?  Ah, at the moment we use an IV that runs backward.
>
> Still curious if this isn't a wrong-code issue...
>

Tree dump just before ivopts is as below:
{
  struct t test;
  struct t test;
  int j;
  struct t test_outside;
  unsigned int ivtmp_1;
  int _11;
  unsigned int ivtmp_26;

  <bb 2>:
  t::t (&test_outside);
  # DEBUG j => 0
  # DEBUG j => 0

  <bb 3>:
  # j_27 = PHI <j_14(6), 0(2)>
  # ivtmp_1 = PHI <ivtmp_26(6), 10(2)>
  # DEBUG j => j_27
  t::t (&test);
  t::foo (&test);
  _11 = i_10(D) + j_27;
  if (_11 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  t::bar (&test);
  t::~t (&test);
  test ={v} {CLOBBER};
  goto <bb 12>;

  <bb 5>:
  t::~t (&test);
  test ={v} {CLOBBER};
  j_14 = j_27 + 1;
  # DEBUG j => j_14
  # DEBUG j => j_14
  ivtmp_26 = ivtmp_1 - 1;
  if (ivtmp_26 == 0)
    goto <bb 7>;
  else
    goto <bb 6>;

  <bb 6>:
  goto <bb 3>;

  <bb 7>:
  if (i_10(D) != 0)
    goto <bb 8>;
  else
    goto <bb 11>;

  <bb 8>:
  t::t (&test);
  if (i_10(D) == 10)
    goto <bb 9>;
  else
    goto <bb 10>;

  <bb 9>:
  t::bar (&test);

  <bb 10>:
  t::~t (&test);
  test ={v} {CLOBBER};

  <bb 11>:
  t::foo (&test_outside);

  <bb 12>:
  t::~t (&test_outside);
  test_outside ={v} {CLOBBER};
  return;

}

The only difference the patch made, is candidate 8 is not converted
into unsigned type any more because _11 is now considered not
overflow.

Before patch:
candidate 8
  var_before ivtmp.9
  var_after ivtmp.9
  incremented before exit test
  type unsigned int
  base (unsigned int) i_10(D)
  step 1
After patch:
candidate 8
  var_before ivtmp.9
  var_after ivtmp.9
  incremented before exit test
  type int
  base i_10(D)
  step 1
  iv doesn't overflow wrto loop niter

Mark {i_10(D), 1}_loop not overflow seems reasonable.  But I am not
sure because use 0 is before candidate stepping, thus we can't assume
it will not overflow at use 1?

After IVO, the code is transformed into:
{
  int ivtmp.9;
  struct t test;
  struct t test;
  int j;
  struct t test_outside;
  unsigned int _29;
  unsigned int _30;
  int _31;

  <bb 2>:
  t::t (&test_outside);
  # DEBUG j => 0
  # DEBUG j => 0
  _29 = (unsigned int) i_10(D);
  _30 = _29 + 10;
  _31 = (int) _30;

  <bb 3>:
  # ivtmp.9_25 = PHI <ivtmp.9_2(6), i_10(D)(2)>
  # DEBUG j => (int) ((unsigned int) ivtmp.9_25 - (unsigned int) i_10(D))
  t::t (&test);
  t::foo (&test);
  if (ivtmp.9_25 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  t::bar (&test);
  t::~t (&test);
  test ={v} {CLOBBER};
  goto <bb 12>;

  <bb 5>:
  t::~t (&test);
  test ={v} {CLOBBER};
  # DEBUG D#1 => (int) (((unsigned int) ivtmp.9_25 - (unsigned int)
i_10(D)) + 1)
  # DEBUG j => D#1
  # DEBUG j => D#1
  ivtmp.9_2 = ivtmp.9_25 + 1;
  if (ivtmp.9_2 == _31)
    goto <bb 7>;
  else
    goto <bb 6>;

  <bb 6>:
  goto <bb 3>;

  <bb 7>:
  if (i_10(D) != 0)
    goto <bb 8>;
  else
    goto <bb 11>;

  <bb 8>:
  t::t (&test);
  if (i_10(D) == 10)
    goto <bb 9>;
  else
    goto <bb 10>;

  <bb 9>:
  t::bar (&test);

  <bb 10>:
  t::~t (&test);
  test ={v} {CLOBBER};

  <bb 11>:
  t::foo (&test_outside);

  <bb 12>:
  t::~t (&test_outside);
  test_outside ={v} {CLOBBER};
  return;

}

Now, the second condition in bb5 is guarded by condition in bb3 and we
know for sure that ivtmp.9_2 is ONE.  Then jump threading in following
dom pass can understand that condition in bb3 will be false and thread
it.  The threaded code is as below, and we can see the loop is
basically removed.
{
  int ivtmp.9;
  struct t test;
  struct t test;
  int j;
  struct t test_outside;
  unsigned int _29;
  unsigned int _30;
  int _31;

  <bb 2>:
  t::t (&test_outside);
  # DEBUG j => 0
  # DEBUG j => 0
  _29 = (unsigned int) i_10(D);
  _30 = _29 + 10;
  _31 = (int) _30;

  <bb 3>:
  # ivtmp.9_25 = PHI <i_10(D)(2)>
  # DEBUG j => (int) ((unsigned int) ivtmp.9_25 - (unsigned int) i_10(D))
  t::t (&test);
  t::foo (&test);
  if (ivtmp.9_25 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  # DEBUG j => (int) ((unsigned int) ivtmp.9_25 - (unsigned int) i_10(D))
  t::bar (&test);
  t::~t (&test);
  test ={v} {CLOBBER};
  goto <bb 11>;

  <bb 5>:
  t::~t (&test);
  test ={v} {CLOBBER};
  # DEBUG D#1 => (int) (((unsigned int) 0 - (unsigned int) i_10(D)) + 1)
  # DEBUG j => D#1
  # DEBUG j => D#1
  ivtmp.9_2 = 1;
  if (_31 == 1)
    goto <bb 6>;
  else
    goto <bb 12>;

  <bb 6>:
  if (i_10(D) != 0)
    goto <bb 7>;
  else
    goto <bb 10>;

  <bb 7>:
  t::t (&test);
  if (i_10(D) == 10)
    goto <bb 8>;
  else
    goto <bb 9>;

  <bb 8>:
  t::bar (&test);

  <bb 9>:
  t::~t (&test);
  test ={v} {CLOBBER};

  <bb 10>:
  t::foo (&test_outside);

  <bb 11>:
  t::~t (&test_outside);
  test_outside ={v} {CLOBBER};
  return;

  <bb 12>:
  # ivtmp.9_27 = PHI <1(5)>
  # DEBUG j => (int) ((unsigned int) ivtmp.9_27 - (unsigned int) i_10(D))
  t::t (&test);
  t::foo (&test);
  goto <bb 4>;

}

It is a weird case.

Thanks,
bin

> Richard.
>
>> Thanks,
>> bin
>>
>>
>> 2015-07-16  Bin Cheng  <bin.cheng@arm.com>
>>
>>         PR tree-optimization/66388
>>         * tree-ssa-loop-ivopts.c (dump_iv): Dump no_overflow info.
>>         (add_candidate_1): New parameter.  Use unsigned type when iv
>>         overflows.  Pass no_overflow to alloc_iv.
>>         (add_autoinc_candidates, add_candidate): New parameter.
>>         Pass no_overflow to add_candidate_1.
>>         (add_candidate): Ditto.
>>         (add_iv_candidate_for_biv, add_iv_candidate_for_use): Pass iv's
>>         no_overflow info to add_candidate and add_candidate_1.
>>         (get_computation_aff, get_computation_cost_at): Handle candidate
>>         with smaller precision than iv use.
>>
>> gcc/testsuite/ChangeLog
>> 2015-07-16  Bin Cheng  <bin.cheng@arm.com>
>>
>>         PR tree-optimization/66388
>>         * gcc.dg/tree-ssa/pr66388.c: New test.
>>
Richard Biener July 24, 2015, 11:23 a.m. UTC | #3
On Fri, Jul 24, 2015 at 1:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Thu, Jul 23, 2015 at 10:06 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Jul 17, 2015 at 8:27 AM, Bin Cheng <bin.cheng@arm.com> wrote:
>>> Hi,
>>> This patch is to fix PR66388.  It's an old issue but recently became worse
>>> after my scev overflow change.  IVOPT now can only compute iv use with
>>> candidate which has at least same type precision.  See below code:
>>>
>>>   if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
>>>     {
>>>       /* We do not have a precision to express the values of use.  */
>>>       return infinite_cost;
>>>     }
>>>
>>> This is not always true.  It's possible to compute with a candidate of
>>> smaller precision if it has enough stepping periods to express the iv use.
>>> Just as code in iv_elimination.  Well, since now we have iv no_overflow
>>> information, we can use that to prove it's safe.  Actually I am thinking
>>> about improving iv elimination with overflow information too.  So this patch
>>> relaxes the constraint to allow computation of uses with smaller precision
>>> candidates.
>>>
>>> Benchmark data shows several cases in spec2k6 are obviously improved on
>>> aarch64:
>>> 400.perlbench                2.32%
>>> 445.gobmk                    0.86%
>>> 456.hmmer                    11.72%
>>> 464.h264ref                  1.93%
>>> 473.astar                    0.75%
>>> 433.milc                     -1.49%
>>> 436.cactusADM                6.61%
>>> 444.namd                     -0.76%
>>>
>>> I looked into assembly code of 456.hmmer&436.cactusADM, and can confirm hot
>>> loops are reduced.  Also perf data could confirm the improvement in
>>> 456.hmmer.
>>> I looked into 433.milc and found most hot functions are not affected by this
>>> patch.  But I do observe two kinds of regressions described as below:
>>> A)  For some loops, auto-increment addressing mode is generated before this
>>> patch, but "base + index<<scale" is generated after. I don't worry about
>>> this too much because auto-increment support in IVO hasn't been enabled on
>>> AArch64 yet. On the contrary, we should worry that auto-increment support is
>>> too aggressive in IVO, resulting in auto-increment addressing mode generated
>>> where it shouldn't. I suspect the regression we monitored before is caused
>>> by such kind of reason.
>>> B) This patch enables computation of 64 bits address iv use with 32 bits biv
>>> candidate.  So there will be a sign extension before the candidate can be
>>> used in memory reference as an index. I already increased the cost by 2 for
>>> such biv candidates but there still be some peculiar cases... Decreasing
>>> cost in determine_iv_cost for biv candidates makes this worse.  It does that
>>> to make debugging simpler, nothing to do with performance.
>>>
>>> Bootstrap and test on x86_64.  It fixes failure of pr49781-1.c.
>>> Unfortunately, it introduces new failure of
>>> g++.dg/debug/dwarf2/deallocator.C.  I looked into the test and found with
>>> this patch, the loop is transformed into a shape that can be later
>>> eliminated(because it can be proved never loop back?).  We can further
>>> discuss if it's this patch's problem or the case should be tuned.
>>> Also bootstrap and test on aarch64.
>>>
>>> So what's your opinion?
>>
>> Looks sensible, but the deallocator.C fail looks odd.  I presume that
>> i + j is simplified in a way that either the first or the second iteration
>> must exit the loop via the return and thus the scan for deallocator.C:34
>> fails?  How does this happen - I can only see this happen if we unroll
>> the loop and then run into VRP.  So does IVOPTs now affect non-loop
>> code as well?  Ah, at the moment we use an IV that runs backward.
>>
>> Still curious if this isn't a wrong-code issue...
>>
>
> Tree dump just before ivopts is as below:
> {
>   struct t test;
>   struct t test;
>   int j;
>   struct t test_outside;
>   unsigned int ivtmp_1;
>   int _11;
>   unsigned int ivtmp_26;
>
>   <bb 2>:
>   t::t (&test_outside);
>   # DEBUG j => 0
>   # DEBUG j => 0
>
>   <bb 3>:
>   # j_27 = PHI <j_14(6), 0(2)>
>   # ivtmp_1 = PHI <ivtmp_26(6), 10(2)>
>   # DEBUG j => j_27
>   t::t (&test);
>   t::foo (&test);
>   _11 = i_10(D) + j_27;
>   if (_11 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 5>;
>
>   <bb 4>:
>   t::bar (&test);
>   t::~t (&test);
>   test ={v} {CLOBBER};
>   goto <bb 12>;
>
>   <bb 5>:
>   t::~t (&test);
>   test ={v} {CLOBBER};
>   j_14 = j_27 + 1;
>   # DEBUG j => j_14
>   # DEBUG j => j_14
>   ivtmp_26 = ivtmp_1 - 1;
>   if (ivtmp_26 == 0)
>     goto <bb 7>;
>   else
>     goto <bb 6>;
>
>   <bb 6>:
>   goto <bb 3>;
>
>   <bb 7>:
>   if (i_10(D) != 0)
>     goto <bb 8>;
>   else
>     goto <bb 11>;
>
>   <bb 8>:
>   t::t (&test);
>   if (i_10(D) == 10)
>     goto <bb 9>;
>   else
>     goto <bb 10>;
>
>   <bb 9>:
>   t::bar (&test);
>
>   <bb 10>:
>   t::~t (&test);
>   test ={v} {CLOBBER};
>
>   <bb 11>:
>   t::foo (&test_outside);
>
>   <bb 12>:
>   t::~t (&test_outside);
>   test_outside ={v} {CLOBBER};
>   return;
>
> }
>
> The only difference the patch made, is candidate 8 is not converted
> into unsigned type any more because _11 is now considered not
> overflow.
>
> Before patch:
> candidate 8
>   var_before ivtmp.9
>   var_after ivtmp.9
>   incremented before exit test
>   type unsigned int
>   base (unsigned int) i_10(D)
>   step 1
> After patch:
> candidate 8
>   var_before ivtmp.9
>   var_after ivtmp.9
>   incremented before exit test
>   type int
>   base i_10(D)
>   step 1
>   iv doesn't overflow wrto loop niter
>
> Mark {i_10(D), 1}_loop not overflow seems reasonable.  But I am not
> sure because use 0 is before candidate stepping, thus we can't assume
> it will not overflow at use 1?
>
> After IVO, the code is transformed into:
> {
>   int ivtmp.9;
>   struct t test;
>   struct t test;
>   int j;
>   struct t test_outside;
>   unsigned int _29;
>   unsigned int _30;
>   int _31;
>
>   <bb 2>:
>   t::t (&test_outside);
>   # DEBUG j => 0
>   # DEBUG j => 0
>   _29 = (unsigned int) i_10(D);
>   _30 = _29 + 10;

From lookign at the source:

  for (int j = 0; j < 10; j++)
    {
      t test;
      test.foo();
      if (i + j)
        {
          test.bar();
          return;
        }
    }

we know that i + 9 does not overflow but we don't know
that i + 10 does not overflow.   Compute of _31 is done
in unsigned, so that's ok.

>   _31 = (int) _30;
>
>   <bb 3>:
>   # ivtmp.9_25 = PHI <ivtmp.9_2(6), i_10(D)(2)>
>   # DEBUG j => (int) ((unsigned int) ivtmp.9_25 - (unsigned int) i_10(D))
>   t::t (&test);
>   t::foo (&test);
>   if (ivtmp.9_25 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 5>;
>
>   <bb 4>:
>   t::bar (&test);
>   t::~t (&test);
>   test ={v} {CLOBBER};
>   goto <bb 12>;
>
>   <bb 5>:
>   t::~t (&test);
>   test ={v} {CLOBBER};
>   # DEBUG D#1 => (int) (((unsigned int) ivtmp.9_25 - (unsigned int)
> i_10(D)) + 1)
>   # DEBUG j => D#1
>   # DEBUG j => D#1
>   ivtmp.9_2 = ivtmp.9_25 + 1;

But here we actually perform i_10 + 10 in the last iteration which might
overflow and thus this stmt generated by IVO possibly invokes undefined behavior
in the last iteration.

Yes, we later figure out we at most iterate twice but I suppose
behavior would be
the same if we'd change the loop to

  for (int j = 0; j < 10; j++)
    {
      t test;
      test.foo();
      if (i + j)
        {
          test.bar();
        }
    }

so in the end the transform looks wrong to me (using 'int' for the IV and not
'unsigned int').  We could write the exit test as ivtmp.9_2 > _31 and compute
_31 as i_10 + 9 to fix that for example.

Richard.

>   if (ivtmp.9_2 == _31)
>     goto <bb 7>;
>   else
>     goto <bb 6>;
>
>   <bb 6>:
>   goto <bb 3>;
>
>   <bb 7>:
>   if (i_10(D) != 0)
>     goto <bb 8>;
>   else
>     goto <bb 11>;
>
>   <bb 8>:
>   t::t (&test);
>   if (i_10(D) == 10)
>     goto <bb 9>;
>   else
>     goto <bb 10>;
>
>   <bb 9>:
>   t::bar (&test);
>
>   <bb 10>:
>   t::~t (&test);
>   test ={v} {CLOBBER};
>
>   <bb 11>:
>   t::foo (&test_outside);
>
>   <bb 12>:
>   t::~t (&test_outside);
>   test_outside ={v} {CLOBBER};
>   return;
>
> }
>
> Now, the second condition in bb5 is guarded by condition in bb3 and we
> know for sure that ivtmp.9_2 is ONE.  Then jump threading in following
> dom pass can understand that condition in bb3 will be false and thread
> it.  The threaded code is as below, and we can see the loop is
> basically removed.
> {
>   int ivtmp.9;
>   struct t test;
>   struct t test;
>   int j;
>   struct t test_outside;
>   unsigned int _29;
>   unsigned int _30;
>   int _31;
>
>   <bb 2>:
>   t::t (&test_outside);
>   # DEBUG j => 0
>   # DEBUG j => 0
>   _29 = (unsigned int) i_10(D);
>   _30 = _29 + 10;
>   _31 = (int) _30;
>
>   <bb 3>:
>   # ivtmp.9_25 = PHI <i_10(D)(2)>
>   # DEBUG j => (int) ((unsigned int) ivtmp.9_25 - (unsigned int) i_10(D))
>   t::t (&test);
>   t::foo (&test);
>   if (ivtmp.9_25 != 0)
>     goto <bb 4>;
>   else
>     goto <bb 5>;
>
>   <bb 4>:
>   # DEBUG j => (int) ((unsigned int) ivtmp.9_25 - (unsigned int) i_10(D))
>   t::bar (&test);
>   t::~t (&test);
>   test ={v} {CLOBBER};
>   goto <bb 11>;
>
>   <bb 5>:
>   t::~t (&test);
>   test ={v} {CLOBBER};
>   # DEBUG D#1 => (int) (((unsigned int) 0 - (unsigned int) i_10(D)) + 1)
>   # DEBUG j => D#1
>   # DEBUG j => D#1
>   ivtmp.9_2 = 1;
>   if (_31 == 1)
>     goto <bb 6>;
>   else
>     goto <bb 12>;
>
>   <bb 6>:
>   if (i_10(D) != 0)
>     goto <bb 7>;
>   else
>     goto <bb 10>;
>
>   <bb 7>:
>   t::t (&test);
>   if (i_10(D) == 10)
>     goto <bb 8>;
>   else
>     goto <bb 9>;
>
>   <bb 8>:
>   t::bar (&test);
>
>   <bb 9>:
>   t::~t (&test);
>   test ={v} {CLOBBER};
>
>   <bb 10>:
>   t::foo (&test_outside);
>
>   <bb 11>:
>   t::~t (&test_outside);
>   test_outside ={v} {CLOBBER};
>   return;
>
>   <bb 12>:
>   # ivtmp.9_27 = PHI <1(5)>
>   # DEBUG j => (int) ((unsigned int) ivtmp.9_27 - (unsigned int) i_10(D))
>   t::t (&test);
>   t::foo (&test);
>   goto <bb 4>;
>
> }
>
> It is a weird case.
>
> Thanks,
> bin
>
>> Richard.
>>
>>> Thanks,
>>> bin
>>>
>>>
>>> 2015-07-16  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         PR tree-optimization/66388
>>>         * tree-ssa-loop-ivopts.c (dump_iv): Dump no_overflow info.
>>>         (add_candidate_1): New parameter.  Use unsigned type when iv
>>>         overflows.  Pass no_overflow to alloc_iv.
>>>         (add_autoinc_candidates, add_candidate): New parameter.
>>>         Pass no_overflow to add_candidate_1.
>>>         (add_candidate): Ditto.
>>>         (add_iv_candidate_for_biv, add_iv_candidate_for_use): Pass iv's
>>>         no_overflow info to add_candidate and add_candidate_1.
>>>         (get_computation_aff, get_computation_cost_at): Handle candidate
>>>         with smaller precision than iv use.
>>>
>>> gcc/testsuite/ChangeLog
>>> 2015-07-16  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         PR tree-optimization/66388
>>>         * gcc.dg/tree-ssa/pr66388.c: New test.
>>>
Bin.Cheng July 24, 2015, 11:54 a.m. UTC | #4
On Fri, Jul 24, 2015 at 7:23 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Jul 24, 2015 at 1:09 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Thu, Jul 23, 2015 at 10:06 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Jul 17, 2015 at 8:27 AM, Bin Cheng <bin.cheng@arm.com> wrote:
>>>> Hi,
>>>> This patch is to fix PR66388.  It's an old issue but recently became worse
>>>> after my scev overflow change.  IVOPT now can only compute iv use with
>>>> candidate which has at least same type precision.  See below code:
>>>>
>>>>   if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
>>>>     {
>>>>       /* We do not have a precision to express the values of use.  */
>>>>       return infinite_cost;
>>>>     }
>>>>
>>>> This is not always true.  It's possible to compute with a candidate of
>>>> smaller precision if it has enough stepping periods to express the iv use.
>>>> Just as code in iv_elimination.  Well, since now we have iv no_overflow
>>>> information, we can use that to prove it's safe.  Actually I am thinking
>>>> about improving iv elimination with overflow information too.  So this patch
>>>> relaxes the constraint to allow computation of uses with smaller precision
>>>> candidates.
>>>>
>>>> Benchmark data shows several cases in spec2k6 are obviously improved on
>>>> aarch64:
>>>> 400.perlbench                2.32%
>>>> 445.gobmk                    0.86%
>>>> 456.hmmer                    11.72%
>>>> 464.h264ref                  1.93%
>>>> 473.astar                    0.75%
>>>> 433.milc                     -1.49%
>>>> 436.cactusADM                6.61%
>>>> 444.namd                     -0.76%
>>>>
>>>> I looked into assembly code of 456.hmmer&436.cactusADM, and can confirm hot
>>>> loops are reduced.  Also perf data could confirm the improvement in
>>>> 456.hmmer.
>>>> I looked into 433.milc and found most hot functions are not affected by this
>>>> patch.  But I do observe two kinds of regressions described as below:
>>>> A)  For some loops, auto-increment addressing mode is generated before this
>>>> patch, but "base + index<<scale" is generated after. I don't worry about
>>>> this too much because auto-increment support in IVO hasn't been enabled on
>>>> AArch64 yet. On the contrary, we should worry that auto-increment support is
>>>> too aggressive in IVO, resulting in auto-increment addressing mode generated
>>>> where it shouldn't. I suspect the regression we monitored before is caused
>>>> by such kind of reason.
>>>> B) This patch enables computation of 64 bits address iv use with 32 bits biv
>>>> candidate.  So there will be a sign extension before the candidate can be
>>>> used in memory reference as an index. I already increased the cost by 2 for
>>>> such biv candidates but there still be some peculiar cases... Decreasing
>>>> cost in determine_iv_cost for biv candidates makes this worse.  It does that
>>>> to make debugging simpler, nothing to do with performance.
>>>>
>>>> Bootstrap and test on x86_64.  It fixes failure of pr49781-1.c.
>>>> Unfortunately, it introduces new failure of
>>>> g++.dg/debug/dwarf2/deallocator.C.  I looked into the test and found with
>>>> this patch, the loop is transformed into a shape that can be later
>>>> eliminated(because it can be proved never loop back?).  We can further
>>>> discuss if it's this patch's problem or the case should be tuned.
>>>> Also bootstrap and test on aarch64.
>>>>
>>>> So what's your opinion?
>>>
>>> Looks sensible, but the deallocator.C fail looks odd.  I presume that
>>> i + j is simplified in a way that either the first or the second iteration
>>> must exit the loop via the return and thus the scan for deallocator.C:34
>>> fails?  How does this happen - I can only see this happen if we unroll
>>> the loop and then run into VRP.  So does IVOPTs now affect non-loop
>>> code as well?  Ah, at the moment we use an IV that runs backward.
>>>
>>> Still curious if this isn't a wrong-code issue...
>>>
>>
>> Tree dump just before ivopts is as below:
>> {
>>   struct t test;
>>   struct t test;
>>   int j;
>>   struct t test_outside;
>>   unsigned int ivtmp_1;
>>   int _11;
>>   unsigned int ivtmp_26;
>>
>>   <bb 2>:
>>   t::t (&test_outside);
>>   # DEBUG j => 0
>>   # DEBUG j => 0
>>
>>   <bb 3>:
>>   # j_27 = PHI <j_14(6), 0(2)>
>>   # ivtmp_1 = PHI <ivtmp_26(6), 10(2)>
>>   # DEBUG j => j_27
>>   t::t (&test);
>>   t::foo (&test);
>>   _11 = i_10(D) + j_27;
>>   if (_11 != 0)
>>     goto <bb 4>;
>>   else
>>     goto <bb 5>;
>>
>>   <bb 4>:
>>   t::bar (&test);
>>   t::~t (&test);
>>   test ={v} {CLOBBER};
>>   goto <bb 12>;
>>
>>   <bb 5>:
>>   t::~t (&test);
>>   test ={v} {CLOBBER};
>>   j_14 = j_27 + 1;
>>   # DEBUG j => j_14
>>   # DEBUG j => j_14
>>   ivtmp_26 = ivtmp_1 - 1;
>>   if (ivtmp_26 == 0)
>>     goto <bb 7>;
>>   else
>>     goto <bb 6>;
>>
>>   <bb 6>:
>>   goto <bb 3>;
>>
>>   <bb 7>:
>>   if (i_10(D) != 0)
>>     goto <bb 8>;
>>   else
>>     goto <bb 11>;
>>
>>   <bb 8>:
>>   t::t (&test);
>>   if (i_10(D) == 10)
>>     goto <bb 9>;
>>   else
>>     goto <bb 10>;
>>
>>   <bb 9>:
>>   t::bar (&test);
>>
>>   <bb 10>:
>>   t::~t (&test);
>>   test ={v} {CLOBBER};
>>
>>   <bb 11>:
>>   t::foo (&test_outside);
>>
>>   <bb 12>:
>>   t::~t (&test_outside);
>>   test_outside ={v} {CLOBBER};
>>   return;
>>
>> }
>>
>> The only difference the patch made, is candidate 8 is not converted
>> into unsigned type any more because _11 is now considered not
>> overflow.
>>
>> Before patch:
>> candidate 8
>>   var_before ivtmp.9
>>   var_after ivtmp.9
>>   incremented before exit test
>>   type unsigned int
>>   base (unsigned int) i_10(D)
>>   step 1
>> After patch:
>> candidate 8
>>   var_before ivtmp.9
>>   var_after ivtmp.9
>>   incremented before exit test
>>   type int
>>   base i_10(D)
>>   step 1
>>   iv doesn't overflow wrto loop niter
>>
>> Mark {i_10(D), 1}_loop not overflow seems reasonable.  But I am not
>> sure because use 0 is before candidate stepping, thus we can't assume
>> it will not overflow at use 1?
>>
>> After IVO, the code is transformed into:
>> {
>>   int ivtmp.9;
>>   struct t test;
>>   struct t test;
>>   int j;
>>   struct t test_outside;
>>   unsigned int _29;
>>   unsigned int _30;
>>   int _31;
>>
>>   <bb 2>:
>>   t::t (&test_outside);
>>   # DEBUG j => 0
>>   # DEBUG j => 0
>>   _29 = (unsigned int) i_10(D);
>>   _30 = _29 + 10;
>
> From lookign at the source:
>
>   for (int j = 0; j < 10; j++)
>     {
>       t test;
>       test.foo();
>       if (i + j)
>         {
>           test.bar();
>           return;
>         }
>     }
>
> we know that i + 9 does not overflow but we don't know
> that i + 10 does not overflow.   Compute of _31 is done
> in unsigned, so that's ok.
>
>>   _31 = (int) _30;
>>
>>   <bb 3>:
>>   # ivtmp.9_25 = PHI <ivtmp.9_2(6), i_10(D)(2)>
>>   # DEBUG j => (int) ((unsigned int) ivtmp.9_25 - (unsigned int) i_10(D))
>>   t::t (&test);
>>   t::foo (&test);
>>   if (ivtmp.9_25 != 0)
>>     goto <bb 4>;
>>   else
>>     goto <bb 5>;
>>
>>   <bb 4>:
>>   t::bar (&test);
>>   t::~t (&test);
>>   test ={v} {CLOBBER};
>>   goto <bb 12>;
>>
>>   <bb 5>:
>>   t::~t (&test);
>>   test ={v} {CLOBBER};
>>   # DEBUG D#1 => (int) (((unsigned int) ivtmp.9_25 - (unsigned int)
>> i_10(D)) + 1)
>>   # DEBUG j => D#1
>>   # DEBUG j => D#1
>>   ivtmp.9_2 = ivtmp.9_25 + 1;
>
> But here we actually perform i_10 + 10 in the last iteration which might
> overflow and thus this stmt generated by IVO possibly invokes undefined behavior
> in the last iteration.
>
> Yes, we later figure out we at most iterate twice but I suppose
> behavior would be
> the same if we'd change the loop to
>
>   for (int j = 0; j < 10; j++)
>     {
>       t test;
>       test.foo();
>       if (i + j)
>         {
>           test.bar();
>         }
>     }
>
> so in the end the transform looks wrong to me (using 'int' for the IV and not
> 'unsigned int').  We could write the exit test as ivtmp.9_2 > _31 and compute
> _31 as i_10 + 9 to fix that for example.
Right.
The no overflow information is computed with respect to loop niter.
The patch lacks change in iv elimination that makes sure the
elimination happens within niter too.  The check is unnecessary before
because every candidate is computed in unsigned type.  I need to make
sure no overflow information is only used within niter range when
rewriting uses.  So far seems iv elimination is the only case that may
compute iv one step more than loop niter.

Thanks,
bin
>
> Richard.
>
diff mbox

Patch

Index: gcc/testsuite/gcc.dg/tree-ssa/pr66388.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/pr66388.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr66388.c	(revision 0)
@@ -0,0 +1,25 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts" } */
+
+int flag;
+int arr[144];
+
+void bar(int a, int b);
+int foo (int t)
+{
+  int step = t;
+
+  do
+    {
+      if (!flag)
+	bar(t, 0);
+
+      t += step;
+    }
+  while (arr[t] != 0);
+
+  return t;
+}
+
+/* { dg-final { scan-tree-dump-times "t.\[0-9_\]* = PHI <.*, " 1 "ivopts"} } */
+/* { dg-final { scan-tree-dump-not "ivtmp.\[0-9_\]* = PHI <" "ivopts"} } */
Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 225859)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -529,6 +529,9 @@  dump_iv (FILE *file, struct iv *iv, bool dump_name
 
   if (iv->biv_p)
     fprintf (file, "  is a biv\n");
+
+  if (iv->no_overflow)
+    fprintf (file, "  iv doesn't overflow wrto loop niter\n");
 }
 
 /* Dumps information about the USE to FILE.  */
@@ -2598,21 +2601,23 @@  find_depends (tree *expr_p, int *ws ATTRIBUTE_UNUS
 /* Adds a candidate BASE + STEP * i.  Important field is set to IMPORTANT and
    position to POS.  If USE is not NULL, the candidate is set as related to
    it.  If both BASE and STEP are NULL, we add a pseudocandidate for the
-   replacement of the final value of the iv by a direct computation.  */
+   replacement of the final value of the iv by a direct computation.
+   NO_OVERFLOW is TRUE means the iv doesn't overflow with respect to loop's
+   niter information.  */
 
 static struct iv_cand *
 add_candidate_1 (struct ivopts_data *data,
 		 tree base, tree step, bool important, enum iv_position pos,
-		 struct iv_use *use, gimple incremented_at)
+		 struct iv_use *use, gimple incremented_at,
+		 bool no_overflow = false)
 {
   unsigned i;
   struct iv_cand *cand = NULL;
   tree type, orig_type;
 
-  /* For non-original variables, make sure their values are computed in a type
-     that does not invoke undefined behavior on overflows (since in general,
-     we cannot prove that these induction variables are non-wrapping).  */
-  if (pos != IP_ORIGINAL)
+  /* For non-original variables, compute their values in a type that does
+     not invoke undefined behavior on overflows if the iv might overflow.  */
+  if (pos != IP_ORIGINAL && !no_overflow)
     {
       orig_type = TREE_TYPE (base);
       type = generic_type_for (orig_type);
@@ -2661,7 +2666,7 @@  add_candidate_1 (struct ivopts_data *data,
       if (!base && !step)
 	cand->iv = NULL;
       else
-	cand->iv = alloc_iv (data, base, step);
+	cand->iv = alloc_iv (data, base, step, no_overflow);
 
       cand->pos = pos;
       if (pos != IP_ORIGINAL && cand->iv)
@@ -2729,11 +2734,13 @@  allow_ip_end_pos_p (struct loop *loop)
 }
 
 /* If possible, adds autoincrement candidates BASE + STEP * i based on use USE.
-   Important field is set to IMPORTANT.  */
+   Important field is set to IMPORTANT.  NO_OVERFLOW is TRUE means the iv
+   doesn't overflow with respect to loop's niter information.  */
 
 static void
 add_autoinc_candidates (struct ivopts_data *data, tree base, tree step,
-			bool important, struct iv_use *use)
+			bool important, struct iv_use *use,
+			bool no_overflow = false)
 {
   basic_block use_bb = gimple_bb (use->stmt);
   machine_mode mem_mode;
@@ -2782,26 +2789,30 @@  add_autoinc_candidates (struct ivopts_data *data,
 	  && GET_MODE_SIZE (mem_mode) == -cstepi))
     {
       add_candidate_1 (data, base, step, important, IP_AFTER_USE, use,
-		       use->stmt);
+		       use->stmt, no_overflow);
     }
 }
 
 /* Adds a candidate BASE + STEP * i.  Important field is set to IMPORTANT and
    position to POS.  If USE is not NULL, the candidate is set as related to
    it.  The candidate computation is scheduled before exit condition and at
-   the end of loop.  */
+   the end of loop.  NO_OVERFLOW is TRUE means the iv doesn't overflow with
+   respect to loop's niter information.  */
 
 static void
 add_candidate (struct ivopts_data *data,
-	       tree base, tree step, bool important, struct iv_use *use)
+	       tree base, tree step, bool important, struct iv_use *use,
+	       bool no_overflow = false)
 {
   gcc_assert (use == NULL || use->sub_id == 0);
 
   if (ip_normal_pos (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL);
+    add_candidate_1 (data, base, step, important,
+		     IP_NORMAL, use, NULL, no_overflow);
   if (ip_end_pos (data->current_loop)
       && allow_ip_end_pos_p (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_END, use, NULL);
+    add_candidate_1 (data, base, step, important,
+		     IP_END, use, NULL, no_overflow);
 }
 
 /* Adds standard iv candidates.  */
@@ -2836,7 +2847,7 @@  add_iv_candidate_for_biv (struct ivopts_data *data
   tree def;
   struct iv_cand *cand;
 
-  add_candidate (data, iv->base, iv->step, true, NULL);
+  add_candidate (data, iv->base, iv->step, true, NULL, iv->no_overflow);
 
   /* The same, but with initial value zero.  */
   if (POINTER_TYPE_P (TREE_TYPE (iv->base)))
@@ -2858,7 +2869,7 @@  add_iv_candidate_for_biv (struct ivopts_data *data
 	{
 	  cand = add_candidate_1 (data,
 				  iv->base, iv->step, true, IP_ORIGINAL, NULL,
-				  SSA_NAME_DEF_STMT (def));
+				  SSA_NAME_DEF_STMT (def), iv->no_overflow);
 	  cand->var_before = iv->ssa_name;
 	  cand->var_after = def;
 	}
@@ -2894,7 +2905,7 @@  add_iv_candidate_for_use (struct ivopts_data *data
   tree basetype;
   struct iv *iv = use->iv;
 
-  add_candidate (data, iv->base, iv->step, false, use);
+  add_candidate (data, iv->base, iv->step, false, use, iv->no_overflow);
 
   /* The same, but with initial value zero.  Make such variable important,
      since it is generic enough so that possibly many uses may be based
@@ -3339,33 +3350,54 @@  get_computation_aff (struct loop *loop,
   tree ubase = use->iv->base;
   tree ustep = use->iv->step;
   tree cbase = cand->iv->base;
-  tree cstep = cand->iv->step, cstep_common;
+  tree cstep = cand->iv->step;
   tree utype = TREE_TYPE (ubase), ctype = TREE_TYPE (cbase);
   tree common_type, var;
   tree uutype;
   aff_tree cbase_aff, var_aff;
   widest_int rat;
 
-  if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
+  if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype)
+      /* If cand doesn't overflow wrto loop niter, it's safe to express
+	 iv use with the cand even if iv use has higher type precision.  */
+      && !cand->iv->no_overflow)
     {
       /* We do not have a precision to express the values of use.  */
       return false;
     }
 
+  /* We need to shift the value if we are after the increment.  */
+  if (stmt_after_increment (loop, cand, at))
+    {
+      if (POINTER_TYPE_P (ctype))
+	cbase = fold_build2 (POINTER_PLUS_EXPR, ctype, cbase, cstep);
+      else
+	cbase = fold_build2 (PLUS_EXPR, ctype, cbase, cstep);
+    }
+
+  if (!constant_multiple_of (ustep, cstep, &rat))
+    {
+      tree tmp = NULL;
+
+      /* If iv use has higher type precision, try compute ratio
+	 again with cand's step converted to use's type.  */
+      if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
+	tmp = fold_convert (TREE_TYPE (ustep), cstep);
+      if (!tmp || !constant_multiple_of (ustep, tmp, &rat))
+	return false;
+    }
+
   var = var_at_stmt (loop, cand, at);
   uutype = unsigned_type_for (utype);
 
   /* If the conversion is not noop, perform it.  */
-  if (TYPE_PRECISION (utype) < TYPE_PRECISION (ctype))
+  if (TYPE_PRECISION (utype) != TYPE_PRECISION (ctype))
     {
       cstep = fold_convert (uutype, cstep);
       cbase = fold_convert (uutype, cbase);
       var = fold_convert (uutype, var);
     }
 
-  if (!constant_multiple_of (ustep, cstep, &rat))
-    return false;
-
   /* In case both UBASE and CBASE are shortened to UUTYPE from some common
      type, we achieve better folding by computing their difference in this
      wider type, and cast the result to UUTYPE.  We do not need to worry about
@@ -3378,20 +3410,6 @@  get_computation_aff (struct loop *loop,
   tree_to_aff_combination (cbase, common_type, &cbase_aff);
   tree_to_aff_combination (var, uutype, &var_aff);
 
-  /* We need to shift the value if we are after the increment.  */
-  if (stmt_after_increment (loop, cand, at))
-    {
-      aff_tree cstep_aff;
-
-      if (common_type != uutype)
-	cstep_common = fold_convert (common_type, cstep);
-      else
-	cstep_common = cstep;
-
-      tree_to_aff_combination (cstep_common, common_type, &cstep_aff);
-      aff_combination_add (&cbase_aff, &cstep_aff);
-    }
-
   aff_combination_scale (&cbase_aff, -rat);
   aff_combination_add (aff, &cbase_aff);
   if (common_type != uutype)
@@ -4426,7 +4444,10 @@  get_computation_cost_at (struct ivopts_data *data,
   cstep = cand->iv->step;
   ctype = TREE_TYPE (cbase);
 
-  if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
+  if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype)
+      /* If cand doesn't overflow wrto loop niter, it's safe to express
+	 iv use with the cand even if iv use has higher type precision.  */
+      && !cand->iv->no_overflow)
     {
       /* We do not have a precision to express the values of use.  */
       return infinite_cost;
@@ -4467,8 +4488,17 @@  get_computation_cost_at (struct ivopts_data *data,
     cstepi = 0;
 
   if (!constant_multiple_of (ustep, cstep, &rat))
-    return infinite_cost;
+    {
+      tree tmp = NULL;
 
+      /* If iv use has higher type precision, try compute ratio
+	 again with cand's step converted to use's type.  */
+      if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
+	tmp = fold_convert (TREE_TYPE (ustep), cstep);
+      if (!tmp || !constant_multiple_of (ustep, tmp, &rat))
+	return infinite_cost;
+    }
+
   if (wi::fits_shwi_p (rat))
     ratio = rat.to_shwi ();
   else
@@ -4525,8 +4555,22 @@  get_computation_cost_at (struct ivopts_data *data,
 		(ratio, mem_mode,
 			TYPE_ADDR_SPACE (TREE_TYPE (utype))))
     {
-      cbase
-	= fold_build2 (MULT_EXPR, ctype, cbase, build_int_cst (ctype, ratio));
+      if (cstepi == 0 && stmt_is_after_inc)
+	{
+	  if (POINTER_TYPE_P (ctype))
+	    cbase = fold_build2 (POINTER_PLUS_EXPR, ctype, cbase, cstep);
+	  else
+	    cbase = fold_build2 (PLUS_EXPR, ctype, cbase, cstep);
+	}
+      if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
+	{
+	  cbase = fold_convert (TREE_TYPE (ustep), cbase);
+	  cbase = fold_build2 (MULT_EXPR, TREE_TYPE (cbase), cbase,
+			       build_int_cst (TREE_TYPE (cbase), ratio));
+	}
+      else
+	cbase = fold_build2 (MULT_EXPR, ctype, cbase,
+			     build_int_cst (ctype, ratio));
       cost = difference_cost (data,
 			      ubase, cbase,
 			      &symbol_present, &var_present, &offset,
@@ -4567,6 +4611,10 @@  get_computation_cost_at (struct ivopts_data *data,
   if (stmt_is_after_inc)
     offset -= ratio * cstepi;
 
+  /* Increase cost if iv use has higher type precision.  */
+  if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
+    cost.cost += 2;
+
   /* Now the computation is in shape symbol + var1 + const + ratio * var2.
      (symbol/var1/const parts may be omitted).  If we are looking for an
      address, find the cost of addressing this.  */