diff mbox

[PR69042/01] Add IV candidate for use with constant offset stripped in base.

Message ID AM4PR08MB1140756CF7792FCDB26F23A1E78A0@AM4PR08MB1140.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Bin Cheng March 16, 2016, 9:48 a.m. UTC
Hi,
When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base.  This is kind of too aggressive and triggers PR69042.  So here is a patch adding back the missing candidates.  Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think).  The issue still depends on PIC_OFFSET register used on x86 target.  As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html.  Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT.

On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30".  For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%.  To address the regression, I will send another patch increasing the parameter bound.

Bootstrap&test on x86_64 and AArch64, is it OK?  In the meantime, I will collect spec2k6 data on x86_64.

Thanks,
bin

2016-03-10  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/69042
	* tree-ssa-loop-ivopts.c (add_iv_candidate_for_use): Add IV cand
	for use with constant offset stripped in base.

Comments

Richard Biener March 16, 2016, 10:06 a.m. UTC | #1
On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base.  This is kind of too aggressive and triggers PR69042.  So here is a patch adding back the missing candidates.  Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think).  The issue still depends on PIC_OFFSET register used on x86 target.  As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html.  Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT.
>
> On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30".  For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%.  To address the regression, I will send another patch increasing the parameter bound.
>
> Bootstrap&test on x86_64 and AArch64, is it OK?  In the meantime, I will collect spec2k6 data on x86_64.

Ok.

Richard.

> Thanks,
> bin
>
> 2016-03-10  Bin Cheng  <bin.cheng@arm.com>
>
>         PR tree-optimization/69042
>         * tree-ssa-loop-ivopts.c (add_iv_candidate_for_use): Add IV cand
>         for use with constant offset stripped in base.
Bin.Cheng March 22, 2016, 10:22 a.m. UTC | #2
On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> > Hi,
> > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base.  This is kind of too aggressive and triggers PR69042.  So here is a patch adding back the missing candidates.  Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think).  The issue still depends on PIC_OFFSET register used on x86 target.  As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html.  Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT.
> >
> > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30".  For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%.  To address the regression, I will send another patch increasing the parameter bound.
> >
> > Bootstrap&test on x86_64 and AArch64, is it OK?  In the meantime, I will collect spec2k6 data on x86_64.
>
> Ok.
Hi Richard,
Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing
param iv-consider-all-candidates-bound) causes 1% regression for
436.cactusADM in my run.  I looked into the code, for function
bench_staggeredleapfrog2_ (takes 99% running time after patching),
IVOPT chooses one fewer candidates for outer loop, but it does result
in couple of more instructions there.  For this case, register
pressure is a more interesting issue (36 candidates chosen in outer
loop, many stack accesses), not sure if this 1% regression blocks the
patch at this stage, or not?

Thanks,
bin
Richard Biener March 22, 2016, 11:01 a.m. UTC | #3
On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>>
>> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> > Hi,
>> > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base.  This is kind of too aggressive and triggers PR69042.  So here is a patch adding back the missing candidates.  Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think).  The issue still depends on PIC_OFFSET register used on x86 target.  As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html.  Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT.
>> >
>> > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30".  For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%.  To address the regression, I will send another patch increasing the parameter bound.
>> >
>> > Bootstrap&test on x86_64 and AArch64, is it OK?  In the meantime, I will collect spec2k6 data on x86_64.
>>
>> Ok.
> Hi Richard,
> Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing
> param iv-consider-all-candidates-bound) causes 1% regression for
> 436.cactusADM in my run.  I looked into the code, for function
> bench_staggeredleapfrog2_ (takes 99% running time after patching),
> IVOPT chooses one fewer candidates for outer loop, but it does result
> in couple of more instructions there.

You mean IVOPTs chooses one fewer IVs for the outer loop?

>  For this case, register
> pressure is a more interesting issue (36 candidates chosen in outer
> loop, many stack accesses), not sure if this 1% regression blocks the
> patch at this stage, or not?

Is this with or without the increase of the param?  What compiler options and
on what sub-architecture was this?

I think if the IVO choice looks optimal before the patch and not optimal after
then it's worth blocking but it sounds like the IVO choice is a mess anyway?
[can you maybe check IV choice by ICC?]

Thanks,
Richard.

> Thanks,
> bin
Bin.Cheng March 23, 2016, 1:58 p.m. UTC | #4
On Tue, Mar 22, 2016 at 11:01 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>>
>>> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> > Hi,
>>> > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base.  This is kind of too aggressive and triggers PR69042.  So here is a patch adding back the missing candidates.  Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think).  The issue still depends on PIC_OFFSET register used on x86 target.  As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html.  Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT.
>>> >
>>> > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30".  For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%.  To address the regression, I will send another patch increasing the parameter bound.
>>> >
>>> > Bootstrap&test on x86_64 and AArch64, is it OK?  In the meantime, I will collect spec2k6 data on x86_64.
>>>
>>> Ok.
>> Hi Richard,
>> Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing
>> param iv-consider-all-candidates-bound) causes 1% regression for
>> 436.cactusADM in my run.  I looked into the code, for function
>> bench_staggeredleapfrog2_ (takes 99% running time after patching),
>> IVOPT chooses one fewer candidates for outer loop, but it does result
>> in couple of more instructions there.
>
> You mean IVOPTs chooses one fewer IVs for the outer loop?
Yes.
>
>>  For this case, register
>> pressure is a more interesting issue (36 candidates chosen in outer
>> loop, many stack accesses), not sure if this 1% regression blocks the
>> patch at this stage, or not?
>
> Is this with or without the increase of the param?  What compiler options and
> on what sub-architecture was this?
It's not the param, we have both address and generic iv_uses in the
form of {base + 4, step}, and there is another address iv_use {base +
8, step}.  With the patch we now add candidate {base, step}, which
will be used for all these uses.  Before the patch, candidates {base +
4, step} and {base + 8, step} are chosen for these iv_uses
respectively.
As a matter of fact, {base + 4, step} should be the best choice, but
we can't do that because candidate added from one iv_use is not
related to other iv_use and we exceed the bound of param
iv-consider-all-candidates-bound = 30.
>
> I think if the IVO choice looks optimal before the patch and not optimal after
> then it's worth blocking but it sounds like the IVO choice is a mess anyway?
Yes, it's a mess here.  For three outermost loops in 436, IVO chooses
>32 candidates for each loop.  I found at least one bug in register
pressure calculation is responsible for this.  By fixing this (plus
one another local patch), # of candidates of these loops are reduced
from 36/36/32 to 12/12/7.  And # of lines of assembly is reduced from
~7150 to ~6850.  Interesting thing is, the performance isn't
improved...  That may be because they are outermost loops.

> [can you maybe check IV choice by ICC?]
I don't have ICC at hand, but from past experiments, ICC tends to
reuse more common IVs than GCC.

Thanks,
bin
Richard Biener March 23, 2016, 2:18 p.m. UTC | #5
On Wed, Mar 23, 2016 at 2:58 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Mar 22, 2016 at 11:01 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>>
>>>> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> > Hi,
>>>> > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base.  This is kind of too aggressive and triggers PR69042.  So here is a patch adding back the missing candidates.  Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think).  The issue still depends on PIC_OFFSET register used on x86 target.  As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html.  Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT.
>>>> >
>>>> > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30".  For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%.  To address the regression, I will send another patch increasing the parameter bound.
>>>> >
>>>> > Bootstrap&test on x86_64 and AArch64, is it OK?  In the meantime, I will collect spec2k6 data on x86_64.
>>>>
>>>> Ok.
>>> Hi Richard,
>>> Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing
>>> param iv-consider-all-candidates-bound) causes 1% regression for
>>> 436.cactusADM in my run.  I looked into the code, for function
>>> bench_staggeredleapfrog2_ (takes 99% running time after patching),
>>> IVOPT chooses one fewer candidates for outer loop, but it does result
>>> in couple of more instructions there.
>>
>> You mean IVOPTs chooses one fewer IVs for the outer loop?
> Yes.
>>
>>>  For this case, register
>>> pressure is a more interesting issue (36 candidates chosen in outer
>>> loop, many stack accesses), not sure if this 1% regression blocks the
>>> patch at this stage, or not?
>>
>> Is this with or without the increase of the param?  What compiler options and
>> on what sub-architecture was this?
> It's not the param, we have both address and generic iv_uses in the
> form of {base + 4, step}, and there is another address iv_use {base +
> 8, step}.  With the patch we now add candidate {base, step}, which
> will be used for all these uses.  Before the patch, candidates {base +
> 4, step} and {base + 8, step} are chosen for these iv_uses
> respectively.
> As a matter of fact, {base + 4, step} should be the best choice, but
> we can't do that because candidate added from one iv_use is not
> related to other iv_use and we exceed the bound of param
> iv-consider-all-candidates-bound = 30.

I suppose upping that to 40 doesn't help here either (AFAIR the loops are quite
big).

>>
>> I think if the IVO choice looks optimal before the patch and not optimal after
>> then it's worth blocking but it sounds like the IVO choice is a mess anyway?
> Yes, it's a mess here.  For three outermost loops in 436, IVO chooses
>>32 candidates for each loop.  I found at least one bug in register
> pressure calculation is responsible for this.  By fixing this (plus
> one another local patch), # of candidates of these loops are reduced
> from 36/36/32 to 12/12/7.  And # of lines of assembly is reduced from
> ~7150 to ~6850.  Interesting thing is, the performance isn't
> improved...  That may be because they are outermost loops.

Yeah, maybe a good default heuristic would be to keep existing BIVs
and add no related candidates at all for outer loops.  Or to adjust
heuristic to prefer the least number of IVs for those.

>> [can you maybe check IV choice by ICC?]
> I don't have ICC at hand, but from past experiments, ICC tends to
> reuse more common IVs than GCC.

Ok.

I think the patch is ok and eventually we need to revisit the outer loop
handling with this as an excuse.

Richard.

> Thanks,
> bin
Bin.Cheng March 23, 2016, 3:29 p.m. UTC | #6
On Wed, Mar 23, 2016 at 2:18 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 2:58 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Tue, Mar 22, 2016 at 11:01 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>>
>>>>> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>>> > Hi,
>>>>> > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base.  This is kind of too aggressive and triggers PR69042.  So here is a patch adding back the missing candidates.  Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think).  The issue still depends on PIC_OFFSET register used on x86 target.  As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html.  Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT.
>>>>> >
>>>>> > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30".  For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%.  To address the regression, I will send another patch increasing the parameter bound.
>>>>> >
>>>>> > Bootstrap&test on x86_64 and AArch64, is it OK?  In the meantime, I will collect spec2k6 data on x86_64.
>>>>>
>>>>> Ok.
>>>> Hi Richard,
>>>> Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing
>>>> param iv-consider-all-candidates-bound) causes 1% regression for
>>>> 436.cactusADM in my run.  I looked into the code, for function
>>>> bench_staggeredleapfrog2_ (takes 99% running time after patching),
>>>> IVOPT chooses one fewer candidates for outer loop, but it does result
>>>> in couple of more instructions there.
>>>
>>> You mean IVOPTs chooses one fewer IVs for the outer loop?
>> Yes.
>>>
>>>>  For this case, register
>>>> pressure is a more interesting issue (36 candidates chosen in outer
>>>> loop, many stack accesses), not sure if this 1% regression blocks the
>>>> patch at this stage, or not?
>>>
>>> Is this with or without the increase of the param?  What compiler options and
>>> on what sub-architecture was this?
>> It's not the param, we have both address and generic iv_uses in the
>> form of {base + 4, step}, and there is another address iv_use {base +
>> 8, step}.  With the patch we now add candidate {base, step}, which
>> will be used for all these uses.  Before the patch, candidates {base +
>> 4, step} and {base + 8, step} are chosen for these iv_uses
>> respectively.
>> As a matter of fact, {base + 4, step} should be the best choice, but
>> we can't do that because candidate added from one iv_use is not
>> related to other iv_use and we exceed the bound of param
>> iv-consider-all-candidates-bound = 30.
>
> I suppose upping that to 40 doesn't help here either (AFAIR the loops are quite
> big).
Yes, there are ~100 candidates in total for each loop.
>
>>>
>>> I think if the IVO choice looks optimal before the patch and not optimal after
>>> then it's worth blocking but it sounds like the IVO choice is a mess anyway?
>> Yes, it's a mess here.  For three outermost loops in 436, IVO chooses
>>>32 candidates for each loop.  I found at least one bug in register
>> pressure calculation is responsible for this.  By fixing this (plus
>> one another local patch), # of candidates of these loops are reduced
>> from 36/36/32 to 12/12/7.  And # of lines of assembly is reduced from
>> ~7150 to ~6850.  Interesting thing is, the performance isn't
>> improved...  That may be because they are outermost loops.
>
> Yeah, maybe a good default heuristic would be to keep existing BIVs
> and add no related candidates at all for outer loops.  Or to adjust
> heuristic to prefer the least number of IVs for those.
>
>>> [can you maybe check IV choice by ICC?]
>> I don't have ICC at hand, but from past experiments, ICC tends to
>> reuse more common IVs than GCC.
>
> Ok.
>
> I think the patch is ok and eventually we need to revisit the outer loop
> handling with this as an excuse.
Yes, currently GCC doesn't handle loop nest at all.  Will revisit this
part in coming stage1.
Patch applied as revision 234429.

Thanks,
bin
Jeff Law March 23, 2016, 3:34 p.m. UTC | #7
On 03/23/2016 09:29 AM, Bin.Cheng wrote:

>>
>> I think the patch is ok and eventually we need to revisit the outer loop
>> handling with this as an excuse.
> Yes, currently GCC doesn't handle loop nest at all.  Will revisit this
> part in coming stage1.
> Patch applied as revision 234429.
So given there's follow-up items to do, it's not clear the state of 69042.

Are the regressions for 69042 resolved?  If so, we should consider 
closing 69042 and opening a new BZ for the additional items we want to 
fix in the next stage1 cycle.

If not, let's make sure the 69042 is clear on what is still regressing.

Jeff
diff mbox

Patch

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 4026d28..32594df 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -3232,11 +3232,13 @@  add_iv_candidate_for_use (struct ivopts_data *data, struct iv_use *use)
     basetype = sizetype;
   record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);
 
-  /* Record common candidate with constant offset stripped in base.  */
+  /* Record common candidate with constant offset stripped in base.
+     Like the use itself, we also add candidate directly for it.  */
+  base = strip_offset (iv->base, &offset);
+  if (offset || base != iv->base)
     {
-      base = strip_offset (iv->base, &offset);
-      if (offset || base != iv->base)
-	record_common_cand (data, base, iv->step, use);
+      record_common_cand (data, base, iv->step, use);
+      add_candidate (data, base, iv->step, false, use);
     }
 
   /* Record common candidate with base_object removed in base.  */