Message ID | AM4PR08MB1140756CF7792FCDB26F23A1E78A0@AM4PR08MB1140.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote: > Hi, > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base. This is kind of too aggressive and triggers PR69042. So here is a patch adding back the missing candidates. Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think). The issue still depends on PIC_OFFSET register used on x86 target. As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html. Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT. > > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30". For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%. To address the regression, I will send another patch increasing the parameter bound. > > Bootstrap&test on x86_64 and AArch64, is it OK? In the meantime, I will collect spec2k6 data on x86_64. Ok. Richard. > Thanks, > bin > > 2016-03-10 Bin Cheng <bin.cheng@arm.com> > > PR tree-optimization/69042 > * tree-ssa-loop-ivopts.c (add_iv_candidate_for_use): Add IV cand > for use with constant offset stripped in base.
On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener <richard.guenther@gmail.com> wrote: > > On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote: > > Hi, > > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base. This is kind of too aggressive and triggers PR69042. So here is a patch adding back the missing candidates. Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think). The issue still depends on PIC_OFFSET register used on x86 target. As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html. Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT. > > > > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30". For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%. To address the regression, I will send another patch increasing the parameter bound. > > > > Bootstrap&test on x86_64 and AArch64, is it OK? In the meantime, I will collect spec2k6 data on x86_64. > > Ok. Hi Richard, Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing param iv-consider-all-candidates-bound) causes 1% regression for 436.cactusADM in my run. I looked into the code, for function bench_staggeredleapfrog2_ (takes 99% running time after patching), IVOPT chooses one fewer candidates for outer loop, but it does result in couple of more instructions there. For this case, register pressure is a more interesting issue (36 candidates chosen in outer loop, many stack accesses), not sure if this 1% regression blocks the patch at this stage, or not? Thanks, bin
On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote: > On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener > <richard.guenther@gmail.com> wrote: >> >> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote: >> > Hi, >> > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base. This is kind of too aggressive and triggers PR69042. So here is a patch adding back the missing candidates. Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think). The issue still depends on PIC_OFFSET register used on x86 target. As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html. Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT. >> > >> > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30". For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%. To address the regression, I will send another patch increasing the parameter bound. >> > >> > Bootstrap&test on x86_64 and AArch64, is it OK? In the meantime, I will collect spec2k6 data on x86_64. >> >> Ok. > Hi Richard, > Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing > param iv-consider-all-candidates-bound) causes 1% regression for > 436.cactusADM in my run. I looked into the code, for function > bench_staggeredleapfrog2_ (takes 99% running time after patching), > IVOPT chooses one fewer candidates for outer loop, but it does result > in couple of more instructions there. You mean IVOPTs chooses one fewer IVs for the outer loop? > For this case, register > pressure is a more interesting issue (36 candidates chosen in outer > loop, many stack accesses), not sure if this 1% regression blocks the > patch at this stage, or not? Is this with or without the increase of the param? What compiler options and on what sub-architecture was this? I think if the IVO choice looks optimal before the patch and not optimal after then it's worth blocking but it sounds like the IVO choice is a mess anyway? [can you maybe check IV choice by ICC?] Thanks, Richard. > Thanks, > bin
On Tue, Mar 22, 2016 at 11:01 AM, Richard Biener <richard.guenther@gmail.com> wrote: > On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote: >> On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener >> <richard.guenther@gmail.com> wrote: >>> >>> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote: >>> > Hi, >>> > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base. This is kind of too aggressive and triggers PR69042. So here is a patch adding back the missing candidates. Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think). The issue still depends on PIC_OFFSET register used on x86 target. As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html. Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT. >>> > >>> > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30". For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%. To address the regression, I will send another patch increasing the parameter bound. >>> > >>> > Bootstrap&test on x86_64 and AArch64, is it OK? In the meantime, I will collect spec2k6 data on x86_64. >>> >>> Ok. >> Hi Richard, >> Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing >> param iv-consider-all-candidates-bound) causes 1% regression for >> 436.cactusADM in my run. I looked into the code, for function >> bench_staggeredleapfrog2_ (takes 99% running time after patching), >> IVOPT chooses one fewer candidates for outer loop, but it does result >> in couple of more instructions there. > > You mean IVOPTs chooses one fewer IVs for the outer loop? Yes. > >> For this case, register >> pressure is a more interesting issue (36 candidates chosen in outer >> loop, many stack accesses), not sure if this 1% regression blocks the >> patch at this stage, or not? > > Is this with or without the increase of the param? What compiler options and > on what sub-architecture was this? It's not the param, we have both address and generic iv_uses in the form of {base + 4, step}, and there is another address iv_use {base + 8, step}. With the patch we now add candidate {base, step}, which will be used for all these uses. Before the patch, candidates {base + 4, step} and {base + 8, step} are chosen for these iv_uses respectively. As a matter of fact, {base + 4, step} should be the best choice, but we can't do that because candidate added from one iv_use is not related to other iv_use and we exceed the bound of param iv-consider-all-candidates-bound = 30. > > I think if the IVO choice looks optimal before the patch and not optimal after > then it's worth blocking but it sounds like the IVO choice is a mess anyway? Yes, it's a mess here. For three outermost loops in 436, IVO chooses >32 candidates for each loop. I found at least one bug in register pressure calculation is responsible for this. By fixing this (plus one another local patch), # of candidates of these loops are reduced from 36/36/32 to 12/12/7. And # of lines of assembly is reduced from ~7150 to ~6850. Interesting thing is, the performance isn't improved... That may be because they are outermost loops. > [can you maybe check IV choice by ICC?] I don't have ICC at hand, but from past experiments, ICC tends to reuse more common IVs than GCC. Thanks, bin
On Wed, Mar 23, 2016 at 2:58 PM, Bin.Cheng <amker.cheng@gmail.com> wrote: > On Tue, Mar 22, 2016 at 11:01 AM, Richard Biener > <richard.guenther@gmail.com> wrote: >> On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote: >>> On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener >>> <richard.guenther@gmail.com> wrote: >>>> >>>> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote: >>>> > Hi, >>>> > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base. This is kind of too aggressive and triggers PR69042. So here is a patch adding back the missing candidates. Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think). The issue still depends on PIC_OFFSET register used on x86 target. As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html. Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT. >>>> > >>>> > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30". For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%. To address the regression, I will send another patch increasing the parameter bound. >>>> > >>>> > Bootstrap&test on x86_64 and AArch64, is it OK? In the meantime, I will collect spec2k6 data on x86_64. >>>> >>>> Ok. >>> Hi Richard, >>> Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing >>> param iv-consider-all-candidates-bound) causes 1% regression for >>> 436.cactusADM in my run. I looked into the code, for function >>> bench_staggeredleapfrog2_ (takes 99% running time after patching), >>> IVOPT chooses one fewer candidates for outer loop, but it does result >>> in couple of more instructions there. >> >> You mean IVOPTs chooses one fewer IVs for the outer loop? > Yes. >> >>> For this case, register >>> pressure is a more interesting issue (36 candidates chosen in outer >>> loop, many stack accesses), not sure if this 1% regression blocks the >>> patch at this stage, or not? >> >> Is this with or without the increase of the param? What compiler options and >> on what sub-architecture was this? > It's not the param, we have both address and generic iv_uses in the > form of {base + 4, step}, and there is another address iv_use {base + > 8, step}. With the patch we now add candidate {base, step}, which > will be used for all these uses. Before the patch, candidates {base + > 4, step} and {base + 8, step} are chosen for these iv_uses > respectively. > As a matter of fact, {base + 4, step} should be the best choice, but > we can't do that because candidate added from one iv_use is not > related to other iv_use and we exceed the bound of param > iv-consider-all-candidates-bound = 30. I suppose upping that to 40 doesn't help here either (AFAIR the loops are quite big). >> >> I think if the IVO choice looks optimal before the patch and not optimal after >> then it's worth blocking but it sounds like the IVO choice is a mess anyway? > Yes, it's a mess here. For three outermost loops in 436, IVO chooses >>32 candidates for each loop. I found at least one bug in register > pressure calculation is responsible for this. By fixing this (plus > one another local patch), # of candidates of these loops are reduced > from 36/36/32 to 12/12/7. And # of lines of assembly is reduced from > ~7150 to ~6850. Interesting thing is, the performance isn't > improved... That may be because they are outermost loops. Yeah, maybe a good default heuristic would be to keep existing BIVs and add no related candidates at all for outer loops. Or to adjust heuristic to prefer the least number of IVs for those. >> [can you maybe check IV choice by ICC?] > I don't have ICC at hand, but from past experiments, ICC tends to > reuse more common IVs than GCC. Ok. I think the patch is ok and eventually we need to revisit the outer loop handling with this as an excuse. Richard. > Thanks, > bin
On Wed, Mar 23, 2016 at 2:18 PM, Richard Biener <richard.guenther@gmail.com> wrote: > On Wed, Mar 23, 2016 at 2:58 PM, Bin.Cheng <amker.cheng@gmail.com> wrote: >> On Tue, Mar 22, 2016 at 11:01 AM, Richard Biener >> <richard.guenther@gmail.com> wrote: >>> On Tue, Mar 22, 2016 at 11:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote: >>>> On Wed, Mar 16, 2016 at 10:06 AM, Richard Biener >>>> <richard.guenther@gmail.com> wrote: >>>>> >>>>> On Wed, Mar 16, 2016 at 10:48 AM, Bin Cheng <Bin.Cheng@arm.com> wrote: >>>>> > Hi, >>>>> > When I tried to decrease # of IV candidates, I removed code that adds IV candidates for use with constant offset stripped in use->base. This is kind of too aggressive and triggers PR69042. So here is a patch adding back the missing candidates. Honestly, this patch doesn't truly fix the issue, it just brings back the original behavior in IVOPT part (Which is still a right thing to do I think). The issue still depends on PIC_OFFSET register used on x86 target. As discussed in https://gcc.gnu.org/ml/gcc/2016-02/msg00040.html. Furthermore, the real problem could be in register pressure modeling about PIC_OFFSET symbol in IVOPT. >>>>> > >>>>> > On AArch64, overall spec2k number isn't changed, though 173.applu is regressed by ~4% because couple of loops' # of candidates now hits "--param iv-consider-all-candidates-bound=30". For spec2k6 data on AArch64, INT is not affected; FP overall is not changed, as for specific case: 459.GemsFDTD is regressed by 2%, 433.milc is improved by 2%. To address the regression, I will send another patch increasing the parameter bound. >>>>> > >>>>> > Bootstrap&test on x86_64 and AArch64, is it OK? In the meantime, I will collect spec2k6 data on x86_64. >>>>> >>>>> Ok. >>>> Hi Richard, >>>> Hmm, I got spec2k6 data on my x86_64, it (along with patch increasing >>>> param iv-consider-all-candidates-bound) causes 1% regression for >>>> 436.cactusADM in my run. I looked into the code, for function >>>> bench_staggeredleapfrog2_ (takes 99% running time after patching), >>>> IVOPT chooses one fewer candidates for outer loop, but it does result >>>> in couple of more instructions there. >>> >>> You mean IVOPTs chooses one fewer IVs for the outer loop? >> Yes. >>> >>>> For this case, register >>>> pressure is a more interesting issue (36 candidates chosen in outer >>>> loop, many stack accesses), not sure if this 1% regression blocks the >>>> patch at this stage, or not? >>> >>> Is this with or without the increase of the param? What compiler options and >>> on what sub-architecture was this? >> It's not the param, we have both address and generic iv_uses in the >> form of {base + 4, step}, and there is another address iv_use {base + >> 8, step}. With the patch we now add candidate {base, step}, which >> will be used for all these uses. Before the patch, candidates {base + >> 4, step} and {base + 8, step} are chosen for these iv_uses >> respectively. >> As a matter of fact, {base + 4, step} should be the best choice, but >> we can't do that because candidate added from one iv_use is not >> related to other iv_use and we exceed the bound of param >> iv-consider-all-candidates-bound = 30. > > I suppose upping that to 40 doesn't help here either (AFAIR the loops are quite > big). Yes, there are ~100 candidates in total for each loop. > >>> >>> I think if the IVO choice looks optimal before the patch and not optimal after >>> then it's worth blocking but it sounds like the IVO choice is a mess anyway? >> Yes, it's a mess here. For three outermost loops in 436, IVO chooses >>>32 candidates for each loop. I found at least one bug in register >> pressure calculation is responsible for this. By fixing this (plus >> one another local patch), # of candidates of these loops are reduced >> from 36/36/32 to 12/12/7. And # of lines of assembly is reduced from >> ~7150 to ~6850. Interesting thing is, the performance isn't >> improved... That may be because they are outermost loops. > > Yeah, maybe a good default heuristic would be to keep existing BIVs > and add no related candidates at all for outer loops. Or to adjust > heuristic to prefer the least number of IVs for those. > >>> [can you maybe check IV choice by ICC?] >> I don't have ICC at hand, but from past experiments, ICC tends to >> reuse more common IVs than GCC. > > Ok. > > I think the patch is ok and eventually we need to revisit the outer loop > handling with this as an excuse. Yes, currently GCC doesn't handle loop nest at all. Will revisit this part in coming stage1. Patch applied as revision 234429. Thanks, bin
On 03/23/2016 09:29 AM, Bin.Cheng wrote: >> >> I think the patch is ok and eventually we need to revisit the outer loop >> handling with this as an excuse. > Yes, currently GCC doesn't handle loop nest at all. Will revisit this > part in coming stage1. > Patch applied as revision 234429. So given there's follow-up items to do, it's not clear the state of 69042. Are the regressions for 69042 resolved? If so, we should consider closing 69042 and opening a new BZ for the additional items we want to fix in the next stage1 cycle. If not, let's make sure the 69042 is clear on what is still regressing. Jeff
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c index 4026d28..32594df 100644 --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -3232,11 +3232,13 @@ add_iv_candidate_for_use (struct ivopts_data *data, struct iv_use *use) basetype = sizetype; record_common_cand (data, build_int_cst (basetype, 0), iv->step, use); - /* Record common candidate with constant offset stripped in base. */ + /* Record common candidate with constant offset stripped in base. + Like the use itself, we also add candidate directly for it. */ + base = strip_offset (iv->base, &offset); + if (offset || base != iv->base) { - base = strip_offset (iv->base, &offset); - if (offset || base != iv->base) - record_common_cand (data, base, iv->step, use); + record_common_cand (data, base, iv->step, use); + add_candidate (data, base, iv->step, false, use); } /* Record common candidate with base_object removed in base. */