Message ID | VI1PR0802MB217687F9BECF2A603577497AE7BE0@VI1PR0802MB2176.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
On Wed, Nov 16, 2016 at 6:20 PM, Bin Cheng <Bin.Cheng@arm.com> wrote: > Hi, > Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops. This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors. For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4. This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4. This works since we have only one main loop, and only one vector factor can be used. > Test result checked for various x64 targets. Is it OK? I think that as you write the test is somewhat fragile. But rather than adjusting the scanning like you do I'd add --param vect-max-peeling-for-alignment=0 and -mprefer-avx128 as additional option on x86_64-*-* i?86-*-*. Your new pattern would fail with avx512 if vector (8) real would be used. What's the actual change that made the testcase fail btw? Richard. > Thanks, > bin > > gcc/testsuite/ChangeLog > 2016-11-16 Bin Cheng <bin.cheng@arm.com> > > PR testsuite/78114 > * gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by > checking predictive commining variables in vectorized loop > wrto vector factor.
On Thu, Nov 17, 2016 at 8:32 AM, Richard Biener <richard.guenther@gmail.com> wrote: > On Wed, Nov 16, 2016 at 6:20 PM, Bin Cheng <Bin.Cheng@arm.com> wrote: >> Hi, >> Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops. This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors. For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4. This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4. This works since we have only one main loop, and only one vector factor can be used. >> Test result checked for various x64 targets. Is it OK? > > I think that as you write the test is somewhat fragile. But rather > than adjusting the scanning like you do > I'd add --param vect-max-peeling-for-alignment=0 and -mprefer-avx128 In this way, is it better to add "--param vect-max-peeling-for-alignment=0" for all targets? Otherwise we still need to differentiate test string to handle different targets. But I have another question here: what if a target can't handle unaligned access and vectorizer have to peel for alignment for it? Also do you think it's ok to check predictive commoning PHI node as below? # vectp_u.122__lsm0.158_94 = PHI <vectp_u.122__lsm0.158_95(8), _96(6)> In this way, we don't need to take possible prologue/epilogue loops into consideration. > as additional option on x86_64-*-* i?86-*-*. > > Your new pattern would fail with avx512 if vector (8) real would be used. > > What's the actual change that made the testcase fail btw? There are two cases. A) After vect_do_peeling change, vectorizer may only peel one iteration for prologue loop (if vf == 2), below test string was added for this reason: ! { dg-final { scan-tree-dump-times "Loop iterates only 1 time, nothing to do" 1 "pcom" } } This fails on x86_64 solaris because prologue loop is not peeled at all. B) Depending on ilp, I think below test strings fail for long time with haswell: ! { dg-final { scan-tree-dump-times "Executing predictive commoning without unrolling" 1 "pcom" { target lp64 } } } ! { dg-final { scan-tree-dump-times "Executing predictive commoning without unrolling" 2 "pcom" { target ia32 } } } Because vectorizer choose vf==4 in this case, and there is no predictive commoning opportunities at all. Also the newly added test string fails in this case too because the prolog peeled iterates more than 1 times. Thanks, bin > > Richard. > >> Thanks, >> bin >> >> gcc/testsuite/ChangeLog >> 2016-11-16 Bin Cheng <bin.cheng@arm.com> >> >> PR testsuite/78114 >> * gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by >> checking predictive commining variables in vectorized loop >> wrto vector factor.
On Thu, Nov 17, 2016 at 11:26 AM, Bin.Cheng <amker.cheng@gmail.com> wrote: > On Thu, Nov 17, 2016 at 8:32 AM, Richard Biener > <richard.guenther@gmail.com> wrote: >> On Wed, Nov 16, 2016 at 6:20 PM, Bin Cheng <Bin.Cheng@arm.com> wrote: >>> Hi, >>> Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops. This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors. For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4. This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4. This works since we have only one main loop, and only one vector factor can be used. >>> Test result checked for various x64 targets. Is it OK? >> >> I think that as you write the test is somewhat fragile. But rather >> than adjusting the scanning like you do >> I'd add --param vect-max-peeling-for-alignment=0 and -mprefer-avx128 > In this way, is it better to add "--param > vect-max-peeling-for-alignment=0" for all targets? Otherwise we still > need to differentiate test string to handle different targets. But I > have another question here: what if a target can't handle unaligned > access and vectorizer have to peel for alignment for it? You'd get versioning for alignment instead. > Also do you think it's ok to check predictive commoning PHI node as below? > # vectp_u.122__lsm0.158_94 = PHI <vectp_u.122__lsm0.158_95(8), _96(6)> > In this way, we don't need to take possible prologue/epilogue loops > into consideration. I hoped w/o peeling we can simply scan for "Executing predictive commoning". But with versioning for alignment you'd still get two loops. So maybe checking for both "Executing predictive commoning" and looking for a vect_lsm PHI node is ok... >> as additional option on x86_64-*-* i?86-*-*. >> >> Your new pattern would fail with avx512 if vector (8) real would be used. >> >> What's the actual change that made the testcase fail btw? > There are two cases. > A) After vect_do_peeling change, vectorizer may only peel one > iteration for prologue loop (if vf == 2), below test string was added > for this reason: > ! { dg-final { scan-tree-dump-times "Loop iterates only 1 time, > nothing to do" 1 "pcom" } } > This fails on x86_64 solaris because prologue loop is not peeled at all. > B) Depending on ilp, I think below test strings fail for long time with haswell: > ! { dg-final { scan-tree-dump-times "Executing predictive commoning > without unrolling" 1 "pcom" { target lp64 } } } > ! { dg-final { scan-tree-dump-times "Executing predictive commoning > without unrolling" 2 "pcom" { target ia32 } } } > Because vectorizer choose vf==4 in this case, and there is no > predictive commoning opportunities at all. Yes. I suggest -mprefer-avx128 for that. > Also the newly added test string fails in this case too because the > prolog peeled iterates more than 1 times. > > Thanks, > bin >> >> Richard. >> >>> Thanks, >>> bin >>> >>> gcc/testsuite/ChangeLog >>> 2016-11-16 Bin Cheng <bin.cheng@arm.com> >>> >>> PR testsuite/78114 >>> * gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by >>> checking predictive commining variables in vectorized loop >>> wrto vector factor.
Hi, On Thu, 17 Nov 2016, Bin.Cheng wrote: > B) Depending on ilp, I think below test strings fail for long time with haswell: > ! { dg-final { scan-tree-dump-times "Executing predictive commoning > without unrolling" 1 "pcom" { target lp64 } } } > ! { dg-final { scan-tree-dump-times "Executing predictive commoning > without unrolling" 2 "pcom" { target ia32 } } } > Because vectorizer choose vf==4 in this case, and there is no > predictive commoning opportunities at all. > Also the newly added test string fails in this case too because the > prolog peeled iterates more than 1 times. Btw, this probably means that on haswell (or other archs with vf==4) mgrid is slower than necessary. On mgrid you really really want predictive commoning to happen. Vectorization isn't that interesting there. Ciao, Michael.
On Fri, Nov 18, 2016 at 4:52 PM, Michael Matz <matz@suse.de> wrote: > Hi, > > On Thu, 17 Nov 2016, Bin.Cheng wrote: > >> B) Depending on ilp, I think below test strings fail for long time with haswell: >> ! { dg-final { scan-tree-dump-times "Executing predictive commoning >> without unrolling" 1 "pcom" { target lp64 } } } >> ! { dg-final { scan-tree-dump-times "Executing predictive commoning >> without unrolling" 2 "pcom" { target ia32 } } } >> Because vectorizer choose vf==4 in this case, and there is no >> predictive commoning opportunities at all. >> Also the newly added test string fails in this case too because the >> prolog peeled iterates more than 1 times. > > Btw, this probably means that on haswell (or other archs with vf==4) mgrid > is slower than necessary. On mgrid you really really want predictive > commoning to happen. Vectorization isn't that interesting there. Interesting, I will check if there is difference between 2/4 vf. we do have cases that smaller vf is better and should be chosen, though for different reasons. Thanks, bin > > > Ciao, > Michael.
On Fri, Nov 18, 2016 at 5:57 PM, Bin.Cheng <amker.cheng@gmail.com> wrote: > On Fri, Nov 18, 2016 at 4:52 PM, Michael Matz <matz@suse.de> wrote: >> Hi, >> >> On Thu, 17 Nov 2016, Bin.Cheng wrote: >> >>> B) Depending on ilp, I think below test strings fail for long time with haswell: >>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning >>> without unrolling" 1 "pcom" { target lp64 } } } >>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning >>> without unrolling" 2 "pcom" { target ia32 } } } >>> Because vectorizer choose vf==4 in this case, and there is no >>> predictive commoning opportunities at all. >>> Also the newly added test string fails in this case too because the >>> prolog peeled iterates more than 1 times. >> >> Btw, this probably means that on haswell (or other archs with vf==4) mgrid >> is slower than necessary. On mgrid you really really want predictive >> commoning to happen. Vectorization isn't that interesting there. > Interesting, I will check if there is difference between 2/4 vf. we > do have cases that smaller vf is better and should be chosen, though > for different reasons. At some time in the past we had predictive commoning done before vectorization (GCC 4.3 at least). Patch is ok meanwhile. Richard. > Thanks, > bin >> >> >> Ciao, >> Michael.
Hi, On Mon, 21 Nov 2016, Richard Biener wrote: > >> Btw, this probably means that on haswell (or other archs with vf==4) mgrid > >> is slower than necessary. On mgrid you really really want predictive > >> commoning to happen. Vectorization isn't that interesting there. > > Interesting, I will check if there is difference between 2/4 vf. we > > do have cases that smaller vf is better and should be chosen, though > > for different reasons. > > At some time in the past we had predictive commoning done before > vectorization (GCC 4.3 at least). Not relevant here. The testcase and patch associated with it came for 4.5 where predcom was after vectorization already and the latter disabled the former (and mgrid had a 40% performance hit). PR41783 and https://gcc.gnu.org/ml/gcc-patches/2010-01/msg00855.html FWIW. Ciao, Michael.
diff --git a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f index 88238f9..3e5c4a4 100644 --- a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f +++ b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f @@ -1,6 +1,6 @@ ! { dg-do compile } ! { dg-require-effective-target vect_double } -! { dg-options "-O3 -fpredictive-commoning -fdump-tree-pcom-details" } +! { dg-options "-O3 -fpredictive-commoning -fdump-tree-pcom" } ******* RESID COMPUTES THE RESIDUAL: R = V - AU @@ -38,8 +38,9 @@ C RETURN END ! we want to check that predictive commoning did something on the -! vectorized loop. -! { dg-final { scan-tree-dump-times "Executing predictive commoning without unrolling" 1 "pcom" { target lp64 } } } -! { dg-final { scan-tree-dump-times "Executing predictive commoning without unrolling" 2 "pcom" { target ia32 } } } -! { dg-final { scan-tree-dump-times "Predictive commoning failed: no suitable chains" 0 "pcom" } } -! { dg-final { scan-tree-dump-times "Loop iterates only 1 time, nothing to do" 1 "pcom" } } +! vectorized loop. If vector factor is 2, the vectorized loop can +! be predictive commoned, we check if predictive commoning variable +! is created with vector(2) type; if vector factor is 4, there is +! no predictive commoning opportunity, we check if vector(4) variable +! is created. This works because only one vector factor can be used. +! { dg-final { scan-tree-dump-times "vector\\(2\\) real\\(.*\\) vectp_u.*__lsm|vector\\(4\\) real\\(.*\\)" 1 "pcom" } }