diff mbox

[PR78114] Refine gfortran.dg/vect/fast-math-mgrid-resid.f

Message ID VI1PR0802MB217687F9BECF2A603577497AE7BE0@VI1PR0802MB2176.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Bin Cheng Nov. 16, 2016, 5:20 p.m. UTC
Hi,
Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops.  This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors.  For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4.  This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4.  This works since we have only one main loop, and only one vector factor can be used.
Test result checked for various x64 targets.  Is it OK?

Thanks,
bin

gcc/testsuite/ChangeLog
2016-11-16  Bin Cheng  <bin.cheng@arm.com>

	PR testsuite/78114
	* gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by
	checking predictive commining variables in vectorized loop
	wrto vector factor.

Comments

Richard Biener Nov. 17, 2016, 8:32 a.m. UTC | #1
On Wed, Nov 16, 2016 at 6:20 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops.  This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors.  For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4.  This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4.  This works since we have only one main loop, and only one vector factor can be used.
> Test result checked for various x64 targets.  Is it OK?

I think that as you write the test is somewhat fragile.  But rather
than adjusting the scanning like you do
I'd add --param vect-max-peeling-for-alignment=0 and -mprefer-avx128
as additional option on x86_64-*-* i?86-*-*.

Your new pattern would fail with avx512 if vector (8) real would be used.

What's the actual change that made the testcase fail btw?

Richard.

> Thanks,
> bin
>
> gcc/testsuite/ChangeLog
> 2016-11-16  Bin Cheng  <bin.cheng@arm.com>
>
>         PR testsuite/78114
>         * gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by
>         checking predictive commining variables in vectorized loop
>         wrto vector factor.
Bin.Cheng Nov. 17, 2016, 10:26 a.m. UTC | #2
On Thu, Nov 17, 2016 at 8:32 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Nov 16, 2016 at 6:20 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops.  This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors.  For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4.  This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4.  This works since we have only one main loop, and only one vector factor can be used.
>> Test result checked for various x64 targets.  Is it OK?
>
> I think that as you write the test is somewhat fragile.  But rather
> than adjusting the scanning like you do
> I'd add --param vect-max-peeling-for-alignment=0 and -mprefer-avx128
In this way, is it better to add "--param
vect-max-peeling-for-alignment=0" for all targets?  Otherwise we still
need to differentiate test string to handle different targets.  But I
have another question here: what if a target can't handle unaligned
access and vectorizer have to peel for alignment for it?
Also do you think it's ok to check predictive commoning PHI node as below?
# vectp_u.122__lsm0.158_94 = PHI <vectp_u.122__lsm0.158_95(8), _96(6)>
In this way, we don't need to take possible prologue/epilogue loops
into consideration.

> as additional option on x86_64-*-* i?86-*-*.
>
> Your new pattern would fail with avx512 if vector (8) real would be used.
>
> What's the actual change that made the testcase fail btw?
There are two cases.
A) After vect_do_peeling change, vectorizer may only peel one
iteration for prologue loop (if vf == 2), below test string was added
for this reason:
! { dg-final { scan-tree-dump-times "Loop iterates only 1 time,
nothing to do" 1 "pcom" } }
This fails on x86_64 solaris because prologue loop is not peeled at all.
B) Depending on ilp, I think below test strings fail for long time with haswell:
! { dg-final { scan-tree-dump-times "Executing predictive commoning
without unrolling" 1 "pcom" { target lp64 } } }
! { dg-final { scan-tree-dump-times "Executing predictive commoning
without unrolling" 2 "pcom" { target ia32 } } }
Because vectorizer choose vf==4 in this case, and there is no
predictive commoning opportunities at all.
Also the newly added test string fails in this case too because the
prolog peeled iterates more than 1 times.

Thanks,
bin
>
> Richard.
>
>> Thanks,
>> bin
>>
>> gcc/testsuite/ChangeLog
>> 2016-11-16  Bin Cheng  <bin.cheng@arm.com>
>>
>>         PR testsuite/78114
>>         * gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by
>>         checking predictive commining variables in vectorized loop
>>         wrto vector factor.
Richard Biener Nov. 17, 2016, 10:53 a.m. UTC | #3
On Thu, Nov 17, 2016 at 11:26 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 8:32 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, Nov 16, 2016 at 6:20 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops.  This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors.  For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4.  This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4.  This works since we have only one main loop, and only one vector factor can be used.
>>> Test result checked for various x64 targets.  Is it OK?
>>
>> I think that as you write the test is somewhat fragile.  But rather
>> than adjusting the scanning like you do
>> I'd add --param vect-max-peeling-for-alignment=0 and -mprefer-avx128
> In this way, is it better to add "--param
> vect-max-peeling-for-alignment=0" for all targets?  Otherwise we still
> need to differentiate test string to handle different targets.  But I
> have another question here: what if a target can't handle unaligned
> access and vectorizer have to peel for alignment for it?

You'd get versioning for alignment instead.

> Also do you think it's ok to check predictive commoning PHI node as below?
> # vectp_u.122__lsm0.158_94 = PHI <vectp_u.122__lsm0.158_95(8), _96(6)>
> In this way, we don't need to take possible prologue/epilogue loops
> into consideration.

I hoped w/o peeling we can simply scan for "Executing predictive commoning".
But with versioning for alignment you'd still get two loops.

So maybe checking for both "Executing predictive commoning" and looking
for a vect_lsm PHI node is ok...

>> as additional option on x86_64-*-* i?86-*-*.
>>
>> Your new pattern would fail with avx512 if vector (8) real would be used.
>>
>> What's the actual change that made the testcase fail btw?
> There are two cases.
> A) After vect_do_peeling change, vectorizer may only peel one
> iteration for prologue loop (if vf == 2), below test string was added
> for this reason:
> ! { dg-final { scan-tree-dump-times "Loop iterates only 1 time,
> nothing to do" 1 "pcom" } }
> This fails on x86_64 solaris because prologue loop is not peeled at all.
> B) Depending on ilp, I think below test strings fail for long time with haswell:
> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
> without unrolling" 1 "pcom" { target lp64 } } }
> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
> without unrolling" 2 "pcom" { target ia32 } } }
> Because vectorizer choose vf==4 in this case, and there is no
> predictive commoning opportunities at all.

Yes.  I suggest -mprefer-avx128 for that.

> Also the newly added test string fails in this case too because the
> prolog peeled iterates more than 1 times.
>
> Thanks,
> bin
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>>
>>> gcc/testsuite/ChangeLog
>>> 2016-11-16  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         PR testsuite/78114
>>>         * gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by
>>>         checking predictive commining variables in vectorized loop
>>>         wrto vector factor.
Michael Matz Nov. 18, 2016, 4:52 p.m. UTC | #4
Hi,

On Thu, 17 Nov 2016, Bin.Cheng wrote:

> B) Depending on ilp, I think below test strings fail for long time with haswell:
> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
> without unrolling" 1 "pcom" { target lp64 } } }
> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
> without unrolling" 2 "pcom" { target ia32 } } }
> Because vectorizer choose vf==4 in this case, and there is no
> predictive commoning opportunities at all.
> Also the newly added test string fails in this case too because the
> prolog peeled iterates more than 1 times.

Btw, this probably means that on haswell (or other archs with vf==4) mgrid 
is slower than necessary.  On mgrid you really really want predictive 
commoning to happen.  Vectorization isn't that interesting there.


Ciao,
Michael.
Bin.Cheng Nov. 18, 2016, 4:57 p.m. UTC | #5
On Fri, Nov 18, 2016 at 4:52 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> On Thu, 17 Nov 2016, Bin.Cheng wrote:
>
>> B) Depending on ilp, I think below test strings fail for long time with haswell:
>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
>> without unrolling" 1 "pcom" { target lp64 } } }
>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
>> without unrolling" 2 "pcom" { target ia32 } } }
>> Because vectorizer choose vf==4 in this case, and there is no
>> predictive commoning opportunities at all.
>> Also the newly added test string fails in this case too because the
>> prolog peeled iterates more than 1 times.
>
> Btw, this probably means that on haswell (or other archs with vf==4) mgrid
> is slower than necessary.  On mgrid you really really want predictive
> commoning to happen.  Vectorization isn't that interesting there.
Interesting, I will check if there is difference between 2/4 vf.  we
do have cases that smaller vf is better and should be chosen, though
for different reasons.

Thanks,
bin
>
>
> Ciao,
> Michael.
Richard Biener Nov. 21, 2016, 9:55 a.m. UTC | #6
On Fri, Nov 18, 2016 at 5:57 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Nov 18, 2016 at 4:52 PM, Michael Matz <matz@suse.de> wrote:
>> Hi,
>>
>> On Thu, 17 Nov 2016, Bin.Cheng wrote:
>>
>>> B) Depending on ilp, I think below test strings fail for long time with haswell:
>>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
>>> without unrolling" 1 "pcom" { target lp64 } } }
>>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
>>> without unrolling" 2 "pcom" { target ia32 } } }
>>> Because vectorizer choose vf==4 in this case, and there is no
>>> predictive commoning opportunities at all.
>>> Also the newly added test string fails in this case too because the
>>> prolog peeled iterates more than 1 times.
>>
>> Btw, this probably means that on haswell (or other archs with vf==4) mgrid
>> is slower than necessary.  On mgrid you really really want predictive
>> commoning to happen.  Vectorization isn't that interesting there.
> Interesting, I will check if there is difference between 2/4 vf.  we
> do have cases that smaller vf is better and should be chosen, though
> for different reasons.

At some time in the past we had predictive commoning done before
vectorization (GCC 4.3 at least).

Patch is ok meanwhile.

Richard.

> Thanks,
> bin
>>
>>
>> Ciao,
>> Michael.
Michael Matz Nov. 22, 2016, 11:56 a.m. UTC | #7
Hi,

On Mon, 21 Nov 2016, Richard Biener wrote:

> >> Btw, this probably means that on haswell (or other archs with vf==4) mgrid
> >> is slower than necessary.  On mgrid you really really want predictive
> >> commoning to happen.  Vectorization isn't that interesting there.
> > Interesting, I will check if there is difference between 2/4 vf.  we
> > do have cases that smaller vf is better and should be chosen, though
> > for different reasons.
> 
> At some time in the past we had predictive commoning done before 
> vectorization (GCC 4.3 at least).

Not relevant here.  The testcase and patch associated with it came for 4.5 
where predcom was after vectorization already and the latter disabled the 
former (and mgrid had a 40% performance hit).  PR41783 and 
https://gcc.gnu.org/ml/gcc-patches/2010-01/msg00855.html FWIW.


Ciao,
Michael.
diff mbox

Patch

diff --git a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
index 88238f9..3e5c4a4 100644
--- a/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
+++ b/gcc/testsuite/gfortran.dg/vect/fast-math-mgrid-resid.f
@@ -1,6 +1,6 @@ 
 ! { dg-do compile }
 ! { dg-require-effective-target vect_double }
-! { dg-options "-O3 -fpredictive-commoning -fdump-tree-pcom-details" }
+! { dg-options "-O3 -fpredictive-commoning -fdump-tree-pcom" }
 
 
 ******* RESID COMPUTES THE RESIDUAL:  R = V - AU
@@ -38,8 +38,9 @@  C
       RETURN
       END
 ! we want to check that predictive commoning did something on the
-! vectorized loop.
-! { dg-final { scan-tree-dump-times "Executing predictive commoning without unrolling" 1 "pcom" { target lp64 } } }
-! { dg-final { scan-tree-dump-times "Executing predictive commoning without unrolling" 2 "pcom" { target ia32 } } }
-! { dg-final { scan-tree-dump-times "Predictive commoning failed: no suitable chains" 0 "pcom" } }
-! { dg-final { scan-tree-dump-times "Loop iterates only 1 time, nothing to do" 1 "pcom" } }
+! vectorized loop.  If vector factor is 2, the vectorized loop can
+! be predictive commoned, we check if predictive commoning variable
+! is created with vector(2) type;  if vector factor is 4, there is
+! no predictive commoning opportunity, we check if vector(4) variable
+! is created.  This works because only one vector factor can be used.
+! { dg-final { scan-tree-dump-times "vector\\(2\\) real\\(.*\\) vectp_u.*__lsm|vector\\(4\\) real\\(.*\\)" 1 "pcom" } }