[PR78114] Refine gfortran.dg/vect/fast-math-mgrid-resid.f

Message ID	VI1PR0802MB217687F9BECF2A603577497AE7BE0@VI1PR0802MB2176.eurprd08.prod.outlook.com
State	New
Headers	show Return-Path: <gcc-patches-return-441687-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:content-type:mime-version; q=dns; s=default; b=Xv+lnVyXTgdPt/oLLLMeOlmYFnSo20OXG/K5EojzLJL7UVFLdr Kh2bmrTExzSXXvR/+5bELOdbm9WpN9gmCEGkIotmh42FkziN0IPZnrO38H6PEZ0C 36Gfnp3biXrIn7nvnwfXH4t/cTgersR1XgqmvtIESXUZ+QqdJ6TZhDfFc= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org From: Bin Cheng <Bin.Cheng@arm.com> To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org> CC: nd <nd@arm.com> Subject: [PATCH PR78114]Refine gfortran.dg/vect/fast-math-mgrid-resid.f Date: Wed, 16 Nov 2016 17:20:41 +0000 Message-ID: <VI1PR0802MB217687F9BECF2A603577497AE7BE0@VI1PR0802MB2176.eurprd08.prod.outlook.com> nodisclaimer: True received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: multipart/mixed; boundary="_002_VI1PR0802MB217687F9BECF2A603577497AE7BE0VI1PR0802MB2176_" MIME-Version: 1.0

Bin Cheng Nov. 16, 2016, 5:20 p.m. UTC

Hi,
Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops.  This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors.  For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4.  This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4.  This works since we have only one main loop, and only one vector factor can be used.
Test result checked for various x64 targets.  Is it OK?

Thanks,
bin

gcc/testsuite/ChangeLog
2016-11-16  Bin Cheng  <bin.cheng@arm.com>

	PR testsuite/78114
	* gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by
	checking predictive commining variables in vectorized loop
	wrto vector factor.

Richard Biener Nov. 17, 2016, 8:32 a.m. UTC | #1

On Wed, Nov 16, 2016 at 6:20 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops.  This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors.  For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4.  This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4.  This works since we have only one main loop, and only one vector factor can be used.
> Test result checked for various x64 targets.  Is it OK?

I think that as you write the test is somewhat fragile.  But rather
than adjusting the scanning like you do
I'd add --param vect-max-peeling-for-alignment=0 and -mprefer-avx128
as additional option on x86_64-*-* i?86-*-*.

Your new pattern would fail with avx512 if vector (8) real would be used.

What's the actual change that made the testcase fail btw?

Richard.

> Thanks,
> bin
>
> gcc/testsuite/ChangeLog
> 2016-11-16  Bin Cheng  <bin.cheng@arm.com>
>
>         PR testsuite/78114
>         * gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by
>         checking predictive commining variables in vectorized loop
>         wrto vector factor.

Bin.Cheng Nov. 17, 2016, 10:26 a.m. UTC | #2

On Thu, Nov 17, 2016 at 8:32 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Nov 16, 2016 at 6:20 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops.  This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors.  For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4.  This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4.  This works since we have only one main loop, and only one vector factor can be used.
>> Test result checked for various x64 targets.  Is it OK?
>
> I think that as you write the test is somewhat fragile.  But rather
> than adjusting the scanning like you do
> I'd add --param vect-max-peeling-for-alignment=0 and -mprefer-avx128
In this way, is it better to add "--param
vect-max-peeling-for-alignment=0" for all targets?  Otherwise we still
need to differentiate test string to handle different targets.  But I
have another question here: what if a target can't handle unaligned
access and vectorizer have to peel for alignment for it?
Also do you think it's ok to check predictive commoning PHI node as below?
# vectp_u.122__lsm0.158_94 = PHI <vectp_u.122__lsm0.158_95(8), _96(6)>
In this way, we don't need to take possible prologue/epilogue loops
into consideration.

> as additional option on x86_64-*-* i?86-*-*.
>
> Your new pattern would fail with avx512 if vector (8) real would be used.
>
> What's the actual change that made the testcase fail btw?
There are two cases.
A) After vect_do_peeling change, vectorizer may only peel one
iteration for prologue loop (if vf == 2), below test string was added
for this reason:
! { dg-final { scan-tree-dump-times "Loop iterates only 1 time,
nothing to do" 1 "pcom" } }
This fails on x86_64 solaris because prologue loop is not peeled at all.
B) Depending on ilp, I think below test strings fail for long time with haswell:
! { dg-final { scan-tree-dump-times "Executing predictive commoning
without unrolling" 1 "pcom" { target lp64 } } }
! { dg-final { scan-tree-dump-times "Executing predictive commoning
without unrolling" 2 "pcom" { target ia32 } } }
Because vectorizer choose vf==4 in this case, and there is no
predictive commoning opportunities at all.
Also the newly added test string fails in this case too because the
prolog peeled iterates more than 1 times.

Thanks,
bin
>
> Richard.
>
>> Thanks,
>> bin
>>
>> gcc/testsuite/ChangeLog
>> 2016-11-16  Bin Cheng  <bin.cheng@arm.com>
>>
>>         PR testsuite/78114
>>         * gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by
>>         checking predictive commining variables in vectorized loop
>>         wrto vector factor.

Richard Biener Nov. 17, 2016, 10:53 a.m. UTC | #3

On Thu, Nov 17, 2016 at 11:26 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 8:32 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, Nov 16, 2016 at 6:20 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> Currently test gfortran.dg/vect/fast-math-mgrid-resid.f checks all predictive commoning opportunities for all possible loops.  This makes it fragile because vectorizer may peel the loop differently, as well as may choose different vector factors.  For example, on x86-solaris, vectorizer doesn't peel for prologue loop; for -march=haswell, the case is long time failed because vector factor is 4, while iteration distance of predictive commoning opportunity is smaller than 4.  This patch refines it by only checking if predictive commoning variable is created when vector factor is 2; or vectorization variable is created when factor is 4.  This works since we have only one main loop, and only one vector factor can be used.
>>> Test result checked for various x64 targets.  Is it OK?
>>
>> I think that as you write the test is somewhat fragile.  But rather
>> than adjusting the scanning like you do
>> I'd add --param vect-max-peeling-for-alignment=0 and -mprefer-avx128
> In this way, is it better to add "--param
> vect-max-peeling-for-alignment=0" for all targets?  Otherwise we still
> need to differentiate test string to handle different targets.  But I
> have another question here: what if a target can't handle unaligned
> access and vectorizer have to peel for alignment for it?

You'd get versioning for alignment instead.

> Also do you think it's ok to check predictive commoning PHI node as below?
> # vectp_u.122__lsm0.158_94 = PHI <vectp_u.122__lsm0.158_95(8), _96(6)>
> In this way, we don't need to take possible prologue/epilogue loops
> into consideration.

I hoped w/o peeling we can simply scan for "Executing predictive commoning".
But with versioning for alignment you'd still get two loops.

So maybe checking for both "Executing predictive commoning" and looking
for a vect_lsm PHI node is ok...

>> as additional option on x86_64-*-* i?86-*-*.
>>
>> Your new pattern would fail with avx512 if vector (8) real would be used.
>>
>> What's the actual change that made the testcase fail btw?
> There are two cases.
> A) After vect_do_peeling change, vectorizer may only peel one
> iteration for prologue loop (if vf == 2), below test string was added
> for this reason:
> ! { dg-final { scan-tree-dump-times "Loop iterates only 1 time,
> nothing to do" 1 "pcom" } }
> This fails on x86_64 solaris because prologue loop is not peeled at all.
> B) Depending on ilp, I think below test strings fail for long time with haswell:
> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
> without unrolling" 1 "pcom" { target lp64 } } }
> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
> without unrolling" 2 "pcom" { target ia32 } } }
> Because vectorizer choose vf==4 in this case, and there is no
> predictive commoning opportunities at all.

Yes.  I suggest -mprefer-avx128 for that.

> Also the newly added test string fails in this case too because the
> prolog peeled iterates more than 1 times.
>
> Thanks,
> bin
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>>
>>> gcc/testsuite/ChangeLog
>>> 2016-11-16  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         PR testsuite/78114
>>>         * gfortran.dg/vect/fast-math-mgrid-resid.f: Refine test by
>>>         checking predictive commining variables in vectorized loop
>>>         wrto vector factor.

Michael Matz Nov. 18, 2016, 4:52 p.m. UTC | #4

Hi,

On Thu, 17 Nov 2016, Bin.Cheng wrote:

> B) Depending on ilp, I think below test strings fail for long time with haswell:
> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
> without unrolling" 1 "pcom" { target lp64 } } }
> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
> without unrolling" 2 "pcom" { target ia32 } } }
> Because vectorizer choose vf==4 in this case, and there is no
> predictive commoning opportunities at all.
> Also the newly added test string fails in this case too because the
> prolog peeled iterates more than 1 times.

Btw, this probably means that on haswell (or other archs with vf==4) mgrid 
is slower than necessary.  On mgrid you really really want predictive 
commoning to happen.  Vectorization isn't that interesting there.

Ciao,
Michael.

Bin.Cheng Nov. 18, 2016, 4:57 p.m. UTC | #5

On Fri, Nov 18, 2016 at 4:52 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> On Thu, 17 Nov 2016, Bin.Cheng wrote:
>
>> B) Depending on ilp, I think below test strings fail for long time with haswell:
>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
>> without unrolling" 1 "pcom" { target lp64 } } }
>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
>> without unrolling" 2 "pcom" { target ia32 } } }
>> Because vectorizer choose vf==4 in this case, and there is no
>> predictive commoning opportunities at all.
>> Also the newly added test string fails in this case too because the
>> prolog peeled iterates more than 1 times.
>
> Btw, this probably means that on haswell (or other archs with vf==4) mgrid
> is slower than necessary.  On mgrid you really really want predictive
> commoning to happen.  Vectorization isn't that interesting there.
Interesting, I will check if there is difference between 2/4 vf.  we
do have cases that smaller vf is better and should be chosen, though
for different reasons.

Thanks,
bin
>
>
> Ciao,
> Michael.

Richard Biener Nov. 21, 2016, 9:55 a.m. UTC | #6

On Fri, Nov 18, 2016 at 5:57 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Nov 18, 2016 at 4:52 PM, Michael Matz <matz@suse.de> wrote:
>> Hi,
>>
>> On Thu, 17 Nov 2016, Bin.Cheng wrote:
>>
>>> B) Depending on ilp, I think below test strings fail for long time with haswell:
>>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
>>> without unrolling" 1 "pcom" { target lp64 } } }
>>> ! { dg-final { scan-tree-dump-times "Executing predictive commoning
>>> without unrolling" 2 "pcom" { target ia32 } } }
>>> Because vectorizer choose vf==4 in this case, and there is no
>>> predictive commoning opportunities at all.
>>> Also the newly added test string fails in this case too because the
>>> prolog peeled iterates more than 1 times.
>>
>> Btw, this probably means that on haswell (or other archs with vf==4) mgrid
>> is slower than necessary.  On mgrid you really really want predictive
>> commoning to happen.  Vectorization isn't that interesting there.
> Interesting, I will check if there is difference between 2/4 vf.  we
> do have cases that smaller vf is better and should be chosen, though
> for different reasons.

At some time in the past we had predictive commoning done before
vectorization (GCC 4.3 at least).

Patch is ok meanwhile.

Richard.

> Thanks,
> bin
>>
>>
>> Ciao,
>> Michael.

Michael Matz Nov. 22, 2016, 11:56 a.m. UTC | #7

Hi,

On Mon, 21 Nov 2016, Richard Biener wrote:

> >> Btw, this probably means that on haswell (or other archs with vf==4) mgrid
> >> is slower than necessary.  On mgrid you really really want predictive
> >> commoning to happen.  Vectorization isn't that interesting there.
> > Interesting, I will check if there is difference between 2/4 vf.  we
> > do have cases that smaller vf is better and should be chosen, though
> > for different reasons.
> 
> At some time in the past we had predictive commoning done before 
> vectorization (GCC 4.3 at least).

Not relevant here.  The testcase and patch associated with it came for 4.5 
where predcom was after vectorization already and the latter disabled the 
former (and mgrid had a 40% performance hit).  PR41783 and 
https://gcc.gnu.org/ml/gcc-patches/2010-01/msg00855.html FWIW.

Ciao,
Michael.

[PR78114] Refine gfortran.dg/vect/fast-math-mgrid-resid.f

Commit Message

Comments

Patch