Tiny predcom improvement (PR tree-optimization/59643)

Message ID	20131231190433.GH892@tucnak.redhat.com
State	New
Headers	show Return-Path: <gcc-patches-return-358748-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:reply-to:mime-version :content-type; q=dns; s=default; b=W5Kr2QshpfqV3J+4OzQkFh3f6UOXZ RmdXmPaK96nLNRq558Zig/hI3jTp44JJVeay0PFO9ykexHhP8NQOHOuSBuX/pwPP R34fGPa9j8rAdrQ1w2FadYmP0ZIwgaeHZmgyVoDzQ/6DxIhRm2U4k5UiupQoBFaa oIEoioRjZm6Pus= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Date: Tue, 31 Dec 2013 20:04:33 +0100 From: Jakub Jelinek <jakub@redhat.com> To: Richard Biener <rguenther@suse.de> Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] Tiny predcom improvement (PR tree-optimization/59643) Message-ID: <20131231190433.GH892@tucnak.redhat.com> Reply-To: Jakub Jelinek <jakub@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15)

Jakub Jelinek Dec. 31, 2013, 7:04 p.m. UTC

Hi!

As written in the PR, I've been looking why is llvm 3.[34] so much faster
on Scimark2 SOR benchmark and the reason is that it's predictive commoning
or whatever it uses doesn't give up on the inner loop, while our predcom
unnecessarily gives up, because there are reads that could alias the write.

This simple patch improves the benchmark by 42%.  We already ignore
unsuitable dependencies for read/read, the patch extends that for unsuitable
dependencies for read/write by just putting the read (and anything in it's
component) into the bad component which is ignored.  pcom doesn't optimize
away the writes and will keep the potentially aliasing reads unmodified as
well.  Without the patch we'd merge the two components, and as
!determine_offset between the two DRs, it would mean the whole merged
component would be always unsuitable and thus ignored.  With the patch
we'll hopefully have some other reads with known offset to the write
and can optimize that, so the patch should always either handle what
it did before or handle perhaps some more cases.

The inner loop from the (public domain) benchmark is added in the two tests,
one runtime test and one test looking whether pcom actually optimized it.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2013-12-31  Jakub Jelinek  <jakub@redhat.com>

	PR tree-optimization/59643
	* tree-predcom.c (split_data_refs_to_components): If one dr is
	read and one write, determine_offset fails and the write isn't
	in the bad component, just put the read into the bad component.

	* gcc.dg/pr59643.c: New test.
	* gcc.c-torture/execute/pr59643.c: New test.



	Jakub

Vladimir Makarov Dec. 31, 2013, 8:22 p.m. UTC | #1

On 12/31/2013, 2:04 PM, Jakub Jelinek wrote:
> Hi!
>
> As written in the PR, I've been looking why is llvm 3.[34] so much faster
> on Scimark2 SOR benchmark and the reason is that it's predictive commoning
> or whatever it uses doesn't give up on the inner loop, while our predcom
> unnecessarily gives up, because there are reads that could alias the write.
>
> This simple patch improves the benchmark by 42%.  We already ignore
> unsuitable dependencies for read/read, the patch extends that for unsuitable
> dependencies for read/write by just putting the read (and anything in it's
> component) into the bad component which is ignored.  pcom doesn't optimize
> away the writes and will keep the potentially aliasing reads unmodified as
> well.  Without the patch we'd merge the two components, and as
> !determine_offset between the two DRs, it would mean the whole merged
> component would be always unsuitable and thus ignored.  With the patch
> we'll hopefully have some other reads with known offset to the write
> and can optimize that, so the patch should always either handle what
> it did before or handle perhaps some more cases.

Congratulation, Jakub!

Scimark2 is always used by Phoronix to show how bad GCC in comparison 
with LLVM.  It is understandable. For some reasons phoronix is very 
biased to LLVM and, I'd say, a marketing machine for LLVM.

They use very small selected benchmarks.  Benchmarking is evil but I 
believe more SPEC2000/SPEC2006 which use bigger real programs.  With 
their approach, they could use whetstone instead of Scimark but the 
results would be very bad for LLVM (e.g. 64-bit code for -O3 on Haswell):

dfa:MWIPS      3705.311  -- LLVM3.3
nor:MWIPS      4587.555  -- GCC4.8

It would be nice to fix also Scimark2 LU GCC-4.8 degradation to finally 
stop this nonsense from Phoronix.

Jakub Jelinek Jan. 1, 2014, 5:50 p.m. UTC | #2

On Tue, Dec 31, 2013 at 03:22:07PM -0500, Vladimir Makarov wrote:
> Scimark2 is always used by Phoronix to show how bad GCC in
> comparison with LLVM.  It is understandable. For some reasons
> phoronix is very biased to LLVM and, I'd say, a marketing machine
> for LLVM.
> 
> They use very small selected benchmarks.  Benchmarking is evil but I
> believe more SPEC2000/SPEC2006 which use bigger real programs.  With
> their approach, they could use whetstone instead of Scimark but the
> results would be very bad for LLVM (e.g. 64-bit code for -O3 on
> Haswell):

I'm aware of all that, that said putting some minimal effort to increase
obstackles for their marketing is worth it.

> It would be nice to fix also Scimark2 LU GCC-4.8 degradation to
> finally stop this nonsense from Phoronix.

But strangely I don't see any LU GCC-4.8 degradation.  This is
-O3 -march=native on i7-2600 (the Phoronix numbers were from i7-4960X,
but that is just SandyBridge vs. IvyBridge-E difference, so I don't
think at least for GCC (verified for the patched trunk, -O3 -march=native
and -O3 -march=ivybridge generated bitwise identical code) it is a significant
difference in what GCC generates, just the CPU is faster and has bigger caches.
Don't have i7-4960X to test it there unfortunately.  The Phoronix posted numbers
show 80% improvement on LU from clang 3.3 to clang 3.4.

LU_factor has several inner loops, only the last one
                for (jj=j+1; jj<N; jj++)
                  Aii[jj] -= AiiJ * Aj[jj];
(but the only one with loop depth 3) is vectorized (with versioning for alias
due to the badly chosen data structure) by both gcc and clang,
            for (k=j+1; k<M; k++)
                A[k][j] *= recp;
for vectorization would require gather+scatter (i.e. Skylake+) plus some
guarantee that it doesn't alias (#pragma omp simd?), but still it is
questionable if vectorization would be beneficial, and
        double t = fabs(A[j][j]);
        for (i=j+1; i<M; i++)
        {
            double ab = fabs(A[i][j]);
            if ( ab > t)
            {
                jp = i;
                t = ab;
            }
        }
is perhaps vectorizable with gather load (i.e. AVX2+) and -Ofast, haven't
tried that though.

GCC 4.7.4 20130609:

Composite Score:         1323.20
FFT             Mflops:   183.61    (N=1048576)
SOR             Mflops:   618.12    (1000 x 1000)
MonteCarlo:     Mflops:   412.98
Sparse matmult  Mflops:  1773.16    (N=100000, nz=1000000)
LU              Mflops:  3628.12    (M=1000, N=1000)

GCC 4.8.2 20131212 (Red Hat 4.8.2-7):

Composite Score:         1437.05
FFT             Mflops:   213.91    (N=1048576)
SOR             Mflops:  1139.72    (1000 x 1000)
MonteCarlo:     Mflops:   526.34
Sparse matmult  Mflops:  1713.81    (N=100000, nz=1000000)
LU              Mflops:  3591.47    (M=1000, N=1000)

GCC 4.9.0 20131230:

Composite Score:         1569.78
FFT             Mflops:   254.65    (N=1048576)
SOR             Mflops:  1131.31    (1000 x 1000)
MonteCarlo:     Mflops:   563.64
Sparse matmult  Mflops:  1780.87    (N=100000, nz=1000000)
LU              Mflops:  4118.40    (M=1000, N=1000)

GCC 4.9.0 20140101 plus the predcom patch:

Composite Score:         1692.05
FFT             Mflops:   253.90    (N=1048576)
SOR             Mflops:  1605.16    (1000 x 1000)
MonteCarlo:     Mflops:   556.34
Sparse matmult  Mflops:  1861.82    (N=100000, nz=1000000)
LU              Mflops:  4183.01    (M=1000, N=1000)

clang 3.3:

Composite Score:         1576.91
FFT             Mflops:   255.41    (N=1048576)
SOR             Mflops:  1613.61    (1000 x 1000)
MonteCarlo:     Mflops:   557.79
Sparse matmult  Mflops:  1914.02    (N=100000, nz=1000000)
LU              Mflops:  3543.74    (M=1000, N=1000)

clang 3.4 20131230:

Composite Score:         1566.95
FFT             Mflops:   255.41    (N=1048576)
SOR             Mflops:  1617.87    (1000 x 1000)
MonteCarlo:     Mflops:   528.94
Sparse matmult  Mflops:  1804.41    (N=100000, nz=1000000)
LU              Mflops:  3628.12    (M=1000, N=1000)

	Jakub

Vladimir Makarov Jan. 1, 2014, 8:21 p.m. UTC | #3

On 1/1/2014, 12:50 PM, Jakub Jelinek wrote:
> On Tue, Dec 31, 2013 at 03:22:07PM -0500, Vladimir Makarov wrote:
>> Scimark2 is always used by Phoronix to show how bad GCC in
>> comparison with LLVM.  It is understandable. For some reasons
>> phoronix is very biased to LLVM and, I'd say, a marketing machine
>> for LLVM.
>>
>> They use very small selected benchmarks.  Benchmarking is evil but I
>> believe more SPEC2000/SPEC2006 which use bigger real programs.  With
>> their approach, they could use whetstone instead of Scimark but the
>> results would be very bad for LLVM (e.g. 64-bit code for -O3 on
>> Haswell):
>
> I'm aware of all that, that said putting some minimal effort to increase
> obstackles for their marketing is worth it.
>

I believe that Michael Larabel has no idea what he is measuring.

I don't believe in his data too.  Recently I tried to reproduce Scimark 
results for GCC4.8 and LLVM3.3 from his article "LLVM Clang 3.4 Already 
Has Some Performance Changes":

http://www.phoronix.com/scan.php?page=article&item=llvm_clang34_first&num=2

He used i7-4770K for this.  I used the closest machine I found i5-4670 
(with switched turbo mode off).  The important difference is 0.1Ghz in 
frequency (3.5Ghz vs. 3.4 Ghz).  I got close GCC Scimark (-large) 
composite score when I used -O and still on my machine the composite 
score was 20% higher than the article reports although the article says 
that -O3 -march=core-avx were used.

His articles about LLVM and GCC usually contains a lot of negative 
emotions about GCC and positive ones about LLVM.  I don't know may be he 
tried to communicate with GCC community and got a frustrated experience. 
  That is only my explanation now why he is so biased towards LLVM.

So far I saw several pitfalls in his testing methodology:

o Micro-benchmarking.  You just gave an example LU-factorization in 
Scimark2 where most benchmark time is spent in 2-lines loop.

o Comparing LLVM and GCC on fortran benchmarks.  LLVM has no fortran FE 
and just quietly call system GCC.  So comparison LLVM and GCC on fortran 
benchmarks means comparison system GCC and a given GCC.

o IMO, the data in articles lack credability may be because a wrong setup.

Jeff Law Jan. 7, 2014, 5:48 a.m. UTC | #4

On 12/31/13 12:04, Jakub Jelinek wrote:
> Hi!
>
> As written in the PR, I've been looking why is llvm 3.[34] so much faster
> on Scimark2 SOR benchmark and the reason is that it's predictive commoning
> or whatever it uses doesn't give up on the inner loop, while our predcom
> unnecessarily gives up, because there are reads that could alias the write.
>
> This simple patch improves the benchmark by 42%.  We already ignore
> unsuitable dependencies for read/read, the patch extends that for unsuitable
> dependencies for read/write by just putting the read (and anything in it's
> component) into the bad component which is ignored.  pcom doesn't optimize
> away the writes and will keep the potentially aliasing reads unmodified as
> well.  Without the patch we'd merge the two components, and as
> !determine_offset between the two DRs, it would mean the whole merged
> component would be always unsuitable and thus ignored.  With the patch
> we'll hopefully have some other reads with known offset to the write
> and can optimize that, so the patch should always either handle what
> it did before or handle perhaps some more cases.
>
> The inner loop from the (public domain) benchmark is added in the two tests,
> one runtime test and one test looking whether pcom actually optimized it.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2013-12-31  Jakub Jelinek  <jakub@redhat.com>
>
> 	PR tree-optimization/59643
> 	* tree-predcom.c (split_data_refs_to_components): If one dr is
> 	read and one write, determine_offset fails and the write isn't
> 	in the bad component, just put the read into the bad component.
>
> 	* gcc.dg/pr59643.c: New test.
> 	* gcc.c-torture/execute/pr59643.c: New test.
OK for the trunk.

jeff

H.J. Lu Jan. 10, 2014, 1:01 a.m. UTC | #5

On Tue, Dec 31, 2013 at 11:04 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> Hi!
>
> As written in the PR, I've been looking why is llvm 3.[34] so much faster
> on Scimark2 SOR benchmark and the reason is that it's predictive commoning
> or whatever it uses doesn't give up on the inner loop, while our predcom
> unnecessarily gives up, because there are reads that could alias the write.
>
> This simple patch improves the benchmark by 42%.  We already ignore
> unsuitable dependencies for read/read, the patch extends that for unsuitable
> dependencies for read/write by just putting the read (and anything in it's
> component) into the bad component which is ignored.  pcom doesn't optimize
> away the writes and will keep the potentially aliasing reads unmodified as
> well.  Without the patch we'd merge the two components, and as
> !determine_offset between the two DRs, it would mean the whole merged
> component would be always unsuitable and thus ignored.  With the patch
> we'll hopefully have some other reads with known offset to the write
> and can optimize that, so the patch should always either handle what
> it did before or handle perhaps some more cases.
>
> The inner loop from the (public domain) benchmark is added in the two tests,
> one runtime test and one test looking whether pcom actually optimized it.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2013-12-31  Jakub Jelinek  <jakub@redhat.com>
>
>         PR tree-optimization/59643
>         * tree-predcom.c (split_data_refs_to_components): If one dr is
>         read and one write, determine_offset fails and the write isn't
>         in the bad component, just put the read into the bad component.
>

This caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59745

Tiny predcom improvement (PR tree-optimization/59643)

Commit Message

Comments

Patch