Fix PR69274, 435.gromacs performance regression due to RA

Message ID	alpine.LSU.2.11.1602051016440.31122@t29.fhfr.qr
State	New
Headers	show Return-Path: <gcc-patches-return-420841-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:mime-version:content-type; q=dns; s=default; b=g01BDI7hAejmNTtZ1NnuKX8nHMYUJfb/Go0iKpGm5v/arEKzs5 dvEoR3CN/FiLtn1eTQasdPqTT0wEMonmvFDzYkuDhp1RmIRzHAhrN5S19ow5NPI1 32SM0J8myNo3l390wc3Jrk7LzFEUnUxAB7h2MtPtg+OpOY/QbIygJZ+0U= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Date: Fri, 5 Feb 2016 10:25:29 +0100 (CET) From: Richard Biener <rguenther@suse.de> To: gcc-patches@gcc.gnu.org cc: vmakarov@redhat.com Subject: [PATCH] Fix PR69274, 435.gromacs performance regression due to RA Message-ID: <alpine.LSU.2.11.1602051016440.31122@t29.fhfr.qr> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII

Richard Biener Feb. 5, 2016, 9:25 a.m. UTC

The following patch fixes the performance regression for 435.gromacs
on x86_64 with AVX2 (Haswell or bdver2) caused by

2015-12-18  Andreas Krebbel  <krebbel@linux.vnet.ibm.com>

	* ira.c (ira_setup_alts): Move the scan for commutative modifier
	to the first loop to make it work even with disabled alternatives.

which in itself is a desirable change giving the RA more freedom.

It turns out the fix makes an existing issue more severe in detecting
more swappable alternatives and thus exiting ira_setup_alts with
operands swapped in recog_data.  This seems to give a slight preference
to choose alternatives with the operands swapped (I didn't try to
investigate how IRA handles the "merged" alternative mask and
operand swapping in its further processing).

Of course previous RTL optimizers and canonicalization rules as well
as backend patterns are tuned towards the not swapped variant and thus
it happens doing more swaps ends up in slower code (I didn't closely
investigate).

So I tested the following patch which simply makes sure that
ira_setup_alts does not alter recog_data.

On a Intel Haswell machine I get (base is with the patch, peak is with
the above change reverted):

                                  Estimated                       
Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  
---------
435.gromacs      7140        264       27.1 S    7140        270       
26.5 S
435.gromacs      7140        264       27.1 *    7140        269       
26.6 S
435.gromacs      7140        263       27.1 S    7140        269       
26.5 *

Richard Biener Feb. 5, 2016, 12:10 p.m. UTC | #1

On Fri, 5 Feb 2016, Richard Biener wrote:

> 
> The following patch fixes the performance regression for 435.gromacs
> on x86_64 with AVX2 (Haswell or bdver2) caused by
> 
> 2015-12-18  Andreas Krebbel  <krebbel@linux.vnet.ibm.com>
> 
> 	* ira.c (ira_setup_alts): Move the scan for commutative modifier
> 	to the first loop to make it work even with disabled alternatives.
> 
> which in itself is a desirable change giving the RA more freedom.
> 
> It turns out the fix makes an existing issue more severe in detecting
> more swappable alternatives and thus exiting ira_setup_alts with
> operands swapped in recog_data.  This seems to give a slight preference
> to choose alternatives with the operands swapped (I didn't try to
> investigate how IRA handles the "merged" alternative mask and
> operand swapping in its further processing).
> 
> Of course previous RTL optimizers and canonicalization rules as well
> as backend patterns are tuned towards the not swapped variant and thus
> it happens doing more swaps ends up in slower code (I didn't closely
> investigate).
> 
> So I tested the following patch which simply makes sure that
> ira_setup_alts does not alter recog_data.
> 
> On a Intel Haswell machine I get (base is with the patch, peak is with
> the above change reverted):
> 
>                                   Estimated                       
> Estimated
>                 Base     Base       Base        Peak     Peak       Peak
> Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
> -------------- ------  ---------  ---------    ------  ---------  
> ---------
> 435.gromacs      7140        264       27.1 S    7140        270       
> 26.5 S
> 435.gromacs      7140        264       27.1 *    7140        269       
> 26.6 S
> 435.gromacs      7140        263       27.1 S    7140        269       
> 26.5 *
> ==============================================================================
> 435.gromacs      7140        264       27.1 *    7140        269       
> 26.5 *
> 
> which means the patched result is even better than before Andreas
> change.  Current trunk homes in at a Run Time of 321s (which is
> the regression to fix).
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk?

It fails

FAIL: gcc.target/i386/addr-sel-1.c scan-assembler b\\\\+1

on i?86 (or x86_64 -m32) though, generating

f:
.LFB0:
        .cfi_startproc
        movl    4(%esp), %eax
        leal    1(%eax), %edx
        movsbl  a+1(%eax), %eax
        movsbl  b(%edx), %edx
        addl    %edx, %eax
        ret

before IRA we have

(insn 6 3 8 2 (parallel [
            (set (reg:SI 87 [ _2 ])
                (plus:SI (reg/v:SI 93 [ i ])
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) 
/space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 
218 {*addsi_1}
     (expr_list:REG_DEAD (reg/v:SI 93 [ i ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 8 6 10 2 (set (reg:SI 96)
        (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 87 [ _2 ])
                    (symbol_ref:SI ("a") [flags 0x2]  <var_decl 
0x7fe25bdf0cf0 a>)) [0 a S1 A8]))) 
/space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 
151 {extendqisi2}
     (nil))
(insn 10 8 11 2 (set (reg:SI 98)
        (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 87 [ _2 ])
                    (symbol_ref:SI ("b") [flags 0x2]  <var_decl 
0x7fe25bdf0d80 b>)) [0 b S1 A8]))) 
/space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 
151 {extendqisi2}
     (expr_list:REG_DEAD (reg:SI 87 [ _2 ])

where I wonder if we can use a single insn why we didn't combine/forwprop
to this before RA.  Probably non-single-use issue (though I thought
forwprop doesn't have this limitation).

Without the patch it is postreload that manages to combine them.
After the patch we allocate dx so that

(insn 10 8 11 2 (set (reg:SI 1 dx [98])
        (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 1 dx [orig:87 _2 ] 
[87])
                    (symbol_ref:SI ("b") [flags 0x2]  <var_decl 
0x7f02a1bd9d80 b>)) [0 b S1 A8]))) 
/space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 
151 {extendqisi2}
     (nil))

which seems to confuse that transform enough to not carry it out
(reload_combine).  Hopefully somebody else can help here in a more
efficient way than I could do.

IMHO it shouldn't block the patch if that looks good itself (I think
it's sound).  I'll have a look at the reload_combine issue early
next week if nobody beats me to it.  PR69689.

Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't
show any regressions.

Richard.

> Thanks,
> Richard.
> 
> 2016-02-05   Richard Biener  <rguenther@suse.de>
> 
> 	PR rtl-optimization/69274
> 	* ira.c (ira_setup_alts): Do not change recog_data.operand
> 	order.
> 
> Index: gcc/ira.c
> ===================================================================
> --- gcc/ira.c	(revision 231814)
> +++ gcc/ira.c	(working copy)
> @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG
>  	}
>        if (commutative < 0)
>  	break;
> -      if (curr_swapped)
> -	break;
> +      /* Swap forth and back to avoid changing recog_data.  */
>        std::swap (recog_data.operand[commutative],
>  		 recog_data.operand[commutative + 1]);
> +      if (curr_swapped)
> +	break;
>      }
>  }
>  
>

Jakub Jelinek Feb. 5, 2016, 12:11 p.m. UTC | #2

On Fri, Feb 05, 2016 at 01:10:26PM +0100, Richard Biener wrote:
> Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't
> show any regressions.

Any improvements there?

	Jakub

Bernd Schmidt Feb. 5, 2016, 12:32 p.m. UTC | #3

On 02/05/2016 01:10 PM, Richard Biener wrote:
> It fails
>
> FAIL: gcc.target/i386/addr-sel-1.c scan-assembler b\\\\+1
>
> on i?86 (or x86_64 -m32) though, generating
>
> f:
> .LFB0:
>          .cfi_startproc
>          movl    4(%esp), %eax
>          leal    1(%eax), %edx
>          movsbl  a+1(%eax), %eax
>          movsbl  b(%edx), %edx
>          addl    %edx, %eax
>          ret

Well, it looks like the first movsbl load clobbers the potentially 
better base register, so trivial propagation doesn't work.

It might be another case where allowing 2->2 in combine would help. Or 
enabling -frename-registers and rerunning reload_combine afterwards.

Bernd

Richard Biener Feb. 5, 2016, 12:35 p.m. UTC | #4

On Fri, 5 Feb 2016, Jakub Jelinek wrote:

> On Fri, Feb 05, 2016 at 01:10:26PM +0100, Richard Biener wrote:
> > Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't
> > show any regressions.
> 
> Any improvements there?

Noise, but I only did 1 run (I did 3 only for 435.gromacs to confirm
the progress over pre-Andreas-patch state which I would otherwise
classified as noise as well).

I can do a 3-run over the weekend if you think that's useful.  OTOH
checking the patch in will get more benchmark/flag/HW covering
from our auto-testers (which also only do 1-runs of course and will
see unrelated changes as well).

Richard.

Jakub Jelinek Feb. 5, 2016, 12:42 p.m. UTC | #5

On Fri, Feb 05, 2016 at 01:35:03PM +0100, Richard Biener wrote:
> On Fri, 5 Feb 2016, Jakub Jelinek wrote:
> 
> > On Fri, Feb 05, 2016 at 01:10:26PM +0100, Richard Biener wrote:
> > > Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't
> > > show any regressions.
> > 
> > Any improvements there?
> 
> Noise, but I only did 1 run (I did 3 only for 435.gromacs to confirm
> the progress over pre-Andreas-patch state which I would otherwise
> classified as noise as well).

And on that single run gromacs was non-noise?

> I can do a 3-run over the weekend if you think that's useful.  OTOH
> checking the patch in will get more benchmark/flag/HW covering
> from our auto-testers (which also only do 1-runs of course and will
> see unrelated changes as well).

Maybe that is good enough.  Anyway, the patch is Vlad's area of expertise...

	Jakub

Richard Biener Feb. 5, 2016, 1:06 p.m. UTC | #6

On Fri, 5 Feb 2016, Jakub Jelinek wrote:

> On Fri, Feb 05, 2016 at 01:35:03PM +0100, Richard Biener wrote:
> > On Fri, 5 Feb 2016, Jakub Jelinek wrote:
> > 
> > > On Fri, Feb 05, 2016 at 01:10:26PM +0100, Richard Biener wrote:
> > > > Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't
> > > > show any regressions.
> > > 
> > > Any improvements there?
> > 
> > Noise, but I only did 1 run (I did 3 only for 435.gromacs to confirm
> > the progress over pre-Andreas-patch state which I would otherwise
> > classified as noise as well).
> 
> And on that single run gromacs was non-noise?

Well, it was like 264s vs. 269s but it looked good so I tried to
confirm progress independent of Andreas change ;)  You can look
at the usual variance of the machine here:
http://gcc.opensuse.org/SPEC/CFP/sb-czerny-head-64-2006/recent.html
(that's of course including source changes for each dot).

So yes, in the full 1-run I account +- 5s (out of ~260) as noise.

Note that all SPEC tests ran ontop of r23181[34].

> > I can do a 3-run over the weekend if you think that's useful.  OTOH
> > checking the patch in will get more benchmark/flag/HW covering
> > from our auto-testers (which also only do 1-runs of course and will
> > see unrelated changes as well).
> 
> Maybe that is good enough.  Anyway, the patch is Vlad's area of expertise...

Yes, and we need to decide on that fallout and if we should go for the
more ugly variant (it also fixes the gromacs regression for me).

Richard.

Vladimir Makarov Feb. 5, 2016, 8:14 p.m. UTC | #7

On 02/05/2016 04:25 AM, Richard Biener wrote:
> The following patch fixes the performance regression for 435.gromacs
> on x86_64 with AVX2 (Haswell or bdver2) caused by
>
> 2015-12-18  Andreas Krebbel  <krebbel@linux.vnet.ibm.com>
>
> 	* ira.c (ira_setup_alts): Move the scan for commutative modifier
> 	to the first loop to make it work even with disabled alternatives.
>
> which in itself is a desirable change giving the RA more freedom.
>
> It turns out the fix makes an existing issue more severe in detecting
> more swappable alternatives and thus exiting ira_setup_alts with
> operands swapped in recog_data.  This seems to give a slight preference
> to choose alternatives with the operands swapped (I didn't try to
> investigate how IRA handles the "merged" alternative mask and
> operand swapping in its further processing).
Alternative mask excludes alternatives which will be definitely rejected 
in LRA.  This approach is to speed up LRA (a lot was done to speed up RA 
but still it consumes a big chunk of compiler time which is unusual for 
all compilers).

LRA and reload prefer insns without commutative operands swap when all 
other costs are the same.
> Of course previous RTL optimizers and canonicalization rules as well
> as backend patterns are tuned towards the not swapped variant and thus
> it happens doing more swaps ends up in slower code (I didn't closely
> investigate).
>
> So I tested the following patch which simply makes sure that
> ira_setup_alts does not alter recog_data.
>
> On a Intel Haswell machine I get (base is with the patch, peak is with
> the above change reverted):
>
>                                    Estimated
> Estimated
>                  Base     Base       Base        Peak     Peak       Peak
> Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
> -------------- ------  ---------  ---------    ------  ---------
> ---------
> 435.gromacs      7140        264       27.1 S    7140        270
> 26.5 S
> 435.gromacs      7140        264       27.1 *    7140        269
> 26.6 S
> 435.gromacs      7140        263       27.1 S    7140        269
> 26.5 *
> ==============================================================================
> 435.gromacs      7140        264       27.1 *    7140        269
> 26.5 *
>
> which means the patched result is even better than before Andreas
> change.  Current trunk homes in at a Run Time of 321s (which is
> the regression to fix).
   Thanks for working on this, Richard.  It is not easy to find reasons 
for worse code on modern processors after such small change.  As RA is 
based on heuristics it hard to predict the change for a specific 
benchmark.  I remember I checked  Andreas patch on SPEC2000 in a hope 
that it also improves x86-64 code but I did not see a difference.

It is even hard to say sometimes how a specific (non-heuristic) 
optimization will affect a specific benchmark performance when a lot of 
unknown (from alignments to CPU internals are involved).  An year ago I 
tried to use ML to choose best options.  I used a set of about 100 C 
benchmarks (and even more functions).  For practically every benchmark, 
I had an option modification to -Ofast resulting in faster code but ML 
prediction did not work at all.
> Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk?
>
OK.  Thanks again.
> 2016-02-05   Richard Biener  <rguenther@suse.de>
>
> 	PR rtl-optimization/69274
> 	* ira.c (ira_setup_alts): Do not change recog_data.operand
> 	order.
>
> Index: gcc/ira.c
> ===================================================================
> --- gcc/ira.c	(revision 231814)
> +++ gcc/ira.c	(working copy)
> @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG
>   	}
>         if (commutative < 0)
>   	break;
> -      if (curr_swapped)
> -	break;
> +      /* Swap forth and back to avoid changing recog_data.  */
>         std::swap (recog_data.operand[commutative],
>   		 recog_data.operand[commutative + 1]);
> +      if (curr_swapped)
> +	break;
>       }
>   }
>

Richard Biener Feb. 8, 2016, 9:09 a.m. UTC | #8

On Fri, 5 Feb 2016, Vladimir Makarov wrote:

> On 02/05/2016 04:25 AM, Richard Biener wrote:
> > The following patch fixes the performance regression for 435.gromacs
> > on x86_64 with AVX2 (Haswell or bdver2) caused by
> > 
> > 2015-12-18  Andreas Krebbel  <krebbel@linux.vnet.ibm.com>
> > 
> > 	* ira.c (ira_setup_alts): Move the scan for commutative modifier
> > 	to the first loop to make it work even with disabled alternatives.
> > 
> > which in itself is a desirable change giving the RA more freedom.
> > 
> > It turns out the fix makes an existing issue more severe in detecting
> > more swappable alternatives and thus exiting ira_setup_alts with
> > operands swapped in recog_data.  This seems to give a slight preference
> > to choose alternatives with the operands swapped (I didn't try to
> > investigate how IRA handles the "merged" alternative mask and
> > operand swapping in its further processing).
> Alternative mask excludes alternatives which will be definitely rejected in
> LRA.  This approach is to speed up LRA (a lot was done to speed up RA but
> still it consumes a big chunk of compiler time which is unusual for all
> compilers).
> 
> LRA and reload prefer insns without commutative operands swap when all other
> costs are the same.

Ok, so leaving operands swapped in ira_setup_alts will prefer the
swapped variants in case other costs are the same.

> > Of course previous RTL optimizers and canonicalization rules as well
> > as backend patterns are tuned towards the not swapped variant and thus
> > it happens doing more swaps ends up in slower code (I didn't closely
> > investigate).
> > 
> > So I tested the following patch which simply makes sure that
> > ira_setup_alts does not alter recog_data.
> > 
> > On a Intel Haswell machine I get (base is with the patch, peak is with
> > the above change reverted):
> > 
> >                                    Estimated
> > Estimated
> >                  Base     Base       Base        Peak     Peak       Peak
> > Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
> > -------------- ------  ---------  ---------    ------  ---------
> > ---------
> > 435.gromacs      7140        264       27.1 S    7140        270
> > 26.5 S
> > 435.gromacs      7140        264       27.1 *    7140        269
> > 26.6 S
> > 435.gromacs      7140        263       27.1 S    7140        269
> > 26.5 *
> > ==============================================================================
> > 435.gromacs      7140        264       27.1 *    7140        269
> > 26.5 *
> > 
> > which means the patched result is even better than before Andreas
> > change.  Current trunk homes in at a Run Time of 321s (which is
> > the regression to fix).
>   Thanks for working on this, Richard.  It is not easy to find reasons for
> worse code on modern processors after such small change.  As RA is based on
> heuristics it hard to predict the change for a specific benchmark.  I remember
> I checked  Andreas patch on SPEC2000 in a hope that it also improves x86-64
> code but I did not see a difference.
> 
> It is even hard to say sometimes how a specific (non-heuristic) optimization
> will affect a specific benchmark performance when a lot of unknown (from
> alignments to CPU internals are involved).  An year ago I tried to use ML to
> choose best options.  I used a set of about 100 C benchmarks (and even more
> functions).  For practically every benchmark, I had an option modification to
> -Ofast resulting in faster code but ML prediction did not work at all.
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk?
> > 
> OK.  Thanks again.

Thanks.  Over the weekend I did a full 3-run SPEC 2k6 with the following
result.  Base is r231814 while peak is r231814 patched, flags are
-Ofast -march=native (on a Haswell machine).

                                  Estimated                       
Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  
---------
400.perlbench    9770        255       38.4 *    9770        251       
39.0 S
400.perlbench    9770        258       37.8 S    9770        250       
39.0 *
400.perlbench    9770        253       38.6 S    9770        250       
39.0 S
401.bzip2        9650        407       23.7 *    9650        400       
24.1 *
401.bzip2        9650        412       23.4 S    9650        417       
23.1 S
401.bzip2        9650        396       24.4 S    9650        398       
24.3 S
403.gcc          8050        252       31.9 S    8050        245       
32.9 S
403.gcc          8050        240       33.6 S    8050        244       
32.9 *
403.gcc          8050        242       33.2 *    8050        241       
33.4 S
429.mcf          9120        243       37.6 S    9120        245       
37.3 S
429.mcf          9120        224       40.7 S    9120        241       
37.8 *
429.mcf          9120        225       40.5 *    9120        229       
39.9 S
445.gobmk       10490        394       26.6 S   10490        393       
26.7 S
445.gobmk       10490        394       26.6 *   10490        392       
26.8 *
445.gobmk       10490        393       26.7 S   10490        392       
26.8 S
456.hmmer        9330        340       27.4 S    9330        340       
27.5 *
456.hmmer        9330        340       27.5 S    9330        340       
27.5 S
456.hmmer        9330        340       27.5 *    9330        340       
27.5 S
458.sjeng       12100        407       29.7 *   12100        407       
29.8 *
458.sjeng       12100        406       29.8 S   12100        406       
29.8 S
458.sjeng       12100        408       29.6 S   12100        407       
29.8 S
462.libquantum  20720        300       69.0 S   20720        300       
69.0 *
462.libquantum  20720        300       69.1 *   20720        300       
69.2 S
462.libquantum  20720        300       69.1 S   20720        301       
68.9 S
464.h264ref     22130        444       49.8 S   22130        444       
49.9 S
464.h264ref     22130        443       50.0 S   22130        442       
50.1 S
464.h264ref     22130        444       49.9 *   22130        443       
50.0 *
471.omnetpp      6250        326       19.1 S    6250        328       
19.1 S
471.omnetpp      6250        305       20.5 *    6250        324       
19.3 *
471.omnetpp      6250        296       21.1 S    6250        305       
20.5 S
473.astar        7020        313       22.4 *    7020        316       
22.2 S
473.astar        7020        314       22.3 S    7020        309       
22.7 S
473.astar        7020        308       22.8 S    7020        310       
22.7 *
483.xalancbmk    6900        193       35.7 S    6900        193       
35.7 S
483.xalancbmk    6900        189       36.5 *    6900        189       
36.5 *
483.xalancbmk    6900        185       37.4 S    6900        189       
36.6 S
==============================================================================
400.perlbench    9770        255       38.4 *    9770        250       
39.0 *
401.bzip2        9650        407       23.7 *    9650        400       
24.1 *
403.gcc          8050        242       33.2 *    8050        244       
32.9 *
429.mcf          9120        225       40.5 *    9120        241       
37.8 *
445.gobmk       10490        394       26.6 *   10490        392       
26.8 *
456.hmmer        9330        340       27.5 *    9330        340       
27.5 *
458.sjeng       12100        407       29.7 *   12100        407       
29.8 *
462.libquantum  20720        300       69.1 *   20720        300       
69.0 *
464.h264ref     22130        444       49.9 *   22130        443       
50.0 *
471.omnetpp      6250        305       20.5 *    6250        324       
19.3 *
473.astar        7020        313       22.4 *    7020        310       
22.7 *
483.xalancbmk    6900        189       36.5 *    6900        189       
36.5 *
 Est. SPECint(R)_base2006              32.8
 Est. SPECint2006                                                      
32.5

                                  Estimated                       
Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  
---------
410.bwaves      13590        196       69.4 S   13590        202       
67.1 S
410.bwaves      13590        197       69.0 *   13590        197       
69.1 S
410.bwaves      13590        199       68.3 S   13590        197       
69.0 *
416.gamess      19580        592       33.1 *   19580        589       
33.2 S
416.gamess      19580        592       33.1 S   19580        588       
33.3 S
416.gamess      19580        593       33.0 S   19580        588       
33.3 *
433.milc         9180        350       26.2 S    9180        347       
26.5 S
433.milc         9180        335       27.4 *    9180        334       
27.5 S
433.milc         9180        334       27.5 S    9180        337       
27.2 *
434.zeusmp       9100        231       39.3 S    9100        232       
39.2 S
434.zeusmp       9100        233       39.1 S    9100        233       
39.1 *
434.zeusmp       9100        232       39.2 *    9100        235       
38.8 S
435.gromacs      7140        316       22.6 S    7140        264       
27.1 S
435.gromacs      7140        314       22.7 S    7140        263       
27.2 S
435.gromacs      7140        316       22.6 *    7140        263       
27.1 *
436.cactusADM   11950        201       59.3 *   11950        211       
56.6 S
436.cactusADM   11950        214       55.7 S   11950        205       
58.3 *
436.cactusADM   11950        198       60.4 S   11950        201       
59.5 S
437.leslie3d     9400        218       43.1 S    9400        219       
42.9 S
437.leslie3d     9400        219       42.9 *    9400        219       
42.9 *
437.leslie3d     9400        220       42.8 S    9400        220       
42.7 S
444.namd         8020        301       26.7 S    8020        302       
26.5 S
444.namd         8020        300       26.7 *    8020        302       
26.5 *
444.namd         8020        300       26.7 S    8020        302       
26.6 S
447.dealII      11440        248       46.2 S   11440        245       
46.7 S
447.dealII      11440        248       46.0 S   11440        246       
46.4 S
447.dealII      11440        248       46.2 *   11440        246       
46.5 *
450.soplex       8340        216       38.6 S    8340        223       
37.5 S
450.soplex       8340        215       38.8 *    8340        215       
38.7 S
450.soplex       8340        214       38.9 S    8340        216       
38.6 *
453.povray       5320        121       44.1 S    5320        119       
44.8 *
453.povray       5320        120       44.2 *    5320        117       
45.3 S
453.povray       5320        120       44.3 S    5320        121       
44.0 S
454.calculix     8250        277       29.8 S    8250        277       
29.8 S
454.calculix     8250        277       29.8 *    8250        276       
29.9 *
454.calculix     8250        277       29.7 S    8250        276       
29.9 S
459.GemsFDTD    10610        326       32.5 S   10610        333       
31.8 S
459.GemsFDTD    10610        316       33.6 S   10610        318       
33.4 S
459.GemsFDTD    10610        317       33.5 *   10610        320       
33.2 *
465.tonto        9840        421       23.4 S    9840        419       
23.5 S
465.tonto        9840        419       23.5 S    9840        419       
23.5 *
465.tonto        9840        420       23.5 *    9840        418       
23.5 S
470.lbm         13740        253       54.2 *   13740        254       
54.1 S
470.lbm         13740        251       54.6 S   13740        251       
54.8 S
470.lbm         13740        254       54.0 S   13740        251       
54.7 *
481.wrf         11170        291       38.4 S   11170        293       
38.2 S
481.wrf         11170        288       38.8 S   11170        288       
38.8 S
481.wrf         11170        289       38.6 *   11170        289       
38.6 *
482.sphinx3     19490        398       49.0 S   19490        406       
48.0 S
482.sphinx3     19490        406       48.1 S   19490        401       
48.6 S
482.sphinx3     19490        399       48.9 *   19490        401       
48.6 *
==============================================================================
410.bwaves      13590        197       69.0 *   13590        197       
69.0 *
416.gamess      19580        592       33.1 *   19580        588       
33.3 *
433.milc         9180        335       27.4 *    9180        337       
27.2 *
434.zeusmp       9100        232       39.2 *    9100        233       
39.1 *
435.gromacs      7140        316       22.6 *    7140        263       
27.1 *
436.cactusADM   11950        201       59.3 *   11950        205       
58.3 *
437.leslie3d     9400        219       42.9 *    9400        219       
42.9 *
444.namd         8020        300       26.7 *    8020        302       
26.5 *
447.dealII      11440        248       46.2 *   11440        246       
46.5 *
450.soplex       8340        215       38.8 *    8340        216       
38.6 *
453.povray       5320        120       44.2 *    5320        119       
44.8 *
454.calculix     8250        277       29.8 *    8250        276       
29.9 *
459.GemsFDTD    10610        317       33.5 *   10610        320       
33.2 *
465.tonto        9840        420       23.5 *    9840        419       
23.5 *
470.lbm         13740        253       54.2 *   13740        251       
54.7 *
481.wrf         11170        289       38.6 *   11170        289       
38.6 *
482.sphinx3     19490        399       48.9 *   19490        401       
48.6 *
 Est. SPECfp(R)_base2006               38.0
 Est. SPECfp2006                                                       
38.4

So overall the patch is a loss for SPEC CPU 2006 INT due to
the 429.mcf and 471.omnetpp regressions and a win on SPEC FP.
(I didn't test SPEC INT previously, only FP)

But as you noted the patch only changes allocation preference in case
of equal costs.

I've also looked at stage2 binary sizes and patched we see a growth
of cc1 from 35450264 bytes to 35459552 (executable size).  There are
both ups and downs for individual .o files though.

The gcc.target/i386/addr-sel-1.c (for PR28940) seems to just started
working at some point past in time and thus it was added and the
bug closed.  You could say RA does a better job after the patch
as it uses 1 less register but that restricts the followup
postreload combine attempts.  Though I wonder about what's "better"
RA here - isn't the best allocation one that avoids spills but
uses as many registers as possible (at least when targeting a CPU
that cannot to register renaming)?  regrename doesn't help this
testcase either (it runs too late and does a renaming that doesn't help).

With all of the above I'm not sure what to do for GCC 6 (even though
you just approved the patch).  Going with the patch alternative
(just revert swapping parts of the commutative operands) looks
like completely bogus though it works for fixing the regression as well.

So I have applied the patch now, giving us a few days to get other
peoples (and my own) auto-testers the chance to pick up results with
it and after that we can consider reverting or going with the (IMHO
bogus) half-way variant.

I've XFAILed half of gcc.target/i386/addr-sel-1.c.

Thanks,
Richard.


> > 2016-02-05   Richard Biener  <rguenther@suse.de>
> > 
> > 	PR rtl-optimization/69274
> > 	* ira.c (ira_setup_alts): Do not change recog_data.operand
> > 	order.
> > 
> > Index: gcc/ira.c
> > ===================================================================
> > --- gcc/ira.c	(revision 231814)
> > +++ gcc/ira.c	(working copy)
> > @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG
> >   	}
> >         if (commutative < 0)
> >   	break;
> > -      if (curr_swapped)
> > -	break;
> > +      /* Swap forth and back to avoid changing recog_data.  */
> >         std::swap (recog_data.operand[commutative],
> >   		 recog_data.operand[commutative + 1]);
> > +      if (curr_swapped)
> > +	break;
> >       }
> >   }
> >   
> 
>

Bernd Schmidt Feb. 8, 2016, 12:06 p.m. UTC | #9

On 02/08/2016 10:09 AM, Richard Biener wrote:
> The gcc.target/i386/addr-sel-1.c (for PR28940) seems to just started
> working at some point past in time and thus it was added and the
> bug closed.  You could say RA does a better job after the patch
> as it uses 1 less register but that restricts the followup
> postreload combine attempts.  Though I wonder about what's "better"
> RA here - isn't the best allocation one that avoids spills but
> uses as many registers as possible (at least when targeting a CPU
> that cannot to register renaming)?  regrename doesn't help this
> testcase either (it runs too late and does a renaming that doesn't help).

I don't think regrename's place in the pass pipeline is set in stone. If 
we enable it for gcc-7. we could also experiment with putting it just 
before postreload, or schedule another reload-combine afterwards.


Bernd

Vladimir Makarov Feb. 8, 2016, 3:47 p.m. UTC | #10

On 02/08/2016 04:09 AM, Richard Biener wrote:
> With all of the above I'm not sure what to do for GCC 6 (even though 
> you just approved the patch). Going with the patch alternative (just 
> revert swapping parts of the commutative operands) looks like 
> completely bogus though it works for fixing the regression as well.
Yes, it is bogus.  But I don't see other easy way to close P1 PR except 
making it P2.

I still have plans to rewrite ira-costs.c (may be for GCC7).  The 
algorithm is actually adaptation of old regclass.c one.  There are few 
complaints about wrong costs calculations because of this (the most 
recent complaint was Peter Bergner's one about a power8 test). I don't 
like the algorithm, it ignores the fact that insn operands should be 
from the same alternative during pseudo (allocno) cost calculations.

So this change probably will not survive after rewriting ira-costs.c.

Michael Matz Feb. 8, 2016, 5:38 p.m. UTC | #11

Hi,

On Mon, 8 Feb 2016, Richard Biener wrote:

> 429.mcf          9120        243       37.6 S    9120        245       37.3 S
> 429.mcf          9120        224       40.7 S    9120        241       37.8 *
> 429.mcf          9120        225       40.5 *    9120        229       39.9 S

> 471.omnetpp      6250        326       19.1 S    6250        328       19.1 S
> 471.omnetpp      6250        305       20.5 *    6250        324       19.3 *
> 471.omnetpp      6250        296       21.1 S    6250        305       20.5 S

> 435.gromacs      7140        316       22.6 S    7140        264       27.1 S
> 435.gromacs      7140        314       22.7 S    7140        263       27.2 S
> 435.gromacs      7140        316       22.6 *    7140        263       27.1 *
> 
> So overall the patch is a loss for SPEC CPU 2006 INT due to the 429.mcf 
> and 471.omnetpp regressions and a win on SPEC FP. (I didn't test SPEC 
> INT previously, only FP)

That's no regression if you look at the detailed results (like above).  
mcf and omnetpp were noisy on your machine, if you took the minimum of 
each run only the large progression of gromacs would remain.

> The gcc.target/i386/addr-sel-1.c (for PR28940) seems to just started
> working at some point past in time and thus it was added and the
> bug closed.  You could say RA does a better job after the patch
> as it uses 1 less register but that restricts the followup
> postreload combine attempts.  Though I wonder about what's "better"
> RA here - isn't the best allocation one that avoids spills but
> uses as many registers as possible (at least when targeting a CPU
> that cannot to register renaming)?

Yeah, I think the testcase was added for the wrong reasons.  It does 
exhibit a desirable outcome for sure, but the output became such by random 
changes, and hence also can go back by other changes.

> So I have applied the patch now,

I think the patch makes perfect sense.  ira_setup_alts should have no 
observable behaviour from the outside, except the returned value of merged 
acceptable alternatives.  Certainly it has no business to fiddle with 
recog_data.  It only does the swapping to merge alternatives, and 
accidentaly left them swapped; the merging could have been implemented 
without swapping recog_data.operands, and then the whole issue wouldn't 
have occurred (and addr-sel-1.c wouldn't have been added because it still 
would be "broken").

Ciao,
Michael.

Vladimir Makarov Feb. 8, 2016, 6:07 p.m. UTC | #12

On 02/08/2016 12:38 PM, Michael Matz wrote:
>
> I think the patch makes perfect sense.  ira_setup_alts should have no
> observable behaviour from the outside, except the returned value of merged
> acceptable alternatives.  Certainly it has no business to fiddle with
> recog_data.  It only does the swapping to merge alternatives, and
> accidentaly left them swapped; the merging could have been implemented
> without swapping recog_data.operands, and then the whole issue wouldn't
> have occurred (and addr-sel-1.c wouldn't have been added because it still
> would be "broken").
>
Sorry, I was confused by Richard's message thinking that his patch 
actually exchanges the operands.  I think we have some expression 
shaping optimizations and exchanging operands probably rejects the 
optimization effect.  With this point of view ira-costs.c should not 
exchange operands.   So the patch is not bogus as Richard wrote but 
perfectly legitimate.

Richard Biener Feb. 8, 2016, 6:23 p.m. UTC | #13

On February 8, 2016 7:07:46 PM GMT+01:00, Vladimir Makarov <vmakarov@redhat.com> wrote:
>On 02/08/2016 12:38 PM, Michael Matz wrote:
>>
>> I think the patch makes perfect sense.  ira_setup_alts should have no
>> observable behaviour from the outside, except the returned value of
>merged
>> acceptable alternatives.  Certainly it has no business to fiddle with
>> recog_data.  It only does the swapping to merge alternatives, and
>> accidentaly left them swapped; the merging could have been
>implemented
>> without swapping recog_data.operands, and then the whole issue
>wouldn't
>> have occurred (and addr-sel-1.c wouldn't have been added because it
>still
>> would be "broken").
>>
>Sorry, I was confused by Richard's message thinking that his patch 
>actually exchanges the operands.  I think we have some expression 
>shaping optimizations and exchanging operands probably rejects the 
>optimization effect.  With this point of view ira-costs.c should not 
>exchange operands.   So the patch is not bogus as Richard wrote but 
>perfectly legitimate.

Yes, I meant the posted alternative patch that just unswapped for the cases that Andreas patch changed whether we consider swapping or not.

The applied patch indeed makes perfect sense.

Richard.

Fix PR69274, 435.gromacs performance regression due to RA

Commit Message

Comments

Patch