Message ID | alpine.LSU.2.11.1602051016440.31122@t29.fhfr.qr |
---|---|
State | New |
Headers | show |
On Fri, 5 Feb 2016, Richard Biener wrote: > > The following patch fixes the performance regression for 435.gromacs > on x86_64 with AVX2 (Haswell or bdver2) caused by > > 2015-12-18 Andreas Krebbel <krebbel@linux.vnet.ibm.com> > > * ira.c (ira_setup_alts): Move the scan for commutative modifier > to the first loop to make it work even with disabled alternatives. > > which in itself is a desirable change giving the RA more freedom. > > It turns out the fix makes an existing issue more severe in detecting > more swappable alternatives and thus exiting ira_setup_alts with > operands swapped in recog_data. This seems to give a slight preference > to choose alternatives with the operands swapped (I didn't try to > investigate how IRA handles the "merged" alternative mask and > operand swapping in its further processing). > > Of course previous RTL optimizers and canonicalization rules as well > as backend patterns are tuned towards the not swapped variant and thus > it happens doing more swaps ends up in slower code (I didn't closely > investigate). > > So I tested the following patch which simply makes sure that > ira_setup_alts does not alter recog_data. > > On a Intel Haswell machine I get (base is with the patch, peak is with > the above change reverted): > > Estimated > Estimated > Base Base Base Peak Peak Peak > Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio > -------------- ------ --------- --------- ------ --------- > --------- > 435.gromacs 7140 264 27.1 S 7140 270 > 26.5 S > 435.gromacs 7140 264 27.1 * 7140 269 > 26.6 S > 435.gromacs 7140 263 27.1 S 7140 269 > 26.5 * > ============================================================================== > 435.gromacs 7140 264 27.1 * 7140 269 > 26.5 * > > which means the patched result is even better than before Andreas > change. Current trunk homes in at a Run Time of 321s (which is > the regression to fix). > > Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk? It fails FAIL: gcc.target/i386/addr-sel-1.c scan-assembler b\\\\+1 on i?86 (or x86_64 -m32) though, generating f: .LFB0: .cfi_startproc movl 4(%esp), %eax leal 1(%eax), %edx movsbl a+1(%eax), %eax movsbl b(%edx), %edx addl %edx, %eax ret before IRA we have (insn 6 3 8 2 (parallel [ (set (reg:SI 87 [ _2 ]) (plus:SI (reg/v:SI 93 [ i ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 218 {*addsi_1} (expr_list:REG_DEAD (reg/v:SI 93 [ i ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn 8 6 10 2 (set (reg:SI 96) (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 87 [ _2 ]) (symbol_ref:SI ("a") [flags 0x2] <var_decl 0x7fe25bdf0cf0 a>)) [0 a S1 A8]))) /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 151 {extendqisi2} (nil)) (insn 10 8 11 2 (set (reg:SI 98) (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 87 [ _2 ]) (symbol_ref:SI ("b") [flags 0x2] <var_decl 0x7fe25bdf0d80 b>)) [0 b S1 A8]))) /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 151 {extendqisi2} (expr_list:REG_DEAD (reg:SI 87 [ _2 ]) where I wonder if we can use a single insn why we didn't combine/forwprop to this before RA. Probably non-single-use issue (though I thought forwprop doesn't have this limitation). Without the patch it is postreload that manages to combine them. After the patch we allocate dx so that (insn 10 8 11 2 (set (reg:SI 1 dx [98]) (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 1 dx [orig:87 _2 ] [87]) (symbol_ref:SI ("b") [flags 0x2] <var_decl 0x7f02a1bd9d80 b>)) [0 b S1 A8]))) /space/rguenther/src/svn/trunk3/gcc/testsuite/gcc.target/i386/addr-sel-1.c:13 151 {extendqisi2} (nil)) which seems to confuse that transform enough to not carry it out (reload_combine). Hopefully somebody else can help here in a more efficient way than I could do. IMHO it shouldn't block the patch if that looks good itself (I think it's sound). I'll have a look at the reload_combine issue early next week if nobody beats me to it. PR69689. Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't show any regressions. Richard. > Thanks, > Richard. > > 2016-02-05 Richard Biener <rguenther@suse.de> > > PR rtl-optimization/69274 > * ira.c (ira_setup_alts): Do not change recog_data.operand > order. > > Index: gcc/ira.c > =================================================================== > --- gcc/ira.c (revision 231814) > +++ gcc/ira.c (working copy) > @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG > } > if (commutative < 0) > break; > - if (curr_swapped) > - break; > + /* Swap forth and back to avoid changing recog_data. */ > std::swap (recog_data.operand[commutative], > recog_data.operand[commutative + 1]); > + if (curr_swapped) > + break; > } > } > >
On Fri, Feb 05, 2016 at 01:10:26PM +0100, Richard Biener wrote: > Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't > show any regressions. Any improvements there? Jakub
On 02/05/2016 01:10 PM, Richard Biener wrote: > It fails > > FAIL: gcc.target/i386/addr-sel-1.c scan-assembler b\\\\+1 > > on i?86 (or x86_64 -m32) though, generating > > f: > .LFB0: > .cfi_startproc > movl 4(%esp), %eax > leal 1(%eax), %edx > movsbl a+1(%eax), %eax > movsbl b(%edx), %edx > addl %edx, %eax > ret Well, it looks like the first movsbl load clobbers the potentially better base register, so trivial propagation doesn't work. It might be another case where allowing 2->2 in combine would help. Or enabling -frename-registers and rerunning reload_combine afterwards. Bernd
On Fri, 5 Feb 2016, Jakub Jelinek wrote: > On Fri, Feb 05, 2016 at 01:10:26PM +0100, Richard Biener wrote: > > Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't > > show any regressions. > > Any improvements there? Noise, but I only did 1 run (I did 3 only for 435.gromacs to confirm the progress over pre-Andreas-patch state which I would otherwise classified as noise as well). I can do a 3-run over the weekend if you think that's useful. OTOH checking the patch in will get more benchmark/flag/HW covering from our auto-testers (which also only do 1-runs of course and will see unrelated changes as well). Richard.
On Fri, Feb 05, 2016 at 01:35:03PM +0100, Richard Biener wrote: > On Fri, 5 Feb 2016, Jakub Jelinek wrote: > > > On Fri, Feb 05, 2016 at 01:10:26PM +0100, Richard Biener wrote: > > > Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't > > > show any regressions. > > > > Any improvements there? > > Noise, but I only did 1 run (I did 3 only for 435.gromacs to confirm > the progress over pre-Andreas-patch state which I would otherwise > classified as noise as well). And on that single run gromacs was non-noise? > I can do a 3-run over the weekend if you think that's useful. OTOH > checking the patch in will get more benchmark/flag/HW covering > from our auto-testers (which also only do 1-runs of course and will > see unrelated changes as well). Maybe that is good enough. Anyway, the patch is Vlad's area of expertise... Jakub
On Fri, 5 Feb 2016, Jakub Jelinek wrote: > On Fri, Feb 05, 2016 at 01:35:03PM +0100, Richard Biener wrote: > > On Fri, 5 Feb 2016, Jakub Jelinek wrote: > > > > > On Fri, Feb 05, 2016 at 01:10:26PM +0100, Richard Biener wrote: > > > > Otherwise bootstrap / testing went ok and a full SPEC 2k6 run doesn't > > > > show any regressions. > > > > > > Any improvements there? > > > > Noise, but I only did 1 run (I did 3 only for 435.gromacs to confirm > > the progress over pre-Andreas-patch state which I would otherwise > > classified as noise as well). > > And on that single run gromacs was non-noise? Well, it was like 264s vs. 269s but it looked good so I tried to confirm progress independent of Andreas change ;) You can look at the usual variance of the machine here: http://gcc.opensuse.org/SPEC/CFP/sb-czerny-head-64-2006/recent.html (that's of course including source changes for each dot). So yes, in the full 1-run I account +- 5s (out of ~260) as noise. Note that all SPEC tests ran ontop of r23181[34]. > > I can do a 3-run over the weekend if you think that's useful. OTOH > > checking the patch in will get more benchmark/flag/HW covering > > from our auto-testers (which also only do 1-runs of course and will > > see unrelated changes as well). > > Maybe that is good enough. Anyway, the patch is Vlad's area of expertise... Yes, and we need to decide on that fallout and if we should go for the more ugly variant (it also fixes the gromacs regression for me). Richard.
On 02/05/2016 04:25 AM, Richard Biener wrote: > The following patch fixes the performance regression for 435.gromacs > on x86_64 with AVX2 (Haswell or bdver2) caused by > > 2015-12-18 Andreas Krebbel <krebbel@linux.vnet.ibm.com> > > * ira.c (ira_setup_alts): Move the scan for commutative modifier > to the first loop to make it work even with disabled alternatives. > > which in itself is a desirable change giving the RA more freedom. > > It turns out the fix makes an existing issue more severe in detecting > more swappable alternatives and thus exiting ira_setup_alts with > operands swapped in recog_data. This seems to give a slight preference > to choose alternatives with the operands swapped (I didn't try to > investigate how IRA handles the "merged" alternative mask and > operand swapping in its further processing). Alternative mask excludes alternatives which will be definitely rejected in LRA. This approach is to speed up LRA (a lot was done to speed up RA but still it consumes a big chunk of compiler time which is unusual for all compilers). LRA and reload prefer insns without commutative operands swap when all other costs are the same. > Of course previous RTL optimizers and canonicalization rules as well > as backend patterns are tuned towards the not swapped variant and thus > it happens doing more swaps ends up in slower code (I didn't closely > investigate). > > So I tested the following patch which simply makes sure that > ira_setup_alts does not alter recog_data. > > On a Intel Haswell machine I get (base is with the patch, peak is with > the above change reverted): > > Estimated > Estimated > Base Base Base Peak Peak Peak > Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio > -------------- ------ --------- --------- ------ --------- > --------- > 435.gromacs 7140 264 27.1 S 7140 270 > 26.5 S > 435.gromacs 7140 264 27.1 * 7140 269 > 26.6 S > 435.gromacs 7140 263 27.1 S 7140 269 > 26.5 * > ============================================================================== > 435.gromacs 7140 264 27.1 * 7140 269 > 26.5 * > > which means the patched result is even better than before Andreas > change. Current trunk homes in at a Run Time of 321s (which is > the regression to fix). Thanks for working on this, Richard. It is not easy to find reasons for worse code on modern processors after such small change. As RA is based on heuristics it hard to predict the change for a specific benchmark. I remember I checked Andreas patch on SPEC2000 in a hope that it also improves x86-64 code but I did not see a difference. It is even hard to say sometimes how a specific (non-heuristic) optimization will affect a specific benchmark performance when a lot of unknown (from alignments to CPU internals are involved). An year ago I tried to use ML to choose best options. I used a set of about 100 C benchmarks (and even more functions). For practically every benchmark, I had an option modification to -Ofast resulting in faster code but ML prediction did not work at all. > Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk? > OK. Thanks again. > 2016-02-05 Richard Biener <rguenther@suse.de> > > PR rtl-optimization/69274 > * ira.c (ira_setup_alts): Do not change recog_data.operand > order. > > Index: gcc/ira.c > =================================================================== > --- gcc/ira.c (revision 231814) > +++ gcc/ira.c (working copy) > @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG > } > if (commutative < 0) > break; > - if (curr_swapped) > - break; > + /* Swap forth and back to avoid changing recog_data. */ > std::swap (recog_data.operand[commutative], > recog_data.operand[commutative + 1]); > + if (curr_swapped) > + break; > } > } >
On Fri, 5 Feb 2016, Vladimir Makarov wrote: > On 02/05/2016 04:25 AM, Richard Biener wrote: > > The following patch fixes the performance regression for 435.gromacs > > on x86_64 with AVX2 (Haswell or bdver2) caused by > > > > 2015-12-18 Andreas Krebbel <krebbel@linux.vnet.ibm.com> > > > > * ira.c (ira_setup_alts): Move the scan for commutative modifier > > to the first loop to make it work even with disabled alternatives. > > > > which in itself is a desirable change giving the RA more freedom. > > > > It turns out the fix makes an existing issue more severe in detecting > > more swappable alternatives and thus exiting ira_setup_alts with > > operands swapped in recog_data. This seems to give a slight preference > > to choose alternatives with the operands swapped (I didn't try to > > investigate how IRA handles the "merged" alternative mask and > > operand swapping in its further processing). > Alternative mask excludes alternatives which will be definitely rejected in > LRA. This approach is to speed up LRA (a lot was done to speed up RA but > still it consumes a big chunk of compiler time which is unusual for all > compilers). > > LRA and reload prefer insns without commutative operands swap when all other > costs are the same. Ok, so leaving operands swapped in ira_setup_alts will prefer the swapped variants in case other costs are the same. > > Of course previous RTL optimizers and canonicalization rules as well > > as backend patterns are tuned towards the not swapped variant and thus > > it happens doing more swaps ends up in slower code (I didn't closely > > investigate). > > > > So I tested the following patch which simply makes sure that > > ira_setup_alts does not alter recog_data. > > > > On a Intel Haswell machine I get (base is with the patch, peak is with > > the above change reverted): > > > > Estimated > > Estimated > > Base Base Base Peak Peak Peak > > Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio > > -------------- ------ --------- --------- ------ --------- > > --------- > > 435.gromacs 7140 264 27.1 S 7140 270 > > 26.5 S > > 435.gromacs 7140 264 27.1 * 7140 269 > > 26.6 S > > 435.gromacs 7140 263 27.1 S 7140 269 > > 26.5 * > > ============================================================================== > > 435.gromacs 7140 264 27.1 * 7140 269 > > 26.5 * > > > > which means the patched result is even better than before Andreas > > change. Current trunk homes in at a Run Time of 321s (which is > > the regression to fix). > Thanks for working on this, Richard. It is not easy to find reasons for > worse code on modern processors after such small change. As RA is based on > heuristics it hard to predict the change for a specific benchmark. I remember > I checked Andreas patch on SPEC2000 in a hope that it also improves x86-64 > code but I did not see a difference. > > It is even hard to say sometimes how a specific (non-heuristic) optimization > will affect a specific benchmark performance when a lot of unknown (from > alignments to CPU internals are involved). An year ago I tried to use ML to > choose best options. I used a set of about 100 C benchmarks (and even more > functions). For practically every benchmark, I had an option modification to > -Ofast resulting in faster code but ML prediction did not work at all. > > Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk? > > > OK. Thanks again. Thanks. Over the weekend I did a full 3-run SPEC 2k6 with the following result. Base is r231814 while peak is r231814 patched, flags are -Ofast -march=native (on a Haswell machine). Estimated Estimated Base Base Base Peak Peak Peak Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio -------------- ------ --------- --------- ------ --------- --------- 400.perlbench 9770 255 38.4 * 9770 251 39.0 S 400.perlbench 9770 258 37.8 S 9770 250 39.0 * 400.perlbench 9770 253 38.6 S 9770 250 39.0 S 401.bzip2 9650 407 23.7 * 9650 400 24.1 * 401.bzip2 9650 412 23.4 S 9650 417 23.1 S 401.bzip2 9650 396 24.4 S 9650 398 24.3 S 403.gcc 8050 252 31.9 S 8050 245 32.9 S 403.gcc 8050 240 33.6 S 8050 244 32.9 * 403.gcc 8050 242 33.2 * 8050 241 33.4 S 429.mcf 9120 243 37.6 S 9120 245 37.3 S 429.mcf 9120 224 40.7 S 9120 241 37.8 * 429.mcf 9120 225 40.5 * 9120 229 39.9 S 445.gobmk 10490 394 26.6 S 10490 393 26.7 S 445.gobmk 10490 394 26.6 * 10490 392 26.8 * 445.gobmk 10490 393 26.7 S 10490 392 26.8 S 456.hmmer 9330 340 27.4 S 9330 340 27.5 * 456.hmmer 9330 340 27.5 S 9330 340 27.5 S 456.hmmer 9330 340 27.5 * 9330 340 27.5 S 458.sjeng 12100 407 29.7 * 12100 407 29.8 * 458.sjeng 12100 406 29.8 S 12100 406 29.8 S 458.sjeng 12100 408 29.6 S 12100 407 29.8 S 462.libquantum 20720 300 69.0 S 20720 300 69.0 * 462.libquantum 20720 300 69.1 * 20720 300 69.2 S 462.libquantum 20720 300 69.1 S 20720 301 68.9 S 464.h264ref 22130 444 49.8 S 22130 444 49.9 S 464.h264ref 22130 443 50.0 S 22130 442 50.1 S 464.h264ref 22130 444 49.9 * 22130 443 50.0 * 471.omnetpp 6250 326 19.1 S 6250 328 19.1 S 471.omnetpp 6250 305 20.5 * 6250 324 19.3 * 471.omnetpp 6250 296 21.1 S 6250 305 20.5 S 473.astar 7020 313 22.4 * 7020 316 22.2 S 473.astar 7020 314 22.3 S 7020 309 22.7 S 473.astar 7020 308 22.8 S 7020 310 22.7 * 483.xalancbmk 6900 193 35.7 S 6900 193 35.7 S 483.xalancbmk 6900 189 36.5 * 6900 189 36.5 * 483.xalancbmk 6900 185 37.4 S 6900 189 36.6 S ============================================================================== 400.perlbench 9770 255 38.4 * 9770 250 39.0 * 401.bzip2 9650 407 23.7 * 9650 400 24.1 * 403.gcc 8050 242 33.2 * 8050 244 32.9 * 429.mcf 9120 225 40.5 * 9120 241 37.8 * 445.gobmk 10490 394 26.6 * 10490 392 26.8 * 456.hmmer 9330 340 27.5 * 9330 340 27.5 * 458.sjeng 12100 407 29.7 * 12100 407 29.8 * 462.libquantum 20720 300 69.1 * 20720 300 69.0 * 464.h264ref 22130 444 49.9 * 22130 443 50.0 * 471.omnetpp 6250 305 20.5 * 6250 324 19.3 * 473.astar 7020 313 22.4 * 7020 310 22.7 * 483.xalancbmk 6900 189 36.5 * 6900 189 36.5 * Est. SPECint(R)_base2006 32.8 Est. SPECint2006 32.5 Estimated Estimated Base Base Base Peak Peak Peak Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio -------------- ------ --------- --------- ------ --------- --------- 410.bwaves 13590 196 69.4 S 13590 202 67.1 S 410.bwaves 13590 197 69.0 * 13590 197 69.1 S 410.bwaves 13590 199 68.3 S 13590 197 69.0 * 416.gamess 19580 592 33.1 * 19580 589 33.2 S 416.gamess 19580 592 33.1 S 19580 588 33.3 S 416.gamess 19580 593 33.0 S 19580 588 33.3 * 433.milc 9180 350 26.2 S 9180 347 26.5 S 433.milc 9180 335 27.4 * 9180 334 27.5 S 433.milc 9180 334 27.5 S 9180 337 27.2 * 434.zeusmp 9100 231 39.3 S 9100 232 39.2 S 434.zeusmp 9100 233 39.1 S 9100 233 39.1 * 434.zeusmp 9100 232 39.2 * 9100 235 38.8 S 435.gromacs 7140 316 22.6 S 7140 264 27.1 S 435.gromacs 7140 314 22.7 S 7140 263 27.2 S 435.gromacs 7140 316 22.6 * 7140 263 27.1 * 436.cactusADM 11950 201 59.3 * 11950 211 56.6 S 436.cactusADM 11950 214 55.7 S 11950 205 58.3 * 436.cactusADM 11950 198 60.4 S 11950 201 59.5 S 437.leslie3d 9400 218 43.1 S 9400 219 42.9 S 437.leslie3d 9400 219 42.9 * 9400 219 42.9 * 437.leslie3d 9400 220 42.8 S 9400 220 42.7 S 444.namd 8020 301 26.7 S 8020 302 26.5 S 444.namd 8020 300 26.7 * 8020 302 26.5 * 444.namd 8020 300 26.7 S 8020 302 26.6 S 447.dealII 11440 248 46.2 S 11440 245 46.7 S 447.dealII 11440 248 46.0 S 11440 246 46.4 S 447.dealII 11440 248 46.2 * 11440 246 46.5 * 450.soplex 8340 216 38.6 S 8340 223 37.5 S 450.soplex 8340 215 38.8 * 8340 215 38.7 S 450.soplex 8340 214 38.9 S 8340 216 38.6 * 453.povray 5320 121 44.1 S 5320 119 44.8 * 453.povray 5320 120 44.2 * 5320 117 45.3 S 453.povray 5320 120 44.3 S 5320 121 44.0 S 454.calculix 8250 277 29.8 S 8250 277 29.8 S 454.calculix 8250 277 29.8 * 8250 276 29.9 * 454.calculix 8250 277 29.7 S 8250 276 29.9 S 459.GemsFDTD 10610 326 32.5 S 10610 333 31.8 S 459.GemsFDTD 10610 316 33.6 S 10610 318 33.4 S 459.GemsFDTD 10610 317 33.5 * 10610 320 33.2 * 465.tonto 9840 421 23.4 S 9840 419 23.5 S 465.tonto 9840 419 23.5 S 9840 419 23.5 * 465.tonto 9840 420 23.5 * 9840 418 23.5 S 470.lbm 13740 253 54.2 * 13740 254 54.1 S 470.lbm 13740 251 54.6 S 13740 251 54.8 S 470.lbm 13740 254 54.0 S 13740 251 54.7 * 481.wrf 11170 291 38.4 S 11170 293 38.2 S 481.wrf 11170 288 38.8 S 11170 288 38.8 S 481.wrf 11170 289 38.6 * 11170 289 38.6 * 482.sphinx3 19490 398 49.0 S 19490 406 48.0 S 482.sphinx3 19490 406 48.1 S 19490 401 48.6 S 482.sphinx3 19490 399 48.9 * 19490 401 48.6 * ============================================================================== 410.bwaves 13590 197 69.0 * 13590 197 69.0 * 416.gamess 19580 592 33.1 * 19580 588 33.3 * 433.milc 9180 335 27.4 * 9180 337 27.2 * 434.zeusmp 9100 232 39.2 * 9100 233 39.1 * 435.gromacs 7140 316 22.6 * 7140 263 27.1 * 436.cactusADM 11950 201 59.3 * 11950 205 58.3 * 437.leslie3d 9400 219 42.9 * 9400 219 42.9 * 444.namd 8020 300 26.7 * 8020 302 26.5 * 447.dealII 11440 248 46.2 * 11440 246 46.5 * 450.soplex 8340 215 38.8 * 8340 216 38.6 * 453.povray 5320 120 44.2 * 5320 119 44.8 * 454.calculix 8250 277 29.8 * 8250 276 29.9 * 459.GemsFDTD 10610 317 33.5 * 10610 320 33.2 * 465.tonto 9840 420 23.5 * 9840 419 23.5 * 470.lbm 13740 253 54.2 * 13740 251 54.7 * 481.wrf 11170 289 38.6 * 11170 289 38.6 * 482.sphinx3 19490 399 48.9 * 19490 401 48.6 * Est. SPECfp(R)_base2006 38.0 Est. SPECfp2006 38.4 So overall the patch is a loss for SPEC CPU 2006 INT due to the 429.mcf and 471.omnetpp regressions and a win on SPEC FP. (I didn't test SPEC INT previously, only FP) But as you noted the patch only changes allocation preference in case of equal costs. I've also looked at stage2 binary sizes and patched we see a growth of cc1 from 35450264 bytes to 35459552 (executable size). There are both ups and downs for individual .o files though. The gcc.target/i386/addr-sel-1.c (for PR28940) seems to just started working at some point past in time and thus it was added and the bug closed. You could say RA does a better job after the patch as it uses 1 less register but that restricts the followup postreload combine attempts. Though I wonder about what's "better" RA here - isn't the best allocation one that avoids spills but uses as many registers as possible (at least when targeting a CPU that cannot to register renaming)? regrename doesn't help this testcase either (it runs too late and does a renaming that doesn't help). With all of the above I'm not sure what to do for GCC 6 (even though you just approved the patch). Going with the patch alternative (just revert swapping parts of the commutative operands) looks like completely bogus though it works for fixing the regression as well. So I have applied the patch now, giving us a few days to get other peoples (and my own) auto-testers the chance to pick up results with it and after that we can consider reverting or going with the (IMHO bogus) half-way variant. I've XFAILed half of gcc.target/i386/addr-sel-1.c. Thanks, Richard. > > 2016-02-05 Richard Biener <rguenther@suse.de> > > > > PR rtl-optimization/69274 > > * ira.c (ira_setup_alts): Do not change recog_data.operand > > order. > > > > Index: gcc/ira.c > > =================================================================== > > --- gcc/ira.c (revision 231814) > > +++ gcc/ira.c (working copy) > > @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG > > } > > if (commutative < 0) > > break; > > - if (curr_swapped) > > - break; > > + /* Swap forth and back to avoid changing recog_data. */ > > std::swap (recog_data.operand[commutative], > > recog_data.operand[commutative + 1]); > > + if (curr_swapped) > > + break; > > } > > } > > > >
On 02/08/2016 10:09 AM, Richard Biener wrote: > The gcc.target/i386/addr-sel-1.c (for PR28940) seems to just started > working at some point past in time and thus it was added and the > bug closed. You could say RA does a better job after the patch > as it uses 1 less register but that restricts the followup > postreload combine attempts. Though I wonder about what's "better" > RA here - isn't the best allocation one that avoids spills but > uses as many registers as possible (at least when targeting a CPU > that cannot to register renaming)? regrename doesn't help this > testcase either (it runs too late and does a renaming that doesn't help). I don't think regrename's place in the pass pipeline is set in stone. If we enable it for gcc-7. we could also experiment with putting it just before postreload, or schedule another reload-combine afterwards. Bernd
On 02/08/2016 04:09 AM, Richard Biener wrote: > With all of the above I'm not sure what to do for GCC 6 (even though > you just approved the patch). Going with the patch alternative (just > revert swapping parts of the commutative operands) looks like > completely bogus though it works for fixing the regression as well. Yes, it is bogus. But I don't see other easy way to close P1 PR except making it P2. I still have plans to rewrite ira-costs.c (may be for GCC7). The algorithm is actually adaptation of old regclass.c one. There are few complaints about wrong costs calculations because of this (the most recent complaint was Peter Bergner's one about a power8 test). I don't like the algorithm, it ignores the fact that insn operands should be from the same alternative during pseudo (allocno) cost calculations. So this change probably will not survive after rewriting ira-costs.c.
Hi, On Mon, 8 Feb 2016, Richard Biener wrote: > 429.mcf 9120 243 37.6 S 9120 245 37.3 S > 429.mcf 9120 224 40.7 S 9120 241 37.8 * > 429.mcf 9120 225 40.5 * 9120 229 39.9 S > 471.omnetpp 6250 326 19.1 S 6250 328 19.1 S > 471.omnetpp 6250 305 20.5 * 6250 324 19.3 * > 471.omnetpp 6250 296 21.1 S 6250 305 20.5 S > 435.gromacs 7140 316 22.6 S 7140 264 27.1 S > 435.gromacs 7140 314 22.7 S 7140 263 27.2 S > 435.gromacs 7140 316 22.6 * 7140 263 27.1 * > > So overall the patch is a loss for SPEC CPU 2006 INT due to the 429.mcf > and 471.omnetpp regressions and a win on SPEC FP. (I didn't test SPEC > INT previously, only FP) That's no regression if you look at the detailed results (like above). mcf and omnetpp were noisy on your machine, if you took the minimum of each run only the large progression of gromacs would remain. > The gcc.target/i386/addr-sel-1.c (for PR28940) seems to just started > working at some point past in time and thus it was added and the > bug closed. You could say RA does a better job after the patch > as it uses 1 less register but that restricts the followup > postreload combine attempts. Though I wonder about what's "better" > RA here - isn't the best allocation one that avoids spills but > uses as many registers as possible (at least when targeting a CPU > that cannot to register renaming)? Yeah, I think the testcase was added for the wrong reasons. It does exhibit a desirable outcome for sure, but the output became such by random changes, and hence also can go back by other changes. > So I have applied the patch now, I think the patch makes perfect sense. ira_setup_alts should have no observable behaviour from the outside, except the returned value of merged acceptable alternatives. Certainly it has no business to fiddle with recog_data. It only does the swapping to merge alternatives, and accidentaly left them swapped; the merging could have been implemented without swapping recog_data.operands, and then the whole issue wouldn't have occurred (and addr-sel-1.c wouldn't have been added because it still would be "broken"). Ciao, Michael.
On 02/08/2016 12:38 PM, Michael Matz wrote: > > I think the patch makes perfect sense. ira_setup_alts should have no > observable behaviour from the outside, except the returned value of merged > acceptable alternatives. Certainly it has no business to fiddle with > recog_data. It only does the swapping to merge alternatives, and > accidentaly left them swapped; the merging could have been implemented > without swapping recog_data.operands, and then the whole issue wouldn't > have occurred (and addr-sel-1.c wouldn't have been added because it still > would be "broken"). > Sorry, I was confused by Richard's message thinking that his patch actually exchanges the operands. I think we have some expression shaping optimizations and exchanging operands probably rejects the optimization effect. With this point of view ira-costs.c should not exchange operands. So the patch is not bogus as Richard wrote but perfectly legitimate.
On February 8, 2016 7:07:46 PM GMT+01:00, Vladimir Makarov <vmakarov@redhat.com> wrote: >On 02/08/2016 12:38 PM, Michael Matz wrote: >> >> I think the patch makes perfect sense. ira_setup_alts should have no >> observable behaviour from the outside, except the returned value of >merged >> acceptable alternatives. Certainly it has no business to fiddle with >> recog_data. It only does the swapping to merge alternatives, and >> accidentaly left them swapped; the merging could have been >implemented >> without swapping recog_data.operands, and then the whole issue >wouldn't >> have occurred (and addr-sel-1.c wouldn't have been added because it >still >> would be "broken"). >> >Sorry, I was confused by Richard's message thinking that his patch >actually exchanges the operands. I think we have some expression >shaping optimizations and exchanging operands probably rejects the >optimization effect. With this point of view ira-costs.c should not >exchange operands. So the patch is not bogus as Richard wrote but >perfectly legitimate. Yes, I meant the posted alternative patch that just unswapped for the cases that Andreas patch changed whether we consider swapping or not. The applied patch indeed makes perfect sense. Richard.
============================================================================== 435.gromacs 7140 264 27.1 * 7140 269 26.5 * which means the patched result is even better than before Andreas change. Current trunk homes in at a Run Time of 321s (which is the regression to fix). Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk? Thanks, Richard. 2016-02-05 Richard Biener <rguenther@suse.de> PR rtl-optimization/69274 * ira.c (ira_setup_alts): Do not change recog_data.operand order. Index: gcc/ira.c =================================================================== --- gcc/ira.c (revision 231814) +++ gcc/ira.c (working copy) @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG } if (commutative < 0) break; - if (curr_swapped) - break; + /* Swap forth and back to avoid changing recog_data. */ std::swap (recog_data.operand[commutative], recog_data.operand[commutative + 1]); + if (curr_swapped) + break; } }