Message ID | 4C2AF62F.4030700@codesourcery.com |
---|---|
State | New |
Headers | show |
On 06/30/10 01:45, Jie Zhang wrote: > Currently reload_cse_move2add can transform > > (set (REGX) (CONST_INT A)) > ... > (set (REGX) (CONST_INT B)) > > to > > (set (REGX) (CONST_INT A)) > ... > (set (REGX) (plus (REGX) (CONST_INT B-A))) > > This patch enhances it to be able to transform > > (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) > ... > (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) > > to > > (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) > ... > (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) > > > Benchmarking using EEMBC on ARM Cortex-A8 shows performance > improvement on one test: > > idctrn01: 6% Was this a size or runtime performance improvement? > > Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% > regression on CINT2000 and 0.1% improvement on CFP2000. > > Bootstrapped and regression tested on x86_64. Any thoughts on why spec2k showed a regression and was it a size or runtime regression? I'm generally OK with the patch, but want to understand what's happening benchmark-wise before final approval. jeff
On 07/01/2010 12:47 AM, Jeff Law wrote: > On 06/30/10 01:45, Jie Zhang wrote: >> Currently reload_cse_move2add can transform >> >> (set (REGX) (CONST_INT A)) >> ... >> (set (REGX) (CONST_INT B)) >> >> to >> >> (set (REGX) (CONST_INT A)) >> ... >> (set (REGX) (plus (REGX) (CONST_INT B-A))) >> >> This patch enhances it to be able to transform >> >> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >> ... >> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) >> >> to >> >> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >> ... >> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) >> >> >> Benchmarking using EEMBC on ARM Cortex-A8 shows performance >> improvement on one test: >> >> idctrn01: 6% > Was this a size or runtime performance improvement? > This is a runtime performance improvement. >> >> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% >> regression on CINT2000 and 0.1% improvement on CFP2000. >> >> Bootstrapped and regression tested on x86_64. > Any thoughts on why spec2k showed a regression and was it a size or > runtime regression? > I'm not sure what caused the regressions. I'm redoing the benchmarking. This time I do it without X and shut down running servers as much as I can. Hope this can remove measuring error. > I'm generally OK with the patch, but want to understand what's happening > benchmark-wise before final approval. > OK. Thanks, Jie
On 06/30/10 11:13, Jie Zhang wrote: > On 07/01/2010 12:47 AM, Jeff Law wrote: >> On 06/30/10 01:45, Jie Zhang wrote: >>> Currently reload_cse_move2add can transform >>> >>> (set (REGX) (CONST_INT A)) >>> ... >>> (set (REGX) (CONST_INT B)) >>> >>> to >>> >>> (set (REGX) (CONST_INT A)) >>> ... >>> (set (REGX) (plus (REGX) (CONST_INT B-A))) >>> >>> This patch enhances it to be able to transform >>> >>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>> ... >>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) >>> >>> to >>> >>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>> ... >>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) >>> >>> >>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance >>> improvement on one test: >>> >>> idctrn01: 6% >> Was this a size or runtime performance improvement? >> > This is a runtime performance improvement. That's quite a surprise. Just for giggles, does x86 show any change on idctrn01? I realize it's an eembc benchmark, but if it's that sensitive to this optimization, we ought to see some change in behaviour for x86 as well. > >>> >>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% >>> regression on CINT2000 and 0.1% improvement on CFP2000. >>> >>> Bootstrapped and regression tested on x86_64. >> Any thoughts on why spec2k showed a regression and was it a size or >> runtime regression? >> > I'm not sure what caused the regressions. I'm redoing the > benchmarking. This time I do it without X and shut down running > servers as much as I can. Hope this can remove measuring error. Strongly advised (shut down X and as many services as possible, turn off speed scaling in the processor, etc). If you can get size #s, that would be interesting too -- I'd expect this to be a small size improvement independent of the processor architecture. Runtime performance I don't have a good feel for -- I can easily envision cases where it's going to be better and others where it's going to be worse. Jeff
On 07/02/2010 02:10 AM, Jeff Law wrote: > On 06/30/10 11:13, Jie Zhang wrote: >> On 07/01/2010 12:47 AM, Jeff Law wrote: >>> On 06/30/10 01:45, Jie Zhang wrote: >>>> Currently reload_cse_move2add can transform >>>> >>>> (set (REGX) (CONST_INT A)) >>>> ... >>>> (set (REGX) (CONST_INT B)) >>>> >>>> to >>>> >>>> (set (REGX) (CONST_INT A)) >>>> ... >>>> (set (REGX) (plus (REGX) (CONST_INT B-A))) >>>> >>>> This patch enhances it to be able to transform >>>> >>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>> ... >>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) >>>> >>>> to >>>> >>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>> ... >>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) >>>> >>>> >>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance >>>> improvement on one test: >>>> >>>> idctrn01: 6% >>> Was this a size or runtime performance improvement? >>> >> This is a runtime performance improvement. > That's quite a surprise. Just for giggles, does x86 show any change on > idctrn01? I realize it's an eembc benchmark, but if it's that sensitive > to this optimization, we ought to see some change in behaviour for x86 > as well. > I have not benchmarked it on x86 using EEMBC. We use SPEC2000 for benchmarking on x86 here. I need to ask if it's possible to setup EEMBC for x86 in our environment. > >> >>>> >>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% >>>> regression on CINT2000 and 0.1% improvement on CFP2000. >>>> >>>> Bootstrapped and regression tested on x86_64. >>> Any thoughts on why spec2k showed a regression and was it a size or >>> runtime regression? >>> >> I'm not sure what caused the regressions. I'm redoing the >> benchmarking. This time I do it without X and shut down running >> servers as much as I can. Hope this can remove measuring error. > Strongly advised (shut down X and as many services as possible, turn off > speed scaling in the processor, etc). > Yes, I did all of these this time. > If you can get size #s, that would be interesting too -- I'd expect this > to be a small size improvement independent of the processor > architecture. Runtime performance I don't have a good feel for -- I can > easily envision cases where it's going to be better and others where > it's going to be worse. > I use 5 iterations and test -O2 and -O3 this time. It have been running for 24 hours. It might need one or two more hours to complete. So I have to report the results tomorrow. Thanks,
On 07/02/2010 02:26 AM, Jie Zhang wrote: > On 07/02/2010 02:10 AM, Jeff Law wrote: >> On 06/30/10 11:13, Jie Zhang wrote: >>> On 07/01/2010 12:47 AM, Jeff Law wrote: >>>> On 06/30/10 01:45, Jie Zhang wrote: >>>>> Currently reload_cse_move2add can transform >>>>> >>>>> (set (REGX) (CONST_INT A)) >>>>> ... >>>>> (set (REGX) (CONST_INT B)) >>>>> >>>>> to >>>>> >>>>> (set (REGX) (CONST_INT A)) >>>>> ... >>>>> (set (REGX) (plus (REGX) (CONST_INT B-A))) >>>>> >>>>> This patch enhances it to be able to transform >>>>> >>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>>> ... >>>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) >>>>> >>>>> to >>>>> >>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>>> ... >>>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) >>>>> >>>>> >>>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance >>>>> improvement on one test: >>>>> >>>>> idctrn01: 6% >>>> Was this a size or runtime performance improvement? >>>> >>> This is a runtime performance improvement. >> That's quite a surprise. Just for giggles, does x86 show any change on >> idctrn01? I realize it's an eembc benchmark, but if it's that sensitive >> to this optimization, we ought to see some change in behaviour for x86 >> as well. >> > I have not benchmarked it on x86 using EEMBC. We use SPEC2000 for > benchmarking on x86 here. I need to ask if it's possible to setup EEMBC > for x86 in our environment. > It's not easy for me to run EEMBC on x86. We have no dedicated x86 hardware for running EEMBC. >> >>> >>>>> >>>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% >>>>> regression on CINT2000 and 0.1% improvement on CFP2000. >>>>> >>>>> Bootstrapped and regression tested on x86_64. >>>> Any thoughts on why spec2k showed a regression and was it a size or >>>> runtime regression? >>>> >>> I'm not sure what caused the regressions. I'm redoing the >>> benchmarking. This time I do it without X and shut down running >>> servers as much as I can. Hope this can remove measuring error. >> Strongly advised (shut down X and as many services as possible, turn off >> speed scaling in the processor, etc). >> > Yes, I did all of these this time. > >> If you can get size #s, that would be interesting too -- I'd expect this >> to be a small size improvement independent of the processor >> architecture. Runtime performance I don't have a good feel for -- I can >> easily envision cases where it's going to be better and others where >> it's going to be worse. >> > I use 5 iterations and test -O2 and -O3 this time. It have been running > for 24 hours. It might need one or two more hours to complete. So I have > to report the results tomorrow. > Here are the new results. 'Before' is before applying my patch, 'After' is after applying my patch. First the results for -O3. --------------------------------------------------- CINT2000 -O3 Before After Change --------------------------------------------------- 164.gzip 1016 1015 -0.10% 175.vpr 813 810 -0.37% 176.gcc 1048 1048 0.00% 181.mcf 625 624 -0.16% 186.crafty 1705 1706 0.06% 197.parser 769 773 0.52% 252.eon 1857 1857 0.00% 253.perlbmk 1264 1267 0.24% 254.gap 1097 1093 -0.36% 255.vortex 1513 1510 -0.20% 256.bzip2 980 980 0.00% 300.twolf 808 807 -0.12% --------------------------------------------------- SPECint_base2000 1067 1067 0.00% --------------------------------------------------- Notes on the above result: * No significant changes. --------------------------------------------------- CFP2000 -O3 Before After Change --------------------------------------------------- 168.wupwise 1187 1186 -0.08% 171.swim 1351 1364 0.96% 172.mgrid 881 882 0.11% 173.applu 1072 1069 -0.28% 177.mesa 1292 1290 -0.15% 178.galgel 1576 1583 0.44% 179.art 1013 998 -1.48% 183.equake 1442 1428 -0.97% 187.facerec 1077 1078 0.09% 188.ammp 873 876 0.34% 189.lucas 1399 1417 1.29% 191.fma3d 1078 1074 -0.37% 200.sixtrack 527 527 0.00% 301.apsi 963 970 0.73% --------------------------------------------------- SPECfp_base2000 1088 1088 0.00% --------------------------------------------------- Notes on the above results: * 171: The 5 results for "Before" are 1351, 1298, 1325, 1357, 1360. while for "After" are 1364, 1368, 1357, 1358, 1375. Seems a real improvement. * 179: The 5 results for "Before" are 1013, 1023, 1004, 1012, 1016. while for "After" are 993, 998, 998, 1009, 1005. Seems a real regression. * 183: The 5 results for "Before" are 1442, 1442, 1442, 1443, 1441. while for "After" are 1427, 1431, 1427, 1433, 1428. Seems a real regression. * 189: The 5 results for "Before" are 1396, 1399, 1401, 1394, 1399. while for "After" are 1418, 1416, 1417, 1401, 1418. Seems a real improvement. Then the results for -O2: --------------------------------------------------- CINT2000 -O2 Before After Change --------------------------------------------------- 164.gzip 1011 1012 0.10% 175.vpr 791 810 2.40% 176.gcc 1033 1031 -0.19% 181.mcf 624 624 0.00% 186.crafty 1699 1698 -0.06% 197.parser 739 736 -0.41% 252.eon 1880 1880 0.00% 253.perlbmk 1341 1342 0.07% 254.gap 1089 1094 0.46% 255.vortex 1449 1448 -0.07% 256.bzip2 970 971 0.10% 300.twolf 805 801 -0.50% --------------------------------------------------- SPECint_base2000 1060 1062 0.19% --------------------------------------------------- Notes on the above results: * 175: The 5 results for "Before" are 789, 791, 789, 793, 801. while for "After" are 810, 811, 811, 808, 808. Seems a real improvement. --------------------------------------------------- CFP2000 -O2 Before After Change --------------------------------------------------- 168.wupwise 1260 1232 -2.22% 171.swim 1382 1373 -0.65% 172.mgrid 905 900 -0.55% 173.applu 1049 1051 0.19% 177.mesa 1301 1300 -0.08% 178.galgel 1573 1565 -0.51% 179.art 1003 1002 -0.10% 183.equake 1352 1368 1.18% 187.facerec 1057 1062 0.47% 188.ammp 886 885 -0.11% 189.lucas 1406 1380 -1.85% 191.fma3d 1046 1048 0.19% 200.sixtrack 526 526 0.00% 301.apsi 965 955 -1.04% --------------------------------------------------- SPECfp_base2000 1087 1083 -0.37% --------------------------------------------------- Notes on the above results: * 168: The 5 results for "Before" are 1261, 1231, 1261, 1253, 1260. while for "After" are 1257, 1255, 1229, 1232, 1232. Hard to say if it's a real regression, at least not a regression as large as -2.22%. * 183: The 5 results for "Before" are 1352, 1353, 1352, 1353, 1352. while for "After" are 1368, 1368, 1368, 1367, 1367. Seems a real improvement. * 189: The 5 results for "Before" are 1403, 1406, 1424, 1402, 1425. while for "After" are 1380, 1399, 1380, 1380, 1382. Seems a real regression. * 301: The 5 results for "Before" are 963, 967, 965, 964, 966. while for "After" are 955, 954, 955, 954, 955. Seems a real regression. Summary ======= I list all tests with ABS(change) >= 1% below: CINT2000 Change(-O3) Change(-O2) ---------------------------------------- 175.vpr -0.37% 2.40% ---------------------------------------- SPECint_base2000 0.00% 0.19% CFP2000 Change(-O3) Change(-O2) ---------------------------------------- 168.wupwise -0.08% -2.22% 171.swim 0.96% -0.65% 179.art -1.48% -0.10% 183.equake -0.97% 1.18% 189.lucas 1.29% -1.85% 301.apsi 0.73% -1.04% ---------------------------------------- SPECfp_base2000 0.00% -0.37% Some tests have regressions while some other tests have improvements. Some tests have regressions at one optimization level but have no minor change or even improvements at the other optimization level. With these test results, is it safe to commit my patch? Regards,
On 07/02/2010 12:58 PM, Jie Zhang wrote: > 168.wupwise 1260 1232 -2.22% > 189.lucas 1406 1380 -1.85% Lucas is insanely dependent on register allocation and scheduling. With -fweb and -frename-registers both disabled at -O2, I suspect the reason of the regression is that your patch increases register lifetimes and makes scheduling a bit harder. I don't think this should be a reason to hold up your patch though, it would be more profitable to add some kind of rematerialization to the scheduler. Paolo
On 07/02/10 05:59, Paolo Bonzini wrote: > On 07/02/2010 12:58 PM, Jie Zhang wrote: >> 168.wupwise 1260 1232 -2.22% >> 189.lucas 1406 1380 -1.85% > I suspect the reason of the regression is that your patch increases > register lifetimes and makes scheduling a bit harder. Which is precisely what I was trying to figure out. Note the patch changes the post-reload code, so while lifetimes are increased, it really doesn't affect register allocation. So any regressions ought to be scheduling or other micro-architecture issues and I was having a hard time believing that we're going to see a measurable regression, particularly on any modern x86 implementation. > I don't think this should be a reason to hold up your patch though, it > would be more profitable to add some kind of rematerialization to the > scheduler. One way to do this would be to attach the equivalent (original) form and somehow note to the scheduler that the equivalent form can be directly substituted for the current form. When the scheduler has a bubble in the pipeline it could search QUEUE for an insn with an equivalent form, if found, it would verify dependencies and possibly issue the insn with the equivalent (original) form. There's other instances where such capabilities might be useful, but I'd consider them outside the scope of this patch. I'm still reviewing the data -- a pair of 2% regressions is pretty significant. > Paolo
> On 07/02/2010 02:26 AM, Jie Zhang wrote: >> >> I have not benchmarked it on x86 using EEMBC. We use SPEC2000 for >> benchmarking on x86 here. I need to ask if it's possible to setup EEMBC >> for x86 in our environment. >> > It's not easy for me to run EEMBC on x86. We have no dedicated x86 > hardware for running EEMBC. Even if you just run it a handful of times on whatever your development workstation happens to be and it's only one test that's of particular interest (idctrn01). Basically this one test is a clear outlier on ARM, so I'd like to see if that behaviour shows up on x86 as well. >>> If you can get size #s, that would be interesting too -- I'd expect >>> this >>> to be a small size improvement independent of the processor >>> architecture. Runtime performance I don't have a good feel for -- I can >>> easily envision cases where it's going to be better and others where >>> it's going to be worse. >>> >> I use 5 iterations and test -O2 and -O3 this time. It have been running >> for 24 hours. It might need one or two more hours to complete. So I have >> to report the results tomorrow. >> > > Here are the new results. 'Before' is before applying my patch, > 'After' is after applying my patch. Thanks. Overall it looks like a wash (with the exception of idctrn01 on arm). Did you get the codesize #s? I'd expect them to show a small, but consistent improvement which ought to be enough to push the patch from a wash to a clear improvement on both x86 & arm. jeff
On 07/02/2010 01:59 PM, Paolo Bonzini wrote: > On 07/02/2010 12:58 PM, Jie Zhang wrote: >> 168.wupwise 1260 1232 -2.22% >> 189.lucas 1406 1380 -1.85% > > Lucas is insanely dependent on register allocation and scheduling. With > -fweb and -frename-registers both disabled at -O2 This reminds me. I committed a number of regrename speedup patches earlier this year - does anyone think it would be a good idea to add that to -O2 now? Bernd
On 07/08/2010 12:54 AM, Jeff Law wrote: > On 07/02/10 05:59, Paolo Bonzini wrote: >> On 07/02/2010 12:58 PM, Jie Zhang wrote: >>> 168.wupwise 1260 1232 -2.22% >>> 189.lucas 1406 1380 -1.85% >> I suspect the reason of the regression is that your patch increases >> register lifetimes and makes scheduling a bit harder. > Which is precisely what I was trying to figure out. Note the patch > changes the post-reload code, so while lifetimes are > increased, it really doesn't affect register allocation. So any > regressions ought to be scheduling or other micro-architecture issues > and I was having a hard time believing that we're going to see a > measurable regression, particularly on any modern x86 implementation. > I was wondering it will be different if I run benchmark on an Intel machine. But currently I have no such spare machine to run on. > >> I don't think this should be a reason to hold up your patch though, it >> would be more profitable to add some kind of rematerialization to the >> scheduler. > One way to do this would be to attach the equivalent (original) form and > somehow note to the scheduler that the equivalent form can be directly > substituted for the current form. When the scheduler has a bubble in the > pipeline it could search QUEUE for an insn with an equivalent form, if > found, it would verify dependencies and possibly issue the insn with the > equivalent (original) form. > This is really a good idea. ARM Cortex-A8 has two ways to load large constants, one is using constant pool, the other is using MOVW/MOVT. These two methods have different pipeline usage. It will be good if scheduler can choose one from them according to the current schedule requirement. > There's other instances where such capabilities might be useful, but I'd > consider them outside the scope of this patch. > > I'm still reviewing the data -- a pair of 2% regressions is pretty > significant. > Yeah. And -O3 does not have them. I would like to take a deep look. But tests in SPEC are not like EEMBC...
On 07/07/2010 06:54 PM, Jeff Law wrote: > I'm still reviewing the data -- a pair of 2% regressions is pretty > significant. Note that Jie's raw data for wupwise suggests that there isn't actually that big a regression there, it just picked a high value in one run and a low value in the other, but there were instances of both in either run. Bernd
On 07/08/2010 01:02 AM, Bernd Schmidt wrote: > On 07/02/2010 01:59 PM, Paolo Bonzini wrote: >> On 07/02/2010 12:58 PM, Jie Zhang wrote: >>> 168.wupwise 1260 1232 -2.22% >>> 189.lucas 1406 1380 -1.85% >> >> Lucas is insanely dependent on register allocation and scheduling. With >> -fweb and -frename-registers both disabled at -O2 > > This reminds me. I committed a number of regrename speedup patches > earlier this year - does anyone think it would be a good idea to add > that to -O2 now? > I'm curious, why regrename is not in -O2?
On 07/07/2010 07:02 PM, Bernd Schmidt wrote: > On 07/02/2010 01:59 PM, Paolo Bonzini wrote: >> On 07/02/2010 12:58 PM, Jie Zhang wrote: >>> 168.wupwise 1260 1232 -2.22% >>> 189.lucas 1406 1380 -1.85% >> >> Lucas is insanely dependent on register allocation and scheduling. With >> -fweb and -frename-registers both disabled at -O2 > > This reminds me. I committed a number of regrename speedup patches > earlier this year - does anyone think it would be a good idea to add > that to -O2 now? Definitely on i686, since it disables pre-regalloc scheduling. You cannot get decent scheduling there without regrename. Paolo
On 07/07/2010 06:54 PM, Jeff Law wrote: > Note the patch changes the post-reload code, so while lifetimes are > increased, it really doesn't affect register allocation. So any > regressions ought to be scheduling or other micro-architecture issues > and I was having a hard time believing that we're going to see a > measurable regression, particularly on any modern x86 implementation. Out-of-order execution and elimination of inter-instruction dependencies (especially load-to-use dependencies) are much more important on modern x86 cores than Intel would like you to believe. See PR38824 and PR27827 for examples. Also, even if scheduling is only decoding bandwidth, it may require more freedom than what IRA leaves. Bernd's idea of turning regrename on at -O2 is a good one. Paolo
On 07/08/2010 06:21 PM, Paolo Bonzini wrote: > On 07/07/2010 06:54 PM, Jeff Law wrote: >> Note the patch changes the post-reload code, so while lifetimes are >> increased, it really doesn't affect register allocation. So any >> regressions ought to be scheduling or other micro-architecture issues >> and I was having a hard time believing that we're going to see a >> measurable regression, particularly on any modern x86 implementation. > > Out-of-order execution and elimination of inter-instruction dependencies > (especially load-to-use dependencies) are much more important on modern > x86 cores than Intel would like you to believe. See PR38824 and PR27827 > for examples. > > Also, even if scheduling is only decoding bandwidth, it may require more > freedom than what IRA leaves. Bernd's idea of turning regrename on at > -O2 is a good one. Here's a slightly crazier idea then: What we could additionally do in IRA is to allocate some hard registers speculatively if we think there's an important loop in the function. That would give us more room to work with in regrename, at the expense of saving more regs in the prologue. Bernd
On 07/08/2010 01:02 AM, Jeff Law wrote: >> On 07/02/2010 02:26 AM, Jie Zhang wrote: >>> >>> I have not benchmarked it on x86 using EEMBC. We use SPEC2000 for >>> benchmarking on x86 here. I need to ask if it's possible to setup EEMBC >>> for x86 in our environment. >>> >> It's not easy for me to run EEMBC on x86. We have no dedicated x86 >> hardware for running EEMBC. > Even if you just run it a handful of times on whatever your development > workstation happens to be and it's only one test that's of particular > interest (idctrn01). Basically this one test is a clear outlier on ARM, > so I'd like to see if that behaviour shows up on x86 as well. > I just tested idctrn01 on my pentium-m laptop. My patch does not change the performance of idctrn01 on i386. I did some investigation. It's because i386 can directly use "symbol_ref + offset" as the address in load instructions. So there is no optimization opportunity that my patch can utilize. That's the difference between RISC and CISC. I also tried to test my patch on PowerPC. It does not make any difference either on PowerPC. It's because when reload pass inserts instructions to load "symbol_ref + offset" into a register, it uses two instructions instead of one: (set (reg:SI 23 23) (high:SI (const:SI (plus:SI (symbol_ref:SI ("*.LANCHOR0")) (const_int 336 [0x150]))))) (set (reg:SI 23 23) (lo_sum:SI (reg:SI 23 23) (const:SI (plus:SI (symbol_ref:SI ("*.LANCHOR0")) (const_int 336 [0x150]))))) Currently my patch cannot optimize for this case. But it can be enhanced to handle this with some effort. >>>> If you can get size #s, that would be interesting too -- I'd expect >>>> this >>>> to be a small size improvement independent of the processor >>>> architecture. Runtime performance I don't have a good feel for -- I can >>>> easily envision cases where it's going to be better and others where >>>> it's going to be worse. >>>> >>> I use 5 iterations and test -O2 and -O3 this time. It have been running >>> for 24 hours. It might need one or two more hours to complete. So I have >>> to report the results tomorrow. >>> >> >> Here are the new results. 'Before' is before applying my patch, >> 'After' is after applying my patch. > Thanks. Overall it looks like a wash (with the exception of idctrn01 on > arm). > > Did you get the codesize #s? I'd expect them to show a small, but > consistent improvement which ought to be enough to push the patch from a > wash to a clear improvement on both x86 & arm. > I collected the code size data from SPEC2000 on AMD64. My patch does not change code size for any test whether -O2 or -O3, except Code Size ========= Test -O2 Before After Change ---------------------------------------------- 172.mgrid 14962 14976 14 Code Size Test -O3 Before After Change ---------------------------------------------- 171.swim 19231 19247 16 For EEMBC on ARM, my patch saves some bytes in the following 4 tests for code. No changes on other tests. No changes on data/bss size on all tests. Code Size ========= Test -O3 Before After Change ---------------------------------------------- automotive/aifftr01 86444 86396 -48 automotive/bitmnp01 60129 60112 -17 automotive/idctrn01 84812 84557 -255 office/text01 72685 72668 -17 Regards,
On 07/09/2010 06:58 PM, Jie Zhang wrote: > I just tested idctrn01 on my pentium-m laptop. My patch does not change > the performance of idctrn01 on i386. I did some investigation. It's > because i386 can directly use "symbol_ref + offset" as the address in > load instructions. So there is no optimization opportunity that my patch > can utilize. That's the difference between RISC and CISC. [...] > I collected the code size data from SPEC2000 on AMD64. My patch does not > change code size for any test whether -O2 or -O3, except > > Code Size > ========= > Test -O2 Before After Change > ---------------------------------------------- > 172.mgrid 14962 14976 14 > > Code Size > Test -O3 Before After Change > ---------------------------------------------- > 171.swim 19231 19247 16 So what kinds of changes are there that would explain a 1.48% drop on 179.art? If you can verify that the code is in fact identical, and the performance change is noise, the patch is ok. Bernd
On 07/10/2010 08:06 AM, Bernd Schmidt wrote: > On 07/09/2010 06:58 PM, Jie Zhang wrote: >> I just tested idctrn01 on my pentium-m laptop. My patch does not change >> the performance of idctrn01 on i386. I did some investigation. It's >> because i386 can directly use "symbol_ref + offset" as the address in >> load instructions. So there is no optimization opportunity that my patch >> can utilize. That's the difference between RISC and CISC. > [...] >> I collected the code size data from SPEC2000 on AMD64. My patch does not >> change code size for any test whether -O2 or -O3, except >> >> Code Size >> ========= >> Test -O2 Before After Change >> ---------------------------------------------- >> 172.mgrid 14962 14976 14 >> >> Code Size >> Test -O3 Before After Change >> ---------------------------------------------- >> 171.swim 19231 19247 16 > > So what kinds of changes are there that would explain a 1.48% drop on > 179.art? If you can verify that the code is in fact identical, and the > performance change is noise, the patch is ok. > I checked the disassemblies of all SPEC2000 tests. My patch generates different code only for 171.swim, 172.mgrid and 173.applu whether with -O2 or -O3. It seems the noise can be as large as 1.48%.
On Mon, Jul 12, 2010 at 9:10 AM, Jie Zhang <jie@codesourcery.com> wrote: > On 07/10/2010 08:06 AM, Bernd Schmidt wrote: >> >> On 07/09/2010 06:58 PM, Jie Zhang wrote: >>> >>> I just tested idctrn01 on my pentium-m laptop. My patch does not change >>> the performance of idctrn01 on i386. I did some investigation. It's >>> because i386 can directly use "symbol_ref + offset" as the address in >>> load instructions. So there is no optimization opportunity that my patch >>> can utilize. That's the difference between RISC and CISC. >> >> [...] >>> >>> I collected the code size data from SPEC2000 on AMD64. My patch does not >>> change code size for any test whether -O2 or -O3, except >>> >>> Code Size >>> ========= >>> Test -O2 Before After Change >>> ---------------------------------------------- >>> 172.mgrid 14962 14976 14 >>> >>> Code Size >>> Test -O3 Before After Change >>> ---------------------------------------------- >>> 171.swim 19231 19247 16 >> >> So what kinds of changes are there that would explain a 1.48% drop on >> 179.art? If you can verify that the code is in fact identical, and the >> performance change is noise, the patch is ok. >> > I checked the disassemblies of all SPEC2000 tests. My patch generates > different code only for 171.swim, 172.mgrid and 173.applu whether with -O2 > or -O3. It seems the noise can be as large as 1.48%. It can be much worse, depending on your machine configuration, cache, and whether the SPEC*CPU* test is actually memory-bound (like art, mcf, ammp, and a few others)... Ciao! Steven
On 07/12/2010 03:17 PM, Steven Bosscher wrote: > On Mon, Jul 12, 2010 at 9:10 AM, Jie Zhang<jie@codesourcery.com> wrote: >> On 07/10/2010 08:06 AM, Bernd Schmidt wrote: >>> >>> On 07/09/2010 06:58 PM, Jie Zhang wrote: >>>> >>>> I just tested idctrn01 on my pentium-m laptop. My patch does not change >>>> the performance of idctrn01 on i386. I did some investigation. It's >>>> because i386 can directly use "symbol_ref + offset" as the address in >>>> load instructions. So there is no optimization opportunity that my patch >>>> can utilize. That's the difference between RISC and CISC. >>> >>> [...] >>>> >>>> I collected the code size data from SPEC2000 on AMD64. My patch does not >>>> change code size for any test whether -O2 or -O3, except >>>> >>>> Code Size >>>> ========= >>>> Test -O2 Before After Change >>>> ---------------------------------------------- >>>> 172.mgrid 14962 14976 14 >>>> >>>> Code Size >>>> Test -O3 Before After Change >>>> ---------------------------------------------- >>>> 171.swim 19231 19247 16 >>> >>> So what kinds of changes are there that would explain a 1.48% drop on >>> 179.art? If you can verify that the code is in fact identical, and the >>> performance change is noise, the patch is ok. >>> >> I checked the disassemblies of all SPEC2000 tests. My patch generates >> different code only for 171.swim, 172.mgrid and 173.applu whether with -O2 >> or -O3. It seems the noise can be as large as 1.48%. > > It can be much worse, depending on your machine configuration, cache, > and whether the SPEC*CPU* test is actually memory-bound (like art, > mcf, ammp, and a few others)... > Even I run it 5 iterations, it still can have noise like 179.art I got. * 179: The 5 results for "Before" are 1013, 1023, 1004, 1012, 1016. while for "After" are 993, 998, 998, 1009, 1005. I'm lucky this time since the patch does not change the code. But if it changed the code, telling if it's noise or real regression would become much harder. I wonder how many iterations we have to use if we want to make sure the result can be trusted.
On 06/30/10 01:45, Jie Zhang wrote: > Currently reload_cse_move2add can transform > > (set (REGX) (CONST_INT A)) > ... > (set (REGX) (CONST_INT B)) > > to > > (set (REGX) (CONST_INT A)) > ... > (set (REGX) (plus (REGX) (CONST_INT B-A))) > > This patch enhances it to be able to transform > > (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) > ... > (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) > > to > > (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) > ... > (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) > > > Benchmarking using EEMBC on ARM Cortex-A8 shows performance > improvement on one test: > > idctrn01: 6% > > No meaningful changes on other tests. > > Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% > regression on CINT2000 and 0.1% improvement on CFP2000. > > Bootstrapped and regression tested on x86_64. > > Is it OK? Thanks for your patience, particularly when my requests for additional data. I think we're OK with the benchmarking #s, particularly looking at the amount of noise in the runs we're seeing for x86_64. Appoved, Jeff
On 07/13/2010 12:37 AM, Jeff Law wrote: > On 06/30/10 01:45, Jie Zhang wrote: >> Currently reload_cse_move2add can transform >> >> (set (REGX) (CONST_INT A)) >> ... >> (set (REGX) (CONST_INT B)) >> >> to >> >> (set (REGX) (CONST_INT A)) >> ... >> (set (REGX) (plus (REGX) (CONST_INT B-A))) >> >> This patch enhances it to be able to transform >> >> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >> ... >> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) >> >> to >> >> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >> ... >> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) >> >> >> Benchmarking using EEMBC on ARM Cortex-A8 shows performance >> improvement on one test: >> >> idctrn01: 6% >> >> No meaningful changes on other tests. >> >> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% >> regression on CINT2000 and 0.1% improvement on CFP2000. >> >> Bootstrapped and regression tested on x86_64. >> >> Is it OK? > Thanks for your patience, particularly when my requests for additional > data. I think we're OK with the benchmarking #s, particularly looking at > the amount of noise in the runs we're seeing for x86_64. > > Appoved, > Jeff, Thanks! I learned a lot during the procedure. On 07/10/2010 08:06 AM, Bernd Schmidt wrote: > So what kinds of changes are there that would explain a 1.48% drop on > 179.art? If you can verify that the code is in fact identical, and the > performance change is noise, the patch is ok. > Bernd, I didn't realize it was a conditional approval until I asked you privately. Thanks! I have committed it on trunk. Regards,
On Mon, Jul 12, 2010 at 10:41 AM, Jie Zhang <jie@codesourcery.com> wrote: > On 07/13/2010 12:37 AM, Jeff Law wrote: >> >> On 06/30/10 01:45, Jie Zhang wrote: >>> >>> Currently reload_cse_move2add can transform >>> >>> (set (REGX) (CONST_INT A)) >>> ... >>> (set (REGX) (CONST_INT B)) >>> >>> to >>> >>> (set (REGX) (CONST_INT A)) >>> ... >>> (set (REGX) (plus (REGX) (CONST_INT B-A))) >>> >>> This patch enhances it to be able to transform >>> >>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>> ... >>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) >>> >>> to >>> >>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>> ... >>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) >>> >>> >>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance >>> improvement on one test: >>> >>> idctrn01: 6% >>> >>> No meaningful changes on other tests. >>> >>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% >>> regression on CINT2000 and 0.1% improvement on CFP2000. >>> >>> Bootstrapped and regression tested on x86_64. >>> >>> Is it OK? >> >> Thanks for your patience, particularly when my requests for additional >> data. I think we're OK with the benchmarking #s, particularly looking at >> the amount of noise in the runs we're seeing for x86_64. >> >> Appoved, >> > Jeff, > > Thanks! I learned a lot during the procedure. > > On 07/10/2010 08:06 AM, Bernd Schmidt wrote: >> So what kinds of changes are there that would explain a 1.48% drop on >> 179.art? If you can verify that the code is in fact identical, and the >> performance change is noise, the patch is ok. >> > Bernd, > > I didn't realize it was a conditional approval until I asked you privately. > Thanks! > > > I have committed it on trunk. > I think it breaks bootstrap on Linux/ia32: ../../src-trunk/gcc/postreload.c: In function 'reload_cse_regs': ../../src-trunk/gcc/postreload.c:1327:57: error: 'min_regno' may be used uninitialized in this function [-Werror=uninitialized] ../../src-trunk/gcc/postreload.c:1284:7: note: 'min_regno' was declared here cc1: all warnings being treated as errors
On Mon, Jul 12, 2010 at 11:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Mon, Jul 12, 2010 at 10:41 AM, Jie Zhang <jie@codesourcery.com> wrote: >> On 07/13/2010 12:37 AM, Jeff Law wrote: >>> >>> On 06/30/10 01:45, Jie Zhang wrote: >>>> >>>> Currently reload_cse_move2add can transform >>>> >>>> (set (REGX) (CONST_INT A)) >>>> ... >>>> (set (REGX) (CONST_INT B)) >>>> >>>> to >>>> >>>> (set (REGX) (CONST_INT A)) >>>> ... >>>> (set (REGX) (plus (REGX) (CONST_INT B-A))) >>>> >>>> This patch enhances it to be able to transform >>>> >>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>> ... >>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) >>>> >>>> to >>>> >>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>> ... >>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) >>>> >>>> >>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance >>>> improvement on one test: >>>> >>>> idctrn01: 6% >>>> >>>> No meaningful changes on other tests. >>>> >>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% >>>> regression on CINT2000 and 0.1% improvement on CFP2000. >>>> >>>> Bootstrapped and regression tested on x86_64. >>>> >>>> Is it OK? >>> >>> Thanks for your patience, particularly when my requests for additional >>> data. I think we're OK with the benchmarking #s, particularly looking at >>> the amount of noise in the runs we're seeing for x86_64. >>> >>> Appoved, >>> >> Jeff, >> >> Thanks! I learned a lot during the procedure. >> >> On 07/10/2010 08:06 AM, Bernd Schmidt wrote: >>> So what kinds of changes are there that would explain a 1.48% drop on >>> 179.art? If you can verify that the code is in fact identical, and the >>> performance change is noise, the patch is ok. >>> >> Bernd, >> >> I didn't realize it was a conditional approval until I asked you privately. >> Thanks! >> >> >> I have committed it on trunk. >> > > I think it breaks bootstrap on Linux/ia32: > > ../../src-trunk/gcc/postreload.c: In function 'reload_cse_regs': > ../../src-trunk/gcc/postreload.c:1327:57: error: 'min_regno' may be > used uninitialized in this function [-Werror=uninitialized] > ../../src-trunk/gcc/postreload.c:1284:7: note: 'min_regno' was declared here > cc1: all warnings being treated as errors > I opened: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44921
On 07/13/2010 02:38 AM, H.J. Lu wrote: > On Mon, Jul 12, 2010 at 11:36 AM, H.J. Lu<hjl.tools@gmail.com> wrote: >> On Mon, Jul 12, 2010 at 10:41 AM, Jie Zhang<jie@codesourcery.com> wrote: >>> On 07/13/2010 12:37 AM, Jeff Law wrote: >>>> >>>> On 06/30/10 01:45, Jie Zhang wrote: >>>>> >>>>> Currently reload_cse_move2add can transform >>>>> >>>>> (set (REGX) (CONST_INT A)) >>>>> ... >>>>> (set (REGX) (CONST_INT B)) >>>>> >>>>> to >>>>> >>>>> (set (REGX) (CONST_INT A)) >>>>> ... >>>>> (set (REGX) (plus (REGX) (CONST_INT B-A))) >>>>> >>>>> This patch enhances it to be able to transform >>>>> >>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>>> ... >>>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) >>>>> >>>>> to >>>>> >>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>>> ... >>>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) >>>>> >>>>> >>>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance >>>>> improvement on one test: >>>>> >>>>> idctrn01: 6% >>>>> >>>>> No meaningful changes on other tests. >>>>> >>>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% >>>>> regression on CINT2000 and 0.1% improvement on CFP2000. >>>>> >>>>> Bootstrapped and regression tested on x86_64. >>>>> >>>>> Is it OK? >>>> >>>> Thanks for your patience, particularly when my requests for additional >>>> data. I think we're OK with the benchmarking #s, particularly looking at >>>> the amount of noise in the runs we're seeing for x86_64. >>>> >>>> Appoved, >>>> >>> Jeff, >>> >>> Thanks! I learned a lot during the procedure. >>> >>> On 07/10/2010 08:06 AM, Bernd Schmidt wrote: >>>> So what kinds of changes are there that would explain a 1.48% drop on >>>> 179.art? If you can verify that the code is in fact identical, and the >>>> performance change is noise, the patch is ok. >>>> >>> Bernd, >>> >>> I didn't realize it was a conditional approval until I asked you privately. >>> Thanks! >>> >>> >>> I have committed it on trunk. >>> >> >> I think it breaks bootstrap on Linux/ia32: >> >> ../../src-trunk/gcc/postreload.c: In function 'reload_cse_regs': >> ../../src-trunk/gcc/postreload.c:1327:57: error: 'min_regno' may be >> used uninitialized in this function [-Werror=uninitialized] >> ../../src-trunk/gcc/postreload.c:1284:7: note: 'min_regno' was declared here >> cc1: all warnings being treated as errors >> > > I opened: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44921 > Thanks for catching that. I did the bootstraping and regression testing when I submitted my patch. But I didn't do them before committing. I saw your have already committed the fix: http://gcc.gnu.org/ml/gcc-patches/2010-07/msg01021.html But as amylaar (sorry I don't know the real name since the bugzilla is broken now) pointed out in the bugzilla, this is a false warning. I did several testing of my patch with old revisions this morning. I found r161569 with my patch bootstrap OK. r161614 with my patch fails bootstrap. So there is some revision between them caused the false warning and we need to find out which and if it's better to fix that revision than adding initialization to min_regno. Regards,
On 07/13/2010 12:50 PM, Jie Zhang wrote: > On 07/13/2010 02:38 AM, H.J. Lu wrote: >> On Mon, Jul 12, 2010 at 11:36 AM, H.J. Lu<hjl.tools@gmail.com> wrote: >>> On Mon, Jul 12, 2010 at 10:41 AM, Jie Zhang<jie@codesourcery.com> wrote: >>>> On 07/13/2010 12:37 AM, Jeff Law wrote: >>>>> >>>>> On 06/30/10 01:45, Jie Zhang wrote: >>>>>> >>>>>> Currently reload_cse_move2add can transform >>>>>> >>>>>> (set (REGX) (CONST_INT A)) >>>>>> ... >>>>>> (set (REGX) (CONST_INT B)) >>>>>> >>>>>> to >>>>>> >>>>>> (set (REGX) (CONST_INT A)) >>>>>> ... >>>>>> (set (REGX) (plus (REGX) (CONST_INT B-A))) >>>>>> >>>>>> This patch enhances it to be able to transform >>>>>> >>>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>>>> ... >>>>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) >>>>>> >>>>>> to >>>>>> >>>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) >>>>>> ... >>>>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) >>>>>> >>>>>> >>>>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance >>>>>> improvement on one test: >>>>>> >>>>>> idctrn01: 6% >>>>>> >>>>>> No meaningful changes on other tests. >>>>>> >>>>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% >>>>>> regression on CINT2000 and 0.1% improvement on CFP2000. >>>>>> >>>>>> Bootstrapped and regression tested on x86_64. >>>>>> >>>>>> Is it OK? >>>>> >>>>> Thanks for your patience, particularly when my requests for additional >>>>> data. I think we're OK with the benchmarking #s, particularly >>>>> looking at >>>>> the amount of noise in the runs we're seeing for x86_64. >>>>> >>>>> Appoved, >>>>> >>>> Jeff, >>>> >>>> Thanks! I learned a lot during the procedure. >>>> >>>> On 07/10/2010 08:06 AM, Bernd Schmidt wrote: >>>>> So what kinds of changes are there that would explain a 1.48% drop on >>>>> 179.art? If you can verify that the code is in fact identical, and the >>>>> performance change is noise, the patch is ok. >>>>> >>>> Bernd, >>>> >>>> I didn't realize it was a conditional approval until I asked you >>>> privately. >>>> Thanks! >>>> >>>> >>>> I have committed it on trunk. >>>> >>> >>> I think it breaks bootstrap on Linux/ia32: >>> >>> ../../src-trunk/gcc/postreload.c: In function 'reload_cse_regs': >>> ../../src-trunk/gcc/postreload.c:1327:57: error: 'min_regno' may be >>> used uninitialized in this function [-Werror=uninitialized] >>> ../../src-trunk/gcc/postreload.c:1284:7: note: 'min_regno' was >>> declared here >>> cc1: all warnings being treated as errors >>> >> >> I opened: >> >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44921 >> > Thanks for catching that. I did the bootstraping and regression testing > when I submitted my patch. But I didn't do them before committing. > > I saw your have already committed the fix: > > http://gcc.gnu.org/ml/gcc-patches/2010-07/msg01021.html > > But as amylaar (sorry I don't know the real name since the bugzilla is > broken now) pointed out in the bugzilla, this is a false warning. I did > several testing of my patch with old revisions this morning. I found > r161569 with my patch bootstrap OK. r161614 with my patch fails > bootstrap. So there is some revision between them caused the false > warning and we need to find out which and if it's better to fix that > revision than adding initialization to min_regno. > It's revision 161605. I will add a followup to the corresponding thread.
* postreload.c (reg_symbol_ref[]): New. (move2add_use_add2_insn): New. (move2add_use_add3_insn): New. (reload_cse_move2add): Handle SYMBOL + OFFSET case. (move2add_note_store): Likewise. Index: postreload.c =================================================================== --- postreload.c (revision 290047) +++ postreload.c (working copy) @@ -1156,17 +1156,19 @@ reload_combine_note_use (rtx *xp, rtx in information about register contents we have would be costly, so we use move2add_last_label_luid to note where the label is and then later disable any optimization that would cross it. - reg_offset[n] / reg_base_reg[n] / reg_mode[n] are only valid if - reg_set_luid[n] is greater than move2add_last_label_luid. */ + reg_offset[n] / reg_base_reg[n] / reg_symbol_ref[n] / reg_mode[n] + are only valid if reg_set_luid[n] is greater than + move2add_last_label_luid. */ static int reg_set_luid[FIRST_PSEUDO_REGISTER]; /* If reg_base_reg[n] is negative, register n has been set to - reg_offset[n] in mode reg_mode[n] . + reg_offset[n] or reg_symbol_ref[n] + reg_offset[n] in mode reg_mode[n]. If reg_base_reg[n] is non-negative, register n has been set to the sum of reg_offset[n] and the value of register reg_base_reg[n] before reg_set_luid[n], calculated in mode reg_mode[n] . */ static HOST_WIDE_INT reg_offset[FIRST_PSEUDO_REGISTER]; static int reg_base_reg[FIRST_PSEUDO_REGISTER]; +static rtx reg_symbol_ref[FIRST_PSEUDO_REGISTER]; static enum machine_mode reg_mode[FIRST_PSEUDO_REGISTER]; /* move2add_luid is linearly increased while scanning the instructions @@ -1186,6 +1188,151 @@ static int move2add_last_label_luid; && TRULY_NOOP_TRUNCATION (GET_MODE_BITSIZE (OUTMODE), \ GET_MODE_BITSIZE (INMODE)))) +/* This function is called with INSN that sets REG to (SYM + OFF), + while REG is known to already have value (SYM + offset). + This function tries to change INSN into an add instruction + (set (REG) (plus (REG) (OFF - offset))) using the known value. + It also updates the information about REG's known value. */ + +static void +move2add_use_add2_insn (rtx reg, rtx sym, rtx off, rtx insn) +{ + rtx pat = PATTERN (insn); + rtx src = SET_SRC (pat); + int regno = REGNO (reg); + rtx new_src = gen_int_mode (INTVAL (off) - reg_offset[regno], + GET_MODE (reg)); + bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn)); + + /* (set (reg) (plus (reg) (const_int 0))) is not canonical; + use (set (reg) (reg)) instead. + We don't delete this insn, nor do we convert it into a + note, to avoid losing register notes or the return + value flag. jump2 already knows how to get rid of + no-op moves. */ + if (new_src == const0_rtx) + { + /* If the constants are different, this is a + truncation, that, if turned into (set (reg) + (reg)), would be discarded. Maybe we should + try a truncMN pattern? */ + if (INTVAL (off) == reg_offset [regno]) + validate_change (insn, &SET_SRC (pat), reg, 0); + } + else if (rtx_cost (new_src, PLUS, speed) < rtx_cost (src, SET, speed) + && have_add2_insn (reg, new_src)) + { + rtx tem = gen_rtx_PLUS (GET_MODE (reg), reg, new_src); + validate_change (insn, &SET_SRC (pat), tem, 0); + } + else if (sym == NULL_RTX && GET_MODE (reg) != BImode) + { + enum machine_mode narrow_mode; + for (narrow_mode = GET_CLASS_NARROWEST_MODE (MODE_INT); + narrow_mode != VOIDmode + && narrow_mode != GET_MODE (reg); + narrow_mode = GET_MODE_WIDER_MODE (narrow_mode)) + { + if (have_insn_for (STRICT_LOW_PART, narrow_mode) + && ((reg_offset[regno] + & ~GET_MODE_MASK (narrow_mode)) + == (INTVAL (off) + & ~GET_MODE_MASK (narrow_mode)))) + { + rtx narrow_reg = gen_rtx_REG (narrow_mode, + REGNO (reg)); + rtx narrow_src = gen_int_mode (INTVAL (off), + narrow_mode); + rtx new_set = + gen_rtx_SET (VOIDmode, + gen_rtx_STRICT_LOW_PART (VOIDmode, + narrow_reg), + narrow_src); + if (validate_change (insn, &PATTERN (insn), + new_set, 0)) + break; + } + } + } + reg_set_luid[regno] = move2add_luid; + reg_base_reg[regno] = -1; + reg_mode[regno] = GET_MODE (reg); + reg_symbol_ref[regno] = sym; + reg_offset[regno] = INTVAL (off); +} + + +/* This function is called with INSN that sets REG to (SYM + OFF), + but REG doesn't have known value (SYM + offset). This function + tries to find another register which is known to already have + value (SYM + offset) and change INSN into an add instruction + (set (REG) (plus (the found register) (OFF - offset))) if such + a register is found. It also updates the information about + REG's known value. */ + +static void +move2add_use_add3_insn (rtx reg, rtx sym, rtx off, rtx insn) +{ + rtx pat = PATTERN (insn); + rtx src = SET_SRC (pat); + int regno = REGNO (reg); + int min_cost = INT_MAX; + int min_regno; + bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn)); + int i; + + for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) + if (reg_set_luid[i] > move2add_last_label_luid + && reg_mode[i] == GET_MODE (reg) + && reg_base_reg[i] < 0 + && reg_symbol_ref[i] != NULL_RTX + && rtx_equal_p (sym, reg_symbol_ref[i])) + { + rtx new_src = gen_int_mode (INTVAL (off) - reg_offset[i], + GET_MODE (reg)); + /* (set (reg) (plus (reg) (const_int 0))) is not canonical; + use (set (reg) (reg)) instead. + We don't delete this insn, nor do we convert it into a + note, to avoid losing register notes or the return + value flag. jump2 already knows how to get rid of + no-op moves. */ + if (new_src == const0_rtx) + { + min_cost = 0; + min_regno = i; + break; + } + else + { + int cost = rtx_cost (new_src, PLUS, speed); + if (cost < min_cost) + { + min_cost = cost; + min_regno = i; + } + } + } + + if (min_cost < rtx_cost (src, SET, speed)) + { + rtx tem; + + tem = gen_rtx_REG (GET_MODE (reg), min_regno); + if (i != min_regno) + { + rtx new_src = gen_int_mode (INTVAL (off) - reg_offset[min_regno], + GET_MODE (reg)); + tem = gen_rtx_PLUS (GET_MODE (reg), tem, new_src); + } + validate_change (insn, &SET_SRC (pat), tem, 0); + } + reg_set_luid[regno] = move2add_luid; + reg_base_reg[regno] = -1; + reg_mode[regno] = GET_MODE (reg); + reg_symbol_ref[regno] = sym; + reg_offset[regno] = INTVAL (off); +} + static void reload_cse_move2add (rtx first) { @@ -1193,7 +1340,13 @@ reload_cse_move2add (rtx first) rtx insn; for (i = FIRST_PSEUDO_REGISTER - 1; i >= 0; i--) - reg_set_luid[i] = 0; + { + reg_set_luid[i] = 0; + reg_offset[i] = 0; + reg_base_reg[i] = 0; + reg_symbol_ref[i] = NULL_RTX; + reg_mode[i] = VOIDmode; + } move2add_last_label_luid = 0; move2add_luid = 2; @@ -1241,65 +1394,11 @@ reload_cse_move2add (rtx first) (set (STRICT_LOW_PART (REGX)) (CONST_INT B)) */ - if (CONST_INT_P (src) && reg_base_reg[regno] < 0) + if (CONST_INT_P (src) + && reg_base_reg[regno] < 0 + && reg_symbol_ref[regno] == NULL_RTX) { - rtx new_src = gen_int_mode (INTVAL (src) - reg_offset[regno], - GET_MODE (reg)); - bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn)); - - /* (set (reg) (plus (reg) (const_int 0))) is not canonical; - use (set (reg) (reg)) instead. - We don't delete this insn, nor do we convert it into a - note, to avoid losing register notes or the return - value flag. jump2 already knows how to get rid of - no-op moves. */ - if (new_src == const0_rtx) - { - /* If the constants are different, this is a - truncation, that, if turned into (set (reg) - (reg)), would be discarded. Maybe we should - try a truncMN pattern? */ - if (INTVAL (src) == reg_offset [regno]) - validate_change (insn, &SET_SRC (pat), reg, 0); - } - else if (rtx_cost (new_src, PLUS, speed) < rtx_cost (src, SET, speed) - && have_add2_insn (reg, new_src)) - { - rtx tem = gen_rtx_PLUS (GET_MODE (reg), reg, new_src); - validate_change (insn, &SET_SRC (pat), tem, 0); - } - else if (GET_MODE (reg) != BImode) - { - enum machine_mode narrow_mode; - for (narrow_mode = GET_CLASS_NARROWEST_MODE (MODE_INT); - narrow_mode != VOIDmode - && narrow_mode != GET_MODE (reg); - narrow_mode = GET_MODE_WIDER_MODE (narrow_mode)) - { - if (have_insn_for (STRICT_LOW_PART, narrow_mode) - && ((reg_offset[regno] - & ~GET_MODE_MASK (narrow_mode)) - == (INTVAL (src) - & ~GET_MODE_MASK (narrow_mode)))) - { - rtx narrow_reg = gen_rtx_REG (narrow_mode, - REGNO (reg)); - rtx narrow_src = gen_int_mode (INTVAL (src), - narrow_mode); - rtx new_set = - gen_rtx_SET (VOIDmode, - gen_rtx_STRICT_LOW_PART (VOIDmode, - narrow_reg), - narrow_src); - if (validate_change (insn, &PATTERN (insn), - new_set, 0)) - break; - } - } - } - reg_set_luid[regno] = move2add_luid; - reg_mode[regno] = GET_MODE (reg); - reg_offset[regno] = INTVAL (src); + move2add_use_add2_insn (reg, NULL_RTX, src, insn); continue; } @@ -1369,6 +1468,51 @@ reload_cse_move2add (rtx first) } } } + + /* Try to transform + (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) + ... + (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B)))) + to + (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A)))) + ... + (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A)))) */ + if ((GET_CODE (src) == SYMBOL_REF + || (GET_CODE (src) == CONST + && GET_CODE (XEXP (src, 0)) == PLUS + && GET_CODE (XEXP (XEXP (src, 0), 0)) == SYMBOL_REF + && CONST_INT_P (XEXP (XEXP (src, 0), 1)))) + && dbg_cnt (cse2_move2add)) + { + rtx sym, off; + + if (GET_CODE (src) == SYMBOL_REF) + { + sym = src; + off = const0_rtx; + } + else + { + sym = XEXP (XEXP (src, 0), 0); + off = XEXP (XEXP (src, 0), 1); + } + + /* If the reg already contains the value which is sum of + sym and some constant value, we can use an add2 insn. */ + if (reg_set_luid[regno] > move2add_last_label_luid + && MODES_OK_FOR_MOVE2ADD (GET_MODE (reg), reg_mode[regno]) + && reg_base_reg[regno] < 0 + && reg_symbol_ref[regno] != NULL_RTX + && rtx_equal_p (sym, reg_symbol_ref[regno])) + move2add_use_add2_insn (reg, sym, off, insn); + + /* Otherwise, we have to find a register whose value is sum + of sym and some constant value. */ + else + move2add_use_add3_insn (reg, sym, off, insn); + + continue; + } } for (note = REG_NOTES (insn); note; note = XEXP (note, 1)) @@ -1382,7 +1526,7 @@ reload_cse_move2add (rtx first) reg_set_luid[regno] = 0; } } - note_stores (PATTERN (insn), move2add_note_store, NULL); + note_stores (PATTERN (insn), move2add_note_store, insn); /* If INSN is a conditional branch, we try to extract an implicit set out of it. */ @@ -1404,7 +1548,7 @@ reload_cse_move2add (rtx first) { rtx implicit_set = gen_rtx_SET (VOIDmode, XEXP (cnd, 0), XEXP (cnd, 1)); - move2add_note_store (SET_DEST (implicit_set), implicit_set, 0); + move2add_note_store (SET_DEST (implicit_set), implicit_set, insn); } } @@ -1422,13 +1566,15 @@ reload_cse_move2add (rtx first) } } -/* SET is a SET or CLOBBER that sets DST. +/* SET is a SET or CLOBBER that sets DST. DATA is the insn which + contains SET. Update reg_set_luid, reg_offset and reg_base_reg accordingly. Called from reload_cse_move2add via note_stores. */ static void -move2add_note_store (rtx dst, const_rtx set, void *data ATTRIBUTE_UNUSED) +move2add_note_store (rtx dst, const_rtx set, void *data) { + rtx insn = (rtx) data; unsigned int regno = 0; unsigned int nregs = 0; unsigned int i; @@ -1462,6 +1608,38 @@ move2add_note_store (rtx dst, const_rtx nregs = hard_regno_nregs[regno][mode]; if (SCALAR_INT_MODE_P (GET_MODE (dst)) + && nregs == 1 && GET_CODE (set) == SET) + { + rtx note, sym = NULL_RTX; + HOST_WIDE_INT off; + + note = find_reg_equal_equiv_note (insn); + if (note && GET_CODE (XEXP (note, 0)) == SYMBOL_REF) + { + sym = XEXP (note, 0); + off = 0; + } + else if (note && GET_CODE (XEXP (note, 0)) == CONST + && GET_CODE (XEXP (XEXP (note, 0), 0)) == PLUS + && GET_CODE (XEXP (XEXP (XEXP (note, 0), 0), 0)) == SYMBOL_REF + && CONST_INT_P (XEXP (XEXP (XEXP (note, 0), 0), 1))) + { + sym = XEXP (XEXP (XEXP (note, 0), 0), 0); + off = INTVAL (XEXP (XEXP (XEXP (note, 0), 0), 1)); + } + + if (sym != NULL_RTX) + { + reg_base_reg[regno] = -1; + reg_symbol_ref[regno] = sym; + reg_offset[regno] = off; + reg_mode[regno] = mode; + reg_set_luid[regno] = move2add_luid; + return; + } + } + + if (SCALAR_INT_MODE_P (GET_MODE (dst)) && nregs == 1 && GET_CODE (set) == SET && GET_CODE (SET_DEST (set)) != ZERO_EXTRACT && GET_CODE (SET_DEST (set)) != STRICT_LOW_PART) @@ -1521,6 +1699,7 @@ move2add_note_store (rtx dst, const_rtx case CONST_INT: /* Start tracking the register as a constant. */ reg_base_reg[regno] = -1; + reg_symbol_ref[regno] = NULL_RTX; reg_offset[regno] = INTVAL (SET_SRC (set)); /* We assign the same luid to all registers set to constants. */ reg_set_luid[regno] = move2add_last_label_luid + 1; @@ -1541,6 +1720,7 @@ move2add_note_store (rtx dst, const_rtx if (reg_set_luid[base_regno] <= move2add_last_label_luid) { reg_base_reg[base_regno] = base_regno; + reg_symbol_ref[base_regno] = NULL_RTX; reg_offset[base_regno] = 0; reg_set_luid[base_regno] = move2add_luid; reg_mode[base_regno] = mode; @@ -1554,6 +1734,7 @@ move2add_note_store (rtx dst, const_rtx /* Copy base information from our base register. */ reg_set_luid[regno] = reg_set_luid[base_regno]; reg_base_reg[regno] = reg_base_reg[base_regno]; + reg_symbol_ref[regno] = reg_symbol_ref[base_regno]; /* Compute the sum of the offsets or constants. */ reg_offset[regno] = trunc_int_for_mode (offset