| Submitter | Dmitry Melnik |
|---|---|
| Date | June 27, 2012, 2:58 p.m. |
| Message ID | <4FEB1F9C.40004@ispras.ru> |
| Download | mbox | patch |
| Permalink | /patch/167662/ |
| State | New |
| Headers | show |
Comments
On Wed, Jun 27, 2012 at 4:58 PM, Dmitry Melnik <dm@ispras.ru> wrote: > This patch can be applied to current trunk and passes regtest successfully > on qemu-arm. > Maybe it will be good to have it in trunk? > If everybody agrees, we can take care of committing it. If the patch is approved, can you please add a brief comment before the define_split to explain why it's there and what it does? Ciao! Steven
On Wed, 27 Jun 2012 18:58:36 +0400 Dmitry Melnik <dm@ispras.ru> wrote: > This patch can be applied to current trunk and passes regtest > successfully on qemu-arm. > Maybe it will be good to have it in trunk? > If everybody agrees, we can take care of committing it. No objection from me (as the original author), FWIW. Thanks! Julian
On 27/06/12 15:58, Dmitry Melnik wrote: > Hi, > > We'd like to note about CodeSourcery's patch for ARM backend, from which > GCC mainline can gain 4% on SPEC2K INT: > http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch > (also the patch is attached). > > Originally, we noticed that GNU Go works 6% faster on cortex-a8 with > -fno-gcse. After profiling we found that this is most likely caused by > cache misses when accessing global variables. GCC generates ldr > instructions for them, while this can be avoided by emitting movt/movw > pair for such cases. RTL expressions for these instructions is high_ and > lo_sum. Currently, symbol_ref expands as high_ and lo_sum but then > cprop1 decides that this is redundant and merges them into one load insn. > > The problem was also found by Linaro community: > https://bugs.launchpad.net/gcc-linaro/+bug/886124 . > Also there is a patch from codesourcery (attached), which was ported to > linaro gcc 4.5, but is missing in later linaro releases. > This patch makes split of symbol_refs at the later stage (after cprop), > instead of generating movt/movw at expand. > > It fixed our test case on GNU Go. Also we tested it on SPEC2K INT (ref) > with GCC 4.8 snapshot from May 12, 2012 on cortex-a9 with -O2 and -mthumb: > > Base Base Base Peak Peak Peak > Benchmarks Ref Time Run Time Ratio Ref Time Run Time Ratio > ---------- -------- -------- -------- -------- -------- ------- > 164.gzip 1400 492 284 1400 497 282 -0.70% > 175.vpr 1400 433 323 1400 458 306 -5.26% > 176.gcc 1100 203 542 1100 198 557 2.77% > 181.mcf 1800 529 340 1800 528 341 0.29% > 186.crafty 1000 261 383 1000 256 391 2.09% > 197.parser 1800 709 254 1800 701 257 1.18% > 252.eon 1300 219 594 1300 202 644 8.42% > 253.perlbmk 1800 389 463 1800 367 490 5.83% > 254.gap 1100 259 425 1100 236 467 9.88% > 255.vortex 1900 498 382 1900 442 430 12.57% > 256.bzip2 1500 452 332 1500 424 354 6.63% > 300.twolf 3000 916 328 3000 853 352 7.32% > SPECint_base2000 376 > SPECint2000 391 3.99% > > > SPEC2K INT grows by 4% (up to 12.5% on vortex; vpr slowdown is likely > because of big variance on this test). > > Similarly, there are gains of 3-4% without -mthumb on cortex-a9 and on > cortex-a8 (thumb2 and ARM modes). > > This patch can be applied to current trunk and passes regtest > successfully on qemu-arm. > Maybe it will be good to have it in trunk? > If everybody agrees, we can take care of committing it. > > -- > Best regards, > Dmitry > > > gcc-4.5-linaro-r99369.patch > Please update the ChangeLog entry (it's not appropriate to mention Sourcery G++) and add a comment as Steven has suggested. Otherwise OK. R.
On 27 June 2012 15:58, Dmitry Melnik <dm@ispras.ru> wrote: > Hi, > > We'd like to note about CodeSourcery's patch for ARM backend, from which GCC > mainline can gain 4% on SPEC2K INT: > http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch > (also the patch is attached). > > Originally, we noticed that GNU Go works 6% faster on cortex-a8 with > -fno-gcse. After profiling we found that this is most likely caused by > cache misses when accessing global variables. GCC generates ldr > instructions for them, while this can be avoided by emitting movt/movw pair > for such cases. RTL expressions for these instructions is high_ and lo_sum. > Currently, symbol_ref expands as high_ and lo_sum but then cprop1 decides > that this is redundant and merges them into one load insn. > > The problem was also found by Linaro community: > https://bugs.launchpad.net/gcc-linaro/+bug/886124 . The reason IIRC this isn't in our later releases is that it wasn't thought beneficial enough to upstream. Now you've got some evidence to the contrary. > Also there is a patch from codesourcery (attached), which was ported to > linaro gcc 4.5, but is missing in later linaro releases. > This patch makes split of symbol_refs at the later stage (after cprop), > instead of generating movt/movw at expand. I must admit that I had been suggesting to Zhenqiang about turning this off by tightening the movsi_insn predicates rather than adding a split, but given that it appears to produce enough benefit in this case I don't have any reasons to object ... However it's interesting that this doesn't seem to help vpr .... Ramana
On 06/27/2012 07:55 PM, Ramana Radhakrishnan wrote: > I must admit that I had been suggesting to Zhenqiang about turning > this off by tightening the movsi_insn predicates rather than adding a > split, but given that it appears to produce enough benefit in this > case I don't have any reasons to object ... > > However it's interesting that this doesn't seem to help vpr .... We retested vpr, but it just seems to be unstable: base peak time time 175.vpr 1400 502 X 1400 526 X 175.vpr 1400 500 X 1400 524 X 175.vpr 1400 516 X 1400 526 X 175.vpr 1400 492 X 1400 481 X 175.vpr 1400 496 X 1400 485 X median 500 524 However, the minimum time is still better with the patch. And here are all 3 runs for previously reported data: test base ratio peak ratio median base median peak improvement 164.gzip 284 281 284 282 -0.70% 164.gzip 284 282 164.gzip 285 283 175.vpr 329 306 323 306 -5.26% 175.vpr 323 305 175.vpr 306 307 176.gcc 542 554 542 557 2.77% 176.gcc 541 558 176.gcc 544 557 181.mcf 343 340 340 341 0.29% 181.mcf 339 342 181.mcf 340 341 186.crafty 383 399 383 391 2.09% 186.crafty 390 391 186.crafty 380 386 197.parser 254 257 254 257 1.18% 197.parser 254 257 197.parser 254 257 252.eon 591 644 594 644 8.42% 252.eon 598 644 252.eon 594 643 253.perlbmk 462 490 463 490 5.83% 253.perlbmk 463 490 253.perlbmk 463 490 254.gap 415 473 425 467 9.88% 254.gap 430 467 254.gap 425 464 255.vortex 384 430 382 430 12.57% 255.vortex 382 430 255.vortex 381 430 256.bzip2 331 354 332 354 6.63% 256.bzip2 332 354 256.bzip2 335 349 300.twolf 323 356 328 352 7.32% 300.twolf 347 337 300.twolf 328 352 -- Best regards, Dmitry
On 29 June 2012 14:48, Dmitry Melnik <dm@ispras.ru> wrote: > > On 06/27/2012 07:55 PM, Ramana Radhakrishnan wrote: > >> I must admit that I had been suggesting to Zhenqiang about turning >> this off by tightening the movsi_insn predicates rather than adding a >> split, but given that it appears to produce enough benefit in this >> case I don't have any reasons to object ... >> >> However it's interesting that this doesn't seem to help vpr .... > > We retested vpr, but it just seems to be unstable: Unfortunate but ok. Ramana
Patch
2010-08-20 Jie Zhang <jie@codesourcery.com> Merged from Sourcery G++ 4.4: gcc/ 2009-05-29 Julian Brown <julian@codesourcery.com> Merged from Sourcery G++ 4.3: * config/arm/arm.md (movsi): Don't split symbol refs here. (define_split): New. 2010-08-18 Julian Brown <julian@codesourcery.com> Issue #9222 === modified file 'gcc/config/arm/arm.md' --- old/gcc/config/arm/arm.md 2010-08-20 16:41:37 +0000 +++ new/gcc/config/arm/arm.md 2010-08-23 14:39:12 +0000 @@ -5150,14 +5150,6 @@ optimize && can_create_pseudo_p ()); DONE; } - - if (TARGET_USE_MOVT && !target_word_relocations - && GET_CODE (operands[1]) == SYMBOL_REF - && !flag_pic && !arm_tls_referenced_p (operands[1])) - { - arm_emit_movpair (operands[0], operands[1]); - DONE; - } } else /* TARGET_THUMB1... */ { @@ -5265,6 +5257,19 @@ " ) +(define_split + [(set (match_operand:SI 0 "arm_general_register_operand" "") + (match_operand:SI 1 "general_operand" ""))] + "TARGET_32BIT + && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF + && !flag_pic && !target_word_relocations + && !arm_tls_referenced_p (operands[1])" + [(clobber (const_int 0))] +{ + arm_emit_movpair (operands[0], operands[1]); + DONE; +}) + (define_insn "*thumb1_movsi_insn" [(set (match_operand:SI 0 "nonimmediate_operand" "=l,l,l,l,l,>,l, m,*lhk") (match_operand:SI 1 "general_operand" "l, I,J,K,>,l,mi,l,*lhk"))]
Hi, We'd like to note about CodeSourcery's patch for ARM backend, from which GCC mainline can gain 4% on SPEC2K INT: http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch (also the patch is attached). Originally, we noticed that GNU Go works 6% faster on cortex-a8 with -fno-gcse. After profiling we found that this is most likely caused by cache misses when accessing global variables. GCC generates ldr instructions for them, while this can be avoided by emitting movt/movw pair for such cases. RTL expressions for these instructions is high_ and lo_sum. Currently, symbol_ref expands as high_ and lo_sum but then cprop1 decides that this is redundant and merges them into one load insn. The problem was also found by Linaro community: https://bugs.launchpad.net/gcc-linaro/+bug/886124 . Also there is a patch from codesourcery (attached), which was ported to linaro gcc 4.5, but is missing in later linaro releases. This patch makes split of symbol_refs at the later stage (after cprop), instead of generating movt/movw at expand. It fixed our test case on GNU Go. Also we tested it on SPEC2K INT (ref) with GCC 4.8 snapshot from May 12, 2012 on cortex-a9 with -O2 and -mthumb: Base Base Base Peak Peak Peak Benchmarks Ref Time Run Time Ratio Ref Time Run Time Ratio ---------- -------- -------- -------- -------- -------- ------- 164.gzip 1400 492 284 1400 497 282 -0.70% 175.vpr 1400 433 323 1400 458 306 -5.26% 176.gcc 1100 203 542 1100 198 557 2.77% 181.mcf 1800 529 340 1800 528 341 0.29% 186.crafty 1000 261 383 1000 256 391 2.09% 197.parser 1800 709 254 1800 701 257 1.18% 252.eon 1300 219 594 1300 202 644 8.42% 253.perlbmk 1800 389 463 1800 367 490 5.83% 254.gap 1100 259 425 1100 236 467 9.88% 255.vortex 1900 498 382 1900 442 430 12.57% 256.bzip2 1500 452 332 1500 424 354 6.63% 300.twolf 3000 916 328 3000 853 352 7.32% SPECint_base2000 376 SPECint2000 391 3.99% SPEC2K INT grows by 4% (up to 12.5% on vortex; vpr slowdown is likely because of big variance on this test). Similarly, there are gains of 3-4% without -mthumb on cortex-a9 and on cortex-a8 (thumb2 and ARM modes). This patch can be applied to current trunk and passes regtest successfully on qemu-arm. Maybe it will be good to have it in trunk? If everybody agrees, we can take care of committing it. -- Best regards, Dmitry