Patchwork [RFC,ARM] later split of symbol_refs

login
register
mail settings
Submitter Dmitry Melnik
Date June 27, 2012, 2:58 p.m.
Message ID <4FEB1F9C.40004@ispras.ru>
Download mbox | patch
Permalink /patch/167662/
State New
Headers show

Comments

Dmitry Melnik - June 27, 2012, 2:58 p.m.
Hi,

We'd like to note about CodeSourcery's patch for ARM backend, from which 
GCC mainline can gain 4% on SPEC2K INT: 
http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch 
(also the patch is attached).

Originally, we noticed that GNU Go works 6% faster on cortex-a8 with 
-fno-gcse.  After profiling we found that this is most likely caused by 
cache misses when accessing global variables.  GCC generates ldr 
instructions for them, while this can be avoided by emitting movt/movw 
pair for such cases. RTL expressions for these instructions is high_ and 
lo_sum.  Currently, symbol_ref expands as high_ and lo_sum but then 
cprop1 decides that this is redundant and merges them into one load insn.

The problem was also found by Linaro community: 
https://bugs.launchpad.net/gcc-linaro/+bug/886124 .
Also there is a patch from codesourcery (attached), which was ported to 
linaro gcc 4.5, but is missing in later linaro releases.
This patch makes split of symbol_refs at the later stage (after cprop), 
instead of generating movt/movw at expand.

It fixed our test case on GNU Go.  Also we tested it on SPEC2K INT (ref) 
with GCC 4.8 snapshot from May 12, 2012 on cortex-a9 with -O2 and -mthumb:

             Base      Base      Base      Peak      Peak      Peak
Benchmarks  Ref Time  Run Time   Ratio    Ref Time  Run Time  Ratio
----------  --------  --------  --------  --------  -------- -------
164.gzip    1400      492       284     1400       497       282  -0.70%
175.vpr     1400      433       323     1400       458       306  -5.26%
176.gcc     1100      203       542     1100       198       557   2.77%
181.mcf     1800      529       340     1800       528       341   0.29%
186.crafty  1000      261       383     1000       256       391   2.09%
197.parser  1800      709       254     1800       701       257   1.18%
252.eon     1300      219       594     1300       202       644   8.42%
253.perlbmk 1800      389       463     1800       367       490   5.83%
254.gap     1100      259       425     1100       236       467   9.88%
255.vortex  1900      498       382     1900       442       430  12.57%
256.bzip2   1500      452       332     1500       424       354   6.63%
300.twolf   3000      916       328     3000       853       352   7.32%
SPECint_base2000                376
SPECint2000                                                  391   3.99%


SPEC2K INT grows by 4% (up to 12.5% on vortex; vpr slowdown is likely 
because of big variance on this test).

Similarly, there are gains of 3-4% without -mthumb on cortex-a9 and on 
cortex-a8 (thumb2 and ARM modes).

This patch can be applied to current trunk and passes regtest 
successfully on qemu-arm.
Maybe it will be good to have it in trunk?
If everybody agrees, we can take care of committing it.

--
Best regards,
   Dmitry
Steven Bosscher - June 27, 2012, 3:06 p.m.
On Wed, Jun 27, 2012 at 4:58 PM, Dmitry Melnik <dm@ispras.ru> wrote:
> This patch can be applied to current trunk and passes regtest successfully
> on qemu-arm.
> Maybe it will be good to have it in trunk?
> If everybody agrees, we can take care of committing it.

If the patch is approved, can you please add a brief comment before
the define_split to explain why it's there and what it does?

Ciao!
Steven
Julian Brown - June 27, 2012, 3:53 p.m.
On Wed, 27 Jun 2012 18:58:36 +0400
Dmitry Melnik <dm@ispras.ru> wrote:

> This patch can be applied to current trunk and passes regtest 
> successfully on qemu-arm.
> Maybe it will be good to have it in trunk?
> If everybody agrees, we can take care of committing it.

No objection from me (as the original author), FWIW.

Thanks!

Julian
Richard Earnshaw - June 27, 2012, 3:53 p.m.
On 27/06/12 15:58, Dmitry Melnik wrote:
> Hi,
> 
> We'd like to note about CodeSourcery's patch for ARM backend, from which 
> GCC mainline can gain 4% on SPEC2K INT: 
> http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch 
> (also the patch is attached).
> 
> Originally, we noticed that GNU Go works 6% faster on cortex-a8 with 
> -fno-gcse.  After profiling we found that this is most likely caused by 
> cache misses when accessing global variables.  GCC generates ldr 
> instructions for them, while this can be avoided by emitting movt/movw 
> pair for such cases. RTL expressions for these instructions is high_ and 
> lo_sum.  Currently, symbol_ref expands as high_ and lo_sum but then 
> cprop1 decides that this is redundant and merges them into one load insn.
> 
> The problem was also found by Linaro community: 
> https://bugs.launchpad.net/gcc-linaro/+bug/886124 .
> Also there is a patch from codesourcery (attached), which was ported to 
> linaro gcc 4.5, but is missing in later linaro releases.
> This patch makes split of symbol_refs at the later stage (after cprop), 
> instead of generating movt/movw at expand.
> 
> It fixed our test case on GNU Go.  Also we tested it on SPEC2K INT (ref) 
> with GCC 4.8 snapshot from May 12, 2012 on cortex-a9 with -O2 and -mthumb:
> 
>              Base      Base      Base      Peak      Peak      Peak
> Benchmarks  Ref Time  Run Time   Ratio    Ref Time  Run Time  Ratio
> ----------  --------  --------  --------  --------  -------- -------
> 164.gzip    1400      492       284     1400       497       282  -0.70%
> 175.vpr     1400      433       323     1400       458       306  -5.26%
> 176.gcc     1100      203       542     1100       198       557   2.77%
> 181.mcf     1800      529       340     1800       528       341   0.29%
> 186.crafty  1000      261       383     1000       256       391   2.09%
> 197.parser  1800      709       254     1800       701       257   1.18%
> 252.eon     1300      219       594     1300       202       644   8.42%
> 253.perlbmk 1800      389       463     1800       367       490   5.83%
> 254.gap     1100      259       425     1100       236       467   9.88%
> 255.vortex  1900      498       382     1900       442       430  12.57%
> 256.bzip2   1500      452       332     1500       424       354   6.63%
> 300.twolf   3000      916       328     3000       853       352   7.32%
> SPECint_base2000                376
> SPECint2000                                                  391   3.99%
> 
> 
> SPEC2K INT grows by 4% (up to 12.5% on vortex; vpr slowdown is likely 
> because of big variance on this test).
> 
> Similarly, there are gains of 3-4% without -mthumb on cortex-a9 and on 
> cortex-a8 (thumb2 and ARM modes).
> 
> This patch can be applied to current trunk and passes regtest 
> successfully on qemu-arm.
> Maybe it will be good to have it in trunk?
> If everybody agrees, we can take care of committing it.
> 
> --
> Best regards,
>    Dmitry
> 
> 
> gcc-4.5-linaro-r99369.patch
> 

Please update the ChangeLog entry (it's not appropriate to mention
Sourcery G++) and add a comment as Steven has suggested.

Otherwise OK.

R.
Ramana Radhakrishnan - June 27, 2012, 3:55 p.m.
On 27 June 2012 15:58, Dmitry Melnik <dm@ispras.ru> wrote:
> Hi,
>
> We'd like to note about CodeSourcery's patch for ARM backend, from which GCC
> mainline can gain 4% on SPEC2K INT:
> http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch
> (also the patch is attached).
>
> Originally, we noticed that GNU Go works 6% faster on cortex-a8 with
> -fno-gcse.  After profiling we found that this is most likely caused by
> cache misses when accessing global variables.  GCC generates ldr
> instructions for them, while this can be avoided by emitting movt/movw pair
> for such cases. RTL expressions for these instructions is high_ and lo_sum.
>  Currently, symbol_ref expands as high_ and lo_sum but then cprop1 decides
> that this is redundant and merges them into one load insn.
>
> The problem was also found by Linaro community:
> https://bugs.launchpad.net/gcc-linaro/+bug/886124 .

The reason IIRC this isn't in our later releases is that it wasn't
thought beneficial enough to upstream. Now you've got some evidence to
the contrary.

> Also there is a patch from codesourcery (attached), which was ported to
> linaro gcc 4.5, but is missing in later linaro releases.
> This patch makes split of symbol_refs at the later stage (after cprop),
> instead of generating movt/movw at expand.

I must admit that I had been suggesting to Zhenqiang about turning
this off by tightening the movsi_insn predicates rather than adding a
split, but given that it appears to produce enough benefit in this
case I don't have any reasons to object ...

However it's interesting that this doesn't seem to help vpr ....


Ramana
Dmitry Melnik - June 29, 2012, 1:48 p.m.
On 06/27/2012 07:55 PM, Ramana Radhakrishnan wrote:

 > I must admit that I had been suggesting to Zhenqiang about turning
 > this off by tightening the movsi_insn predicates rather than adding a
 > split, but given that it appears to produce enough benefit in this
 > case I don't have any reasons to object ...
 >
 > However it's interesting that this doesn't seem to help vpr ....

We retested vpr, but it just seems to be unstable:

base                             peak
time                              time

   175.vpr           1400       502          X     1400 526          X
   175.vpr           1400       500          X     1400 524          X
   175.vpr           1400       516          X     1400 526          X
   175.vpr           1400       492          X     1400 481          X
   175.vpr           1400       496          X     1400 485          X
   median 500                                524

However, the minimum time is still better with the patch.

And here are all 3 runs for previously reported data:

test    base ratio    peak ratio    median base    median peak improvement
164.gzip    284    281    284    282    -0.70%
164.gzip    284    282
164.gzip    285    283
175.vpr    329    306    323    306    -5.26%
175.vpr    323    305
175.vpr    306    307
176.gcc    542    554    542    557    2.77%
176.gcc    541    558
176.gcc    544    557
181.mcf    343    340    340    341    0.29%
181.mcf    339    342
181.mcf    340    341
186.crafty    383    399    383    391    2.09%
186.crafty    390    391
186.crafty    380    386
197.parser    254    257    254    257    1.18%
197.parser    254    257
197.parser    254    257
252.eon    591    644    594    644    8.42%
252.eon    598    644
252.eon    594    643
253.perlbmk    462    490    463    490    5.83%
253.perlbmk    463    490
253.perlbmk    463    490
254.gap    415    473    425    467    9.88%
254.gap    430    467
254.gap    425    464
255.vortex    384    430    382    430    12.57%
255.vortex    382    430
255.vortex    381    430
256.bzip2    331    354    332    354    6.63%
256.bzip2    332    354
256.bzip2    335    349
300.twolf    323    356    328    352    7.32%
300.twolf    347    337
300.twolf    328    352


--
Best regards,
   Dmitry
Ramana Radhakrishnan - June 29, 2012, 2:32 p.m.
On 29 June 2012 14:48, Dmitry Melnik <dm@ispras.ru> wrote:
>
> On 06/27/2012 07:55 PM, Ramana Radhakrishnan wrote:
>
>> I must admit that I had been suggesting to Zhenqiang about turning
>> this off by tightening the movsi_insn predicates rather than adding a
>> split, but given that it appears to produce enough benefit in this
>> case I don't have any reasons to object ...
>>
>> However it's interesting that this doesn't seem to help vpr ....
>
> We retested vpr, but it just seems to be unstable:

Unfortunate but ok.

Ramana

Patch

2010-08-20  Jie Zhang  <jie@codesourcery.com>

	Merged from Sourcery G++ 4.4:

	gcc/
	2009-05-29  Julian Brown  <julian@codesourcery.com>
	Merged from Sourcery G++ 4.3:
	* config/arm/arm.md (movsi): Don't split symbol refs here.
	(define_split): New.

 2010-08-18  Julian Brown  <julian@codesourcery.com>
 
 	Issue #9222

=== modified file 'gcc/config/arm/arm.md'
--- old/gcc/config/arm/arm.md	2010-08-20 16:41:37 +0000
+++ new/gcc/config/arm/arm.md	2010-08-23 14:39:12 +0000
@@ -5150,14 +5150,6 @@ 
 			       optimize && can_create_pseudo_p ());
           DONE;
         }
-
-      if (TARGET_USE_MOVT && !target_word_relocations
-	  && GET_CODE (operands[1]) == SYMBOL_REF
-	  && !flag_pic && !arm_tls_referenced_p (operands[1]))
-	{
-	  arm_emit_movpair (operands[0], operands[1]);
-	  DONE;
-	}
     }
   else /* TARGET_THUMB1...  */
     {
@@ -5265,6 +5257,19 @@ 
   "
 )
 
+(define_split
+  [(set (match_operand:SI 0 "arm_general_register_operand" "")
+	(match_operand:SI 1 "general_operand" ""))]
+  "TARGET_32BIT
+   && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF
+   && !flag_pic && !target_word_relocations
+   && !arm_tls_referenced_p (operands[1])"
+  [(clobber (const_int 0))]
+{
+  arm_emit_movpair (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*thumb1_movsi_insn"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=l,l,l,l,l,>,l, m,*lhk")
 	(match_operand:SI 1 "general_operand"      "l, I,J,K,>,l,mi,l,*lhk"))]