diff mbox series

[v4] Repeat jump threading after combine

Message ID 20181126121140.7176-1-iii@linux.ibm.com
State New
Headers show
Series [v4] Repeat jump threading after combine | expand

Commit Message

Ilya Leoshkevich Nov. 26, 2018, 12:11 p.m. UTC
Bootstrapped and regtested on x86_64-redhat-linux, s390x-redhat-linux
and ppc64le-redhat-linux.

Previous iteration:
https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00495.html

In the end, the main question was: does this make the code better on
architectures other than s390?
https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00993.html

Not sure whether it's already too late for this one, but I'd like to at
least post the updated code, my observations and SPEC CPU results.

- Code size decreases in most cases.  In general, the main side-effect of
this patch is that after jump threading bbro pass builds different
traces and reorders and merges basic blocks differently:

# x86_64-redhat-linux:
436.cactusADM 274479  insns -528 smaller    # maximum decrease
526.blender_r 2773303 insns -203 smaller
502.gcc_r     2262388 insns -142 smaller
403.gcc       815367  insns -106 smaller
...
525.x264_r    174450  insns +10 bigger      # maximum increase

# ppc64le-redhat-linux:
526.blender_r   3422613 insns -276 smaller  # maximum decrease
521.wrf_r       6008722 insns -228 smaller
520.omnetpp_r   612626  insns -52 smaller
...
435.gromacs     338597  insns +16 bigger    # maximum increase

- Compilation performance did not seem to have been affected in a
measurable way.  According to -ftime-report, the total user time of
SPEC CPU build used to be 26018s, and now it is 25985s, the difference
being -0.12%.

- Run time differences are all over the place:

# x86_64-redhat-linux:
548.exchange2_r -1.82%
541.leela_r     -1.59%
538.imagick_r   -0.95%
520.omnetpp_r   -0.94%
403.gcc         -0.76%
447.dealII      -0.58%
526.blender_r   -0.56%
450.soplex      -0.51%
# skip |dt| < 0.5%
523.xalancbmk_r +0.52%
416.gamess      +0.61%
503.bwaves_r    +0.62%
445.gobmk       +0.66%
456.hmmer       +0.70%
549.fotonik3d_r +0.74%
471.omnetpp     +0.99%
459.GemsFDTD    +1.09%
554.roms_r      +1.30%
500.perlbench_r +1.56%
483.xalancbmk   +1.60%

# ppc64le-redhat-linux:
511.povray_r       -1.29%
482.sphinx3        -0.65%
456.hmmer          -0.53%
519.lbm_r          -0.51%
# skip |dt| < 0.5%
549.fotonik3d_r    +1.13%
403.gcc            +1.76%
500.perlbench_r    +2.35%

I've investigated 483.xalancbmk and 500.perlbench_r regressions on
x86_64. 

Even though the total 483.xalancbmk size slightly decreases, we get 4%
more icache misses and 25% more stalls because of that.  I couldn't
pinpoint that to a certain function or line of code - can this be due to
somehow generally worsened locality?

500.perlbench_r has 25% more indirect branch mispedicts, particularly,
when perl_run ends up calling Perl_pp_rv2av, Perl_pp_gvsv and
Perl_pp_nextstate.  I have to admit I don't know what could have caused
that.



Consider the following RTL:

(insn (set (reg 65) (if_then_else (eq %cc 0) 1 0)))
(insn (parallel [(set %cc (compare (reg 65) 0)) (clobber %scratch)]))
(jump_insn (set %pc (if_then_else (ne %cc 0) (label_ref 23) %pc)))

Combine simplifies this into:

(note NOTE_INSN_DELETED)
(note NOTE_INSN_DELETED)
(jump_insn (set %pc (if_then_else (eq %cc 0) (label_ref 23) %pc)))

opening up the possibility to perform jump threading.

gcc/ChangeLog:

2018-09-19  Ilya Leoshkevich  <iii@linux.ibm.com>

	PR target/80080
	* cfgcleanup.c (class pass_postreload_jump): New pass.
	(pass_postreload_jump::execute): Likewise.
	(make_pass_postreload_jump): Likewise.
	* passes.def: Add pass_postreload_jump before
	pass_postreload_cse.
	* tree-pass.h (make_pass_postreload_jump): New pass.

gcc/testsuite/ChangeLog:

2018-09-05  Ilya Leoshkevich  <iii@linux.ibm.com>

	PR target/80080
	* gcc.target/s390/pr80080-4.c: New test.
---
 gcc/cfgcleanup.c                          | 42 +++++++++++++++++++++++
 gcc/passes.def                            |  1 +
 gcc/testsuite/gcc.target/s390/pr80080-4.c | 16 +++++++++
 gcc/tree-pass.h                           |  1 +
 4 files changed, 60 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/s390/pr80080-4.c

Comments

Segher Boessenkool Nov. 26, 2018, 3:07 p.m. UTC | #1
Hi!

On Mon, Nov 26, 2018 at 01:11:40PM +0100, Ilya Leoshkevich wrote:
> In the end, the main question was: does this make the code better on
> architectures other than s390?
> https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00993.html

Yup.

> - Code size decreases in most cases.  In general, the main side-effect of
> this patch is that after jump threading bbro pass builds different
> traces and reorders and merges basic blocks differently:

It's (almost) all in the noise though.

> # ppc64le-redhat-linux:
> 511.povray_r       -1.29%
> 482.sphinx3        -0.65%
> 456.hmmer          -0.53%
> 519.lbm_r          -0.51%
> # skip |dt| < 0.5%
> 549.fotonik3d_r    +1.13%
> 403.gcc            +1.76%
> 500.perlbench_r    +2.35%

2% degradation on gcc and perlbench isn't really acceptable.  It is
certainly possible this is an uarch effect of indirect jumps and we are
just very unlucky now (and were lucky before), but this doesn't sound
good to me :-/

What did you run this on?  p8?


Segher
Ilya Leoshkevich Nov. 26, 2018, 3:26 p.m. UTC | #2
> Am 26.11.2018 um 16:07 schrieb Segher Boessenkool <segher@kernel.crashing.org>:
> 
>> # ppc64le-redhat-linux:
>> 511.povray_r       -1.29%
>> 482.sphinx3        -0.65%
>> 456.hmmer          -0.53%
>> 519.lbm_r          -0.51%
>> # skip |dt| < 0.5%
>> 549.fotonik3d_r    +1.13%
>> 403.gcc            +1.76%
>> 500.perlbench_r    +2.35%
> 
> 2% degradation on gcc and perlbench isn't really acceptable.  It is
> certainly possible this is an uarch effect of indirect jumps and we are
> just very unlucky now (and were lucky before), but this doesn't sound
> good to me :-/
> 
> What did you run this on?  p8?

That was p9 (gcc135).
Ilya Leoshkevich Nov. 27, 2018, 4:07 p.m. UTC | #3
> Am 26.11.2018 um 16:26 schrieb Ilya Leoshkevich <iii@linux.ibm.com>:
> 
>> Am 26.11.2018 um 16:07 schrieb Segher Boessenkool <segher@kernel.crashing.org>:
>> 
>>> # ppc64le-redhat-linux:
>>> 511.povray_r       -1.29%
>>> 482.sphinx3        -0.65%
>>> 456.hmmer          -0.53%
>>> 519.lbm_r          -0.51%
>>> # skip |dt| < 0.5%
>>> 549.fotonik3d_r    +1.13%
>>> 403.gcc            +1.76%
>>> 500.perlbench_r    +2.35%
>> 
>> 2% degradation on gcc and perlbench isn't really acceptable.  It is
>> certainly possible this is an uarch effect of indirect jumps and we are
>> just very unlucky now (and were lucky before), but this doesn't sound
>> good to me :-/
>> 
>> What did you run this on?  p8?
> 
> That was p9 (gcc135).

I've had a look at gcc regression with perf.  I made a 5x re-run, and
confirmed that the run time grew from 225.76s to 228.82s (+1.4%).

perf stat shows that the slow version consumes additional ~11E+9 cycles:

   856,588,095,385      cycles:u                  #    3.740 GHz                      (33.33%)
    36,451,588,171      stalled-cycles-frontend:u #    4.26% frontend cycles idle     (50.01%)
   438,654,175,652      stalled-cycles-backend:u  #   51.21% backend cycles idle      (16.68%)
   937,926,993,826      instructions:u            #    1.09  insn per cycle
                                                  #    0.47  stalled cycles per insn  (33.36%)
   205,289,212,856      branches:u                #  896.253 M/sec                    (50.02%)
     9,019,757,337      branch-misses:u           #    4.39% of all branches          (16.65%)

vs

   867,688,505,674      cycles:u                  #    3.731 GHz                      (33.34%)  Δ=11100410289 (+1.29%)
    36,672,094,462      stalled-cycles-frontend:u #    4.23% frontend cycles idle     (50.02%)  Δ=  220506291 (+0.60%)
   438,837,922,096      stalled-cycles-backend:u  #   50.58% backend cycles idle      (16.68%)  Δ=  183746444 (+0.04%)
   937,918,212,318      instructions:u            #    1.08  insn per cycle
                                                  #    0.47  stalled cycles per insn  (33.37%)
   205,201,306,341      branches:u                #  882.403 M/sec                    (50.02%)
     9,072,857,028      branch-misses:u           #    4.42% of all branches          (16.65%)  Δ=   53099691 (+0.58%)

It also shows that the slowdown cannot be explained by pipeline stalls,
additional instructions or branch misses (the latter could still be the
case if a single branch miss somehow translated to 200 cycles on p9).

perf diff -c wdiff:1,1 shows, that there is just one function
(htab_traverse) that is significantly slower now:

     2.98%     11768891764  exe                [.] htab_traverse
     1.91%       563949986  exe                [.] compute_dominance_frontiers_1

The additional cycles consumed by this function matches the overall
number of additionaly consumed cycles, and the contribution of the
runner up (compute_dominance_frontiers_1) is 20 times smaller, so I
think it's really just this one function.

However, the generated assembly is completely identical in both cases!

I saw similar situations in the past, so I tried adding a nop to
htab_traverse:

--- hashtab.c
+++ hashtab.c
@@ -529,6 +529,8 @@ htab_traverse (htab, callback, info)
      htab_trav callback;
      PTR info;
 {
+  __asm__ volatile("nop\n");
+
   PTR *slot = htab->entries;
   PTR *limit = slot + htab->size;

and made a 5x re-run.  The new measurements are 227.01s and 227.44s
(+0.19%).  With two nops I get 227.25s and 227.29s (+0.02%), which also
looks like noise.

Can this be explained by some microarchitectural quirk after all?
Segher Boessenkool Nov. 28, 2018, 8:51 p.m. UTC | #4
Hi!

On Tue, Nov 27, 2018 at 05:07:11PM +0100, Ilya Leoshkevich wrote:
> perf diff -c wdiff:1,1 shows, that there is just one function
> (htab_traverse) that is significantly slower now:
> 
>      2.98%     11768891764  exe                [.] htab_traverse
>      1.91%       563949986  exe                [.] compute_dominance_frontiers_1
> 
> The additional cycles consumed by this function matches the overall
> number of additionaly consumed cycles, and the contribution of the
> runner up (compute_dominance_frontiers_1) is 20 times smaller, so I
> think it's really just this one function.
> 
> However, the generated assembly is completely identical in both cases!

Ugh.  We have seen this before :-(

Thanks for investigating  I don't consider the Power degradation as really
caused by your patch, then.

> I saw similar situations in the past, so I tried adding a nop to
> htab_traverse:
> 
> --- hashtab.c
> +++ hashtab.c
> @@ -529,6 +529,8 @@ htab_traverse (htab, callback, info)
>       htab_trav callback;
>       PTR info;
>  {
> +  __asm__ volatile("nop\n");
> +
>    PTR *slot = htab->entries;
>    PTR *limit = slot + htab->size;
> 
> and made a 5x re-run.  The new measurements are 227.01s and 227.44s
> (+0.19%).  With two nops I get 227.25s and 227.29s (+0.02%), which also
> looks like noise.
> 
> Can this be explained by some microarchitectural quirk after all?

Two frequent branch targets that get thrown into the same bin for prediction.
Results change based on random compiler changes, ASLR settings, phase of the
moon, how many people in your neighbourhood have had porridge for breakfast
this morning, etc.


Segher
Jeff Law Nov. 29, 2018, 3:35 a.m. UTC | #5
On 11/28/18 1:51 PM, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Nov 27, 2018 at 05:07:11PM +0100, Ilya Leoshkevich wrote:
>> perf diff -c wdiff:1,1 shows, that there is just one function
>> (htab_traverse) that is significantly slower now:
>>
>>      2.98%     11768891764  exe                [.] htab_traverse
>>      1.91%       563949986  exe                [.] compute_dominance_frontiers_1
>>
>> The additional cycles consumed by this function matches the overall
>> number of additionaly consumed cycles, and the contribution of the
>> runner up (compute_dominance_frontiers_1) is 20 times smaller, so I
>> think it's really just this one function.
>>
>> However, the generated assembly is completely identical in both cases!
> 
> Ugh.  We have seen this before :-(
> 
> Thanks for investigating  I don't consider the Power degradation as really
> caused by your patch, then.
> 
>> I saw similar situations in the past, so I tried adding a nop to
>> htab_traverse:
>>
>> --- hashtab.c
>> +++ hashtab.c
>> @@ -529,6 +529,8 @@ htab_traverse (htab, callback, info)
>>       htab_trav callback;
>>       PTR info;
>>  {
>> +  __asm__ volatile("nop\n");
>> +
>>    PTR *slot = htab->entries;
>>    PTR *limit = slot + htab->size;
>>
>> and made a 5x re-run.  The new measurements are 227.01s and 227.44s
>> (+0.19%).  With two nops I get 227.25s and 227.29s (+0.02%), which also
>> looks like noise.
>>
>> Can this be explained by some microarchitectural quirk after all?
> 
> Two frequent branch targets that get thrown into the same bin for prediction.
> Results change based on random compiler changes, ASLR settings, phase of the
> moon, how many people in your neighbourhood have had porridge for breakfast
> this morning, etc.
FWIW, I've found the hashtable code particularly vulnerable to this kind
of performance jitter.   I've long suspected it's more related to the
data locations as I can see the jitter with the same binary running
under valgrind/cachegrind control.  ASLR being the most likely culprit
in my mind.

However, in this case it seems different -- adding a NOP is changing the
instruction stream.    Could be collisions in the branch predictors or
something similar.

Ilya, can you repost the final patch?
Jeff

> 
> 
> Segher
>
diff mbox series

Patch

diff --git a/gcc/cfgcleanup.c b/gcc/cfgcleanup.c
index 4a5dc29d14f..bc4a78889db 100644
--- a/gcc/cfgcleanup.c
+++ b/gcc/cfgcleanup.c
@@ -3259,6 +3259,48 @@  make_pass_jump (gcc::context *ctxt)
 
 namespace {
 
+const pass_data pass_data_postreload_jump =
+{
+  RTL_PASS, /* type */
+  "postreload_jump", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_JUMP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_postreload_jump : public rtl_opt_pass
+{
+public:
+  pass_postreload_jump (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_postreload_jump, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual unsigned int execute (function *);
+
+}; // class pass_postreload_jump
+
+unsigned int
+pass_postreload_jump::execute (function *)
+{
+  cleanup_cfg (flag_thread_jumps ? CLEANUP_THREADING : 0);
+  return 0;
+}
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_postreload_jump (gcc::context *ctxt)
+{
+  return new pass_postreload_jump (ctxt);
+}
+
+namespace {
+
 const pass_data pass_data_jump2 =
 {
   RTL_PASS, /* type */
diff --git a/gcc/passes.def b/gcc/passes.def
index 82ad9404b9e..0079fecef32 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -458,6 +458,7 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_reload);
       NEXT_PASS (pass_postreload);
       PUSH_INSERT_PASSES_WITHIN (pass_postreload)
+	  NEXT_PASS (pass_postreload_jump);
 	  NEXT_PASS (pass_postreload_cse);
 	  NEXT_PASS (pass_gcse2);
 	  NEXT_PASS (pass_split_after_reload);
diff --git a/gcc/testsuite/gcc.target/s390/pr80080-4.c b/gcc/testsuite/gcc.target/s390/pr80080-4.c
new file mode 100644
index 00000000000..5fc6a558008
--- /dev/null
+++ b/gcc/testsuite/gcc.target/s390/pr80080-4.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile { target { lp64 } } } */
+/* { dg-options "-march=z196 -O2" } */
+
+extern void bar(int *mem);
+
+void foo4(int *mem)
+{
+  int oldval = 0;
+  if (!__atomic_compare_exchange_n (mem, (void *) &oldval, 1,
+				    1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+    {
+      bar (mem);
+    }
+}
+
+/* { dg-final { scan-assembler {(?n)\n\tlt\t.*\n\tjne\t(\.L\d+)\n(.*\n)*\tcs\t.*\n\tber\t%r14\n\1:\n\tjg\tbar\n} } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 2f8779ee4b8..b20d34c15e9 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -579,6 +579,7 @@  extern rtl_opt_pass *make_pass_clean_state (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context
 							      *ctxt);
+extern rtl_opt_pass *make_pass_postreload_jump (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt);