Message ID | 20181126121140.7176-1-iii@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [v4] Repeat jump threading after combine | expand |
Hi! On Mon, Nov 26, 2018 at 01:11:40PM +0100, Ilya Leoshkevich wrote: > In the end, the main question was: does this make the code better on > architectures other than s390? > https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00993.html Yup. > - Code size decreases in most cases. In general, the main side-effect of > this patch is that after jump threading bbro pass builds different > traces and reorders and merges basic blocks differently: It's (almost) all in the noise though. > # ppc64le-redhat-linux: > 511.povray_r -1.29% > 482.sphinx3 -0.65% > 456.hmmer -0.53% > 519.lbm_r -0.51% > # skip |dt| < 0.5% > 549.fotonik3d_r +1.13% > 403.gcc +1.76% > 500.perlbench_r +2.35% 2% degradation on gcc and perlbench isn't really acceptable. It is certainly possible this is an uarch effect of indirect jumps and we are just very unlucky now (and were lucky before), but this doesn't sound good to me :-/ What did you run this on? p8? Segher
> Am 26.11.2018 um 16:07 schrieb Segher Boessenkool <segher@kernel.crashing.org>: > >> # ppc64le-redhat-linux: >> 511.povray_r -1.29% >> 482.sphinx3 -0.65% >> 456.hmmer -0.53% >> 519.lbm_r -0.51% >> # skip |dt| < 0.5% >> 549.fotonik3d_r +1.13% >> 403.gcc +1.76% >> 500.perlbench_r +2.35% > > 2% degradation on gcc and perlbench isn't really acceptable. It is > certainly possible this is an uarch effect of indirect jumps and we are > just very unlucky now (and were lucky before), but this doesn't sound > good to me :-/ > > What did you run this on? p8? That was p9 (gcc135).
> Am 26.11.2018 um 16:26 schrieb Ilya Leoshkevich <iii@linux.ibm.com>: > >> Am 26.11.2018 um 16:07 schrieb Segher Boessenkool <segher@kernel.crashing.org>: >> >>> # ppc64le-redhat-linux: >>> 511.povray_r -1.29% >>> 482.sphinx3 -0.65% >>> 456.hmmer -0.53% >>> 519.lbm_r -0.51% >>> # skip |dt| < 0.5% >>> 549.fotonik3d_r +1.13% >>> 403.gcc +1.76% >>> 500.perlbench_r +2.35% >> >> 2% degradation on gcc and perlbench isn't really acceptable. It is >> certainly possible this is an uarch effect of indirect jumps and we are >> just very unlucky now (and were lucky before), but this doesn't sound >> good to me :-/ >> >> What did you run this on? p8? > > That was p9 (gcc135). I've had a look at gcc regression with perf. I made a 5x re-run, and confirmed that the run time grew from 225.76s to 228.82s (+1.4%). perf stat shows that the slow version consumes additional ~11E+9 cycles: 856,588,095,385 cycles:u # 3.740 GHz (33.33%) 36,451,588,171 stalled-cycles-frontend:u # 4.26% frontend cycles idle (50.01%) 438,654,175,652 stalled-cycles-backend:u # 51.21% backend cycles idle (16.68%) 937,926,993,826 instructions:u # 1.09 insn per cycle # 0.47 stalled cycles per insn (33.36%) 205,289,212,856 branches:u # 896.253 M/sec (50.02%) 9,019,757,337 branch-misses:u # 4.39% of all branches (16.65%) vs 867,688,505,674 cycles:u # 3.731 GHz (33.34%) Δ=11100410289 (+1.29%) 36,672,094,462 stalled-cycles-frontend:u # 4.23% frontend cycles idle (50.02%) Δ= 220506291 (+0.60%) 438,837,922,096 stalled-cycles-backend:u # 50.58% backend cycles idle (16.68%) Δ= 183746444 (+0.04%) 937,918,212,318 instructions:u # 1.08 insn per cycle # 0.47 stalled cycles per insn (33.37%) 205,201,306,341 branches:u # 882.403 M/sec (50.02%) 9,072,857,028 branch-misses:u # 4.42% of all branches (16.65%) Δ= 53099691 (+0.58%) It also shows that the slowdown cannot be explained by pipeline stalls, additional instructions or branch misses (the latter could still be the case if a single branch miss somehow translated to 200 cycles on p9). perf diff -c wdiff:1,1 shows, that there is just one function (htab_traverse) that is significantly slower now: 2.98% 11768891764 exe [.] htab_traverse 1.91% 563949986 exe [.] compute_dominance_frontiers_1 The additional cycles consumed by this function matches the overall number of additionaly consumed cycles, and the contribution of the runner up (compute_dominance_frontiers_1) is 20 times smaller, so I think it's really just this one function. However, the generated assembly is completely identical in both cases! I saw similar situations in the past, so I tried adding a nop to htab_traverse: --- hashtab.c +++ hashtab.c @@ -529,6 +529,8 @@ htab_traverse (htab, callback, info) htab_trav callback; PTR info; { + __asm__ volatile("nop\n"); + PTR *slot = htab->entries; PTR *limit = slot + htab->size; and made a 5x re-run. The new measurements are 227.01s and 227.44s (+0.19%). With two nops I get 227.25s and 227.29s (+0.02%), which also looks like noise. Can this be explained by some microarchitectural quirk after all?
Hi! On Tue, Nov 27, 2018 at 05:07:11PM +0100, Ilya Leoshkevich wrote: > perf diff -c wdiff:1,1 shows, that there is just one function > (htab_traverse) that is significantly slower now: > > 2.98% 11768891764 exe [.] htab_traverse > 1.91% 563949986 exe [.] compute_dominance_frontiers_1 > > The additional cycles consumed by this function matches the overall > number of additionaly consumed cycles, and the contribution of the > runner up (compute_dominance_frontiers_1) is 20 times smaller, so I > think it's really just this one function. > > However, the generated assembly is completely identical in both cases! Ugh. We have seen this before :-( Thanks for investigating I don't consider the Power degradation as really caused by your patch, then. > I saw similar situations in the past, so I tried adding a nop to > htab_traverse: > > --- hashtab.c > +++ hashtab.c > @@ -529,6 +529,8 @@ htab_traverse (htab, callback, info) > htab_trav callback; > PTR info; > { > + __asm__ volatile("nop\n"); > + > PTR *slot = htab->entries; > PTR *limit = slot + htab->size; > > and made a 5x re-run. The new measurements are 227.01s and 227.44s > (+0.19%). With two nops I get 227.25s and 227.29s (+0.02%), which also > looks like noise. > > Can this be explained by some microarchitectural quirk after all? Two frequent branch targets that get thrown into the same bin for prediction. Results change based on random compiler changes, ASLR settings, phase of the moon, how many people in your neighbourhood have had porridge for breakfast this morning, etc. Segher
On 11/28/18 1:51 PM, Segher Boessenkool wrote: > Hi! > > On Tue, Nov 27, 2018 at 05:07:11PM +0100, Ilya Leoshkevich wrote: >> perf diff -c wdiff:1,1 shows, that there is just one function >> (htab_traverse) that is significantly slower now: >> >> 2.98% 11768891764 exe [.] htab_traverse >> 1.91% 563949986 exe [.] compute_dominance_frontiers_1 >> >> The additional cycles consumed by this function matches the overall >> number of additionaly consumed cycles, and the contribution of the >> runner up (compute_dominance_frontiers_1) is 20 times smaller, so I >> think it's really just this one function. >> >> However, the generated assembly is completely identical in both cases! > > Ugh. We have seen this before :-( > > Thanks for investigating I don't consider the Power degradation as really > caused by your patch, then. > >> I saw similar situations in the past, so I tried adding a nop to >> htab_traverse: >> >> --- hashtab.c >> +++ hashtab.c >> @@ -529,6 +529,8 @@ htab_traverse (htab, callback, info) >> htab_trav callback; >> PTR info; >> { >> + __asm__ volatile("nop\n"); >> + >> PTR *slot = htab->entries; >> PTR *limit = slot + htab->size; >> >> and made a 5x re-run. The new measurements are 227.01s and 227.44s >> (+0.19%). With two nops I get 227.25s and 227.29s (+0.02%), which also >> looks like noise. >> >> Can this be explained by some microarchitectural quirk after all? > > Two frequent branch targets that get thrown into the same bin for prediction. > Results change based on random compiler changes, ASLR settings, phase of the > moon, how many people in your neighbourhood have had porridge for breakfast > this morning, etc. FWIW, I've found the hashtable code particularly vulnerable to this kind of performance jitter. I've long suspected it's more related to the data locations as I can see the jitter with the same binary running under valgrind/cachegrind control. ASLR being the most likely culprit in my mind. However, in this case it seems different -- adding a NOP is changing the instruction stream. Could be collisions in the branch predictors or something similar. Ilya, can you repost the final patch? Jeff > > > Segher >
diff --git a/gcc/cfgcleanup.c b/gcc/cfgcleanup.c index 4a5dc29d14f..bc4a78889db 100644 --- a/gcc/cfgcleanup.c +++ b/gcc/cfgcleanup.c @@ -3259,6 +3259,48 @@ make_pass_jump (gcc::context *ctxt) namespace { +const pass_data pass_data_postreload_jump = +{ + RTL_PASS, /* type */ + "postreload_jump", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_JUMP, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + 0, /* todo_flags_finish */ +}; + +class pass_postreload_jump : public rtl_opt_pass +{ +public: + pass_postreload_jump (gcc::context *ctxt) + : rtl_opt_pass (pass_data_postreload_jump, ctxt) + {} + + /* opt_pass methods: */ + virtual unsigned int execute (function *); + +}; // class pass_postreload_jump + +unsigned int +pass_postreload_jump::execute (function *) +{ + cleanup_cfg (flag_thread_jumps ? CLEANUP_THREADING : 0); + return 0; +} + +} // anon namespace + +rtl_opt_pass * +make_pass_postreload_jump (gcc::context *ctxt) +{ + return new pass_postreload_jump (ctxt); +} + +namespace { + const pass_data pass_data_jump2 = { RTL_PASS, /* type */ diff --git a/gcc/passes.def b/gcc/passes.def index 82ad9404b9e..0079fecef32 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -458,6 +458,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_reload); NEXT_PASS (pass_postreload); PUSH_INSERT_PASSES_WITHIN (pass_postreload) + NEXT_PASS (pass_postreload_jump); NEXT_PASS (pass_postreload_cse); NEXT_PASS (pass_gcse2); NEXT_PASS (pass_split_after_reload); diff --git a/gcc/testsuite/gcc.target/s390/pr80080-4.c b/gcc/testsuite/gcc.target/s390/pr80080-4.c new file mode 100644 index 00000000000..5fc6a558008 --- /dev/null +++ b/gcc/testsuite/gcc.target/s390/pr80080-4.c @@ -0,0 +1,16 @@ +/* { dg-do compile { target { lp64 } } } */ +/* { dg-options "-march=z196 -O2" } */ + +extern void bar(int *mem); + +void foo4(int *mem) +{ + int oldval = 0; + if (!__atomic_compare_exchange_n (mem, (void *) &oldval, 1, + 1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) + { + bar (mem); + } +} + +/* { dg-final { scan-assembler {(?n)\n\tlt\t.*\n\tjne\t(\.L\d+)\n(.*\n)*\tcs\t.*\n\tber\t%r14\n\1:\n\tjg\tbar\n} } } */ diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index 2f8779ee4b8..b20d34c15e9 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -579,6 +579,7 @@ extern rtl_opt_pass *make_pass_clean_state (gcc::context *ctxt); extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt); extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_postreload_jump (gcc::context *ctxt); extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt); extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt); extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt);