Message ID | cc527602-2ed8-79ad-f082-e0e0279aa310@suse.cz |
---|---|
State | New |
Headers | show |
Series | Increase min-lto-partition. | expand |
> Hi. > > I played a bit with -flinker-output=nolto-rel of gimple-match.ii and I identified > that current default of min-lto-partition leads to too many LTRANS. We pay with > LTO overhead and so that user time is high. > > Base is: > $ g++ -O2 /tmp/gimple-match.ii -c -fno-checking -o x.o > real 0m40.130s > user 0m39.911s Did you configured compiler with checking? If so, I think the benchmarks are not that good, because -fchecking does not control everything. It would be relevant for gcc bootstrap but notmuch else. In that case I would go with explicit --param in our Makefile. I tried your experiment with linking tramp3d (with -O2 -flto on EPYC) hubicka@lomikamen-jh:~$ time /aux/hubicka/trunk-install/bin/g++ -flto=auto tramp3d-v44.o --param lto-min-partition=10000 real 0m12.574s user 1m5.010s sys 0m0.970s hubicka@lomikamen-jh:~$ time /aux/hubicka/trunk-install/bin/g++ -flto=auto tramp3d-v44.o --param lto-min-partition=20000 real 0m17.926s user 1m1.259s sys 0m1.153s hubicka@lomikamen-jh:~$ time /aux/hubicka/trunk-install/bin/g++ -flto=auto tramp3d-v44.o --param lto-min-partition=30000 real 0m22.115s user 0m56.964s sys 0m0.892s hubicka@lomikamen-jh:~$ time /aux/hubicka/trunk-install/bin/g++ -flto=auto tramp3d-v44.o --param lto-min-partition=40000 real 0m23.510s user 0m50.783s sys 0m0.983s hubicka@lomikamen-jh:~$ time /aux/hubicka/trunk-install/bin/g++ -flto=auto tramp3d-v44.o --param lto-min-partition=50000 real 0m28.410s user 0m46.146s sys 0m0.680s hubicka@lomikamen-jh:~$ time /aux/hubicka/trunk-install/bin/g++ -flto=auto tramp3d-v44.o --param lto-min-partition=60000 real 0m32.304s user 0m46.114s sys 0m0.720s hubicka@lomikamen-jh:~$ time /aux/hubicka/trunk-install/bin/g++ -flto=auto tramp3d-v44.o --param lto-min-partition=70000 real 0m42.332s user 0m50.521s sys 0m0.749s So going from 10000 to 70000 seems to decarese user time from 65 to 50s (30% reduction) however the overall linktime goes up from 12s to 42s (3.5 times) Which does not seem that great tradeoff. Moreover I seem to get best results with: hubicka@lomikamen-jh:~$ time /aux/hubicka/trunk-install/bin/g++ -flto=auto tramp3d-v44.o --param lto-min-partition=1 --param lto-partitions=200 real 0m5.752s user 1m8.949s sys 0m3.826s Both genmatch and tramp3d seems bit extreme sources, but perhaps we want to explore thi bit further.. I will try to re-measure your results on my setup so we get idea how much sensitive it is :) Honza > > LGEN: > > $ time g++ -O2 /tmp/gimple-match.ii -c -flto -fno-checking > real 0m8.709s > user 0m8.543s > > WPA+LTRANS: > > $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o --param lto-partitions=4 -fno-checking > real 0m11.220s > user 0m33.067s > > $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o --param lto-partitions=6 -fno-checking > real 0m9.880s > user 0m35.599s > > $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o --param lto-partitions=8 -fno-checking > real 0m6.681s > user 0m39.746s > > default: > $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o -fno-checking > real 0m6.065s > user 1m22.698s > > So I would recommend to set the param value to 75000, which leads to 6 partitions. That would be: > > 9+10s = 19s vs. 40s (total real time 44s). That seems reasonable to me. > > Thoughts? > Thanks, > Martin > > gcc/ChangeLog: > > 2020-03-13 Martin Liska <mliska@suse.cz> > > * params.opt: Bump min-lto-partition in order to not create > too many LTRANS. > --- > gcc/params.opt | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > > diff --git a/gcc/params.opt b/gcc/params.opt > index e39216aa7d0..49fafac20af 100644 > --- a/gcc/params.opt > +++ b/gcc/params.opt > @@ -363,7 +363,7 @@ Common Joined UInteger Var(param_max_lto_streaming_parallelism) Init(32) Integer > maximal number of LTO partitions streamed in parallel. > > -param=lto-min-partition= > -Common Joined UInteger Var(param_min_partition_size) Init(10000) Param > +Common Joined UInteger Var(param_min_partition_size) Init(75000) Param > Minimal size of a partition for LTO (in estimated instructions). > > -param=lto-partitions= >
> > $ time g++ -O2 /tmp/gimple-match.ii -c -flto -fno-checking > > real 0m8.709s > > user 0m8.543s > > > > WPA+LTRANS: > > > > $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o --param lto-partitions=4 -fno-checking > > real 0m11.220s > > user 0m33.067s > > > > $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o --param lto-partitions=6 -fno-checking > > real 0m9.880s > > user 0m35.599s > > > > $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o --param lto-partitions=8 -fno-checking > > real 0m6.681s > > user 0m39.746s > > > > default: > > $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o -fno-checking > > real 0m6.065s > > user 1m22.698s I did /aux/hubicka/trunk-git/build2/./prev-gcc/xg++ -B/aux/hubicka/trunk-git/build2/./prev-gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/ -nostdinc++ -B/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/src/.libs -B/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/libsupc++/.libs -I/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu -I/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/include -I/aux/hubicka/trunk-git/libstdc++-v3/libsupc++ -L/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/src/.libs -L/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/libsupc++/.libs -fno-PIE -c -g -O2 -fchecking=0 -DIN_GCC -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wno-error=format-diag -Wmissing-format-attribute -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common -Wno-unused -DHAVE_CONFIG_H -I. -I. -I../../gcc -I../../gcc/. -I../../gcc/../include -I../../gcc/../libcpp/include -I/aux/hubicka/trunk-git/build2/./gmp -I/aux/hubicka/trunk-git/gmp -I/aux/hubicka/trunk-git/build2/./mpfr/src -I/aux/hubicka/trunk-git/mpfr/src -I/aux/hubicka/trunk-git/mpc/src -I../../gcc/../libdecnumber -I../../gcc/../libdecnumber/bid -I../libdecnumber -I../../gcc/../libbacktrace -I/aux/hubicka/trunk-git/build2/./isl/include -I/aux/hubicka/trunk-git/isl/include -o gimple-match.o -MT gimple-match.o -MMD -MP -MF ./.deps/gimple-match.TPo gimple-match.c -flto (copying from build disabling checking and adding -flto) and I get: hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=128 -r real 0m10.394s user 2m13.809s sys 0m3.896s hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=8 -r real 0m21.033s user 2m3.063s sys 0m2.539s hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=6 -r real 0m23.975s user 1m56.139s sys 0m2.595s hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=4 -r real 0m32.383s user 1m39.411s sys 0m2.213s With debug info disabled (like you do, but I guess in less realistic setting) I get: hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=128 -r real 0m10.905s user 1m55.065s sys 0m2.956s hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=8 -r real 0m17.297s user 1m26.513s sys 0m1.626s hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=6 -r real 0m22.365s user 1m30.969s sys 0m1.386s hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=4 -r real 0m26.534s user 1m21.593s sys 0m0.902s So I do not see such notable idfference in user times (but they are consistently worse than yours). Perhaps, can you try to perf it including the system profile? It may give us some idea why things behave differently. Compiler binary I use is profiledbootstrapped with LTO. Honza > > > > So I would recommend to set the param value to 75000, which leads to 6 partitions. That would be: > > > > 9+10s = 19s vs. 40s (total real time 44s). That seems reasonable to me. > > > > Thoughts? > > Thanks, > > Martin > > > > gcc/ChangeLog: > > > > 2020-03-13 Martin Liska <mliska@suse.cz> > > > > * params.opt: Bump min-lto-partition in order to not create > > too many LTRANS. > > --- > > gcc/params.opt | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > diff --git a/gcc/params.opt b/gcc/params.opt > > index e39216aa7d0..49fafac20af 100644 > > --- a/gcc/params.opt > > +++ b/gcc/params.opt > > @@ -363,7 +363,7 @@ Common Joined UInteger Var(param_max_lto_streaming_parallelism) Init(32) Integer > > maximal number of LTO partitions streamed in parallel. > > > > -param=lto-min-partition= > > -Common Joined UInteger Var(param_min_partition_size) Init(10000) Param > > +Common Joined UInteger Var(param_min_partition_size) Init(75000) Param > > Minimal size of a partition for LTO (in estimated instructions). > > > > -param=lto-partitions= > > >
On 3/13/20 4:11 PM, Jan Hubicka wrote: >>> $ time g++ -O2 /tmp/gimple-match.ii -c -flto -fno-checking >>> real 0m8.709s >>> user 0m8.543s >>> >>> WPA+LTRANS: >>> >>> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o --param lto-partitions=4 -fno-checking >>> real 0m11.220s >>> user 0m33.067s >>> >>> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o --param lto-partitions=6 -fno-checking >>> real 0m9.880s >>> user 0m35.599s >>> >>> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o --param lto-partitions=8 -fno-checking >>> real 0m6.681s >>> user 0m39.746s >>> >>> default: >>> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o gimple-match2.o -fno-checking >>> real 0m6.065s >>> user 1m22.698s > > I did > /aux/hubicka/trunk-git/build2/./prev-gcc/xg++ -B/aux/hubicka/trunk-git/build2/./prev-gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/ -nostdinc++ -B/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/src/.libs -B/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/libsupc++/.libs -I/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu -I/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/include -I/aux/hubicka/trunk-git/libstdc++-v3/libsupc++ -L/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/src/.libs -L/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/libsupc++/.libs -fno-PIE -c -g -O2 -fchecking=0 -DIN_GCC -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wno-error=format-diag -Wmissing-format-attribute -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common -Wno-unused -DHAVE_CONFIG_H -I. -I. -I../../gcc -I../../gcc/. -I../../gcc/../include -I../../gcc/../libcpp/include -I/aux/hubicka/trunk-git/build2/./gmp -I/aux/hubicka/trunk-git/gmp -I/aux/hubicka/trunk-git/build2/./mpfr/src -I/aux/hubicka/trunk-git/mpfr/src -I/aux/hubicka/trunk-git/mpc/src -I../../gcc/../libdecnumber -I../../gcc/../libdecnumber/bid -I../libdecnumber -I../../gcc/../libbacktrace -I/aux/hubicka/trunk-git/build2/./isl/include -I/aux/hubicka/trunk-git/isl/include -o gimple-match.o -MT gimple-match.o -MMD -MP -MF ./.deps/gimple-match.TPo gimple-match.c -flto > > (copying from build disabling checking and adding -flto) and I get: > hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=128 -r > > real 0m10.394s > user 2m13.809s > sys 0m3.896s > hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=8 -r > > real 0m21.033s > user 2m3.063s > sys 0m2.539s > hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=6 -r > > real 0m23.975s > user 1m56.139s > sys 0m2.595s > hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=4 -r > > real 0m32.383s > user 1m39.411s > sys 0m2.213s > > With debug info disabled (like you do, but I guess in less realistic > setting) I get: > > hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time > /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel > gimple-match.o -fno-checking --param lto-partitions=128 -r > > real 0m10.905s > user 1m55.065s > sys 0m2.956s > hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time > /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel > gimple-match.o -fno-checking --param lto-partitions=8 -r > > real 0m17.297s > user 1m26.513s > sys 0m1.626s > hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time > /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel > gimple-match.o -fno-checking --param lto-partitions=6 -r > > real 0m22.365s > user 1m30.969s > sys 0m1.386s > hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time > /aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel > gimple-match.o -fno-checking --param lto-partitions=4 -r > > real 0m26.534s > user 1m21.593s > sys 0m0.902s > > So I do not see such notable idfference in user times (but they are > consistently worse than yours). Perhaps, can you try to perf it > including the system profile? It may give us some idea why things behave > differently. That's strange. So let's take my gimple-match.ii: https://drive.google.com/file/d/1B8d3bIvz1KA_ksIo8h-JgkaJTCRiSPR4/view?usp=sharing For gcc9 package (LTO+PGO) I get: $ time g++ -O2 gimple-match.ii -c -flto real 0m8.180s user 0m7.992s $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=4 -r real 0m9.041s user 0m28.157s sys 0m0.493s $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=128 -r real 0m6.011s user 1m20.326s sys 0m2.147s $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking -r real 0m6.303s user 1m18.789s sys 0m2.244s $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=8 -r real 0m5.875s user 0m38.938s sys 0m0.784s For default I get: perf report --stdio | head -n30 # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 351K of event 'cycles:u' # Event count (approx.): 341558047686 # # Overhead Command Shared Object Symbol # ........ ............... ........................... ............................................................................ # 3.61% lto1-ltrans lto1 [.] df_worklist_dataflow 1.93% lto1-ltrans lto1 [.] cleanup_cfg 1.15% lto1-ltrans lto1 [.] init_alias_analysis 1.02% lto1-ltrans lto1 [.] pre_and_rev_post_order_compute_fn 0.93% lto1-ltrans lto1 [.] calculate_dominance_info 0.84% lto1-ltrans lto1 [.] inverted_post_order_compute 0.75% lto1-ltrans lto1 [.] post_order_compute 0.71% lto1-ltrans libc-2.31.so [.] _int_malloc 0.69% lto1-ltrans lto1 [.] constrain_operands 0.68% lto1-ltrans lto1 [.] df_bb_refs_record 0.59% lto1-ltrans lto1 [.] side_effects_p 0.53% lto1-ltrans lto1 [.] delete_unreachable_blocks 0.53% lto1-ltrans lto1 [.] rewrite_update_dom_walker::before_dom_children 0.49% lto1-ltrans lto1 [.] bitmap_set_bit 0.47% lto1-ltrans lto1 [.] record_temporary_equivalences 0.46% lto1-ltrans lto1 [.] single_def_use_dom_walker::before_dom_children 0.46% lto1-ltrans lto1 [.] df_compact_blocks 0.45% lto1-ltrans lto1 [.] substitute_and_fold_engine::substitute_and_fold 0.45% lto1-ltrans libc-2.31.so [.] _int_free Martin > > Compiler binary I use is profiledbootstrapped with LTO. > > Honza >>> >>> So I would recommend to set the param value to 75000, which leads to 6 partitions. That would be: >>> >>> 9+10s = 19s vs. 40s (total real time 44s). That seems reasonable to me. >>> >>> Thoughts? >>> Thanks, >>> Martin >>> >>> gcc/ChangeLog: >>> >>> 2020-03-13 Martin Liska <mliska@suse.cz> >>> >>> * params.opt: Bump min-lto-partition in order to not create >>> too many LTRANS. >>> --- >>> gcc/params.opt | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> >> >>> diff --git a/gcc/params.opt b/gcc/params.opt >>> index e39216aa7d0..49fafac20af 100644 >>> --- a/gcc/params.opt >>> +++ b/gcc/params.opt >>> @@ -363,7 +363,7 @@ Common Joined UInteger Var(param_max_lto_streaming_parallelism) Init(32) Integer >>> maximal number of LTO partitions streamed in parallel. >>> >>> -param=lto-min-partition= >>> -Common Joined UInteger Var(param_min_partition_size) Init(10000) Param >>> +Common Joined UInteger Var(param_min_partition_size) Init(75000) Param >>> Minimal size of a partition for LTO (in estimated instructions). >>> >>> -param=lto-partitions= >>> >>
And using EPYC2 with 64 cores I get: $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=4 -r real 0m11.040s user 0m33.479s sys 0m0.718s $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=8 -r real 0m6.542s user 0m39.334s sys 0m0.945s $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking --param lto-partitions=128 -r real 0m4.945s user 0m59.344s sys 0m2.475s So here the growth of user time is only about 100%. And baseline is: time g++ -O2 /tmp/gimple-match.ii -c real 0m39.783s user 0m39.385s sys 0m0.372s Martin
diff --git a/gcc/params.opt b/gcc/params.opt index e39216aa7d0..49fafac20af 100644 --- a/gcc/params.opt +++ b/gcc/params.opt @@ -363,7 +363,7 @@ Common Joined UInteger Var(param_max_lto_streaming_parallelism) Init(32) Integer maximal number of LTO partitions streamed in parallel. -param=lto-min-partition= -Common Joined UInteger Var(param_min_partition_size) Init(10000) Param +Common Joined UInteger Var(param_min_partition_size) Init(75000) Param Minimal size of a partition for LTO (in estimated instructions). -param=lto-partitions=