Message ID | f61115af-55e1-8d59-545f-295f7e5b53cb@suse.cz |
---|---|
State | New |
Headers | show |
On 2017.05.25 at 11:55 +0200, Martin Liška wrote: > Hi. > > As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following > 2 reasons: > > 1) stageprofile compiler is train just on libraries that are built during stage2 > 2) apart from that, as the compiler is also used to build the final compiler, profile > is being updated during the build. So the stage2 compiler is making different decisions. > > Both problems can be resolved by adding another step in between current stage2 and stage3 > where we train stage2 compiler by building compiler with default options. > > I'm going to do some measurements. I did some measurements on gcc67 (trunk with --enable-checking=release). The apparent speedup is in the noise. Without your patch: Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs): 15749.058451 task-clock (msec) # 0.997 CPUs utilized ( +- 0.13% ) 1,352 context-switches # 0.086 K/sec ( +- 0.16% ) 7 cpu-migrations # 0.000 K/sec ( +- 5.73% ) 269,142 page-faults # 0.017 M/sec ( +- 0.01% ) 60,676,581,181 cycles # 3.853 GHz ( +- 0.09% ) (83.35%) 13,401,784,189 stalled-cycles-frontend # 22.09% frontend cycles idle ( +- 0.20% ) (83.33%) 12,926,843,370 stalled-cycles-backend # 21.30% backend cycles idle ( +- 0.04% ) (83.31%) 73,074,099,356 instructions # 1.20 insn per cycle # 0.18 stalled cycles per insn ( +- 0.02% ) (83.34%) 16,607,220,814 branches # 1054.490 M/sec ( +- 0.03% ) (83.36%) 616,673,310 branch-misses # 3.71% of all branches ( +- 0.08% ) (83.36%) 15.803602619 seconds time elapsed ( +- 0.14% ) With your patch: Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs): 15735.220610 task-clock (msec) # 0.997 CPUs utilized ( +- 0.11% ) 1,354 context-switches # 0.086 K/sec ( +- 0.22% ) 6 cpu-migrations # 0.000 K/sec ( +- 6.67% ) 269,164 page-faults # 0.017 M/sec ( +- 0.01% ) 60,723,862,242 cycles # 3.859 GHz ( +- 0.08% ) (83.35%) 13,382,554,421 stalled-cycles-frontend # 22.04% frontend cycles idle ( +- 0.14% ) (83.31%) 12,912,171,664 stalled-cycles-backend # 21.26% backend cycles idle ( +- 0.03% ) (83.34%) 73,109,081,227 instructions # 1.20 insn per cycle # 0.18 stalled cycles per insn ( +- 0.03% ) (83.34%) 16,590,421,798 branches # 1054.349 M/sec ( +- 0.02% ) (83.35%) 616,669,135 branch-misses # 3.72% of all branches ( +- 0.08% ) (83.36%) 15.788772466 seconds time elapsed ( +- 0.12% ) -- Markus
On 05/25/2017 01:22 PM, Markus Trippelsdorf wrote: > On 2017.05.25 at 11:55 +0200, Martin Liška wrote: >> Hi. >> >> As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following >> 2 reasons: >> >> 1) stageprofile compiler is train just on libraries that are built during stage2 >> 2) apart from that, as the compiler is also used to build the final compiler, profile >> is being updated during the build. So the stage2 compiler is making different decisions. >> >> Both problems can be resolved by adding another step in between current stage2 and stage3 >> where we train stage2 compiler by building compiler with default options. >> >> I'm going to do some measurements. > > I did some measurements on gcc67 (trunk with --enable-checking=release). > The apparent speedup is in the noise. Hello. Thanks for measurements: I can see difference for GCC 7.1: g++-7 tramp3d-v4.ii -O2 && time for i in `seq 1 10` ; do g++-7 tramp3d-v4.ii -O2 ; done before: 2m25.133s after: real 2m25.133s which is 99.09124426480228%. It's probably within a noise level. And apparently file size of binary is bugger: before (using bloaty): VM SIZE FILE SIZE -------------- -------------- 59.0% 15.1Mi .text 15.1Mi 62.3% 21.3% 5.45Mi .rodata 5.45Mi 22.5% 6.6% 1.69Mi .eh_frame 1.69Mi 6.9% 5.4% 1.38Mi .bss 0 0.0% 3.3% 874Ki .dynstr 874Ki 3.5% 1.8% 480Ki .dynsym 480Ki 1.9% 1.1% 285Ki .eh_frame_hdr 285Ki 1.1% 0.6% 158Ki .gnu.hash 158Ki 0.6% 0.5% 144Ki .hash 144Ki 0.6% 0.2% 44.4Ki .data 44.4Ki 0.2% 0.2% 40.0Ki .gnu.version 40.0Ki 0.2% 0.0% 11.1Ki .rela.plt 11.1Ki 0.0% 0.0% 7.44Ki .plt 7.44Ki 0.0% 0.0% 4.56Ki .data.rel.ro 4.56Ki 0.0% 0.0% 3.73Ki .got.plt 3.73Ki 0.0% 0.0% 38 [Unmapped] 2.75Ki 0.0% 0.0% 624 [ELF Headers] 2.55Ki 0.0% 0.0% 848 [Other] 1.13Ki 0.0% 0.0% 917 .gcc_except_table 917 0.0% 0.0% 608 .dynamic 608 0.0% 0.0% 16 [None] 0 0.0% 100.0% 25.7Mi TOTAL 24.3Mi 100.0% after: VM SIZE FILE SIZE -------------- -------------- 58.3% 14.6Mi .text 14.6Mi 54.2% 21.6% 5.41Mi .rodata 5.41Mi 20.1% 0.0% 0 .strtab 2.13Mi 7.9% 6.7% 1.67Mi .eh_frame 1.67Mi 6.2% 5.5% 1.38Mi .bss 0 0.0% 0.0% 0 .symtab 1.11Mi 4.1% 3.4% 876Ki .dynstr 876Ki 3.2% 1.9% 480Ki .dynsym 480Ki 1.7% 1.1% 280Ki .eh_frame_hdr 280Ki 1.0% 0.6% 158Ki .gnu.hash 158Ki 0.6% 0.6% 144Ki .hash 144Ki 0.5% 0.2% 44.4Ki .data 44.4Ki 0.2% 0.2% 40.1Ki .gnu.version 40.1Ki 0.1% 0.0% 11.1Ki .rela.plt 11.1Ki 0.0% 0.0% 7.44Ki .plt 7.44Ki 0.0% 0.0% 4.56Ki .data.rel.ro 4.56Ki 0.0% 0.0% 3.73Ki .got.plt 3.73Ki 0.0% 0.0% 58 [Unmapped] 3.11Ki 0.0% 0.0% 624 [ELF Headers] 2.61Ki 0.0% 0.0% 2.32Ki [Other] 2.60Ki 0.0% 0.0% 16 [None] 0 0.0% 100.0% 25.1Mi TOTAL 26.9Mi 100.0% As I had chat with Honza, we still have problem in GCC that using current working sets, get_hot_bb_threshold () is very close to number of runs, which is effectively 1 for a single run. That's mistake and that should be fixed. Martin > > Without your patch: > > Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs): > > 15749.058451 task-clock (msec) # 0.997 CPUs utilized ( +- 0.13% ) > 1,352 context-switches # 0.086 K/sec ( +- 0.16% ) > 7 cpu-migrations # 0.000 K/sec ( +- 5.73% ) > 269,142 page-faults # 0.017 M/sec ( +- 0.01% ) > 60,676,581,181 cycles # 3.853 GHz ( +- 0.09% ) (83.35%) > 13,401,784,189 stalled-cycles-frontend # 22.09% frontend cycles idle ( +- 0.20% ) (83.33%) > 12,926,843,370 stalled-cycles-backend # 21.30% backend cycles idle ( +- 0.04% ) (83.31%) > 73,074,099,356 instructions # 1.20 insn per cycle > # 0.18 stalled cycles per insn ( +- 0.02% ) (83.34%) > 16,607,220,814 branches # 1054.490 M/sec ( +- 0.03% ) (83.36%) > 616,673,310 branch-misses # 3.71% of all branches ( +- 0.08% ) (83.36%) > > 15.803602619 seconds time elapsed ( +- 0.14% ) > > With your patch: > > Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs): > > 15735.220610 task-clock (msec) # 0.997 CPUs utilized ( +- 0.11% ) > 1,354 context-switches # 0.086 K/sec ( +- 0.22% ) > 6 cpu-migrations # 0.000 K/sec ( +- 6.67% ) > 269,164 page-faults # 0.017 M/sec ( +- 0.01% ) > 60,723,862,242 cycles # 3.859 GHz ( +- 0.08% ) (83.35%) > 13,382,554,421 stalled-cycles-frontend # 22.04% frontend cycles idle ( +- 0.14% ) (83.31%) > 12,912,171,664 stalled-cycles-backend # 21.26% backend cycles idle ( +- 0.03% ) (83.34%) > 73,109,081,227 instructions # 1.20 insn per cycle > # 0.18 stalled cycles per insn ( +- 0.03% ) (83.34%) > 16,590,421,798 branches # 1054.349 M/sec ( +- 0.02% ) (83.35%) > 616,669,135 branch-misses # 3.72% of all branches ( +- 0.08% ) (83.36%) > > 15.788772466 seconds time elapsed ( +- 0.12% ) > > > > -- > Markus >
On 2017.05.25 at 11:55 +0200, Martin Liška wrote: > Hi. > > As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following > 2 reasons: > > 1) stageprofile compiler is train just on libraries that are built during stage2 > 2) apart from that, as the compiler is also used to build the final compiler, profile > is being updated during the build. So the stage2 compiler is making different decisions. > > Both problems can be resolved by adding another step in between current stage2 and stage3 > where we train stage2 compiler by building compiler with default options. Another issue that I've noticed is that LTO doesn't get used in the final stage (stagefeedback) with "bootstrap-O3 bootstrap-lto". It only is used during training. So either move -flto to stagefeedback, or use -flto both during training and during the final stage.
> On 05/25/2017 01:22 PM, Markus Trippelsdorf wrote: > > On 2017.05.25 at 11:55 +0200, Martin Liška wrote: > >> Hi. > >> > >> As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following > >> 2 reasons: > >> > >> 1) stageprofile compiler is train just on libraries that are built during stage2 > >> 2) apart from that, as the compiler is also used to build the final compiler, profile > >> is being updated during the build. So the stage2 compiler is making different decisions. > >> > >> Both problems can be resolved by adding another step in between current stage2 and stage3 > >> where we train stage2 compiler by building compiler with default options. > >> > >> I'm going to do some measurements. > > > > I did some measurements on gcc67 (trunk with --enable-checking=release). > > The apparent speedup is in the noise. > > Hello. > > Thanks for measurements: > > I can see difference for GCC 7.1: > > g++-7 tramp3d-v4.ii -O2 && time for i in `seq 1 10` ; do g++-7 tramp3d-v4.ii -O2 ; done > > before: 2m25.133s > after: real 2m25.133s > > which is 99.09124426480228%. It's probably within a noise level. > > And apparently file size of binary is bugger: > > before (using bloaty): > > VM SIZE FILE SIZE > -------------- -------------- > 59.0% 15.1Mi .text 15.1Mi 62.3% > 21.3% 5.45Mi .rodata 5.45Mi 22.5% > 6.6% 1.69Mi .eh_frame 1.69Mi 6.9% > 5.4% 1.38Mi .bss 0 0.0% > 3.3% 874Ki .dynstr 874Ki 3.5% > 1.8% 480Ki .dynsym 480Ki 1.9% > 1.1% 285Ki .eh_frame_hdr 285Ki 1.1% > 0.6% 158Ki .gnu.hash 158Ki 0.6% > 0.5% 144Ki .hash 144Ki 0.6% > 0.2% 44.4Ki .data 44.4Ki 0.2% > 0.2% 40.0Ki .gnu.version 40.0Ki 0.2% > 0.0% 11.1Ki .rela.plt 11.1Ki 0.0% > 0.0% 7.44Ki .plt 7.44Ki 0.0% > 0.0% 4.56Ki .data.rel.ro 4.56Ki 0.0% > 0.0% 3.73Ki .got.plt 3.73Ki 0.0% > 0.0% 38 [Unmapped] 2.75Ki 0.0% > 0.0% 624 [ELF Headers] 2.55Ki 0.0% > 0.0% 848 [Other] 1.13Ki 0.0% > 0.0% 917 .gcc_except_table 917 0.0% > 0.0% 608 .dynamic 608 0.0% > 0.0% 16 [None] 0 0.0% > 100.0% 25.7Mi TOTAL 24.3Mi 100.0% > > after: > > VM SIZE FILE SIZE > -------------- -------------- > 58.3% 14.6Mi .text 14.6Mi 54.2% > 21.6% 5.41Mi .rodata 5.41Mi 20.1% > 0.0% 0 .strtab 2.13Mi 7.9% > 6.7% 1.67Mi .eh_frame 1.67Mi 6.2% > 5.5% 1.38Mi .bss 0 0.0% > 0.0% 0 .symtab 1.11Mi 4.1% > 3.4% 876Ki .dynstr 876Ki 3.2% > 1.9% 480Ki .dynsym 480Ki 1.7% > 1.1% 280Ki .eh_frame_hdr 280Ki 1.0% > 0.6% 158Ki .gnu.hash 158Ki 0.6% > 0.6% 144Ki .hash 144Ki 0.5% > 0.2% 44.4Ki .data 44.4Ki 0.2% > 0.2% 40.1Ki .gnu.version 40.1Ki 0.1% > 0.0% 11.1Ki .rela.plt 11.1Ki 0.0% > 0.0% 7.44Ki .plt 7.44Ki 0.0% > 0.0% 4.56Ki .data.rel.ro 4.56Ki 0.0% > 0.0% 3.73Ki .got.plt 3.73Ki 0.0% > 0.0% 58 [Unmapped] 3.11Ki 0.0% > 0.0% 624 [ELF Headers] 2.61Ki 0.0% > 0.0% 2.32Ki [Other] 2.60Ki 0.0% > 0.0% 16 [None] 0 0.0% > 100.0% 25.1Mi TOTAL 26.9Mi 100.0% > > As I had chat with Honza, we still have problem in GCC that using current working sets, > get_hot_bb_threshold () is very close to number of runs, which is effectively 1 for a single > run. That's mistake and that should be fixed. Yep, with LTO+PGO bootstrap I think we also hit the problem that PGO inliner was never seriously tuned (we basically use the very first badness metric I introduced and we never experimented with parameters). The reason is that hot/cold partitioning even when it is very coarsce does work reasonably well for per-file compilation model. With LTO we are facing very many inline decisions and probably there is a lot of low hanging fruit. GCC is currently on transition to new profile counter code. I will push out the initial patch retiring gcov_type soon (once I finish updating it to current tree - it is very anoying) and that will let us to track hotness more conservatively and fix the old problem that count becomes unrealistically low by broken profile updates and thus becomes cold. This should make it possible to increase the threshold and start with re-tunning (hopefully this or next week) Honza > > Martin
From 0a9c9a7f7d335e5e053ab37c5649371996e95325 Mon Sep 17 00:00:00 2001 From: marxin <mliska@suse.cz> Date: Thu, 25 May 2017 11:35:29 +0200 Subject: [PATCH] Introduce 4-stages profiledbootstrap to get a better profile. gcc/ChangeLog: 2017-05-25 Martin Liska <mliska@suse.cz> * doc/install.texi: Document that PGO runs in 4 stages. ChangeLog: 2017-05-25 Martin Liska <mliska@suse.cz> * Makefile.def: Define 4 stages PGO bootstrap. * Makefile.tpl: Define FLAGS. * Makefile.in: Regenerate. --- Makefile.in | 7 +++++-- Makefile.tpl | 7 +++++-- gcc/doc/install.texi | 5 +++-- 3 files changed, 13 insertions(+), 6 deletions(-) diff --git a/Makefile.in b/Makefile.in index b824e0a0ca1..75e5a1a912b 100644 --- a/Makefile.in +++ b/Makefile.in @@ -522,8 +522,11 @@ STAGE1_CONFIGURE_FLAGS = --disable-intermodule $(STAGE1_CHECKING) \ STAGEprofile_CFLAGS = $(STAGE2_CFLAGS) -fprofile-generate STAGEprofile_TFLAGS = $(STAGE2_TFLAGS) -STAGEfeedback_CFLAGS = $(STAGE3_CFLAGS) -fprofile-use -STAGEfeedback_TFLAGS = $(STAGE3_TFLAGS) +STAGEtrain_CFLAGS = $(STAGE3_CFLAGS) +STAGEtrain_TFLAGS = $(STAGE3_TFLAGS) + +STAGEfeedback_CFLAGS = $(STAGE4_CFLAGS) -fprofile-use +STAGEfeedback_TFLAGS = $(STAGE4_TFLAGS) STAGEautoprofile_CFLAGS = $(STAGE2_CFLAGS) -g STAGEautoprofile_TFLAGS = $(STAGE2_TFLAGS) diff --git a/Makefile.tpl b/Makefile.tpl index d0fa07005be..5fcd7e358d9 100644 --- a/Makefile.tpl +++ b/Makefile.tpl @@ -455,8 +455,11 @@ STAGE1_CONFIGURE_FLAGS = --disable-intermodule $(STAGE1_CHECKING) \ STAGEprofile_CFLAGS = $(STAGE2_CFLAGS) -fprofile-generate STAGEprofile_TFLAGS = $(STAGE2_TFLAGS) -STAGEfeedback_CFLAGS = $(STAGE3_CFLAGS) -fprofile-use -STAGEfeedback_TFLAGS = $(STAGE3_TFLAGS) +STAGEtrain_CFLAGS = $(STAGE3_CFLAGS) +STAGEtrain_TFLAGS = $(STAGE3_TFLAGS) + +STAGEfeedback_CFLAGS = $(STAGE4_CFLAGS) -fprofile-use +STAGEfeedback_TFLAGS = $(STAGE4_TFLAGS) STAGEautoprofile_CFLAGS = $(STAGE2_CFLAGS) -g STAGEautoprofile_TFLAGS = $(STAGE2_TFLAGS) diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi index b13fc1f6f42..386771872ba 100644 --- a/gcc/doc/install.texi +++ b/gcc/doc/install.texi @@ -2611,8 +2611,9 @@ bootstrap the compiler with profile feedback, use @code{make profiledbootstrap}. When @samp{make profiledbootstrap} is run, it will first build a @code{stage1} compiler. This compiler is used to build a @code{stageprofile} compiler instrumented to collect execution counts of instruction and branch -probabilities. Then runtime libraries are compiled with profile collected. -Finally a @code{stagefeedback} compiler is built using the information collected. +probabilities. Training run is done by building @code{stagetrain} +compiler. Finally a @code{stagefeedback} compiler is built +using the information collected. Unlike standard bootstrap, several additional restrictions apply. The compiler used to build @code{stage1} needs to support a 64-bit integral type. -- 2.12.2