diff mbox

Introduce 4-stages profiledbootstrap to get a better profile.

Message ID f61115af-55e1-8d59-545f-295f7e5b53cb@suse.cz
State New
Headers show

Commit Message

Martin Liška May 25, 2017, 9:55 a.m. UTC
Hi.

As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following
2 reasons:

1) stageprofile compiler is train just on libraries that are built during stage2
2) apart from that, as the compiler is also used to build the final compiler, profile
is being updated during the build. So the stage2 compiler is making different decisions.

Both problems can be resolved by adding another step in between current stage2 and stage3
where we train stage2 compiler by building compiler with default options.

I'm going to do some measurements.

Ready for trunk?
Martin

Comments

Markus Trippelsdorf May 25, 2017, 11:22 a.m. UTC | #1
On 2017.05.25 at 11:55 +0200, Martin Liška wrote:
> Hi.
>
> As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following
> 2 reasons:
>
> 1) stageprofile compiler is train just on libraries that are built during stage2
> 2) apart from that, as the compiler is also used to build the final compiler, profile
> is being updated during the build. So the stage2 compiler is making different decisions.
>
> Both problems can be resolved by adding another step in between current stage2 and stage3
> where we train stage2 compiler by building compiler with default options.
>
> I'm going to do some measurements.

I did some measurements on gcc67 (trunk with --enable-checking=release).
The apparent speedup is in the noise.

Without your patch:

 Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs):

      15749.058451      task-clock (msec)         #    0.997 CPUs utilized            ( +-  0.13% )
             1,352      context-switches          #    0.086 K/sec                    ( +-  0.16% )
                 7      cpu-migrations            #    0.000 K/sec                    ( +-  5.73% )
           269,142      page-faults               #    0.017 M/sec                    ( +-  0.01% )
    60,676,581,181      cycles                    #    3.853 GHz                      ( +-  0.09% )  (83.35%)
    13,401,784,189      stalled-cycles-frontend   #   22.09% frontend cycles idle     ( +-  0.20% )  (83.33%)
    12,926,843,370      stalled-cycles-backend    #   21.30% backend cycles idle      ( +-  0.04% )  (83.31%)
    73,074,099,356      instructions              #    1.20  insn per cycle
                                                  #    0.18  stalled cycles per insn  ( +-  0.02% )  (83.34%)
    16,607,220,814      branches                  # 1054.490 M/sec                    ( +-  0.03% )  (83.36%)
       616,673,310      branch-misses             #    3.71% of all branches          ( +-  0.08% )  (83.36%)

      15.803602619 seconds time elapsed                                          ( +-  0.14% )

With your patch:

 Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs):

      15735.220610      task-clock (msec)         #    0.997 CPUs utilized            ( +-  0.11% )
             1,354      context-switches          #    0.086 K/sec                    ( +-  0.22% )
                 6      cpu-migrations            #    0.000 K/sec                    ( +-  6.67% )
           269,164      page-faults               #    0.017 M/sec                    ( +-  0.01% )
    60,723,862,242      cycles                    #    3.859 GHz                      ( +-  0.08% )  (83.35%)
    13,382,554,421      stalled-cycles-frontend   #   22.04% frontend cycles idle     ( +-  0.14% )  (83.31%)
    12,912,171,664      stalled-cycles-backend    #   21.26% backend cycles idle      ( +-  0.03% )  (83.34%)
    73,109,081,227      instructions              #    1.20  insn per cycle
                                                  #    0.18  stalled cycles per insn  ( +-  0.03% )  (83.34%)
    16,590,421,798      branches                  # 1054.349 M/sec                    ( +-  0.02% )  (83.35%)
       616,669,135      branch-misses             #    3.72% of all branches          ( +-  0.08% )  (83.36%)

      15.788772466 seconds time elapsed                                          ( +-  0.12% )



--
Markus
Martin Liška May 25, 2017, 3:50 p.m. UTC | #2
On 05/25/2017 01:22 PM, Markus Trippelsdorf wrote:
> On 2017.05.25 at 11:55 +0200, Martin Liška wrote:
>> Hi.
>>
>> As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following
>> 2 reasons:
>>
>> 1) stageprofile compiler is train just on libraries that are built during stage2
>> 2) apart from that, as the compiler is also used to build the final compiler, profile
>> is being updated during the build. So the stage2 compiler is making different decisions.
>>
>> Both problems can be resolved by adding another step in between current stage2 and stage3
>> where we train stage2 compiler by building compiler with default options.
>>
>> I'm going to do some measurements.
> 
> I did some measurements on gcc67 (trunk with --enable-checking=release).
> The apparent speedup is in the noise.

Hello.

Thanks for measurements:

I can see difference for GCC 7.1:

g++-7 tramp3d-v4.ii -O2 && time for i in `seq 1 10` ; do g++-7 tramp3d-v4.ii -O2 ; done

before: 2m25.133s
after: real	2m25.133s

which is 99.09124426480228%. It's probably within a noise level.

And apparently file size of binary is bugger:

before (using bloaty):

     VM SIZE                         FILE SIZE
 --------------                   --------------
  59.0%  15.1Mi .text              15.1Mi  62.3%
  21.3%  5.45Mi .rodata            5.45Mi  22.5%
   6.6%  1.69Mi .eh_frame          1.69Mi   6.9%
   5.4%  1.38Mi .bss                    0   0.0%
   3.3%   874Ki .dynstr             874Ki   3.5%
   1.8%   480Ki .dynsym             480Ki   1.9%
   1.1%   285Ki .eh_frame_hdr       285Ki   1.1%
   0.6%   158Ki .gnu.hash           158Ki   0.6%
   0.5%   144Ki .hash               144Ki   0.6%
   0.2%  44.4Ki .data              44.4Ki   0.2%
   0.2%  40.0Ki .gnu.version       40.0Ki   0.2%
   0.0%  11.1Ki .rela.plt          11.1Ki   0.0%
   0.0%  7.44Ki .plt               7.44Ki   0.0%
   0.0%  4.56Ki .data.rel.ro       4.56Ki   0.0%
   0.0%  3.73Ki .got.plt           3.73Ki   0.0%
   0.0%      38 [Unmapped]         2.75Ki   0.0%
   0.0%     624 [ELF Headers]      2.55Ki   0.0%
   0.0%     848 [Other]            1.13Ki   0.0%
   0.0%     917 .gcc_except_table     917   0.0%
   0.0%     608 .dynamic              608   0.0%
   0.0%      16 [None]                  0   0.0%
 100.0%  25.7Mi TOTAL              24.3Mi 100.0%

after:

     VM SIZE                     FILE SIZE
 --------------               --------------
  58.3%  14.6Mi .text          14.6Mi  54.2%
  21.6%  5.41Mi .rodata        5.41Mi  20.1%
   0.0%       0 .strtab        2.13Mi   7.9%
   6.7%  1.67Mi .eh_frame      1.67Mi   6.2%
   5.5%  1.38Mi .bss                0   0.0%
   0.0%       0 .symtab        1.11Mi   4.1%
   3.4%   876Ki .dynstr         876Ki   3.2%
   1.9%   480Ki .dynsym         480Ki   1.7%
   1.1%   280Ki .eh_frame_hdr   280Ki   1.0%
   0.6%   158Ki .gnu.hash       158Ki   0.6%
   0.6%   144Ki .hash           144Ki   0.5%
   0.2%  44.4Ki .data          44.4Ki   0.2%
   0.2%  40.1Ki .gnu.version   40.1Ki   0.1%
   0.0%  11.1Ki .rela.plt      11.1Ki   0.0%
   0.0%  7.44Ki .plt           7.44Ki   0.0%
   0.0%  4.56Ki .data.rel.ro   4.56Ki   0.0%
   0.0%  3.73Ki .got.plt       3.73Ki   0.0%
   0.0%      58 [Unmapped]     3.11Ki   0.0%
   0.0%     624 [ELF Headers]  2.61Ki   0.0%
   0.0%  2.32Ki [Other]        2.60Ki   0.0%
   0.0%      16 [None]              0   0.0%
 100.0%  25.1Mi TOTAL          26.9Mi 100.0%

As I had chat with Honza, we still have problem in GCC that using current working sets,
get_hot_bb_threshold () is very close to number of runs, which is effectively 1 for a single
run. That's mistake and that should be fixed.

Martin



> 
> Without your patch:
> 
>  Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs):
> 
>       15749.058451      task-clock (msec)         #    0.997 CPUs utilized            ( +-  0.13% )
>              1,352      context-switches          #    0.086 K/sec                    ( +-  0.16% )
>                  7      cpu-migrations            #    0.000 K/sec                    ( +-  5.73% )
>            269,142      page-faults               #    0.017 M/sec                    ( +-  0.01% )
>     60,676,581,181      cycles                    #    3.853 GHz                      ( +-  0.09% )  (83.35%)
>     13,401,784,189      stalled-cycles-frontend   #   22.09% frontend cycles idle     ( +-  0.20% )  (83.33%)
>     12,926,843,370      stalled-cycles-backend    #   21.30% backend cycles idle      ( +-  0.04% )  (83.31%)
>     73,074,099,356      instructions              #    1.20  insn per cycle
>                                                   #    0.18  stalled cycles per insn  ( +-  0.02% )  (83.34%)
>     16,607,220,814      branches                  # 1054.490 M/sec                    ( +-  0.03% )  (83.36%)
>        616,673,310      branch-misses             #    3.71% of all branches          ( +-  0.08% )  (83.36%)
> 
>       15.803602619 seconds time elapsed                                          ( +-  0.14% )
> 
> With your patch:
> 
>  Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs):
> 
>       15735.220610      task-clock (msec)         #    0.997 CPUs utilized            ( +-  0.11% )
>              1,354      context-switches          #    0.086 K/sec                    ( +-  0.22% )
>                  6      cpu-migrations            #    0.000 K/sec                    ( +-  6.67% )
>            269,164      page-faults               #    0.017 M/sec                    ( +-  0.01% )
>     60,723,862,242      cycles                    #    3.859 GHz                      ( +-  0.08% )  (83.35%)
>     13,382,554,421      stalled-cycles-frontend   #   22.04% frontend cycles idle     ( +-  0.14% )  (83.31%)
>     12,912,171,664      stalled-cycles-backend    #   21.26% backend cycles idle      ( +-  0.03% )  (83.34%)
>     73,109,081,227      instructions              #    1.20  insn per cycle
>                                                   #    0.18  stalled cycles per insn  ( +-  0.03% )  (83.34%)
>     16,590,421,798      branches                  # 1054.349 M/sec                    ( +-  0.02% )  (83.35%)
>        616,669,135      branch-misses             #    3.72% of all branches          ( +-  0.08% )  (83.36%)
> 
>       15.788772466 seconds time elapsed                                          ( +-  0.12% )
> 
> 
> 
> --
> Markus
>
Markus Trippelsdorf May 29, 2017, 5:04 a.m. UTC | #3
On 2017.05.25 at 11:55 +0200, Martin Liška wrote:
> Hi.
> 
> As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following
> 2 reasons:
> 
> 1) stageprofile compiler is train just on libraries that are built during stage2
> 2) apart from that, as the compiler is also used to build the final compiler, profile
> is being updated during the build. So the stage2 compiler is making different decisions.
> 
> Both problems can be resolved by adding another step in between current stage2 and stage3
> where we train stage2 compiler by building compiler with default options.

Another issue that I've noticed is that LTO doesn't get used in the
final stage (stagefeedback) with "bootstrap-O3 bootstrap-lto".
It only is used during training. So either move -flto to stagefeedback,
or use -flto both during training and during the final stage.
Jan Hubicka May 29, 2017, 2:57 p.m. UTC | #4
> On 05/25/2017 01:22 PM, Markus Trippelsdorf wrote:
> > On 2017.05.25 at 11:55 +0200, Martin Liška wrote:
> >> Hi.
> >>
> >> As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following
> >> 2 reasons:
> >>
> >> 1) stageprofile compiler is train just on libraries that are built during stage2
> >> 2) apart from that, as the compiler is also used to build the final compiler, profile
> >> is being updated during the build. So the stage2 compiler is making different decisions.
> >>
> >> Both problems can be resolved by adding another step in between current stage2 and stage3
> >> where we train stage2 compiler by building compiler with default options.
> >>
> >> I'm going to do some measurements.
> > 
> > I did some measurements on gcc67 (trunk with --enable-checking=release).
> > The apparent speedup is in the noise.
> 
> Hello.
> 
> Thanks for measurements:
> 
> I can see difference for GCC 7.1:
> 
> g++-7 tramp3d-v4.ii -O2 && time for i in `seq 1 10` ; do g++-7 tramp3d-v4.ii -O2 ; done
> 
> before: 2m25.133s
> after: real	2m25.133s
> 
> which is 99.09124426480228%. It's probably within a noise level.
> 
> And apparently file size of binary is bugger:
> 
> before (using bloaty):
> 
>      VM SIZE                         FILE SIZE
>  --------------                   --------------
>   59.0%  15.1Mi .text              15.1Mi  62.3%
>   21.3%  5.45Mi .rodata            5.45Mi  22.5%
>    6.6%  1.69Mi .eh_frame          1.69Mi   6.9%
>    5.4%  1.38Mi .bss                    0   0.0%
>    3.3%   874Ki .dynstr             874Ki   3.5%
>    1.8%   480Ki .dynsym             480Ki   1.9%
>    1.1%   285Ki .eh_frame_hdr       285Ki   1.1%
>    0.6%   158Ki .gnu.hash           158Ki   0.6%
>    0.5%   144Ki .hash               144Ki   0.6%
>    0.2%  44.4Ki .data              44.4Ki   0.2%
>    0.2%  40.0Ki .gnu.version       40.0Ki   0.2%
>    0.0%  11.1Ki .rela.plt          11.1Ki   0.0%
>    0.0%  7.44Ki .plt               7.44Ki   0.0%
>    0.0%  4.56Ki .data.rel.ro       4.56Ki   0.0%
>    0.0%  3.73Ki .got.plt           3.73Ki   0.0%
>    0.0%      38 [Unmapped]         2.75Ki   0.0%
>    0.0%     624 [ELF Headers]      2.55Ki   0.0%
>    0.0%     848 [Other]            1.13Ki   0.0%
>    0.0%     917 .gcc_except_table     917   0.0%
>    0.0%     608 .dynamic              608   0.0%
>    0.0%      16 [None]                  0   0.0%
>  100.0%  25.7Mi TOTAL              24.3Mi 100.0%
> 
> after:
> 
>      VM SIZE                     FILE SIZE
>  --------------               --------------
>   58.3%  14.6Mi .text          14.6Mi  54.2%
>   21.6%  5.41Mi .rodata        5.41Mi  20.1%
>    0.0%       0 .strtab        2.13Mi   7.9%
>    6.7%  1.67Mi .eh_frame      1.67Mi   6.2%
>    5.5%  1.38Mi .bss                0   0.0%
>    0.0%       0 .symtab        1.11Mi   4.1%
>    3.4%   876Ki .dynstr         876Ki   3.2%
>    1.9%   480Ki .dynsym         480Ki   1.7%
>    1.1%   280Ki .eh_frame_hdr   280Ki   1.0%
>    0.6%   158Ki .gnu.hash       158Ki   0.6%
>    0.6%   144Ki .hash           144Ki   0.5%
>    0.2%  44.4Ki .data          44.4Ki   0.2%
>    0.2%  40.1Ki .gnu.version   40.1Ki   0.1%
>    0.0%  11.1Ki .rela.plt      11.1Ki   0.0%
>    0.0%  7.44Ki .plt           7.44Ki   0.0%
>    0.0%  4.56Ki .data.rel.ro   4.56Ki   0.0%
>    0.0%  3.73Ki .got.plt       3.73Ki   0.0%
>    0.0%      58 [Unmapped]     3.11Ki   0.0%
>    0.0%     624 [ELF Headers]  2.61Ki   0.0%
>    0.0%  2.32Ki [Other]        2.60Ki   0.0%
>    0.0%      16 [None]              0   0.0%
>  100.0%  25.1Mi TOTAL          26.9Mi 100.0%
> 
> As I had chat with Honza, we still have problem in GCC that using current working sets,
> get_hot_bb_threshold () is very close to number of runs, which is effectively 1 for a single
> run. That's mistake and that should be fixed.

Yep, with LTO+PGO bootstrap I think we also hit the problem that PGO inliner was never
seriously tuned (we basically use the very first badness metric I introduced and we never
experimented with parameters). The reason is that hot/cold partitioning even when it
is very coarsce does work reasonably well for per-file compilation model.  With LTO we
are facing very many inline decisions and probably there is a lot of low hanging fruit.

GCC is currently on transition to new profile counter code.  I will push out the initial
patch retiring gcov_type soon (once I finish updating it to current tree - it is very
anoying) and that will let us to track hotness more conservatively and fix the old
problem that count becomes unrealistically low by broken profile updates and thus
becomes cold.  This should make it possible to increase the threshold and start with
re-tunning (hopefully this or next week)

Honza
> 
> Martin
diff mbox

Patch

From 0a9c9a7f7d335e5e053ab37c5649371996e95325 Mon Sep 17 00:00:00 2001
From: marxin <mliska@suse.cz>
Date: Thu, 25 May 2017 11:35:29 +0200
Subject: [PATCH] Introduce 4-stages profiledbootstrap to get a better profile.

gcc/ChangeLog:

2017-05-25  Martin Liska  <mliska@suse.cz>

	* doc/install.texi: Document that PGO runs in 4 stages.

ChangeLog:

2017-05-25  Martin Liska  <mliska@suse.cz>

	* Makefile.def: Define 4 stages PGO bootstrap.
	* Makefile.tpl: Define FLAGS.
	* Makefile.in: Regenerate.
---
 Makefile.in          | 7 +++++--
 Makefile.tpl         | 7 +++++--
 gcc/doc/install.texi | 5 +++--
 3 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/Makefile.in b/Makefile.in
index b824e0a0ca1..75e5a1a912b 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -522,8 +522,11 @@  STAGE1_CONFIGURE_FLAGS = --disable-intermodule $(STAGE1_CHECKING) \
 STAGEprofile_CFLAGS = $(STAGE2_CFLAGS) -fprofile-generate
 STAGEprofile_TFLAGS = $(STAGE2_TFLAGS)
 
-STAGEfeedback_CFLAGS = $(STAGE3_CFLAGS) -fprofile-use
-STAGEfeedback_TFLAGS = $(STAGE3_TFLAGS)
+STAGEtrain_CFLAGS = $(STAGE3_CFLAGS)
+STAGEtrain_TFLAGS = $(STAGE3_TFLAGS)
+
+STAGEfeedback_CFLAGS = $(STAGE4_CFLAGS) -fprofile-use
+STAGEfeedback_TFLAGS = $(STAGE4_TFLAGS)
 
 STAGEautoprofile_CFLAGS = $(STAGE2_CFLAGS) -g
 STAGEautoprofile_TFLAGS = $(STAGE2_TFLAGS)
diff --git a/Makefile.tpl b/Makefile.tpl
index d0fa07005be..5fcd7e358d9 100644
--- a/Makefile.tpl
+++ b/Makefile.tpl
@@ -455,8 +455,11 @@  STAGE1_CONFIGURE_FLAGS = --disable-intermodule $(STAGE1_CHECKING) \
 STAGEprofile_CFLAGS = $(STAGE2_CFLAGS) -fprofile-generate
 STAGEprofile_TFLAGS = $(STAGE2_TFLAGS)
 
-STAGEfeedback_CFLAGS = $(STAGE3_CFLAGS) -fprofile-use
-STAGEfeedback_TFLAGS = $(STAGE3_TFLAGS)
+STAGEtrain_CFLAGS = $(STAGE3_CFLAGS)
+STAGEtrain_TFLAGS = $(STAGE3_TFLAGS)
+
+STAGEfeedback_CFLAGS = $(STAGE4_CFLAGS) -fprofile-use
+STAGEfeedback_TFLAGS = $(STAGE4_TFLAGS)
 
 STAGEautoprofile_CFLAGS = $(STAGE2_CFLAGS) -g
 STAGEautoprofile_TFLAGS = $(STAGE2_TFLAGS)
diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index b13fc1f6f42..386771872ba 100644
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -2611,8 +2611,9 @@  bootstrap the compiler with profile feedback, use @code{make profiledbootstrap}.
 When @samp{make profiledbootstrap} is run, it will first build a @code{stage1}
 compiler.  This compiler is used to build a @code{stageprofile} compiler
 instrumented to collect execution counts of instruction and branch
-probabilities.  Then runtime libraries are compiled with profile collected.
-Finally a @code{stagefeedback} compiler is built using the information collected.
+probabilities.  Training run is done by building @code{stagetrain}
+compiler.  Finally a @code{stagefeedback} compiler is built
+using the information collected.
 
 Unlike standard bootstrap, several additional restrictions apply.  The
 compiler used to build @code{stage1} needs to support a 64-bit integral type.
-- 
2.12.2