Reduce inline limits a bit to compensate changes in inlining metrics

Message ID 20180209133450.GA45817@kam.mff.cuni.cz
State New
Headers show
Series
  • Reduce inline limits a bit to compensate changes in inlining metrics
Related show

Commit Message

Jan Hubicka Feb. 9, 2018, 1:34 p.m.
Hi,
this patch addresses the code size regression by reducing 
max-inline-insns-auto 40->30 and increasing inline-min-speedup 8->15.

The main reason why we need retuning is following

 - inline-min-speedup works in a way that if expected runtime 
   of caller+calee combo after inlining reduces by more than 8%
   the inliner is going to bypass inline-insns-auto (because it knows the
   inlining is benefical rather than just inlining in hope it will be).
   The decrease either happens because callee is very lightweight at
   average or because we can track it will optimize well.

   During GCC 8 development I have switched time estimates from int to sreal.
   Original estimates was capping time to about 1000 instructions and thus
   large function rarely saw speedup because that was based comparing caped
   numbers.  With sreals we can now track benefits better

 - We made quite some progress on early optimizations making function
   bodies to appear smaller to inliner which in turn inlines more of them.
   This is reason why we want to decrease inline-min-speedup to gain some code
   size back.

   The code size estimate difference at beggining of inlning is about 6% to
   gcc 6 and about 12% to gcc 4.9.

I have benchmarked patch on Haswell SPEC2000, SPEC2006, polyhedron and our C++
benchmarks.  Here I found no off-noise changes on SPEC2000/2006. I know that
reducing inline-insns-auto to 10  still produces no regressions and even
improves facerec 6600->8000 but that seems bit of effect of good luck (it also
depends on setting of branch predictor weights and needs to be analyzed
independently).  min-speedup can be increased to 30 without measurable effects
as well.

On C++ benchmark suite I know that cray degrades with min-speedup set to 30 (it
needs value of 22). Also there is degradation with profile-generate on tramp3d.

So overall I believe that for Haswell the reduction of inline limits is doing
very consistent code size improvement without perofrmance tradeoffs.

I also tested Itanium and here things are slightly more sensitive. The
reduction of limits affects gzip 337->332 (-1.5%), vpr 1000->980 (-2%), crafty
(925->935) (+2%) and vortex (1165->1180) (+1%). So overall it is specint2000
neutral. Reducing inline-isns-auto to 10 brings off noise overall degradation
by -1% and 20 is in-between.

specfp2000 reacts positively by improving applu 520->525 (+1%) and mgrid
391->397 (+1.3%). It would let me to reduct inline-isns-auto to 10 without
any other regressions.

C++ benchmarks does not show any off-noise changes.

I have also did some limited testing on ppc and arm. They reacted more similarly
to Haswell also showing no important changes for reducing the inlining limits.

Now reducing inline limits triggers failure of testsuite/g++.dg/pr83239.C
which tests that inlining happens.  The reason why it does not happen is
becuae ipa-fnsplit is trying to second guess if inliner will evnetually consider
function for inlining and the test is out of date.  I decided to hack around
it for stage4 and will try to clean these things up next stage1.

Bootstraped/regtested x86_64-linux.  I know it is late in stage4, but would it
be OK to for GCC 8? 

	PR middle-end/83665
	* params.def (inline-min-speedup): Increase from 8 to 15.
	(max-inline-insns-auto): Decrease from 40 to 30.
	* ipa-split.c (consider_split): Add some buffer for function to
	be considered inlining candidate.
	* invoke.texi (max-inline-insns-auto, inline-min-speedup): UPdate
	default values.

Comments

Richard Biener Feb. 12, 2018, 9:22 a.m. | #1
On Fri, 9 Feb 2018, Jan Hubicka wrote:

> Hi,
> this patch addresses the code size regression by reducing 
> max-inline-insns-auto 40->30 and increasing inline-min-speedup 8->15.
> 
> The main reason why we need retuning is following
> 
>  - inline-min-speedup works in a way that if expected runtime 
>    of caller+calee combo after inlining reduces by more than 8%
>    the inliner is going to bypass inline-insns-auto (because it knows the
>    inlining is benefical rather than just inlining in hope it will be).
>    The decrease either happens because callee is very lightweight at
>    average or because we can track it will optimize well.
> 
>    During GCC 8 development I have switched time estimates from int to sreal.
>    Original estimates was capping time to about 1000 instructions and thus
>    large function rarely saw speedup because that was based comparing caped
>    numbers.  With sreals we can now track benefits better
> 
>  - We made quite some progress on early optimizations making function
>    bodies to appear smaller to inliner which in turn inlines more of them.
>    This is reason why we want to decrease inline-min-speedup to gain some code
>    size back.
> 
>    The code size estimate difference at beggining of inlning is about 6% to
>    gcc 6 and about 12% to gcc 4.9.
> 
> I have benchmarked patch on Haswell SPEC2000, SPEC2006, polyhedron and our C++
> benchmarks.  Here I found no off-noise changes on SPEC2000/2006. I know that
> reducing inline-insns-auto to 10  still produces no regressions and even
> improves facerec 6600->8000 but that seems bit of effect of good luck (it also
> depends on setting of branch predictor weights and needs to be analyzed
> independently).  min-speedup can be increased to 30 without measurable effects
> as well.
> 
> On C++ benchmark suite I know that cray degrades with min-speedup set to 30 (it
> needs value of 22). Also there is degradation with profile-generate on tramp3d.
> 
> So overall I believe that for Haswell the reduction of inline limits is doing
> very consistent code size improvement without perofrmance tradeoffs.
> 
> I also tested Itanium and here things are slightly more sensitive. The
> reduction of limits affects gzip 337->332 (-1.5%), vpr 1000->980 (-2%), crafty
> (925->935) (+2%) and vortex (1165->1180) (+1%). So overall it is specint2000
> neutral. Reducing inline-isns-auto to 10 brings off noise overall degradation
> by -1% and 20 is in-between.
> 
> specfp2000 reacts positively by improving applu 520->525 (+1%) and mgrid
> 391->397 (+1.3%). It would let me to reduct inline-isns-auto to 10 without
> any other regressions.
> 
> C++ benchmarks does not show any off-noise changes.
> 
> I have also did some limited testing on ppc and arm. They reacted more similarly
> to Haswell also showing no important changes for reducing the inlining limits.
> 
> Now reducing inline limits triggers failure of testsuite/g++.dg/pr83239.C
> which tests that inlining happens.  The reason why it does not happen is
> becuae ipa-fnsplit is trying to second guess if inliner will evnetually consider
> function for inlining and the test is out of date.  I decided to hack around
> it for stage4 and will try to clean these things up next stage1.
> 
> Bootstraped/regtested x86_64-linux.  I know it is late in stage4, but would it
> be OK to for GCC 8? 

Ok.

Richard.

> 	PR middle-end/83665
> 	* params.def (inline-min-speedup): Increase from 8 to 15.
> 	(max-inline-insns-auto): Decrease from 40 to 30.
> 	* ipa-split.c (consider_split): Add some buffer for function to
> 	be considered inlining candidate.
> 	* invoke.texi (max-inline-insns-auto, inline-min-speedup): UPdate
> 	default values.
> Index: params.def
> ===================================================================
> --- params.def	(revision 257520)
> +++ params.def	(working copy)
> @@ -52,13 +52,13 @@ DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCO
>  DEFPARAM (PARAM_INLINE_MIN_SPEEDUP,
>  	  "inline-min-speedup",
>  	  "The minimal estimated speedup allowing inliner to ignore inline-insns-single and inline-insns-auto.",
> -	  8, 0, 0)
> +	  15, 0, 0)
>  
>  /* The single function inlining limit. This is the maximum size
>     of a function counted in internal gcc instructions (not in
>     real machine instructions) that is eligible for inlining
>     by the tree inliner.
> -   The default value is 450.
> +   The default value is 400.
>     Only functions marked inline (or methods defined in the class
>     definition for C++) are affected by this.
>     There are more restrictions to inlining: If inlined functions
> @@ -77,11 +77,11 @@ DEFPARAM (PARAM_MAX_INLINE_INSNS_SINGLE,
>     that is applied to functions marked inlined (or defined in the
>     class declaration in C++) given by the "max-inline-insns-single"
>     parameter.
> -   The default value is 40.  */
> +   The default value is 30.  */
>  DEFPARAM (PARAM_MAX_INLINE_INSNS_AUTO,
>  	  "max-inline-insns-auto",
>  	  "The maximum number of instructions when automatically inlining.",
> -	  40, 0, 0)
> +	  30, 0, 0)
>  
>  DEFPARAM (PARAM_MAX_INLINE_INSNS_RECURSIVE,
>  	  "max-inline-insns-recursive",
> Index: ipa-split.c
> ===================================================================
> --- ipa-split.c	(revision 257520)
> +++ ipa-split.c	(working copy)
> @@ -558,10 +558,13 @@ consider_split (struct split_point *curr
>  		 "  Refused: split size is smaller than call overhead\n");
>        return;
>      }
> +  /* FIXME: The logic here is not very precise, because inliner does use
> +     inline predicates to reduce function body size.  We add 10 to anticipate
> +     that.  Next stage1 we should try to be more meaningful here.  */
>    if (current->header_size + call_overhead
>        >= (unsigned int)(DECL_DECLARED_INLINE_P (current_function_decl)
>  			? MAX_INLINE_INSNS_SINGLE
> -			: MAX_INLINE_INSNS_AUTO))
> +			: MAX_INLINE_INSNS_AUTO) + 10)
>      {
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	fprintf (dump_file,
> @@ -574,7 +577,7 @@ consider_split (struct split_point *curr
>       Limit this duplication.  This is consistent with limit in tree-sra.c  
>       FIXME: with LTO we ought to be able to do better!  */
>    if (DECL_ONE_ONLY (current_function_decl)
> -      && current->split_size >= (unsigned int) MAX_INLINE_INSNS_AUTO)
> +      && current->split_size >= (unsigned int) MAX_INLINE_INSNS_AUTO + 10)
>      {
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	fprintf (dump_file,
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi	(revision 257520)
> +++ doc/invoke.texi	(working copy)
> @@ -10131,13 +10131,14 @@ a lot of functions that would otherwise
>  by the compiler are investigated.  To those functions, a different
>  (more restrictive) limit compared to functions declared inline can
>  be applied.
> -The default value is 40.
> +The default value is 30.
>  
>  @item inline-min-speedup
>  When estimated performance improvement of caller + callee runtime exceeds this
>  threshold (in percent), the function can be inlined regardless of the limit on
>  @option{--param max-inline-insns-single} and @option{--param
>  max-inline-insns-auto}.
> +The default value is 15.
>  
>  @item large-function-insns
>  The limit specifying really large functions.  For functions larger than this
> 
>
Martin Liška Feb. 13, 2018, 1:12 p.m. | #2
Hi.

I see quite SPEC 2006 benchmark changes for -Ofast -march=native:
(note that size changes are not presented all)

1) gillan (AMD bulldozer):

+------------------------------------------+-------+----------+----------+---+-------+-------+
| Performance Regressions - Execution Time |   Δ   | Previous | Current  | σ | Δ (B) | σ (B) |
+------------------------------------------+-------+----------+----------+---+-------+-------+
| SPEC/SPEC2006/FP/453.povray              | 6.06% | 180.9933 | 191.9582 | - | 3.48% |     - |
| SPEC/SPEC2006/FP/410.bwaves              | 5.13% | 307.0029 | 322.7495 | - | 8.11% |     - |
| SPEC/SPEC2006/FP/433.milc                | 4.80% | 310.4705 | 325.3810 | - | 4.01% |     - |
| SPEC/SPEC2006/FP/450.soplex              | 3.50% | 449.4463 | 465.1627 | - | 3.09% |     - |
| SPEC/SPEC2006/INT/471.omnetpp            | 1.41% | 562.2769 | 570.1829 | - | 1.65% |     - |
| SPEC/SPEC2006/FP/447.dealII              | 1.40% | 359.1753 | 364.1933 | - | 1.66% |     - |
| SPEC/SPEC2006/INT/483.xalancbmk          | 1.00% | 430.9722 | 435.2821 | - | 7.59% |     - |
+------------------------------------------+-------+----------+----------+---+-------+-------+

+-------------------------------------------+--------+----------+----------+---+--------+-------+
| Performance Improvements - Execution Time |   Δ    | Previous | Current  | σ | Δ (B)  | σ (B) |
+-------------------------------------------+--------+----------+----------+---+--------+-------+
| SPEC/SPEC2006/INT/429.mcf                 | -7.93% | 606.7478 | 558.6419 | - | -7.42% |     - |
| SPEC/SPEC2006/INT/473.astar               | -2.07% | 577.5786 | 565.6328 | - | -6.16% |     - |
| SPEC/SPEC2006/FP/435.gromacs              | -2.01% | 329.9626 | 323.3386 | - | -0.34% |     - |
| SPEC/SPEC2006/INT/458.sjeng               | -1.71% | 667.0133 | 655.5964 | - | -1.54% |     - |
| SPEC/SPEC2006/FP/436.cactusADM            | -1.07% | 379.5225 | 375.4560 | - | -4.12% |     - |
+-------------------------------------------+--------+----------+----------+---+--------+-------+

+---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
|          Performance Improvements - size          |    Δ    |   Previous   |   Current    | σ |  Δ (B)  | σ (B) |
+---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
| SPEC/SPEC2006/FP/482.sphinx3/elf/sections/text    | -15.92% |  244450.0000 |  205522.0000 | - | -9.45%  |     - |
| SPEC/SPEC2006/FP/482.sphinx3/elf                  | -12.16% |  335640.0000 |  294816.0000 | - | -6.46%  |     - |
| SPEC/SPEC2006/INT/456.hmmer/elf/sections/text     | -12.14% |  395778.0000 |  347746.0000 | - | -9.75%  |     - |
| SPEC/SPEC2006/INT/483.xalancbmk/elf/sections/text | -10.43% | 3282082.0000 | 2939698.0000 | - | -11.33% |     - |
| SPEC/SPEC2006/FP/453.povray/elf/sections/text     | -9.48%  | 1004658.0000 |  909426.0000 | - | -12.68% |     - |
| SPEC/SPEC2006/INT/400.perlbench/elf/sections/text | -9.42%  | 1141458.0000 | 1033890.0000 | - | -9.75%  |     - |
| SPEC/SPEC2006/INT/456.hmmer/elf                   | -8.77%  |  508624.0000 |  464024.0000 | - | -7.28%  |     - |
+---------------------------------------------------+---------+--------------+--------------+---+---------+-------+

2) AMD Ryzen 5 machine:

+------------------------------------------+-------+----------+----------+---+-------+-------+
| Performance Regressions - Execution Time |   Δ   | Previous | Current  | σ | Δ (B) | σ (B) |
+------------------------------------------+-------+----------+----------+---+-------+-------+
| SPEC/SPEC2006/FP/450.soplex              | 6.28% | 204.6552 | 217.5132 | - | 8.06% |     - |
| SPEC/SPEC2006/FP/433.milc                | 6.11% | 200.5912 | 212.8397 | - | 5.81% |     - |
| SPEC/SPEC2006/FP/410.bwaves              | 3.28% | 145.9785 | 150.7609 | - | 5.37% |     - |
| SPEC/SPEC2006/INT/462.libquantum         | 1.98% | 306.7858 | 312.8472 | - | 2.14% |     - |
+------------------------------------------+-------+----------+----------+---+-------+-------+

+-------------------------------------------+--------+----------+----------+---+--------+-------+
| Performance Improvements - Execution Time |   Δ    | Previous | Current  | σ | Δ (B)  | σ (B) |
+-------------------------------------------+--------+----------+----------+---+--------+-------+
| SPEC/SPEC2006/INT/429.mcf                 | -4.71% | 255.8058 | 243.7667 | - | -1.18% |     - |
| SPEC/SPEC2006/INT/473.astar               | -3.33% | 361.8033 | 349.7520 | - | 0.67%  |     - |
| SPEC/SPEC2006/FP/482.sphinx3              | -1.28% | 301.7175 | 297.8409 | - | 5.49%  |     - |
+-------------------------------------------+--------+----------+----------+---+--------+-------+

+---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
|          Performance Improvements - size          |    Δ    |   Previous   |   Current    | σ |  Δ (B)  | σ (B) |
+---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
| SPEC/SPEC2006/FP/482.sphinx3/elf/sections/text    | -17.60% |  202370.0000 |  166754.0000 | - | -9.65%  |     - |
| SPEC/SPEC2006/FP/482.sphinx3/elf                  | -12.67% |  290680.0000 |  253856.0000 | - | -7.42%  |     - |
| SPEC/SPEC2006/INT/456.hmmer/elf/sections/text     | -10.58% |  340802.0000 |  304754.0000 | - | -7.92%  |     - |
| SPEC/SPEC2006/INT/400.perlbench/elf/sections/text | -9.50%  | 1166962.0000 | 1056066.0000 | - | -9.89%  |     - |
| SPEC/SPEC2006/FP/453.povray/elf/sections/text     | -9.48%  |  993042.0000 |  898882.0000 | - | -13.33% |     - |
| SPEC/SPEC2006/INT/483.xalancbmk/elf/sections/text | -9.13%  | 3225538.0000 | 2931026.0000 | - | -9.93%  |     - |
| SPEC/SPEC2006/INT/456.hmmer/elf                   | -8.01%  |  455472.0000 |  418968.0000 | - | -5.44%  |     - |
| SPEC/SPEC2006/INT/400.perlbench/elf               | -7.45%  | 1538352.0000 | 1423800.0000 | - | -7.69%  |     - |
+---------------------------------------------------+---------+--------------+--------------+---+---------+-------+

And I really verified on my Haswell machine that sphinx3 changes:

bloaty new -- old
     VM SIZE                     FILE SIZE
 ++++++++++++++ GROWING       ++++++++++++++
  +0.1%     +24 .eh_frame         +24  +0.1%
  [ = ]       0 .symtab           +24  +0.2%
  [ = ]       0 .strtab           +14  +0.2%
  +0.3%      +8 .eh_frame_hdr      +8  +0.3%

 -------------- SHRINKING     --------------
  [ = ]       0 .debug_loc     -164Ki -37.0%
  [ = ]       0 .debug_line    -118Ki -23.9%
 -16.7% -35.4Ki .text         -35.4Ki -16.7%
  [ = ]       0 .debug_ranges -33.1Ki -74.7%
  [ = ]       0 .debug_info   -31.0Ki  -9.8%
 -32.0%     -16 [Unmapped]       -679 -16.8%
  [ = ]       0 .debug_abbrev    -186  -0.5%

 -11.8% -35.3Ki TOTAL          -382Ki -23.2%

Martin

Martin
Jan Hubicka Feb. 13, 2018, 6:09 p.m. | #3
> Hi.
> 
> I see quite SPEC 2006 benchmark changes for -Ofast -march=native:
> (note that size changes are not presented all)

Thanks! It is interesting that I do not see similar observations on Czerny which also
runs spec2006. In partiuclar both AMD machines agrees on soplex milc and bwaves.
They are flat in Czerny graph, so it may be again dependency of AMD machines on code lyaout
we was seeing recently.
https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/410_bwaves_recent_big.png
https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/433_milc_recent_big.png
https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/450_soplex_recent_big.png

We sould take a look at spoplex. Megrez will pick up the patch tonight so we will see if the
regression reproduces on both bulldozer machines.

Honza

> 
> 1) gillan (AMD bulldozer):
> 
> +------------------------------------------+-------+----------+----------+---+-------+-------+
> | Performance Regressions - Execution Time |   Δ   | Previous | Current  | σ | Δ (B) | σ (B) |
> +------------------------------------------+-------+----------+----------+---+-------+-------+
> | SPEC/SPEC2006/FP/453.povray              | 6.06% | 180.9933 | 191.9582 | - | 3.48% |     - |
> | SPEC/SPEC2006/FP/410.bwaves              | 5.13% | 307.0029 | 322.7495 | - | 8.11% |     - |
> | SPEC/SPEC2006/FP/433.milc                | 4.80% | 310.4705 | 325.3810 | - | 4.01% |     - |
> | SPEC/SPEC2006/FP/450.soplex              | 3.50% | 449.4463 | 465.1627 | - | 3.09% |     - |
> | SPEC/SPEC2006/INT/471.omnetpp            | 1.41% | 562.2769 | 570.1829 | - | 1.65% |     - |
> | SPEC/SPEC2006/FP/447.dealII              | 1.40% | 359.1753 | 364.1933 | - | 1.66% |     - |
> | SPEC/SPEC2006/INT/483.xalancbmk          | 1.00% | 430.9722 | 435.2821 | - | 7.59% |     - |
> +------------------------------------------+-------+----------+----------+---+-------+-------+
> 
> +-------------------------------------------+--------+----------+----------+---+--------+-------+
> | Performance Improvements - Execution Time |   Δ    | Previous | Current  | σ | Δ (B)  | σ (B) |
> +-------------------------------------------+--------+----------+----------+---+--------+-------+
> | SPEC/SPEC2006/INT/429.mcf                 | -7.93% | 606.7478 | 558.6419 | - | -7.42% |     - |
> | SPEC/SPEC2006/INT/473.astar               | -2.07% | 577.5786 | 565.6328 | - | -6.16% |     - |
> | SPEC/SPEC2006/FP/435.gromacs              | -2.01% | 329.9626 | 323.3386 | - | -0.34% |     - |
> | SPEC/SPEC2006/INT/458.sjeng               | -1.71% | 667.0133 | 655.5964 | - | -1.54% |     - |
> | SPEC/SPEC2006/FP/436.cactusADM            | -1.07% | 379.5225 | 375.4560 | - | -4.12% |     - |
> +-------------------------------------------+--------+----------+----------+---+--------+-------+
> 
> +---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
> |          Performance Improvements - size          |    Δ    |   Previous   |   Current    | σ |  Δ (B)  | σ (B) |
> +---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
> | SPEC/SPEC2006/FP/482.sphinx3/elf/sections/text    | -15.92% |  244450.0000 |  205522.0000 | - | -9.45%  |     - |
> | SPEC/SPEC2006/FP/482.sphinx3/elf                  | -12.16% |  335640.0000 |  294816.0000 | - | -6.46%  |     - |
> | SPEC/SPEC2006/INT/456.hmmer/elf/sections/text     | -12.14% |  395778.0000 |  347746.0000 | - | -9.75%  |     - |
> | SPEC/SPEC2006/INT/483.xalancbmk/elf/sections/text | -10.43% | 3282082.0000 | 2939698.0000 | - | -11.33% |     - |
> | SPEC/SPEC2006/FP/453.povray/elf/sections/text     | -9.48%  | 1004658.0000 |  909426.0000 | - | -12.68% |     - |
> | SPEC/SPEC2006/INT/400.perlbench/elf/sections/text | -9.42%  | 1141458.0000 | 1033890.0000 | - | -9.75%  |     - |
> | SPEC/SPEC2006/INT/456.hmmer/elf                   | -8.77%  |  508624.0000 |  464024.0000 | - | -7.28%  |     - |
> +---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
> 
> 2) AMD Ryzen 5 machine:
> 
> +------------------------------------------+-------+----------+----------+---+-------+-------+
> | Performance Regressions - Execution Time |   Δ   | Previous | Current  | σ | Δ (B) | σ (B) |
> +------------------------------------------+-------+----------+----------+---+-------+-------+
> | SPEC/SPEC2006/FP/450.soplex              | 6.28% | 204.6552 | 217.5132 | - | 8.06% |     - |
> | SPEC/SPEC2006/FP/433.milc                | 6.11% | 200.5912 | 212.8397 | - | 5.81% |     - |
> | SPEC/SPEC2006/FP/410.bwaves              | 3.28% | 145.9785 | 150.7609 | - | 5.37% |     - |
> | SPEC/SPEC2006/INT/462.libquantum         | 1.98% | 306.7858 | 312.8472 | - | 2.14% |     - |
> +------------------------------------------+-------+----------+----------+---+-------+-------+
> 
> +-------------------------------------------+--------+----------+----------+---+--------+-------+
> | Performance Improvements - Execution Time |   Δ    | Previous | Current  | σ | Δ (B)  | σ (B) |
> +-------------------------------------------+--------+----------+----------+---+--------+-------+
> | SPEC/SPEC2006/INT/429.mcf                 | -4.71% | 255.8058 | 243.7667 | - | -1.18% |     - |
> | SPEC/SPEC2006/INT/473.astar               | -3.33% | 361.8033 | 349.7520 | - | 0.67%  |     - |
> | SPEC/SPEC2006/FP/482.sphinx3              | -1.28% | 301.7175 | 297.8409 | - | 5.49%  |     - |
> +-------------------------------------------+--------+----------+----------+---+--------+-------+
> 
> +---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
> |          Performance Improvements - size          |    Δ    |   Previous   |   Current    | σ |  Δ (B)  | σ (B) |
> +---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
> | SPEC/SPEC2006/FP/482.sphinx3/elf/sections/text    | -17.60% |  202370.0000 |  166754.0000 | - | -9.65%  |     - |
> | SPEC/SPEC2006/FP/482.sphinx3/elf                  | -12.67% |  290680.0000 |  253856.0000 | - | -7.42%  |     - |
> | SPEC/SPEC2006/INT/456.hmmer/elf/sections/text     | -10.58% |  340802.0000 |  304754.0000 | - | -7.92%  |     - |
> | SPEC/SPEC2006/INT/400.perlbench/elf/sections/text | -9.50%  | 1166962.0000 | 1056066.0000 | - | -9.89%  |     - |
> | SPEC/SPEC2006/FP/453.povray/elf/sections/text     | -9.48%  |  993042.0000 |  898882.0000 | - | -13.33% |     - |
> | SPEC/SPEC2006/INT/483.xalancbmk/elf/sections/text | -9.13%  | 3225538.0000 | 2931026.0000 | - | -9.93%  |     - |
> | SPEC/SPEC2006/INT/456.hmmer/elf                   | -8.01%  |  455472.0000 |  418968.0000 | - | -5.44%  |     - |
> | SPEC/SPEC2006/INT/400.perlbench/elf               | -7.45%  | 1538352.0000 | 1423800.0000 | - | -7.69%  |     - |
> +---------------------------------------------------+---------+--------------+--------------+---+---------+-------+
> 
> And I really verified on my Haswell machine that sphinx3 changes:
> 
> bloaty new -- old
>      VM SIZE                     FILE SIZE
>  ++++++++++++++ GROWING       ++++++++++++++
>   +0.1%     +24 .eh_frame         +24  +0.1%
>   [ = ]       0 .symtab           +24  +0.2%
>   [ = ]       0 .strtab           +14  +0.2%
>   +0.3%      +8 .eh_frame_hdr      +8  +0.3%
> 
>  -------------- SHRINKING     --------------
>   [ = ]       0 .debug_loc     -164Ki -37.0%
>   [ = ]       0 .debug_line    -118Ki -23.9%
>  -16.7% -35.4Ki .text         -35.4Ki -16.7%
>   [ = ]       0 .debug_ranges -33.1Ki -74.7%
>   [ = ]       0 .debug_info   -31.0Ki  -9.8%
>  -32.0%     -16 [Unmapped]       -679 -16.8%
>   [ = ]       0 .debug_abbrev    -186  -0.5%
> 
>  -11.8% -35.3Ki TOTAL          -382Ki -23.2%
> 
> Martin
> 
> Martin

Patch

Index: params.def
===================================================================
--- params.def	(revision 257520)
+++ params.def	(working copy)
@@ -52,13 +52,13 @@  DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCO
 DEFPARAM (PARAM_INLINE_MIN_SPEEDUP,
 	  "inline-min-speedup",
 	  "The minimal estimated speedup allowing inliner to ignore inline-insns-single and inline-insns-auto.",
-	  8, 0, 0)
+	  15, 0, 0)
 
 /* The single function inlining limit. This is the maximum size
    of a function counted in internal gcc instructions (not in
    real machine instructions) that is eligible for inlining
    by the tree inliner.
-   The default value is 450.
+   The default value is 400.
    Only functions marked inline (or methods defined in the class
    definition for C++) are affected by this.
    There are more restrictions to inlining: If inlined functions
@@ -77,11 +77,11 @@  DEFPARAM (PARAM_MAX_INLINE_INSNS_SINGLE,
    that is applied to functions marked inlined (or defined in the
    class declaration in C++) given by the "max-inline-insns-single"
    parameter.
-   The default value is 40.  */
+   The default value is 30.  */
 DEFPARAM (PARAM_MAX_INLINE_INSNS_AUTO,
 	  "max-inline-insns-auto",
 	  "The maximum number of instructions when automatically inlining.",
-	  40, 0, 0)
+	  30, 0, 0)
 
 DEFPARAM (PARAM_MAX_INLINE_INSNS_RECURSIVE,
 	  "max-inline-insns-recursive",
Index: ipa-split.c
===================================================================
--- ipa-split.c	(revision 257520)
+++ ipa-split.c	(working copy)
@@ -558,10 +558,13 @@  consider_split (struct split_point *curr
 		 "  Refused: split size is smaller than call overhead\n");
       return;
     }
+  /* FIXME: The logic here is not very precise, because inliner does use
+     inline predicates to reduce function body size.  We add 10 to anticipate
+     that.  Next stage1 we should try to be more meaningful here.  */
   if (current->header_size + call_overhead
       >= (unsigned int)(DECL_DECLARED_INLINE_P (current_function_decl)
 			? MAX_INLINE_INSNS_SINGLE
-			: MAX_INLINE_INSNS_AUTO))
+			: MAX_INLINE_INSNS_AUTO) + 10)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
@@ -574,7 +577,7 @@  consider_split (struct split_point *curr
      Limit this duplication.  This is consistent with limit in tree-sra.c  
      FIXME: with LTO we ought to be able to do better!  */
   if (DECL_ONE_ONLY (current_function_decl)
-      && current->split_size >= (unsigned int) MAX_INLINE_INSNS_AUTO)
+      && current->split_size >= (unsigned int) MAX_INLINE_INSNS_AUTO + 10)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file,
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 257520)
+++ doc/invoke.texi	(working copy)
@@ -10131,13 +10131,14 @@  a lot of functions that would otherwise
 by the compiler are investigated.  To those functions, a different
 (more restrictive) limit compared to functions declared inline can
 be applied.
-The default value is 40.
+The default value is 30.
 
 @item inline-min-speedup
 When estimated performance improvement of caller + callee runtime exceeds this
 threshold (in percent), the function can be inlined regardless of the limit on
 @option{--param max-inline-insns-single} and @option{--param
 max-inline-insns-auto}.
+The default value is 15.
 
 @item large-function-insns
 The limit specifying really large functions.  For functions larger than this