diff mbox series

-O2 inliner returning 1/n: reduce EARLY_INLINING_INSNS for O1 and O2

Message ID 20190916163914.macczibmd54rjqri@kam.mff.cuni.cz
State New
Headers show
Series -O2 inliner returning 1/n: reduce EARLY_INLINING_INSNS for O1 and O2 | expand

Commit Message

Jan Hubicka Sept. 16, 2019, 4:39 p.m. UTC
Hi,
as discussed on Cauldron this week I plan to push out changes enabling
-finline-functions at -O2 with limited parameters aiming to overal
better performance without large code size increases.

Currently we do inline agressively functions declared inline, we inline
when function size is expected to shrink and we also do limited
auto-inlining in early inliner for non-inline functions even if code
grows.  This is handled by PARAM_EARLY_INLINING_INSNS.

This patch tunes it down or -O2 in order to get some room for real
IPA inliner to do its work.
Combined efect of my chages are in
https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_elf_detail_stats=on&min_percentage_change=0.001&revisions=ddee20190fa78935338bc3161c1b29b8528d82dd%2C9b247ee17d1030b88462531225cc842251507bb6

This involves further forking inline-insns-auto, inline-insns-single and
big-speedup params.

Generally I was able to mostly improve SPEC 2006 and 2017 scores as
follows:

O2 Kabylake
SPEC/SPEC2006/INT/total 		0.58% 	
SPEC/SPEC2006/FP/total 			0.19% 	
SPEC/SPEC2017/FP/total 		 	0.45% 	
SPEC/SPEC2017/INT/total 	 	0.18% 	

O2 LTO Kabylake
SPEC/SPEC2006/INT/total 	 	1.08% 	
SPEC/SPEC2006/FP/total 		 	0.60% 	

O2 Zen
SPEC/SPEC2006/INT/total 	 	1.64% 	
SPEC/SPEC2006/FP/total 		 	0.23% 	
SPEC/SPEC2017/INT/total 	 	-0.58% 	
SPEC/SPEC2017/FP/total 		 	0.52% 	

O2 Zen LTO
SPEC/SPEC2006/FP/total 	 	1.40% 	
SPEC/SPEC2006/INT/total	 	1.26% 	
SPEC/SPEC2017/INT/total 	0.93% 	
SPEC/SPEC2017/FP/total 		-0.22% 	

The SPEC2017 FP on Zen is affected by 10% regression on CactusBSSN that
seems to be due to microarchitectural behaviour depending on code layout
rather than any inlining changes in hot parts of program.  Other notable
regression is omnetpp that shows on Zen only too.  Comparing Zen and
Kaby result it seems that only consistent loser id gcc (3%) a xalancbmk
(2.8%) both with non-LTO only. I plan to investigate those if regression
persists even though it is bit small and there is no obvious problem in
the backtrace.

Code size improves by 0.67% or SPEC2006 non-LTO and regresses by 1.64% with LTO
For 2017 it is 2.2% improvement and 2.4% regression respectively.

The difference between LTO and non-LTO is mostly due to fact that LTO
units tends to hit overall unit growth cap of inlining since there are
too many inline candidates. For this reason the patch is not as
effective on Firefox and other realy big packages as I would like.  I
still plan number of changes to inliner this stage1 so this is not final
situation, but I think it is better to do the change early so it gets
tested on other architectures. (And it was concensus of the Caudlron
discussion by my understanding)

This patch is not enabling -finline-functions so it will temporarily
regress perofrmance (and improve code size). i am doing this in
incremental steps to get more data on both inliners.

Bootstrapped/regtested x86_64-linux, plan to commit it later today.

Honza

	* ipa-inline.c (want_early_inline_function_p): Use
	PARAM_EARLY_INLINING_INSNS_O2.
	* params.def (PARAM_EARLY_INLINING_INSNS_O2): New.
	(PARAM_EARLY_INLINING_INSNS): Update documentation.
	* invoke.texi (early-inlining-insns-O2): New.
	(early-inlining-insns): Update documentation.

Comments

Jan Hubicka Oct. 1, 2019, 5:04 p.m. UTC | #1
> 	* ipa-inline.c (want_early_inline_function_p): Use
> 	PARAM_EARLY_INLINING_INSNS_O2.
> 	* params.def (PARAM_EARLY_INLINING_INSNS_O2): New.
> 	(PARAM_EARLY_INLINING_INSNS): Update documentation.
> 	* invoke.texi (early-inlining-insns-O2): New.
> 	(early-inlining-insns): Update documentation.
Hi,
this is a variant of patch with testsuite compensation I comitted today.
There are some cases where we need early inlining to happen in order to
get the even we look for.

In most cases this will go back once late inlining autoinline,
but I filled https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91954
for missed vectorization issue and 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91955
for Wstringop.

Honza

	* doc/invoke.texi (early-inlining-insns-O2): Document.
	(early-inlining-insns): Update.
	* params.def (early-inlining-insns-O2): New bound.
	(early-inlining-insns): Update docs.
	* ipa-inline.c (want_early_inline_function_p): Use new bound.

	* g++.dg/tree-ssa/pr61034.C: Set early-inlining-insns-O2=14.
	* g++.dg/tree-ssa/pr8781.C: Likewise.
	* g++.dg/warn/Wstringop-truncation-1.C: Likewise.
	* gcc.dg/ipa/pr63416.c: likewise.
	* gcc.dg/vect/pr66142.c: Likewise.
	* gcc.dg/tree-ssa/ssa-thread-12.c: Mark compure_idf inline.
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 276272)
+++ doc/invoke.texi	(working copy)
@@ -11291,9 +11291,17 @@ recursion depth can be guessed from the
 via a given call expression.  This parameter limits inlining only to call
 expressions whose probability exceeds the given threshold (in percents).
 
+@item early-inlining-insns-O2
+Specify growth that the early inliner can make.  In effect it increases
+the amount of inlining for code having a large abstraction penalty.
+This is applied to functions compiled with @option{-O1} or @option{-O2}
+optimization levels.
+
 @item early-inlining-insns
 Specify growth that the early inliner can make.  In effect it increases
 the amount of inlining for code having a large abstraction penalty.
+This is applied to functions compiled with @option{-O3} or @option{-Ofast}
+optimization levels.
 
 @item max-early-inliner-iterations
 Limit of iterations of the early inliner.  This basically bounds
Index: ipa-inline.c
===================================================================
--- ipa-inline.c	(revision 276272)
+++ ipa-inline.c	(working copy)
@@ -641,6 +641,10 @@ want_early_inline_function_p (struct cgr
     {
       int growth = estimate_edge_growth (e);
       int n;
+      int early_inlining_insns = opt_for_fn (e->caller->decl, optimize) >= 3
+				 ? PARAM_VALUE (PARAM_EARLY_INLINING_INSNS)
+				 : PARAM_VALUE (PARAM_EARLY_INLINING_INSNS_O2);
+
 
       if (growth <= PARAM_VALUE (PARAM_MAX_INLINE_INSNS_SIZE))
 	;
@@ -654,26 +658,28 @@ want_early_inline_function_p (struct cgr
 			     growth);
 	  want_inline = false;
 	}
-      else if (growth > PARAM_VALUE (PARAM_EARLY_INLINING_INSNS))
+      else if (growth > early_inlining_insns)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, e->call_stmt,
 			     "  will not early inline: %C->%C, "
-			     "growth %i exceeds --param early-inlining-insns\n",
-			     e->caller, callee,
-			     growth);
+			     "growth %i exceeds --param early-inlining-insns%s\n",
+			     e->caller, callee, growth,
+			     opt_for_fn (e->caller->decl, optimize) >= 3
+			     ? "" : "-O2");
 	  want_inline = false;
 	}
       else if ((n = num_calls (callee)) != 0
-	       && growth * (n + 1) > PARAM_VALUE (PARAM_EARLY_INLINING_INSNS))
+	       && growth * (n + 1) > early_inlining_insns)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, e->call_stmt,
 			     "  will not early inline: %C->%C, "
-			     "growth %i exceeds --param early-inlining-insns "
+			     "growth %i exceeds --param early-inlining-insns%s "
 			     "divided by number of calls\n",
-			     e->caller, callee,
-			     growth);
+			     e->caller, callee, growth,
+			     opt_for_fn (e->caller->decl, optimize) >= 3
+			     ? "" : "-O2");
 	  want_inline = false;
 	}
     }
Index: params.def
===================================================================
--- params.def	(revision 276272)
+++ params.def	(working copy)
@@ -233,8 +233,12 @@ DEFPARAM(PARAM_IPCP_UNIT_GROWTH,
 	 10, 0, 0)
 DEFPARAM(PARAM_EARLY_INLINING_INSNS,
 	 "early-inlining-insns",
-	 "Maximal estimated growth of function body caused by early inlining of single call.",
+	 "Maximal estimated growth of function body caused by early inlining of single call with -O3 and -Ofast.",
 	 14, 0, 0)
+DEFPARAM(PARAM_EARLY_INLINING_INSNS_O2,
+	 "early-inlining-insns-O2",
+	 "Maximal estimated growth of function body caused by early inlining of single call with -O1 and -O2.",
+	 6, 0, 0)
 DEFPARAM(PARAM_LARGE_STACK_FRAME,
 	 "large-stack-frame",
 	 "The size of stack frame to be considered large.",
Index: testsuite/g++.dg/tree-ssa/pr61034.C
===================================================================
--- testsuite/g++.dg/tree-ssa/pr61034.C	(revision 276272)
+++ testsuite/g++.dg/tree-ssa/pr61034.C	(working copy)
@@ -1,5 +1,5 @@
 // { dg-do compile }
-// { dg-options "-O2 -fdump-tree-fre3 -fdump-tree-optimized -fdelete-null-pointer-checks" }
+// { dg-options "-O2 -fdump-tree-fre3 -fdump-tree-optimized -fdelete-null-pointer-checks --param early-inlining-insns-O2=14" }
 
 #define assume(x) if(!(x))__builtin_unreachable()
 
Index: testsuite/g++.dg/tree-ssa/pr8781.C
===================================================================
--- testsuite/g++.dg/tree-ssa/pr8781.C	(revision 276272)
+++ testsuite/g++.dg/tree-ssa/pr8781.C	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O -fno-tree-sra -fdump-tree-fre1" } */
+/* { dg-options "-O -fno-tree-sra -fdump-tree-fre1 --param early-inlining-insns-O2=14" } */
 
 int f();
 
Index: testsuite/g++.dg/warn/Wstringop-truncation-1.C
===================================================================
--- testsuite/g++.dg/warn/Wstringop-truncation-1.C	(revision 276272)
+++ testsuite/g++.dg/warn/Wstringop-truncation-1.C	(working copy)
@@ -1,7 +1,7 @@
 /* PR/tree-optimization/84480 - bogus -Wstringop-truncation despite
    assignment with an inlined string literal
    { dg-do compile }
-   { dg-options "-O2 -Wstringop-truncation" }  */
+   { dg-options "-O2 -Wstringop-truncation --param early-inlining-insns-O2=14" }  */
 
 #include <string.h>
 
Index: testsuite/gcc.dg/ipa/pr63416.c
===================================================================
--- testsuite/gcc.dg/ipa/pr63416.c	(revision 276272)
+++ testsuite/gcc.dg/ipa/pr63416.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-optimized"  } */
+/* { dg-options "-O2 -fdump-tree-optimized --param early-inlining-insns-O2=14"  } */
 #define _UNUSED_ __attribute__((__unused__))
 
 typedef int TEST_F30 (int *v);
Index: testsuite/gcc.dg/tree-ssa/ssa-thread-12.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/ssa-thread-12.c	(revision 276272)
+++ testsuite/gcc.dg/tree-ssa/ssa-thread-12.c	(working copy)
@@ -56,7 +56,7 @@ bmp_iter_and_compl (bitmap_iterator * bi
 }
 
 extern int VEC_int_base_length (VEC_int_base *);
-bitmap
+inline bitmap
 compute_idf (bitmap def_blocks, bitmap_head * dfs)
 {
   bitmap_iterator bi;
Index: testsuite/gcc.dg/vect/pr66142.c
===================================================================
--- testsuite/gcc.dg/vect/pr66142.c	(revision 276272)
+++ testsuite/gcc.dg/vect/pr66142.c	(working copy)
@@ -1,6 +1,6 @@
 /* PR middle-end/66142 */
 /* { dg-do compile } */
-/* { dg-additional-options "-ffast-math -fopenmp-simd" } */
+/* { dg-additional-options "-ffast-math -fopenmp-simd --param early-inlining-insns-O2=14" } */
 /* { dg-additional-options "-mavx" { target avx_runtime } } */
 
 struct A { float x, y; };
Jakub Jelinek Oct. 2, 2019, 10:39 a.m. UTC | #2
On Tue, Oct 01, 2019 at 07:04:53PM +0200, Jan Hubicka wrote:
> 	* gcc.dg/tree-ssa/ssa-thread-12.c: Mark compure_idf inline.

> --- testsuite/gcc.dg/tree-ssa/ssa-thread-12.c	(revision 276272)
> +++ testsuite/gcc.dg/tree-ssa/ssa-thread-12.c	(working copy)
> @@ -56,7 +56,7 @@ bmp_iter_and_compl (bitmap_iterator * bi
>  }
>  
>  extern int VEC_int_base_length (VEC_int_base *);
> -bitmap
> +inline bitmap
>  compute_idf (bitmap def_blocks, bitmap_head * dfs)
>  {
>    bitmap_iterator bi;

Neither this change nor the followup in http://gcc.gnu.org/r276418
actually works.
PASS: gcc.dg/tree-ssa/ssa-thread-12.c (test for excess errors)
gcc.dg/tree-ssa/ssa-thread-12.c: dump file does not exist
UNRESOLVED: gcc.dg/tree-ssa/ssa-thread-12.c scan-tree-dump thread2 "FSM"
gcc.dg/tree-ssa/ssa-thread-12.c: dump file does not exist
UNRESOLVED: gcc.dg/tree-ssa/ssa-thread-12.c scan-tree-dump thread3 "FSM"
gcc.dg/tree-ssa/ssa-thread-12.c: dump file does not exist
UNRESOLVED: gcc.dg/tree-ssa/ssa-thread-12.c scan-tree-dump thread4 "FSM"

We shouldn't really have UNRESOLVED testcases in the testsuite.

The reason is that with the static __inline__ bitmap in there, no function
is actually emitted as nothing uses it, so there are no dumps other than
original (and 4 empty ones) with -fdump-tree-all.

	Jakub
Jan Hubicka Oct. 2, 2019, 10:42 a.m. UTC | #3
> On Tue, Oct 01, 2019 at 07:04:53PM +0200, Jan Hubicka wrote:
> > 	* gcc.dg/tree-ssa/ssa-thread-12.c: Mark compure_idf inline.
> 
> > --- testsuite/gcc.dg/tree-ssa/ssa-thread-12.c	(revision 276272)
> > +++ testsuite/gcc.dg/tree-ssa/ssa-thread-12.c	(working copy)
> > @@ -56,7 +56,7 @@ bmp_iter_and_compl (bitmap_iterator * bi
> >  }
> >  
> >  extern int VEC_int_base_length (VEC_int_base *);
> > -bitmap
> > +inline bitmap
> >  compute_idf (bitmap def_blocks, bitmap_head * dfs)
> >  {
> >    bitmap_iterator bi;
> 
> Neither this change nor the followup in http://gcc.gnu.org/r276418
> actually works.
> PASS: gcc.dg/tree-ssa/ssa-thread-12.c (test for excess errors)
> gcc.dg/tree-ssa/ssa-thread-12.c: dump file does not exist
> UNRESOLVED: gcc.dg/tree-ssa/ssa-thread-12.c scan-tree-dump thread2 "FSM"
> gcc.dg/tree-ssa/ssa-thread-12.c: dump file does not exist
> UNRESOLVED: gcc.dg/tree-ssa/ssa-thread-12.c scan-tree-dump thread3 "FSM"
> gcc.dg/tree-ssa/ssa-thread-12.c: dump file does not exist
> UNRESOLVED: gcc.dg/tree-ssa/ssa-thread-12.c scan-tree-dump thread4 "FSM"
> 
> We shouldn't really have UNRESOLVED testcases in the testsuite.
> 
> The reason is that with the static __inline__ bitmap in there, no function
> is actually emitted as nothing uses it, so there are no dumps other than
> original (and 4 empty ones) with -fdump-tree-all.

Aha, sorry for that.  I was looking for FAILs and missed the UNRESOLVED.
I will revert that change and go with --param then.  The testcase is
somewhat fragile depending on particular chain of inliner decisions.

Honza
> 
> 	Jakub
diff mbox series

Patch

Index: ipa-inline.c
===================================================================
--- ipa-inline.c	(revision 275716)
+++ ipa-inline.c	(working copy)
@@ -641,6 +641,10 @@  want_early_inline_function_p (struct cgr
     {
       int growth = estimate_edge_growth (e);
       int n;
+      int early_inlining_insns = opt_for_fn (e->caller->decl, optimize) >= 3
+				 ? PARAM_VALUE (PARAM_EARLY_INLINING_INSNS)
+				 : PARAM_VALUE (PARAM_EARLY_INLINING_INSNS_O2);
+
 
       if (growth <= PARAM_VALUE (PARAM_MAX_INLINE_INSNS_SIZE))
 	;
@@ -654,26 +658,28 @@  want_early_inline_function_p (struct cgr
 			     growth);
 	  want_inline = false;
 	}
-      else if (growth > PARAM_VALUE (PARAM_EARLY_INLINING_INSNS))
+      else if (growth > early_inlining_insns)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, e->call_stmt,
 			     "  will not early inline: %C->%C, "
-			     "growth %i exceeds --param early-inlining-insns\n",
-			     e->caller, callee,
-			     growth);
+			     "growth %i exceeds --param early-inlining-insns%s\n",
+			     e->caller, callee, growth,
+			     opt_for_fn (e->caller->decl, optimize) >= 3
+			     ? "" : "-O2");
 	  want_inline = false;
 	}
       else if ((n = num_calls (callee)) != 0
-	       && growth * (n + 1) > PARAM_VALUE (PARAM_EARLY_INLINING_INSNS))
+	       && growth * (n + 1) > early_inlining_insns)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, e->call_stmt,
 			     "  will not early inline: %C->%C, "
-			     "growth %i exceeds --param early-inlining-insns "
+			     "growth %i exceeds --param early-inlining-insns%s "
 			     "divided by number of calls\n",
-			     e->caller, callee,
-			     growth);
+			     e->caller, callee, growth,
+			     opt_for_fn (e->caller->decl, optimize) >= 3
+			     ? "" : "-O2");
 	  want_inline = false;
 	}
     }
Index: params.def
===================================================================
--- params.def	(revision 275716)
+++ params.def	(working copy)
@@ -233,8 +233,12 @@  DEFPARAM(PARAM_IPCP_UNIT_GROWTH,
 	 10, 0, 0)
 DEFPARAM(PARAM_EARLY_INLINING_INSNS,
 	 "early-inlining-insns",
-	 "Maximal estimated growth of function body caused by early inlining of single call.",
+	 "Maximal estimated growth of function body caused by early inlining of single call with -O3 and -Ofast.",
 	 14, 0, 0)
+DEFPARAM(PARAM_EARLY_INLINING_INSNS_O2,
+	 "early-inlining-insns-O2",
+	 "Maximal estimated growth of function body caused by early inlining of single call with -O1 and -O2.",
+	 6, 0, 0)
 DEFPARAM(PARAM_LARGE_STACK_FRAME,
 	 "large-stack-frame",
 	 "The size of stack frame to be considered large.",
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 275716)
+++ doc/invoke.texi	(working copy)
@@ -11290,9 +11290,17 @@  recursion depth can be guessed from the
 via a given call expression.  This parameter limits inlining only to call
 expressions whose probability exceeds the given threshold (in percents).
 
+@item early-inlining-insns-O2
+Specify growth that the early inliner can make.  In effect it increases
+the amount of inlining for code having a large abstraction penalty.
+This is applied to functions compiled with @option{-O1} or @option{-O2}
+optimization levels.
+
 @item early-inlining-insns
 Specify growth that the early inliner can make.  In effect it increases
 the amount of inlining for code having a large abstraction penalty.
+This is applied to functions compiled with @option{-O3} or @option{-Ofast}
+optimization levels.
 
 @item max-early-inliner-iterations
 Limit of iterations of the early inliner.  This basically bounds