diff mbox

Enable inliner to bypass inline-insns-single/auto when it knows the performance will improve

Message ID 20121107094038.GB29254@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka Nov. 7, 2012, 9:40 a.m. UTC
Hi,
with inliner predicates, the inliner heuristic now is able to prove that
some of the inlined function body will be optimized out after inlining.
This makes it possible to estimate the speedup that is now used to drive
the badness metric, but it is ignored in actual decision whether function
is inline candidate.

In general the decision on when to inline can be
 1) conservative on code size - when we know code will shrink it is almost
    surely a win
 2) uninformed guess - we can just inline and hope something will simplify.
    this makes sense to do with small enough function epsecially when user
    asks for -O3
 3) informed inline - we know somehting important will simplify.

We already have inline hints handling some cases of 3), like loop strides
and bounds.  This patch just adds the time based hint.
If speedup of runtme of caller+callee exceeds 10%, it is quite likely inlining
is win.
The inlining still may not happen in the end due to other inlining limits.

Bootstrapped/regtested x86_64-linux. Benchmarked on SPEC2k, SPEC2k6, C++
tests, polyhedron and Mozilla.  Largest single win is on c-ray where we now
inline ray_spehere because it will become loop invariant.  There are also
improvements on polyhedron and Mozilla.

Will commit it today or tomorrow depending on when autotesters will hit other
changes.

Honza

	PR middle-end/48636
	* ipa-inline.c (big_speedup_p): New function.
	(want_inline_small_function_p): Use it.
	(edge_badness): Dump it.
	* params.def (inline-min-speedup): New parameter.
	* doc/invoke.texi (inline-min-speedup): Document.

Index: doc/invoke.texi
===================================================================
*** doc/invoke.texi	(revision 193284)
--- doc/invoke.texi	(working copy)
*************** by the compiler are investigated.  To th
*** 8941,8946 ****
--- 8941,8952 ----
  be applied.
  The default value is 40.
  
+ @item inline-min-speedup
+ When estimated performance improvement of caller + callee runtime exceeds this
+ threshold (in precent), the function can be inlined regardless the limit on
+ @option{--param max-inline-insns-single} and @option{--param
+ max-inline-insns-auto}.
+ 
  @item large-function-insns
  The limit specifying really large functions.  For functions larger than this
  limit after inlining, inlining is constrained by
Index: ipa-inline.c
===================================================================
*** ipa-inline.c	(revision 193284)
--- ipa-inline.c	(working copy)
*************** compute_inlined_call_time (struct cgraph
*** 493,498 ****
--- 493,514 ----
    return time;
  }
  
+ /* Return true if the speedup for inlining E is bigger than
+    PARAM_MAX_INLINE_MIN_SPEEDUP.  */
+ 
+ static bool
+ big_speedup_p (struct cgraph_edge *e)
+ {
+   gcov_type time = compute_uninlined_call_time (inline_summary (e->callee),
+ 					  e);
+   gcov_type inlined_time = compute_inlined_call_time (e,
+ 					        estimate_edge_time (e));
+   if (time - inlined_time
+       > RDIV (time * PARAM_VALUE (PARAM_INLINE_MIN_SPEEDUP), 100))
+     return true;
+   return false;
+ }
+ 
  /* Return true if we are interested in inlining small function.
     When REPORT is true, report reason to dump file.  */
  
*************** want_inline_small_function_p (struct cgr
*** 514,519 ****
--- 530,536 ----
      {
        int growth = estimate_edge_growth (e);
        inline_hints hints = estimate_edge_hints (e);
+       bool big_speedup = big_speedup_p (e);
  
        if (growth <= 0)
  	;
*************** want_inline_small_function_p (struct cgr
*** 521,526 ****
--- 538,544 ----
  	 hints suggests that inlining given function is very profitable.  */
        else if (DECL_DECLARED_INLINE_P (callee->symbol.decl)
  	       && growth >= MAX_INLINE_INSNS_SINGLE
+ 	       && !big_speedup
  	       && !(hints & (INLINE_HINT_indirect_call
  			     | INLINE_HINT_loop_iterations
  			     | INLINE_HINT_loop_stride)))
*************** want_inline_small_function_p (struct cgr
*** 574,579 ****
--- 592,598 ----
  	 Upgrade it to MAX_INLINE_INSNS_SINGLE when hints suggests that
  	 inlining given function is very profitable.  */
        else if (!DECL_DECLARED_INLINE_P (callee->symbol.decl)
+ 	       && !big_speedup
  	       && growth >= ((hints & (INLINE_HINT_indirect_call
  				       | INLINE_HINT_loop_iterations
  				       | INLINE_HINT_loop_stride))
*************** edge_badness (struct cgraph_edge *edge, 
*** 836,841 ****
--- 855,862 ----
  	       growth,
  	       edge_time);
        dump_inline_hints (dump_file, hints);
+       if (big_speedup_p (edge))
+ 	fprintf (dump_file, " big_speedup");
        fprintf (dump_file, "\n");
      }

Comments

Steven Bosscher Nov. 7, 2012, 9:44 a.m. UTC | #1
On Wed, Nov 7, 2012 at 10:40 AM, Jan Hubicka wrote:
> Hi,
> with inliner predicates, the inliner heuristic now is able to prove that
> some of the inlined function body will be optimized out after inlining.
> This makes it possible to estimate the speedup that is now used to drive
> the badness metric, but it is ignored in actual decision whether function
> is inline candidate.

Is it really still the time for this kind of changes? Development
stage3 means "regression fixes only" and this isn't a regression...

Ciao!
Steven
Jan Hubicka Nov. 7, 2012, 10:38 a.m. UTC | #2
> On Wed, Nov 7, 2012 at 10:40 AM, Jan Hubicka wrote:
> > Hi,
> > with inliner predicates, the inliner heuristic now is able to prove that
> > some of the inlined function body will be optimized out after inlining.
> > This makes it possible to estimate the speedup that is now used to drive
> > the badness metric, but it is ignored in actual decision whether function
> > is inline candidate.
> 
> Is it really still the time for this kind of changes? Development
> stage3 means "regression fixes only" and this isn't a regression...

I discussed this with Jakub/Richi that I would like to do inliner heuristic
re-tunning at early stage 3. This is part of it.  I am hoping to be done soon.
While the changes was done a while ago, I am pushing them out slowly so they
can be indenpendently benchmarked.  I am not able to do too many SPEC2k6 runs
in a week. 

I had bit hard time getting inliner to the level of 4.7 on Mozilla LTO and
tramp3d that are both hard to analyze.  This turned out to be mostly the
addr_expr issues. We no longer forward propagate as much as we did to keep info
for objsize pass this made a lot of C++ abstraction to be no longer zero cost.
Also there was the stupid overflow on time metric making some inlining
copletely random.  Inliner seem to be in relatively good shape performance wise
getting quite consistent improvements in C++. (tramp3d is 50% smaller and
faster than before, wave and DLV also improved in both code size and speed,
Mozilla is faster &smaller and we now get smaller code from -Os than -O2 on the
C++ stuff, LTO SPEC builds got smaller with same speed,
http://gcc.opensuse.org/c++bench-frescobaldi/).

Overall plan I plan to add one extra inliner hint for array indexes to help
fortran array descriptors and enable use of the gcov's histograms that Google
apparently forgot to do (that is FDO only). So if I will wait today to see
effect of ipa-cp change probably going in, I should be done by Saturday (at
speed of patch a day).  

Next week I plan to run some benchmarks to see if the inlining limits can be
pushed down a bit, but it does not seem to be critical.  Pushing overall growth
to 15% or less would make wonders for Firefox with LTO (that probably won't
matter much in practice since we are impracticaly slow and memory hungry at
WPA), reducing inline-insns-auto/single may work given that we can now bypass
it in cases that matter. Neither one is too critical however.

I also still need to analyze botan regression that is only left on the table
(not neccesarily inliner related) for x86 and see if the IA-64 regresisons are
inliner related or something else.  There is also EON regression at -O2 that
seems to be related to unrolling heuristic decision. It seem to reproduce on
AMD hardware only so it may be simple code layout problem.

Plan also look into the comple time regression with large number of callees in
single function. (one of the old Lucier's PRs). This can be fixed by
incrementally updating the call statement costs as edges are added/removed
instead of recomputing them from scratch.

Honza
> 
> Ciao!
> Steven
diff mbox

Patch

Index: params.def
===================================================================
--- params.def  (revision 193286)
+++ params.def  (working copy)
@@ -46,6 +46,11 @@  DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCO
          "Maximal estimated outcome of branch considered predictable",
          2, 0, 50)
 
+DEFPARAM (PARAM_INLINE_MIN_SPEEDUP,
+         "inline-min-speedup",
+         "The minimal estimated speedup allowing inliner to ignore inline-insns-single and inline-isnsns-auto",
+         10, 0, 0)
+
 /* The single function inlining limit. This is the maximum size
    of a function counted in internal gcc instructions (not in
    real machine instructions) that is eligible for inlining