diff mbox

Partial inlining

Message ID 20100622105258.GA15547@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka June 22, 2010, 10:52 a.m. UTC
Hi,
this patch implements function splitting that allows partial inlining.
Functions are split if there is BB whose frequency is smaller than frequency of
entry BB such that the control flow must past through it just once and
afterwards we don't use values computed beforehand.

I tested the patch on spec2k with and without LTO, spec2k6 and C++ benchmarks.
The storry is always pretty much the same, the patch allows to reduce function
size for inlining from 50 to 40 estimated instructions.  This is still a lot.
However previously we hit number of issues on spec2000 that are gone now.  First
offender is botan that gets huge (>40%) regressions.  It might be interesting to
analyze those.  Offnoise changes on spec2k6 are as follows:

SPEC2k6 time (bigger is better):
milc 6%
povray 4.2%
lbm -2%
wrf 1.1%

SPEC2k6 size (smaller is better):
Perlbench -1.3%
Bzip2 0.3%
gcc 0.1%
gobmk 0.4%
hmmer 0.2%
sjeng 0.1%
libquantum -36%
h264ref -0.7%
omnetpp 0.6%
astar +4%
xlanbmk -1%

So largest gain is probably 36% code size saving at libquantum and 6% speedup
on milc.

We get regression in lbm and some small code size growth when splitting makes
inlining possible, but the inlining is useless.

Total compile time on spec2k6 has reduced a bit (as result of libquantum I
guess), but within noise factor.

Bootstrap time seems also comparable even if we do quite a lot of splitting.

On GCC LTO bootstrap we actually do quite a lot of useful splitting, in
cgraph_node, some vector functions (splitting away the error code) and in many
places where we have cheap guard in header of fucntion.

Due to nature of the problem, there are many possible improvements to the pass,
I want to track them incrementally to avoid pass getting too complicated
without too much of testing.  In particular I think it would be useful

1) resolve some stupid limitations we have now (return by value, EH code etc)
2) make it split large functions (i.e. not to help inlining, but to speed up
   compilation)
3) Add support for passing non-SSA values (currently if split part needs something
   non-SSA we just give up)
4) Make split part to recompute cheap values computed in header
5) Allow splitting into multiple parts
6) ...

There is also problem with debug quality.  Currently the split function will
appear as two functions in backtrace.  I hoped to solve it same way as openMP
but as Jakub told me we don't solve it there and we would probably need dwarf
extension to properly declare split function.
Well, in any case we should be able to fix things back when the split part gets
inlined back to header.

I had the patch in my local tree for a while and it is not causing troubles,
does it seem OK for mainline?

Bootstrapped/regtested x86_64-linux.

Honza

	* tree-pass.h (pass_split_functions): Declare.
	* opts.c (decode_options): Enable function splitting at -O2
	* timevar.def (TV_IPA_FNSPLIT): New macro.
	* ipa-split.c: New file.
	* common.opt (-fpartial-inlining): New flag.
	* Makefile.in (ipa-split.o): New object file.
	* passes.c (init_optimization_passes): Add ipa-split.
	* params.def (max-inline-insns-auto): Reduce max-inline-insns-auto to 40.
	(partial-inlining-entry-probability): New parameters.
	* doc/invoke.texi (-fpartial-inlining): New.
	
	* testsuite/gcc.dg/tree-ssa/ipa-split.c

/* { dg-do compile } */
/* { dg-options "-O3 -fdump-tree-fnsplit" } */
int test2(a)
{
   if (a<100)
     return 1;
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   do_something_big ();
   return 0;
}

test()
{
  test2(10);
  test2(20);
}
/* { dg-final { scan-tree-dump-times "Splitting function" 1 "fnsplit"} } */
/* { dg-final { cleanup-tree-dump "fnsplit" } } */
Index: doc/invoke.texi
===================================================================
*** doc/invoke.texi	(revision 159621)
--- doc/invoke.texi	(working copy)
*************** Objective-C and Objective-C++ Dialects}.
*** 361,367 ****
  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
! -fpeel-loops -fpredictive-commoning -fprefetch-loop-arrays @gol
  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
  -fprofile-generate=@var{path} @gol
  -fprofile-use -fprofile-use=@var{path} -fprofile-values @gol
--- 361,367 ----
  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
! -fpartial-inlining -fpeel-loops -fpredictive-commoning -fprefetch-loop-arrays @gol
  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
  -fprofile-generate=@var{path} @gol
  -fprofile-use -fprofile-use=@var{path} -fprofile-values @gol
*************** also turns on the following optimization
*** 5851,5856 ****
--- 5851,5857 ----
  -findirect-inlining @gol
  -fipa-sra @gol
  -foptimize-sibling-calls @gol
+ -fpartial-inlining @gol
  -fpeephole2 @gol
  -fregmove @gol
  -freorder-blocks  -freorder-functions @gol
*************** This optimization is enabled by default.
*** 7001,7006 ****
--- 7002,7015 ----
  With this option, the compiler will create multiple copies of some
  local variables when unrolling a loop which can result in superior code.
  
+ @item -fpartial-inlining
+ @opindex fpartial-inlining
+ Inline parts of functions.  This option has any effect only
+ when inlining itself is turned on by the @option{-finline-functions}
+ or @option{-finline-small-functions} options.
+ 
+ Enabled at level @option{-O2}.
+ 
  @item -fpredictive-commoning
  @opindex fpredictive-commoning
  Perform predictive commoning optimization, i.e., reusing computations

Comments

Richard Biener June 22, 2010, 1:12 p.m. UTC | #1
On Tue, 22 Jun 2010, Jan Hubicka wrote:

> Hi,
> this patch implements function splitting that allows partial inlining.
> Functions are split if there is BB whose frequency is smaller than frequency of
> entry BB such that the control flow must past through it just once and
> afterwards we don't use values computed beforehand.
> 
> I tested the patch on spec2k with and without LTO, spec2k6 and C++ benchmarks.
> The storry is always pretty much the same, the patch allows to reduce function
> size for inlining from 50 to 40 estimated instructions.  This is still a lot.
> However previously we hit number of issues on spec2000 that are gone now.  First
> offender is botan that gets huge (>40%) regressions.  It might be interesting to
> analyze those.  Offnoise changes on spec2k6 are as follows:
> 
> SPEC2k6 time (bigger is better):
> milc 6%
> povray 4.2%
> lbm -2%
> wrf 1.1%
> 
> SPEC2k6 size (smaller is better):
> Perlbench -1.3%
> Bzip2 0.3%
> gcc 0.1%
> gobmk 0.4%
> hmmer 0.2%
> sjeng 0.1%
> libquantum -36%
> h264ref -0.7%
> omnetpp 0.6%
> astar +4%
> xlanbmk -1%
> 
> So largest gain is probably 36% code size saving at libquantum and 6% speedup
> on milc.
> 
> We get regression in lbm and some small code size growth when splitting makes
> inlining possible, but the inlining is useless.
> 
> Total compile time on spec2k6 has reduced a bit (as result of libquantum I
> guess), but within noise factor.
> 
> Bootstrap time seems also comparable even if we do quite a lot of splitting.
> 
> On GCC LTO bootstrap we actually do quite a lot of useful splitting, in
> cgraph_node, some vector functions (splitting away the error code) and in many
> places where we have cheap guard in header of fucntion.

The above numbers are all with the max-inline-insns-auto reduction?
I'd like to have that adjustment split out as a separate followup patch.

The partial-inlining-entry-probability default looks quite high.  I would
have expected that guessed probabilities of say

  if (flag)
    return;

make the entry taken 50% of the time?  So, did you experiment with
other values than 70 or is that something that is suggested by literature?

You run ipa-split late during early optimizations.  Will the new clone
you create go through the early opts again?  Thus, is the placement
at the very end of early opts critical?  If so please add a comment
before the NEXT_PASS.  I suppose we don't re-run early opts on the
clone as otherwise we could end up splitting it again (and so for
a very large function with cascaded if (x) cheap; if (x) cheap; ....
we will take a very long time to compile because we split of
each if (x) cheap; one after the other?  Please add a testcase
like this.)

+  for (parm = DECL_ARGUMENTS (current_function_decl); parm;
+       parm = TREE_CHAIN (parm))
+    {
+      if (!is_gimple_reg (parm))
+       {
+         if (bitmap_bit_p (non_ssa_vars, DECL_UID (parm)))
+           {
+             if (dump_file)
+               fprintf (dump_file,
+                        "  Refused: need to pass non-ssa param 
values\n");

I'm confused by the non_ssa_vars thing.  This is just tracking
whether a non-SSA parameter is being used?

+  if (current->split_size <= call_overhead)
+    {
+      if (dump_file)
+       fprintf (dump_file,
+                "  Refused: split size is smaller than call overhead\n");
+      return;
+    }

also refuse to split if the size of the header is bigger than that
of the tail?  Otherwise we're just going to inline it back, no?

+  /* When there are non-ssa vars used in the split region, see if they
+     are used in the header region.  If so, reject the split.
+     FIXME: we can use nested function support to access both.  */
+  if (!bitmap_empty_p (non_ssa_vars))
+    {
+      basic_block bb;
+      FOR_EACH_BB (bb)
+       if (!bitmap_bit_p (current->split_bbs, bb->index))
+         {

use

  if (bitmap_bit_p (current->split_bbs, bb->index))
    continue;

to avoid excessive indentation.

+               if (is_gimple_debug (gsi_stmt (bsi)))
+                 continue;

I wonder if debug gets confused if you reference vars that are not 
there...

+           FOR_EACH_EDGE (e, ei, bb->succs)
+             if (e->dest == return_bb)
+               {
+                 for (bsi = gsi_start_phis (return_bb); !gsi_end_p (bsi);
+                      gsi_next (&bsi))
+                   {
+                     gimple stmt = gsi_stmt (bsi);
+                     tree op = gimple_phi_arg_def (stmt, e->dest_idx);
+
+                     if (!is_gimple_reg (gimple_phi_result (stmt)))
+                       continue;
+                     STRIP_NOPS (op);

There are never NOPs in PHI arguments.

+                     if (TREE_CODE (op) != SSA_NAME
+                         && test_nonssa_use (stmt, op, non_ssa_vars))
+                       {

+/* Return basic block containing RETURN statement, or EXIT_BLOCK_PTR if 
none
+   found.  */

What about BUILT_IN_RETURN?  Or returns via EH or abnormally returning
fns like longjmp?

+static bool
+mark_nonssa_use (gimple stmt ATTRIBUTE_UNUSED, tree t,
+                void *data ATTRIBUTE_UNUSED)
+{
+  t = get_base_address (t);
+
+  if (!t || is_gimple_reg (t))
+    return false;
+
+  /* At present we can't pass non-SSA arguments to split function.
+     FIXME: this can be relaxed by passing references to arguments.  */
+  if (TREE_CODE (t) == PARM_DECL)
+    {
+      if (dump_file)
+       fprintf (dump_file, "Can not split use of non-ssa function 
parameter.\n");
+      return true;
+    }

I think this will do excessive dumping.  At least make it
dump_flags & TDF_DETAILS.

+/* Compute local properties of basic block BB we collect when looking for 
split points
+   point.

double "point"

+      if (gimple_code (stmt) == GIMPLE_RESX
+         && stmt_can_throw_external (stmt))
+       {
+         if (dump_file)
+           fprintf (dump_file, "Can not split external resx.\n");

Please make all these dumps && (dump_flags & TDF_DETAILS).

+      /* Check builtins that prevent splitting.  */
+      if (gimple_code (stmt) == GIMPLE_CALL
+         && (decl = gimple_call_fndecl (stmt)) != NULL_TREE
+         && DECL_BUILT_IN (decl))

DECL_BUILT_IN_CLASS (decl) == BUILT_IN_NORMAL

+      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_DEF)
+       if (TREE_CODE (op) == SSA_NAME)
+         bitmap_set_bit (set_ssa_names, SSA_NAME_VERSION (op));
+      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
+       if (TREE_CODE (op) == SSA_NAME)
+         bitmap_set_bit (used_ssa_names, SSA_NAME_VERSION (op));

all SSA uses/defs are SSA_NAMEs, no need to check that.  Likewise
below in the two loops over PHIs.

+  FOR_EACH_EDGE (e, ei, bb->succs)
+    if (e->dest == return_bb)
+      {

if (e->dest != return_bb)
  continue;

+      /* We are wlaking acyclic graph, so edge_num counts

walking an

+             dest->aux = (void *)(ptrdiff_t)VEC_length (stack_entry, 
stack);

ugh.  Please use intptr_t, not ptrdiff_t.

+  /* See if the split functionwill return.  */

function will

+  /* Update return value.  This is bit tricky.  When we do not return,

Update the return value.  This is a bit tricky.

+     or produce new return statement.  */

a new return

+  if ((!node->callers || !node->callers->next_caller)
+      && !node->address_taken
+      && ((!flag_lto && !flag_whopr) || !node->local.externally_visible))
+    {
+      if (dump_file)
+       fprintf (dump_file, "Not splitting: not called directly or called 
once.\n");
+      return 0;
+    }

Why the check for node->address_taken?  Why the check
for external visibility?

> Due to nature of the problem, there are many possible improvements to the pass,
> I want to track them incrementally to avoid pass getting too complicated
> without too much of testing.  In particular I think it would be useful
> 
> 1) resolve some stupid limitations we have now (return by value, EH code etc)
> 2) make it split large functions (i.e. not to help inlining, but to speed up
>    compilation)
> 3) Add support for passing non-SSA values (currently if split part needs something
>    non-SSA we just give up)
> 4) Make split part to recompute cheap values computed in header
> 5) Allow splitting into multiple parts
> 6) ...

6) Allow splitting of varargs functions (needed for libquantum)

> There is also problem with debug quality.  Currently the split function will
> appear as two functions in backtrace.  I hoped to solve it same way as openMP
> but as Jakub told me we don't solve it there and we would probably need dwarf
> extension to properly declare split function.
> Well, in any case we should be able to fix things back when the split part gets
> inlined back to header.
> 
> I had the patch in my local tree for a while and it is not causing troubles,
> does it seem OK for mainline?

Ok with the above changes (please go over all dumping and make sure
that without -details we only say that we are splitting and where
and dump the split function).

Please commit the inliner parm change separately, best after the
testers had a chance to catch up to the splitting once.

Thanks,
Richard.

> Bootstrapped/regtested x86_64-linux.
> 
> Honza
> 
> 	* tree-pass.h (pass_split_functions): Declare.
> 	* opts.c (decode_options): Enable function splitting at -O2
> 	* timevar.def (TV_IPA_FNSPLIT): New macro.
> 	* ipa-split.c: New file.
> 	* common.opt (-fpartial-inlining): New flag.
> 	* Makefile.in (ipa-split.o): New object file.
> 	* passes.c (init_optimization_passes): Add ipa-split.
> 	* params.def (max-inline-insns-auto): Reduce max-inline-insns-auto to 40.
> 	(partial-inlining-entry-probability): New parameters.
> 	* doc/invoke.texi (-fpartial-inlining): New.
> 	
> 	* testsuite/gcc.dg/tree-ssa/ipa-split.c
> 
> /* { dg-do compile } */
> /* { dg-options "-O3 -fdump-tree-fnsplit" } */
> int test2(a)
> {
>    if (a<100)
>      return 1;
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    do_something_big ();
>    return 0;
> }
> 
> test()
> {
>   test2(10);
>   test2(20);
> }
> /* { dg-final { scan-tree-dump-times "Splitting function" 1 "fnsplit"} } */
> /* { dg-final { cleanup-tree-dump "fnsplit" } } */
> Index: tree-pass.h
> ===================================================================
> --- tree-pass.h	(revision 160109)
> +++ tree-pass.h	(working copy)
> @@ -442,6 +442,7 @@ extern struct gimple_opt_pass pass_build
>  extern struct gimple_opt_pass pass_local_pure_const;
>  extern struct gimple_opt_pass pass_tracer;
>  extern struct gimple_opt_pass pass_warn_unused_result;
> +extern struct gimple_opt_pass pass_split_functions;
>  
>  /* IPA Passes */
>  extern struct simple_ipa_opt_pass pass_ipa_function_and_variable_visibility;
> Index: opts.c
> ===================================================================
> --- opts.c	(revision 160109)
> +++ opts.c	(working copy)
> @@ -910,6 +910,7 @@ decode_options (unsigned int argc, const
>    opt2 = (optimize >= 2);
>    flag_inline_small_functions = opt2;
>    flag_indirect_inlining = opt2;
> +  flag_partial_inlining = opt2;
>    flag_thread_jumps = opt2;
>    flag_crossjumping = opt2;
>    flag_optimize_sibling_calls = opt2;
> Index: timevar.def
> ===================================================================
> --- timevar.def	(revision 160109)
> +++ timevar.def	(working copy)
> @@ -52,6 +52,7 @@ DEFTIMEVAR (TV_CGRAPH                , "
>  DEFTIMEVAR (TV_CGRAPHOPT             , "callgraph optimization")
>  DEFTIMEVAR (TV_VARPOOL               , "varpool construction")
>  DEFTIMEVAR (TV_IPA_CONSTANT_PROP     , "ipa cp")
> +DEFTIMEVAR (TV_IPA_FNSPLIT           , "ipa function splitting")
>  DEFTIMEVAR (TV_IPA_LTO_GIMPLE_IO     , "ipa lto gimple I/O")
>  DEFTIMEVAR (TV_IPA_LTO_DECL_IO       , "ipa lto decl I/O")
>  DEFTIMEVAR (TV_IPA_LTO_DECL_INIT_IO  , "ipa lto decl init I/O")
> Index: ipa-split.c
> ===================================================================
> --- ipa-split.c	(revision 0)
> +++ ipa-split.c	(revision 0)
> @@ -0,0 +1,1069 @@
> +/* Function splitting pass
> +   Copyright (C) 2010
> +   Free Software Foundation, Inc.
> +   Contributed by Jan Hubicka  <jh@suse.cz>
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +/* The purpose of this pass is to split function bodies to improve
> +   inlining.  I.e. for function of the form:
> +
> +   func (...)
> +     {
> +       if (cheap_test)
> +	 something_small
> +       else
> +	 something_big
> +     }
> +
> +   Produce:
> +
> +   func.part (...)
> +     {
> +	something_big
> +     }
> +
> +   func (...)
> +     {
> +       if (cheap_test)
> +	 something_small
> +       else
> +	 func.part (...);
> +     }
> +
> +   When func becomes inlinable and when cheap_test is often true, inlining func,
> +   but not fund.part leads to performance imrovement similar as inlining original
> +   func while the code size growth is smaller.
> +
> +   The pass is organized in three stages:
> +   1) Collect local info about basic block into BB_INFO structure and
> +      compute function body estimated size and time.
> +   2) Via DFS walk find all possible basic blocks where we can split
> +      and chose best one.
> +   3) If split point is found, split at the specified BB by creating a clone
> +      and updating function to call it.  
> +
> +   The decisions what functions to split are in execute_split_functions
> +   and consider_split.  
> +
> +   There are several possible future improvements for this pass including:
> +
> +   1) Splitting to break up large functions
> +   2) Splitting to reduce stack frame usage
> +   3) Allow split part of function to use values computed in the header part.
> +      The values needs to be passed to split function, perhaps via same interface
> +      as for nested functions or as argument.
> +   4) Support for simple rematerialization.  I.e. when split part use
> +      value computed in header from function parameter in very cheap way, we
> +      can just recompute it.
> +   5) Support splitting of nested functions.
> +   6) Support non-SSA arguments.  
> +   7) There is nothing preventing us from producing multiple parts of single function
> +      when needed or splitting also the parts.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "tree.h"
> +#include "target.h"
> +#include "cgraph.h"
> +#include "ipa-prop.h"
> +#include "tree-flow.h"
> +#include "tree-pass.h"
> +#include "flags.h"
> +#include "timevar.h"
> +#include "diagnostic.h"
> +#include "tree-dump.h"
> +#include "tree-inline.h"
> +#include "fibheap.h"
> +#include "params.h"
> +#include "gimple-pretty-print.h"
> +
> +/* Per basic block info.  */
> +
> +typedef struct
> +{
> +  unsigned int size;
> +  unsigned int time;
> +} bb_info;
> +DEF_VEC_O(bb_info);
> +DEF_VEC_ALLOC_O(bb_info,heap);
> +
> +static VEC(bb_info, heap) *bb_info_vec;
> +
> +/* Description of split point.  */
> +
> +struct split_point
> +{
> +  /* Size of the partitions.  */
> +  unsigned int header_time, header_size, split_time, split_size;
> +
> +  /* SSA names that need to be passed into spit funciton.  */
> +  bitmap ssa_names_to_pass;
> +
> +  /* Basic block where we split (that will become entry point of new function.  */
> +  basic_block entry_bb;
> +
> +  /* Basic blocks we are splitting away.  */
> +  bitmap split_bbs;
> +};
> +
> +/* Best split point found.  */
> +
> +struct split_point best_split_point;
> +
> +/* Callback for walk_stmt_load_store_addr_ops.  If T is non-ssa automatic
> +   variable, check it if it is present in bitmap passed via DATA.  */
> +
> +static bool
> +test_nonssa_use (gimple stmt ATTRIBUTE_UNUSED, tree t,
> +	         void *data ATTRIBUTE_UNUSED)
> +{
> +  t = get_base_address (t);
> +
> +  if (t && !is_gimple_reg (t)
> +      && ((TREE_CODE (t) == VAR_DECL && auto_var_in_fn_p (t, current_function_decl))
> +	  || (TREE_CODE (t) == PARM_DECL)))
> +    return bitmap_bit_p ((bitmap)data, DECL_UID (t));
> +  return false;
> +}
> +
> +/* Dump split point CURRENT.  */
> +
> +static void
> +dump_split_point (FILE * file, struct split_point *current)
> +{
> +  fprintf (file,
> +	   "Split point at BB %i header time:%i header size: %i split time: %i split size: %i\n  bbs: ",
> +	   current->entry_bb->index, current->header_time,
> +	   current->header_size, current->split_time, current->split_size);
> +  dump_bitmap (file, current->split_bbs);
> +  fprintf (file, "  SSA names to pass: ");
> +  dump_bitmap (file, current->ssa_names_to_pass);
> +}
> +
> +/* We found an split_point CURRENT.  NON_SSA_VARS is bitmap of all non ssa
> +   variables used and RETURN_BB is return basic block.
> +   See if we can split function here.  */
> +
> +static void
> +consider_split (struct split_point *current, bitmap non_ssa_vars,
> +		basic_block return_bb)
> +{
> +  tree parm;
> +  unsigned int num_args = 0;
> +  unsigned int call_overhead;
> +  edge e;
> +  edge_iterator ei;
> +  if (dump_file)
> +    dump_split_point (dump_file, current);
> +
> +  /* Do not split when we would end up calling function anyway.  */
> +  if (current->entry_bb->frequency
> +      >= (ENTRY_BLOCK_PTR->frequency
> +	  * PARAM_VALUE (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY) / 100))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "  Refused: split BB frequency is too large.\n");
> +      return;
> +    }
> +
> +  if (!current->header_size)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "  Refused: header empty\n");
> +      gcc_unreachable ();
> +      return;
> +    }
> +
> +  /* FIXME: We can do better: if the split region start with a loop and there is only
> +     one entry point from outer wrold, we can update PHI.  */
> +  if (!gsi_end_p (gsi_start_phis (current->entry_bb)))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "  Refused: entry BB has PHI\n");
> +      return;
> +    }
> +
> +
> +  /* See what argument we will pass to the split function and compute
> +     call overhead.  */
> +  call_overhead = eni_size_weights.call_cost;
> +  for (parm = DECL_ARGUMENTS (current_function_decl); parm;
> +       parm = TREE_CHAIN (parm))
> +    {
> +      if (!is_gimple_reg (parm))
> +	{
> +	  if (bitmap_bit_p (non_ssa_vars, DECL_UID (parm)))
> +	    {
> +	      if (dump_file)
> +		fprintf (dump_file,
> +			 "  Refused: need to pass non-ssa param values\n");
> +	      return;
> +	    }
> +	}
> +      else if (gimple_default_def (cfun, parm)
> +	       && bitmap_bit_p (current->ssa_names_to_pass,
> +				SSA_NAME_VERSION (gimple_default_def
> +						  (cfun, parm))))
> +	{
> +	  if (!VOID_TYPE_P (TREE_TYPE (parm)))
> +	    call_overhead += estimate_move_cost (TREE_TYPE (parm));
> +	  num_args++;
> +	}
> +    }
> +  if (!VOID_TYPE_P (TREE_TYPE (current_function_decl)))
> +    call_overhead += estimate_move_cost (TREE_TYPE (current_function_decl));
> +
> +  if (current->split_size <= call_overhead)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "  Refused: split size is smaller than call overhead\n");
> +      return;
> +    }
> +  if (current->header_size + call_overhead
> +      >= (unsigned int)(DECL_DECLARED_INLINE_P (current_function_decl)
> +			? MAX_INLINE_INSNS_SINGLE
> +			: MAX_INLINE_INSNS_AUTO))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "  Refused: header size is too large for inline candidate\n");
> +      return;
> +    }
> +
> +  /* FIXME: we currently can pass only SSA function parameters to the split
> +     arguments.  Once parm_adjustment infrastructure is supported by clonning,
> +     we can pass more than that.  */
> +  if (num_args != bitmap_count_bits (current->ssa_names_to_pass))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "  Refused: need to pass non-param values\n");
> +      return;
> +    }
> +
> +  /* When there are non-ssa vars used in the split region, see if they
> +     are used in the header region.  If so, reject the split.
> +     FIXME: we can use nested function support to access both.  */
> +  if (!bitmap_empty_p (non_ssa_vars))
> +    {
> +      basic_block bb;
> +      FOR_EACH_BB (bb)
> +	if (!bitmap_bit_p (current->split_bbs, bb->index))
> +	  {
> +	    gimple_stmt_iterator bsi;
> +	    for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
> +	      {
> +		if (is_gimple_debug (gsi_stmt (bsi)))
> +		  continue;
> +		if (walk_stmt_load_store_addr_ops
> +		    (gsi_stmt (bsi), non_ssa_vars, test_nonssa_use,
> +		     test_nonssa_use, test_nonssa_use))
> +		  {
> +		    if (dump_file)
> +		      fprintf (dump_file,
> +			       "  Refused: split part has non-ssa uses\n");
> +		    return;
> +		  }
> +	      }
> +	    for (bsi = gsi_start_phis (bb); !gsi_end_p (bsi); gsi_next (&bsi))
> +	      {
> +		if (is_gimple_debug (gsi_stmt (bsi)))
> +		  continue;
> +		if (walk_stmt_load_store_addr_ops
> +		    (gsi_stmt (bsi), non_ssa_vars, test_nonssa_use,
> +		     test_nonssa_use, test_nonssa_use))
> +		  {
> +		    if (dump_file)
> +		      fprintf (dump_file,
> +			       "  Refused: split part has non-ssa uses\n");
> +		    return;
> +		  }
> +	      }
> +	    FOR_EACH_EDGE (e, ei, bb->succs)
> +	      if (e->dest == return_bb)
> +		{
> +		  for (bsi = gsi_start_phis (return_bb); !gsi_end_p (bsi);
> +		       gsi_next (&bsi))
> +		    {
> +		      gimple stmt = gsi_stmt (bsi);
> +		      tree op = gimple_phi_arg_def (stmt, e->dest_idx);
> +
> +		      if (!is_gimple_reg (gimple_phi_result (stmt)))
> +			continue;
> +		      STRIP_NOPS (op);
> +		      if (TREE_CODE (op) != SSA_NAME
> +			  && test_nonssa_use (stmt, op, non_ssa_vars))
> +			{
> +			  if (dump_file)
> +			    fprintf (dump_file,
> +				     "  Refused: split part has non-ssa uses\n");
> +			  return;
> +			}
> +		    }
> +		}
> +	  }
> +      return;
> +    }
> +  if (dump_file)
> +    fprintf (dump_file, "  Accepted!\n");
> +
> +  /* At the moment chose split point with lowest frequency and that leaves
> +     out smallest size of header.
> +     In future we might re-consider this heuristics.  */
> +  if (!best_split_point.split_bbs
> +      || best_split_point.entry_bb->frequency > current->entry_bb->frequency
> +      || (best_split_point.entry_bb->frequency == current->entry_bb->frequency
> +	  && best_split_point.split_size < current->split_size))
> +	
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "  New best split point!\n");
> +      if (best_split_point.ssa_names_to_pass)
> +	{
> +	  BITMAP_FREE (best_split_point.ssa_names_to_pass);
> +	  BITMAP_FREE (best_split_point.split_bbs);
> +	}
> +      best_split_point = *current;
> +      best_split_point.ssa_names_to_pass = BITMAP_ALLOC (NULL);
> +      bitmap_copy (best_split_point.ssa_names_to_pass,
> +		   current->ssa_names_to_pass);
> +      best_split_point.split_bbs = BITMAP_ALLOC (NULL);
> +      bitmap_copy (best_split_point.split_bbs, current->split_bbs);
> +    }
> +}
> +
> +/* Return basic block containing RETURN statement, or EXIT_BLOCK_PTR if none
> +   found.  */
> +
> +static basic_block
> +find_return_bb (void)
> +{
> +  edge e;
> +  edge_iterator ei;
> +  bool found_return = false;
> +
> +  if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1)
> +    FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
> +      {
> +	gimple_stmt_iterator bsi;
> +	for (bsi = gsi_start_bb (e->src); !gsi_end_p (bsi); gsi_next (&bsi))
> +	  if (gimple_code (gsi_stmt (bsi)) != GIMPLE_RETURN
> +	      && gimple_code (gsi_stmt (bsi)) != GIMPLE_LABEL
> +	      && !is_gimple_debug (gsi_stmt (bsi)))
> +	    break;
> +	  else if (gimple_code (gsi_stmt (bsi)) == GIMPLE_RETURN)
> +	    found_return = true;
> +	if (gsi_end_p (bsi) && found_return)
> +	  return e->src;
> +      }
> +  return EXIT_BLOCK_PTR;
> +}
> +
> +/* Callback for walk_stmt_load_store_addr_ops.  If T is non-ssa automatic
> +   variable, mark it as used in bitmap passed via DATA. 
> +   Return true when access to T prevents splitting the function.  */
> +
> +static bool
> +mark_nonssa_use (gimple stmt ATTRIBUTE_UNUSED, tree t,
> +	         void *data ATTRIBUTE_UNUSED)
> +{
> +  t = get_base_address (t);
> +
> +  if (!t || is_gimple_reg (t))
> +    return false;
> +
> +  /* At present we can't pass non-SSA arguments to split function.
> +     FIXME: this can be relaxed by passing references to arguments.  */
> +  if (TREE_CODE (t) == PARM_DECL)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Can not split use of non-ssa function parameter.\n");
> +      return true;
> +    }
> +
> +  if (TREE_CODE (t) == VAR_DECL && auto_var_in_fn_p (t, current_function_decl))
> +    bitmap_set_bit ((bitmap)data, DECL_UID (t));
> +  return false;
> +}
> +
> +/* Compute local properties of basic block BB we collect when looking for split points
> +   point.
> +   We look for ssa defs and store them in SET_SSA_NAMES, for ssa uses and store them
> +   in USED_SSA_NAMES and for any non-SSA automatic vars stored in NON_SSA_VARS.
> +
> +   When BB has edge to RETURN_BB, collect uses in RETURN_BB too.  
> +
> +   Return false when BB contains something that prevents it from being put into
> +   split function.  */
> +
> +static bool
> +visit_bb (basic_block bb, basic_block return_bb,
> +	  bitmap set_ssa_names, bitmap used_ssa_names,
> +	  bitmap non_ssa_vars)
> +{
> +  gimple_stmt_iterator bsi;
> +  edge e;
> +  edge_iterator ei;
> +  bool can_split = true;
> +
> +  for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
> +    {
> +      gimple stmt = gsi_stmt (bsi);
> +      tree op;
> +      ssa_op_iter iter;
> +      tree decl;
> +
> +      if (is_gimple_debug (stmt))
> +	continue;
> +
> +      /* FIXME: We can split regions containing EH.  We can not however
> +	 split RESX, EH_DISPATCH and EH_POINTER referring to same region
> +	 into different partitions.  This would require tracking of
> +	 EH regions and checking in consider_split_point if they 
> +	 are not used elsewhere.  */
> +      if (gimple_code (stmt) == GIMPLE_RESX
> +	  && stmt_can_throw_external (stmt))
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file, "Can not split external resx.\n");
> +	  can_split = false;
> +	}
> +      if (gimple_code (stmt) == GIMPLE_EH_DISPATCH)
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file, "Can not split eh dispatch.\n");
> +	  can_split = false;
> +	}
> +
> +      /* Check builtins that prevent splitting.  */
> +      if (gimple_code (stmt) == GIMPLE_CALL
> +	  && (decl = gimple_call_fndecl (stmt)) != NULL_TREE
> +	  && DECL_BUILT_IN (decl))
> +	switch (DECL_FUNCTION_CODE (decl))
> +	  {
> +	  /* FIXME: once we will allow passing non-parm values to split part,
> +	     we need to be sure to handle correct builtin_stack_save and
> +	     builtin_stack_restore.  At the moment we are safe; there is no
> +	     way to store builtin_stack_save result in non-SSA variable
> +	     since all calls to those are compiler generated.  */
> +	  case BUILT_IN_APPLY:
> +	  case BUILT_IN_VA_START:
> +	    if (dump_file)
> +	      fprintf (dump_file, "Can not split builtin_apply and va_start.\n");
> +	    can_split = false;
> +	    break;
> +	  case BUILT_IN_EH_POINTER:
> +	    if (dump_file)
> +	      fprintf (dump_file, "Can not split builtin_eh_pointer.\n");
> +	    can_split = false;
> +	    break;
> +	  default:
> +	    break;
> +	  }
> +
> +      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_DEF)
> +	if (TREE_CODE (op) == SSA_NAME)
> +	  bitmap_set_bit (set_ssa_names, SSA_NAME_VERSION (op));
> +      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
> +	if (TREE_CODE (op) == SSA_NAME)
> +	  bitmap_set_bit (used_ssa_names, SSA_NAME_VERSION (op));
> +      can_split &= !walk_stmt_load_store_addr_ops (stmt, non_ssa_vars,
> +						   mark_nonssa_use,
> +						   mark_nonssa_use,
> +						   mark_nonssa_use);
> +    }
> +  for (bsi = gsi_start_phis (bb); !gsi_end_p (bsi); gsi_next (&bsi))
> +    {
> +      gimple stmt = gsi_stmt (bsi);
> +      tree op;
> +      ssa_op_iter iter;
> +
> +      if (is_gimple_debug (stmt))
> +	continue;
> +      if (!is_gimple_reg (gimple_phi_result (stmt)))
> +	continue;
> +      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_DEF)
> +	if (TREE_CODE (op) == SSA_NAME)
> +	  bitmap_set_bit (set_ssa_names, SSA_NAME_VERSION (op));
> +      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
> +	if (TREE_CODE (op) == SSA_NAME)
> +	  bitmap_set_bit (used_ssa_names, SSA_NAME_VERSION (op));
> +      can_split &= !walk_stmt_load_store_addr_ops (stmt, non_ssa_vars,
> +						   mark_nonssa_use,
> +						   mark_nonssa_use,
> +						   mark_nonssa_use);
> +    }
> +  /* Record also uses comming from PHI operand in return BB.  */
> +  FOR_EACH_EDGE (e, ei, bb->succs)
> +    if (e->dest == return_bb)
> +      {
> +	bool found_phi = false;
> +	for (bsi = gsi_start_phis (return_bb); !gsi_end_p (bsi); gsi_next (&bsi))
> +	  {
> +	    gimple stmt = gsi_stmt (bsi);
> +	    tree op = gimple_phi_arg_def (stmt, e->dest_idx);
> +
> +	    if (is_gimple_debug (stmt))
> +	      continue;
> +	    if (!is_gimple_reg (gimple_phi_result (stmt)))
> +	      continue;
> +	    found_phi = true;
> +	    STRIP_NOPS (op);
> +	    if (TREE_CODE (op) == SSA_NAME)
> +	      bitmap_set_bit (used_ssa_names, SSA_NAME_VERSION (op));
> +	    else
> +	      can_split &= !mark_nonssa_use (stmt, op, non_ssa_vars);
> +	  }
> +	if (!gsi_end_p (gsi_last_bb (return_bb)))
> +	  {
> +	    ssa_op_iter iter;
> +	    gimple stmt = gsi_stmt (gsi_last_bb (return_bb));
> +	    tree op;
> +	    if (!found_phi)
> +	      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
> +		if (TREE_CODE (op) == SSA_NAME)
> +		  bitmap_set_bit (used_ssa_names, SSA_NAME_VERSION (op));
> +	    can_split &= !walk_stmt_load_store_addr_ops (stmt, non_ssa_vars,
> +							 mark_nonssa_use,
> +							 mark_nonssa_use,
> +							 mark_nonssa_use);
> +	  }
> +      }
> +  return can_split;
> +}
> +
> +/* Stack entry for recursive DFS walk in find_split_point.  */
> +
> +typedef struct
> +{
> +  /* Basic block we are examining.  */
> +  basic_block bb;
> +
> +  /* SSA names set and used by the BB and all BBs reachable
> +     from it via DFS walk.  */
> +  bitmap set_ssa_names, used_ssa_names;
> +  bitmap non_ssa_vars;
> +
> +  /* All BBS visited from this BB via DFS walk.  */
> +  bitmap bbs_visited;
> +
> +  /* Last examined edge in DFS walk.  Since we walk unoriented graph,
> +     the value is up to sum of incomming and outgoing edges of BB.  */
> +  unsigned int edge_num;
> +
> +  /* Stack entry index of earliest BB reachable from current BB
> +     or any BB visited later in DFS valk.  */
> +  int earliest;
> +
> +  /* Overall time and size of all BBs reached from this BB in DFS walk.  */
> +  int overall_time, overall_size;
> +
> +  /* When false we can not split on this BB.  */
> +  bool can_split;
> +} stack_entry;
> +DEF_VEC_O(stack_entry);
> +DEF_VEC_ALLOC_O(stack_entry,heap);
> +
> +
> +/* Find all articulations and call consider_split on them.
> +   OVERALL_TIME and OVERALL_SIZE is time and size of the function.
> +
> +   We perform basic algorithm for finding an articulation in a graph
> +   created from CFG by considering it to be an unoriented graph.
> +
> +   The articulation is discovered via DFS walk. We collect earliest
> +   basic block on stack that is reachable via backward edge.  Articulation
> +   is any basic block such that there is no backward edge bypassing it.
> +   To reduce stack usage we maintain heap allocated stack in STACK vector.
> +   AUX pointer of BB is set to index it appears in the stack or -1 once
> +   it is visited and popped off the stack.
> +
> +   The algorithm finds articulation after visiting the whole component
> +   reachable by it.  This makes it convenient to collect information about
> +   the component used by consider_split.  */
> +
> +static void
> +find_split_points (int overall_time, int overall_size)
> +{
> +  stack_entry first;
> +  VEC(stack_entry, heap) *stack = NULL;
> +  basic_block bb;
> +  basic_block return_bb = find_return_bb ();
> +  struct split_point current;
> +
> +  current.header_time = overall_time;
> +  current.header_size = overall_size;
> +  current.split_time = 0;
> +  current.split_size = 0;
> +  current.ssa_names_to_pass = BITMAP_ALLOC (NULL);
> +
> +  first.bb = ENTRY_BLOCK_PTR;
> +  first.edge_num = 0;
> +  first.overall_time = 0;
> +  first.overall_size = 0;
> +  first.earliest = INT_MAX;
> +  first.set_ssa_names = 0;
> +  first.used_ssa_names = 0;
> +  first.bbs_visited = 0;
> +  VEC_safe_push (stack_entry, heap, stack, &first);
> +  ENTRY_BLOCK_PTR->aux = (void *)(ptrdiff_t)-1;
> +
> +  while (!VEC_empty (stack_entry, stack))
> +    {
> +      stack_entry *entry = VEC_last (stack_entry, stack);
> +
> +      /* We are wlaking acyclic graph, so edge_num counts
> +	 succ and pred edges together.  However when considering
> +         articulation, we want to have processed everything reachable
> +	 from articulation but nothing that reaches into it.  */
> +      if (entry->edge_num == EDGE_COUNT (entry->bb->succs)
> +	  && entry->bb != ENTRY_BLOCK_PTR)
> +	{
> +	  int pos = VEC_length (stack_entry, stack);
> +	  entry->can_split &= visit_bb (entry->bb, return_bb,
> +					entry->set_ssa_names,
> +					entry->used_ssa_names,
> +					entry->non_ssa_vars);
> +	  if (pos <= entry->earliest && !entry->can_split && dump_file)
> +	    fprintf (dump_file,
> +		     "found articulation at bb %i but can not split\n",
> +		     entry->bb->index);
> +	  if (pos <= entry->earliest && entry->can_split)
> +	     {
> +	       if (dump_file)
> +		 fprintf (dump_file, "found articulation at bb %i\n",
> +			  entry->bb->index);
> +	       current.entry_bb = entry->bb;
> +	       current.ssa_names_to_pass = BITMAP_ALLOC (NULL);
> +	       bitmap_and_compl (current.ssa_names_to_pass,
> +				 entry->used_ssa_names, entry->set_ssa_names);
> +	       current.header_time = overall_time - entry->overall_time;
> +	       current.header_size = overall_size - entry->overall_size;
> +	       current.split_time = entry->overall_time;
> +	       current.split_size = entry->overall_size;
> +	       current.split_bbs = entry->bbs_visited;
> +	       consider_split (&current, entry->non_ssa_vars, return_bb);
> +	       BITMAP_FREE (current.ssa_names_to_pass);
> +	     }
> +	}
> +      /* Do actual DFS walk.  */
> +      if (entry->edge_num
> +	  < (EDGE_COUNT (entry->bb->succs)
> +	     + EDGE_COUNT (entry->bb->preds)))
> +	{
> +          edge e;
> +	  basic_block dest;
> +	  if (entry->edge_num < EDGE_COUNT (entry->bb->succs))
> +	    {
> +	      e = EDGE_SUCC (entry->bb, entry->edge_num);
> +	      dest = e->dest;
> +	    }
> +	  else
> +	    {
> +	      e = EDGE_PRED (entry->bb, entry->edge_num - EDGE_COUNT (entry->bb->succs));
> +	      dest = e->src;
> +	    }
> +
> +	  entry->edge_num++;
> +
> +	  /* New BB to visit, push it to the stack.  */
> +	  if (dest != return_bb && dest != EXIT_BLOCK_PTR
> +	      && !dest->aux)
> +	    {
> +	      stack_entry new_entry;
> +
> +	      new_entry.bb = dest;
> +	      new_entry.edge_num = 0;
> +	      new_entry.overall_time
> +		 = VEC_index (bb_info, bb_info_vec, dest->index)->time;
> +	      new_entry.overall_size
> +		 = VEC_index (bb_info, bb_info_vec, dest->index)->size;
> +	      new_entry.earliest = INT_MAX;
> +	      new_entry.set_ssa_names = BITMAP_ALLOC (NULL);
> +	      new_entry.used_ssa_names = BITMAP_ALLOC (NULL);
> +	      new_entry.bbs_visited = BITMAP_ALLOC (NULL);
> +	      new_entry.non_ssa_vars = BITMAP_ALLOC (NULL);
> +	      new_entry.can_split = true;
> +	      bitmap_set_bit (new_entry.bbs_visited, dest->index);
> +	      VEC_safe_push (stack_entry, heap, stack, &new_entry);
> +	      dest->aux = (void *)(ptrdiff_t)VEC_length (stack_entry, stack);
> +	    }
> +	  /* Back edge found, record the earliest point.  */
> +	  else if ((ptrdiff_t)dest->aux > 0
> +		   && (ptrdiff_t)dest->aux < entry->earliest)
> +	    entry->earliest = (ptrdiff_t)dest->aux;
> +	}
> +      /* We are done with examing the edges. pop off the value from stack and
> +	 merge stuff we cummulate during the walk.  */
> +      else if (entry->bb != ENTRY_BLOCK_PTR)
> +	{
> +	  stack_entry *prev = VEC_index (stack_entry, stack,
> +					 VEC_length (stack_entry, stack) - 2);
> +
> +	  entry->bb->aux = (void *)(ptrdiff_t)-1;
> +	  prev->can_split &= entry->can_split;
> +	  if (prev->set_ssa_names)
> +	    {
> +	      bitmap_ior_into (prev->set_ssa_names, entry->set_ssa_names);
> +	      bitmap_ior_into (prev->used_ssa_names, entry->used_ssa_names);
> +	      bitmap_ior_into (prev->bbs_visited, entry->bbs_visited);
> +	      bitmap_ior_into (prev->non_ssa_vars, entry->non_ssa_vars);
> +	    }
> +	  if (prev->earliest > entry->earliest)
> +	    prev->earliest = entry->earliest;
> +	  prev->overall_time += entry->overall_time;
> +	  prev->overall_size += entry->overall_size;
> +	  BITMAP_FREE (entry->set_ssa_names);
> +	  BITMAP_FREE (entry->used_ssa_names);
> +	  BITMAP_FREE (entry->bbs_visited);
> +	  BITMAP_FREE (entry->non_ssa_vars);
> +	  VEC_pop (stack_entry, stack);
> +	}
> +      else
> +        VEC_pop (stack_entry, stack);
> +    }
> +  ENTRY_BLOCK_PTR->aux = NULL;
> +  FOR_EACH_BB (bb)
> +    bb->aux = NULL;
> +  BITMAP_FREE (current.ssa_names_to_pass);
> +}
> +
> +/* Split function at SPLIT_POINT.  */
> +
> +static void
> +split_function (struct split_point *split_point)
> +{
> +  VEC (tree, heap) *args_to_pass = NULL;
> +  bitmap args_to_skip = BITMAP_ALLOC (NULL);
> +  tree parm;
> +  int num = 0;
> +  struct cgraph_node *node;
> +  basic_block return_bb = find_return_bb ();
> +  basic_block call_bb;
> +  gimple_stmt_iterator gsi;
> +  gimple call;
> +  edge e;
> +  edge_iterator ei;
> +  tree retval = NULL, real_retval = NULL;
> +  bool split_part_return_p = false;
> +  gimple last_stmt = NULL;
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "\n\nSplitting function at:\n");
> +      dump_split_point (dump_file, split_point);
> +    }
> +
> +  /* Collect the parameters of new function and args_to_skip bitmap.  */
> +  for (parm = DECL_ARGUMENTS (current_function_decl);
> +       parm; parm = TREE_CHAIN (parm), num++)
> +    if (!is_gimple_reg (parm)
> +	|| !gimple_default_def (cfun, parm)
> +	|| !bitmap_bit_p (split_point->ssa_names_to_pass,
> +			  SSA_NAME_VERSION (gimple_default_def (cfun, parm))))
> +      bitmap_set_bit (args_to_skip, num);
> +    else
> +      VEC_safe_push (tree, heap, args_to_pass, gimple_default_def (cfun, parm));
> +
> +  /* See if the split functionwill return.  */
> +  FOR_EACH_EDGE (e, ei, return_bb->preds)
> +    if (bitmap_bit_p (split_point->split_bbs, e->src->index))
> +      break;
> +  if (e)
> +    split_part_return_p = true;
> +
> +  /* If we return, we will need the return block.  */
> +  if (return_bb != EXIT_BLOCK_PTR && split_part_return_p)
> +    bitmap_set_bit (split_point->split_bbs, return_bb->index);
> +
> +  /* Now create the actual clone.  */
> +  rebuild_cgraph_edges ();
> +  node = cgraph_function_versioning (cgraph_node (current_function_decl),
> +				     NULL, NULL,
> +				     args_to_skip,
> +				     split_point->split_bbs,
> +				     split_point->entry_bb, "_part");
> +  cgraph_node_remove_callees (cgraph_node (current_function_decl));
> +  if (!split_part_return_p)
> +    TREE_THIS_VOLATILE (node->decl) = 1;
> +  if (dump_file)
> +    dump_function_to_file (node->decl, dump_file, dump_flags);
> +
> +  /* Create the basic block we place call into.  It is the entry basic block
> +     split after last label.  */
> +  call_bb = split_point->entry_bb;
> +  for (gsi = gsi_start_bb (call_bb); !gsi_end_p (gsi);)
> +    if (gimple_code (gsi_stmt (gsi)) == GIMPLE_LABEL)
> +      {
> +	last_stmt = gsi_stmt (gsi);
> +	gsi_next (&gsi);
> +      }
> +    else
> +      break;
> +  e = split_block (split_point->entry_bb, last_stmt);
> +  remove_edge (e);
> +
> +  /* Produce the call statement.  */
> +  gsi = gsi_last_bb (call_bb);
> +  call = gimple_build_call_vec (node->decl, args_to_pass);
> +  gimple_set_block (call, DECL_INITIAL (current_function_decl));
> +
> +  /* Update return value.  This is bit tricky.  When we do not return,
> +     do nothing.  When we return we might need to update return_bb
> +     or produce new return statement.  */
> +  if (!split_part_return_p)
> +    gsi_insert_after (&gsi, call, GSI_NEW_STMT);
> +  else
> +    {
> +      e = make_edge (call_bb, return_bb,
> +		     return_bb == EXIT_BLOCK_PTR ? 0 : EDGE_FALLTHRU);
> +      e->count = call_bb->count;
> +      e->probability = REG_BR_PROB_BASE;
> +      if (return_bb != EXIT_BLOCK_PTR)
> +	{
> +	  gimple return_stmt = gsi_stmt (gsi_last_bb (return_bb));
> +	  gcc_assert (gimple_code (return_stmt) == GIMPLE_RETURN);
> +
> +	  if ((real_retval = retval = gimple_return_retval (return_stmt))
> +	      && !is_gimple_min_invariant (retval)
> +	      && (TREE_CODE (retval) != SSA_NAME
> +		  || !SSA_NAME_IS_DEFAULT_DEF (retval)))
> +	    {
> +	      gimple_stmt_iterator psi;
> +
> +	      /* See if there is PHI definind return value.  */
> +	      for (psi = gsi_start_phis (return_bb);
> +		   !gsi_end_p (psi); gsi_next (&psi))
> +		if (is_gimple_reg (gimple_phi_result (gsi_stmt (psi))))
> +		  break;
> +
> +	      /* When we have PHI, update PHI.  When there is no PHI,
> +		 update the return statement itself.  */
> +	      if (TREE_CODE (retval) == SSA_NAME)
> +		{
> +		  retval = make_ssa_name (SSA_NAME_VAR (retval), call);
> +		  if (TREE_CODE (retval) == SSA_NAME
> +		      && !gsi_end_p (psi))
> +		    add_phi_arg (gsi_stmt (psi), retval, e, UNKNOWN_LOCATION);
> +		  else if (TREE_CODE (retval) == SSA_NAME)
> +		    {
> +		      gimple_return_set_retval (return_stmt, retval);
> +		      update_stmt (return_stmt);
> +		    }
> +		}
> +	      gimple_call_set_lhs (call, retval);
> +	    }
> +          gsi_insert_after (&gsi, call, GSI_NEW_STMT);
> +	}
> +      else
> +	{
> +	  gimple ret;
> +	  if (!VOID_TYPE_P (TREE_TYPE (TREE_TYPE (current_function_decl))))
> +	    {
> +	      retval
> +	        = create_tmp_var (TREE_TYPE (TREE_TYPE (current_function_decl)),
> +				  "RET");
> +	      if (is_gimple_reg (retval))
> +		retval = make_ssa_name (retval, call);
> +	      gimple_call_set_lhs (call, retval);
> +	    }
> +          gsi_insert_after (&gsi, call, GSI_NEW_STMT);
> +	  ret = gimple_build_return (retval);
> +	  gsi_insert_after (&gsi, ret, GSI_NEW_STMT);
> +	}
> +    }
> +  free_dominance_info (CDI_DOMINATORS);
> +  free_dominance_info (CDI_POST_DOMINATORS);
> +  compute_inline_parameters (node);
> +}
> +
> +/* Execute function splitting pass.  */
> +
> +static unsigned int
> +execute_split_functions (void)
> +{
> +  gimple_stmt_iterator bsi;
> +  basic_block bb;
> +  int overall_time = 0, overall_size = 0;
> +  int todo = 0;
> +  edge e;
> +  edge_iterator ei;
> +  int nreturns = 0;
> +  struct cgraph_node *node = cgraph_node (current_function_decl);
> +
> +  if (flags_from_decl_or_type (current_function_decl) & ECF_NORETURN)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: noreturn function.\n");
> +      return 0;
> +    }
> +  if (MAIN_NAME_P (DECL_NAME (current_function_decl)))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: main function.\n");
> +      return 0;
> +    }
> +  /* This can be relaxed; function might become inlinable after splitting
> +     away the uninlinable part.  */
> +  if (!node->local.inlinable)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: not inlinable.\n");
> +      return 0;
> +    }
> +  if (node->local.disregard_inline_limits)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: disregading inline limits.\n");
> +      return 0;
> +    }
> +  /* This can be relaxed; most of versioning tests actually prevents
> +     a duplication.  */
> +  if (!tree_versionable_function_p (current_function_decl))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: not versionable.\n");
> +      return 0;
> +    }
> +  /* FIXME: we could support this.  */
> +  if (DECL_STRUCT_FUNCTION (current_function_decl)->static_chain_decl)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: nested function.\n");
> +      return 0;
> +    }
> +  /* FIXME: Should be easy to support.  */
> +  if (DECL_BY_REFERENCE (DECL_RESULT (current_function_decl)))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: returns value by reference.\n");
> +      return 0;
> +    }
> +  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
> +    if (gimple_code (gsi_stmt (gsi_last_bb (e->src))) == GIMPLE_RETURN)
> +      nreturns++;
> +  /* FIXME: This happens only when we have one return with a value and other
> +     return without.  We can handle this easily by making find_return_bb to find
> +     version with a return value.  */
> +  if (nreturns > 1)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: multiple return statements.\n");
> +      return 0;
> +    }
> +
> +  /* See if it makes sense to try to split.
> +     It makes sense to split if we inline, that is if we have direct calls to
> +     handle or direct calls are possibly going to appear as result of indirect
> +     inlining or LTO.
> +     Note that we are not completely conservative about disqualifying functions
> +     called once.  It is possible that the caller is called more then once and
> +     then inlining would still benefit.  */
> +  if ((!node->callers || !node->callers->next_caller)
> +      && !node->address_taken
> +      && ((!flag_lto && !flag_whopr) || !node->local.externally_visible))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: not called directly or called once.\n");
> +      return 0;
> +    }
> +
> +  /* FIXME: We can actually split if splitting reduces call overhead.  */
> +  if (!flag_inline_small_functions
> +      && !DECL_DECLARED_INLINE_P (current_function_decl))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Not splitting: not autoinlining and function is not inline.\n");
> +      return 0;
> +    }
> +
> +  /* Compute local info about basic blocks and determine function size/time.  */
> +  VEC_safe_grow_cleared (bb_info, heap, bb_info_vec, last_basic_block + 1);
> +  memset (&best_split_point, 0, sizeof (best_split_point));
> +  FOR_EACH_BB (bb)
> +    {
> +      int time = 0;
> +      int size = 0;
> +      int freq = compute_call_stmt_bb_frequency (current_function_decl, bb);
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "Basic block %i\n", bb->index);
> +
> +      for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
> +	{
> +	  int this_time, this_size;
> +	  gimple stmt = gsi_stmt (bsi);
> +
> +	  this_size = estimate_num_insns (stmt, &eni_size_weights);
> +	  this_time = estimate_num_insns (stmt, &eni_time_weights) * freq;
> +	  size += this_size;
> +	  time += this_time;
> +
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    {
> +	      fprintf (dump_file, "  freq:%6i size:%3i time:%3i ",
> +		       freq, this_size, this_time);
> +	      print_gimple_stmt (dump_file, stmt, 0, 0);
> +	    }
> +	}
> +      overall_time += time;
> +      overall_size += size;
> +      VEC_index (bb_info, bb_info_vec, bb->index)->time = time;
> +      VEC_index (bb_info, bb_info_vec, bb->index)->size = size;
> +    }
> +  find_split_points (overall_time, overall_size);
> +  if (best_split_point.split_bbs)
> +    {
> +      split_function (&best_split_point);
> +      BITMAP_FREE (best_split_point.ssa_names_to_pass);
> +      BITMAP_FREE (best_split_point.split_bbs);
> +      todo = TODO_update_ssa | TODO_cleanup_cfg;
> +    }
> +  VEC_free (bb_info, heap, bb_info_vec);
> +  bb_info_vec = NULL;
> +  return todo;
> +}
> +
> +static bool
> +gate_split_functions (void)
> +{
> +  return flag_partial_inlining;
> +}
> +
> +struct gimple_opt_pass pass_split_functions =
> +{
> + {
> +  GIMPLE_PASS,
> +  "fnsplit",				/* name */
> +  gate_split_functions,			/* gate */
> +  execute_split_functions,		/* execute */
> +  NULL,					/* sub */
> +  NULL,					/* next */
> +  0,					/* static_pass_number */
> +  TV_IPA_FNSPLIT,			/* tv_id */
> +  PROP_cfg,				/* properties_required */
> +  0,					/* properties_provided */
> +  0,					/* properties_destroyed */
> +  0,					/* todo_flags_start */
> +  TODO_dump_func			/* todo_flags_finish */
> + }
> +};
> Index: common.opt
> ===================================================================
> --- common.opt	(revision 160109)
> +++ common.opt	(working copy)
> @@ -876,6 +876,10 @@ foptimize-sibling-calls
>  Common Report Var(flag_optimize_sibling_calls) Optimization
>  Optimize sibling and tail recursive calls
>  
> +fpartial-inlining
> +Common Report Var(flag_partial_inlining)
> +Perform partial inlining
> +
>  fpre-ipa-mem-report
>  Common Report Var(pre_ipa_mem_report)
>  Report on memory allocation before interprocedural optimization
> Index: Makefile.in
> ===================================================================
> --- Makefile.in	(revision 160111)
> +++ Makefile.in	(working copy)
> @@ -1429,6 +1429,7 @@ OBJS-archive = \
>  	cppdefault.o \
>  	incpath.o \
>  	ipa-cp.o \
> +        ipa-split.o \
>  	ipa-inline.o \
>  	ipa-prop.o \
>  	ipa-pure-const.o \
> @@ -2966,6 +2967,10 @@ ipa-cp.o : ipa-cp.c $(CONFIG_H) $(SYSTEM
>     $(TREE_H) $(TARGET_H) $(CGRAPH_H) $(IPA_PROP_H) $(TREE_FLOW_H) \
>     $(TREE_PASS_H) $(FLAGS_H) $(TIMEVAR_H) $(DIAGNOSTIC_H) $(TREE_DUMP_H) \
>     $(TREE_INLINE_H) $(FIBHEAP_H) $(PARAMS_H) tree-pretty-print.h
> +ipa-split.o : ipa-split.c $(CONFIG_H) $(SYSTEM_H) coretypes.h  \
> +   $(TREE_H) $(TARGET_H) $(CGRAPH_H) $(IPA_PROP_H) $(TREE_FLOW_H) \
> +   $(TREE_PASS_H) $(FLAGS_H) $(TIMEVAR_H) $(DIAGNOSTIC_H) $(TREE_DUMP_H) \
> +   $(TREE_INLINE_H) $(FIBHEAP_H) $(PARAMS_H)
>  matrix-reorg.o : matrix-reorg.c $(CONFIG_H) $(SYSTEM_H) coretypes.h  \
>     $(TM_H) $(TREE_H) $(RTL_H) $(TREE_INLINE_H) $(TREE_FLOW_H) \
>     tree-flow-inline.h langhooks.h $(HASHTAB_H) $(TOPLEV_H) $(FLAGS_H) $(GGC_H) \
> Index: passes.c
> ===================================================================
> --- passes.c	(revision 160109)
> +++ passes.c	(working copy)
> @@ -795,6 +795,7 @@ init_optimization_passes (void)
>            NEXT_PASS (pass_cleanup_eh);
>            NEXT_PASS (pass_profile);
>            NEXT_PASS (pass_local_pure_const);
> +          NEXT_PASS (pass_split_functions);
>  	}
>        NEXT_PASS (pass_release_ssa_names);
>        NEXT_PASS (pass_rebuild_cgraph_edges);
> Index: params.def
> ===================================================================
> --- params.def	(revision 160109)
> +++ params.def	(working copy)
> @@ -78,11 +78,11 @@ DEFPARAM (PARAM_MAX_INLINE_INSNS_SINGLE,
>     that is applied to functions marked inlined (or defined in the
>     class declaration in C++) given by the "max-inline-insns-single"
>     parameter.
> -   The default value is 90.  */
> +   The default value is 40.  */
>  DEFPARAM (PARAM_MAX_INLINE_INSNS_AUTO,
>  	  "max-inline-insns-auto",
>  	  "The maximum number of instructions when automatically inlining",
> -	  50, 0, 0)
> +	  40, 0, 0)
>  
>  DEFPARAM (PARAM_MAX_INLINE_INSNS_RECURSIVE,
>  	  "max-inline-insns-recursive",
> @@ -117,6 +117,12 @@ DEFPARAM (PARAM_EARLY_INLINER_MAX_ITERAT
>  	  "The maximum number of nested indirect inlining performed by early inliner",
>  	  10, 0, 0)
>  
> +/* Limit on probability of entry BB.  */
> +DEFPARAM (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY,
> +	  "partial-inlining-entry-probability",
> +	  "Maximum probability of the entry BB of split region (in percent relative to entry BB of the function) to make partial inlining happen",
> +	  70, 0, 0)
> +
>  /* Limit the number of expansions created by the variable expansion
>     optimization to avoid register pressure.  */
>  DEFPARAM (PARAM_MAX_VARIABLE_EXPANSIONS,
> Index: doc/invoke.texi
> ===================================================================
> *** doc/invoke.texi	(revision 159621)
> --- doc/invoke.texi	(working copy)
> *************** Objective-C and Objective-C++ Dialects}.
> *** 361,367 ****
>   -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>   -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>   -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
> ! -fpeel-loops -fpredictive-commoning -fprefetch-loop-arrays @gol
>   -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>   -fprofile-generate=@var{path} @gol
>   -fprofile-use -fprofile-use=@var{path} -fprofile-values @gol
> --- 361,367 ----
>   -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>   -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>   -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
> ! -fpartial-inlining -fpeel-loops -fpredictive-commoning -fprefetch-loop-arrays @gol
>   -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>   -fprofile-generate=@var{path} @gol
>   -fprofile-use -fprofile-use=@var{path} -fprofile-values @gol
> *************** also turns on the following optimization
> *** 5851,5856 ****
> --- 5851,5857 ----
>   -findirect-inlining @gol
>   -fipa-sra @gol
>   -foptimize-sibling-calls @gol
> + -fpartial-inlining @gol
>   -fpeephole2 @gol
>   -fregmove @gol
>   -freorder-blocks  -freorder-functions @gol
> *************** This optimization is enabled by default.
> *** 7001,7006 ****
> --- 7002,7015 ----
>   With this option, the compiler will create multiple copies of some
>   local variables when unrolling a loop which can result in superior code.
>   
> + @item -fpartial-inlining
> + @opindex fpartial-inlining
> + Inline parts of functions.  This option has any effect only
> + when inlining itself is turned on by the @option{-finline-functions}
> + or @option{-finline-small-functions} options.
> + 
> + Enabled at level @option{-O2}.
> + 
>   @item -fpredictive-commoning
>   @opindex fpredictive-commoning
>   Perform predictive commoning optimization, i.e., reusing computations
> 
>
Jan Hubicka June 25, 2010, 8:33 a.m. UTC | #2
> The above numbers are all with the max-inline-insns-auto reduction?
> I'd like to have that adjustment split out as a separate followup patch.

Yes, without the change we generally tend to do more inlining (as we produce
more inline candidates).  Not exactly in all cases: when we split at point
that the call to split function is cold, we reduce size even for unchanges
max-inlie-insns-auto.

I've removed this change from patch and will commit it sequentially.
> 
> The partial-inlining-entry-probability default looks quite high.  I would
> have expected that guessed probabilities of say
> 
>   if (flag)
>     return;
> 
> make the entry taken 50% of the time?  So, did you experiment with
> other values than 70 or is that something that is suggested by literature?

Well, not really.  It depends on what is inside if conditional.
If it is if (flag < 0), lets say, we will hit non-negative heuristics
and will predict it with 79%.

The value of 70 seemed high enough to take care of most of random fuzz from the
heruistics, but it is something I plan to experiment with in the future.
In general the predictors on tests of this type are quite ineffective.
> 
> You run ipa-split late during early optimizations.  Will the new clone
> you create go through the early opts again?  Thus, is the placement

New clone is not considered by early optimizers. It is added as new
function just afterwards.  We might want to change this in future, but
that is how cgraph_add_new_function works for now (it queues the function
to be inserted once IPA pass is done and early optimizations are one
IPA pass).

> at the very end of early opts critical?  If so please add a comment

Well, it will work elsewhere too, just you miss optimizations on the split part
of function then.  In general I see little value in moving it earlier, but I've
added coment.

> before the NEXT_PASS.  I suppose we don't re-run early opts on the
> clone as otherwise we could end up splitting it again (and so for
> a very large function with cascaded if (x) cheap; if (x) cheap; ....
> we will take a very long time to compile because we split of
> each if (x) cheap; one after the other?  Please add a testcase
> like this.)

Splitting is considered just once.  But while working on pass, just
for fun I tried to split at every possible split point and quite surprisingly
it did not introduce that much of compile time penalty for such a testcase.
(it sped up insn-attrtab compilation that is pretty much like this)

So multiple split points is something I plan to add, just want to do that
incrementally so the benefits can be more easilly tracked.

What should the testcase test?
> 
> +  for (parm = DECL_ARGUMENTS (current_function_decl); parm;
> +       parm = TREE_CHAIN (parm))
> +    {
> +      if (!is_gimple_reg (parm))
> +       {
> +         if (bitmap_bit_p (non_ssa_vars, DECL_UID (parm)))
> +           {
> +             if (dump_file)
> +               fprintf (dump_file,
> +                        "  Refused: need to pass non-ssa param 
> values\n");
> 
> I'm confused by the non_ssa_vars thing.  This is just tracking
> whether a non-SSA parameter is being used?

Yes, to handle non-ssa arguments, we will need to pass them by reference
that means that we will need to change function prototype in less trivial
way than by removing its arguments.

arg adjustments used in ipa-sra can do this, all we need is to hook them into
clonning machinery that is something I plan to do relatively soon too (it will
imply support in virutal clones that needs way to stream them into cgraphopt
section for WHOPR so it is not completely trivial, but the adjustment code was
meant for this from beggining).

There are two places we check them.  First is for SRA arguments, the other
place is checking that if non-ssa variable is used, it is either used only by
header or by split part, but not by both.

(for same reason I also do not support yet sharing values computed by header
and used by split part that is something I can do by passing them as argument
or recomputing them in split part if they are cheap.  I think especially
for casts in C++ we will want to do the recomputation in trivial cases)
> 
> +  if (current->split_size <= call_overhead)
> +    {
> +      if (dump_file)
> +       fprintf (dump_file,
> +                "  Refused: split size is smaller than call overhead\n");
> +      return;
> +    }
> 
> also refuse to split if the size of the header is bigger than that
> of the tail?  Otherwise we're just going to inline it back, no?

Not necesarily - especially when the split point is cold and thus we end
up inlining for size.

This is something I want to experiment with too.  At the moment we produce
more splitting than needed (i.e. we often inline things back) that has
negative effect on compile time and debug info. In practice I didn't found
this bad.

Keeping the function splitting spearate from inliner heuristics, we can also
play safe and split only when the function becomes inlinable after splitting
or the split part is cold and its size
> +               if (is_gimple_debug (gsi_stmt (bsi)))
> +                 continue;
> 
> I wonder if debug gets confused if you reference vars that are not 
> there...

tree-inline will remove the debug statement (and thus we lose some debug info
when split part is inlined back), but in general it works.
> +/* Return basic block containing RETURN statement, or EXIT_BLOCK_PTR if 
> none
> +   found.  */
> 
> What about BUILT_IN_RETURN?  Or returns via EH or abnormally returning
> fns like longjmp?

EH or abnormal jumps are not problem.  Functions returning both via
normal return and BUILT_IN_RETURN (or functions having two return statement
one with and one without return value) are currently outlawed.

We can support multiple return BBs if needed, I added TODO for that reason.
> +  /* At present we can't pass non-SSA arguments to split function.
> +     FIXME: this can be relaxed by passing references to arguments.  */
> +  if (TREE_CODE (t) == PARM_DECL)
> +    {
> +      if (dump_file)
> +       fprintf (dump_file, "Can not split use of non-ssa function 
> parameter.\n");
> +      return true;
> +    }
> 
> I think this will do excessive dumping.  At least make it
> dump_flags & TDF_DETAILS.

Done.
> +  if ((!node->callers || !node->callers->next_caller)
> +      && !node->address_taken
> +      && ((!flag_lto && !flag_whopr) || !node->local.externally_visible))
> +    {
> +      if (dump_file)
> +       fprintf (dump_file, "Not splitting: not called directly or called 
> once.\n");
> +      return 0;
> +    }
> 
> Why the check for node->address_taken?  Why the check
> for external visibility?

We need to decide if function can possibly end up calling more than once when it reach
inliner.  When it has address taken, we can devirtualize the call.
When doing LTO, it might be called directly from other unit.
> 
> Ok with the above changes (please go over all dumping and make sure
> that without -details we only say that we are splitting and where
> and dump the split function).

You seem to be consistently on less dumping by default than me.  I find dumping
of reasons why split points are not accepted quite useful, but I guess using
-details is not problem.

Honza
Richard Biener June 25, 2010, 9:15 a.m. UTC | #3
On Fri, Jun 25, 2010 at 10:33 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> The above numbers are all with the max-inline-insns-auto reduction?
>> I'd like to have that adjustment split out as a separate followup patch.
>
> Yes, without the change we generally tend to do more inlining (as we produce
> more inline candidates).  Not exactly in all cases: when we split at point
> that the call to split function is cold, we reduce size even for unchanges
> max-inlie-insns-auto.
>
> I've removed this change from patch and will commit it sequentially.
>>
>> The partial-inlining-entry-probability default looks quite high.  I would
>> have expected that guessed probabilities of say
>>
>>   if (flag)
>>     return;
>>
>> make the entry taken 50% of the time?  So, did you experiment with
>> other values than 70 or is that something that is suggested by literature?
>
> Well, not really.  It depends on what is inside if conditional.
> If it is if (flag < 0), lets say, we will hit non-negative heuristics
> and will predict it with 79%.
>
> The value of 70 seemed high enough to take care of most of random fuzz from the
> heruistics, but it is something I plan to experiment with in the future.
> In general the predictors on tests of this type are quite ineffective.
>>
>> You run ipa-split late during early optimizations.  Will the new clone
>> you create go through the early opts again?  Thus, is the placement
>
> New clone is not considered by early optimizers. It is added as new
> function just afterwards.  We might want to change this in future, but
> that is how cgraph_add_new_function works for now (it queues the function
> to be inserted once IPA pass is done and early optimizations are one
> IPA pass).
>
>> at the very end of early opts critical?  If so please add a comment
>
> Well, it will work elsewhere too, just you miss optimizations on the split part
> of function then.  In general I see little value in moving it earlier, but I've
> added coment.
>
>> before the NEXT_PASS.  I suppose we don't re-run early opts on the
>> clone as otherwise we could end up splitting it again (and so for
>> a very large function with cascaded if (x) cheap; if (x) cheap; ....
>> we will take a very long time to compile because we split of
>> each if (x) cheap; one after the other?  Please add a testcase
>> like this.)
>
> Splitting is considered just once.  But while working on pass, just
> for fun I tried to split at every possible split point and quite surprisingly
> it did not introduce that much of compile time penalty for such a testcase.
> (it sped up insn-attrtab compilation that is pretty much like this)
>
> So multiple split points is something I plan to add, just want to do that
> incrementally so the benefits can be more easilly tracked.
>
> What should the testcase test?

That compile-time is properly bounded.  So I'd use something like

#define FOO if (p) return 1;
#define FOO10 FOO FOO FOO FOO FOO FOO FOO FOO FOO
#define FOO100 ....

int foo(unsigned p)
{
  FOO1000
  return 0;
}

>>
>> +  for (parm = DECL_ARGUMENTS (current_function_decl); parm;
>> +       parm = TREE_CHAIN (parm))
>> +    {
>> +      if (!is_gimple_reg (parm))
>> +       {
>> +         if (bitmap_bit_p (non_ssa_vars, DECL_UID (parm)))
>> +           {
>> +             if (dump_file)
>> +               fprintf (dump_file,
>> +                        "  Refused: need to pass non-ssa param
>> values\n");
>>
>> I'm confused by the non_ssa_vars thing.  This is just tracking
>> whether a non-SSA parameter is being used?
>
> Yes, to handle non-ssa arguments, we will need to pass them by reference
> that means that we will need to change function prototype in less trivial
> way than by removing its arguments.
>
> arg adjustments used in ipa-sra can do this, all we need is to hook them into
> clonning machinery that is something I plan to do relatively soon too (it will
> imply support in virutal clones that needs way to stream them into cgraphopt
> section for WHOPR so it is not completely trivial, but the adjustment code was
> meant for this from beggining).
>
> There are two places we check them.  First is for SRA arguments, the other
> place is checking that if non-ssa variable is used, it is either used only by
> header or by split part, but not by both.
>
> (for same reason I also do not support yet sharing values computed by header
> and used by split part that is something I can do by passing them as argument
> or recomputing them in split part if they are cheap.  I think especially
> for casts in C++ we will want to do the recomputation in trivial cases)
>>
>> +  if (current->split_size <= call_overhead)
>> +    {
>> +      if (dump_file)
>> +       fprintf (dump_file,
>> +                "  Refused: split size is smaller than call overhead\n");
>> +      return;
>> +    }
>>
>> also refuse to split if the size of the header is bigger than that
>> of the tail?  Otherwise we're just going to inline it back, no?
>
> Not necesarily - especially when the split point is cold and thus we end
> up inlining for size.
>
> This is something I want to experiment with too.  At the moment we produce
> more splitting than needed (i.e. we often inline things back) that has
> negative effect on compile time and debug info. In practice I didn't found
> this bad.
>
> Keeping the function splitting spearate from inliner heuristics, we can also
> play safe and split only when the function becomes inlinable after splitting
> or the split part is cold and its size
>> +               if (is_gimple_debug (gsi_stmt (bsi)))
>> +                 continue;
>>
>> I wonder if debug gets confused if you reference vars that are not
>> there...
>
> tree-inline will remove the debug statement (and thus we lose some debug info
> when split part is inlined back), but in general it works.
>> +/* Return basic block containing RETURN statement, or EXIT_BLOCK_PTR if
>> none
>> +   found.  */
>>
>> What about BUILT_IN_RETURN?  Or returns via EH or abnormally returning
>> fns like longjmp?
>
> EH or abnormal jumps are not problem.  Functions returning both via
> normal return and BUILT_IN_RETURN (or functions having two return statement
> one with and one without return value) are currently outlawed.
>
> We can support multiple return BBs if needed, I added TODO for that reason.
>> +  /* At present we can't pass non-SSA arguments to split function.
>> +     FIXME: this can be relaxed by passing references to arguments.  */
>> +  if (TREE_CODE (t) == PARM_DECL)
>> +    {
>> +      if (dump_file)
>> +       fprintf (dump_file, "Can not split use of non-ssa function
>> parameter.\n");
>> +      return true;
>> +    }
>>
>> I think this will do excessive dumping.  At least make it
>> dump_flags & TDF_DETAILS.
>
> Done.
>> +  if ((!node->callers || !node->callers->next_caller)
>> +      && !node->address_taken
>> +      && ((!flag_lto && !flag_whopr) || !node->local.externally_visible))
>> +    {
>> +      if (dump_file)
>> +       fprintf (dump_file, "Not splitting: not called directly or called
>> once.\n");
>> +      return 0;
>> +    }
>>
>> Why the check for node->address_taken?  Why the check
>> for external visibility?
>
> We need to decide if function can possibly end up calling more than once when it reach
> inliner.  When it has address taken, we can devirtualize the call.
> When doing LTO, it might be called directly from other unit.
>>
>> Ok with the above changes (please go over all dumping and make sure
>> that without -details we only say that we are splitting and where
>> and dump the split function).
>
> You seem to be consistently on less dumping by default than me.  I find dumping
> of reasons why split points are not accepted quite useful, but I guess using
> -details is not problem.

Well, I think -fdump-tree-all should just dump all functions after
each pass.  Maybe with very little extra info.  -details is for
dumping what it did (or not) and why.

-fdump-tree-all already is several GB for tramp3d.

Richard.
Jan Hubicka June 25, 2010, 9:44 a.m. UTC | #4
Hi,
BTW collecting statistics on resons why splitting was rejected one gets:
     60  split part has non-ssa uses
    159  not inlinable.
    800  articulation found but can not split
   3862  need to pass non-param values
   8943  split size is smaller than call overhead
   9475  not called directly or called once.
  13603  entry BB has PHI
  19935  header size is too large for inline candidate
  42593  split BB frequency is too large.

I guess most of split BB frequency is too large comes from attempts to split
away checking code, but I will need to verify. Also most of the large header
sizes attempts from some silly splits, like splitting away my_friendly_abort
call out of huge function.

So it looks I should look into the PHI statements first.

Honza
H.J. Lu June 25, 2010, 6:48 p.m. UTC | #5
On Fri, Jun 25, 2010 at 1:33 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> The above numbers are all with the max-inline-insns-auto reduction?
>> I'd like to have that adjustment split out as a separate followup patch.
>
> Yes, without the change we generally tend to do more inlining (as we produce
> more inline candidates).  Not exactly in all cases: when we split at point
> that the call to split function is cold, we reduce size even for unchanges
> max-inlie-insns-auto.
>
> I've removed this change from patch and will commit it sequentially.
>>
>> The partial-inlining-entry-probability default looks quite high.  I would
>> have expected that guessed probabilities of say
>>
>>   if (flag)
>>     return;
>>
>> make the entry taken 50% of the time?  So, did you experiment with
>> other values than 70 or is that something that is suggested by literature?
>
> Well, not really.  It depends on what is inside if conditional.
> If it is if (flag < 0), lets say, we will hit non-negative heuristics
> and will predict it with 79%.
>
> The value of 70 seemed high enough to take care of most of random fuzz from the
> heruistics, but it is something I plan to experiment with in the future.
> In general the predictors on tests of this type are quite ineffective.
>>
>> You run ipa-split late during early optimizations.  Will the new clone
>> you create go through the early opts again?  Thus, is the placement
>
> New clone is not considered by early optimizers. It is added as new
> function just afterwards.  We might want to change this in future, but
> that is how cgraph_add_new_function works for now (it queues the function
> to be inserted once IPA pass is done and early optimizations are one
> IPA pass).
>
>> at the very end of early opts critical?  If so please add a comment
>
> Well, it will work elsewhere too, just you miss optimizations on the split part
> of function then.  In general I see little value in moving it earlier, but I've
> added coment.
>
>> before the NEXT_PASS.  I suppose we don't re-run early opts on the
>> clone as otherwise we could end up splitting it again (and so for
>> a very large function with cascaded if (x) cheap; if (x) cheap; ....
>> we will take a very long time to compile because we split of
>> each if (x) cheap; one after the other?  Please add a testcase
>> like this.)
>
> Splitting is considered just once.  But while working on pass, just
> for fun I tried to split at every possible split point and quite surprisingly
> it did not introduce that much of compile time penalty for such a testcase.
> (it sped up insn-attrtab compilation that is pretty much like this)
>
> So multiple split points is something I plan to add, just want to do that
> incrementally so the benefits can be more easilly tracked.
>
> What should the testcase test?
>>
>> +  for (parm = DECL_ARGUMENTS (current_function_decl); parm;
>> +       parm = TREE_CHAIN (parm))
>> +    {
>> +      if (!is_gimple_reg (parm))
>> +       {
>> +         if (bitmap_bit_p (non_ssa_vars, DECL_UID (parm)))
>> +           {
>> +             if (dump_file)
>> +               fprintf (dump_file,
>> +                        "  Refused: need to pass non-ssa param
>> values\n");
>>
>> I'm confused by the non_ssa_vars thing.  This is just tracking
>> whether a non-SSA parameter is being used?
>
> Yes, to handle non-ssa arguments, we will need to pass them by reference
> that means that we will need to change function prototype in less trivial
> way than by removing its arguments.
>
> arg adjustments used in ipa-sra can do this, all we need is to hook them into
> clonning machinery that is something I plan to do relatively soon too (it will
> imply support in virutal clones that needs way to stream them into cgraphopt
> section for WHOPR so it is not completely trivial, but the adjustment code was
> meant for this from beggining).
>
> There are two places we check them.  First is for SRA arguments, the other
> place is checking that if non-ssa variable is used, it is either used only by
> header or by split part, but not by both.
>
> (for same reason I also do not support yet sharing values computed by header
> and used by split part that is something I can do by passing them as argument
> or recomputing them in split part if they are cheap.  I think especially
> for casts in C++ we will want to do the recomputation in trivial cases)
>>
>> +  if (current->split_size <= call_overhead)
>> +    {
>> +      if (dump_file)
>> +       fprintf (dump_file,
>> +                "  Refused: split size is smaller than call overhead\n");
>> +      return;
>> +    }
>>
>> also refuse to split if the size of the header is bigger than that
>> of the tail?  Otherwise we're just going to inline it back, no?
>
> Not necesarily - especially when the split point is cold and thus we end
> up inlining for size.
>
> This is something I want to experiment with too.  At the moment we produce
> more splitting than needed (i.e. we often inline things back) that has
> negative effect on compile time and debug info. In practice I didn't found
> this bad.
>
> Keeping the function splitting spearate from inliner heuristics, we can also
> play safe and split only when the function becomes inlinable after splitting
> or the split part is cold and its size
>> +               if (is_gimple_debug (gsi_stmt (bsi)))
>> +                 continue;
>>
>> I wonder if debug gets confused if you reference vars that are not
>> there...
>
> tree-inline will remove the debug statement (and thus we lose some debug info
> when split part is inlined back), but in general it works.
>> +/* Return basic block containing RETURN statement, or EXIT_BLOCK_PTR if
>> none
>> +   found.  */
>>
>> What about BUILT_IN_RETURN?  Or returns via EH or abnormally returning
>> fns like longjmp?
>
> EH or abnormal jumps are not problem.  Functions returning both via
> normal return and BUILT_IN_RETURN (or functions having two return statement
> one with and one without return value) are currently outlawed.
>
> We can support multiple return BBs if needed, I added TODO for that reason.
>> +  /* At present we can't pass non-SSA arguments to split function.
>> +     FIXME: this can be relaxed by passing references to arguments.  */
>> +  if (TREE_CODE (t) == PARM_DECL)
>> +    {
>> +      if (dump_file)
>> +       fprintf (dump_file, "Can not split use of non-ssa function
>> parameter.\n");
>> +      return true;
>> +    }
>>
>> I think this will do excessive dumping.  At least make it
>> dump_flags & TDF_DETAILS.
>
> Done.
>> +  if ((!node->callers || !node->callers->next_caller)
>> +      && !node->address_taken
>> +      && ((!flag_lto && !flag_whopr) || !node->local.externally_visible))
>> +    {
>> +      if (dump_file)
>> +       fprintf (dump_file, "Not splitting: not called directly or called
>> once.\n");
>> +      return 0;
>> +    }
>>
>> Why the check for node->address_taken?  Why the check
>> for external visibility?
>
> We need to decide if function can possibly end up calling more than once when it reach
> inliner.  When it has address taken, we can devirtualize the call.
> When doing LTO, it might be called directly from other unit.
>>
>> Ok with the above changes (please go over all dumping and make sure
>> that without -details we only say that we are splitting and where
>> and dump the split function).
>
> You seem to be consistently on less dumping by default than me.  I find dumping
> of reasons why split points are not accepted quite useful, but I guess using
> -details is not problem.
>

This breaks C++:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44671
Jan Hubicka June 26, 2010, 9:16 a.m. UTC | #6
> > You seem to be consistently on less dumping by default than me.  I find dumping
> > of reasons why split points are not accepted quite useful, but I guess using
> > -details is not problem.
> >
> 
> This breaks C++:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44671

This does not reproduce for me (at least with "Fix PHI handling in ipa-split" patch my tree)
I am now re-trying from scratch.

Honza
> 
> -- 
> H.J.
Paolo Carlini June 26, 2010, 9:35 a.m. UTC | #7
On 06/26/2010 11:16 AM, Jan Hubicka wrote:
>>> You seem to be consistently on less dumping by default than me.  I find dumping
>>> of reasons why split points are not accepted quite useful, but I guess using
>>> -details is not problem.
>>>
>>>       
>> This breaks C++:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44671
>>     
> This does not reproduce for me (at least with "Fix PHI handling in ipa-split" patch my tree)
> I am now re-trying from scratch.
>   
The problem is definitely there. See, for example:

    http://gcc.gnu.org/ml/gcc-testresults/2010-06/msg02660.html

and I have just reproduced it myself once more with r161427.

Paolo.
Jan Hubicka June 26, 2010, 10 a.m. UTC | #8
> On 06/26/2010 11:16 AM, Jan Hubicka wrote:
> >>> You seem to be consistently on less dumping by default than me.  I find dumping
> >>> of reasons why split points are not accepted quite useful, but I guess using
> >>> -details is not problem.
> >>>
> >>>       
> >> This breaks C++:
> >>
> >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44671
> >>     
> > This does not reproduce for me (at least with "Fix PHI handling in ipa-split" patch my tree)
> > I am now re-trying from scratch.
> >   
> The problem is definitely there. See, for example:
> 
>     http://gcc.gnu.org/ml/gcc-testresults/2010-06/msg02660.html
> 
> and I have just reproduced it myself once more with r161427.

It is interesting indeed, I get
                === libstdc++ Summary ===

# of expected passes            7160
# of expected failures          61
# of unsupported tests          341

on build on gcc14 compilation farm.  On gcc17 the problem however reproduce and goes away when the
testsuite_shard is comiled with -fno-partial-inlining.

Honza
> 
> Paolo.
Jan Hubicka June 26, 2010, 10:20 a.m. UTC | #9
> > On 06/26/2010 11:16 AM, Jan Hubicka wrote:
> > >>> You seem to be consistently on less dumping by default than me.  I find dumping
> > >>> of reasons why split points are not accepted quite useful, but I guess using
> > >>> -details is not problem.
> > >>>
> > >>>       
> > >> This breaks C++:
> > >>
> > >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44671
> > >>     
> > > This does not reproduce for me (at least with "Fix PHI handling in ipa-split" patch my tree)
> > > I am now re-trying from scratch.
> > >   
> > The problem is definitely there. See, for example:
> > 
> >     http://gcc.gnu.org/ml/gcc-testresults/2010-06/msg02660.html
> > 
> > and I have just reproduced it myself once more with r161427.
> 
> It is interesting indeed, I get
>                 === libstdc++ Summary ===
> 
> # of expected passes            7160
> # of expected failures          61
> # of unsupported tests          341
> 
> on build on gcc14 compilation farm.  On gcc17 the problem however reproduce and goes away when the
> testsuite_shard is comiled with -fno-partial-inlining.

The difference is use of GOLD as linker.  I am investigating if we are hitting GNU LD bug or if GCC incorrectly
emits direct calls to the function in question.

Honza
Paolo Carlini June 26, 2010, 10:30 a.m. UTC | #10
Hi,

> It is interesting indeed, I get
>                === libstdc++ Summary ===
> 
> # of expected passes            7160
> # of expected failures          61
> # of unsupported tests          341
> 
> on build on gcc14 compilation farm.  On gcc17 the problem however reproduce and goes away when the
> testsuite_shard is comiled with -fno-partial-inlining.

Fine. About gcc14, I have frankly no idea - are OS and toolchain up to date on that machine? - apparently the problem triggers in different ways on 32-bit vs 64-bit and may even require a recent linker.

By the way, the 341 unsupported tell me that the GNU locale model is not used and many locale-related tests are thus not run at all, in other terms, I would not recommend such setup for serious testing of the C++ library.

Paolo
H.J. Lu June 28, 2010, 12:23 a.m. UTC | #11
On Fri, Jun 25, 2010 at 11:48 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Jun 25, 2010 at 1:33 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> The above numbers are all with the max-inline-insns-auto reduction?
>>> I'd like to have that adjustment split out as a separate followup patch.
>>
>> Yes, without the change we generally tend to do more inlining (as we produce
>> more inline candidates).  Not exactly in all cases: when we split at point
>> that the call to split function is cold, we reduce size even for unchanges
>> max-inlie-insns-auto.
>>
>> I've removed this change from patch and will commit it sequentially.
>>>
>>> The partial-inlining-entry-probability default looks quite high.  I would
>>> have expected that guessed probabilities of say
>>>
>>>   if (flag)
>>>     return;
>>>
>>> make the entry taken 50% of the time?  So, did you experiment with
>>> other values than 70 or is that something that is suggested by literature?
>>
>> Well, not really.  It depends on what is inside if conditional.
>> If it is if (flag < 0), lets say, we will hit non-negative heuristics
>> and will predict it with 79%.
>>
>> The value of 70 seemed high enough to take care of most of random fuzz from the
>> heruistics, but it is something I plan to experiment with in the future.
>> In general the predictors on tests of this type are quite ineffective.
>>>
>>> You run ipa-split late during early optimizations.  Will the new clone
>>> you create go through the early opts again?  Thus, is the placement
>>
>> New clone is not considered by early optimizers. It is added as new
>> function just afterwards.  We might want to change this in future, but
>> that is how cgraph_add_new_function works for now (it queues the function
>> to be inserted once IPA pass is done and early optimizations are one
>> IPA pass).
>>
>>> at the very end of early opts critical?  If so please add a comment
>>
>> Well, it will work elsewhere too, just you miss optimizations on the split part
>> of function then.  In general I see little value in moving it earlier, but I've
>> added coment.
>>
>>> before the NEXT_PASS.  I suppose we don't re-run early opts on the
>>> clone as otherwise we could end up splitting it again (and so for
>>> a very large function with cascaded if (x) cheap; if (x) cheap; ....
>>> we will take a very long time to compile because we split of
>>> each if (x) cheap; one after the other?  Please add a testcase
>>> like this.)
>>
>> Splitting is considered just once.  But while working on pass, just
>> for fun I tried to split at every possible split point and quite surprisingly
>> it did not introduce that much of compile time penalty for such a testcase.
>> (it sped up insn-attrtab compilation that is pretty much like this)
>>
>> So multiple split points is something I plan to add, just want to do that
>> incrementally so the benefits can be more easilly tracked.
>>
>> What should the testcase test?
>>>
>>> +  for (parm = DECL_ARGUMENTS (current_function_decl); parm;
>>> +       parm = TREE_CHAIN (parm))
>>> +    {
>>> +      if (!is_gimple_reg (parm))
>>> +       {
>>> +         if (bitmap_bit_p (non_ssa_vars, DECL_UID (parm)))
>>> +           {
>>> +             if (dump_file)
>>> +               fprintf (dump_file,
>>> +                        "  Refused: need to pass non-ssa param
>>> values\n");
>>>
>>> I'm confused by the non_ssa_vars thing.  This is just tracking
>>> whether a non-SSA parameter is being used?
>>
>> Yes, to handle non-ssa arguments, we will need to pass them by reference
>> that means that we will need to change function prototype in less trivial
>> way than by removing its arguments.
>>
>> arg adjustments used in ipa-sra can do this, all we need is to hook them into
>> clonning machinery that is something I plan to do relatively soon too (it will
>> imply support in virutal clones that needs way to stream them into cgraphopt
>> section for WHOPR so it is not completely trivial, but the adjustment code was
>> meant for this from beggining).
>>
>> There are two places we check them.  First is for SRA arguments, the other
>> place is checking that if non-ssa variable is used, it is either used only by
>> header or by split part, but not by both.
>>
>> (for same reason I also do not support yet sharing values computed by header
>> and used by split part that is something I can do by passing them as argument
>> or recomputing them in split part if they are cheap.  I think especially
>> for casts in C++ we will want to do the recomputation in trivial cases)
>>>
>>> +  if (current->split_size <= call_overhead)
>>> +    {
>>> +      if (dump_file)
>>> +       fprintf (dump_file,
>>> +                "  Refused: split size is smaller than call overhead\n");
>>> +      return;
>>> +    }
>>>
>>> also refuse to split if the size of the header is bigger than that
>>> of the tail?  Otherwise we're just going to inline it back, no?
>>
>> Not necesarily - especially when the split point is cold and thus we end
>> up inlining for size.
>>
>> This is something I want to experiment with too.  At the moment we produce
>> more splitting than needed (i.e. we often inline things back) that has
>> negative effect on compile time and debug info. In practice I didn't found
>> this bad.
>>
>> Keeping the function splitting spearate from inliner heuristics, we can also
>> play safe and split only when the function becomes inlinable after splitting
>> or the split part is cold and its size
>>> +               if (is_gimple_debug (gsi_stmt (bsi)))
>>> +                 continue;
>>>
>>> I wonder if debug gets confused if you reference vars that are not
>>> there...
>>
>> tree-inline will remove the debug statement (and thus we lose some debug info
>> when split part is inlined back), but in general it works.
>>> +/* Return basic block containing RETURN statement, or EXIT_BLOCK_PTR if
>>> none
>>> +   found.  */
>>>
>>> What about BUILT_IN_RETURN?  Or returns via EH or abnormally returning
>>> fns like longjmp?
>>
>> EH or abnormal jumps are not problem.  Functions returning both via
>> normal return and BUILT_IN_RETURN (or functions having two return statement
>> one with and one without return value) are currently outlawed.
>>
>> We can support multiple return BBs if needed, I added TODO for that reason.
>>> +  /* At present we can't pass non-SSA arguments to split function.
>>> +     FIXME: this can be relaxed by passing references to arguments.  */
>>> +  if (TREE_CODE (t) == PARM_DECL)
>>> +    {
>>> +      if (dump_file)
>>> +       fprintf (dump_file, "Can not split use of non-ssa function
>>> parameter.\n");
>>> +      return true;
>>> +    }
>>>
>>> I think this will do excessive dumping.  At least make it
>>> dump_flags & TDF_DETAILS.
>>
>> Done.
>>> +  if ((!node->callers || !node->callers->next_caller)
>>> +      && !node->address_taken
>>> +      && ((!flag_lto && !flag_whopr) || !node->local.externally_visible))
>>> +    {
>>> +      if (dump_file)
>>> +       fprintf (dump_file, "Not splitting: not called directly or called
>>> once.\n");
>>> +      return 0;
>>> +    }
>>>
>>> Why the check for node->address_taken?  Why the check
>>> for external visibility?
>>
>> We need to decide if function can possibly end up calling more than once when it reach
>> inliner.  When it has address taken, we can devirtualize the call.
>> When doing LTO, it might be called directly from other unit.
>>>
>>> Ok with the above changes (please go over all dumping and make sure
>>> that without -details we only say that we are splitting and where
>>> and dump the split function).
>>
>> You seem to be consistently on less dumping by default than me.  I find dumping
>> of reasons why split points are not accepted quite useful, but I guess using
>> -details is not problem.
>>
>
> This breaks C++:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44671

Also caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44687
Eric Botcazou June 28, 2010, 10:16 p.m. UTC | #12
> this patch implements function splitting that allows partial inlining.
> Functions are split if there is BB whose frequency is smaller than
> frequency of entry BB such that the control flow must past through it just
> once and afterwards we don't use values computed beforehand.

Are you sure we want that at -O2?  This looks more like an -O3 kind of things 
to me, it can do dumb things like splitting a function and inlining back the 
broken out part into the remaining one, with a lot of useless tree copying 
in-between.

The attached testcase is even more interesting: the compiler used to inline a 
pragma Inline function, now it first splits it, then inlines back the broken 
out part, and finally doesn't inline the reconstructed function because the 
size estimate isn't the same as the original one!

To reproduce: put the testcase into $(objdir)/gcc and run
  ./gnat1 -quiet -O2 -gnatpgn -gnatwns -Iada -I$(srcdir)/gcc/ada exp_ch4.adb

The ICE is an unrelated known issue that I don't want to fix (for now) because 
it happens to catch problems upstream (it already caught the noreturn bug we 
discussed a few weeks ago).  The interesting function is Atree.Parent.
Jan Hubicka June 28, 2010, 10:55 p.m. UTC | #13
> > this patch implements function splitting that allows partial inlining.
> > Functions are split if there is BB whose frequency is smaller than
> > frequency of entry BB such that the control flow must past through it just
> > once and afterwards we don't use values computed beforehand.
> 
> Are you sure we want that at -O2?  This looks more like an -O3 kind of things 
> to me, it can do dumb things like splitting a function and inlining back the 
> broken out part into the remaining one, with a lot of useless tree copying 
> in-between.

Well, at -O2 it splits only functions declared inline so it should not be terribly
back (i.e. it does not show as slowdown at our testers).
One option would be to further limit splitting only when call to split part
is cold or it makes previously uninlinable function inlinable so the degenerate
case of splitting and re-combining does not happen.
> 
> The attached testcase is even more interesting: the compiler used to inline a 
> pragma Inline function, now it first splits it, then inlines back the broken 
> out part, and finally doesn't inline the reconstructed function because the 
> size estimate isn't the same as the original one!
> 
> To reproduce: put the testcase into $(objdir)/gcc and run
>   ./gnat1 -quiet -O2 -gnatpgn -gnatwns -Iada -I$(srcdir)/gcc/ada exp_ch4.adb
> 
> The ICE is an unrelated known issue that I don't want to fix (for now) because 
> it happens to catch problems upstream (it already caught the noreturn bug we 
> discussed a few weeks ago).  The interesting function is Atree.Parent.

Jakub sent me a testcase where similarly funny thing happen, there it is because
the split part has frequency of 0 everywhere (it is guarted by builtin_expect).
This confuses inliner herusitic by concluding that time spent in function is 0
and savings is a call cost (a roundof error to prevent division by zero).
I will work on fixing that one first hopefully fixing this one too.

I suspect this scenario can happen too in more legal cases when the split part
ends up looking better candidate for inlining than the header.  This can also
be computed beforehand.

Honza
> 
> -- 
> Eric Botcazou

> with Atree;    use Atree;
> with Einfo;    use Einfo;
> with Exp_Util; use Exp_Util;
> with Nlists;   use Nlists;
> with Nmake;    use Nmake;
> with Rtsfind;  use Rtsfind;
> with Sinfo;    use Sinfo;
> with Tbuild;   use Tbuild;
> 
> package body Exp_Ch4 is
> 
>    procedure Expand_N_Allocator (N : Node_Id) is
>       PtrT  : constant Entity_Id  := Etype (N);
>       Loc   : constant Source_Ptr := Sloc (N);
>       T : constant Entity_Id := Entity (Expression (N));
>       Args, Decls : List_Id;
>       Arg1, Decl, Temp_Decl : Node_Id;
>       Temp  : Entity_Id;
> 
>    begin
> 
>       Temp := Make_Temporary (Loc, 'P');
> 
>       Arg1 :=
>         Make_Explicit_Dereference (Loc,
>           Prefix => New_Reference_To (Temp, Loc));
> 
>       Args := New_List (Arg1);
> 
>       if Has_Task (T) then
>          if Nkind (Parent (N)) = N_Assignment_Statement then
>             declare
>                Nam : constant Node_Id := Name (Parent (N));
> 
>             begin
>                if Is_Entity_Name (Nam) then
>                   Decls :=
>                     Build_Task_Image_Decls
>                       (Loc,
>                        New_Occurrence_Of
>                          (Entity (Nam), Sloc (Nam)), T);
> 
>                else
>                   Decls := Build_Task_Image_Decls (Loc, T, T);
>                end if;
>             end;
> 
>          elsif Nkind (Parent (N)) = N_Object_Declaration then
>             Decls :=
>               Build_Task_Image_Decls
>                 (Loc, Defining_Identifier (Parent (N)), T);
> 
>          else
>             Decls := Build_Task_Image_Decls (Loc, T, T);
>          end if;
>       end if;
> 
>       Temp_Decl :=
>         Make_Object_Declaration (Loc,
>           Defining_Identifier => Temp,
>           Constant_Present    => True,
>           Object_Definition   => New_Reference_To (PtrT, Loc),
>           Expression          => N);
> 
>       Insert_Action (N, Temp_Decl, Suppress => All_Checks);
> 
>    exception
>       when RE_Not_Available => return;
>    end Expand_N_Allocator;
> 
> end Exp_Ch4;

> with Types; use Types;
> 
> package Exp_Ch4 is
> 
>    procedure Expand_N_Allocator (N : Node_Id);
> 
> end Exp_Ch4;
Eric Botcazou June 28, 2010, 11:27 p.m. UTC | #14
> Well, at -O2 it splits only functions declared inline so it should not be
> terribly back (i.e. it does not show as slowdown at our testers).
> One option would be to further limit splitting only when call to split part
> is cold or it makes previously uninlinable function inlinable so the
> degenerate case of splitting and re-combining does not happen.

OK, but historically we have restricted sophisticated optimizations to -O3.
I think users will be surprised when they discover that their small inline 
functions are broken apart at -O2 without clear reasons.  The recent problems 
with 4.5 show that computing good inlining estimates is relatively hard.

> Jakub sent me a testcase where similarly funny thing happen, there it is
> because the split part has frequency of 0 everywhere (it is guarted by
> builtin_expect). This confuses inliner herusitic by concluding that time
> spent in function is 0 and savings is a call cost (a roundof error to
> prevent division by zero). I will work on fixing that one first hopefully
> fixing this one too.

Thanks in advance.
Jan Hubicka June 29, 2010, 12:17 a.m. UTC | #15
> > Well, at -O2 it splits only functions declared inline so it should not be
> > terribly back (i.e. it does not show as slowdown at our testers).
> > One option would be to further limit splitting only when call to split part
> > is cold or it makes previously uninlinable function inlinable so the
> > degenerate case of splitting and re-combining does not happen.
> 
> OK, but historically we have restricted sophisticated optimizations to -O3.
> I think users will be surprised when they discover that their small inline 
> functions are broken apart at -O2 without clear reasons.  The recent problems 
> with 4.5 show that computing good inlining estimates is relatively hard.

I would say that -O3 is more about risky optimizations.  Those that might not
help or make aggressive tradeoffs in between size and speed.  We definitly do a
lot of smart things at -O2 that might surprise users.  For example type based
aliasing.  (or at IPA we introduced autoinlining for size in 4.3, cloning in
4.4 and more).

Partial inlining should not be risky in this way. It does not make programs
bigger. And the problems with tuning inliner is partly reason why I worked on
the pass - tuning is so hard because we mis other useful transforms otherwise.
Inline limits are pushed up by few benchmarks that reuiqre inlining very large
functions that, some of them can be cured with partial inlining and other
compilers do partial inlining by default too.

I agree that in current setting the pass is too active.  It helps to identify
issues and then it should be easy to selectively trottle it at -O2 (and partly
at -O3 too) while retuning inliner later in development of 4.6.

Honza
> 
> > Jakub sent me a testcase where similarly funny thing happen, there it is
> > because the split part has frequency of 0 everywhere (it is guarted by
> > builtin_expect). This confuses inliner herusitic by concluding that time
> > spent in function is 0 and savings is a call cost (a roundof error to
> > prevent division by zero). I will work on fixing that one first hopefully
> > fixing this one too.
> 
> Thanks in advance.
> 
> -- 
> Eric Botcazou
diff mbox

Patch

Index: tree-pass.h
===================================================================
--- tree-pass.h	(revision 160109)
+++ tree-pass.h	(working copy)
@@ -442,6 +442,7 @@  extern struct gimple_opt_pass pass_build
 extern struct gimple_opt_pass pass_local_pure_const;
 extern struct gimple_opt_pass pass_tracer;
 extern struct gimple_opt_pass pass_warn_unused_result;
+extern struct gimple_opt_pass pass_split_functions;
 
 /* IPA Passes */
 extern struct simple_ipa_opt_pass pass_ipa_function_and_variable_visibility;
Index: opts.c
===================================================================
--- opts.c	(revision 160109)
+++ opts.c	(working copy)
@@ -910,6 +910,7 @@  decode_options (unsigned int argc, const
   opt2 = (optimize >= 2);
   flag_inline_small_functions = opt2;
   flag_indirect_inlining = opt2;
+  flag_partial_inlining = opt2;
   flag_thread_jumps = opt2;
   flag_crossjumping = opt2;
   flag_optimize_sibling_calls = opt2;
Index: timevar.def
===================================================================
--- timevar.def	(revision 160109)
+++ timevar.def	(working copy)
@@ -52,6 +52,7 @@  DEFTIMEVAR (TV_CGRAPH                , "
 DEFTIMEVAR (TV_CGRAPHOPT             , "callgraph optimization")
 DEFTIMEVAR (TV_VARPOOL               , "varpool construction")
 DEFTIMEVAR (TV_IPA_CONSTANT_PROP     , "ipa cp")
+DEFTIMEVAR (TV_IPA_FNSPLIT           , "ipa function splitting")
 DEFTIMEVAR (TV_IPA_LTO_GIMPLE_IO     , "ipa lto gimple I/O")
 DEFTIMEVAR (TV_IPA_LTO_DECL_IO       , "ipa lto decl I/O")
 DEFTIMEVAR (TV_IPA_LTO_DECL_INIT_IO  , "ipa lto decl init I/O")
Index: ipa-split.c
===================================================================
--- ipa-split.c	(revision 0)
+++ ipa-split.c	(revision 0)
@@ -0,0 +1,1069 @@ 
+/* Function splitting pass
+   Copyright (C) 2010
+   Free Software Foundation, Inc.
+   Contributed by Jan Hubicka  <jh@suse.cz>
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* The purpose of this pass is to split function bodies to improve
+   inlining.  I.e. for function of the form:
+
+   func (...)
+     {
+       if (cheap_test)
+	 something_small
+       else
+	 something_big
+     }
+
+   Produce:
+
+   func.part (...)
+     {
+	something_big
+     }
+
+   func (...)
+     {
+       if (cheap_test)
+	 something_small
+       else
+	 func.part (...);
+     }
+
+   When func becomes inlinable and when cheap_test is often true, inlining func,
+   but not fund.part leads to performance imrovement similar as inlining original
+   func while the code size growth is smaller.
+
+   The pass is organized in three stages:
+   1) Collect local info about basic block into BB_INFO structure and
+      compute function body estimated size and time.
+   2) Via DFS walk find all possible basic blocks where we can split
+      and chose best one.
+   3) If split point is found, split at the specified BB by creating a clone
+      and updating function to call it.  
+
+   The decisions what functions to split are in execute_split_functions
+   and consider_split.  
+
+   There are several possible future improvements for this pass including:
+
+   1) Splitting to break up large functions
+   2) Splitting to reduce stack frame usage
+   3) Allow split part of function to use values computed in the header part.
+      The values needs to be passed to split function, perhaps via same interface
+      as for nested functions or as argument.
+   4) Support for simple rematerialization.  I.e. when split part use
+      value computed in header from function parameter in very cheap way, we
+      can just recompute it.
+   5) Support splitting of nested functions.
+   6) Support non-SSA arguments.  
+   7) There is nothing preventing us from producing multiple parts of single function
+      when needed or splitting also the parts.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tree.h"
+#include "target.h"
+#include "cgraph.h"
+#include "ipa-prop.h"
+#include "tree-flow.h"
+#include "tree-pass.h"
+#include "flags.h"
+#include "timevar.h"
+#include "diagnostic.h"
+#include "tree-dump.h"
+#include "tree-inline.h"
+#include "fibheap.h"
+#include "params.h"
+#include "gimple-pretty-print.h"
+
+/* Per basic block info.  */
+
+typedef struct
+{
+  unsigned int size;
+  unsigned int time;
+} bb_info;
+DEF_VEC_O(bb_info);
+DEF_VEC_ALLOC_O(bb_info,heap);
+
+static VEC(bb_info, heap) *bb_info_vec;
+
+/* Description of split point.  */
+
+struct split_point
+{
+  /* Size of the partitions.  */
+  unsigned int header_time, header_size, split_time, split_size;
+
+  /* SSA names that need to be passed into spit funciton.  */
+  bitmap ssa_names_to_pass;
+
+  /* Basic block where we split (that will become entry point of new function.  */
+  basic_block entry_bb;
+
+  /* Basic blocks we are splitting away.  */
+  bitmap split_bbs;
+};
+
+/* Best split point found.  */
+
+struct split_point best_split_point;
+
+/* Callback for walk_stmt_load_store_addr_ops.  If T is non-ssa automatic
+   variable, check it if it is present in bitmap passed via DATA.  */
+
+static bool
+test_nonssa_use (gimple stmt ATTRIBUTE_UNUSED, tree t,
+	         void *data ATTRIBUTE_UNUSED)
+{
+  t = get_base_address (t);
+
+  if (t && !is_gimple_reg (t)
+      && ((TREE_CODE (t) == VAR_DECL && auto_var_in_fn_p (t, current_function_decl))
+	  || (TREE_CODE (t) == PARM_DECL)))
+    return bitmap_bit_p ((bitmap)data, DECL_UID (t));
+  return false;
+}
+
+/* Dump split point CURRENT.  */
+
+static void
+dump_split_point (FILE * file, struct split_point *current)
+{
+  fprintf (file,
+	   "Split point at BB %i header time:%i header size: %i split time: %i split size: %i\n  bbs: ",
+	   current->entry_bb->index, current->header_time,
+	   current->header_size, current->split_time, current->split_size);
+  dump_bitmap (file, current->split_bbs);
+  fprintf (file, "  SSA names to pass: ");
+  dump_bitmap (file, current->ssa_names_to_pass);
+}
+
+/* We found an split_point CURRENT.  NON_SSA_VARS is bitmap of all non ssa
+   variables used and RETURN_BB is return basic block.
+   See if we can split function here.  */
+
+static void
+consider_split (struct split_point *current, bitmap non_ssa_vars,
+		basic_block return_bb)
+{
+  tree parm;
+  unsigned int num_args = 0;
+  unsigned int call_overhead;
+  edge e;
+  edge_iterator ei;
+  if (dump_file)
+    dump_split_point (dump_file, current);
+
+  /* Do not split when we would end up calling function anyway.  */
+  if (current->entry_bb->frequency
+      >= (ENTRY_BLOCK_PTR->frequency
+	  * PARAM_VALUE (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY) / 100))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "  Refused: split BB frequency is too large.\n");
+      return;
+    }
+
+  if (!current->header_size)
+    {
+      if (dump_file)
+	fprintf (dump_file, "  Refused: header empty\n");
+      gcc_unreachable ();
+      return;
+    }
+
+  /* FIXME: We can do better: if the split region start with a loop and there is only
+     one entry point from outer wrold, we can update PHI.  */
+  if (!gsi_end_p (gsi_start_phis (current->entry_bb)))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "  Refused: entry BB has PHI\n");
+      return;
+    }
+
+
+  /* See what argument we will pass to the split function and compute
+     call overhead.  */
+  call_overhead = eni_size_weights.call_cost;
+  for (parm = DECL_ARGUMENTS (current_function_decl); parm;
+       parm = TREE_CHAIN (parm))
+    {
+      if (!is_gimple_reg (parm))
+	{
+	  if (bitmap_bit_p (non_ssa_vars, DECL_UID (parm)))
+	    {
+	      if (dump_file)
+		fprintf (dump_file,
+			 "  Refused: need to pass non-ssa param values\n");
+	      return;
+	    }
+	}
+      else if (gimple_default_def (cfun, parm)
+	       && bitmap_bit_p (current->ssa_names_to_pass,
+				SSA_NAME_VERSION (gimple_default_def
+						  (cfun, parm))))
+	{
+	  if (!VOID_TYPE_P (TREE_TYPE (parm)))
+	    call_overhead += estimate_move_cost (TREE_TYPE (parm));
+	  num_args++;
+	}
+    }
+  if (!VOID_TYPE_P (TREE_TYPE (current_function_decl)))
+    call_overhead += estimate_move_cost (TREE_TYPE (current_function_decl));
+
+  if (current->split_size <= call_overhead)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "  Refused: split size is smaller than call overhead\n");
+      return;
+    }
+  if (current->header_size + call_overhead
+      >= (unsigned int)(DECL_DECLARED_INLINE_P (current_function_decl)
+			? MAX_INLINE_INSNS_SINGLE
+			: MAX_INLINE_INSNS_AUTO))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "  Refused: header size is too large for inline candidate\n");
+      return;
+    }
+
+  /* FIXME: we currently can pass only SSA function parameters to the split
+     arguments.  Once parm_adjustment infrastructure is supported by clonning,
+     we can pass more than that.  */
+  if (num_args != bitmap_count_bits (current->ssa_names_to_pass))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "  Refused: need to pass non-param values\n");
+      return;
+    }
+
+  /* When there are non-ssa vars used in the split region, see if they
+     are used in the header region.  If so, reject the split.
+     FIXME: we can use nested function support to access both.  */
+  if (!bitmap_empty_p (non_ssa_vars))
+    {
+      basic_block bb;
+      FOR_EACH_BB (bb)
+	if (!bitmap_bit_p (current->split_bbs, bb->index))
+	  {
+	    gimple_stmt_iterator bsi;
+	    for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
+	      {
+		if (is_gimple_debug (gsi_stmt (bsi)))
+		  continue;
+		if (walk_stmt_load_store_addr_ops
+		    (gsi_stmt (bsi), non_ssa_vars, test_nonssa_use,
+		     test_nonssa_use, test_nonssa_use))
+		  {
+		    if (dump_file)
+		      fprintf (dump_file,
+			       "  Refused: split part has non-ssa uses\n");
+		    return;
+		  }
+	      }
+	    for (bsi = gsi_start_phis (bb); !gsi_end_p (bsi); gsi_next (&bsi))
+	      {
+		if (is_gimple_debug (gsi_stmt (bsi)))
+		  continue;
+		if (walk_stmt_load_store_addr_ops
+		    (gsi_stmt (bsi), non_ssa_vars, test_nonssa_use,
+		     test_nonssa_use, test_nonssa_use))
+		  {
+		    if (dump_file)
+		      fprintf (dump_file,
+			       "  Refused: split part has non-ssa uses\n");
+		    return;
+		  }
+	      }
+	    FOR_EACH_EDGE (e, ei, bb->succs)
+	      if (e->dest == return_bb)
+		{
+		  for (bsi = gsi_start_phis (return_bb); !gsi_end_p (bsi);
+		       gsi_next (&bsi))
+		    {
+		      gimple stmt = gsi_stmt (bsi);
+		      tree op = gimple_phi_arg_def (stmt, e->dest_idx);
+
+		      if (!is_gimple_reg (gimple_phi_result (stmt)))
+			continue;
+		      STRIP_NOPS (op);
+		      if (TREE_CODE (op) != SSA_NAME
+			  && test_nonssa_use (stmt, op, non_ssa_vars))
+			{
+			  if (dump_file)
+			    fprintf (dump_file,
+				     "  Refused: split part has non-ssa uses\n");
+			  return;
+			}
+		    }
+		}
+	  }
+      return;
+    }
+  if (dump_file)
+    fprintf (dump_file, "  Accepted!\n");
+
+  /* At the moment chose split point with lowest frequency and that leaves
+     out smallest size of header.
+     In future we might re-consider this heuristics.  */
+  if (!best_split_point.split_bbs
+      || best_split_point.entry_bb->frequency > current->entry_bb->frequency
+      || (best_split_point.entry_bb->frequency == current->entry_bb->frequency
+	  && best_split_point.split_size < current->split_size))
+	
+    {
+      if (dump_file)
+	fprintf (dump_file, "  New best split point!\n");
+      if (best_split_point.ssa_names_to_pass)
+	{
+	  BITMAP_FREE (best_split_point.ssa_names_to_pass);
+	  BITMAP_FREE (best_split_point.split_bbs);
+	}
+      best_split_point = *current;
+      best_split_point.ssa_names_to_pass = BITMAP_ALLOC (NULL);
+      bitmap_copy (best_split_point.ssa_names_to_pass,
+		   current->ssa_names_to_pass);
+      best_split_point.split_bbs = BITMAP_ALLOC (NULL);
+      bitmap_copy (best_split_point.split_bbs, current->split_bbs);
+    }
+}
+
+/* Return basic block containing RETURN statement, or EXIT_BLOCK_PTR if none
+   found.  */
+
+static basic_block
+find_return_bb (void)
+{
+  edge e;
+  edge_iterator ei;
+  bool found_return = false;
+
+  if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1)
+    FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
+      {
+	gimple_stmt_iterator bsi;
+	for (bsi = gsi_start_bb (e->src); !gsi_end_p (bsi); gsi_next (&bsi))
+	  if (gimple_code (gsi_stmt (bsi)) != GIMPLE_RETURN
+	      && gimple_code (gsi_stmt (bsi)) != GIMPLE_LABEL
+	      && !is_gimple_debug (gsi_stmt (bsi)))
+	    break;
+	  else if (gimple_code (gsi_stmt (bsi)) == GIMPLE_RETURN)
+	    found_return = true;
+	if (gsi_end_p (bsi) && found_return)
+	  return e->src;
+      }
+  return EXIT_BLOCK_PTR;
+}
+
+/* Callback for walk_stmt_load_store_addr_ops.  If T is non-ssa automatic
+   variable, mark it as used in bitmap passed via DATA. 
+   Return true when access to T prevents splitting the function.  */
+
+static bool
+mark_nonssa_use (gimple stmt ATTRIBUTE_UNUSED, tree t,
+	         void *data ATTRIBUTE_UNUSED)
+{
+  t = get_base_address (t);
+
+  if (!t || is_gimple_reg (t))
+    return false;
+
+  /* At present we can't pass non-SSA arguments to split function.
+     FIXME: this can be relaxed by passing references to arguments.  */
+  if (TREE_CODE (t) == PARM_DECL)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Can not split use of non-ssa function parameter.\n");
+      return true;
+    }
+
+  if (TREE_CODE (t) == VAR_DECL && auto_var_in_fn_p (t, current_function_decl))
+    bitmap_set_bit ((bitmap)data, DECL_UID (t));
+  return false;
+}
+
+/* Compute local properties of basic block BB we collect when looking for split points
+   point.
+   We look for ssa defs and store them in SET_SSA_NAMES, for ssa uses and store them
+   in USED_SSA_NAMES and for any non-SSA automatic vars stored in NON_SSA_VARS.
+
+   When BB has edge to RETURN_BB, collect uses in RETURN_BB too.  
+
+   Return false when BB contains something that prevents it from being put into
+   split function.  */
+
+static bool
+visit_bb (basic_block bb, basic_block return_bb,
+	  bitmap set_ssa_names, bitmap used_ssa_names,
+	  bitmap non_ssa_vars)
+{
+  gimple_stmt_iterator bsi;
+  edge e;
+  edge_iterator ei;
+  bool can_split = true;
+
+  for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
+    {
+      gimple stmt = gsi_stmt (bsi);
+      tree op;
+      ssa_op_iter iter;
+      tree decl;
+
+      if (is_gimple_debug (stmt))
+	continue;
+
+      /* FIXME: We can split regions containing EH.  We can not however
+	 split RESX, EH_DISPATCH and EH_POINTER referring to same region
+	 into different partitions.  This would require tracking of
+	 EH regions and checking in consider_split_point if they 
+	 are not used elsewhere.  */
+      if (gimple_code (stmt) == GIMPLE_RESX
+	  && stmt_can_throw_external (stmt))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "Can not split external resx.\n");
+	  can_split = false;
+	}
+      if (gimple_code (stmt) == GIMPLE_EH_DISPATCH)
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "Can not split eh dispatch.\n");
+	  can_split = false;
+	}
+
+      /* Check builtins that prevent splitting.  */
+      if (gimple_code (stmt) == GIMPLE_CALL
+	  && (decl = gimple_call_fndecl (stmt)) != NULL_TREE
+	  && DECL_BUILT_IN (decl))
+	switch (DECL_FUNCTION_CODE (decl))
+	  {
+	  /* FIXME: once we will allow passing non-parm values to split part,
+	     we need to be sure to handle correct builtin_stack_save and
+	     builtin_stack_restore.  At the moment we are safe; there is no
+	     way to store builtin_stack_save result in non-SSA variable
+	     since all calls to those are compiler generated.  */
+	  case BUILT_IN_APPLY:
+	  case BUILT_IN_VA_START:
+	    if (dump_file)
+	      fprintf (dump_file, "Can not split builtin_apply and va_start.\n");
+	    can_split = false;
+	    break;
+	  case BUILT_IN_EH_POINTER:
+	    if (dump_file)
+	      fprintf (dump_file, "Can not split builtin_eh_pointer.\n");
+	    can_split = false;
+	    break;
+	  default:
+	    break;
+	  }
+
+      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_DEF)
+	if (TREE_CODE (op) == SSA_NAME)
+	  bitmap_set_bit (set_ssa_names, SSA_NAME_VERSION (op));
+      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
+	if (TREE_CODE (op) == SSA_NAME)
+	  bitmap_set_bit (used_ssa_names, SSA_NAME_VERSION (op));
+      can_split &= !walk_stmt_load_store_addr_ops (stmt, non_ssa_vars,
+						   mark_nonssa_use,
+						   mark_nonssa_use,
+						   mark_nonssa_use);
+    }
+  for (bsi = gsi_start_phis (bb); !gsi_end_p (bsi); gsi_next (&bsi))
+    {
+      gimple stmt = gsi_stmt (bsi);
+      tree op;
+      ssa_op_iter iter;
+
+      if (is_gimple_debug (stmt))
+	continue;
+      if (!is_gimple_reg (gimple_phi_result (stmt)))
+	continue;
+      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_DEF)
+	if (TREE_CODE (op) == SSA_NAME)
+	  bitmap_set_bit (set_ssa_names, SSA_NAME_VERSION (op));
+      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
+	if (TREE_CODE (op) == SSA_NAME)
+	  bitmap_set_bit (used_ssa_names, SSA_NAME_VERSION (op));
+      can_split &= !walk_stmt_load_store_addr_ops (stmt, non_ssa_vars,
+						   mark_nonssa_use,
+						   mark_nonssa_use,
+						   mark_nonssa_use);
+    }
+  /* Record also uses comming from PHI operand in return BB.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    if (e->dest == return_bb)
+      {
+	bool found_phi = false;
+	for (bsi = gsi_start_phis (return_bb); !gsi_end_p (bsi); gsi_next (&bsi))
+	  {
+	    gimple stmt = gsi_stmt (bsi);
+	    tree op = gimple_phi_arg_def (stmt, e->dest_idx);
+
+	    if (is_gimple_debug (stmt))
+	      continue;
+	    if (!is_gimple_reg (gimple_phi_result (stmt)))
+	      continue;
+	    found_phi = true;
+	    STRIP_NOPS (op);
+	    if (TREE_CODE (op) == SSA_NAME)
+	      bitmap_set_bit (used_ssa_names, SSA_NAME_VERSION (op));
+	    else
+	      can_split &= !mark_nonssa_use (stmt, op, non_ssa_vars);
+	  }
+	if (!gsi_end_p (gsi_last_bb (return_bb)))
+	  {
+	    ssa_op_iter iter;
+	    gimple stmt = gsi_stmt (gsi_last_bb (return_bb));
+	    tree op;
+	    if (!found_phi)
+	      FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
+		if (TREE_CODE (op) == SSA_NAME)
+		  bitmap_set_bit (used_ssa_names, SSA_NAME_VERSION (op));
+	    can_split &= !walk_stmt_load_store_addr_ops (stmt, non_ssa_vars,
+							 mark_nonssa_use,
+							 mark_nonssa_use,
+							 mark_nonssa_use);
+	  }
+      }
+  return can_split;
+}
+
+/* Stack entry for recursive DFS walk in find_split_point.  */
+
+typedef struct
+{
+  /* Basic block we are examining.  */
+  basic_block bb;
+
+  /* SSA names set and used by the BB and all BBs reachable
+     from it via DFS walk.  */
+  bitmap set_ssa_names, used_ssa_names;
+  bitmap non_ssa_vars;
+
+  /* All BBS visited from this BB via DFS walk.  */
+  bitmap bbs_visited;
+
+  /* Last examined edge in DFS walk.  Since we walk unoriented graph,
+     the value is up to sum of incomming and outgoing edges of BB.  */
+  unsigned int edge_num;
+
+  /* Stack entry index of earliest BB reachable from current BB
+     or any BB visited later in DFS valk.  */
+  int earliest;
+
+  /* Overall time and size of all BBs reached from this BB in DFS walk.  */
+  int overall_time, overall_size;
+
+  /* When false we can not split on this BB.  */
+  bool can_split;
+} stack_entry;
+DEF_VEC_O(stack_entry);
+DEF_VEC_ALLOC_O(stack_entry,heap);
+
+
+/* Find all articulations and call consider_split on them.
+   OVERALL_TIME and OVERALL_SIZE is time and size of the function.
+
+   We perform basic algorithm for finding an articulation in a graph
+   created from CFG by considering it to be an unoriented graph.
+
+   The articulation is discovered via DFS walk. We collect earliest
+   basic block on stack that is reachable via backward edge.  Articulation
+   is any basic block such that there is no backward edge bypassing it.
+   To reduce stack usage we maintain heap allocated stack in STACK vector.
+   AUX pointer of BB is set to index it appears in the stack or -1 once
+   it is visited and popped off the stack.
+
+   The algorithm finds articulation after visiting the whole component
+   reachable by it.  This makes it convenient to collect information about
+   the component used by consider_split.  */
+
+static void
+find_split_points (int overall_time, int overall_size)
+{
+  stack_entry first;
+  VEC(stack_entry, heap) *stack = NULL;
+  basic_block bb;
+  basic_block return_bb = find_return_bb ();
+  struct split_point current;
+
+  current.header_time = overall_time;
+  current.header_size = overall_size;
+  current.split_time = 0;
+  current.split_size = 0;
+  current.ssa_names_to_pass = BITMAP_ALLOC (NULL);
+
+  first.bb = ENTRY_BLOCK_PTR;
+  first.edge_num = 0;
+  first.overall_time = 0;
+  first.overall_size = 0;
+  first.earliest = INT_MAX;
+  first.set_ssa_names = 0;
+  first.used_ssa_names = 0;
+  first.bbs_visited = 0;
+  VEC_safe_push (stack_entry, heap, stack, &first);
+  ENTRY_BLOCK_PTR->aux = (void *)(ptrdiff_t)-1;
+
+  while (!VEC_empty (stack_entry, stack))
+    {
+      stack_entry *entry = VEC_last (stack_entry, stack);
+
+      /* We are wlaking acyclic graph, so edge_num counts
+	 succ and pred edges together.  However when considering
+         articulation, we want to have processed everything reachable
+	 from articulation but nothing that reaches into it.  */
+      if (entry->edge_num == EDGE_COUNT (entry->bb->succs)
+	  && entry->bb != ENTRY_BLOCK_PTR)
+	{
+	  int pos = VEC_length (stack_entry, stack);
+	  entry->can_split &= visit_bb (entry->bb, return_bb,
+					entry->set_ssa_names,
+					entry->used_ssa_names,
+					entry->non_ssa_vars);
+	  if (pos <= entry->earliest && !entry->can_split && dump_file)
+	    fprintf (dump_file,
+		     "found articulation at bb %i but can not split\n",
+		     entry->bb->index);
+	  if (pos <= entry->earliest && entry->can_split)
+	     {
+	       if (dump_file)
+		 fprintf (dump_file, "found articulation at bb %i\n",
+			  entry->bb->index);
+	       current.entry_bb = entry->bb;
+	       current.ssa_names_to_pass = BITMAP_ALLOC (NULL);
+	       bitmap_and_compl (current.ssa_names_to_pass,
+				 entry->used_ssa_names, entry->set_ssa_names);
+	       current.header_time = overall_time - entry->overall_time;
+	       current.header_size = overall_size - entry->overall_size;
+	       current.split_time = entry->overall_time;
+	       current.split_size = entry->overall_size;
+	       current.split_bbs = entry->bbs_visited;
+	       consider_split (&current, entry->non_ssa_vars, return_bb);
+	       BITMAP_FREE (current.ssa_names_to_pass);
+	     }
+	}
+      /* Do actual DFS walk.  */
+      if (entry->edge_num
+	  < (EDGE_COUNT (entry->bb->succs)
+	     + EDGE_COUNT (entry->bb->preds)))
+	{
+          edge e;
+	  basic_block dest;
+	  if (entry->edge_num < EDGE_COUNT (entry->bb->succs))
+	    {
+	      e = EDGE_SUCC (entry->bb, entry->edge_num);
+	      dest = e->dest;
+	    }
+	  else
+	    {
+	      e = EDGE_PRED (entry->bb, entry->edge_num - EDGE_COUNT (entry->bb->succs));
+	      dest = e->src;
+	    }
+
+	  entry->edge_num++;
+
+	  /* New BB to visit, push it to the stack.  */
+	  if (dest != return_bb && dest != EXIT_BLOCK_PTR
+	      && !dest->aux)
+	    {
+	      stack_entry new_entry;
+
+	      new_entry.bb = dest;
+	      new_entry.edge_num = 0;
+	      new_entry.overall_time
+		 = VEC_index (bb_info, bb_info_vec, dest->index)->time;
+	      new_entry.overall_size
+		 = VEC_index (bb_info, bb_info_vec, dest->index)->size;
+	      new_entry.earliest = INT_MAX;
+	      new_entry.set_ssa_names = BITMAP_ALLOC (NULL);
+	      new_entry.used_ssa_names = BITMAP_ALLOC (NULL);
+	      new_entry.bbs_visited = BITMAP_ALLOC (NULL);
+	      new_entry.non_ssa_vars = BITMAP_ALLOC (NULL);
+	      new_entry.can_split = true;
+	      bitmap_set_bit (new_entry.bbs_visited, dest->index);
+	      VEC_safe_push (stack_entry, heap, stack, &new_entry);
+	      dest->aux = (void *)(ptrdiff_t)VEC_length (stack_entry, stack);
+	    }
+	  /* Back edge found, record the earliest point.  */
+	  else if ((ptrdiff_t)dest->aux > 0
+		   && (ptrdiff_t)dest->aux < entry->earliest)
+	    entry->earliest = (ptrdiff_t)dest->aux;
+	}
+      /* We are done with examing the edges. pop off the value from stack and
+	 merge stuff we cummulate during the walk.  */
+      else if (entry->bb != ENTRY_BLOCK_PTR)
+	{
+	  stack_entry *prev = VEC_index (stack_entry, stack,
+					 VEC_length (stack_entry, stack) - 2);
+
+	  entry->bb->aux = (void *)(ptrdiff_t)-1;
+	  prev->can_split &= entry->can_split;
+	  if (prev->set_ssa_names)
+	    {
+	      bitmap_ior_into (prev->set_ssa_names, entry->set_ssa_names);
+	      bitmap_ior_into (prev->used_ssa_names, entry->used_ssa_names);
+	      bitmap_ior_into (prev->bbs_visited, entry->bbs_visited);
+	      bitmap_ior_into (prev->non_ssa_vars, entry->non_ssa_vars);
+	    }
+	  if (prev->earliest > entry->earliest)
+	    prev->earliest = entry->earliest;
+	  prev->overall_time += entry->overall_time;
+	  prev->overall_size += entry->overall_size;
+	  BITMAP_FREE (entry->set_ssa_names);
+	  BITMAP_FREE (entry->used_ssa_names);
+	  BITMAP_FREE (entry->bbs_visited);
+	  BITMAP_FREE (entry->non_ssa_vars);
+	  VEC_pop (stack_entry, stack);
+	}
+      else
+        VEC_pop (stack_entry, stack);
+    }
+  ENTRY_BLOCK_PTR->aux = NULL;
+  FOR_EACH_BB (bb)
+    bb->aux = NULL;
+  BITMAP_FREE (current.ssa_names_to_pass);
+}
+
+/* Split function at SPLIT_POINT.  */
+
+static void
+split_function (struct split_point *split_point)
+{
+  VEC (tree, heap) *args_to_pass = NULL;
+  bitmap args_to_skip = BITMAP_ALLOC (NULL);
+  tree parm;
+  int num = 0;
+  struct cgraph_node *node;
+  basic_block return_bb = find_return_bb ();
+  basic_block call_bb;
+  gimple_stmt_iterator gsi;
+  gimple call;
+  edge e;
+  edge_iterator ei;
+  tree retval = NULL, real_retval = NULL;
+  bool split_part_return_p = false;
+  gimple last_stmt = NULL;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n\nSplitting function at:\n");
+      dump_split_point (dump_file, split_point);
+    }
+
+  /* Collect the parameters of new function and args_to_skip bitmap.  */
+  for (parm = DECL_ARGUMENTS (current_function_decl);
+       parm; parm = TREE_CHAIN (parm), num++)
+    if (!is_gimple_reg (parm)
+	|| !gimple_default_def (cfun, parm)
+	|| !bitmap_bit_p (split_point->ssa_names_to_pass,
+			  SSA_NAME_VERSION (gimple_default_def (cfun, parm))))
+      bitmap_set_bit (args_to_skip, num);
+    else
+      VEC_safe_push (tree, heap, args_to_pass, gimple_default_def (cfun, parm));
+
+  /* See if the split functionwill return.  */
+  FOR_EACH_EDGE (e, ei, return_bb->preds)
+    if (bitmap_bit_p (split_point->split_bbs, e->src->index))
+      break;
+  if (e)
+    split_part_return_p = true;
+
+  /* If we return, we will need the return block.  */
+  if (return_bb != EXIT_BLOCK_PTR && split_part_return_p)
+    bitmap_set_bit (split_point->split_bbs, return_bb->index);
+
+  /* Now create the actual clone.  */
+  rebuild_cgraph_edges ();
+  node = cgraph_function_versioning (cgraph_node (current_function_decl),
+				     NULL, NULL,
+				     args_to_skip,
+				     split_point->split_bbs,
+				     split_point->entry_bb, "_part");
+  cgraph_node_remove_callees (cgraph_node (current_function_decl));
+  if (!split_part_return_p)
+    TREE_THIS_VOLATILE (node->decl) = 1;
+  if (dump_file)
+    dump_function_to_file (node->decl, dump_file, dump_flags);
+
+  /* Create the basic block we place call into.  It is the entry basic block
+     split after last label.  */
+  call_bb = split_point->entry_bb;
+  for (gsi = gsi_start_bb (call_bb); !gsi_end_p (gsi);)
+    if (gimple_code (gsi_stmt (gsi)) == GIMPLE_LABEL)
+      {
+	last_stmt = gsi_stmt (gsi);
+	gsi_next (&gsi);
+      }
+    else
+      break;
+  e = split_block (split_point->entry_bb, last_stmt);
+  remove_edge (e);
+
+  /* Produce the call statement.  */
+  gsi = gsi_last_bb (call_bb);
+  call = gimple_build_call_vec (node->decl, args_to_pass);
+  gimple_set_block (call, DECL_INITIAL (current_function_decl));
+
+  /* Update return value.  This is bit tricky.  When we do not return,
+     do nothing.  When we return we might need to update return_bb
+     or produce new return statement.  */
+  if (!split_part_return_p)
+    gsi_insert_after (&gsi, call, GSI_NEW_STMT);
+  else
+    {
+      e = make_edge (call_bb, return_bb,
+		     return_bb == EXIT_BLOCK_PTR ? 0 : EDGE_FALLTHRU);
+      e->count = call_bb->count;
+      e->probability = REG_BR_PROB_BASE;
+      if (return_bb != EXIT_BLOCK_PTR)
+	{
+	  gimple return_stmt = gsi_stmt (gsi_last_bb (return_bb));
+	  gcc_assert (gimple_code (return_stmt) == GIMPLE_RETURN);
+
+	  if ((real_retval = retval = gimple_return_retval (return_stmt))
+	      && !is_gimple_min_invariant (retval)
+	      && (TREE_CODE (retval) != SSA_NAME
+		  || !SSA_NAME_IS_DEFAULT_DEF (retval)))
+	    {
+	      gimple_stmt_iterator psi;
+
+	      /* See if there is PHI definind return value.  */
+	      for (psi = gsi_start_phis (return_bb);
+		   !gsi_end_p (psi); gsi_next (&psi))
+		if (is_gimple_reg (gimple_phi_result (gsi_stmt (psi))))
+		  break;
+
+	      /* When we have PHI, update PHI.  When there is no PHI,
+		 update the return statement itself.  */
+	      if (TREE_CODE (retval) == SSA_NAME)
+		{
+		  retval = make_ssa_name (SSA_NAME_VAR (retval), call);
+		  if (TREE_CODE (retval) == SSA_NAME
+		      && !gsi_end_p (psi))
+		    add_phi_arg (gsi_stmt (psi), retval, e, UNKNOWN_LOCATION);
+		  else if (TREE_CODE (retval) == SSA_NAME)
+		    {
+		      gimple_return_set_retval (return_stmt, retval);
+		      update_stmt (return_stmt);
+		    }
+		}
+	      gimple_call_set_lhs (call, retval);
+	    }
+          gsi_insert_after (&gsi, call, GSI_NEW_STMT);
+	}
+      else
+	{
+	  gimple ret;
+	  if (!VOID_TYPE_P (TREE_TYPE (TREE_TYPE (current_function_decl))))
+	    {
+	      retval
+	        = create_tmp_var (TREE_TYPE (TREE_TYPE (current_function_decl)),
+				  "RET");
+	      if (is_gimple_reg (retval))
+		retval = make_ssa_name (retval, call);
+	      gimple_call_set_lhs (call, retval);
+	    }
+          gsi_insert_after (&gsi, call, GSI_NEW_STMT);
+	  ret = gimple_build_return (retval);
+	  gsi_insert_after (&gsi, ret, GSI_NEW_STMT);
+	}
+    }
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+  compute_inline_parameters (node);
+}
+
+/* Execute function splitting pass.  */
+
+static unsigned int
+execute_split_functions (void)
+{
+  gimple_stmt_iterator bsi;
+  basic_block bb;
+  int overall_time = 0, overall_size = 0;
+  int todo = 0;
+  edge e;
+  edge_iterator ei;
+  int nreturns = 0;
+  struct cgraph_node *node = cgraph_node (current_function_decl);
+
+  if (flags_from_decl_or_type (current_function_decl) & ECF_NORETURN)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: noreturn function.\n");
+      return 0;
+    }
+  if (MAIN_NAME_P (DECL_NAME (current_function_decl)))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: main function.\n");
+      return 0;
+    }
+  /* This can be relaxed; function might become inlinable after splitting
+     away the uninlinable part.  */
+  if (!node->local.inlinable)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: not inlinable.\n");
+      return 0;
+    }
+  if (node->local.disregard_inline_limits)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: disregading inline limits.\n");
+      return 0;
+    }
+  /* This can be relaxed; most of versioning tests actually prevents
+     a duplication.  */
+  if (!tree_versionable_function_p (current_function_decl))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: not versionable.\n");
+      return 0;
+    }
+  /* FIXME: we could support this.  */
+  if (DECL_STRUCT_FUNCTION (current_function_decl)->static_chain_decl)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: nested function.\n");
+      return 0;
+    }
+  /* FIXME: Should be easy to support.  */
+  if (DECL_BY_REFERENCE (DECL_RESULT (current_function_decl)))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: returns value by reference.\n");
+      return 0;
+    }
+  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
+    if (gimple_code (gsi_stmt (gsi_last_bb (e->src))) == GIMPLE_RETURN)
+      nreturns++;
+  /* FIXME: This happens only when we have one return with a value and other
+     return without.  We can handle this easily by making find_return_bb to find
+     version with a return value.  */
+  if (nreturns > 1)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: multiple return statements.\n");
+      return 0;
+    }
+
+  /* See if it makes sense to try to split.
+     It makes sense to split if we inline, that is if we have direct calls to
+     handle or direct calls are possibly going to appear as result of indirect
+     inlining or LTO.
+     Note that we are not completely conservative about disqualifying functions
+     called once.  It is possible that the caller is called more then once and
+     then inlining would still benefit.  */
+  if ((!node->callers || !node->callers->next_caller)
+      && !node->address_taken
+      && ((!flag_lto && !flag_whopr) || !node->local.externally_visible))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: not called directly or called once.\n");
+      return 0;
+    }
+
+  /* FIXME: We can actually split if splitting reduces call overhead.  */
+  if (!flag_inline_small_functions
+      && !DECL_DECLARED_INLINE_P (current_function_decl))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not splitting: not autoinlining and function is not inline.\n");
+      return 0;
+    }
+
+  /* Compute local info about basic blocks and determine function size/time.  */
+  VEC_safe_grow_cleared (bb_info, heap, bb_info_vec, last_basic_block + 1);
+  memset (&best_split_point, 0, sizeof (best_split_point));
+  FOR_EACH_BB (bb)
+    {
+      int time = 0;
+      int size = 0;
+      int freq = compute_call_stmt_bb_frequency (current_function_decl, bb);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "Basic block %i\n", bb->index);
+
+      for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
+	{
+	  int this_time, this_size;
+	  gimple stmt = gsi_stmt (bsi);
+
+	  this_size = estimate_num_insns (stmt, &eni_size_weights);
+	  this_time = estimate_num_insns (stmt, &eni_time_weights) * freq;
+	  size += this_size;
+	  time += this_time;
+
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file, "  freq:%6i size:%3i time:%3i ",
+		       freq, this_size, this_time);
+	      print_gimple_stmt (dump_file, stmt, 0, 0);
+	    }
+	}
+      overall_time += time;
+      overall_size += size;
+      VEC_index (bb_info, bb_info_vec, bb->index)->time = time;
+      VEC_index (bb_info, bb_info_vec, bb->index)->size = size;
+    }
+  find_split_points (overall_time, overall_size);
+  if (best_split_point.split_bbs)
+    {
+      split_function (&best_split_point);
+      BITMAP_FREE (best_split_point.ssa_names_to_pass);
+      BITMAP_FREE (best_split_point.split_bbs);
+      todo = TODO_update_ssa | TODO_cleanup_cfg;
+    }
+  VEC_free (bb_info, heap, bb_info_vec);
+  bb_info_vec = NULL;
+  return todo;
+}
+
+static bool
+gate_split_functions (void)
+{
+  return flag_partial_inlining;
+}
+
+struct gimple_opt_pass pass_split_functions =
+{
+ {
+  GIMPLE_PASS,
+  "fnsplit",				/* name */
+  gate_split_functions,			/* gate */
+  execute_split_functions,		/* execute */
+  NULL,					/* sub */
+  NULL,					/* next */
+  0,					/* static_pass_number */
+  TV_IPA_FNSPLIT,			/* tv_id */
+  PROP_cfg,				/* properties_required */
+  0,					/* properties_provided */
+  0,					/* properties_destroyed */
+  0,					/* todo_flags_start */
+  TODO_dump_func			/* todo_flags_finish */
+ }
+};
Index: common.opt
===================================================================
--- common.opt	(revision 160109)
+++ common.opt	(working copy)
@@ -876,6 +876,10 @@  foptimize-sibling-calls
 Common Report Var(flag_optimize_sibling_calls) Optimization
 Optimize sibling and tail recursive calls
 
+fpartial-inlining
+Common Report Var(flag_partial_inlining)
+Perform partial inlining
+
 fpre-ipa-mem-report
 Common Report Var(pre_ipa_mem_report)
 Report on memory allocation before interprocedural optimization
Index: Makefile.in
===================================================================
--- Makefile.in	(revision 160111)
+++ Makefile.in	(working copy)
@@ -1429,6 +1429,7 @@  OBJS-archive = \
 	cppdefault.o \
 	incpath.o \
 	ipa-cp.o \
+        ipa-split.o \
 	ipa-inline.o \
 	ipa-prop.o \
 	ipa-pure-const.o \
@@ -2966,6 +2967,10 @@  ipa-cp.o : ipa-cp.c $(CONFIG_H) $(SYSTEM
    $(TREE_H) $(TARGET_H) $(CGRAPH_H) $(IPA_PROP_H) $(TREE_FLOW_H) \
    $(TREE_PASS_H) $(FLAGS_H) $(TIMEVAR_H) $(DIAGNOSTIC_H) $(TREE_DUMP_H) \
    $(TREE_INLINE_H) $(FIBHEAP_H) $(PARAMS_H) tree-pretty-print.h
+ipa-split.o : ipa-split.c $(CONFIG_H) $(SYSTEM_H) coretypes.h  \
+   $(TREE_H) $(TARGET_H) $(CGRAPH_H) $(IPA_PROP_H) $(TREE_FLOW_H) \
+   $(TREE_PASS_H) $(FLAGS_H) $(TIMEVAR_H) $(DIAGNOSTIC_H) $(TREE_DUMP_H) \
+   $(TREE_INLINE_H) $(FIBHEAP_H) $(PARAMS_H)
 matrix-reorg.o : matrix-reorg.c $(CONFIG_H) $(SYSTEM_H) coretypes.h  \
    $(TM_H) $(TREE_H) $(RTL_H) $(TREE_INLINE_H) $(TREE_FLOW_H) \
    tree-flow-inline.h langhooks.h $(HASHTAB_H) $(TOPLEV_H) $(FLAGS_H) $(GGC_H) \
Index: passes.c
===================================================================
--- passes.c	(revision 160109)
+++ passes.c	(working copy)
@@ -795,6 +795,7 @@  init_optimization_passes (void)
           NEXT_PASS (pass_cleanup_eh);
           NEXT_PASS (pass_profile);
           NEXT_PASS (pass_local_pure_const);
+          NEXT_PASS (pass_split_functions);
 	}
       NEXT_PASS (pass_release_ssa_names);
       NEXT_PASS (pass_rebuild_cgraph_edges);
Index: params.def
===================================================================
--- params.def	(revision 160109)
+++ params.def	(working copy)
@@ -78,11 +78,11 @@  DEFPARAM (PARAM_MAX_INLINE_INSNS_SINGLE,
    that is applied to functions marked inlined (or defined in the
    class declaration in C++) given by the "max-inline-insns-single"
    parameter.
-   The default value is 90.  */
+   The default value is 40.  */
 DEFPARAM (PARAM_MAX_INLINE_INSNS_AUTO,
 	  "max-inline-insns-auto",
 	  "The maximum number of instructions when automatically inlining",
-	  50, 0, 0)
+	  40, 0, 0)
 
 DEFPARAM (PARAM_MAX_INLINE_INSNS_RECURSIVE,
 	  "max-inline-insns-recursive",
@@ -117,6 +117,12 @@  DEFPARAM (PARAM_EARLY_INLINER_MAX_ITERAT
 	  "The maximum number of nested indirect inlining performed by early inliner",
 	  10, 0, 0)
 
+/* Limit on probability of entry BB.  */
+DEFPARAM (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY,
+	  "partial-inlining-entry-probability",
+	  "Maximum probability of the entry BB of split region (in percent relative to entry BB of the function) to make partial inlining happen",
+	  70, 0, 0)
+
 /* Limit the number of expansions created by the variable expansion
    optimization to avoid register pressure.  */
 DEFPARAM (PARAM_MAX_VARIABLE_EXPANSIONS,