[RFC] Add new ipa-reorder pass
diff mbox series

Message ID 12f761ce-012d-f9be-ceef-6ad8de6324e8@suse.cz
State New
Headers show
Series
  • [RFC] Add new ipa-reorder pass
Related show

Commit Message

Martin Liška Sept. 19, 2019, 8:33 a.m. UTC
Hi.

Function reordering has been around for quite some time and a naive
implementation was also part of my diploma thesis some time ago.
Currently, the GCC can reorder function based on first execution, which
happens with PGO and LTO of course. Known limitation is that the order
is preserved only partially as various symbols go into different LTRANS partitions.

There has been some research in the area and I would point out the Facebook paper
([1]) and Sony presentation ([2]). Based on that, I decided to make a new implementation
in the GCC that does the same (in a proper way). First part of the enablement are patches
to ld.bfd and ld.gold that come up with a new section .text.sorted, that is always sorted.
There's a snippet from the modified default linker script:

  .text           :
  {
    *(.text.unlikely .text.*_unlikely .text.unlikely.*)
    *(.text.exit .text.exit.*)
    *(.text.startup .text.startup.*)
    *(.text.hot .text.hot.*)
    *(SORT(.text.sorted.*))
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf.em.  */
    *(.gnu.warning)
  }

Second part is the reordering pass where I implemented both approaches:
the existing one (-freorder-functions-algorithm=first-run) and the new one
based on the Facebook paper (-freorder-functions-algorithm=call-chain-clustering).
The implementation is straightforward and utilizes the newly introduced text subsection.

Results:
I made a PGO LTO bootstrap of the GCC with the patch and -freorder-functions-algorithm=call-chain-clustering.
The pass makes sorting order for about 7500 functions and from these about 7300 are really put in the desired
order. The remaining are the functions that are later observed as cold and are eventually put into .text.unlikely
section. I verified the order with nm (sorted by address) and my -fdump-ipa-reorder dump. The aforementioned
sorted function occupy 5.0MB (out of 18.3MB) of the final binary of cc1plus. I made a simple benchmark of:

./cc1plus /home/marxin/Programming/tramp3d/tramp3d-v4.ii -O2 -quiet and run it 20x before and after my patch:
run #020: old:15.033 s, new: 14.975 s, cmp: 99.61%

The numbers are stable and present a speed up of about 0.4%. To verify that I also measured iTLB misses
before:

18,475,504      iTLB-load-misses:u
and after:
13,779,347      iTLB-load-misses:u

Both these numbers are very stable based on my measurements. That said, I believe the patch is correct
and can help in huge binaries that have a flat profile.

Notes and limitations:
- The call-chain-clustering algorithm requires to fit as many as possible functions into page size (4K).
  Based on my measurements that should correspond to ~1000 GIMPLE statements (IPA inliner size). I can
  make it a param in the future.
- One needs modified binutils and I that would probably require a configure detection. The only way
  which I see is based on ld --version. I'm planning to make the binutils submission soon.
- Right now, I prefer to put functions into .text.sorted.* rather than .text.hot. That's required by the
  algorithm to put hot function calls next to each other.
- Observation: Very hot calls are likely inlined and so that the clustering reached quite big functions.

Thoughts? I would definitely welcome any interesting measurement on a bigger load.

Martin

[1] https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
[2] https://llvm.org/devmtg/2018-10/slides/Spencer-Profile%20Guided%20Function%20Layout%20in%20LLVM%20and%20LLD.pdf

Comments

Martin Liška Sept. 24, 2019, 12:05 p.m. UTC | #1
On 9/19/19 10:33 AM, Martin Liška wrote:
> - One needs modified binutils and I that would probably require a configure detection. The only way
>   which I see is based on ld --version. I'm planning to make the binutils submission soon.

The patch submission link:
https://sourceware.org/ml/binutils/2019-09/msg00219.html
Evgeny Kudryashov Sept. 25, 2019, 4:36 p.m. UTC | #2
On 2019-09-19 11:33, Martin Liška wrote:
> Hi.
> 
> Function reordering has been around for quite some time and a naive
> implementation was also part of my diploma thesis some time ago.
> Currently, the GCC can reorder function based on first execution, which
> happens with PGO and LTO of course. Known limitation is that the order
> is preserved only partially as various symbols go into different
> LTRANS partitions.
> 
> There has been some research in the area and I would point out the
> Facebook paper
> ([1]) and Sony presentation ([2]). Based on that, I decided to make a
> new implementation
> in the GCC that does the same (in a proper way). First part of the
> enablement are patches
> to ld.bfd and ld.gold that come up with a new section .text.sorted,
> that is always sorted.
> 
> Thoughts? I would definitely welcome any interesting measurement on a
> bigger load.
> 
> Martin
> 

Hi, Martin!

Some time ago I tried to do the same but didn't go that far.

I also used the C3 algorithm, except for the fact that ipa_fn_summary 
contains information about size and time (somehow missed it). The linker 
option --sort-section=name was used for prototyping. It, obviously, 
sorts sections and allows to place functions in the desired order (by 
placing them into named sections .text.sorted.NNNN, without patching 
linkers or adjusting linker scripts). For testing my implementation I 
used several benchmarks from SPEC2006: perlbench, sjeng and gobmk. 
Unfortunately, no significant positive changes were obtained.

I've tested the proposed pass on perlbench with train and reference 
input (PGO+LTO as a base) but couldn't obtain stable results (still 
unclear environment or perlbench specific reasons).

Evgeny.
Martin Liška Sept. 26, 2019, 9:37 a.m. UTC | #3
On 9/25/19 6:36 PM, Evgeny Kudryashov wrote:
> On 2019-09-19 11:33, Martin Liška wrote:
>> Hi.
>>
>> Function reordering has been around for quite some time and a naive
>> implementation was also part of my diploma thesis some time ago.
>> Currently, the GCC can reorder function based on first execution, which
>> happens with PGO and LTO of course. Known limitation is that the order
>> is preserved only partially as various symbols go into different
>> LTRANS partitions.
>>
>> There has been some research in the area and I would point out the
>> Facebook paper
>> ([1]) and Sony presentation ([2]). Based on that, I decided to make a
>> new implementation
>> in the GCC that does the same (in a proper way). First part of the
>> enablement are patches
>> to ld.bfd and ld.gold that come up with a new section .text.sorted,
>> that is always sorted.
>>
>> Thoughts? I would definitely welcome any interesting measurement on a
>> bigger load.
>>
>> Martin
>>
> 
> Hi, Martin!
> 
> Some time ago I tried to do the same but didn't go that far.

Hello.

> 
> I also used the C3 algorithm, except for the fact that ipa_fn_summary contains information about size and time (somehow missed it).

Which is a key part of the algorithm as one wants not to cross a page size boundary (in ideal case).


> The linker option --sort-section=name was used for prototyping. It, obviously, sorts sections and allows to place functions in the desired order (by placing them into named sections .text.sorted.NNNN, without patching linkers or adjusting linker scripts).

Yes, that can work, but having the new section explicitly sorted will not require a passing of an option to linker. 

> For testing my implementation I used several benchmarks from SPEC2006: perlbench, sjeng and gobmk. Unfortunately, no significant positive changes were obtained.
> 
> I've tested the proposed pass on perlbench with train and reference input (PGO+LTO as a base) but couldn't obtain stable results (still unclear environment or perlbench specific reasons).

Yes, it's quite difficult to measure something. One needs a huge .text section of a binary and a test-case which has a flat profile.
I was able to measure results on the gcc itself.

Martin

> 
> Evgeny.
>
Fangrui Song Oct. 3, 2019, 4:52 a.m. UTC | #4
On 2019-09-24, Martin Liška wrote:
>On 9/19/19 10:33 AM, Martin Liška wrote:
>> - One needs modified binutils and I that would probably require a configure detection. The only way
>>   which I see is based on ld --version. I'm planning to make the binutils submission soon.
>
>The patch submission link:
>https://sourceware.org/ml/binutils/2019-09/msg00219.html

Hi Martin,

I have a question about why .text.sorted.* are needed.

The Sony presentation (your [2] link) embedded a new section
.llvm.call-graph-profile[3] to represent edges in the object files. The
linker (lld) collects all .llvm.call-graph-profile sections and does a
C3 layout. There is no need for new section type .text.sorted.*

[3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s

(Please CC me. I am not subscribed.)
Andrew Pinski Oct. 3, 2019, 7:21 a.m. UTC | #5
On Wed, Oct 2, 2019 at 9:52 PM Fangrui Song <i@maskray.me> wrote:
>
> On 2019-09-24, Martin Liška wrote:
> >On 9/19/19 10:33 AM, Martin Liška wrote:
> >> - One needs modified binutils and I that would probably require a configure detection. The only way
> >>   which I see is based on ld --version. I'm planning to make the binutils submission soon.
> >
> >The patch submission link:
> >https://sourceware.org/ml/binutils/2019-09/msg00219.html
>
> Hi Martin,
>
> I have a question about why .text.sorted.* are needed.
>
> The Sony presentation (your [2] link) embedded a new section
> .llvm.call-graph-profile[3] to represent edges in the object files. The
> linker (lld) collects all .llvm.call-graph-profile sections and does a
> C3 layout. There is no need for new section type .text.sorted.*
>
> [3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s
>
> (Please CC me. I am not subscribed.)

The idea of GCC's version is to that the modification needed to the
linker is very little.  And even without patching the linker
script/linker, you don't need to much special and you don't need LD to
do much work at all.

Thanks,
Andrew
Fangrui Song Oct. 3, 2019, 7:45 a.m. UTC | #6
On 2019-10-03, Andrew Pinski wrote:
>On Wed, Oct 2, 2019@9:52 PM Fangrui Song <i@maskray.me> wrote:
>>
>> On 2019-09-24, Martin Liška wrote:
>> >On 9/19/19 10:33 AM, Martin Liška wrote:
>> >> - One needs modified binutils and I that would probably require a configure detection. The only way
>> >>   which I see is based on ld --version. I'm planning to make the binutils submission soon.
>> >
>> >The patch submission link:
>> >https://sourceware.org/ml/binutils/2019-09/msg00219.html
>>
>> Hi Martin,
>>
>> I have a question about why .text.sorted.* are needed.
>>
>> The Sony presentation (your [2] link) embedded a new section
>> .llvm.call-graph-profile[3] to represent edges in the object files. The
>> linker (lld) collects all .llvm.call-graph-profile sections and does a
>> C3 layout. There is no need for new section type .text.sorted.*
>>
>> [3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s
>>
>> (Please CC me. I am not subscribed.)
>
>The idea of GCC's version is to that the modification needed to the
>linker is very little.  And even without patching the linker
>script/linker, you don't need to much special and you don't need LD to
>do much work@all.

I am afraid this can be a large limitation. Then .text.sorted.* can
only a) reorder functions within a translation unit, b) or reorder all
functions when LTO is enabled.

b) is possible only if all .text.sorted.* sections can be numbered.

For the LLVM case, call graph profile can be used without LTO. When both
ThinLTO+PGO are enabled, however, the additional performance improvement
offered by the global call graph profile reordering is insignificant,
smaller than 1%.
Andrew Pinski Oct. 3, 2019, 8:08 a.m. UTC | #7
On Thu, Oct 3, 2019 at 12:46 AM Fangrui Song <i@maskray.me> wrote:
>
>
> On 2019-10-03, Andrew Pinski wrote:
> >On Wed, Oct 2, 2019@9:52 PM Fangrui Song <i@maskray.me> wrote:
> >>
> >> On 2019-09-24, Martin Liška wrote:
> >> >On 9/19/19 10:33 AM, Martin Liška wrote:
> >> >> - One needs modified binutils and I that would probably require a configure detection. The only way
> >> >>   which I see is based on ld --version. I'm planning to make the binutils submission soon.
> >> >
> >> >The patch submission link:
> >> >https://sourceware.org/ml/binutils/2019-09/msg00219.html
> >>
> >> Hi Martin,
> >>
> >> I have a question about why .text.sorted.* are needed.
> >>
> >> The Sony presentation (your [2] link) embedded a new section
> >> .llvm.call-graph-profile[3] to represent edges in the object files. The
> >> linker (lld) collects all .llvm.call-graph-profile sections and does a
> >> C3 layout. There is no need for new section type .text.sorted.*
> >>
> >> [3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s
> >>
> >> (Please CC me. I am not subscribed.)
> >
> >The idea of GCC's version is to that the modification needed to the
> >linker is very little.  And even without patching the linker
> >script/linker, you don't need to much special and you don't need LD to
> >do much work@all.
>
> I am afraid this can be a large limitation. Then .text.sorted.* can
> only a) reorder functions within a translation unit, b) or reorder all
> functions when LTO is enabled.

I don't think it is that limited.  In fact I think the other way
around is more limited as you are now limited on changing the call
graph data format.  This reduces the complexity of the linker really.
Linkers are already slow; why make them even slower?  This patch is
you only need to change the linker once and you don't have
dependencies between the linker and compiler.

Thanks,
Andrew

>
> b) is possible only if all .text.sorted.* sections can be numbered.
>
> For the LLVM case, call graph profile can be used without LTO. When both
> ThinLTO+PGO are enabled, however, the additional performance improvement
> offered by the global call graph profile reordering is insignificant,
> smaller than 1%.See you said it was less than 1% so why do it then?

Thanks,
Andrew Pinski
Jeff Law Oct. 4, 2019, 10:06 p.m. UTC | #8
On 9/19/19 2:33 AM, Martin Liška wrote:
> Hi.
> 
> Function reordering has been around for quite some time and a naive
> implementation was also part of my diploma thesis some time ago.
> Currently, the GCC can reorder function based on first execution, which
> happens with PGO and LTO of course. Known limitation is that the order
> is preserved only partially as various symbols go into different LTRANS partitions.
> 
> There has been some research in the area and I would point out the Facebook paper
> ([1]) and Sony presentation ([2]). Based on that, I decided to make a new implementation
> in the GCC that does the same (in a proper way). First part of the enablement are patches
> to ld.bfd and ld.gold that come up with a new section .text.sorted, that is always sorted.
> There's a snippet from the modified default linker script:
Funny, I was doing this via linker scripts circa ~95, in fact that's why
we have -ffunction-sections :-)   We started with scripts which
post-processed profiling data to create linker scripts for ELF systems.
 We had something for HPUX/SOM as well, but I can't remember what
mechanism we used, it may have been a gross level sorting using the SOM
section sort key mechanism - it only had 128 or 256 keys with a notable
amount of them reserved.

We had also built a linker with a basic level of interposition circa
1993 and explored various approaches to reordering executables.  I'd
just joined the group at the time and was responsible for wiring up
stuff on the OS side, but eventually got the "pleasure" of owning the
linker server.  A lot of the C3 algorithmic stuff looks similar to what
we did.

Anyway...

I don't see anything objectionable in here.  It's noted as an RFC.  Are
you interested in pushing this forward for gcc-10?

jeff
Jan Hubicka Oct. 6, 2019, 2:37 p.m. UTC | #9
> On 9/19/19 2:33 AM, Martin Liška wrote:
> > Hi.
> > 
> > Function reordering has been around for quite some time and a naive
> > implementation was also part of my diploma thesis some time ago.
> > Currently, the GCC can reorder function based on first execution, which
> > happens with PGO and LTO of course. Known limitation is that the order
> > is preserved only partially as various symbols go into different LTRANS partitions.
> > 
> > There has been some research in the area and I would point out the Facebook paper
> > ([1]) and Sony presentation ([2]). Based on that, I decided to make a new implementation
> > in the GCC that does the same (in a proper way). First part of the enablement are patches
> > to ld.bfd and ld.gold that come up with a new section .text.sorted, that is always sorted.
> > There's a snippet from the modified default linker script:
> Funny, I was doing this via linker scripts circa ~95, in fact that's why
> we have -ffunction-sections :-)   We started with scripts which
> post-processed profiling data to create linker scripts for ELF systems.
>  We had something for HPUX/SOM as well, but I can't remember what
> mechanism we used, it may have been a gross level sorting using the SOM
> section sort key mechanism - it only had 128 or 256 keys with a notable
> amount of them reserved.
> 
> We had also built a linker with a basic level of interposition circa
> 1993 and explored various approaches to reordering executables.  I'd
> just joined the group at the time and was responsible for wiring up
> stuff on the OS side, but eventually got the "pleasure" of owning the
> linker server.  A lot of the C3 algorithmic stuff looks similar to what
> we did.

For reference I am attaching early LTO version from circa 2010 :)
> 
> Anyway...
> 
> I don't see anything objectionable in here.  It's noted as an RFC.  Are
> you interested in pushing this forward for gcc-10?

I think it is plan to get some form of code layout pass into GCC10.  I
will test Martin's version on Firefox and see if it have any effect
here. It is generally quite hard to evaluate those changes (and it is
reason why I did not moved forward with version form 2010 - more
precisely it kept falling off the shelf for about a decade)

If LLD has support for cgraph annotations and we support LLD, i think we
should add support for that, too - it will be very useful to compare
indvidiual implementations.
I believe there is gold support (off tree?) for that too and something
similar is also supported by other toolchains like AIX?

One problem with the sections approach is that every section gets
aligned to largest code alignment inside of the section. Since our
alignments are (should be) often cache line based we get cca 30 bytes of
waste for every function that is quite a lot.

This is why we currently have way to order function when outputting them
and use that with FDO (using Martin's first execution logic). This has
drwarback of making the functions to flow in that order through late
optimizations and RTL backend and thus we lose IPA-RA and some 
IP propagation (late pure/const/nothrow discovery).
So this approach has a drawback, too. It is why i was trying to push
myself to get gcc to use gas fragments :)

Anyway, all these issues can be sovled incementally - lets see how
Maritn's patch work on Firefox and if we can get it tested elsewhere and
start from that.

I will take a look into the details.

Honza
> 
> jeff
> 

Index: tree-pass.h
===================================================================
*** tree-pass.h	(revision 164689)
--- tree-pass.h	(working copy)
*************** extern struct ipa_opt_pass_d pass_ipa_lt
*** 467,472 ****
--- 467,473 ----
  extern struct ipa_opt_pass_d pass_ipa_lto_finish_out;
  extern struct ipa_opt_pass_d pass_ipa_profile;
  extern struct ipa_opt_pass_d pass_ipa_cdtor_merge;
+ extern struct ipa_opt_pass_d pass_ipa_func_reorder;
  
  extern struct gimple_opt_pass pass_all_optimizations;
  extern struct gimple_opt_pass pass_cleanup_cfg_post_optimizing;
Index: cgraphunit.c
===================================================================
*** cgraphunit.c	(revision 164689)
--- cgraphunit.c	(working copy)
*************** cgraph_inline_p (struct cgraph_edge *e,
*** 1517,1522 ****
--- 1517,1529 ----
    return !e->inline_failed;
  }
  
+ static int
+ node_cmp (const void *pa, const void *pb)
+ {
+   const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
+   const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
+   return b->order - a->order;
+ }
  
  
  /* Expand all functions that must be output.
*************** cgraph_expand_all_functions (void)
*** 1546,1551 ****
--- 1553,1561 ----
      if (order[i]->process)
        order[new_order_pos++] = order[i];
  
+   if (flag_ipa_reorder)
+     qsort (order, new_order_pos, sizeof (struct cgraph_node *), node_cmp);
+ 
    for (i = new_order_pos - 1; i >= 0; i--)
      {
        node = order[i];
Index: tree-ssa-ccp.c
===================================================================
*** tree-ssa-ccp.c	(revision 164689)
--- tree-ssa-ccp.c	(working copy)
*************** ccp_fold_stmt (gimple_stmt_iterator *gsi
*** 2307,2312 ****
--- 2307,2324 ----
  		changed = true;
  	      }
  	  }
+ 	if (TREE_CODE (gimple_call_fn (stmt)) == OBJ_TYPE_REF)
+ 	  {
+ 	    tree expr = OBJ_TYPE_REF_EXPR (gimple_call_fn (stmt));
+ 	    expr = valueize_op (expr);
+ 	    if (TREE_CODE (expr) == ADDR_EXPR
+ 	        && TREE_CODE (TREE_OPERAND (expr, 0)) == FUNCTION_DECL)
+ 	     {
+ 	       gimple_call_set_fndecl (stmt, TREE_OPERAND (expr, 0));
+ 	       debug_gimple_stmt (stmt);
+ 	       changed = true;
+ 	     }
+ 	  }
  
  	return changed;
        }
Index: lto-cgraph.c
===================================================================
*** lto-cgraph.c	(revision 164689)
--- lto-cgraph.c	(working copy)
*************** lto_output_node (struct lto_simple_outpu
*** 466,471 ****
--- 466,473 ----
  
    if (tag == LTO_cgraph_analyzed_node)
      {
+       lto_output_uleb128_stream (ob->main_stream,
+ 				 node->order);
        lto_output_sleb128_stream (ob->main_stream,
  				 node->local.inline_summary.estimated_self_stack_size);
        lto_output_sleb128_stream (ob->main_stream,
*************** input_overwrite_node (struct lto_file_de
*** 927,932 ****
--- 929,935 ----
  		      struct cgraph_node *node,
  		      enum LTO_cgraph_tags tag,
  		      struct bitpack_d *bp,
+ 		      int order,
  		      unsigned int stack_size,
  		      unsigned int self_time,
  		      unsigned int time_inlining_benefit,
*************** input_overwrite_node (struct lto_file_de
*** 935,940 ****
--- 938,944 ----
  		      enum ld_plugin_symbol_resolution resolution)
  {
    node->aux = (void *) tag;
+   node->order = order;
    node->local.inline_summary.estimated_self_stack_size = stack_size;
    node->local.inline_summary.self_time = self_time;
    node->local.inline_summary.time_inlining_benefit = time_inlining_benefit;
*************** input_node (struct lto_file_decl_data *f
*** 1027,1032 ****
--- 1031,1037 ----
    unsigned long same_body_count = 0;
    int clone_ref;
    enum ld_plugin_symbol_resolution resolution;
+   int order = 0;
  
    clone_ref = lto_input_sleb128 (ib);
  
*************** input_node (struct lto_file_decl_data *f
*** 1045,1050 ****
--- 1050,1056 ----
  
    if (tag == LTO_cgraph_analyzed_node)
      {
+       order = lto_input_uleb128 (ib);
        stack_size = lto_input_sleb128 (ib);
        self_size = lto_input_sleb128 (ib);
        size_inlining_benefit = lto_input_sleb128 (ib);
*************** input_node (struct lto_file_decl_data *f
*** 1066,1072 ****
  
    bp = lto_input_bitpack (ib);
    resolution = (enum ld_plugin_symbol_resolution)lto_input_uleb128 (ib);
!   input_overwrite_node (file_data, node, tag, &bp, stack_size, self_time,
    			time_inlining_benefit, self_size,
  			size_inlining_benefit, resolution);
  
--- 1072,1078 ----
  
    bp = lto_input_bitpack (ib);
    resolution = (enum ld_plugin_symbol_resolution)lto_input_uleb128 (ib);
!   input_overwrite_node (file_data, node, tag, &bp, order, stack_size, self_time,
    			time_inlining_benefit, self_size,
  			size_inlining_benefit, resolution);
  
Index: gimple-fold.c
===================================================================
*** gimple-fold.c	(revision 164689)
--- gimple-fold.c	(working copy)
*************** gimple_fold_obj_type_ref (tree ref, tree
*** 1465,1470 ****
--- 1465,1480 ----
    tree known_binfo = known_type ? TYPE_BINFO (known_type) : NULL_TREE;
    tree binfo;
  
+   if (TREE_CODE (obj) == SSA_NAME)
+     {
+       gimple def_stmt = SSA_NAME_DEF_STMT (obj);
+       if (def_stmt
+ 	  && gimple_code (def_stmt) == GIMPLE_ASSIGN
+ 	  && (gimple_assign_single_p (def_stmt)
+ 	      || (gimple_assign_unary_nop_p (def_stmt))))
+ 	obj = gimple_assign_rhs1 (def_stmt);
+     }
+ 
    if (TREE_CODE (obj) == ADDR_EXPR)
      obj = TREE_OPERAND (obj, 0);
  
*************** fold_gimple_call (gimple_stmt_iterator *
*** 1513,1520 ****
           copying EH region info to the new node.  Easier to just do it
           here where we can just smash the call operand.  */
        callee = gimple_call_fn (stmt);
!       if (TREE_CODE (callee) == OBJ_TYPE_REF
!           && TREE_CODE (OBJ_TYPE_REF_OBJECT (callee)) == ADDR_EXPR)
          {
            tree t;
  
--- 1523,1529 ----
           copying EH region info to the new node.  Easier to just do it
           here where we can just smash the call operand.  */
        callee = gimple_call_fn (stmt);
!       if (TREE_CODE (callee) == OBJ_TYPE_REF)
          {
            tree t;
  
Index: common.opt
===================================================================
*** common.opt	(revision 164689)
--- common.opt	(working copy)
*************** fdse
*** 1107,1112 ****
--- 1107,1116 ----
  Common Var(flag_dse) Init(1) Optimization
  Use the RTL dead store elimination pass
  
+ freorder-functions
+ Common Report Var(flag_ipa_reorder) Init(1) Optimization
+ Reroder functions for better locality at execution time
+ 
  freschedule-modulo-scheduled-loops
  Common Report Var(flag_resched_modulo_sched) Optimization
  Enable/Disable the traditional scheduling in loops that already passed modulo scheduling
Index: ipa-reorder.c
===================================================================
*** ipa-reorder.c	(revision 0)
--- ipa-reorder.c	(revision 0)
***************
*** 0 ****
--- 1,567 ----
+ /* IPA function reordering pass.
+    Copyright (C) 2010 Free Software Foundation, Inc.
+    Contributed by Jan Hubicka  <jh@suse.cz>
+ 
+ This file is part of GCC.
+ 
+ GCC is free software; you can redistribute it and/or modify it under
+ the terms of the GNU General Public License as published by the Free
+ Software Foundation; either version 3, or (at your option) any later
+ version.
+ 
+ GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+ WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ for more details.
+ 
+ You should have received a copy of the GNU General Public License
+ along with GCC; see the file COPYING3.  If not see
+ <http://www.gnu.org/licenses/>.  */
+ 
+ #include "config.h"
+ #include "system.h"
+ #include "coretypes.h"
+ #include "tm.h"
+ #include "cgraph.h"
+ #include "tree-pass.h"
+ #include "timevar.h"
+ #include "gimple.h"
+ #include "ggc.h"
+ #include "flags.h"
+ #include "pointer-set.h"
+ #include "target.h"
+ #include "tree-iterator.h"
+ #include "fibheap.h"
+ 
+ struct priority
+ {
+   gcov_type priority;
+   struct partition *src_partition;
+   struct partition *dest_partition;
+   fibnode_t node;
+ };
+ 
+ typedef struct priority *priority_ptr;
+ 
+ DEF_VEC_P(priority_ptr);
+ DEF_VEC_ALLOC_P(priority_ptr,heap);
+ 
+ struct partition
+ {
+   int index;
+   VEC(cgraph_node_ptr, heap) *nodes;
+   VEC(priority_ptr, heap) *priorities;
+   VEC(priority_ptr, heap) *in_priorities;
+ };
+ 
+ typedef struct partition *partition_ptr;
+ 
+ DEF_VEC_P(partition_ptr);
+ DEF_VEC_ALLOC_P(partition_ptr,heap);
+ 
+ static bool
+ cgraph_node_will_be_output_p (struct cgraph_node *node)
+ {
+   return (node->analyzed && !node->global.inlined_to);
+ }
+ 
+ static void
+ account_callees (struct partition *part,
+ 		 VEC (partition_ptr, heap) *partitions,
+ 		 struct cgraph_node *node)
+ {
+   struct cgraph_edge *edge;
+   struct priority *p;
+   int i;
+   struct ipa_ref *ref;
+ 
+   for (edge = node->callees; edge; edge = edge->next_callee)
+     if (edge->inline_failed)
+       {
+ 	if (!edge->callee->aux)
+ 	  {
+ 	    struct partition *dest = VEC_index (partition_ptr, partitions,
+ 						edge->callee->uid);
+ 	    if (dest && dest != part)
+ 	      {
+ 		p = XCNEW (struct priority);
+ 		edge->callee->aux = p;
+ 		p->src_partition = part;
+ 		p->dest_partition = dest;
+ 		VEC_safe_push (priority_ptr, heap, part->priorities, p);
+ 		VEC_safe_push (priority_ptr, heap, dest->in_priorities, p);
+ 	      }
+ 	    else
+ 	      continue;
+ 	  }
+ 	else
+ 	  p = (struct priority *)edge->callee->aux;
+ 	if (edge->count)
+ 	  p->priority += edge->count + 1;
+ 	else
+ 	  {
+ 	    int mul = 1;
+ 	    switch (edge->caller->frequency)
+ 	      {
+ 	      case NODE_FREQUENCY_UNLIKELY_EXECUTED: mul = 0; break;
+ 	      case NODE_FREQUENCY_EXECUTED_ONCE: mul = 1; break;
+ 	      case NODE_FREQUENCY_NORMAL: mul = 10; break;
+ 	      case NODE_FREQUENCY_HOT: mul = 1000; break;
+ 	      }
+ 	    p->priority += edge->frequency * mul + 1;
+ 	  }
+       }
+     else
+       account_callees (part, partitions, edge->callee);
+ 
+   /* Account taking address of a function as possible call with a low priority.  */
+   for (i = 0; ipa_ref_list_reference_iterate (&node->ref_list, i, ref); i++)
+     if (ref->refered_type == IPA_REF_CGRAPH)
+       {
+ 	struct cgraph_node *node2 = ipa_ref_node (ref);
+ 	if (!node2->aux)
+ 	  {
+ 	    struct partition *dest = VEC_index (partition_ptr, partitions,
+ 						node2->uid);
+ 	    if (dest && dest != part)
+ 	      {
+ 		p = XCNEW (struct priority);
+ 		node2->aux = p;
+ 		p->src_partition = part;
+ 		p->dest_partition = dest;
+ 		VEC_safe_push (priority_ptr, heap, part->priorities, p);
+ 		VEC_safe_push (priority_ptr, heap, dest->in_priorities, p);
+ 	      }
+ 	    else
+ 	      continue;
+ 	  }
+ 	else
+ 	  p = (struct priority *)node2->aux;
+ 	p->priority++;
+       }
+ }
+ 
+ static void
+ clear_callees (struct cgraph_node *node)
+ {
+   struct cgraph_edge *edge;
+   int i;
+   struct ipa_ref *ref;
+ 
+   for (edge = node->callees; edge; edge = edge->next_callee)
+     if (edge->inline_failed)
+       edge->callee->aux = NULL;
+     else
+       clear_callees (edge->callee);
+   for (i = 0; ipa_ref_list_reference_iterate (&node->ref_list, i, ref); i++)
+     if (ref->refered_type == IPA_REF_CGRAPH)
+       ipa_ref_node (ref)->aux = NULL;
+ }
+ 
+ /* Compare two priority vector entries via indexes of destination
+    partitions.  */
+ 
+ static int
+ priority_cmp (const void *pa, const void *pb)
+ {
+   const struct priority *a = *(const struct priority * const *) pa;
+   const struct priority *b = *(const struct priority * const *) pb;
+   /* Priorities pointing to dead partitions are removed lazily,
+      so just avoid referencing dead memory.  */
+   if (!a->priority)
+     return ! b->priority ? 0 : -1;
+   if (!b->priority)
+     return 1;
+   return a->dest_partition->index - b->dest_partition->index;
+ }
+ 
+ /* Compare two priority vector entries via indexes of destination
+    partitions.  */
+ 
+ static int
+ in_priority_cmp (const void *pa, const void *pb)
+ {
+   const struct priority *a = *(const struct priority * const *) pa;
+   const struct priority *b = *(const struct priority * const *) pb;
+   /* Priorities pointing to dead partitions are removed lazily,
+      so just avoid referencing dead memory.  */
+   if (!a->priority)
+     return ! b->priority ? 0 : -1;
+   if (!b->priority)
+     return 1;
+   return a->src_partition->index - b->src_partition->index;
+ }
+ 
+ static void
+ delete_priority (fibheap_t heap, struct priority *p)
+ {
+   p->priority = 0;
+   p->src_partition = p->dest_partition = NULL;
+   if (p->node)
+     fibheap_delete_node (heap, p->node);
+   p->node = NULL;
+ }
+ 
+ static void
+ merge_partitions (VEC (partition_ptr, heap) *partitions,
+ 		  fibheap_t heap,
+ 		  struct partition *part1,
+ 		  struct partition *part2)
+ {
+   VEC (priority_ptr, heap) *priorities = NULL;
+   VEC (priority_ptr, heap) *in_priorities = NULL;
+   unsigned int i1 = 0, i2 = 0;
+   struct cgraph_node *node;
+   int i;
+ 
+   /* We keep priority lists ordered by indexes of destination partitions
+      to allow effective merging.  */
+   qsort (VEC_address (priority_ptr, part1->priorities),
+ 	 VEC_length (priority_ptr, part1->priorities),
+ 	 sizeof (priority_ptr),
+ 	 priority_cmp);
+   qsort (VEC_address (priority_ptr, part2->priorities),
+ 	 VEC_length (priority_ptr, part2->priorities),
+ 	 sizeof (priority_ptr),
+ 	 priority_cmp);
+   
+   /* Merge priority lists, prune out references of partition to itself.
+      Assume priority lists are ordered by indexes of destination partitions
+      and keep them so.  */
+   while (1)
+     {
+       struct priority *p1, *p2;
+ 
+       while (i1 < VEC_length (priority_ptr, part1->priorities)
+ 	     && !VEC_index (priority_ptr, part1->priorities, i1)->priority)
+ 	i1++;
+       while (i2 < VEC_length (priority_ptr, part2->priorities)
+ 	     && !VEC_index (priority_ptr, part2->priorities, i2)->priority)
+ 	i2++;
+ 
+       if (i1 == VEC_length (priority_ptr, part1->priorities)
+ 	  && i2 == VEC_length (priority_ptr, part2->priorities))
+ 	break;
+ 
+       if (i1 < VEC_length (priority_ptr, part1->priorities)
+ 	  && (i2 == VEC_length (priority_ptr, part2->priorities)
+ 	      || (VEC_index (priority_ptr, part1->priorities,
+ 			     i1)->dest_partition->index
+ 		  < VEC_index (priority_ptr, part2->priorities,
+ 			       i2)->dest_partition->index)))
+ 	{
+ 	  p1 = VEC_index (priority_ptr, part1->priorities, i1);
+ 	  if (p1->dest_partition != part2)
+ 	    VEC_safe_push (priority_ptr, heap, priorities, p1);
+ 	  else
+ 	    delete_priority (heap, p1);
+ 	  i1++;
+ 	}
+       else if (i2 < VEC_length (priority_ptr, part2->priorities)
+ 	       && (i1 == VEC_length (priority_ptr, part1->priorities)
+ 		   || (VEC_index (priority_ptr, part1->priorities,
+ 				  i1)->dest_partition->index
+ 		       > VEC_index (priority_ptr, part2->priorities,
+ 				    i2)->dest_partition->index)))
+ 	{
+ 	  p2 = VEC_index (priority_ptr, part2->priorities, i2);
+ 	  if (p2->dest_partition != part1)
+ 	    {
+ 	      VEC_safe_push (priority_ptr, heap, priorities, p2);
+ 	      p2->src_partition = part1;
+ 	    }
+ 	  else
+ 	    delete_priority (heap, p2);
+ 	  i2++;
+ 	}
+       else
+ 	{
+ 	  p1 = VEC_index (priority_ptr, part1->priorities, i1);
+ 	  p2 = VEC_index (priority_ptr, part2->priorities, i2);
+ 	  p1->priority += p2->priority;
+ 	  fibheap_replace_key (heap, p1->node, INT_MAX - p1->priority);
+ 	  delete_priority (heap, p2);
+ 	  VEC_safe_push (priority_ptr, heap, priorities, p1);
+ 	  i1++;
+ 	  i2++;
+ 	}
+     }
+   VEC_free (priority_ptr, heap, part1->priorities);
+   part1->priorities = priorities;
+ 
+   qsort (VEC_address (priority_ptr, part1->in_priorities),
+ 	 VEC_length (priority_ptr, part1->in_priorities),
+ 	 sizeof (priority_ptr),
+ 	 in_priority_cmp);
+   qsort (VEC_address (priority_ptr, part2->in_priorities),
+ 	 VEC_length (priority_ptr, part2->in_priorities),
+ 	 sizeof (priority_ptr),
+ 	 in_priority_cmp);
+   i1 = 0;
+   i2 = 0;
+   while (1)
+     {
+       struct priority *p1, *p2;
+       while (i1 < VEC_length (priority_ptr, part1->in_priorities)
+ 	     && !VEC_index (priority_ptr, part1->in_priorities, i1)->priority)
+ 	i1++;
+       while (i2 < VEC_length (priority_ptr, part2->in_priorities)
+ 	     && !VEC_index (priority_ptr, part2->in_priorities, i2)->priority)
+ 	i2++;
+ 
+       if (i1 == VEC_length (priority_ptr, part1->in_priorities)
+ 	  && i2 == VEC_length (priority_ptr, part2->in_priorities))
+ 	break;
+ 
+       if (i1 < VEC_length (priority_ptr, part1->in_priorities)
+ 	  && (i2 == VEC_length (priority_ptr, part2->in_priorities)
+ 	      || (VEC_index (priority_ptr, part1->in_priorities,
+ 			     i1)->src_partition->index
+ 		  < VEC_index (priority_ptr, part2->in_priorities,
+ 			       i2)->src_partition->index)))
+ 	{
+ 	  p1 = VEC_index (priority_ptr, part1->in_priorities, i1);
+ 	  if (p1->src_partition != part2)
+ 	    VEC_safe_push (priority_ptr, heap, in_priorities, p1);
+ 	  else
+ 	    delete_priority (heap, p1);
+ 	  i1++;
+ 	}
+       else if (i2 < VEC_length (priority_ptr, part2->in_priorities)
+ 	       && (i1 == VEC_length (priority_ptr, part1->in_priorities)
+ 		   || (VEC_index (priority_ptr, part1->in_priorities,
+ 				  i1)->src_partition->index
+ 		       > VEC_index (priority_ptr, part2->in_priorities,
+ 				    i2)->src_partition->index)))
+ 	{
+ 	  p2 = VEC_index (priority_ptr, part2->in_priorities, i2);
+ 	  if (p2->src_partition != part1)
+ 	    {
+ 	      VEC_safe_push (priority_ptr, heap, in_priorities, p2);
+ 	      p2->dest_partition = part1;
+ 	    }
+ 	  else
+ 	    delete_priority (heap, p2);
+ 	  i2++;
+ 	}
+       else
+ 	{
+ 	  p1 = VEC_index (priority_ptr, part1->in_priorities, i1);
+ 	  p2 = VEC_index (priority_ptr, part2->in_priorities, i2);
+ 	  p1->priority += p2->priority;
+ 	  fibheap_replace_key (heap, p1->node, INT_MAX - p1->priority);
+ 	  delete_priority (heap, p2);
+ 	  VEC_safe_push (priority_ptr, heap, in_priorities, p1);
+ 	  i1++;
+ 	  i2++;
+ 	}
+     }
+   VEC_free (priority_ptr, heap, part1->in_priorities);
+   part1->in_priorities = in_priorities;
+ 
+   for (i = 0; VEC_iterate (cgraph_node_ptr, part2->nodes, i, node); i++)
+     VEC_safe_push (cgraph_node_ptr, heap, part1->nodes, node);
+ 
+   VEC_replace (partition_ptr, partitions, part2->index, NULL);
+   VEC_free (priority_ptr, heap, part2->priorities);
+   VEC_free (priority_ptr, heap, part2->in_priorities);
+   VEC_free (cgraph_node_ptr, heap, part2->nodes);
+   free (part2);
+ }
+ 
+ static void
+ dump_partition (struct partition *part)
+ {
+   int i;
+   struct cgraph_node *node;
+   struct priority *p;
+   fprintf (dump_file, "  Partition %i:", part->index);
+   for (i = 0; VEC_iterate (cgraph_node_ptr, part->nodes, i, node); i++)
+     fprintf (dump_file, "  %s/%i", cgraph_node_name (node), node->uid);
+   fprintf (dump_file, "\n    priorities:");
+   for (i = 0; VEC_iterate (priority_ptr, part->priorities, i, p); i++)
+     if (p->priority)
+       {
+ 	gcc_assert (p->src_partition == part);
+ 	gcc_assert (p->dest_partition != part);
+ 	fprintf (dump_file, "  %i:%i", p->dest_partition->index,
+ 		 (int)p->priority);
+       }
+   fprintf (dump_file, "\n");
+ }
+ 
+ static unsigned int
+ ipa_func_reorder (void)
+ {
+   struct cgraph_node *node;
+   VEC (partition_ptr, heap) *partitions = VEC_alloc (partition_ptr, heap, cgraph_max_uid);
+   int i;
+   struct partition *part, *first_part = NULL;
+   int freq;
+   fibheap_t heap;
+ 
+   if (!cgraph_max_uid)
+     return 0;
+ 
+   heap = fibheap_new ();
+   VEC_safe_grow_cleared (partition_ptr, heap, partitions, cgraph_max_uid);
+ 
+   for (node = cgraph_nodes; node; node = node->next)
+     if (cgraph_node_will_be_output_p (node))
+       {
+ 	part = XCNEW (struct partition);
+ 	part->index = node->uid;
+ 	VEC_safe_push (cgraph_node_ptr, heap, part->nodes, node);
+ 	VEC_replace (partition_ptr, partitions, node->uid, part);
+       }
+ 
+   if (dump_file)
+     fprintf (dump_file, "\n\nCreating partitions\n");
+   for (node = cgraph_nodes; node; node = node->next)
+     if (cgraph_node_will_be_output_p (node))
+       {
+ 	part = VEC_index (partition_ptr, partitions, node->uid);
+ 	account_callees (part, partitions, node);
+ 	clear_callees (node);
+ 	if (dump_file)
+ 	  dump_partition (part);
+       }
+   for (node = cgraph_nodes; node; node = node->next)
+     if (cgraph_node_will_be_output_p (node))
+       {
+ 	struct priority *p;
+ 	part = VEC_index (partition_ptr, partitions, node->uid);
+ 
+ 	for (i = 0; VEC_iterate (priority_ptr, part->priorities, i, p); i++)
+ 	  p->node = fibheap_insert (heap, INT_MAX - p->priority, p);
+ 	if (dump_file)
+ 	  dump_partition (part);
+       }
+ 
+   if (dump_file)
+     fprintf (dump_file, "\n\nMerging by priorities\n");
+   while (!fibheap_empty (heap))
+     {
+       struct priority *p = (struct priority *)fibheap_extract_min (heap);
+       struct partition *part = p->src_partition;
+       p->node = NULL;
+       if (dump_file)
+ 	{
+ 	  fprintf (dump_file,
+ 	           "Concatenating partitions %i and %i, priority %i\n",
+ 		   p->src_partition->index,
+ 		   p->dest_partition->index,
+ 		   (int)p->priority);
+ 	  if (dump_file)
+ 	    dump_partition (p->src_partition);
+ 	  if (dump_file)
+ 	    dump_partition (p->dest_partition);
+ 	}
+       merge_partitions (partitions, heap, p->src_partition, p->dest_partition);
+       if (dump_file)
+ 	dump_partition (part);
+     }
+ 
+   /* We ran out of calls to merge.  Try to arrange remaining partitions
+      approximately in execution order: static constructors first, followed
+      by partition containing function main ()
+      followed by hot sections of the program.  */
+   if (dump_file)
+     fprintf (dump_file, "\n\nLooking for static constructors\n");
+   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
+     if (part && part != first_part
+ 	&& DECL_STATIC_CONSTRUCTOR (VEC_index (cgraph_node_ptr,
+ 					       part->nodes, 0)->decl))
+       {
+ 	if (dump_file)
+ 	  dump_partition (part);
+ 	if (!first_part)
+ 	 first_part = part;
+ 	else
+ 	 merge_partitions (partitions, heap, first_part, part);
+       }
+   if (dump_file)
+     fprintf (dump_file, "\n\nLooking for main\n");
+   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
+     if (part && part != first_part
+ 	&& MAIN_NAME_P (DECL_NAME (VEC_index (cgraph_node_ptr,
+ 					      part->nodes, 0)->decl)))
+       {
+ 	if (dump_file)
+ 	  dump_partition (part);
+ 	if (!first_part)
+ 	 first_part = part;
+ 	else
+ 	 merge_partitions (partitions, heap, first_part, part);
+       }
+   if (dump_file)
+     fprintf (dump_file, "\n\nMerging by frequency\n");
+   for (freq = NODE_FREQUENCY_HOT; freq >= NODE_FREQUENCY_UNLIKELY_EXECUTED;
+        freq--)
+     {
+       for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
+ 	if (part && part != first_part
+ 	    && VEC_index (cgraph_node_ptr, part->nodes, 0)->frequency == freq)
+ 	  {
+ 	    if (dump_file)
+ 	      dump_partition (part);
+ 	    if (!first_part)
+ 	     first_part = part;
+ 	    else
+ 	     merge_partitions (partitions, heap, first_part, part);
+ 	  }
+     }
+   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
+     gcc_assert (!part || part == first_part);
+ 
+   fibheap_delete (heap);
+   if (!first_part)
+     return 0;
+   if (dump_file)
+     {
+       fprintf (dump_file, "\n\nFinal order\n");
+       dump_partition (first_part);
+     }
+ 
+   /* Store the resulting order and kill the single remaining partition.  */
+   for (i = 0; VEC_iterate (cgraph_node_ptr, first_part->nodes, i, node); i++)
+     node->order = i;
+   VEC_free (priority_ptr, heap, first_part->priorities);
+   VEC_free (cgraph_node_ptr, heap, first_part->nodes);
+   free (first_part);
+   return 0;
+ }
+ 
+ static bool
+ gate_ipa_func_reorder (void)
+ {
+   return optimize && flag_ipa_reorder && flag_toplevel_reorder;
+ }
+ 
+ struct ipa_opt_pass_d pass_ipa_func_reorder =
+ {
+  {
+   IPA_PASS,
+   "reorder",				/* name */
+   gate_ipa_func_reorder,		/* gate */
+   ipa_func_reorder,		        /* execute */
+   NULL,					/* sub */
+   NULL,					/* next */
+   0,					/* static_pass_number */
+   TV_CGRAPHOPT,			        /* tv_id */
+   0,	                                /* properties_required */
+   0,					/* properties_provided */
+   0,					/* properties_destroyed */
+   0,					/* todo_flags_start */
+   0                                     /* todo_flags_finish */
+  },
+  NULL,				        /* generate_summary */
+  NULL,					/* write_summary */
+  NULL,					/* read_summary */
+  NULL,					/* write_optimization_summary */
+  NULL,					/* read_optimization_summary */
+  NULL,					/* stmt_fixup */
+  0,					/* TODOs */
+  NULL,			                /* function_transform */
+  NULL					/* variable_transform */
+ };
Index: Makefile.in
===================================================================
*** Makefile.in	(revision 164689)
--- Makefile.in	(working copy)
*************** OBJS-archive = \
*** 1461,1466 ****
--- 1461,1467 ----
  	ipa-type-escape.o \
  	ipa-utils.o \
  	ipa.o \
+ 	ipa-reorder.o \
  	matrix-reorg.o \
  	prefix.o \
  	tree-inline.o \
*************** varpool.o : varpool.c $(CONFIG_H) $(SYST
*** 3012,3017 ****
--- 3013,3020 ----
     $(TREE_FLOW_H) gt-varpool.h
  ipa.o : ipa.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(CGRAPH_H) \
     $(TREE_PASS_H) $(TIMEVAR_H) $(GIMPLE_H) $(GGC_H) pointer-set.h
+ ipa-reorder.o : ipa-reorder.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
+    $(CGRAPH_H) $(TREE_PASS_H) $(TIMEVAR_H) $(GIMPLE_H) $(GGC_H) pointer-set.h
  ipa-prop.o : ipa-prop.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
     langhooks.h $(GGC_H) $(TARGET_H) $(CGRAPH_H) $(IPA_PROP_H) $(DIAGNOSTIC_H) \
     $(TREE_FLOW_H) $(TM_H) $(TREE_PASS_H) $(FLAGS_H) $(TREE_H) \
Index: passes.c
===================================================================
*** passes.c	(revision 164689)
--- passes.c	(working copy)
*************** init_optimization_passes (void)
*** 818,823 ****
--- 818,824 ----
    NEXT_PASS (pass_ipa_type_escape);
    NEXT_PASS (pass_ipa_pta);
    NEXT_PASS (pass_ipa_struct_reorg);
+   NEXT_PASS (pass_ipa_func_reorder);
    *p = NULL;
  
    p = &all_lto_gen_passes;
Richard Biener Oct. 7, 2019, 7:01 a.m. UTC | #10
On Sun, Oct 6, 2019 at 4:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > On 9/19/19 2:33 AM, Martin Liška wrote:
> > > Hi.
> > >
> > > Function reordering has been around for quite some time and a naive
> > > implementation was also part of my diploma thesis some time ago.
> > > Currently, the GCC can reorder function based on first execution, which
> > > happens with PGO and LTO of course. Known limitation is that the order
> > > is preserved only partially as various symbols go into different LTRANS partitions.
> > >
> > > There has been some research in the area and I would point out the Facebook paper
> > > ([1]) and Sony presentation ([2]). Based on that, I decided to make a new implementation
> > > in the GCC that does the same (in a proper way). First part of the enablement are patches
> > > to ld.bfd and ld.gold that come up with a new section .text.sorted, that is always sorted.
> > > There's a snippet from the modified default linker script:
> > Funny, I was doing this via linker scripts circa ~95, in fact that's why
> > we have -ffunction-sections :-)   We started with scripts which
> > post-processed profiling data to create linker scripts for ELF systems.
> >  We had something for HPUX/SOM as well, but I can't remember what
> > mechanism we used, it may have been a gross level sorting using the SOM
> > section sort key mechanism - it only had 128 or 256 keys with a notable
> > amount of them reserved.
> >
> > We had also built a linker with a basic level of interposition circa
> > 1993 and explored various approaches to reordering executables.  I'd
> > just joined the group at the time and was responsible for wiring up
> > stuff on the OS side, but eventually got the "pleasure" of owning the
> > linker server.  A lot of the C3 algorithmic stuff looks similar to what
> > we did.
>
> For reference I am attaching early LTO version from circa 2010 :)
> >
> > Anyway...
> >
> > I don't see anything objectionable in here.  It's noted as an RFC.  Are
> > you interested in pushing this forward for gcc-10?
>
> I think it is plan to get some form of code layout pass into GCC10.  I
> will test Martin's version on Firefox and see if it have any effect
> here. It is generally quite hard to evaluate those changes (and it is
> reason why I did not moved forward with version form 2010 - more
> precisely it kept falling off the shelf for about a decade)
>
> If LLD has support for cgraph annotations and we support LLD, i think we
> should add support for that, too - it will be very useful to compare
> indvidiual implementations.
> I believe there is gold support (off tree?) for that too and something
> similar is also supported by other toolchains like AIX?
>
> One problem with the sections approach is that every section gets
> aligned to largest code alignment inside of the section. Since our
> alignments are (should be) often cache line based we get cca 30 bytes of
> waste for every function that is quite a lot.
>
> This is why we currently have way to order function when outputting them
> and use that with FDO (using Martin's first execution logic). This has
> drwarback of making the functions to flow in that order through late
> optimizations and RTL backend and thus we lose IPA-RA and some
> IP propagation (late pure/const/nothrow discovery).

But you can also fix that by the parallelization GSoC project approach,
decoupling things at RTL expansion IPA-wise and using output order
for the latter (or even do the fragments in GCC itself by refactoring the
output machinery to use function-local "strings", assembling the final
file later).  Refactoring output to handle multiple output "files" at the
same time might also help getting rid of that early-LTO-debug copying
stuff (now that it seems that linker-plugin extensions for partially claiming
files will never happen...)

> So this approach has a drawback, too. It is why i was trying to push
> myself to get gcc to use gas fragments :)
>
> Anyway, all these issues can be sovled incementally - lets see how
> Maritn's patch work on Firefox and if we can get it tested elsewhere and
> start from that.
>
> I will take a look into the details.
>
> Honza
> >
> > jeff
> >
>
> Index: tree-pass.h
> ===================================================================
> *** tree-pass.h (revision 164689)
> --- tree-pass.h (working copy)
> *************** extern struct ipa_opt_pass_d pass_ipa_lt
> *** 467,472 ****
> --- 467,473 ----
>   extern struct ipa_opt_pass_d pass_ipa_lto_finish_out;
>   extern struct ipa_opt_pass_d pass_ipa_profile;
>   extern struct ipa_opt_pass_d pass_ipa_cdtor_merge;
> + extern struct ipa_opt_pass_d pass_ipa_func_reorder;
>
>   extern struct gimple_opt_pass pass_all_optimizations;
>   extern struct gimple_opt_pass pass_cleanup_cfg_post_optimizing;
> Index: cgraphunit.c
> ===================================================================
> *** cgraphunit.c        (revision 164689)
> --- cgraphunit.c        (working copy)
> *************** cgraph_inline_p (struct cgraph_edge *e,
> *** 1517,1522 ****
> --- 1517,1529 ----
>     return !e->inline_failed;
>   }
>
> + static int
> + node_cmp (const void *pa, const void *pb)
> + {
> +   const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
> +   const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
> +   return b->order - a->order;
> + }
>
>
>   /* Expand all functions that must be output.
> *************** cgraph_expand_all_functions (void)
> *** 1546,1551 ****
> --- 1553,1561 ----
>       if (order[i]->process)
>         order[new_order_pos++] = order[i];
>
> +   if (flag_ipa_reorder)
> +     qsort (order, new_order_pos, sizeof (struct cgraph_node *), node_cmp);
> +
>     for (i = new_order_pos - 1; i >= 0; i--)
>       {
>         node = order[i];
> Index: tree-ssa-ccp.c
> ===================================================================
> *** tree-ssa-ccp.c      (revision 164689)
> --- tree-ssa-ccp.c      (working copy)
> *************** ccp_fold_stmt (gimple_stmt_iterator *gsi
> *** 2307,2312 ****
> --- 2307,2324 ----
>                 changed = true;
>               }
>           }
> +       if (TREE_CODE (gimple_call_fn (stmt)) == OBJ_TYPE_REF)
> +         {
> +           tree expr = OBJ_TYPE_REF_EXPR (gimple_call_fn (stmt));
> +           expr = valueize_op (expr);
> +           if (TREE_CODE (expr) == ADDR_EXPR
> +               && TREE_CODE (TREE_OPERAND (expr, 0)) == FUNCTION_DECL)
> +            {
> +              gimple_call_set_fndecl (stmt, TREE_OPERAND (expr, 0));
> +              debug_gimple_stmt (stmt);
> +              changed = true;
> +            }
> +         }
>
>         return changed;
>         }
> Index: lto-cgraph.c
> ===================================================================
> *** lto-cgraph.c        (revision 164689)
> --- lto-cgraph.c        (working copy)
> *************** lto_output_node (struct lto_simple_outpu
> *** 466,471 ****
> --- 466,473 ----
>
>     if (tag == LTO_cgraph_analyzed_node)
>       {
> +       lto_output_uleb128_stream (ob->main_stream,
> +                                node->order);
>         lto_output_sleb128_stream (ob->main_stream,
>                                  node->local.inline_summary.estimated_self_stack_size);
>         lto_output_sleb128_stream (ob->main_stream,
> *************** input_overwrite_node (struct lto_file_de
> *** 927,932 ****
> --- 929,935 ----
>                       struct cgraph_node *node,
>                       enum LTO_cgraph_tags tag,
>                       struct bitpack_d *bp,
> +                     int order,
>                       unsigned int stack_size,
>                       unsigned int self_time,
>                       unsigned int time_inlining_benefit,
> *************** input_overwrite_node (struct lto_file_de
> *** 935,940 ****
> --- 938,944 ----
>                       enum ld_plugin_symbol_resolution resolution)
>   {
>     node->aux = (void *) tag;
> +   node->order = order;
>     node->local.inline_summary.estimated_self_stack_size = stack_size;
>     node->local.inline_summary.self_time = self_time;
>     node->local.inline_summary.time_inlining_benefit = time_inlining_benefit;
> *************** input_node (struct lto_file_decl_data *f
> *** 1027,1032 ****
> --- 1031,1037 ----
>     unsigned long same_body_count = 0;
>     int clone_ref;
>     enum ld_plugin_symbol_resolution resolution;
> +   int order = 0;
>
>     clone_ref = lto_input_sleb128 (ib);
>
> *************** input_node (struct lto_file_decl_data *f
> *** 1045,1050 ****
> --- 1050,1056 ----
>
>     if (tag == LTO_cgraph_analyzed_node)
>       {
> +       order = lto_input_uleb128 (ib);
>         stack_size = lto_input_sleb128 (ib);
>         self_size = lto_input_sleb128 (ib);
>         size_inlining_benefit = lto_input_sleb128 (ib);
> *************** input_node (struct lto_file_decl_data *f
> *** 1066,1072 ****
>
>     bp = lto_input_bitpack (ib);
>     resolution = (enum ld_plugin_symbol_resolution)lto_input_uleb128 (ib);
> !   input_overwrite_node (file_data, node, tag, &bp, stack_size, self_time,
>                         time_inlining_benefit, self_size,
>                         size_inlining_benefit, resolution);
>
> --- 1072,1078 ----
>
>     bp = lto_input_bitpack (ib);
>     resolution = (enum ld_plugin_symbol_resolution)lto_input_uleb128 (ib);
> !   input_overwrite_node (file_data, node, tag, &bp, order, stack_size, self_time,
>                         time_inlining_benefit, self_size,
>                         size_inlining_benefit, resolution);
>
> Index: gimple-fold.c
> ===================================================================
> *** gimple-fold.c       (revision 164689)
> --- gimple-fold.c       (working copy)
> *************** gimple_fold_obj_type_ref (tree ref, tree
> *** 1465,1470 ****
> --- 1465,1480 ----
>     tree known_binfo = known_type ? TYPE_BINFO (known_type) : NULL_TREE;
>     tree binfo;
>
> +   if (TREE_CODE (obj) == SSA_NAME)
> +     {
> +       gimple def_stmt = SSA_NAME_DEF_STMT (obj);
> +       if (def_stmt
> +         && gimple_code (def_stmt) == GIMPLE_ASSIGN
> +         && (gimple_assign_single_p (def_stmt)
> +             || (gimple_assign_unary_nop_p (def_stmt))))
> +       obj = gimple_assign_rhs1 (def_stmt);
> +     }
> +
>     if (TREE_CODE (obj) == ADDR_EXPR)
>       obj = TREE_OPERAND (obj, 0);
>
> *************** fold_gimple_call (gimple_stmt_iterator *
> *** 1513,1520 ****
>            copying EH region info to the new node.  Easier to just do it
>            here where we can just smash the call operand.  */
>         callee = gimple_call_fn (stmt);
> !       if (TREE_CODE (callee) == OBJ_TYPE_REF
> !           && TREE_CODE (OBJ_TYPE_REF_OBJECT (callee)) == ADDR_EXPR)
>           {
>             tree t;
>
> --- 1523,1529 ----
>            copying EH region info to the new node.  Easier to just do it
>            here where we can just smash the call operand.  */
>         callee = gimple_call_fn (stmt);
> !       if (TREE_CODE (callee) == OBJ_TYPE_REF)
>           {
>             tree t;
>
> Index: common.opt
> ===================================================================
> *** common.opt  (revision 164689)
> --- common.opt  (working copy)
> *************** fdse
> *** 1107,1112 ****
> --- 1107,1116 ----
>   Common Var(flag_dse) Init(1) Optimization
>   Use the RTL dead store elimination pass
>
> + freorder-functions
> + Common Report Var(flag_ipa_reorder) Init(1) Optimization
> + Reroder functions for better locality at execution time
> +
>   freschedule-modulo-scheduled-loops
>   Common Report Var(flag_resched_modulo_sched) Optimization
>   Enable/Disable the traditional scheduling in loops that already passed modulo scheduling
> Index: ipa-reorder.c
> ===================================================================
> *** ipa-reorder.c       (revision 0)
> --- ipa-reorder.c       (revision 0)
> ***************
> *** 0 ****
> --- 1,567 ----
> + /* IPA function reordering pass.
> +    Copyright (C) 2010 Free Software Foundation, Inc.
> +    Contributed by Jan Hubicka  <jh@suse.cz>
> +
> + This file is part of GCC.
> +
> + GCC is free software; you can redistribute it and/or modify it under
> + the terms of the GNU General Public License as published by the Free
> + Software Foundation; either version 3, or (at your option) any later
> + version.
> +
> + GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> + WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> + for more details.
> +
> + You should have received a copy of the GNU General Public License
> + along with GCC; see the file COPYING3.  If not see
> + <http://www.gnu.org/licenses/>.  */
> +
> + #include "config.h"
> + #include "system.h"
> + #include "coretypes.h"
> + #include "tm.h"
> + #include "cgraph.h"
> + #include "tree-pass.h"
> + #include "timevar.h"
> + #include "gimple.h"
> + #include "ggc.h"
> + #include "flags.h"
> + #include "pointer-set.h"
> + #include "target.h"
> + #include "tree-iterator.h"
> + #include "fibheap.h"
> +
> + struct priority
> + {
> +   gcov_type priority;
> +   struct partition *src_partition;
> +   struct partition *dest_partition;
> +   fibnode_t node;
> + };
> +
> + typedef struct priority *priority_ptr;
> +
> + DEF_VEC_P(priority_ptr);
> + DEF_VEC_ALLOC_P(priority_ptr,heap);
> +
> + struct partition
> + {
> +   int index;
> +   VEC(cgraph_node_ptr, heap) *nodes;
> +   VEC(priority_ptr, heap) *priorities;
> +   VEC(priority_ptr, heap) *in_priorities;
> + };
> +
> + typedef struct partition *partition_ptr;
> +
> + DEF_VEC_P(partition_ptr);
> + DEF_VEC_ALLOC_P(partition_ptr,heap);
> +
> + static bool
> + cgraph_node_will_be_output_p (struct cgraph_node *node)
> + {
> +   return (node->analyzed && !node->global.inlined_to);
> + }
> +
> + static void
> + account_callees (struct partition *part,
> +                VEC (partition_ptr, heap) *partitions,
> +                struct cgraph_node *node)
> + {
> +   struct cgraph_edge *edge;
> +   struct priority *p;
> +   int i;
> +   struct ipa_ref *ref;
> +
> +   for (edge = node->callees; edge; edge = edge->next_callee)
> +     if (edge->inline_failed)
> +       {
> +       if (!edge->callee->aux)
> +         {
> +           struct partition *dest = VEC_index (partition_ptr, partitions,
> +                                               edge->callee->uid);
> +           if (dest && dest != part)
> +             {
> +               p = XCNEW (struct priority);
> +               edge->callee->aux = p;
> +               p->src_partition = part;
> +               p->dest_partition = dest;
> +               VEC_safe_push (priority_ptr, heap, part->priorities, p);
> +               VEC_safe_push (priority_ptr, heap, dest->in_priorities, p);
> +             }
> +           else
> +             continue;
> +         }
> +       else
> +         p = (struct priority *)edge->callee->aux;
> +       if (edge->count)
> +         p->priority += edge->count + 1;
> +       else
> +         {
> +           int mul = 1;
> +           switch (edge->caller->frequency)
> +             {
> +             case NODE_FREQUENCY_UNLIKELY_EXECUTED: mul = 0; break;
> +             case NODE_FREQUENCY_EXECUTED_ONCE: mul = 1; break;
> +             case NODE_FREQUENCY_NORMAL: mul = 10; break;
> +             case NODE_FREQUENCY_HOT: mul = 1000; break;
> +             }
> +           p->priority += edge->frequency * mul + 1;
> +         }
> +       }
> +     else
> +       account_callees (part, partitions, edge->callee);
> +
> +   /* Account taking address of a function as possible call with a low priority.  */
> +   for (i = 0; ipa_ref_list_reference_iterate (&node->ref_list, i, ref); i++)
> +     if (ref->refered_type == IPA_REF_CGRAPH)
> +       {
> +       struct cgraph_node *node2 = ipa_ref_node (ref);
> +       if (!node2->aux)
> +         {
> +           struct partition *dest = VEC_index (partition_ptr, partitions,
> +                                               node2->uid);
> +           if (dest && dest != part)
> +             {
> +               p = XCNEW (struct priority);
> +               node2->aux = p;
> +               p->src_partition = part;
> +               p->dest_partition = dest;
> +               VEC_safe_push (priority_ptr, heap, part->priorities, p);
> +               VEC_safe_push (priority_ptr, heap, dest->in_priorities, p);
> +             }
> +           else
> +             continue;
> +         }
> +       else
> +         p = (struct priority *)node2->aux;
> +       p->priority++;
> +       }
> + }
> +
> + static void
> + clear_callees (struct cgraph_node *node)
> + {
> +   struct cgraph_edge *edge;
> +   int i;
> +   struct ipa_ref *ref;
> +
> +   for (edge = node->callees; edge; edge = edge->next_callee)
> +     if (edge->inline_failed)
> +       edge->callee->aux = NULL;
> +     else
> +       clear_callees (edge->callee);
> +   for (i = 0; ipa_ref_list_reference_iterate (&node->ref_list, i, ref); i++)
> +     if (ref->refered_type == IPA_REF_CGRAPH)
> +       ipa_ref_node (ref)->aux = NULL;
> + }
> +
> + /* Compare two priority vector entries via indexes of destination
> +    partitions.  */
> +
> + static int
> + priority_cmp (const void *pa, const void *pb)
> + {
> +   const struct priority *a = *(const struct priority * const *) pa;
> +   const struct priority *b = *(const struct priority * const *) pb;
> +   /* Priorities pointing to dead partitions are removed lazily,
> +      so just avoid referencing dead memory.  */
> +   if (!a->priority)
> +     return ! b->priority ? 0 : -1;
> +   if (!b->priority)
> +     return 1;
> +   return a->dest_partition->index - b->dest_partition->index;
> + }
> +
> + /* Compare two priority vector entries via indexes of destination
> +    partitions.  */
> +
> + static int
> + in_priority_cmp (const void *pa, const void *pb)
> + {
> +   const struct priority *a = *(const struct priority * const *) pa;
> +   const struct priority *b = *(const struct priority * const *) pb;
> +   /* Priorities pointing to dead partitions are removed lazily,
> +      so just avoid referencing dead memory.  */
> +   if (!a->priority)
> +     return ! b->priority ? 0 : -1;
> +   if (!b->priority)
> +     return 1;
> +   return a->src_partition->index - b->src_partition->index;
> + }
> +
> + static void
> + delete_priority (fibheap_t heap, struct priority *p)
> + {
> +   p->priority = 0;
> +   p->src_partition = p->dest_partition = NULL;
> +   if (p->node)
> +     fibheap_delete_node (heap, p->node);
> +   p->node = NULL;
> + }
> +
> + static void
> + merge_partitions (VEC (partition_ptr, heap) *partitions,
> +                 fibheap_t heap,
> +                 struct partition *part1,
> +                 struct partition *part2)
> + {
> +   VEC (priority_ptr, heap) *priorities = NULL;
> +   VEC (priority_ptr, heap) *in_priorities = NULL;
> +   unsigned int i1 = 0, i2 = 0;
> +   struct cgraph_node *node;
> +   int i;
> +
> +   /* We keep priority lists ordered by indexes of destination partitions
> +      to allow effective merging.  */
> +   qsort (VEC_address (priority_ptr, part1->priorities),
> +        VEC_length (priority_ptr, part1->priorities),
> +        sizeof (priority_ptr),
> +        priority_cmp);
> +   qsort (VEC_address (priority_ptr, part2->priorities),
> +        VEC_length (priority_ptr, part2->priorities),
> +        sizeof (priority_ptr),
> +        priority_cmp);
> +
> +   /* Merge priority lists, prune out references of partition to itself.
> +      Assume priority lists are ordered by indexes of destination partitions
> +      and keep them so.  */
> +   while (1)
> +     {
> +       struct priority *p1, *p2;
> +
> +       while (i1 < VEC_length (priority_ptr, part1->priorities)
> +            && !VEC_index (priority_ptr, part1->priorities, i1)->priority)
> +       i1++;
> +       while (i2 < VEC_length (priority_ptr, part2->priorities)
> +            && !VEC_index (priority_ptr, part2->priorities, i2)->priority)
> +       i2++;
> +
> +       if (i1 == VEC_length (priority_ptr, part1->priorities)
> +         && i2 == VEC_length (priority_ptr, part2->priorities))
> +       break;
> +
> +       if (i1 < VEC_length (priority_ptr, part1->priorities)
> +         && (i2 == VEC_length (priority_ptr, part2->priorities)
> +             || (VEC_index (priority_ptr, part1->priorities,
> +                            i1)->dest_partition->index
> +                 < VEC_index (priority_ptr, part2->priorities,
> +                              i2)->dest_partition->index)))
> +       {
> +         p1 = VEC_index (priority_ptr, part1->priorities, i1);
> +         if (p1->dest_partition != part2)
> +           VEC_safe_push (priority_ptr, heap, priorities, p1);
> +         else
> +           delete_priority (heap, p1);
> +         i1++;
> +       }
> +       else if (i2 < VEC_length (priority_ptr, part2->priorities)
> +              && (i1 == VEC_length (priority_ptr, part1->priorities)
> +                  || (VEC_index (priority_ptr, part1->priorities,
> +                                 i1)->dest_partition->index
> +                      > VEC_index (priority_ptr, part2->priorities,
> +                                   i2)->dest_partition->index)))
> +       {
> +         p2 = VEC_index (priority_ptr, part2->priorities, i2);
> +         if (p2->dest_partition != part1)
> +           {
> +             VEC_safe_push (priority_ptr, heap, priorities, p2);
> +             p2->src_partition = part1;
> +           }
> +         else
> +           delete_priority (heap, p2);
> +         i2++;
> +       }
> +       else
> +       {
> +         p1 = VEC_index (priority_ptr, part1->priorities, i1);
> +         p2 = VEC_index (priority_ptr, part2->priorities, i2);
> +         p1->priority += p2->priority;
> +         fibheap_replace_key (heap, p1->node, INT_MAX - p1->priority);
> +         delete_priority (heap, p2);
> +         VEC_safe_push (priority_ptr, heap, priorities, p1);
> +         i1++;
> +         i2++;
> +       }
> +     }
> +   VEC_free (priority_ptr, heap, part1->priorities);
> +   part1->priorities = priorities;
> +
> +   qsort (VEC_address (priority_ptr, part1->in_priorities),
> +        VEC_length (priority_ptr, part1->in_priorities),
> +        sizeof (priority_ptr),
> +        in_priority_cmp);
> +   qsort (VEC_address (priority_ptr, part2->in_priorities),
> +        VEC_length (priority_ptr, part2->in_priorities),
> +        sizeof (priority_ptr),
> +        in_priority_cmp);
> +   i1 = 0;
> +   i2 = 0;
> +   while (1)
> +     {
> +       struct priority *p1, *p2;
> +       while (i1 < VEC_length (priority_ptr, part1->in_priorities)
> +            && !VEC_index (priority_ptr, part1->in_priorities, i1)->priority)
> +       i1++;
> +       while (i2 < VEC_length (priority_ptr, part2->in_priorities)
> +            && !VEC_index (priority_ptr, part2->in_priorities, i2)->priority)
> +       i2++;
> +
> +       if (i1 == VEC_length (priority_ptr, part1->in_priorities)
> +         && i2 == VEC_length (priority_ptr, part2->in_priorities))
> +       break;
> +
> +       if (i1 < VEC_length (priority_ptr, part1->in_priorities)
> +         && (i2 == VEC_length (priority_ptr, part2->in_priorities)
> +             || (VEC_index (priority_ptr, part1->in_priorities,
> +                            i1)->src_partition->index
> +                 < VEC_index (priority_ptr, part2->in_priorities,
> +                              i2)->src_partition->index)))
> +       {
> +         p1 = VEC_index (priority_ptr, part1->in_priorities, i1);
> +         if (p1->src_partition != part2)
> +           VEC_safe_push (priority_ptr, heap, in_priorities, p1);
> +         else
> +           delete_priority (heap, p1);
> +         i1++;
> +       }
> +       else if (i2 < VEC_length (priority_ptr, part2->in_priorities)
> +              && (i1 == VEC_length (priority_ptr, part1->in_priorities)
> +                  || (VEC_index (priority_ptr, part1->in_priorities,
> +                                 i1)->src_partition->index
> +                      > VEC_index (priority_ptr, part2->in_priorities,
> +                                   i2)->src_partition->index)))
> +       {
> +         p2 = VEC_index (priority_ptr, part2->in_priorities, i2);
> +         if (p2->src_partition != part1)
> +           {
> +             VEC_safe_push (priority_ptr, heap, in_priorities, p2);
> +             p2->dest_partition = part1;
> +           }
> +         else
> +           delete_priority (heap, p2);
> +         i2++;
> +       }
> +       else
> +       {
> +         p1 = VEC_index (priority_ptr, part1->in_priorities, i1);
> +         p2 = VEC_index (priority_ptr, part2->in_priorities, i2);
> +         p1->priority += p2->priority;
> +         fibheap_replace_key (heap, p1->node, INT_MAX - p1->priority);
> +         delete_priority (heap, p2);
> +         VEC_safe_push (priority_ptr, heap, in_priorities, p1);
> +         i1++;
> +         i2++;
> +       }
> +     }
> +   VEC_free (priority_ptr, heap, part1->in_priorities);
> +   part1->in_priorities = in_priorities;
> +
> +   for (i = 0; VEC_iterate (cgraph_node_ptr, part2->nodes, i, node); i++)
> +     VEC_safe_push (cgraph_node_ptr, heap, part1->nodes, node);
> +
> +   VEC_replace (partition_ptr, partitions, part2->index, NULL);
> +   VEC_free (priority_ptr, heap, part2->priorities);
> +   VEC_free (priority_ptr, heap, part2->in_priorities);
> +   VEC_free (cgraph_node_ptr, heap, part2->nodes);
> +   free (part2);
> + }
> +
> + static void
> + dump_partition (struct partition *part)
> + {
> +   int i;
> +   struct cgraph_node *node;
> +   struct priority *p;
> +   fprintf (dump_file, "  Partition %i:", part->index);
> +   for (i = 0; VEC_iterate (cgraph_node_ptr, part->nodes, i, node); i++)
> +     fprintf (dump_file, "  %s/%i", cgraph_node_name (node), node->uid);
> +   fprintf (dump_file, "\n    priorities:");
> +   for (i = 0; VEC_iterate (priority_ptr, part->priorities, i, p); i++)
> +     if (p->priority)
> +       {
> +       gcc_assert (p->src_partition == part);
> +       gcc_assert (p->dest_partition != part);
> +       fprintf (dump_file, "  %i:%i", p->dest_partition->index,
> +                (int)p->priority);
> +       }
> +   fprintf (dump_file, "\n");
> + }
> +
> + static unsigned int
> + ipa_func_reorder (void)
> + {
> +   struct cgraph_node *node;
> +   VEC (partition_ptr, heap) *partitions = VEC_alloc (partition_ptr, heap, cgraph_max_uid);
> +   int i;
> +   struct partition *part, *first_part = NULL;
> +   int freq;
> +   fibheap_t heap;
> +
> +   if (!cgraph_max_uid)
> +     return 0;
> +
> +   heap = fibheap_new ();
> +   VEC_safe_grow_cleared (partition_ptr, heap, partitions, cgraph_max_uid);
> +
> +   for (node = cgraph_nodes; node; node = node->next)
> +     if (cgraph_node_will_be_output_p (node))
> +       {
> +       part = XCNEW (struct partition);
> +       part->index = node->uid;
> +       VEC_safe_push (cgraph_node_ptr, heap, part->nodes, node);
> +       VEC_replace (partition_ptr, partitions, node->uid, part);
> +       }
> +
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nCreating partitions\n");
> +   for (node = cgraph_nodes; node; node = node->next)
> +     if (cgraph_node_will_be_output_p (node))
> +       {
> +       part = VEC_index (partition_ptr, partitions, node->uid);
> +       account_callees (part, partitions, node);
> +       clear_callees (node);
> +       if (dump_file)
> +         dump_partition (part);
> +       }
> +   for (node = cgraph_nodes; node; node = node->next)
> +     if (cgraph_node_will_be_output_p (node))
> +       {
> +       struct priority *p;
> +       part = VEC_index (partition_ptr, partitions, node->uid);
> +
> +       for (i = 0; VEC_iterate (priority_ptr, part->priorities, i, p); i++)
> +         p->node = fibheap_insert (heap, INT_MAX - p->priority, p);
> +       if (dump_file)
> +         dump_partition (part);
> +       }
> +
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nMerging by priorities\n");
> +   while (!fibheap_empty (heap))
> +     {
> +       struct priority *p = (struct priority *)fibheap_extract_min (heap);
> +       struct partition *part = p->src_partition;
> +       p->node = NULL;
> +       if (dump_file)
> +       {
> +         fprintf (dump_file,
> +                  "Concatenating partitions %i and %i, priority %i\n",
> +                  p->src_partition->index,
> +                  p->dest_partition->index,
> +                  (int)p->priority);
> +         if (dump_file)
> +           dump_partition (p->src_partition);
> +         if (dump_file)
> +           dump_partition (p->dest_partition);
> +       }
> +       merge_partitions (partitions, heap, p->src_partition, p->dest_partition);
> +       if (dump_file)
> +       dump_partition (part);
> +     }
> +
> +   /* We ran out of calls to merge.  Try to arrange remaining partitions
> +      approximately in execution order: static constructors first, followed
> +      by partition containing function main ()
> +      followed by hot sections of the program.  */
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nLooking for static constructors\n");
> +   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
> +     if (part && part != first_part
> +       && DECL_STATIC_CONSTRUCTOR (VEC_index (cgraph_node_ptr,
> +                                              part->nodes, 0)->decl))
> +       {
> +       if (dump_file)
> +         dump_partition (part);
> +       if (!first_part)
> +        first_part = part;
> +       else
> +        merge_partitions (partitions, heap, first_part, part);
> +       }
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nLooking for main\n");
> +   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
> +     if (part && part != first_part
> +       && MAIN_NAME_P (DECL_NAME (VEC_index (cgraph_node_ptr,
> +                                             part->nodes, 0)->decl)))
> +       {
> +       if (dump_file)
> +         dump_partition (part);
> +       if (!first_part)
> +        first_part = part;
> +       else
> +        merge_partitions (partitions, heap, first_part, part);
> +       }
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nMerging by frequency\n");
> +   for (freq = NODE_FREQUENCY_HOT; freq >= NODE_FREQUENCY_UNLIKELY_EXECUTED;
> +        freq--)
> +     {
> +       for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
> +       if (part && part != first_part
> +           && VEC_index (cgraph_node_ptr, part->nodes, 0)->frequency == freq)
> +         {
> +           if (dump_file)
> +             dump_partition (part);
> +           if (!first_part)
> +            first_part = part;
> +           else
> +            merge_partitions (partitions, heap, first_part, part);
> +         }
> +     }
> +   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
> +     gcc_assert (!part || part == first_part);
> +
> +   fibheap_delete (heap);
> +   if (!first_part)
> +     return 0;
> +   if (dump_file)
> +     {
> +       fprintf (dump_file, "\n\nFinal order\n");
> +       dump_partition (first_part);
> +     }
> +
> +   /* Store the resulting order and kill the single remaining partition.  */
> +   for (i = 0; VEC_iterate (cgraph_node_ptr, first_part->nodes, i, node); i++)
> +     node->order = i;
> +   VEC_free (priority_ptr, heap, first_part->priorities);
> +   VEC_free (cgraph_node_ptr, heap, first_part->nodes);
> +   free (first_part);
> +   return 0;
> + }
> +
> + static bool
> + gate_ipa_func_reorder (void)
> + {
> +   return optimize && flag_ipa_reorder && flag_toplevel_reorder;
> + }
> +
> + struct ipa_opt_pass_d pass_ipa_func_reorder =
> + {
> +  {
> +   IPA_PASS,
> +   "reorder",                          /* name */
> +   gate_ipa_func_reorder,              /* gate */
> +   ipa_func_reorder,                   /* execute */
> +   NULL,                                       /* sub */
> +   NULL,                                       /* next */
> +   0,                                  /* static_pass_number */
> +   TV_CGRAPHOPT,                               /* tv_id */
> +   0,                                  /* properties_required */
> +   0,                                  /* properties_provided */
> +   0,                                  /* properties_destroyed */
> +   0,                                  /* todo_flags_start */
> +   0                                     /* todo_flags_finish */
> +  },
> +  NULL,                                        /* generate_summary */
> +  NULL,                                        /* write_summary */
> +  NULL,                                        /* read_summary */
> +  NULL,                                        /* write_optimization_summary */
> +  NULL,                                        /* read_optimization_summary */
> +  NULL,                                        /* stmt_fixup */
> +  0,                                   /* TODOs */
> +  NULL,                                        /* function_transform */
> +  NULL                                 /* variable_transform */
> + };
> Index: Makefile.in
> ===================================================================
> *** Makefile.in (revision 164689)
> --- Makefile.in (working copy)
> *************** OBJS-archive = \
> *** 1461,1466 ****
> --- 1461,1467 ----
>         ipa-type-escape.o \
>         ipa-utils.o \
>         ipa.o \
> +       ipa-reorder.o \
>         matrix-reorg.o \
>         prefix.o \
>         tree-inline.o \
> *************** varpool.o : varpool.c $(CONFIG_H) $(SYST
> *** 3012,3017 ****
> --- 3013,3020 ----
>      $(TREE_FLOW_H) gt-varpool.h
>   ipa.o : ipa.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(CGRAPH_H) \
>      $(TREE_PASS_H) $(TIMEVAR_H) $(GIMPLE_H) $(GGC_H) pointer-set.h
> + ipa-reorder.o : ipa-reorder.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
> +    $(CGRAPH_H) $(TREE_PASS_H) $(TIMEVAR_H) $(GIMPLE_H) $(GGC_H) pointer-set.h
>   ipa-prop.o : ipa-prop.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
>      langhooks.h $(GGC_H) $(TARGET_H) $(CGRAPH_H) $(IPA_PROP_H) $(DIAGNOSTIC_H) \
>      $(TREE_FLOW_H) $(TM_H) $(TREE_PASS_H) $(FLAGS_H) $(TREE_H) \
> Index: passes.c
> ===================================================================
> *** passes.c    (revision 164689)
> --- passes.c    (working copy)
> *************** init_optimization_passes (void)
> *** 818,823 ****
> --- 818,824 ----
>     NEXT_PASS (pass_ipa_type_escape);
>     NEXT_PASS (pass_ipa_pta);
>     NEXT_PASS (pass_ipa_struct_reorg);
> +   NEXT_PASS (pass_ipa_func_reorder);
>     *p = NULL;
>
>     p = &all_lto_gen_passes;
Jan Hubicka Oct. 7, 2019, 9:33 a.m. UTC | #11
> > This is why we currently have way to order function when outputting them
> > and use that with FDO (using Martin's first execution logic). This has
> > drwarback of making the functions to flow in that order through late
> > optimizations and RTL backend and thus we lose IPA-RA and some
> > IP propagation (late pure/const/nothrow discovery).
> 
> But you can also fix that by the parallelization GSoC project approach,
> decoupling things at RTL expansion IPA-wise and using output order
> for the latter (or even do the fragments in GCC itself by refactoring the
> output machinery to use function-local "strings", assembling the final
> file later).  Refactoring output to handle multiple output "files" at the
> same time might also help getting rid of that early-LTO-debug copying
> stuff (now that it seems that linker-plugin extensions for partially claiming
> files will never happen...)

Multiple output files won't solve my concern about padding and
relaxation, but yep, parallelizing RTL will definitly need to introduce
way to hold final assembly and glue it together, so this problem may get
solved independently which would be nice.

Honza

Patch
diff mbox series

From d9c81529ae2a66bd51c49d83e19edb58b19cbf97 Mon Sep 17 00:00:00 2001
From: Martin Liska <mliska@suse.cz>
Date: Thu, 5 Sep 2019 13:32:41 +0200
Subject: [PATCH] Add new ipa-reorder pass.

---
 gcc/Makefile.in         |   1 +
 gcc/cgraph.c            |   2 +
 gcc/cgraph.h            |   2 +
 gcc/cgraphclones.c      |   1 +
 gcc/common.opt          |  14 ++
 gcc/flag-types.h        |   8 +
 gcc/ipa-reorder.c       | 333 ++++++++++++++++++++++++++++++++++++++++
 gcc/lto-cgraph.c        |   2 +
 gcc/lto/lto-partition.c |  18 ---
 gcc/passes.def          |   1 +
 gcc/timevar.def         |   1 +
 gcc/tree-pass.h         |   1 +
 gcc/varasm.c            |   9 ++
 13 files changed, 375 insertions(+), 18 deletions(-)
 create mode 100644 gcc/ipa-reorder.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 152df9fa9b3..c0efef626c5 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1368,6 +1368,7 @@  OBJS = \
 	init-regs.o \
 	internal-fn.o \
 	ipa-cp.o \
+	ipa-reorder.o \
 	ipa-devirt.o \
 	ipa-fnsummary.o \
 	ipa-polymorphic-call.o \
diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 843891e9e56..a0f350c26c0 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2041,6 +2041,8 @@  cgraph_node::dump (FILE *f)
     }
   if (tp_first_run > 0)
     fprintf (f, " first_run:%i", tp_first_run);
+  if (text_sorted_order > 0)
+    fprintf (f, " text_sorted_order:%i", text_sorted_order);
   if (origin)
     fprintf (f, " nested in:%s", origin->asm_name ());
   if (gimple_has_body_p (decl))
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 4c54210123a..24c0b2be321 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -1446,6 +1446,8 @@  struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : public symtab_node
   unsigned int profile_id;
   /* Time profiler: first run of function.  */
   int tp_first_run;
+  /* Order in .text.sorted.* section.  */
+  int text_sorted_order;
 
   /* Set when decl is an abstract function pointed to by the
      ABSTRACT_DECL_ORIGIN of a reachable function.  */
diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c
index fa753697c78..4d47b3e7574 100644
--- a/gcc/cgraphclones.c
+++ b/gcc/cgraphclones.c
@@ -462,6 +462,7 @@  cgraph_node::create_clone (tree new_decl, profile_count prof_count,
   new_node->rtl = rtl;
   new_node->frequency = frequency;
   new_node->tp_first_run = tp_first_run;
+  new_node->text_sorted_order = text_sorted_order;
   new_node->tm_clone = tm_clone;
   new_node->icf_merged = icf_merged;
   new_node->merged_comdat = merged_comdat;
diff --git a/gcc/common.opt b/gcc/common.opt
index 1b9e0f3c802..e68f28f6401 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2252,6 +2252,20 @@  freorder-functions
 Common Report Var(flag_reorder_functions) Optimization
 Reorder functions to improve code placement.
 
+freorder-functions-algorithm=
+Common Joined RejectNegative Enum(reorder_functions_algorithm) Var(flag_reorder_functions_algorithm) Init(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING) Optimization
+-freorder-functions-algorithm=[first-run|call-chain-clustering]	Set the used function reordering algorithm.
+
+Enum
+Name(reorder_functions_algorithm) Type(enum reorder_functions_algorithm) UnknownError(unknown function reordering algorithm %qs)
+
+EnumValue
+Enum(reorder_functions_algorithm) String(first-run) Value(REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN)
+
+EnumValue
+Enum(reorder_functions_algorithm) String(call-chain-clustering) Value(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING)
+
+
 frerun-cse-after-loop
 Common Report Var(flag_rerun_cse_after_loop) Optimization
 Add a common subexpression elimination pass after loop optimizations.
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index a2103282d46..9658aa99ea5 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -138,6 +138,14 @@  enum reorder_blocks_algorithm
   REORDER_BLOCKS_ALGORITHM_STC
 };
 
+/* The algorithm used for function reordering.  */
+enum reorder_functions_algorithm
+{
+  REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN,
+  REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING
+};
+
+
 /* The algorithm used for the integrated register allocator (IRA).  */
 enum ira_algorithm
 {
diff --git a/gcc/ipa-reorder.c b/gcc/ipa-reorder.c
new file mode 100644
index 00000000000..d9ddc468968
--- /dev/null
+++ b/gcc/ipa-reorder.c
@@ -0,0 +1,333 @@ 
+/* Reorder functions based on profile.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "cgraph.h"
+#include "symbol-summary.h"
+#include "tree-vrp.h"
+#include "alloc-pool.h"
+#include "ipa-prop.h"
+#include "ipa-fnsummary.h"
+#include "fibonacci_heap.h"
+#include <limits>
+
+using namespace std;
+
+namespace {
+
+#define C3_CLUSTER_THRESHOLD 1024
+
+struct cluster_edge;
+
+/* Cluster is set of functions considered in C3 algorithm.  */
+
+struct cluster
+{
+  cluster (cgraph_node *node, int size, sreal time):
+    m_functions (), m_callers (), m_size (size), m_time (time)
+  {
+    m_functions.safe_push (node);
+  }
+
+  vec<cgraph_node *> m_functions;
+  hash_map <cluster *, cluster_edge *> m_callers;
+  int m_size;
+  sreal m_time;
+};
+
+/* Cluster edge is an oriented edge in between two clusters.  */
+
+struct cluster_edge
+{
+  cluster_edge (cluster *caller, cluster *callee, uint32_t count):
+    m_caller (caller), m_callee (callee), m_count (count), m_heap_node (NULL)
+  {}
+
+
+  uint32_t inverted_count ()
+  {
+    return numeric_limits<uint32_t>::max () - m_count;
+  }
+
+  cluster *m_caller;
+  cluster *m_callee;
+  uint32_t m_count;
+  fibonacci_node<uint32_t, cluster_edge> *m_heap_node;
+};
+
+/* Sort functions based of first execution of the function.  */
+
+static void
+sort_functions_by_first_run (void)
+{
+  cgraph_node *node;
+
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (node->tp_first_run && !node->alias)
+      node->text_sorted_order = node->tp_first_run;
+}
+
+/* Compare clusters by density after that are established.  */
+
+static int
+cluster_cmp (const void *a_p, const void *b_p)
+{
+  const cluster *a = *(cluster * const *)a_p;
+  const cluster *b = *(cluster * const *)b_p;
+
+  unsigned fncounta = a->m_functions.length ();
+  unsigned fncountb = b->m_functions.length ();
+  if (fncounta <= 1 || fncountb <= 1)
+    return fncountb - fncounta;
+
+  sreal r = b->m_time * a->m_size - a->m_time * b->m_size;
+  return (r < 0) ? -1 : ((r > 0) ? 1 : 0);
+}
+
+/* Sort functions based on call chain clustering, which is an algorithm
+   mentioned in the following article:
+   https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
+   .  */
+
+static void
+sort_functions_by_c3 (void)
+{
+  cgraph_node *node;
+  auto_vec<cluster *> clusters;
+
+  /* Create a cluster for each function.  */
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (!node->alias
+	&& !node->global.inlined_to)
+      {
+	ipa_fn_summary *summary = ipa_fn_summaries->get (node);
+	cluster *c = new cluster (node, summary->size, summary->time);
+	node->aux = c;
+	clusters.safe_push (c);
+      }
+
+  auto_vec<cluster_edge *> edges;
+
+  /* Insert edges between clusters that have a profile.  */
+  for (unsigned i = 0; i < clusters.length (); i++)
+    {
+      cgraph_node *node = clusters[i]->m_functions[0];
+      for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller)
+	if (cs->count.reliable_p ()
+	    && cs->count.to_gcov_type () > 0)
+	  {
+	    cluster *caller = (cluster *)cs->caller->aux;
+	    cluster *callee = (cluster *)cs->callee->aux;
+	    gcov_type count = cs->count.to_gcov_type ();
+
+	    cluster_edge **cedge = callee->m_callers.get (caller);
+	    if (cedge != NULL)
+	      (*cedge)->m_count += count;
+	    else
+	      {
+		cluster_edge *cedge = new cluster_edge (caller, callee, count);
+		edges.safe_push (cedge);
+		callee->m_callers.put (caller, cedge);
+	      }
+	  }
+    }
+
+  /* Now insert all created edges into a heap.  */
+  fibonacci_heap <uint32_t, cluster_edge> heap (0);
+
+  for (unsigned i = 0; i < clusters.length (); i++)
+    {
+      cluster *c = clusters[i];
+      for (hash_map<cluster *, cluster_edge *>::iterator it
+	   = c->m_callers.begin (); it != c->m_callers.end (); ++it)
+	{
+	  cluster_edge *cedge = (*it).second;
+	  cedge->m_heap_node = heap.insert (cedge->inverted_count (), cedge);
+	}
+    }
+
+  while (!heap.empty ())
+    {
+      cluster_edge *cedge = heap.extract_min ();
+      cluster *caller = cedge->m_caller;
+      cluster *callee = cedge->m_callee;
+      cedge->m_heap_node = NULL;
+
+      if (caller == callee)
+	continue;
+      if (caller->m_size + callee->m_size <= C3_CLUSTER_THRESHOLD)
+	{
+	  caller->m_size += callee->m_size;
+	  caller->m_time += callee->m_time;
+
+	  /* Append all cgraph_nodes from callee to caller.  */
+	  for (unsigned i = 0; i < callee->m_functions.length (); i++)
+	    caller->m_functions.safe_push (callee->m_functions[i]);
+
+	  callee->m_functions.truncate (0);
+
+	  /* Iterate all cluster_edges of callee and add them to the caller.  */
+	  for (hash_map<cluster *, cluster_edge *>::iterator it
+	       = callee->m_callers.begin (); it != callee->m_callers.end ();
+	       ++it)
+	    {
+	      (*it).second->m_callee = caller;
+	      cluster_edge **ce = caller->m_callers.get ((*it).first);
+
+	      if (ce != NULL)
+		{
+		  (*ce)->m_count += (*it).second->m_count;
+		  if ((*ce)->m_heap_node != NULL)
+		    heap.decrease_key ((*ce)->m_heap_node,
+				       (*ce)->inverted_count ());
+		}
+	      else
+		caller->m_callers.put ((*it).first, (*it).second);
+	    }
+	}
+    }
+
+  /* Sort the candidate clusters.  */
+  clusters.qsort (cluster_cmp);
+
+  /* Dump clusters.  */
+  if (dump_file)
+    {
+      for (unsigned i = 0; i < clusters.length (); i++)
+	{
+	  cluster *c = clusters[i];
+	  if (c->m_functions.length () <= 1)
+	    continue;
+
+	  fprintf (dump_file, "Cluster %d with functions: %d, size: %d,"
+		   " density: %f\n", i, c->m_functions.length (), c->m_size,
+		   (c->m_time / c->m_size).to_double ());
+	  fprintf (dump_file, "  functions: ");
+	  for (unsigned j = 0; j < c->m_functions.length (); j++)
+	    fprintf (dump_file, "%s ", c->m_functions[j]->dump_name ());
+	  fprintf (dump_file, "\n");
+	}
+      fprintf (dump_file, "\n");
+    }
+
+  /* Assign .text.sorted.* section names.  */
+  int counter = 1;
+  for (unsigned i = 0; i < clusters.length (); i++)
+    {
+      cluster *c = clusters[i];
+      if (c->m_functions.length () <= 1)
+	continue;
+
+      for (unsigned j = 0; j < c->m_functions.length (); j++)
+	{
+	  cgraph_node *node = c->m_functions[j];
+
+	  if (dump_file)
+	    fprintf (dump_file, "setting: %d for %s with size:%d\n",
+		     counter, node->dump_asm_name (),
+		     ipa_fn_summaries->get (node)->size);
+	  node->text_sorted_order = counter++;
+	}
+    }
+
+  /* Release memory.  */
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (!node->alias)
+      node->aux = NULL;
+
+  for (unsigned i = 0; i < clusters.length (); i++)
+    delete clusters[i];
+
+  for (unsigned i = 0; i < edges.length (); i++)
+    delete edges[i];
+}
+
+/* The main function for function sorting.  */
+
+static unsigned int
+ipa_reorder (void)
+{
+  switch (flag_reorder_functions_algorithm)
+    {
+    case REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING:
+      sort_functions_by_c3 ();
+      break;
+    case REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN:
+      sort_functions_by_first_run ();
+      break;
+    default:
+      gcc_unreachable ();
+    }
+
+  return 0;
+}
+
+const pass_data pass_data_ipa_reorder =
+{
+  IPA_PASS, /* type */
+  "reorder", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_IPA_REORDER, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_reorder : public ipa_opt_pass_d
+{
+public:
+  pass_ipa_reorder (gcc::context *ctxt)
+    : ipa_opt_pass_d (pass_data_ipa_reorder, ctxt,
+		      NULL, /* generate_summary */
+		      NULL, /* write_summary */
+		      NULL, /* read_summary */
+		      NULL, /* write_optimization_summary */
+		      NULL, /* read_optimization_summary */
+		      NULL, /* stmt_fixup */
+		      0, /* function_transform_todo_flags_start */
+		      NULL, /* function_transform */
+		      NULL) /* variable_transform */
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *);
+  virtual unsigned int execute (function *) { return ipa_reorder (); }
+
+}; // class pass_ipa_reorder
+
+bool
+pass_ipa_reorder::gate (function *)
+{
+  return flag_profile_reorder_functions && flag_profile_use && flag_wpa;
+}
+
+} // anon namespace
+
+ipa_opt_pass_d *
+make_pass_ipa_reorder (gcc::context *ctxt)
+{
+  return new pass_ipa_reorder (ctxt);
+}
diff --git a/gcc/lto-cgraph.c b/gcc/lto-cgraph.c
index bc0f0107333..0a424a2ad12 100644
--- a/gcc/lto-cgraph.c
+++ b/gcc/lto-cgraph.c
@@ -505,6 +505,7 @@  lto_output_node (struct lto_simple_output_block *ob, struct cgraph_node *node,
     section = "";
 
   streamer_write_hwi_stream (ob->main_stream, node->tp_first_run);
+  streamer_write_hwi_stream (ob->main_stream, node->text_sorted_order);
 
   bp = bitpack_create (ob->main_stream);
   bp_pack_value (&bp, node->local.local, 1);
@@ -1277,6 +1278,7 @@  input_node (struct lto_file_decl_data *file_data,
 		    "node with uid %d", node->get_uid ());
 
   node->tp_first_run = streamer_read_uhwi (ib);
+  node->text_sorted_order = streamer_read_uhwi (ib);
 
   bp = streamer_read_bitpack (ib);
 
diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c
index 6972e6e9ef3..8937e64ef72 100644
--- a/gcc/lto/lto-partition.c
+++ b/gcc/lto/lto-partition.c
@@ -381,24 +381,6 @@  node_cmp (const void *pa, const void *pb)
   const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
   const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
 
-  /* Profile reorder flag enables function reordering based on first execution
-     of a function. All functions with profile are placed in ascending
-     order at the beginning.  */
-
-  if (flag_profile_reorder_functions)
-  {
-    /* Functions with time profile are sorted in ascending order.  */
-    if (a->tp_first_run && b->tp_first_run)
-      return a->tp_first_run != b->tp_first_run
-	? a->tp_first_run - b->tp_first_run
-        : a->order - b->order;
-
-    /* Functions with time profile are sorted before the functions
-       that do not have the profile.  */
-    if (a->tp_first_run || b->tp_first_run)
-      return b->tp_first_run - a->tp_first_run;
-  }
-
   return b->order - a->order;
 }
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 93879223a8b..6caf2ac5cda 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -153,6 +153,7 @@  along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_ipa_fn_summary);
   NEXT_PASS (pass_ipa_inline);
   NEXT_PASS (pass_ipa_pure_const);
+  NEXT_PASS (pass_ipa_reorder);
   NEXT_PASS (pass_ipa_free_fn_summary, false /* small_p */);
   NEXT_PASS (pass_ipa_reference);
   /* This pass needs to be scheduled after any IP code duplication.   */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index 3551a433b1f..068c902424f 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -76,6 +76,7 @@  DEFTIMEVAR (TV_IPA_INHERITANCE       , "ipa inheritance graph")
 DEFTIMEVAR (TV_IPA_VIRTUAL_CALL      , "ipa virtual call target")
 DEFTIMEVAR (TV_IPA_DEVIRT	     , "ipa devirtualization")
 DEFTIMEVAR (TV_IPA_CONSTANT_PROP     , "ipa cp")
+DEFTIMEVAR (TV_IPA_REORDER	     , "ipa reorder")
 DEFTIMEVAR (TV_IPA_INLINING          , "ipa inlining heuristics")
 DEFTIMEVAR (TV_IPA_FNSPLIT           , "ipa function splitting")
 DEFTIMEVAR (TV_IPA_COMDATS	     , "ipa comdats")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 7106eba54af..10bc4cc125d 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -500,6 +500,7 @@  extern ipa_opt_pass_d *make_pass_ipa_inline (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_free_lang_data (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_free_fn_summary (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_cp (gcc::context *ctxt);
+extern ipa_opt_pass_d *make_pass_ipa_reorder (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_icf (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_devirt (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_reference (gcc::context *ctxt);
diff --git a/gcc/varasm.c b/gcc/varasm.c
index a7c22523f9f..498b9ed59fd 100644
--- a/gcc/varasm.c
+++ b/gcc/varasm.c
@@ -601,6 +601,15 @@  default_function_section (tree decl, enum node_frequency freq,
   if (exit && freq != NODE_FREQUENCY_UNLIKELY_EXECUTED)
     return get_named_text_section (decl, ".text.exit", NULL);
 
+  cgraph_node *node = cgraph_node::get (decl);
+  if (node->text_sorted_order > 0 && freq != NODE_FREQUENCY_UNLIKELY_EXECUTED)
+    {
+      char section_name[64];
+      sprintf (section_name, ".text.sorted.%010d",
+	       node->text_sorted_order);
+      return get_named_text_section (decl, section_name, NULL);
+    }
+
   /* Group cold functions together, similarly for hot code.  */
   switch (freq)
     {
-- 
2.23.0