diff mbox series

[RFC] Add new ipa-reorder pass

Message ID 12f761ce-012d-f9be-ceef-6ad8de6324e8@suse.cz
State New
Headers show
Series [RFC] Add new ipa-reorder pass | expand

Commit Message

Martin Liška Sept. 19, 2019, 8:33 a.m. UTC
Hi.

Function reordering has been around for quite some time and a naive
implementation was also part of my diploma thesis some time ago.
Currently, the GCC can reorder function based on first execution, which
happens with PGO and LTO of course. Known limitation is that the order
is preserved only partially as various symbols go into different LTRANS partitions.

There has been some research in the area and I would point out the Facebook paper
([1]) and Sony presentation ([2]). Based on that, I decided to make a new implementation
in the GCC that does the same (in a proper way). First part of the enablement are patches
to ld.bfd and ld.gold that come up with a new section .text.sorted, that is always sorted.
There's a snippet from the modified default linker script:

  .text           :
  {
    *(.text.unlikely .text.*_unlikely .text.unlikely.*)
    *(.text.exit .text.exit.*)
    *(.text.startup .text.startup.*)
    *(.text.hot .text.hot.*)
    *(SORT(.text.sorted.*))
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf.em.  */
    *(.gnu.warning)
  }

Second part is the reordering pass where I implemented both approaches:
the existing one (-freorder-functions-algorithm=first-run) and the new one
based on the Facebook paper (-freorder-functions-algorithm=call-chain-clustering).
The implementation is straightforward and utilizes the newly introduced text subsection.

Results:
I made a PGO LTO bootstrap of the GCC with the patch and -freorder-functions-algorithm=call-chain-clustering.
The pass makes sorting order for about 7500 functions and from these about 7300 are really put in the desired
order. The remaining are the functions that are later observed as cold and are eventually put into .text.unlikely
section. I verified the order with nm (sorted by address) and my -fdump-ipa-reorder dump. The aforementioned
sorted function occupy 5.0MB (out of 18.3MB) of the final binary of cc1plus. I made a simple benchmark of:

./cc1plus /home/marxin/Programming/tramp3d/tramp3d-v4.ii -O2 -quiet and run it 20x before and after my patch:
run #020: old:15.033 s, new: 14.975 s, cmp: 99.61%

The numbers are stable and present a speed up of about 0.4%. To verify that I also measured iTLB misses
before:

18,475,504      iTLB-load-misses:u
and after:
13,779,347      iTLB-load-misses:u

Both these numbers are very stable based on my measurements. That said, I believe the patch is correct
and can help in huge binaries that have a flat profile.

Notes and limitations:
- The call-chain-clustering algorithm requires to fit as many as possible functions into page size (4K).
  Based on my measurements that should correspond to ~1000 GIMPLE statements (IPA inliner size). I can
  make it a param in the future.
- One needs modified binutils and I that would probably require a configure detection. The only way
  which I see is based on ld --version. I'm planning to make the binutils submission soon.
- Right now, I prefer to put functions into .text.sorted.* rather than .text.hot. That's required by the
  algorithm to put hot function calls next to each other.
- Observation: Very hot calls are likely inlined and so that the clustering reached quite big functions.

Thoughts? I would definitely welcome any interesting measurement on a bigger load.

Martin

[1] https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
[2] https://llvm.org/devmtg/2018-10/slides/Spencer-Profile%20Guided%20Function%20Layout%20in%20LLVM%20and%20LLD.pdf

Comments

Martin Liška Sept. 24, 2019, 12:05 p.m. UTC | #1
On 9/19/19 10:33 AM, Martin Liška wrote:
> - One needs modified binutils and I that would probably require a configure detection. The only way
>   which I see is based on ld --version. I'm planning to make the binutils submission soon.

The patch submission link:
https://sourceware.org/ml/binutils/2019-09/msg00219.html
Evgeny Kudryashov Sept. 25, 2019, 4:36 p.m. UTC | #2
On 2019-09-19 11:33, Martin Liška wrote:
> Hi.
> 
> Function reordering has been around for quite some time and a naive
> implementation was also part of my diploma thesis some time ago.
> Currently, the GCC can reorder function based on first execution, which
> happens with PGO and LTO of course. Known limitation is that the order
> is preserved only partially as various symbols go into different
> LTRANS partitions.
> 
> There has been some research in the area and I would point out the
> Facebook paper
> ([1]) and Sony presentation ([2]). Based on that, I decided to make a
> new implementation
> in the GCC that does the same (in a proper way). First part of the
> enablement are patches
> to ld.bfd and ld.gold that come up with a new section .text.sorted,
> that is always sorted.
> 
> Thoughts? I would definitely welcome any interesting measurement on a
> bigger load.
> 
> Martin
> 

Hi, Martin!

Some time ago I tried to do the same but didn't go that far.

I also used the C3 algorithm, except for the fact that ipa_fn_summary 
contains information about size and time (somehow missed it). The linker 
option --sort-section=name was used for prototyping. It, obviously, 
sorts sections and allows to place functions in the desired order (by 
placing them into named sections .text.sorted.NNNN, without patching 
linkers or adjusting linker scripts). For testing my implementation I 
used several benchmarks from SPEC2006: perlbench, sjeng and gobmk. 
Unfortunately, no significant positive changes were obtained.

I've tested the proposed pass on perlbench with train and reference 
input (PGO+LTO as a base) but couldn't obtain stable results (still 
unclear environment or perlbench specific reasons).

Evgeny.
Martin Liška Sept. 26, 2019, 9:37 a.m. UTC | #3
On 9/25/19 6:36 PM, Evgeny Kudryashov wrote:
> On 2019-09-19 11:33, Martin Liška wrote:
>> Hi.
>>
>> Function reordering has been around for quite some time and a naive
>> implementation was also part of my diploma thesis some time ago.
>> Currently, the GCC can reorder function based on first execution, which
>> happens with PGO and LTO of course. Known limitation is that the order
>> is preserved only partially as various symbols go into different
>> LTRANS partitions.
>>
>> There has been some research in the area and I would point out the
>> Facebook paper
>> ([1]) and Sony presentation ([2]). Based on that, I decided to make a
>> new implementation
>> in the GCC that does the same (in a proper way). First part of the
>> enablement are patches
>> to ld.bfd and ld.gold that come up with a new section .text.sorted,
>> that is always sorted.
>>
>> Thoughts? I would definitely welcome any interesting measurement on a
>> bigger load.
>>
>> Martin
>>
> 
> Hi, Martin!
> 
> Some time ago I tried to do the same but didn't go that far.

Hello.

> 
> I also used the C3 algorithm, except for the fact that ipa_fn_summary contains information about size and time (somehow missed it).

Which is a key part of the algorithm as one wants not to cross a page size boundary (in ideal case).


> The linker option --sort-section=name was used for prototyping. It, obviously, sorts sections and allows to place functions in the desired order (by placing them into named sections .text.sorted.NNNN, without patching linkers or adjusting linker scripts).

Yes, that can work, but having the new section explicitly sorted will not require a passing of an option to linker. 

> For testing my implementation I used several benchmarks from SPEC2006: perlbench, sjeng and gobmk. Unfortunately, no significant positive changes were obtained.
> 
> I've tested the proposed pass on perlbench with train and reference input (PGO+LTO as a base) but couldn't obtain stable results (still unclear environment or perlbench specific reasons).

Yes, it's quite difficult to measure something. One needs a huge .text section of a binary and a test-case which has a flat profile.
I was able to measure results on the gcc itself.

Martin

> 
> Evgeny.
>
Fangrui Song Oct. 3, 2019, 4:52 a.m. UTC | #4
On 2019-09-24, Martin Liška wrote:
>On 9/19/19 10:33 AM, Martin Liška wrote:
>> - One needs modified binutils and I that would probably require a configure detection. The only way
>>   which I see is based on ld --version. I'm planning to make the binutils submission soon.
>
>The patch submission link:
>https://sourceware.org/ml/binutils/2019-09/msg00219.html

Hi Martin,

I have a question about why .text.sorted.* are needed.

The Sony presentation (your [2] link) embedded a new section
.llvm.call-graph-profile[3] to represent edges in the object files. The
linker (lld) collects all .llvm.call-graph-profile sections and does a
C3 layout. There is no need for new section type .text.sorted.*

[3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s

(Please CC me. I am not subscribed.)
Andrew Pinski Oct. 3, 2019, 7:21 a.m. UTC | #5
On Wed, Oct 2, 2019 at 9:52 PM Fangrui Song <i@maskray.me> wrote:
>
> On 2019-09-24, Martin Liška wrote:
> >On 9/19/19 10:33 AM, Martin Liška wrote:
> >> - One needs modified binutils and I that would probably require a configure detection. The only way
> >>   which I see is based on ld --version. I'm planning to make the binutils submission soon.
> >
> >The patch submission link:
> >https://sourceware.org/ml/binutils/2019-09/msg00219.html
>
> Hi Martin,
>
> I have a question about why .text.sorted.* are needed.
>
> The Sony presentation (your [2] link) embedded a new section
> .llvm.call-graph-profile[3] to represent edges in the object files. The
> linker (lld) collects all .llvm.call-graph-profile sections and does a
> C3 layout. There is no need for new section type .text.sorted.*
>
> [3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s
>
> (Please CC me. I am not subscribed.)

The idea of GCC's version is to that the modification needed to the
linker is very little.  And even without patching the linker
script/linker, you don't need to much special and you don't need LD to
do much work at all.

Thanks,
Andrew
Fangrui Song Oct. 3, 2019, 7:45 a.m. UTC | #6
On 2019-10-03, Andrew Pinski wrote:
>On Wed, Oct 2, 2019@9:52 PM Fangrui Song <i@maskray.me> wrote:
>>
>> On 2019-09-24, Martin Liška wrote:
>> >On 9/19/19 10:33 AM, Martin Liška wrote:
>> >> - One needs modified binutils and I that would probably require a configure detection. The only way
>> >>   which I see is based on ld --version. I'm planning to make the binutils submission soon.
>> >
>> >The patch submission link:
>> >https://sourceware.org/ml/binutils/2019-09/msg00219.html
>>
>> Hi Martin,
>>
>> I have a question about why .text.sorted.* are needed.
>>
>> The Sony presentation (your [2] link) embedded a new section
>> .llvm.call-graph-profile[3] to represent edges in the object files. The
>> linker (lld) collects all .llvm.call-graph-profile sections and does a
>> C3 layout. There is no need for new section type .text.sorted.*
>>
>> [3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s
>>
>> (Please CC me. I am not subscribed.)
>
>The idea of GCC's version is to that the modification needed to the
>linker is very little.  And even without patching the linker
>script/linker, you don't need to much special and you don't need LD to
>do much work@all.

I am afraid this can be a large limitation. Then .text.sorted.* can
only a) reorder functions within a translation unit, b) or reorder all
functions when LTO is enabled.

b) is possible only if all .text.sorted.* sections can be numbered.

For the LLVM case, call graph profile can be used without LTO. When both
ThinLTO+PGO are enabled, however, the additional performance improvement
offered by the global call graph profile reordering is insignificant,
smaller than 1%.
Andrew Pinski Oct. 3, 2019, 8:08 a.m. UTC | #7
On Thu, Oct 3, 2019 at 12:46 AM Fangrui Song <i@maskray.me> wrote:
>
>
> On 2019-10-03, Andrew Pinski wrote:
> >On Wed, Oct 2, 2019@9:52 PM Fangrui Song <i@maskray.me> wrote:
> >>
> >> On 2019-09-24, Martin Liška wrote:
> >> >On 9/19/19 10:33 AM, Martin Liška wrote:
> >> >> - One needs modified binutils and I that would probably require a configure detection. The only way
> >> >>   which I see is based on ld --version. I'm planning to make the binutils submission soon.
> >> >
> >> >The patch submission link:
> >> >https://sourceware.org/ml/binutils/2019-09/msg00219.html
> >>
> >> Hi Martin,
> >>
> >> I have a question about why .text.sorted.* are needed.
> >>
> >> The Sony presentation (your [2] link) embedded a new section
> >> .llvm.call-graph-profile[3] to represent edges in the object files. The
> >> linker (lld) collects all .llvm.call-graph-profile sections and does a
> >> C3 layout. There is no need for new section type .text.sorted.*
> >>
> >> [3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s
> >>
> >> (Please CC me. I am not subscribed.)
> >
> >The idea of GCC's version is to that the modification needed to the
> >linker is very little.  And even without patching the linker
> >script/linker, you don't need to much special and you don't need LD to
> >do much work@all.
>
> I am afraid this can be a large limitation. Then .text.sorted.* can
> only a) reorder functions within a translation unit, b) or reorder all
> functions when LTO is enabled.

I don't think it is that limited.  In fact I think the other way
around is more limited as you are now limited on changing the call
graph data format.  This reduces the complexity of the linker really.
Linkers are already slow; why make them even slower?  This patch is
you only need to change the linker once and you don't have
dependencies between the linker and compiler.

Thanks,
Andrew

>
> b) is possible only if all .text.sorted.* sections can be numbered.
>
> For the LLVM case, call graph profile can be used without LTO. When both
> ThinLTO+PGO are enabled, however, the additional performance improvement
> offered by the global call graph profile reordering is insignificant,
> smaller than 1%.See you said it was less than 1% so why do it then?

Thanks,
Andrew Pinski
Jeff Law Oct. 4, 2019, 10:06 p.m. UTC | #8
On 9/19/19 2:33 AM, Martin Liška wrote:
> Hi.
> 
> Function reordering has been around for quite some time and a naive
> implementation was also part of my diploma thesis some time ago.
> Currently, the GCC can reorder function based on first execution, which
> happens with PGO and LTO of course. Known limitation is that the order
> is preserved only partially as various symbols go into different LTRANS partitions.
> 
> There has been some research in the area and I would point out the Facebook paper
> ([1]) and Sony presentation ([2]). Based on that, I decided to make a new implementation
> in the GCC that does the same (in a proper way). First part of the enablement are patches
> to ld.bfd and ld.gold that come up with a new section .text.sorted, that is always sorted.
> There's a snippet from the modified default linker script:
Funny, I was doing this via linker scripts circa ~95, in fact that's why
we have -ffunction-sections :-)   We started with scripts which
post-processed profiling data to create linker scripts for ELF systems.
 We had something for HPUX/SOM as well, but I can't remember what
mechanism we used, it may have been a gross level sorting using the SOM
section sort key mechanism - it only had 128 or 256 keys with a notable
amount of them reserved.

We had also built a linker with a basic level of interposition circa
1993 and explored various approaches to reordering executables.  I'd
just joined the group at the time and was responsible for wiring up
stuff on the OS side, but eventually got the "pleasure" of owning the
linker server.  A lot of the C3 algorithmic stuff looks similar to what
we did.

Anyway...

I don't see anything objectionable in here.  It's noted as an RFC.  Are
you interested in pushing this forward for gcc-10?

jeff
Jan Hubicka Oct. 6, 2019, 2:37 p.m. UTC | #9
> On 9/19/19 2:33 AM, Martin Liška wrote:
> > Hi.
> > 
> > Function reordering has been around for quite some time and a naive
> > implementation was also part of my diploma thesis some time ago.
> > Currently, the GCC can reorder function based on first execution, which
> > happens with PGO and LTO of course. Known limitation is that the order
> > is preserved only partially as various symbols go into different LTRANS partitions.
> > 
> > There has been some research in the area and I would point out the Facebook paper
> > ([1]) and Sony presentation ([2]). Based on that, I decided to make a new implementation
> > in the GCC that does the same (in a proper way). First part of the enablement are patches
> > to ld.bfd and ld.gold that come up with a new section .text.sorted, that is always sorted.
> > There's a snippet from the modified default linker script:
> Funny, I was doing this via linker scripts circa ~95, in fact that's why
> we have -ffunction-sections :-)   We started with scripts which
> post-processed profiling data to create linker scripts for ELF systems.
>  We had something for HPUX/SOM as well, but I can't remember what
> mechanism we used, it may have been a gross level sorting using the SOM
> section sort key mechanism - it only had 128 or 256 keys with a notable
> amount of them reserved.
> 
> We had also built a linker with a basic level of interposition circa
> 1993 and explored various approaches to reordering executables.  I'd
> just joined the group at the time and was responsible for wiring up
> stuff on the OS side, but eventually got the "pleasure" of owning the
> linker server.  A lot of the C3 algorithmic stuff looks similar to what
> we did.

For reference I am attaching early LTO version from circa 2010 :)
> 
> Anyway...
> 
> I don't see anything objectionable in here.  It's noted as an RFC.  Are
> you interested in pushing this forward for gcc-10?

I think it is plan to get some form of code layout pass into GCC10.  I
will test Martin's version on Firefox and see if it have any effect
here. It is generally quite hard to evaluate those changes (and it is
reason why I did not moved forward with version form 2010 - more
precisely it kept falling off the shelf for about a decade)

If LLD has support for cgraph annotations and we support LLD, i think we
should add support for that, too - it will be very useful to compare
indvidiual implementations.
I believe there is gold support (off tree?) for that too and something
similar is also supported by other toolchains like AIX?

One problem with the sections approach is that every section gets
aligned to largest code alignment inside of the section. Since our
alignments are (should be) often cache line based we get cca 30 bytes of
waste for every function that is quite a lot.

This is why we currently have way to order function when outputting them
and use that with FDO (using Martin's first execution logic). This has
drwarback of making the functions to flow in that order through late
optimizations and RTL backend and thus we lose IPA-RA and some 
IP propagation (late pure/const/nothrow discovery).
So this approach has a drawback, too. It is why i was trying to push
myself to get gcc to use gas fragments :)

Anyway, all these issues can be sovled incementally - lets see how
Maritn's patch work on Firefox and if we can get it tested elsewhere and
start from that.

I will take a look into the details.

Honza
> 
> jeff
> 

Index: tree-pass.h
===================================================================
*** tree-pass.h	(revision 164689)
--- tree-pass.h	(working copy)
*************** extern struct ipa_opt_pass_d pass_ipa_lt
*** 467,472 ****
--- 467,473 ----
  extern struct ipa_opt_pass_d pass_ipa_lto_finish_out;
  extern struct ipa_opt_pass_d pass_ipa_profile;
  extern struct ipa_opt_pass_d pass_ipa_cdtor_merge;
+ extern struct ipa_opt_pass_d pass_ipa_func_reorder;
  
  extern struct gimple_opt_pass pass_all_optimizations;
  extern struct gimple_opt_pass pass_cleanup_cfg_post_optimizing;
Index: cgraphunit.c
===================================================================
*** cgraphunit.c	(revision 164689)
--- cgraphunit.c	(working copy)
*************** cgraph_inline_p (struct cgraph_edge *e,
*** 1517,1522 ****
--- 1517,1529 ----
    return !e->inline_failed;
  }
  
+ static int
+ node_cmp (const void *pa, const void *pb)
+ {
+   const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
+   const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
+   return b->order - a->order;
+ }
  
  
  /* Expand all functions that must be output.
*************** cgraph_expand_all_functions (void)
*** 1546,1551 ****
--- 1553,1561 ----
      if (order[i]->process)
        order[new_order_pos++] = order[i];
  
+   if (flag_ipa_reorder)
+     qsort (order, new_order_pos, sizeof (struct cgraph_node *), node_cmp);
+ 
    for (i = new_order_pos - 1; i >= 0; i--)
      {
        node = order[i];
Index: tree-ssa-ccp.c
===================================================================
*** tree-ssa-ccp.c	(revision 164689)
--- tree-ssa-ccp.c	(working copy)
*************** ccp_fold_stmt (gimple_stmt_iterator *gsi
*** 2307,2312 ****
--- 2307,2324 ----
  		changed = true;
  	      }
  	  }
+ 	if (TREE_CODE (gimple_call_fn (stmt)) == OBJ_TYPE_REF)
+ 	  {
+ 	    tree expr = OBJ_TYPE_REF_EXPR (gimple_call_fn (stmt));
+ 	    expr = valueize_op (expr);
+ 	    if (TREE_CODE (expr) == ADDR_EXPR
+ 	        && TREE_CODE (TREE_OPERAND (expr, 0)) == FUNCTION_DECL)
+ 	     {
+ 	       gimple_call_set_fndecl (stmt, TREE_OPERAND (expr, 0));
+ 	       debug_gimple_stmt (stmt);
+ 	       changed = true;
+ 	     }
+ 	  }
  
  	return changed;
        }
Index: lto-cgraph.c
===================================================================
*** lto-cgraph.c	(revision 164689)
--- lto-cgraph.c	(working copy)
*************** lto_output_node (struct lto_simple_outpu
*** 466,471 ****
--- 466,473 ----
  
    if (tag == LTO_cgraph_analyzed_node)
      {
+       lto_output_uleb128_stream (ob->main_stream,
+ 				 node->order);
        lto_output_sleb128_stream (ob->main_stream,
  				 node->local.inline_summary.estimated_self_stack_size);
        lto_output_sleb128_stream (ob->main_stream,
*************** input_overwrite_node (struct lto_file_de
*** 927,932 ****
--- 929,935 ----
  		      struct cgraph_node *node,
  		      enum LTO_cgraph_tags tag,
  		      struct bitpack_d *bp,
+ 		      int order,
  		      unsigned int stack_size,
  		      unsigned int self_time,
  		      unsigned int time_inlining_benefit,
*************** input_overwrite_node (struct lto_file_de
*** 935,940 ****
--- 938,944 ----
  		      enum ld_plugin_symbol_resolution resolution)
  {
    node->aux = (void *) tag;
+   node->order = order;
    node->local.inline_summary.estimated_self_stack_size = stack_size;
    node->local.inline_summary.self_time = self_time;
    node->local.inline_summary.time_inlining_benefit = time_inlining_benefit;
*************** input_node (struct lto_file_decl_data *f
*** 1027,1032 ****
--- 1031,1037 ----
    unsigned long same_body_count = 0;
    int clone_ref;
    enum ld_plugin_symbol_resolution resolution;
+   int order = 0;
  
    clone_ref = lto_input_sleb128 (ib);
  
*************** input_node (struct lto_file_decl_data *f
*** 1045,1050 ****
--- 1050,1056 ----
  
    if (tag == LTO_cgraph_analyzed_node)
      {
+       order = lto_input_uleb128 (ib);
        stack_size = lto_input_sleb128 (ib);
        self_size = lto_input_sleb128 (ib);
        size_inlining_benefit = lto_input_sleb128 (ib);
*************** input_node (struct lto_file_decl_data *f
*** 1066,1072 ****
  
    bp = lto_input_bitpack (ib);
    resolution = (enum ld_plugin_symbol_resolution)lto_input_uleb128 (ib);
!   input_overwrite_node (file_data, node, tag, &bp, stack_size, self_time,
    			time_inlining_benefit, self_size,
  			size_inlining_benefit, resolution);
  
--- 1072,1078 ----
  
    bp = lto_input_bitpack (ib);
    resolution = (enum ld_plugin_symbol_resolution)lto_input_uleb128 (ib);
!   input_overwrite_node (file_data, node, tag, &bp, order, stack_size, self_time,
    			time_inlining_benefit, self_size,
  			size_inlining_benefit, resolution);
  
Index: gimple-fold.c
===================================================================
*** gimple-fold.c	(revision 164689)
--- gimple-fold.c	(working copy)
*************** gimple_fold_obj_type_ref (tree ref, tree
*** 1465,1470 ****
--- 1465,1480 ----
    tree known_binfo = known_type ? TYPE_BINFO (known_type) : NULL_TREE;
    tree binfo;
  
+   if (TREE_CODE (obj) == SSA_NAME)
+     {
+       gimple def_stmt = SSA_NAME_DEF_STMT (obj);
+       if (def_stmt
+ 	  && gimple_code (def_stmt) == GIMPLE_ASSIGN
+ 	  && (gimple_assign_single_p (def_stmt)
+ 	      || (gimple_assign_unary_nop_p (def_stmt))))
+ 	obj = gimple_assign_rhs1 (def_stmt);
+     }
+ 
    if (TREE_CODE (obj) == ADDR_EXPR)
      obj = TREE_OPERAND (obj, 0);
  
*************** fold_gimple_call (gimple_stmt_iterator *
*** 1513,1520 ****
           copying EH region info to the new node.  Easier to just do it
           here where we can just smash the call operand.  */
        callee = gimple_call_fn (stmt);
!       if (TREE_CODE (callee) == OBJ_TYPE_REF
!           && TREE_CODE (OBJ_TYPE_REF_OBJECT (callee)) == ADDR_EXPR)
          {
            tree t;
  
--- 1523,1529 ----
           copying EH region info to the new node.  Easier to just do it
           here where we can just smash the call operand.  */
        callee = gimple_call_fn (stmt);
!       if (TREE_CODE (callee) == OBJ_TYPE_REF)
          {
            tree t;
  
Index: common.opt
===================================================================
*** common.opt	(revision 164689)
--- common.opt	(working copy)
*************** fdse
*** 1107,1112 ****
--- 1107,1116 ----
  Common Var(flag_dse) Init(1) Optimization
  Use the RTL dead store elimination pass
  
+ freorder-functions
+ Common Report Var(flag_ipa_reorder) Init(1) Optimization
+ Reroder functions for better locality at execution time
+ 
  freschedule-modulo-scheduled-loops
  Common Report Var(flag_resched_modulo_sched) Optimization
  Enable/Disable the traditional scheduling in loops that already passed modulo scheduling
Index: ipa-reorder.c
===================================================================
*** ipa-reorder.c	(revision 0)
--- ipa-reorder.c	(revision 0)
***************
*** 0 ****
--- 1,567 ----
+ /* IPA function reordering pass.
+    Copyright (C) 2010 Free Software Foundation, Inc.
+    Contributed by Jan Hubicka  <jh@suse.cz>
+ 
+ This file is part of GCC.
+ 
+ GCC is free software; you can redistribute it and/or modify it under
+ the terms of the GNU General Public License as published by the Free
+ Software Foundation; either version 3, or (at your option) any later
+ version.
+ 
+ GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+ WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ for more details.
+ 
+ You should have received a copy of the GNU General Public License
+ along with GCC; see the file COPYING3.  If not see
+ <http://www.gnu.org/licenses/>.  */
+ 
+ #include "config.h"
+ #include "system.h"
+ #include "coretypes.h"
+ #include "tm.h"
+ #include "cgraph.h"
+ #include "tree-pass.h"
+ #include "timevar.h"
+ #include "gimple.h"
+ #include "ggc.h"
+ #include "flags.h"
+ #include "pointer-set.h"
+ #include "target.h"
+ #include "tree-iterator.h"
+ #include "fibheap.h"
+ 
+ struct priority
+ {
+   gcov_type priority;
+   struct partition *src_partition;
+   struct partition *dest_partition;
+   fibnode_t node;
+ };
+ 
+ typedef struct priority *priority_ptr;
+ 
+ DEF_VEC_P(priority_ptr);
+ DEF_VEC_ALLOC_P(priority_ptr,heap);
+ 
+ struct partition
+ {
+   int index;
+   VEC(cgraph_node_ptr, heap) *nodes;
+   VEC(priority_ptr, heap) *priorities;
+   VEC(priority_ptr, heap) *in_priorities;
+ };
+ 
+ typedef struct partition *partition_ptr;
+ 
+ DEF_VEC_P(partition_ptr);
+ DEF_VEC_ALLOC_P(partition_ptr,heap);
+ 
+ static bool
+ cgraph_node_will_be_output_p (struct cgraph_node *node)
+ {
+   return (node->analyzed && !node->global.inlined_to);
+ }
+ 
+ static void
+ account_callees (struct partition *part,
+ 		 VEC (partition_ptr, heap) *partitions,
+ 		 struct cgraph_node *node)
+ {
+   struct cgraph_edge *edge;
+   struct priority *p;
+   int i;
+   struct ipa_ref *ref;
+ 
+   for (edge = node->callees; edge; edge = edge->next_callee)
+     if (edge->inline_failed)
+       {
+ 	if (!edge->callee->aux)
+ 	  {
+ 	    struct partition *dest = VEC_index (partition_ptr, partitions,
+ 						edge->callee->uid);
+ 	    if (dest && dest != part)
+ 	      {
+ 		p = XCNEW (struct priority);
+ 		edge->callee->aux = p;
+ 		p->src_partition = part;
+ 		p->dest_partition = dest;
+ 		VEC_safe_push (priority_ptr, heap, part->priorities, p);
+ 		VEC_safe_push (priority_ptr, heap, dest->in_priorities, p);
+ 	      }
+ 	    else
+ 	      continue;
+ 	  }
+ 	else
+ 	  p = (struct priority *)edge->callee->aux;
+ 	if (edge->count)
+ 	  p->priority += edge->count + 1;
+ 	else
+ 	  {
+ 	    int mul = 1;
+ 	    switch (edge->caller->frequency)
+ 	      {
+ 	      case NODE_FREQUENCY_UNLIKELY_EXECUTED: mul = 0; break;
+ 	      case NODE_FREQUENCY_EXECUTED_ONCE: mul = 1; break;
+ 	      case NODE_FREQUENCY_NORMAL: mul = 10; break;
+ 	      case NODE_FREQUENCY_HOT: mul = 1000; break;
+ 	      }
+ 	    p->priority += edge->frequency * mul + 1;
+ 	  }
+       }
+     else
+       account_callees (part, partitions, edge->callee);
+ 
+   /* Account taking address of a function as possible call with a low priority.  */
+   for (i = 0; ipa_ref_list_reference_iterate (&node->ref_list, i, ref); i++)
+     if (ref->refered_type == IPA_REF_CGRAPH)
+       {
+ 	struct cgraph_node *node2 = ipa_ref_node (ref);
+ 	if (!node2->aux)
+ 	  {
+ 	    struct partition *dest = VEC_index (partition_ptr, partitions,
+ 						node2->uid);
+ 	    if (dest && dest != part)
+ 	      {
+ 		p = XCNEW (struct priority);
+ 		node2->aux = p;
+ 		p->src_partition = part;
+ 		p->dest_partition = dest;
+ 		VEC_safe_push (priority_ptr, heap, part->priorities, p);
+ 		VEC_safe_push (priority_ptr, heap, dest->in_priorities, p);
+ 	      }
+ 	    else
+ 	      continue;
+ 	  }
+ 	else
+ 	  p = (struct priority *)node2->aux;
+ 	p->priority++;
+       }
+ }
+ 
+ static void
+ clear_callees (struct cgraph_node *node)
+ {
+   struct cgraph_edge *edge;
+   int i;
+   struct ipa_ref *ref;
+ 
+   for (edge = node->callees; edge; edge = edge->next_callee)
+     if (edge->inline_failed)
+       edge->callee->aux = NULL;
+     else
+       clear_callees (edge->callee);
+   for (i = 0; ipa_ref_list_reference_iterate (&node->ref_list, i, ref); i++)
+     if (ref->refered_type == IPA_REF_CGRAPH)
+       ipa_ref_node (ref)->aux = NULL;
+ }
+ 
+ /* Compare two priority vector entries via indexes of destination
+    partitions.  */
+ 
+ static int
+ priority_cmp (const void *pa, const void *pb)
+ {
+   const struct priority *a = *(const struct priority * const *) pa;
+   const struct priority *b = *(const struct priority * const *) pb;
+   /* Priorities pointing to dead partitions are removed lazily,
+      so just avoid referencing dead memory.  */
+   if (!a->priority)
+     return ! b->priority ? 0 : -1;
+   if (!b->priority)
+     return 1;
+   return a->dest_partition->index - b->dest_partition->index;
+ }
+ 
+ /* Compare two priority vector entries via indexes of destination
+    partitions.  */
+ 
+ static int
+ in_priority_cmp (const void *pa, const void *pb)
+ {
+   const struct priority *a = *(const struct priority * const *) pa;
+   const struct priority *b = *(const struct priority * const *) pb;
+   /* Priorities pointing to dead partitions are removed lazily,
+      so just avoid referencing dead memory.  */
+   if (!a->priority)
+     return ! b->priority ? 0 : -1;
+   if (!b->priority)
+     return 1;
+   return a->src_partition->index - b->src_partition->index;
+ }
+ 
+ static void
+ delete_priority (fibheap_t heap, struct priority *p)
+ {
+   p->priority = 0;
+   p->src_partition = p->dest_partition = NULL;
+   if (p->node)
+     fibheap_delete_node (heap, p->node);
+   p->node = NULL;
+ }
+ 
+ static void
+ merge_partitions (VEC (partition_ptr, heap) *partitions,
+ 		  fibheap_t heap,
+ 		  struct partition *part1,
+ 		  struct partition *part2)
+ {
+   VEC (priority_ptr, heap) *priorities = NULL;
+   VEC (priority_ptr, heap) *in_priorities = NULL;
+   unsigned int i1 = 0, i2 = 0;
+   struct cgraph_node *node;
+   int i;
+ 
+   /* We keep priority lists ordered by indexes of destination partitions
+      to allow effective merging.  */
+   qsort (VEC_address (priority_ptr, part1->priorities),
+ 	 VEC_length (priority_ptr, part1->priorities),
+ 	 sizeof (priority_ptr),
+ 	 priority_cmp);
+   qsort (VEC_address (priority_ptr, part2->priorities),
+ 	 VEC_length (priority_ptr, part2->priorities),
+ 	 sizeof (priority_ptr),
+ 	 priority_cmp);
+   
+   /* Merge priority lists, prune out references of partition to itself.
+      Assume priority lists are ordered by indexes of destination partitions
+      and keep them so.  */
+   while (1)
+     {
+       struct priority *p1, *p2;
+ 
+       while (i1 < VEC_length (priority_ptr, part1->priorities)
+ 	     && !VEC_index (priority_ptr, part1->priorities, i1)->priority)
+ 	i1++;
+       while (i2 < VEC_length (priority_ptr, part2->priorities)
+ 	     && !VEC_index (priority_ptr, part2->priorities, i2)->priority)
+ 	i2++;
+ 
+       if (i1 == VEC_length (priority_ptr, part1->priorities)
+ 	  && i2 == VEC_length (priority_ptr, part2->priorities))
+ 	break;
+ 
+       if (i1 < VEC_length (priority_ptr, part1->priorities)
+ 	  && (i2 == VEC_length (priority_ptr, part2->priorities)
+ 	      || (VEC_index (priority_ptr, part1->priorities,
+ 			     i1)->dest_partition->index
+ 		  < VEC_index (priority_ptr, part2->priorities,
+ 			       i2)->dest_partition->index)))
+ 	{
+ 	  p1 = VEC_index (priority_ptr, part1->priorities, i1);
+ 	  if (p1->dest_partition != part2)
+ 	    VEC_safe_push (priority_ptr, heap, priorities, p1);
+ 	  else
+ 	    delete_priority (heap, p1);
+ 	  i1++;
+ 	}
+       else if (i2 < VEC_length (priority_ptr, part2->priorities)
+ 	       && (i1 == VEC_length (priority_ptr, part1->priorities)
+ 		   || (VEC_index (priority_ptr, part1->priorities,
+ 				  i1)->dest_partition->index
+ 		       > VEC_index (priority_ptr, part2->priorities,
+ 				    i2)->dest_partition->index)))
+ 	{
+ 	  p2 = VEC_index (priority_ptr, part2->priorities, i2);
+ 	  if (p2->dest_partition != part1)
+ 	    {
+ 	      VEC_safe_push (priority_ptr, heap, priorities, p2);
+ 	      p2->src_partition = part1;
+ 	    }
+ 	  else
+ 	    delete_priority (heap, p2);
+ 	  i2++;
+ 	}
+       else
+ 	{
+ 	  p1 = VEC_index (priority_ptr, part1->priorities, i1);
+ 	  p2 = VEC_index (priority_ptr, part2->priorities, i2);
+ 	  p1->priority += p2->priority;
+ 	  fibheap_replace_key (heap, p1->node, INT_MAX - p1->priority);
+ 	  delete_priority (heap, p2);
+ 	  VEC_safe_push (priority_ptr, heap, priorities, p1);
+ 	  i1++;
+ 	  i2++;
+ 	}
+     }
+   VEC_free (priority_ptr, heap, part1->priorities);
+   part1->priorities = priorities;
+ 
+   qsort (VEC_address (priority_ptr, part1->in_priorities),
+ 	 VEC_length (priority_ptr, part1->in_priorities),
+ 	 sizeof (priority_ptr),
+ 	 in_priority_cmp);
+   qsort (VEC_address (priority_ptr, part2->in_priorities),
+ 	 VEC_length (priority_ptr, part2->in_priorities),
+ 	 sizeof (priority_ptr),
+ 	 in_priority_cmp);
+   i1 = 0;
+   i2 = 0;
+   while (1)
+     {
+       struct priority *p1, *p2;
+       while (i1 < VEC_length (priority_ptr, part1->in_priorities)
+ 	     && !VEC_index (priority_ptr, part1->in_priorities, i1)->priority)
+ 	i1++;
+       while (i2 < VEC_length (priority_ptr, part2->in_priorities)
+ 	     && !VEC_index (priority_ptr, part2->in_priorities, i2)->priority)
+ 	i2++;
+ 
+       if (i1 == VEC_length (priority_ptr, part1->in_priorities)
+ 	  && i2 == VEC_length (priority_ptr, part2->in_priorities))
+ 	break;
+ 
+       if (i1 < VEC_length (priority_ptr, part1->in_priorities)
+ 	  && (i2 == VEC_length (priority_ptr, part2->in_priorities)
+ 	      || (VEC_index (priority_ptr, part1->in_priorities,
+ 			     i1)->src_partition->index
+ 		  < VEC_index (priority_ptr, part2->in_priorities,
+ 			       i2)->src_partition->index)))
+ 	{
+ 	  p1 = VEC_index (priority_ptr, part1->in_priorities, i1);
+ 	  if (p1->src_partition != part2)
+ 	    VEC_safe_push (priority_ptr, heap, in_priorities, p1);
+ 	  else
+ 	    delete_priority (heap, p1);
+ 	  i1++;
+ 	}
+       else if (i2 < VEC_length (priority_ptr, part2->in_priorities)
+ 	       && (i1 == VEC_length (priority_ptr, part1->in_priorities)
+ 		   || (VEC_index (priority_ptr, part1->in_priorities,
+ 				  i1)->src_partition->index
+ 		       > VEC_index (priority_ptr, part2->in_priorities,
+ 				    i2)->src_partition->index)))
+ 	{
+ 	  p2 = VEC_index (priority_ptr, part2->in_priorities, i2);
+ 	  if (p2->src_partition != part1)
+ 	    {
+ 	      VEC_safe_push (priority_ptr, heap, in_priorities, p2);
+ 	      p2->dest_partition = part1;
+ 	    }
+ 	  else
+ 	    delete_priority (heap, p2);
+ 	  i2++;
+ 	}
+       else
+ 	{
+ 	  p1 = VEC_index (priority_ptr, part1->in_priorities, i1);
+ 	  p2 = VEC_index (priority_ptr, part2->in_priorities, i2);
+ 	  p1->priority += p2->priority;
+ 	  fibheap_replace_key (heap, p1->node, INT_MAX - p1->priority);
+ 	  delete_priority (heap, p2);
+ 	  VEC_safe_push (priority_ptr, heap, in_priorities, p1);
+ 	  i1++;
+ 	  i2++;
+ 	}
+     }
+   VEC_free (priority_ptr, heap, part1->in_priorities);
+   part1->in_priorities = in_priorities;
+ 
+   for (i = 0; VEC_iterate (cgraph_node_ptr, part2->nodes, i, node); i++)
+     VEC_safe_push (cgraph_node_ptr, heap, part1->nodes, node);
+ 
+   VEC_replace (partition_ptr, partitions, part2->index, NULL);
+   VEC_free (priority_ptr, heap, part2->priorities);
+   VEC_free (priority_ptr, heap, part2->in_priorities);
+   VEC_free (cgraph_node_ptr, heap, part2->nodes);
+   free (part2);
+ }
+ 
+ static void
+ dump_partition (struct partition *part)
+ {
+   int i;
+   struct cgraph_node *node;
+   struct priority *p;
+   fprintf (dump_file, "  Partition %i:", part->index);
+   for (i = 0; VEC_iterate (cgraph_node_ptr, part->nodes, i, node); i++)
+     fprintf (dump_file, "  %s/%i", cgraph_node_name (node), node->uid);
+   fprintf (dump_file, "\n    priorities:");
+   for (i = 0; VEC_iterate (priority_ptr, part->priorities, i, p); i++)
+     if (p->priority)
+       {
+ 	gcc_assert (p->src_partition == part);
+ 	gcc_assert (p->dest_partition != part);
+ 	fprintf (dump_file, "  %i:%i", p->dest_partition->index,
+ 		 (int)p->priority);
+       }
+   fprintf (dump_file, "\n");
+ }
+ 
+ static unsigned int
+ ipa_func_reorder (void)
+ {
+   struct cgraph_node *node;
+   VEC (partition_ptr, heap) *partitions = VEC_alloc (partition_ptr, heap, cgraph_max_uid);
+   int i;
+   struct partition *part, *first_part = NULL;
+   int freq;
+   fibheap_t heap;
+ 
+   if (!cgraph_max_uid)
+     return 0;
+ 
+   heap = fibheap_new ();
+   VEC_safe_grow_cleared (partition_ptr, heap, partitions, cgraph_max_uid);
+ 
+   for (node = cgraph_nodes; node; node = node->next)
+     if (cgraph_node_will_be_output_p (node))
+       {
+ 	part = XCNEW (struct partition);
+ 	part->index = node->uid;
+ 	VEC_safe_push (cgraph_node_ptr, heap, part->nodes, node);
+ 	VEC_replace (partition_ptr, partitions, node->uid, part);
+       }
+ 
+   if (dump_file)
+     fprintf (dump_file, "\n\nCreating partitions\n");
+   for (node = cgraph_nodes; node; node = node->next)
+     if (cgraph_node_will_be_output_p (node))
+       {
+ 	part = VEC_index (partition_ptr, partitions, node->uid);
+ 	account_callees (part, partitions, node);
+ 	clear_callees (node);
+ 	if (dump_file)
+ 	  dump_partition (part);
+       }
+   for (node = cgraph_nodes; node; node = node->next)
+     if (cgraph_node_will_be_output_p (node))
+       {
+ 	struct priority *p;
+ 	part = VEC_index (partition_ptr, partitions, node->uid);
+ 
+ 	for (i = 0; VEC_iterate (priority_ptr, part->priorities, i, p); i++)
+ 	  p->node = fibheap_insert (heap, INT_MAX - p->priority, p);
+ 	if (dump_file)
+ 	  dump_partition (part);
+       }
+ 
+   if (dump_file)
+     fprintf (dump_file, "\n\nMerging by priorities\n");
+   while (!fibheap_empty (heap))
+     {
+       struct priority *p = (struct priority *)fibheap_extract_min (heap);
+       struct partition *part = p->src_partition;
+       p->node = NULL;
+       if (dump_file)
+ 	{
+ 	  fprintf (dump_file,
+ 	           "Concatenating partitions %i and %i, priority %i\n",
+ 		   p->src_partition->index,
+ 		   p->dest_partition->index,
+ 		   (int)p->priority);
+ 	  if (dump_file)
+ 	    dump_partition (p->src_partition);
+ 	  if (dump_file)
+ 	    dump_partition (p->dest_partition);
+ 	}
+       merge_partitions (partitions, heap, p->src_partition, p->dest_partition);
+       if (dump_file)
+ 	dump_partition (part);
+     }
+ 
+   /* We ran out of calls to merge.  Try to arrange remaining partitions
+      approximately in execution order: static constructors first, followed
+      by partition containing function main ()
+      followed by hot sections of the program.  */
+   if (dump_file)
+     fprintf (dump_file, "\n\nLooking for static constructors\n");
+   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
+     if (part && part != first_part
+ 	&& DECL_STATIC_CONSTRUCTOR (VEC_index (cgraph_node_ptr,
+ 					       part->nodes, 0)->decl))
+       {
+ 	if (dump_file)
+ 	  dump_partition (part);
+ 	if (!first_part)
+ 	 first_part = part;
+ 	else
+ 	 merge_partitions (partitions, heap, first_part, part);
+       }
+   if (dump_file)
+     fprintf (dump_file, "\n\nLooking for main\n");
+   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
+     if (part && part != first_part
+ 	&& MAIN_NAME_P (DECL_NAME (VEC_index (cgraph_node_ptr,
+ 					      part->nodes, 0)->decl)))
+       {
+ 	if (dump_file)
+ 	  dump_partition (part);
+ 	if (!first_part)
+ 	 first_part = part;
+ 	else
+ 	 merge_partitions (partitions, heap, first_part, part);
+       }
+   if (dump_file)
+     fprintf (dump_file, "\n\nMerging by frequency\n");
+   for (freq = NODE_FREQUENCY_HOT; freq >= NODE_FREQUENCY_UNLIKELY_EXECUTED;
+        freq--)
+     {
+       for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
+ 	if (part && part != first_part
+ 	    && VEC_index (cgraph_node_ptr, part->nodes, 0)->frequency == freq)
+ 	  {
+ 	    if (dump_file)
+ 	      dump_partition (part);
+ 	    if (!first_part)
+ 	     first_part = part;
+ 	    else
+ 	     merge_partitions (partitions, heap, first_part, part);
+ 	  }
+     }
+   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
+     gcc_assert (!part || part == first_part);
+ 
+   fibheap_delete (heap);
+   if (!first_part)
+     return 0;
+   if (dump_file)
+     {
+       fprintf (dump_file, "\n\nFinal order\n");
+       dump_partition (first_part);
+     }
+ 
+   /* Store the resulting order and kill the single remaining partition.  */
+   for (i = 0; VEC_iterate (cgraph_node_ptr, first_part->nodes, i, node); i++)
+     node->order = i;
+   VEC_free (priority_ptr, heap, first_part->priorities);
+   VEC_free (cgraph_node_ptr, heap, first_part->nodes);
+   free (first_part);
+   return 0;
+ }
+ 
+ static bool
+ gate_ipa_func_reorder (void)
+ {
+   return optimize && flag_ipa_reorder && flag_toplevel_reorder;
+ }
+ 
+ struct ipa_opt_pass_d pass_ipa_func_reorder =
+ {
+  {
+   IPA_PASS,
+   "reorder",				/* name */
+   gate_ipa_func_reorder,		/* gate */
+   ipa_func_reorder,		        /* execute */
+   NULL,					/* sub */
+   NULL,					/* next */
+   0,					/* static_pass_number */
+   TV_CGRAPHOPT,			        /* tv_id */
+   0,	                                /* properties_required */
+   0,					/* properties_provided */
+   0,					/* properties_destroyed */
+   0,					/* todo_flags_start */
+   0                                     /* todo_flags_finish */
+  },
+  NULL,				        /* generate_summary */
+  NULL,					/* write_summary */
+  NULL,					/* read_summary */
+  NULL,					/* write_optimization_summary */
+  NULL,					/* read_optimization_summary */
+  NULL,					/* stmt_fixup */
+  0,					/* TODOs */
+  NULL,			                /* function_transform */
+  NULL					/* variable_transform */
+ };
Index: Makefile.in
===================================================================
*** Makefile.in	(revision 164689)
--- Makefile.in	(working copy)
*************** OBJS-archive = \
*** 1461,1466 ****
--- 1461,1467 ----
  	ipa-type-escape.o \
  	ipa-utils.o \
  	ipa.o \
+ 	ipa-reorder.o \
  	matrix-reorg.o \
  	prefix.o \
  	tree-inline.o \
*************** varpool.o : varpool.c $(CONFIG_H) $(SYST
*** 3012,3017 ****
--- 3013,3020 ----
     $(TREE_FLOW_H) gt-varpool.h
  ipa.o : ipa.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(CGRAPH_H) \
     $(TREE_PASS_H) $(TIMEVAR_H) $(GIMPLE_H) $(GGC_H) pointer-set.h
+ ipa-reorder.o : ipa-reorder.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
+    $(CGRAPH_H) $(TREE_PASS_H) $(TIMEVAR_H) $(GIMPLE_H) $(GGC_H) pointer-set.h
  ipa-prop.o : ipa-prop.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
     langhooks.h $(GGC_H) $(TARGET_H) $(CGRAPH_H) $(IPA_PROP_H) $(DIAGNOSTIC_H) \
     $(TREE_FLOW_H) $(TM_H) $(TREE_PASS_H) $(FLAGS_H) $(TREE_H) \
Index: passes.c
===================================================================
*** passes.c	(revision 164689)
--- passes.c	(working copy)
*************** init_optimization_passes (void)
*** 818,823 ****
--- 818,824 ----
    NEXT_PASS (pass_ipa_type_escape);
    NEXT_PASS (pass_ipa_pta);
    NEXT_PASS (pass_ipa_struct_reorg);
+   NEXT_PASS (pass_ipa_func_reorder);
    *p = NULL;
  
    p = &all_lto_gen_passes;
Richard Biener Oct. 7, 2019, 7:01 a.m. UTC | #10
On Sun, Oct 6, 2019 at 4:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > On 9/19/19 2:33 AM, Martin Liška wrote:
> > > Hi.
> > >
> > > Function reordering has been around for quite some time and a naive
> > > implementation was also part of my diploma thesis some time ago.
> > > Currently, the GCC can reorder function based on first execution, which
> > > happens with PGO and LTO of course. Known limitation is that the order
> > > is preserved only partially as various symbols go into different LTRANS partitions.
> > >
> > > There has been some research in the area and I would point out the Facebook paper
> > > ([1]) and Sony presentation ([2]). Based on that, I decided to make a new implementation
> > > in the GCC that does the same (in a proper way). First part of the enablement are patches
> > > to ld.bfd and ld.gold that come up with a new section .text.sorted, that is always sorted.
> > > There's a snippet from the modified default linker script:
> > Funny, I was doing this via linker scripts circa ~95, in fact that's why
> > we have -ffunction-sections :-)   We started with scripts which
> > post-processed profiling data to create linker scripts for ELF systems.
> >  We had something for HPUX/SOM as well, but I can't remember what
> > mechanism we used, it may have been a gross level sorting using the SOM
> > section sort key mechanism - it only had 128 or 256 keys with a notable
> > amount of them reserved.
> >
> > We had also built a linker with a basic level of interposition circa
> > 1993 and explored various approaches to reordering executables.  I'd
> > just joined the group at the time and was responsible for wiring up
> > stuff on the OS side, but eventually got the "pleasure" of owning the
> > linker server.  A lot of the C3 algorithmic stuff looks similar to what
> > we did.
>
> For reference I am attaching early LTO version from circa 2010 :)
> >
> > Anyway...
> >
> > I don't see anything objectionable in here.  It's noted as an RFC.  Are
> > you interested in pushing this forward for gcc-10?
>
> I think it is plan to get some form of code layout pass into GCC10.  I
> will test Martin's version on Firefox and see if it have any effect
> here. It is generally quite hard to evaluate those changes (and it is
> reason why I did not moved forward with version form 2010 - more
> precisely it kept falling off the shelf for about a decade)
>
> If LLD has support for cgraph annotations and we support LLD, i think we
> should add support for that, too - it will be very useful to compare
> indvidiual implementations.
> I believe there is gold support (off tree?) for that too and something
> similar is also supported by other toolchains like AIX?
>
> One problem with the sections approach is that every section gets
> aligned to largest code alignment inside of the section. Since our
> alignments are (should be) often cache line based we get cca 30 bytes of
> waste for every function that is quite a lot.
>
> This is why we currently have way to order function when outputting them
> and use that with FDO (using Martin's first execution logic). This has
> drwarback of making the functions to flow in that order through late
> optimizations and RTL backend and thus we lose IPA-RA and some
> IP propagation (late pure/const/nothrow discovery).

But you can also fix that by the parallelization GSoC project approach,
decoupling things at RTL expansion IPA-wise and using output order
for the latter (or even do the fragments in GCC itself by refactoring the
output machinery to use function-local "strings", assembling the final
file later).  Refactoring output to handle multiple output "files" at the
same time might also help getting rid of that early-LTO-debug copying
stuff (now that it seems that linker-plugin extensions for partially claiming
files will never happen...)

> So this approach has a drawback, too. It is why i was trying to push
> myself to get gcc to use gas fragments :)
>
> Anyway, all these issues can be sovled incementally - lets see how
> Maritn's patch work on Firefox and if we can get it tested elsewhere and
> start from that.
>
> I will take a look into the details.
>
> Honza
> >
> > jeff
> >
>
> Index: tree-pass.h
> ===================================================================
> *** tree-pass.h (revision 164689)
> --- tree-pass.h (working copy)
> *************** extern struct ipa_opt_pass_d pass_ipa_lt
> *** 467,472 ****
> --- 467,473 ----
>   extern struct ipa_opt_pass_d pass_ipa_lto_finish_out;
>   extern struct ipa_opt_pass_d pass_ipa_profile;
>   extern struct ipa_opt_pass_d pass_ipa_cdtor_merge;
> + extern struct ipa_opt_pass_d pass_ipa_func_reorder;
>
>   extern struct gimple_opt_pass pass_all_optimizations;
>   extern struct gimple_opt_pass pass_cleanup_cfg_post_optimizing;
> Index: cgraphunit.c
> ===================================================================
> *** cgraphunit.c        (revision 164689)
> --- cgraphunit.c        (working copy)
> *************** cgraph_inline_p (struct cgraph_edge *e,
> *** 1517,1522 ****
> --- 1517,1529 ----
>     return !e->inline_failed;
>   }
>
> + static int
> + node_cmp (const void *pa, const void *pb)
> + {
> +   const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
> +   const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
> +   return b->order - a->order;
> + }
>
>
>   /* Expand all functions that must be output.
> *************** cgraph_expand_all_functions (void)
> *** 1546,1551 ****
> --- 1553,1561 ----
>       if (order[i]->process)
>         order[new_order_pos++] = order[i];
>
> +   if (flag_ipa_reorder)
> +     qsort (order, new_order_pos, sizeof (struct cgraph_node *), node_cmp);
> +
>     for (i = new_order_pos - 1; i >= 0; i--)
>       {
>         node = order[i];
> Index: tree-ssa-ccp.c
> ===================================================================
> *** tree-ssa-ccp.c      (revision 164689)
> --- tree-ssa-ccp.c      (working copy)
> *************** ccp_fold_stmt (gimple_stmt_iterator *gsi
> *** 2307,2312 ****
> --- 2307,2324 ----
>                 changed = true;
>               }
>           }
> +       if (TREE_CODE (gimple_call_fn (stmt)) == OBJ_TYPE_REF)
> +         {
> +           tree expr = OBJ_TYPE_REF_EXPR (gimple_call_fn (stmt));
> +           expr = valueize_op (expr);
> +           if (TREE_CODE (expr) == ADDR_EXPR
> +               && TREE_CODE (TREE_OPERAND (expr, 0)) == FUNCTION_DECL)
> +            {
> +              gimple_call_set_fndecl (stmt, TREE_OPERAND (expr, 0));
> +              debug_gimple_stmt (stmt);
> +              changed = true;
> +            }
> +         }
>
>         return changed;
>         }
> Index: lto-cgraph.c
> ===================================================================
> *** lto-cgraph.c        (revision 164689)
> --- lto-cgraph.c        (working copy)
> *************** lto_output_node (struct lto_simple_outpu
> *** 466,471 ****
> --- 466,473 ----
>
>     if (tag == LTO_cgraph_analyzed_node)
>       {
> +       lto_output_uleb128_stream (ob->main_stream,
> +                                node->order);
>         lto_output_sleb128_stream (ob->main_stream,
>                                  node->local.inline_summary.estimated_self_stack_size);
>         lto_output_sleb128_stream (ob->main_stream,
> *************** input_overwrite_node (struct lto_file_de
> *** 927,932 ****
> --- 929,935 ----
>                       struct cgraph_node *node,
>                       enum LTO_cgraph_tags tag,
>                       struct bitpack_d *bp,
> +                     int order,
>                       unsigned int stack_size,
>                       unsigned int self_time,
>                       unsigned int time_inlining_benefit,
> *************** input_overwrite_node (struct lto_file_de
> *** 935,940 ****
> --- 938,944 ----
>                       enum ld_plugin_symbol_resolution resolution)
>   {
>     node->aux = (void *) tag;
> +   node->order = order;
>     node->local.inline_summary.estimated_self_stack_size = stack_size;
>     node->local.inline_summary.self_time = self_time;
>     node->local.inline_summary.time_inlining_benefit = time_inlining_benefit;
> *************** input_node (struct lto_file_decl_data *f
> *** 1027,1032 ****
> --- 1031,1037 ----
>     unsigned long same_body_count = 0;
>     int clone_ref;
>     enum ld_plugin_symbol_resolution resolution;
> +   int order = 0;
>
>     clone_ref = lto_input_sleb128 (ib);
>
> *************** input_node (struct lto_file_decl_data *f
> *** 1045,1050 ****
> --- 1050,1056 ----
>
>     if (tag == LTO_cgraph_analyzed_node)
>       {
> +       order = lto_input_uleb128 (ib);
>         stack_size = lto_input_sleb128 (ib);
>         self_size = lto_input_sleb128 (ib);
>         size_inlining_benefit = lto_input_sleb128 (ib);
> *************** input_node (struct lto_file_decl_data *f
> *** 1066,1072 ****
>
>     bp = lto_input_bitpack (ib);
>     resolution = (enum ld_plugin_symbol_resolution)lto_input_uleb128 (ib);
> !   input_overwrite_node (file_data, node, tag, &bp, stack_size, self_time,
>                         time_inlining_benefit, self_size,
>                         size_inlining_benefit, resolution);
>
> --- 1072,1078 ----
>
>     bp = lto_input_bitpack (ib);
>     resolution = (enum ld_plugin_symbol_resolution)lto_input_uleb128 (ib);
> !   input_overwrite_node (file_data, node, tag, &bp, order, stack_size, self_time,
>                         time_inlining_benefit, self_size,
>                         size_inlining_benefit, resolution);
>
> Index: gimple-fold.c
> ===================================================================
> *** gimple-fold.c       (revision 164689)
> --- gimple-fold.c       (working copy)
> *************** gimple_fold_obj_type_ref (tree ref, tree
> *** 1465,1470 ****
> --- 1465,1480 ----
>     tree known_binfo = known_type ? TYPE_BINFO (known_type) : NULL_TREE;
>     tree binfo;
>
> +   if (TREE_CODE (obj) == SSA_NAME)
> +     {
> +       gimple def_stmt = SSA_NAME_DEF_STMT (obj);
> +       if (def_stmt
> +         && gimple_code (def_stmt) == GIMPLE_ASSIGN
> +         && (gimple_assign_single_p (def_stmt)
> +             || (gimple_assign_unary_nop_p (def_stmt))))
> +       obj = gimple_assign_rhs1 (def_stmt);
> +     }
> +
>     if (TREE_CODE (obj) == ADDR_EXPR)
>       obj = TREE_OPERAND (obj, 0);
>
> *************** fold_gimple_call (gimple_stmt_iterator *
> *** 1513,1520 ****
>            copying EH region info to the new node.  Easier to just do it
>            here where we can just smash the call operand.  */
>         callee = gimple_call_fn (stmt);
> !       if (TREE_CODE (callee) == OBJ_TYPE_REF
> !           && TREE_CODE (OBJ_TYPE_REF_OBJECT (callee)) == ADDR_EXPR)
>           {
>             tree t;
>
> --- 1523,1529 ----
>            copying EH region info to the new node.  Easier to just do it
>            here where we can just smash the call operand.  */
>         callee = gimple_call_fn (stmt);
> !       if (TREE_CODE (callee) == OBJ_TYPE_REF)
>           {
>             tree t;
>
> Index: common.opt
> ===================================================================
> *** common.opt  (revision 164689)
> --- common.opt  (working copy)
> *************** fdse
> *** 1107,1112 ****
> --- 1107,1116 ----
>   Common Var(flag_dse) Init(1) Optimization
>   Use the RTL dead store elimination pass
>
> + freorder-functions
> + Common Report Var(flag_ipa_reorder) Init(1) Optimization
> + Reroder functions for better locality at execution time
> +
>   freschedule-modulo-scheduled-loops
>   Common Report Var(flag_resched_modulo_sched) Optimization
>   Enable/Disable the traditional scheduling in loops that already passed modulo scheduling
> Index: ipa-reorder.c
> ===================================================================
> *** ipa-reorder.c       (revision 0)
> --- ipa-reorder.c       (revision 0)
> ***************
> *** 0 ****
> --- 1,567 ----
> + /* IPA function reordering pass.
> +    Copyright (C) 2010 Free Software Foundation, Inc.
> +    Contributed by Jan Hubicka  <jh@suse.cz>
> +
> + This file is part of GCC.
> +
> + GCC is free software; you can redistribute it and/or modify it under
> + the terms of the GNU General Public License as published by the Free
> + Software Foundation; either version 3, or (at your option) any later
> + version.
> +
> + GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> + WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> + for more details.
> +
> + You should have received a copy of the GNU General Public License
> + along with GCC; see the file COPYING3.  If not see
> + <http://www.gnu.org/licenses/>.  */
> +
> + #include "config.h"
> + #include "system.h"
> + #include "coretypes.h"
> + #include "tm.h"
> + #include "cgraph.h"
> + #include "tree-pass.h"
> + #include "timevar.h"
> + #include "gimple.h"
> + #include "ggc.h"
> + #include "flags.h"
> + #include "pointer-set.h"
> + #include "target.h"
> + #include "tree-iterator.h"
> + #include "fibheap.h"
> +
> + struct priority
> + {
> +   gcov_type priority;
> +   struct partition *src_partition;
> +   struct partition *dest_partition;
> +   fibnode_t node;
> + };
> +
> + typedef struct priority *priority_ptr;
> +
> + DEF_VEC_P(priority_ptr);
> + DEF_VEC_ALLOC_P(priority_ptr,heap);
> +
> + struct partition
> + {
> +   int index;
> +   VEC(cgraph_node_ptr, heap) *nodes;
> +   VEC(priority_ptr, heap) *priorities;
> +   VEC(priority_ptr, heap) *in_priorities;
> + };
> +
> + typedef struct partition *partition_ptr;
> +
> + DEF_VEC_P(partition_ptr);
> + DEF_VEC_ALLOC_P(partition_ptr,heap);
> +
> + static bool
> + cgraph_node_will_be_output_p (struct cgraph_node *node)
> + {
> +   return (node->analyzed && !node->global.inlined_to);
> + }
> +
> + static void
> + account_callees (struct partition *part,
> +                VEC (partition_ptr, heap) *partitions,
> +                struct cgraph_node *node)
> + {
> +   struct cgraph_edge *edge;
> +   struct priority *p;
> +   int i;
> +   struct ipa_ref *ref;
> +
> +   for (edge = node->callees; edge; edge = edge->next_callee)
> +     if (edge->inline_failed)
> +       {
> +       if (!edge->callee->aux)
> +         {
> +           struct partition *dest = VEC_index (partition_ptr, partitions,
> +                                               edge->callee->uid);
> +           if (dest && dest != part)
> +             {
> +               p = XCNEW (struct priority);
> +               edge->callee->aux = p;
> +               p->src_partition = part;
> +               p->dest_partition = dest;
> +               VEC_safe_push (priority_ptr, heap, part->priorities, p);
> +               VEC_safe_push (priority_ptr, heap, dest->in_priorities, p);
> +             }
> +           else
> +             continue;
> +         }
> +       else
> +         p = (struct priority *)edge->callee->aux;
> +       if (edge->count)
> +         p->priority += edge->count + 1;
> +       else
> +         {
> +           int mul = 1;
> +           switch (edge->caller->frequency)
> +             {
> +             case NODE_FREQUENCY_UNLIKELY_EXECUTED: mul = 0; break;
> +             case NODE_FREQUENCY_EXECUTED_ONCE: mul = 1; break;
> +             case NODE_FREQUENCY_NORMAL: mul = 10; break;
> +             case NODE_FREQUENCY_HOT: mul = 1000; break;
> +             }
> +           p->priority += edge->frequency * mul + 1;
> +         }
> +       }
> +     else
> +       account_callees (part, partitions, edge->callee);
> +
> +   /* Account taking address of a function as possible call with a low priority.  */
> +   for (i = 0; ipa_ref_list_reference_iterate (&node->ref_list, i, ref); i++)
> +     if (ref->refered_type == IPA_REF_CGRAPH)
> +       {
> +       struct cgraph_node *node2 = ipa_ref_node (ref);
> +       if (!node2->aux)
> +         {
> +           struct partition *dest = VEC_index (partition_ptr, partitions,
> +                                               node2->uid);
> +           if (dest && dest != part)
> +             {
> +               p = XCNEW (struct priority);
> +               node2->aux = p;
> +               p->src_partition = part;
> +               p->dest_partition = dest;
> +               VEC_safe_push (priority_ptr, heap, part->priorities, p);
> +               VEC_safe_push (priority_ptr, heap, dest->in_priorities, p);
> +             }
> +           else
> +             continue;
> +         }
> +       else
> +         p = (struct priority *)node2->aux;
> +       p->priority++;
> +       }
> + }
> +
> + static void
> + clear_callees (struct cgraph_node *node)
> + {
> +   struct cgraph_edge *edge;
> +   int i;
> +   struct ipa_ref *ref;
> +
> +   for (edge = node->callees; edge; edge = edge->next_callee)
> +     if (edge->inline_failed)
> +       edge->callee->aux = NULL;
> +     else
> +       clear_callees (edge->callee);
> +   for (i = 0; ipa_ref_list_reference_iterate (&node->ref_list, i, ref); i++)
> +     if (ref->refered_type == IPA_REF_CGRAPH)
> +       ipa_ref_node (ref)->aux = NULL;
> + }
> +
> + /* Compare two priority vector entries via indexes of destination
> +    partitions.  */
> +
> + static int
> + priority_cmp (const void *pa, const void *pb)
> + {
> +   const struct priority *a = *(const struct priority * const *) pa;
> +   const struct priority *b = *(const struct priority * const *) pb;
> +   /* Priorities pointing to dead partitions are removed lazily,
> +      so just avoid referencing dead memory.  */
> +   if (!a->priority)
> +     return ! b->priority ? 0 : -1;
> +   if (!b->priority)
> +     return 1;
> +   return a->dest_partition->index - b->dest_partition->index;
> + }
> +
> + /* Compare two priority vector entries via indexes of destination
> +    partitions.  */
> +
> + static int
> + in_priority_cmp (const void *pa, const void *pb)
> + {
> +   const struct priority *a = *(const struct priority * const *) pa;
> +   const struct priority *b = *(const struct priority * const *) pb;
> +   /* Priorities pointing to dead partitions are removed lazily,
> +      so just avoid referencing dead memory.  */
> +   if (!a->priority)
> +     return ! b->priority ? 0 : -1;
> +   if (!b->priority)
> +     return 1;
> +   return a->src_partition->index - b->src_partition->index;
> + }
> +
> + static void
> + delete_priority (fibheap_t heap, struct priority *p)
> + {
> +   p->priority = 0;
> +   p->src_partition = p->dest_partition = NULL;
> +   if (p->node)
> +     fibheap_delete_node (heap, p->node);
> +   p->node = NULL;
> + }
> +
> + static void
> + merge_partitions (VEC (partition_ptr, heap) *partitions,
> +                 fibheap_t heap,
> +                 struct partition *part1,
> +                 struct partition *part2)
> + {
> +   VEC (priority_ptr, heap) *priorities = NULL;
> +   VEC (priority_ptr, heap) *in_priorities = NULL;
> +   unsigned int i1 = 0, i2 = 0;
> +   struct cgraph_node *node;
> +   int i;
> +
> +   /* We keep priority lists ordered by indexes of destination partitions
> +      to allow effective merging.  */
> +   qsort (VEC_address (priority_ptr, part1->priorities),
> +        VEC_length (priority_ptr, part1->priorities),
> +        sizeof (priority_ptr),
> +        priority_cmp);
> +   qsort (VEC_address (priority_ptr, part2->priorities),
> +        VEC_length (priority_ptr, part2->priorities),
> +        sizeof (priority_ptr),
> +        priority_cmp);
> +
> +   /* Merge priority lists, prune out references of partition to itself.
> +      Assume priority lists are ordered by indexes of destination partitions
> +      and keep them so.  */
> +   while (1)
> +     {
> +       struct priority *p1, *p2;
> +
> +       while (i1 < VEC_length (priority_ptr, part1->priorities)
> +            && !VEC_index (priority_ptr, part1->priorities, i1)->priority)
> +       i1++;
> +       while (i2 < VEC_length (priority_ptr, part2->priorities)
> +            && !VEC_index (priority_ptr, part2->priorities, i2)->priority)
> +       i2++;
> +
> +       if (i1 == VEC_length (priority_ptr, part1->priorities)
> +         && i2 == VEC_length (priority_ptr, part2->priorities))
> +       break;
> +
> +       if (i1 < VEC_length (priority_ptr, part1->priorities)
> +         && (i2 == VEC_length (priority_ptr, part2->priorities)
> +             || (VEC_index (priority_ptr, part1->priorities,
> +                            i1)->dest_partition->index
> +                 < VEC_index (priority_ptr, part2->priorities,
> +                              i2)->dest_partition->index)))
> +       {
> +         p1 = VEC_index (priority_ptr, part1->priorities, i1);
> +         if (p1->dest_partition != part2)
> +           VEC_safe_push (priority_ptr, heap, priorities, p1);
> +         else
> +           delete_priority (heap, p1);
> +         i1++;
> +       }
> +       else if (i2 < VEC_length (priority_ptr, part2->priorities)
> +              && (i1 == VEC_length (priority_ptr, part1->priorities)
> +                  || (VEC_index (priority_ptr, part1->priorities,
> +                                 i1)->dest_partition->index
> +                      > VEC_index (priority_ptr, part2->priorities,
> +                                   i2)->dest_partition->index)))
> +       {
> +         p2 = VEC_index (priority_ptr, part2->priorities, i2);
> +         if (p2->dest_partition != part1)
> +           {
> +             VEC_safe_push (priority_ptr, heap, priorities, p2);
> +             p2->src_partition = part1;
> +           }
> +         else
> +           delete_priority (heap, p2);
> +         i2++;
> +       }
> +       else
> +       {
> +         p1 = VEC_index (priority_ptr, part1->priorities, i1);
> +         p2 = VEC_index (priority_ptr, part2->priorities, i2);
> +         p1->priority += p2->priority;
> +         fibheap_replace_key (heap, p1->node, INT_MAX - p1->priority);
> +         delete_priority (heap, p2);
> +         VEC_safe_push (priority_ptr, heap, priorities, p1);
> +         i1++;
> +         i2++;
> +       }
> +     }
> +   VEC_free (priority_ptr, heap, part1->priorities);
> +   part1->priorities = priorities;
> +
> +   qsort (VEC_address (priority_ptr, part1->in_priorities),
> +        VEC_length (priority_ptr, part1->in_priorities),
> +        sizeof (priority_ptr),
> +        in_priority_cmp);
> +   qsort (VEC_address (priority_ptr, part2->in_priorities),
> +        VEC_length (priority_ptr, part2->in_priorities),
> +        sizeof (priority_ptr),
> +        in_priority_cmp);
> +   i1 = 0;
> +   i2 = 0;
> +   while (1)
> +     {
> +       struct priority *p1, *p2;
> +       while (i1 < VEC_length (priority_ptr, part1->in_priorities)
> +            && !VEC_index (priority_ptr, part1->in_priorities, i1)->priority)
> +       i1++;
> +       while (i2 < VEC_length (priority_ptr, part2->in_priorities)
> +            && !VEC_index (priority_ptr, part2->in_priorities, i2)->priority)
> +       i2++;
> +
> +       if (i1 == VEC_length (priority_ptr, part1->in_priorities)
> +         && i2 == VEC_length (priority_ptr, part2->in_priorities))
> +       break;
> +
> +       if (i1 < VEC_length (priority_ptr, part1->in_priorities)
> +         && (i2 == VEC_length (priority_ptr, part2->in_priorities)
> +             || (VEC_index (priority_ptr, part1->in_priorities,
> +                            i1)->src_partition->index
> +                 < VEC_index (priority_ptr, part2->in_priorities,
> +                              i2)->src_partition->index)))
> +       {
> +         p1 = VEC_index (priority_ptr, part1->in_priorities, i1);
> +         if (p1->src_partition != part2)
> +           VEC_safe_push (priority_ptr, heap, in_priorities, p1);
> +         else
> +           delete_priority (heap, p1);
> +         i1++;
> +       }
> +       else if (i2 < VEC_length (priority_ptr, part2->in_priorities)
> +              && (i1 == VEC_length (priority_ptr, part1->in_priorities)
> +                  || (VEC_index (priority_ptr, part1->in_priorities,
> +                                 i1)->src_partition->index
> +                      > VEC_index (priority_ptr, part2->in_priorities,
> +                                   i2)->src_partition->index)))
> +       {
> +         p2 = VEC_index (priority_ptr, part2->in_priorities, i2);
> +         if (p2->src_partition != part1)
> +           {
> +             VEC_safe_push (priority_ptr, heap, in_priorities, p2);
> +             p2->dest_partition = part1;
> +           }
> +         else
> +           delete_priority (heap, p2);
> +         i2++;
> +       }
> +       else
> +       {
> +         p1 = VEC_index (priority_ptr, part1->in_priorities, i1);
> +         p2 = VEC_index (priority_ptr, part2->in_priorities, i2);
> +         p1->priority += p2->priority;
> +         fibheap_replace_key (heap, p1->node, INT_MAX - p1->priority);
> +         delete_priority (heap, p2);
> +         VEC_safe_push (priority_ptr, heap, in_priorities, p1);
> +         i1++;
> +         i2++;
> +       }
> +     }
> +   VEC_free (priority_ptr, heap, part1->in_priorities);
> +   part1->in_priorities = in_priorities;
> +
> +   for (i = 0; VEC_iterate (cgraph_node_ptr, part2->nodes, i, node); i++)
> +     VEC_safe_push (cgraph_node_ptr, heap, part1->nodes, node);
> +
> +   VEC_replace (partition_ptr, partitions, part2->index, NULL);
> +   VEC_free (priority_ptr, heap, part2->priorities);
> +   VEC_free (priority_ptr, heap, part2->in_priorities);
> +   VEC_free (cgraph_node_ptr, heap, part2->nodes);
> +   free (part2);
> + }
> +
> + static void
> + dump_partition (struct partition *part)
> + {
> +   int i;
> +   struct cgraph_node *node;
> +   struct priority *p;
> +   fprintf (dump_file, "  Partition %i:", part->index);
> +   for (i = 0; VEC_iterate (cgraph_node_ptr, part->nodes, i, node); i++)
> +     fprintf (dump_file, "  %s/%i", cgraph_node_name (node), node->uid);
> +   fprintf (dump_file, "\n    priorities:");
> +   for (i = 0; VEC_iterate (priority_ptr, part->priorities, i, p); i++)
> +     if (p->priority)
> +       {
> +       gcc_assert (p->src_partition == part);
> +       gcc_assert (p->dest_partition != part);
> +       fprintf (dump_file, "  %i:%i", p->dest_partition->index,
> +                (int)p->priority);
> +       }
> +   fprintf (dump_file, "\n");
> + }
> +
> + static unsigned int
> + ipa_func_reorder (void)
> + {
> +   struct cgraph_node *node;
> +   VEC (partition_ptr, heap) *partitions = VEC_alloc (partition_ptr, heap, cgraph_max_uid);
> +   int i;
> +   struct partition *part, *first_part = NULL;
> +   int freq;
> +   fibheap_t heap;
> +
> +   if (!cgraph_max_uid)
> +     return 0;
> +
> +   heap = fibheap_new ();
> +   VEC_safe_grow_cleared (partition_ptr, heap, partitions, cgraph_max_uid);
> +
> +   for (node = cgraph_nodes; node; node = node->next)
> +     if (cgraph_node_will_be_output_p (node))
> +       {
> +       part = XCNEW (struct partition);
> +       part->index = node->uid;
> +       VEC_safe_push (cgraph_node_ptr, heap, part->nodes, node);
> +       VEC_replace (partition_ptr, partitions, node->uid, part);
> +       }
> +
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nCreating partitions\n");
> +   for (node = cgraph_nodes; node; node = node->next)
> +     if (cgraph_node_will_be_output_p (node))
> +       {
> +       part = VEC_index (partition_ptr, partitions, node->uid);
> +       account_callees (part, partitions, node);
> +       clear_callees (node);
> +       if (dump_file)
> +         dump_partition (part);
> +       }
> +   for (node = cgraph_nodes; node; node = node->next)
> +     if (cgraph_node_will_be_output_p (node))
> +       {
> +       struct priority *p;
> +       part = VEC_index (partition_ptr, partitions, node->uid);
> +
> +       for (i = 0; VEC_iterate (priority_ptr, part->priorities, i, p); i++)
> +         p->node = fibheap_insert (heap, INT_MAX - p->priority, p);
> +       if (dump_file)
> +         dump_partition (part);
> +       }
> +
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nMerging by priorities\n");
> +   while (!fibheap_empty (heap))
> +     {
> +       struct priority *p = (struct priority *)fibheap_extract_min (heap);
> +       struct partition *part = p->src_partition;
> +       p->node = NULL;
> +       if (dump_file)
> +       {
> +         fprintf (dump_file,
> +                  "Concatenating partitions %i and %i, priority %i\n",
> +                  p->src_partition->index,
> +                  p->dest_partition->index,
> +                  (int)p->priority);
> +         if (dump_file)
> +           dump_partition (p->src_partition);
> +         if (dump_file)
> +           dump_partition (p->dest_partition);
> +       }
> +       merge_partitions (partitions, heap, p->src_partition, p->dest_partition);
> +       if (dump_file)
> +       dump_partition (part);
> +     }
> +
> +   /* We ran out of calls to merge.  Try to arrange remaining partitions
> +      approximately in execution order: static constructors first, followed
> +      by partition containing function main ()
> +      followed by hot sections of the program.  */
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nLooking for static constructors\n");
> +   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
> +     if (part && part != first_part
> +       && DECL_STATIC_CONSTRUCTOR (VEC_index (cgraph_node_ptr,
> +                                              part->nodes, 0)->decl))
> +       {
> +       if (dump_file)
> +         dump_partition (part);
> +       if (!first_part)
> +        first_part = part;
> +       else
> +        merge_partitions (partitions, heap, first_part, part);
> +       }
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nLooking for main\n");
> +   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
> +     if (part && part != first_part
> +       && MAIN_NAME_P (DECL_NAME (VEC_index (cgraph_node_ptr,
> +                                             part->nodes, 0)->decl)))
> +       {
> +       if (dump_file)
> +         dump_partition (part);
> +       if (!first_part)
> +        first_part = part;
> +       else
> +        merge_partitions (partitions, heap, first_part, part);
> +       }
> +   if (dump_file)
> +     fprintf (dump_file, "\n\nMerging by frequency\n");
> +   for (freq = NODE_FREQUENCY_HOT; freq >= NODE_FREQUENCY_UNLIKELY_EXECUTED;
> +        freq--)
> +     {
> +       for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
> +       if (part && part != first_part
> +           && VEC_index (cgraph_node_ptr, part->nodes, 0)->frequency == freq)
> +         {
> +           if (dump_file)
> +             dump_partition (part);
> +           if (!first_part)
> +            first_part = part;
> +           else
> +            merge_partitions (partitions, heap, first_part, part);
> +         }
> +     }
> +   for (i = 0; VEC_iterate (partition_ptr, partitions, i, part); i++)
> +     gcc_assert (!part || part == first_part);
> +
> +   fibheap_delete (heap);
> +   if (!first_part)
> +     return 0;
> +   if (dump_file)
> +     {
> +       fprintf (dump_file, "\n\nFinal order\n");
> +       dump_partition (first_part);
> +     }
> +
> +   /* Store the resulting order and kill the single remaining partition.  */
> +   for (i = 0; VEC_iterate (cgraph_node_ptr, first_part->nodes, i, node); i++)
> +     node->order = i;
> +   VEC_free (priority_ptr, heap, first_part->priorities);
> +   VEC_free (cgraph_node_ptr, heap, first_part->nodes);
> +   free (first_part);
> +   return 0;
> + }
> +
> + static bool
> + gate_ipa_func_reorder (void)
> + {
> +   return optimize && flag_ipa_reorder && flag_toplevel_reorder;
> + }
> +
> + struct ipa_opt_pass_d pass_ipa_func_reorder =
> + {
> +  {
> +   IPA_PASS,
> +   "reorder",                          /* name */
> +   gate_ipa_func_reorder,              /* gate */
> +   ipa_func_reorder,                   /* execute */
> +   NULL,                                       /* sub */
> +   NULL,                                       /* next */
> +   0,                                  /* static_pass_number */
> +   TV_CGRAPHOPT,                               /* tv_id */
> +   0,                                  /* properties_required */
> +   0,                                  /* properties_provided */
> +   0,                                  /* properties_destroyed */
> +   0,                                  /* todo_flags_start */
> +   0                                     /* todo_flags_finish */
> +  },
> +  NULL,                                        /* generate_summary */
> +  NULL,                                        /* write_summary */
> +  NULL,                                        /* read_summary */
> +  NULL,                                        /* write_optimization_summary */
> +  NULL,                                        /* read_optimization_summary */
> +  NULL,                                        /* stmt_fixup */
> +  0,                                   /* TODOs */
> +  NULL,                                        /* function_transform */
> +  NULL                                 /* variable_transform */
> + };
> Index: Makefile.in
> ===================================================================
> *** Makefile.in (revision 164689)
> --- Makefile.in (working copy)
> *************** OBJS-archive = \
> *** 1461,1466 ****
> --- 1461,1467 ----
>         ipa-type-escape.o \
>         ipa-utils.o \
>         ipa.o \
> +       ipa-reorder.o \
>         matrix-reorg.o \
>         prefix.o \
>         tree-inline.o \
> *************** varpool.o : varpool.c $(CONFIG_H) $(SYST
> *** 3012,3017 ****
> --- 3013,3020 ----
>      $(TREE_FLOW_H) gt-varpool.h
>   ipa.o : ipa.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(CGRAPH_H) \
>      $(TREE_PASS_H) $(TIMEVAR_H) $(GIMPLE_H) $(GGC_H) pointer-set.h
> + ipa-reorder.o : ipa-reorder.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
> +    $(CGRAPH_H) $(TREE_PASS_H) $(TIMEVAR_H) $(GIMPLE_H) $(GGC_H) pointer-set.h
>   ipa-prop.o : ipa-prop.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
>      langhooks.h $(GGC_H) $(TARGET_H) $(CGRAPH_H) $(IPA_PROP_H) $(DIAGNOSTIC_H) \
>      $(TREE_FLOW_H) $(TM_H) $(TREE_PASS_H) $(FLAGS_H) $(TREE_H) \
> Index: passes.c
> ===================================================================
> *** passes.c    (revision 164689)
> --- passes.c    (working copy)
> *************** init_optimization_passes (void)
> *** 818,823 ****
> --- 818,824 ----
>     NEXT_PASS (pass_ipa_type_escape);
>     NEXT_PASS (pass_ipa_pta);
>     NEXT_PASS (pass_ipa_struct_reorg);
> +   NEXT_PASS (pass_ipa_func_reorder);
>     *p = NULL;
>
>     p = &all_lto_gen_passes;
Jan Hubicka Oct. 7, 2019, 9:33 a.m. UTC | #11
> > This is why we currently have way to order function when outputting them
> > and use that with FDO (using Martin's first execution logic). This has
> > drwarback of making the functions to flow in that order through late
> > optimizations and RTL backend and thus we lose IPA-RA and some
> > IP propagation (late pure/const/nothrow discovery).
> 
> But you can also fix that by the parallelization GSoC project approach,
> decoupling things at RTL expansion IPA-wise and using output order
> for the latter (or even do the fragments in GCC itself by refactoring the
> output machinery to use function-local "strings", assembling the final
> file later).  Refactoring output to handle multiple output "files" at the
> same time might also help getting rid of that early-LTO-debug copying
> stuff (now that it seems that linker-plugin extensions for partially claiming
> files will never happen...)

Multiple output files won't solve my concern about padding and
relaxation, but yep, parallelizing RTL will definitly need to introduce
way to hold final assembly and glue it together, so this problem may get
solved independently which would be nice.

Honza
Martin Liška Nov. 25, 2019, 1:51 p.m. UTC | #12
Hello.

I'm sending v2 of the patch set based on the discussion I had with Honza.
Changes from previous version:
- I changed type of edge count from uint32_t to uint64_t.
- The algorithm traverses recursively inline clones.
- TDF_DUMP_DETAILS is supported and provides more information.
- I added cgraph_node_cmp_by_text_sorted function that is used
   for symbol sorting both in cgraphunit.c and lto-cgraph.c.

I made the same testing as before I get following numbers of tramp3d w/ -O2:
Total runs: 15, before: 13.89, after: 13.81, cmp: 99.383%

and iTLB-load-misses:u goes from 15,450,298 to 11,633,215.

The corresponding binutils patch part was partially accepted (for BFD) and
today I was given a feedback for ld.gold.

Thoughts?
Thanks,
Martin
Martin Liška Nov. 29, 2019, 3:11 p.m. UTC | #13
Hi.

I'm sending v3 of the patch where I changed:
- function.cold sections are properly put into .text.unlikely and
   not into a .text.sorted.XYZ section

I've just finished measurements and I still have the original speed up
for tramp3d:
Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%

Thoughts?
Martin
Jan Hubicka Dec. 1, 2019, 10:45 a.m. UTC | #14
> Hi.
> 
> I'm sending v3 of the patch where I changed:
> - function.cold sections are properly put into .text.unlikely and
>   not into a .text.sorted.XYZ section
> 
> I've just finished measurements and I still have the original speed up
> for tramp3d:
> Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%

Hi,
I have updated binutils to current head on the Firefox testing patch and
run FDO build with tp first run ordering and call chain clustering.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1313e6a4d74ebff702afa7594684beb83c01d95f&newProject=try&newRevision=1c2d53b10b042aaaac15edbe7bd26e2740641840&framework=1

It seems there are no differences in performance. The two binaries can
be downloaded at

w/o patch:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2
with call chain clustering.
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2

Since Firefox is quite sensitive to code size I would expect to be able
to measure some benefits here.  Any idea what may have go wrong?
I checked that the binaries seems generally sane - out of 58MB text
segment there is 34MB cold section. It is possible that system ld is
used instead of provided one, but that would be weird.  I will try to
find way to double-check that updating binutils really updated them for
GCC.

Honza
Jan Hubicka Dec. 1, 2019, 11:11 a.m. UTC | #15
> > Hi.
> > 
> > I'm sending v3 of the patch where I changed:
> > - function.cold sections are properly put into .text.unlikely and
> >   not into a .text.sorted.XYZ section
> > 
> > I've just finished measurements and I still have the original speed up
> > for tramp3d:
> > Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%
> 
> Hi,
> I have updated binutils to current head on the Firefox testing patch and
> run FDO build with tp first run ordering and call chain clustering.
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1313e6a4d74ebff702afa7594684beb83c01d95f&newProject=try&newRevision=1c2d53b10b042aaaac15edbe7bd26e2740641840&framework=1
> 
> It seems there are no differences in performance. The two binaries can
> be downloaded at
> 
> w/o patch:
> https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2
> with call chain clustering.
> https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2
> 
> Since Firefox is quite sensitive to code size I would expect to be able
> to measure some benefits here.  Any idea what may have go wrong?
> I checked that the binaries seems generally sane - out of 58MB text
> segment there is 34MB cold section. It is possible that system ld is
> used instead of provided one, but that would be weird.  I will try to
> find way to double-check that updating binutils really updated them for
> GCC.
Linker used is
GNU ld (GNU Binutils) 2.33.50.20191130

Does it have all necessary support in?  Do I need something like
-ffunction-sections?

Honza
Jan Hubicka Dec. 1, 2019, 10:37 p.m. UTC | #16
Hi,
I was playing with it a bit more and built with
-fno-profile-reorder-functions.  

Here is -fno-profile-reorder-functions compared to first run 
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3d537be0cb37458e7928f69a37efb2a6d6b85eae&newProject=try&newRevision=4543abfa08870391544b56d16dfcd530dac0dc30&framework=1
2.2% improvement on the page rendering is off noise but I would hope for
bit more.

Here is -fno-profile-reorder-functions compared to clustering
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3d537be0cb37458e7928f69a37efb2a6d6b85eae&newProject=try&newRevision=1c2d53b10b042aaaac15edbe7bd26e2740641840&framework=1
I am not sure if it is a noise.

Either Fireox's talos is not good benchmark for code layout (which I
would be surprised since it is quite sensitive to code size issues) or
there are some problems.

In general I think the patch is useful and mostly mainline ready except
for detailes but it would be good to have some more evidence that it
works as expected on large binaries besides tramp3d build time.  There
are number of ways where things may go wrong ranging from misupdated
profiles, problems with function splitting, comdats and other issues.

Was you able to benchmark some other benefits?  I remember we discussed
collecting traces from valgrind, perhaps we could test that they are
looking good?

Honza
Martin Liška Dec. 2, 2019, 8:19 a.m. UTC | #17
On 12/1/19 12:11 PM, Jan Hubicka wrote:
>>> Hi.
>>>
>>> I'm sending v3 of the patch where I changed:
>>> - function.cold sections are properly put into .text.unlikely and
>>>    not into a .text.sorted.XYZ section
>>>
>>> I've just finished measurements and I still have the original speed up
>>> for tramp3d:
>>> Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%
>>
>> Hi,
>> I have updated binutils to current head on the Firefox testing patch and
>> run FDO build with tp first run ordering and call chain clustering.
>> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1313e6a4d74ebff702afa7594684beb83c01d95f&newProject=try&newRevision=1c2d53b10b042aaaac15edbe7bd26e2740641840&framework=1
>>
>> It seems there are no differences in performance. The two binaries can
>> be downloaded at
>>
>> w/o patch:
>> https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2
>> with call chain clustering.
>> https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2
>>
>> Since Firefox is quite sensitive to code size I would expect to be able
>> to measure some benefits here.  Any idea what may have go wrong?
>> I checked that the binaries seems generally sane - out of 58MB text
>> segment there is 34MB cold section. It is possible that system ld is
>> used instead of provided one, but that would be weird.  I will try to
>> find way to double-check that updating binutils really updated them for
>> GCC.
> Linker used is
> GNU ld (GNU Binutils) 2.33.50.20191130

Hello.

You can check linker support with:
$ ~/bin/binutils/bin/ld.bfd --verbose | grep text.sorted
     *(SORT(.text.sorted.*))

Is Firefox using BFD or GOLD for linking?

> 
> Does it have all necessary support in?  Do I need something like
> -ffunction-sections?

No, you don't need it.

Martin

> 
> Honza
>
Martin Liška Dec. 2, 2019, 8:20 a.m. UTC | #18
On 12/1/19 11:45 AM, Jan Hubicka wrote:
>> Hi.
>>
>> I'm sending v3 of the patch where I changed:
>> - function.cold sections are properly put into .text.unlikely and
>>    not into a .text.sorted.XYZ section
>>
>> I've just finished measurements and I still have the original speed up
>> for tramp3d:
>> Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%
> 
> Hi,
> I have updated binutils to current head on the Firefox testing patch and
> run FDO build with tp first run ordering and call chain clustering.
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1313e6a4d74ebff702afa7594684beb83c01d95f&newProject=try&newRevision=1c2d53b10b042aaaac15edbe7bd26e2740641840&framework=1

Hello.

Thank you for the testing.

> 
> It seems there are no differences in performance. The two binaries can
> be downloaded at
> 
> w/o patch:
> https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2
> with call chain clustering.
> https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2

I would ideally need output of -fdump-ipa-reorder which prints sorted symbols and
so that I can compare it with resulting assembly.

> 
> Since Firefox is quite sensitive to code size I would expect to be able
> to measure some benefits here.  Any idea what may have go wrong?

That's a pity.

> I checked that the binaries seems generally sane - out of 58MB text
> segment there is 34MB cold section. It is possible that system ld is
> used instead of provided one, but that would be weird.  I will try to
> find way to double-check that updating binutils really updated them for
> GCC.

I can double check once having the dump file.

Martin

> 
> Honza
>
Martin Liška Dec. 2, 2019, 8:25 a.m. UTC | #19
On 12/1/19 11:37 PM, Jan Hubicka wrote:
> Hi,
> I was playing with it a bit more and built with
> -fno-profile-reorder-functions.
> 
> Here is -fno-profile-reorder-functions compared to first run
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3d537be0cb37458e7928f69a37efb2a6d6b85eae&newProject=try&newRevision=4543abfa08870391544b56d16dfcd530dac0dc30&framework=1
> 2.2% improvement on the page rendering is off noise but I would hope for
> bit more.
> 
> Here is -fno-profile-reorder-functions compared to clustering
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3d537be0cb37458e7928f69a37efb2a6d6b85eae&newProject=try&newRevision=1c2d53b10b042aaaac15edbe7bd26e2740641840&framework=1
> I am not sure if it is a noise.
> 
> Either Fireox's talos is not good benchmark for code layout (which I
> would be surprised since it is quite sensitive to code size issues) or
> there are some problems.
> 
> In general I think the patch is useful and mostly mainline ready except
> for detailes but it would be good to have some more evidence that it
> works as expected on large binaries besides tramp3d build time.  There
> are number of ways where things may go wrong ranging from misupdated
> profiles, problems with function splitting, comdats and other issues.

Based on my testing, I was able to see cc1plus binary really sorted as seen
by the pass, including various IPA clones that inherited order from their
origins.

> 
> Was you able to benchmark some other benefits?

Unfortunately not.

> I remember we discussed
> collecting traces from valgrind, perhaps we could test that they are
> looking good?

I have some semi-working icegrind port here:
https://github.com/marxin/valgrind/tree/icegrind

It will probably need some extra work and it's terribly slow. But you can try
it.

I would first wait for the Firefox dump files and then we can discuss what
to do with the pass.

Martin

> 
> Honza
>
Martin Liška Dec. 9, 2019, 12:14 p.m. UTC | #20
Hello.

Based on presentation that had Sriraman Tallam at a LLVM conference:
https://www.youtube.com/watch?v=DySuXFGmB40

I made a heatmap based on executed instruction addresses. I used
$ perf record -F max -- ./cc1plus -fpreprocessed /home/marxin/Programming/tramp3d/tramp3d-v4.ii
and
$ perf script -F time,ip,dso

I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before and after my reorder
patch (also PGO+lean LTO bootstrap).

One can see quite significant clustering starting from 5s till the end of compilation.
Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS

Martin
Martin Liška Dec. 9, 2019, 12:42 p.m. UTC | #21
On 12/9/19 1:14 PM, Martin Liška wrote:
> Hello.
> 
> Based on presentation that had Sriraman Tallam at a LLVM conference:
> https://www.youtube.com/watch?v=DySuXFGmB40
> 
> I made a heatmap based on executed instruction addresses. I used
> $ perf record -F max -- ./cc1plus -fpreprocessed /home/marxin/Programming/tramp3d/tramp3d-v4.ii
> and
> $ perf script -F time,ip,dso
> 
> I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before and after my reorder
> patch (also PGO+lean LTO bootstrap).
> 
> One can see quite significant clustering starting from 5s till the end of compilation.
> Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS
> 
> Martin

For the completeness, the heatmap was generated with the following script:
https://github.com/marxin/script-misc/blob/master/binary-heatmap.py

Martin
Jan Hubicka Dec. 9, 2019, 1:03 p.m. UTC | #22
> On 12/9/19 1:14 PM, Martin Liška wrote:
> > Hello.
> > 
> > Based on presentation that had Sriraman Tallam at a LLVM conference:
> > https://www.youtube.com/watch?v=DySuXFGmB40
> > 
> > I made a heatmap based on executed instruction addresses. I used
> > $ perf record -F max -- ./cc1plus -fpreprocessed /home/marxin/Programming/tramp3d/tramp3d-v4.ii
> > and
> > $ perf script -F time,ip,dso
> > 
> > I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before and after my reorder
> > patch (also PGO+lean LTO bootstrap).
> > 
> > One can see quite significant clustering starting from 5s till the end of compilation.
> > Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS
> > 
> > Martin
> 
> For the completeness, the heatmap was generated with the following script:
> https://github.com/marxin/script-misc/blob/master/binary-heatmap.py

Thanks,
this looks really useful as we had almost no way to check code layout
ever since you systemtap script stopped working.

On the first glance the difference between gcc9 and gcc10 is explained
by the changes to profile updating. gcc9 makes very small cold
partitions compared to gcc10.  It is very nice that we have a way to
measure it. I will also check if some of the more important profiling
update fixes makes sense to backport to gcc9.

Over weekend I did some fixes to tp reordreing, so it may be nice to
update your tests, but I will try to run it myself.

In general one can see individual stages of compilation on the graph -
parsing, early lowering, early opts.  On bigger programs this should be
more visible.  I will give it a try.

Honza
Martin Liška Dec. 9, 2019, 1:16 p.m. UTC | #23
On 12/9/19 2:03 PM, Jan Hubicka wrote:
>> On 12/9/19 1:14 PM, Martin Liška wrote:
>>> Hello.
>>>
>>> Based on presentation that had Sriraman Tallam at a LLVM conference:
>>> https://www.youtube.com/watch?v=DySuXFGmB40
>>>
>>> I made a heatmap based on executed instruction addresses. I used
>>> $ perf record -F max -- ./cc1plus -fpreprocessed /home/marxin/Programming/tramp3d/tramp3d-v4.ii
>>> and
>>> $ perf script -F time,ip,dso
>>>
>>> I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before and after my reorder
>>> patch (also PGO+lean LTO bootstrap).
>>>
>>> One can see quite significant clustering starting from 5s till the end of compilation.
>>> Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS
>>>
>>> Martin
>>
>> For the completeness, the heatmap was generated with the following script:
>> https://github.com/marxin/script-misc/blob/master/binary-heatmap.py
> 
> Thanks,
> this looks really useful as we had almost no way to check code layout
> ever since you systemtap script stopped working.

Great, thanks.

> 
> On the first glance the difference between gcc9 and gcc10 is explained
> by the changes to profile updating. gcc9 makes very small cold
> partitions compared to gcc10.  It is very nice that we have a way to
> measure it. I will also check if some of the more important profiling
> update fixes makes sense to backport to gcc9.
> 
> Over weekend I did some fixes to tp reordreing, so it may be nice to
> update your tests, but I will try to run it myself.
> 
> In general one can see individual stages of compilation on the graph -
> parsing, early lowering, early opts.  On bigger programs this should be
> more visible.  I will give it a try.

You haven't replied to question whether we want to let ipa-reorder into
trunk based on the sent images for GCC 10 PGO+LTO boostrap?

Martin

> 
> Honza
>
Jan Hubicka Dec. 9, 2019, 1:42 p.m. UTC | #24
> > On the first glance the difference between gcc9 and gcc10 is explained
> > by the changes to profile updating. gcc9 makes very small cold
> > partitions compared to gcc10.  It is very nice that we have a way to
> > measure it. I will also check if some of the more important profiling
> > update fixes makes sense to backport to gcc9.
> > 
> > Over weekend I did some fixes to tp reordreing, so it may be nice to
> > update your tests, but I will try to run it myself.
> > 
> > In general one can see individual stages of compilation on the graph -
> > parsing, early lowering, early opts.  On bigger programs this should be
> > more visible.  I will give it a try.
> 
> You haven't replied to question whether we want to let ipa-reorder into
> trunk based on the sent images for GCC 10 PGO+LTO boostrap?

My concern is still the same - while I like the patch I am worried that
we have only one example where it produces some benefit. For this reason
I looked into tp_first_run issues this weekend and found fixed some
issues / verified that the order now seems to be fine for cc1 binary and
partly for Firefox.

We do run periodic benchmarks to keep optimization in shape, but code layout
stuff is out in wild and no one is verifying that it works. It is now
bit nontrivial piece of code and thus it is not big suprise that there
are numeber of bugs.

I like the heatmap generator (but for some reason it generates empty
pngs for me. Maybe things are just out of range since bounds are
hardcoded in your script) 

I don't have much time today but tomorrow I will try to get to it
tested and send some comments on the patch.

Honza
Martin Liška Dec. 9, 2019, 1:44 p.m. UTC | #25
Hi.

I've just updated the script a bit and I added also address histogram:
https://drive.google.com/file/d/11s9R_JnEMohDE6ctqzsj092QD22HKXJI/view?usp=sharing

Martin
Jan Hubicka Dec. 10, 2019, 4:56 p.m. UTC | #26
> 	* Makefile.in: Add ipa-reorder.o.
> 	* cgraph.c (cgraph_node::dump): Dump
> 	text_sorted_order.
> 	(cgraph_node_cmp_by_text_sorted):
> 	New function that sorts functions based
> 	on text_sorted_order.
> 	* cgraph.h (cgraph_node): Add text_sorted_order.
> 	(cgraph_node_cmp_by_text_sorted): New.
> 	* cgraphclones.c (cgraph_node::create_clone):
> 	Clone also text_sorted_order.
> 	* cgraphunit.c (node_cmp): Remove.
> 	(expand_all_functions): Use new function
> 	cgraph_node_cmp_by_text_sorted.
> 	* common.opt: Add new option reorder_functions_algorithm.
> 	* flag-types.h (enum reorder_functions_algorithm):
> 	New enum.
> 	* ipa-reorder.c: New file.
> 	* lto-cgraph.c (lto_output_node): Stream in and out
> 	text_sorted_order.
> 	(input_node): Likewise.
> 	* passes.def: Add pass_ipa_reorder.
> 	* timevar.def (TV_IPA_REORDER): New.
> 	* tree-pass.h (make_pass_ipa_reorder): New.
> 	* varasm.c (default_function_section): Assign text.sorted.X
> 	section.
> 
> gcc/lto/ChangeLog:
> 
> 2019-11-25  Martin Liska  <mliska@suse.cz>
> 
> 	* lto-partition.c (node_cmp): Remove.
> 	(lto_balanced_map): Use new cgraph_node_cmp_by_text_sorted.
> +/* Sort cgraph_nodes by text_sorted_order if available, or by order.  */
> +
> +int
> +cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb)
> +{
> +  const cgraph_node *a = *(const cgraph_node * const *) pa;
> +  const cgraph_node *b = *(const cgraph_node * const *) pb;
> +
> +  /* Functions with text_sorted_order should be before these
> +     without profile.  */
> +  if (a->text_sorted_order == 0 || b->text_sorted_order == 0)
> +    return a->text_sorted_order - b->text_sorted_order;
> +
> +  return a->text_sorted_order != b->text_sorted_order
> +	 ? b->text_sorted_order - a->text_sorted_order
> +	 : b->order - a->order;
> +}

We now have tp_first_run_node_cmp with needs to be updated and renamed
isntead of adding new comparator.
>    /* Time profiler: first run of function.  */
>    int tp_first_run;
> +  /* Order in .text.sorted.* section.  */
> +  int text_sorted_order;

tp_first_run probalby should go into summary since it will live from
profling to the ipa-reorder pass where it will be copied to
text_sorted_order? Or how these two interface?
> diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c
> index a79491e0b88..f624cbee185 100644
> --- a/gcc/cgraphclones.c
> +++ b/gcc/cgraphclones.c
> @@ -372,6 +372,7 @@ cgraph_node::create_clone (tree new_decl, profile_count prof_count,
>    new_node->rtl = rtl;
>    new_node->frequency = frequency;
>    new_node->tp_first_run = tp_first_run;
> +  new_node->text_sorted_order = text_sorted_order;
Copy it also in duplicate_thunk_for_node and create_version_node.
I think we should refactor the copying code and share it in all places
we duplicate function.
> +freorder-functions-algorithm=
> +Common Joined RejectNegative Enum(reorder_functions_algorithm) Var(flag_reorder_functions_algorithm) Init(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING) Optimization
> +-freorder-functions-algorithm=[first-run|call-chain-clustering]	Set the used function reordering algorithm.
> +
> +Enum
> +Name(reorder_functions_algorithm) Type(enum reorder_functions_algorithm) UnknownError(unknown function reordering algorithm %qs)
> +
> +EnumValue
> +Enum(reorder_functions_algorithm) String(first-run) Value(REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN)
> +
> +EnumValue
> +Enum(reorder_functions_algorithm) String(call-chain-clustering) Value(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING)

I think we should also have "definition-order" which will sort by
node->order.  For testing we may also want to have "random" which will
do something stupid like sorting by symbol name checksum so we could see
how these compares.
> +/* The algorithm used for function reordering.  */
> +enum reorder_functions_algorithm
> +{
> +  REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN,
> +  REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING
> +};

Isn't this supposed to be generated by awk scripts?
> +
> +#define C3_CLUSTER_THRESHOLD 1024
description + perhaps PARAM?
> +
> +struct cluster_edge;
> +
> +/* Cluster is set of functions considered in C3 algorithm.  */
Perhaps an high level overview of C3 algorithm with a reference would be
nice.
> +/* Sort functions based of first execution of the function.  */
> +
> +static void
> +sort_functions_by_first_run (void)
> +{
> +  cgraph_node *node;
> +
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +    if (node->tp_first_run && !node->alias)
> +      node->text_sorted_order = node->tp_first_run;

Why you care about excluding aliases?  We also have thunks where
tp_first_run is ignored.  I am just curious if you got some ICE here.
> +}
> +
> +/* Compare clusters by density after that are established.  */
> +
> +static int
> +cluster_cmp (const void *a_p, const void *b_p)
> +{
> +  const cluster *a = *(cluster * const *)a_p;
> +  const cluster *b = *(cluster * const *)b_p;
> +
> +  unsigned fncounta = a->m_functions.length ();
> +  unsigned fncountb = b->m_functions.length ();
I think we still do whitespace after declarations.
> +  if (fncounta <= 1 || fncountb <= 1)
> +    return fncountb - fncounta;
> +
> +  sreal r = b->m_time * a->m_size - a->m_time * b->m_size;
> +  return (r < 0) ? -1 : ((r > 0) ? 1 : 0);
> +}
> +
> +/* Visit callgraph edge CS until we reach a real cgraph_node (not a clone).
> +   Record such edge to EDGES or traverse recursively.  */
> +
> +static void
> +visit_all_edges_for_caller (auto_vec<cluster_edge *> *edges,
> +			    cgraph_node *node, cgraph_edge *cs)
> +{
> +  if (cs->inline_failed)
> +    {
> +      gcov_type count;
> +      profile_count pcount = cs->count.ipa ();
Similarly here.
> +      /* A real call edge.  */
> +      if (!cs->callee->alias
> +	  && cs->callee->definition
> +	  && pcount.initialized_p ()
> +	  && (count = pcount.to_gcov_type ()) > 0)
> +	{
> +	  cluster *caller = (cluster *)node->aux;
> +	  cluster *callee = (cluster *)cs->callee->aux;
> +	  cluster_edge **cedge = callee->m_callers.get (caller);
> +
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    fprintf (dump_file, "Adding edge:%s->%s:%" PRIu64 "\n",
> +		     node->dump_name (), cs->callee->dump_name (), count);
> +
> +	  if (cedge != NULL)
> +	    (*cedge)->m_count += count;
> +	  else
> +	    {
> +	      cluster_edge *cedge = new cluster_edge (caller, callee, count);
> +	      edges->safe_push (cedge);
> +	      callee->m_callers.put (caller, cedge);
> +	    }
> +	}
> +    }
> +  else
> +    {
> +      cgraph_node *clone = cs->callee;
> +      for (cgraph_edge *cs = clone->callees; cs; cs = cs->next_callee)
> +	visit_all_edges_for_caller (edges, node, cs);
> +    }
> +}

With FDO we have likely targets for indirect calls which may or may not
be speculative calls at this moment.  We may want to put that info into
summaries and use here, but that is for incremental change.

I think not seeing indirect calls target may be one of main reasons for
clusters being mistakely broken up.
> +
> +/* Sort functions based on call chain clustering, which is an algorithm
> +   mentioned in the following article:
> +   https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf

I would use normal citation format - research.fb.com may change :)

Ottoni, Guilherme, and Bertrand Maher. "Optimizing function placement
for large-scale data-center applications." Proceedings of the 2017
International Symposium on Code Generation and Optimization. IEEE Press,
2017.
> +   .  */
> +
> +static void
> +sort_functions_by_c3 (void)
> +{
> +  cgraph_node *node;
> +  auto_vec<cluster *> clusters;
> +
> +  /* Create a cluster for each function.  */
Putting thunks into their own clusters will do no good at least until we
learn to output them independently.  So I would add !node->thunk
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +    if (!node->alias
> +	&& !node->inlined_to)
> +      {
> +	if (dump_file && (dump_flags & TDF_DETAILS))
> +	  fprintf (dump_file, "Adding node:%s\n", node->dump_name ());
> +
> +	ipa_size_summary *size_summary = ipa_size_summaries->get (node);
> +	ipa_fn_summary *fn_summary = ipa_fn_summaries->get (node);
> +	cluster *c = new cluster (node, size_summary->size, fn_summary->time);
> +	node->aux = c;
> +	clusters.safe_push (c);
> +      }

Either you want to walk from edge->callee to function_node or you want
to insert all associated aliases and thunks to the cluster I think
> +
> +  auto_vec<cluster_edge *> edges;
> +
> +  /* Insert edges between clusters that have a profile.  */
> +  for (unsigned i = 0; i < clusters.length (); i++)
> +    {
> +      cgraph_node *node = clusters[i]->m_functions[0];
> +      for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee)
> +	visit_all_edges_for_caller (&edges, node, cs);
> +    }
> +
> +  /* Now insert all created edges into a heap.  */
> +  fibonacci_heap <uint64_t, cluster_edge> heap (0);
> +
> +  for (unsigned i = 0; i < clusters.length (); i++)
> +    {
> +      cluster *c = clusters[i];
> +      for (hash_map<cluster *, cluster_edge *>::iterator it
> +	   = c->m_callers.begin (); it != c->m_callers.end (); ++it)
> +	{
> +	  cluster_edge *cedge = (*it).second;
> +	  cedge->m_heap_node = heap.insert (cedge->inverted_count (), cedge);
> +	}
> +    }
> +
> +  while (!heap.empty ())
> +    {
> +      cluster_edge *cedge = heap.extract_min ();
> +      cluster *caller = cedge->m_caller;
> +      cluster *callee = cedge->m_callee;
> +      cedge->m_heap_node = NULL;
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	{
> +	  fprintf (dump_file, "Processing cluster edge: %p->%p, count: %"
> +		   PRIu64 "\n", (void *)caller, (void *)callee, cedge->m_count);
> +	  fprintf (dump_file, "  source functions (%u): ", caller->m_size);
> +	  for (unsigned j = 0; j < caller->m_functions.length (); j++)
> +	    fprintf (dump_file, "%s ", caller->m_functions[j]->dump_name ());
> +	  fprintf (dump_file, "\n  target functions (%u): ", callee->m_size);
> +	  for (unsigned j = 0; j < callee->m_functions.length (); j++)
> +	    fprintf (dump_file, "%s ", callee->m_functions[j]->dump_name ());
> +	  fprintf (dump_file, "\n");
> +	}
> +
> +      if (caller == callee)
> +	continue;
> +      if (caller->m_size + callee->m_size <= C3_CLUSTER_THRESHOLD)

Did you experiment with the cluster threshold value? In a way since we
have so bad idea about size perhaps we may just set it to infinity.
> +	{
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    fprintf (dump_file, "  (clusters merged)\n");
> +
> +	  caller->m_size += callee->m_size;
> +	  caller->m_time += callee->m_time;
> +
> +	  /* Append all cgraph_nodes from callee to caller.  */
> +	  for (unsigned i = 0; i < callee->m_functions.length (); i++)
> +	    caller->m_functions.safe_push (callee->m_functions[i]);

Right datastructure for this is probably union-find.  We don't have a
function to append one vector to another?
> +
> +	  callee->m_functions.truncate (0);

don't we want to free it?
> +
> +  /* Sort the candidate clusters.  */
> +  clusters.qsort (cluster_cmp);

We still have info about what code is used at startup, what code is
likely/unlikely.

After inlining we also may have a lot of functions that have IPA count
of 0 but still have tp_first_run set.  So i suppose we may want to
experiment with this incrementally.

Also merging clusters over even edges profiled 0 time at the end could
improve locality if profile run is not representative.
> +  /* Assign .text.sorted.* section names.  */
I would say just text_sorted_order since I think the sections is just
one way of arranging it.
> +  int counter = 1;
> +  for (unsigned i = 0; i < clusters.length (); i++)
> +    {
> +      cluster *c = clusters[i];
> +      if (c->m_functions.length () <= 1)
> +	continue;
> +
> +      for (unsigned j = 0; j < c->m_functions.length (); j++)
> +	{
> +	  cgraph_node *node = c->m_functions[j];
> +
> +	  if (dump_file)
> +	    fprintf (dump_file, "setting: %d for %s with size:%d\n",
> +		     counter, node->dump_asm_name (),
> +		     ipa_size_summaries->get (node)->size);
> +	  node->text_sorted_order = counter++;
> +	}
> +    }
> +
> +  /* Release memory.  */
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +    if (!node->alias)
> +      node->aux = NULL;
> +
> +  for (unsigned i = 0; i < clusters.length (); i++)
> +    delete clusters[i];
> +
> +  for (unsigned i = 0; i < edges.length (); i++)
> +    delete edges[i];
> +}
> +
> +/* The main function for function sorting.  */
> +
> +static unsigned int
> +ipa_reorder (void)
> +{
> +  switch (flag_reorder_functions_algorithm)
> +    {
> +    case REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING:
> +      sort_functions_by_c3 ();
> +      break;
> +    case REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN:
> +      sort_functions_by_first_run ();
> +      break;
> +    default:
> +      gcc_unreachable ();
> +    }
> +
> +  return 0;
> +}
> +
> +const pass_data pass_data_ipa_reorder =
> +{
> +  IPA_PASS, /* type */
> +  "reorder", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_IPA_REORDER, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_ipa_reorder : public ipa_opt_pass_d
> +{
> +public:
> +  pass_ipa_reorder (gcc::context *ctxt)
> +    : ipa_opt_pass_d (pass_data_ipa_reorder, ctxt,
> +		      NULL, /* generate_summary */
> +		      NULL, /* write_summary */
> +		      NULL, /* read_summary */
> +		      NULL, /* write_optimization_summary */
> +		      NULL, /* read_optimization_summary */
> +		      NULL, /* stmt_fixup */
> +		      0, /* function_transform_todo_flags_start */
> +		      NULL, /* function_transform */
> +		      NULL) /* variable_transform */
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *);
> +  virtual unsigned int execute (function *) { return ipa_reorder (); }
> +
> +}; // class pass_ipa_reorder
> +
> +bool
> +pass_ipa_reorder::gate (function *)
> +{
> +  return flag_profile_reorder_functions && flag_profile_use && flag_wpa;

There is nothing special about flag_wpa.  We can also run it
meaningfully with -flto-partitions=none or with -fwhole-program.

I think we may
 1) default to node->order if no FDO is available
    I think it is better than our current completely backward ordering
    We may tune clustering to beat node->order too.
 2) if FDO but no LTO is available use tp_first_run since the sections
    combine well across translation units.
    This also should apply for incremental LTO builds but not for
    -fwhole-program builds
 3) if FDO and non-incremental LTO or -fwhole-program is available use
     clustering.
moreover with current experineces I think flag_profile_reorder_functions
and flag_reorder_functions_algorithm needs to be Optimization in the
.opt file and thus the code needs to be able to work meaningfully on
mixed units.  Something like place

1) functions with tp_first_run
2) functions build with clusterings
3) remaining functions

and always do all three algorithms depending on opt flags.  More funilly
becuase the preferred logic is not known until link-time I guess we want
to have flag_reorder_functions_algorithm with REORDER_DEFAULT which will
do the above choice based on presence of profile (tested by
node->count.ipa ().nonzero_p ()), tp_first_run, in_lto_p,
flag_whole_proram
> +  order.qsort (cgraph_node_cmp_by_text_sorted);
> +  noreorder.qsort (cgraph_node_cmp_by_text_sorted);

This also needs updating after my mainline fixes. We do not want to sort
noreorder this way.
> +  cgraph_node *node = cgraph_node::get (decl);
> +  if (freq != NODE_FREQUENCY_UNLIKELY_EXECUTED && node->text_sorted_order > 0)
> +    {
> +      char section_name[64];
> +      sprintf (section_name, ".text.sorted.%010d",
> +	       node->text_sorted_order);
> +      return get_named_text_section (decl, section_name, NULL);
> +    }
> +

I would still control this by a flag (again with bit complicated
default).  First there are non-GNU linkers and second it only works well
for one translation unit in whole DSO unless we do time profiling.
Also when we go this way we want to disable sorting in cgraphunit.c so
we do not lose late propagators for no benefits.  Finally I still think
gas fragments would be useful here to save extra cost of alignments and
also be able to use short call/jmp instructions across functions.
-ffunction-sections is not free.

We still have problem with functions in COMDAT group, right? You should
be able to see them with -flto -fno-use-linker-plugin

Honza
Andi Kleen Dec. 11, 2019, 8:54 p.m. UTC | #27
Martin Liška <mliska@suse.cz> writes:
>
> Notes and limitations:
> - The call-chain-clustering algorithm requires to fit as many as possible functions into page size (4K).
>   Based on my measurements that should correspond to ~1000 GIMPLE statements (IPA inliner size). I can
>   make it a param in the future.

That sounds inaccurate. I would assume e.g. integer code has a
quite different factor than floating point code. Perhaps it would
need something fancier.

If you use a static factor you probably would need to calibrate it
over a lot more code.

Anyways, I think it should be a param, or better even an option, because there's
a trend towards using 2MB code pages on x86. Linux has a number of ways to do
that today.

Also of course that's needed for other architectures.

-Andi
diff mbox series

Patch

From d9c81529ae2a66bd51c49d83e19edb58b19cbf97 Mon Sep 17 00:00:00 2001
From: Martin Liska <mliska@suse.cz>
Date: Thu, 5 Sep 2019 13:32:41 +0200
Subject: [PATCH] Add new ipa-reorder pass.

---
 gcc/Makefile.in         |   1 +
 gcc/cgraph.c            |   2 +
 gcc/cgraph.h            |   2 +
 gcc/cgraphclones.c      |   1 +
 gcc/common.opt          |  14 ++
 gcc/flag-types.h        |   8 +
 gcc/ipa-reorder.c       | 333 ++++++++++++++++++++++++++++++++++++++++
 gcc/lto-cgraph.c        |   2 +
 gcc/lto/lto-partition.c |  18 ---
 gcc/passes.def          |   1 +
 gcc/timevar.def         |   1 +
 gcc/tree-pass.h         |   1 +
 gcc/varasm.c            |   9 ++
 13 files changed, 375 insertions(+), 18 deletions(-)
 create mode 100644 gcc/ipa-reorder.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 152df9fa9b3..c0efef626c5 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1368,6 +1368,7 @@  OBJS = \
 	init-regs.o \
 	internal-fn.o \
 	ipa-cp.o \
+	ipa-reorder.o \
 	ipa-devirt.o \
 	ipa-fnsummary.o \
 	ipa-polymorphic-call.o \
diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 843891e9e56..a0f350c26c0 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2041,6 +2041,8 @@  cgraph_node::dump (FILE *f)
     }
   if (tp_first_run > 0)
     fprintf (f, " first_run:%i", tp_first_run);
+  if (text_sorted_order > 0)
+    fprintf (f, " text_sorted_order:%i", text_sorted_order);
   if (origin)
     fprintf (f, " nested in:%s", origin->asm_name ());
   if (gimple_has_body_p (decl))
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 4c54210123a..24c0b2be321 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -1446,6 +1446,8 @@  struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : public symtab_node
   unsigned int profile_id;
   /* Time profiler: first run of function.  */
   int tp_first_run;
+  /* Order in .text.sorted.* section.  */
+  int text_sorted_order;
 
   /* Set when decl is an abstract function pointed to by the
      ABSTRACT_DECL_ORIGIN of a reachable function.  */
diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c
index fa753697c78..4d47b3e7574 100644
--- a/gcc/cgraphclones.c
+++ b/gcc/cgraphclones.c
@@ -462,6 +462,7 @@  cgraph_node::create_clone (tree new_decl, profile_count prof_count,
   new_node->rtl = rtl;
   new_node->frequency = frequency;
   new_node->tp_first_run = tp_first_run;
+  new_node->text_sorted_order = text_sorted_order;
   new_node->tm_clone = tm_clone;
   new_node->icf_merged = icf_merged;
   new_node->merged_comdat = merged_comdat;
diff --git a/gcc/common.opt b/gcc/common.opt
index 1b9e0f3c802..e68f28f6401 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2252,6 +2252,20 @@  freorder-functions
 Common Report Var(flag_reorder_functions) Optimization
 Reorder functions to improve code placement.
 
+freorder-functions-algorithm=
+Common Joined RejectNegative Enum(reorder_functions_algorithm) Var(flag_reorder_functions_algorithm) Init(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING) Optimization
+-freorder-functions-algorithm=[first-run|call-chain-clustering]	Set the used function reordering algorithm.
+
+Enum
+Name(reorder_functions_algorithm) Type(enum reorder_functions_algorithm) UnknownError(unknown function reordering algorithm %qs)
+
+EnumValue
+Enum(reorder_functions_algorithm) String(first-run) Value(REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN)
+
+EnumValue
+Enum(reorder_functions_algorithm) String(call-chain-clustering) Value(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING)
+
+
 frerun-cse-after-loop
 Common Report Var(flag_rerun_cse_after_loop) Optimization
 Add a common subexpression elimination pass after loop optimizations.
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index a2103282d46..9658aa99ea5 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -138,6 +138,14 @@  enum reorder_blocks_algorithm
   REORDER_BLOCKS_ALGORITHM_STC
 };
 
+/* The algorithm used for function reordering.  */
+enum reorder_functions_algorithm
+{
+  REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN,
+  REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING
+};
+
+
 /* The algorithm used for the integrated register allocator (IRA).  */
 enum ira_algorithm
 {
diff --git a/gcc/ipa-reorder.c b/gcc/ipa-reorder.c
new file mode 100644
index 00000000000..d9ddc468968
--- /dev/null
+++ b/gcc/ipa-reorder.c
@@ -0,0 +1,333 @@ 
+/* Reorder functions based on profile.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "cgraph.h"
+#include "symbol-summary.h"
+#include "tree-vrp.h"
+#include "alloc-pool.h"
+#include "ipa-prop.h"
+#include "ipa-fnsummary.h"
+#include "fibonacci_heap.h"
+#include <limits>
+
+using namespace std;
+
+namespace {
+
+#define C3_CLUSTER_THRESHOLD 1024
+
+struct cluster_edge;
+
+/* Cluster is set of functions considered in C3 algorithm.  */
+
+struct cluster
+{
+  cluster (cgraph_node *node, int size, sreal time):
+    m_functions (), m_callers (), m_size (size), m_time (time)
+  {
+    m_functions.safe_push (node);
+  }
+
+  vec<cgraph_node *> m_functions;
+  hash_map <cluster *, cluster_edge *> m_callers;
+  int m_size;
+  sreal m_time;
+};
+
+/* Cluster edge is an oriented edge in between two clusters.  */
+
+struct cluster_edge
+{
+  cluster_edge (cluster *caller, cluster *callee, uint32_t count):
+    m_caller (caller), m_callee (callee), m_count (count), m_heap_node (NULL)
+  {}
+
+
+  uint32_t inverted_count ()
+  {
+    return numeric_limits<uint32_t>::max () - m_count;
+  }
+
+  cluster *m_caller;
+  cluster *m_callee;
+  uint32_t m_count;
+  fibonacci_node<uint32_t, cluster_edge> *m_heap_node;
+};
+
+/* Sort functions based of first execution of the function.  */
+
+static void
+sort_functions_by_first_run (void)
+{
+  cgraph_node *node;
+
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (node->tp_first_run && !node->alias)
+      node->text_sorted_order = node->tp_first_run;
+}
+
+/* Compare clusters by density after that are established.  */
+
+static int
+cluster_cmp (const void *a_p, const void *b_p)
+{
+  const cluster *a = *(cluster * const *)a_p;
+  const cluster *b = *(cluster * const *)b_p;
+
+  unsigned fncounta = a->m_functions.length ();
+  unsigned fncountb = b->m_functions.length ();
+  if (fncounta <= 1 || fncountb <= 1)
+    return fncountb - fncounta;
+
+  sreal r = b->m_time * a->m_size - a->m_time * b->m_size;
+  return (r < 0) ? -1 : ((r > 0) ? 1 : 0);
+}
+
+/* Sort functions based on call chain clustering, which is an algorithm
+   mentioned in the following article:
+   https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
+   .  */
+
+static void
+sort_functions_by_c3 (void)
+{
+  cgraph_node *node;
+  auto_vec<cluster *> clusters;
+
+  /* Create a cluster for each function.  */
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (!node->alias
+	&& !node->global.inlined_to)
+      {
+	ipa_fn_summary *summary = ipa_fn_summaries->get (node);
+	cluster *c = new cluster (node, summary->size, summary->time);
+	node->aux = c;
+	clusters.safe_push (c);
+      }
+
+  auto_vec<cluster_edge *> edges;
+
+  /* Insert edges between clusters that have a profile.  */
+  for (unsigned i = 0; i < clusters.length (); i++)
+    {
+      cgraph_node *node = clusters[i]->m_functions[0];
+      for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller)
+	if (cs->count.reliable_p ()
+	    && cs->count.to_gcov_type () > 0)
+	  {
+	    cluster *caller = (cluster *)cs->caller->aux;
+	    cluster *callee = (cluster *)cs->callee->aux;
+	    gcov_type count = cs->count.to_gcov_type ();
+
+	    cluster_edge **cedge = callee->m_callers.get (caller);
+	    if (cedge != NULL)
+	      (*cedge)->m_count += count;
+	    else
+	      {
+		cluster_edge *cedge = new cluster_edge (caller, callee, count);
+		edges.safe_push (cedge);
+		callee->m_callers.put (caller, cedge);
+	      }
+	  }
+    }
+
+  /* Now insert all created edges into a heap.  */
+  fibonacci_heap <uint32_t, cluster_edge> heap (0);
+
+  for (unsigned i = 0; i < clusters.length (); i++)
+    {
+      cluster *c = clusters[i];
+      for (hash_map<cluster *, cluster_edge *>::iterator it
+	   = c->m_callers.begin (); it != c->m_callers.end (); ++it)
+	{
+	  cluster_edge *cedge = (*it).second;
+	  cedge->m_heap_node = heap.insert (cedge->inverted_count (), cedge);
+	}
+    }
+
+  while (!heap.empty ())
+    {
+      cluster_edge *cedge = heap.extract_min ();
+      cluster *caller = cedge->m_caller;
+      cluster *callee = cedge->m_callee;
+      cedge->m_heap_node = NULL;
+
+      if (caller == callee)
+	continue;
+      if (caller->m_size + callee->m_size <= C3_CLUSTER_THRESHOLD)
+	{
+	  caller->m_size += callee->m_size;
+	  caller->m_time += callee->m_time;
+
+	  /* Append all cgraph_nodes from callee to caller.  */
+	  for (unsigned i = 0; i < callee->m_functions.length (); i++)
+	    caller->m_functions.safe_push (callee->m_functions[i]);
+
+	  callee->m_functions.truncate (0);
+
+	  /* Iterate all cluster_edges of callee and add them to the caller.  */
+	  for (hash_map<cluster *, cluster_edge *>::iterator it
+	       = callee->m_callers.begin (); it != callee->m_callers.end ();
+	       ++it)
+	    {
+	      (*it).second->m_callee = caller;
+	      cluster_edge **ce = caller->m_callers.get ((*it).first);
+
+	      if (ce != NULL)
+		{
+		  (*ce)->m_count += (*it).second->m_count;
+		  if ((*ce)->m_heap_node != NULL)
+		    heap.decrease_key ((*ce)->m_heap_node,
+				       (*ce)->inverted_count ());
+		}
+	      else
+		caller->m_callers.put ((*it).first, (*it).second);
+	    }
+	}
+    }
+
+  /* Sort the candidate clusters.  */
+  clusters.qsort (cluster_cmp);
+
+  /* Dump clusters.  */
+  if (dump_file)
+    {
+      for (unsigned i = 0; i < clusters.length (); i++)
+	{
+	  cluster *c = clusters[i];
+	  if (c->m_functions.length () <= 1)
+	    continue;
+
+	  fprintf (dump_file, "Cluster %d with functions: %d, size: %d,"
+		   " density: %f\n", i, c->m_functions.length (), c->m_size,
+		   (c->m_time / c->m_size).to_double ());
+	  fprintf (dump_file, "  functions: ");
+	  for (unsigned j = 0; j < c->m_functions.length (); j++)
+	    fprintf (dump_file, "%s ", c->m_functions[j]->dump_name ());
+	  fprintf (dump_file, "\n");
+	}
+      fprintf (dump_file, "\n");
+    }
+
+  /* Assign .text.sorted.* section names.  */
+  int counter = 1;
+  for (unsigned i = 0; i < clusters.length (); i++)
+    {
+      cluster *c = clusters[i];
+      if (c->m_functions.length () <= 1)
+	continue;
+
+      for (unsigned j = 0; j < c->m_functions.length (); j++)
+	{
+	  cgraph_node *node = c->m_functions[j];
+
+	  if (dump_file)
+	    fprintf (dump_file, "setting: %d for %s with size:%d\n",
+		     counter, node->dump_asm_name (),
+		     ipa_fn_summaries->get (node)->size);
+	  node->text_sorted_order = counter++;
+	}
+    }
+
+  /* Release memory.  */
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (!node->alias)
+      node->aux = NULL;
+
+  for (unsigned i = 0; i < clusters.length (); i++)
+    delete clusters[i];
+
+  for (unsigned i = 0; i < edges.length (); i++)
+    delete edges[i];
+}
+
+/* The main function for function sorting.  */
+
+static unsigned int
+ipa_reorder (void)
+{
+  switch (flag_reorder_functions_algorithm)
+    {
+    case REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING:
+      sort_functions_by_c3 ();
+      break;
+    case REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN:
+      sort_functions_by_first_run ();
+      break;
+    default:
+      gcc_unreachable ();
+    }
+
+  return 0;
+}
+
+const pass_data pass_data_ipa_reorder =
+{
+  IPA_PASS, /* type */
+  "reorder", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_IPA_REORDER, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_reorder : public ipa_opt_pass_d
+{
+public:
+  pass_ipa_reorder (gcc::context *ctxt)
+    : ipa_opt_pass_d (pass_data_ipa_reorder, ctxt,
+		      NULL, /* generate_summary */
+		      NULL, /* write_summary */
+		      NULL, /* read_summary */
+		      NULL, /* write_optimization_summary */
+		      NULL, /* read_optimization_summary */
+		      NULL, /* stmt_fixup */
+		      0, /* function_transform_todo_flags_start */
+		      NULL, /* function_transform */
+		      NULL) /* variable_transform */
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *);
+  virtual unsigned int execute (function *) { return ipa_reorder (); }
+
+}; // class pass_ipa_reorder
+
+bool
+pass_ipa_reorder::gate (function *)
+{
+  return flag_profile_reorder_functions && flag_profile_use && flag_wpa;
+}
+
+} // anon namespace
+
+ipa_opt_pass_d *
+make_pass_ipa_reorder (gcc::context *ctxt)
+{
+  return new pass_ipa_reorder (ctxt);
+}
diff --git a/gcc/lto-cgraph.c b/gcc/lto-cgraph.c
index bc0f0107333..0a424a2ad12 100644
--- a/gcc/lto-cgraph.c
+++ b/gcc/lto-cgraph.c
@@ -505,6 +505,7 @@  lto_output_node (struct lto_simple_output_block *ob, struct cgraph_node *node,
     section = "";
 
   streamer_write_hwi_stream (ob->main_stream, node->tp_first_run);
+  streamer_write_hwi_stream (ob->main_stream, node->text_sorted_order);
 
   bp = bitpack_create (ob->main_stream);
   bp_pack_value (&bp, node->local.local, 1);
@@ -1277,6 +1278,7 @@  input_node (struct lto_file_decl_data *file_data,
 		    "node with uid %d", node->get_uid ());
 
   node->tp_first_run = streamer_read_uhwi (ib);
+  node->text_sorted_order = streamer_read_uhwi (ib);
 
   bp = streamer_read_bitpack (ib);
 
diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c
index 6972e6e9ef3..8937e64ef72 100644
--- a/gcc/lto/lto-partition.c
+++ b/gcc/lto/lto-partition.c
@@ -381,24 +381,6 @@  node_cmp (const void *pa, const void *pb)
   const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
   const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
 
-  /* Profile reorder flag enables function reordering based on first execution
-     of a function. All functions with profile are placed in ascending
-     order at the beginning.  */
-
-  if (flag_profile_reorder_functions)
-  {
-    /* Functions with time profile are sorted in ascending order.  */
-    if (a->tp_first_run && b->tp_first_run)
-      return a->tp_first_run != b->tp_first_run
-	? a->tp_first_run - b->tp_first_run
-        : a->order - b->order;
-
-    /* Functions with time profile are sorted before the functions
-       that do not have the profile.  */
-    if (a->tp_first_run || b->tp_first_run)
-      return b->tp_first_run - a->tp_first_run;
-  }
-
   return b->order - a->order;
 }
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 93879223a8b..6caf2ac5cda 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -153,6 +153,7 @@  along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_ipa_fn_summary);
   NEXT_PASS (pass_ipa_inline);
   NEXT_PASS (pass_ipa_pure_const);
+  NEXT_PASS (pass_ipa_reorder);
   NEXT_PASS (pass_ipa_free_fn_summary, false /* small_p */);
   NEXT_PASS (pass_ipa_reference);
   /* This pass needs to be scheduled after any IP code duplication.   */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index 3551a433b1f..068c902424f 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -76,6 +76,7 @@  DEFTIMEVAR (TV_IPA_INHERITANCE       , "ipa inheritance graph")
 DEFTIMEVAR (TV_IPA_VIRTUAL_CALL      , "ipa virtual call target")
 DEFTIMEVAR (TV_IPA_DEVIRT	     , "ipa devirtualization")
 DEFTIMEVAR (TV_IPA_CONSTANT_PROP     , "ipa cp")
+DEFTIMEVAR (TV_IPA_REORDER	     , "ipa reorder")
 DEFTIMEVAR (TV_IPA_INLINING          , "ipa inlining heuristics")
 DEFTIMEVAR (TV_IPA_FNSPLIT           , "ipa function splitting")
 DEFTIMEVAR (TV_IPA_COMDATS	     , "ipa comdats")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 7106eba54af..10bc4cc125d 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -500,6 +500,7 @@  extern ipa_opt_pass_d *make_pass_ipa_inline (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_free_lang_data (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_free_fn_summary (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_cp (gcc::context *ctxt);
+extern ipa_opt_pass_d *make_pass_ipa_reorder (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_icf (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_devirt (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_reference (gcc::context *ctxt);
diff --git a/gcc/varasm.c b/gcc/varasm.c
index a7c22523f9f..498b9ed59fd 100644
--- a/gcc/varasm.c
+++ b/gcc/varasm.c
@@ -601,6 +601,15 @@  default_function_section (tree decl, enum node_frequency freq,
   if (exit && freq != NODE_FREQUENCY_UNLIKELY_EXECUTED)
     return get_named_text_section (decl, ".text.exit", NULL);
 
+  cgraph_node *node = cgraph_node::get (decl);
+  if (node->text_sorted_order > 0 && freq != NODE_FREQUENCY_UNLIKELY_EXECUTED)
+    {
+      char section_name[64];
+      sprintf (section_name, ".text.sorted.%010d",
+	       node->text_sorted_order);
+      return get_named_text_section (decl, section_name, NULL);
+    }
+
   /* Group cold functions together, similarly for hot code.  */
   switch (freq)
     {
-- 
2.23.0