diff mbox

Time profiler - phase 2

Message ID CAObPJ3OmMabfRtMFEFrUWddcYxnZ7VGeGE0CL=6bkQzdxF8vow@mail.gmail.com
State New
Headers show

Commit Message

Martin Liška Nov. 18, 2013, 7:41 a.m. UTC
Hello,
   there's new version of the patch. I wrote email to Cary to
negotiate how will implement gold's linker patch.

Thanks,
Martin

On 18 November 2013 00:37, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Hello,
>>    there is a new version of the patch, I disabled the branch with
>> profile-generate. Could you please advise me how should I force to use
>> profile-reorder-functions just with enable LTO optimization?
>>
>> I also attach reordering results:
>> o gimp-reoder-latest.html (latest patch)
>> o gimp-reoder-without-fix.html (without the code in gcc/predict.c)
>>
>> Thank you,
>> Martin
>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>> index 5cb07b7..754f882 100644
>> --- a/gcc/ChangeLog
>> +++ b/gcc/ChangeLog
>> @@ -1,3 +1,13 @@
>> +2013-11-17  Martin Liska  <marxin.liska@gmail.com>
>> +         Jan Hubicka  <jh@suse.cz>
>> +
>> +     * cgraphunit.c (node_cmp): New function.
>> +     (expand_all_functions): Function ordering added.
>> +     * common.opt: New profile based function reordering flag introduced.
>> +     * lto-partition.c: Support for time profile added.
>> +     * lto.c: Likewise.
>> +     * predict.c (handle_missing_profiles): Time profile handled in
>> +       missing profiles.
>
> OK,
> thanks!
>> @@ -8645,7 +8645,7 @@ profile useful for later recompilation with profile feedback based
>>  optimization.  You must use @option{-fprofile-generate} both when
>>  compiling and when linking your program.
>>
>> -The following options are enabled: @code{-fprofile-arcs}, @code{-fprofile-values}, @code{-fvpt}.
>> +The following options are enabled: @code{-fprofile-arcs}, @code{-fprofile-values}, @code{-fprofile-reorder-functions}, @code{-fvpt}.
>>
>>  If @var{path} is specified, GCC looks at the @var{path} to find
>>  the profile feedback data files. See @option{-fprofile-dir}.
>
> Skip this change, it is only about the profiling options used.  I think it is enough to mention it later.
>> @@ -8933,6 +8933,14 @@ from profiling values of expressions for usage in optimizations.
>>
>>  Enabled with @option{-fprofile-generate} and @option{-fprofile-use}.
>>
>> +@item -fprofile-reoder-functions
>> +@opindex fprofile-reorder-functions
>> +Function reordering based on profile instrumentation collects
>> +first time of execution of a function and orders these functions
>> +in ascending order.
>> +
>> +Enabled with @option{-fprofile-generate} and @option{-fprofile-use}.
>
> Only with -fprofile-use.
>
> What happened with the plans for linker support? Perhaps we can implement the numbered sections by Carry's proposal and hope that binutils will catch up in next release?
>
> Honza

Comments

Jan Hubicka Nov. 18, 2013, 10:16 a.m. UTC | #1
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 5cb07b7..754f882 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,13 @@
> +2013-11-17  Martin Liska  <marxin.liska@gmail.com>
> +	    Jan Hubicka  <jh@suse.cz>
> +
> +	* cgraphunit.c (node_cmp): New function.
> +	(expand_all_functions): Function ordering added.
> +	* common.opt: New profile based function reordering flag introduced.
> +	* lto-partition.c: Support for time profile added.
> +	* lto.c: Likewise.
> +	* predict.c (handle_missing_profiles): Time profile handled in
> +	  missing profiles.

OK,
thanks!  Implementing the function section naming scheme would be easy and it would
enable us to do the reordering even w/o LTO that would be quite cool. Lets hope it gets
resolved soon.

Honza
Martin Liška Dec. 2, 2013, 11:37 p.m. UTC | #2
Hello,
   there are dumps for Inkscape, it looks very well.

Link: https://docs.google.com/file/d/0B0pisUJ80pO1Y0t1aEVBRlByR28/edit

There are few of functions that look like this (wpa cgraph):

_ZL13resync_activeP19_EgeSelectOneActionii/2604322 (resync_active)
@0x7f84af42cea0
  Type: function definition analyzed
  Visibility: prevailing_def_ironly
  References:
  Referring:
  Read from file: libinkscape.a
  Availability: local
  First run: 4422
  Function flags: executed 47x local
  Called by: ege_select_one_action_set_active_text/2604300 (0.34 per
call) (can throw external)
_ZL21commit_pending_changeP19_EgeSelectOneAction/2604327 (0.16 per
call) (can throw external)
_ZL34ege_select_one_action_set_propertyP8_GObjectjPK7_GValueP11_GParamSpec/2604316
(47x) (0.16 per call) (can throw external)
  Calls: _ZL13resync_activeP19_EgeSelectOneActionii.part.0/2604456
(10x) (0.21 per call) (can throw external)

_ZL13resync_activeP19_EgeSelectOneActionii.part.0/2604456
(_ZL13resync_activeP19_EgeSelectOneActionii.part.0) @0x7f84af42cd68
  Type: function definition analyzed
  Visibility: artificial
  References: _ZL7signals/2604291 (read)
  Referring:
  Read from file: libinkscape.a
  Availability: local
  First run: 0
  Function flags: executed 10x local
  Called by: _ZL13resync_activeP19_EgeSelectOneActionii/2604322 (10x)
(0.21 per call) (can throw external)

First function has a profile (position is correct according to
valgrind) and second not. Both of them comes from the same object
file. The problem is that the second one is called according to
valgrind. What does .part.X means, is it a part of function that was
separated to a different function? Is there any was these two profiles
could be merged?

Thank you,
Martin

On 27 November 2013 00:24, Martin Liška <marxin.liska@gmail.com> wrote:
> Hello,
>     present results reached for GIMP by the reordering pass. Important
> to notice, that I just used single '.text' section where all symbols
> are placed. As you can see, there just few functions that are not
> catched by the pass (3 of them are LTO clones, that I will find out).
> And 2 functions were not seen during -fprofile-generate run.
>
> In following days, I will prepare same dumps for Inkscape and Firefox.
>
> Martin
>
> On 18 November 2013 11:16, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
>>> index 5cb07b7..754f882 100644
>>> --- a/gcc/ChangeLog
>>> +++ b/gcc/ChangeLog
>>> @@ -1,3 +1,13 @@
>>> +2013-11-17  Martin Liska  <marxin.liska@gmail.com>
>>> +         Jan Hubicka  <jh@suse.cz>
>>> +
>>> +     * cgraphunit.c (node_cmp): New function.
>>> +     (expand_all_functions): Function ordering added.
>>> +     * common.opt: New profile based function reordering flag introduced.
>>> +     * lto-partition.c: Support for time profile added.
>>> +     * lto.c: Likewise.
>>> +     * predict.c (handle_missing_profiles): Time profile handled in
>>> +       missing profiles.
>>
>> OK,
>> thanks!  Implementing the function section naming scheme would be easy and it would
>> enable us to do the reordering even w/o LTO that would be quite cool. Lets hope it gets
>> resolved soon.
>>
>> Honza
diff mbox

Patch

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 5cb07b7..754f882 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,13 @@ 
+2013-11-17  Martin Liska  <marxin.liska@gmail.com>
+	    Jan Hubicka  <jh@suse.cz>
+
+	* cgraphunit.c (node_cmp): New function.
+	(expand_all_functions): Function ordering added.
+	* common.opt: New profile based function reordering flag introduced.
+	* lto-partition.c: Support for time profile added.
+	* lto.c: Likewise.
+	* predict.c (handle_missing_profiles): Time profile handled in
+	  missing profiles.
 2013-11-16  Joern Rennecke  <joern.rennecke@embecosm.com>
 
 	* config/arc/arc.c (arc_predicate_delay_insns): New function.
diff --git a/gcc/cgraphunit.c b/gcc/cgraphunit.c
index 8ab274b..ea722b8 100644
--- a/gcc/cgraphunit.c
+++ b/gcc/cgraphunit.c
@@ -1824,6 +1824,19 @@  expand_function (struct cgraph_node *node)
   ipa_remove_all_references (&node->ref_list);
 }
 
+/* Node comparer that is responsible for the order that corresponds
+   to time when a function was launched for the first time.  */
+
+static int
+node_cmp (const void *pa, const void *pb)
+{
+  const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
+  const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
+
+  return a->tp_first_run != b->tp_first_run
+	 ? b->tp_first_run - a->tp_first_run
+	 : b->order - a->order;
+}
 
 /* Expand all functions that must be output.
 
@@ -1835,11 +1848,14 @@  expand_function (struct cgraph_node *node)
    to use subsections to make the output functions appear in top-down
    order).  */
 
+
 static void
 expand_all_functions (void)
 {
   struct cgraph_node *node;
   struct cgraph_node **order = XCNEWVEC (struct cgraph_node *, cgraph_n_nodes);
+
+  unsigned int expanded_func_count = 0, profiled_func_count = 0;
   int order_pos, new_order_pos = 0;
   int i;
 
@@ -1852,19 +1868,35 @@  expand_all_functions (void)
     if (order[i]->process)
       order[new_order_pos++] = order[i];
 
+  if (flag_profile_reorder_functions)
+    qsort (order, new_order_pos, sizeof (struct cgraph_node *), node_cmp);
+
   for (i = new_order_pos - 1; i >= 0; i--)
     {
       node = order[i];
+
       if (node->process)
 	{
+     expanded_func_count++;
+     if(node->tp_first_run)
+       profiled_func_count++;
+
 	  node->process = 0;
 	  expand_function (node);
 	}
     }
+
+    if (in_lto_p && dump_file)
+      fprintf (dump_file, "Expanded functions with time profile (%s):%u/%u\n",
+               main_input_filename, profiled_func_count, expanded_func_count);
+
+  if (cgraph_dump_file && flag_profile_reorder_functions && in_lto_p)
+    fprintf (cgraph_dump_file, "Expanded functions with time profile:%u/%u\n",
+             profiled_func_count, expanded_func_count);
+
   cgraph_process_new_functions ();
 
   free (order);
-
 }
 
 /* This is used to sort the node types by the cgraph order number.  */
@@ -2189,6 +2221,7 @@  compile (void)
 #endif
 
   cgraph_state = CGRAPH_STATE_EXPANSION;
+
   if (!flag_toplevel_reorder)
     output_in_order ();
   else
diff --git a/gcc/common.opt b/gcc/common.opt
index d5971df..85d5c74 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1708,6 +1708,10 @@  fprofile-report
 Common Report Var(profile_report)
 Report on consistency of profile
 
+fprofile-reorder-functions
+Common Report Var(flag_profile_reorder_functions)
+Enable function reordering that improves code placement
+
 frandom-seed
 Common Var(common_deferred_options) Defer
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index ff4c2ee..e4972ab 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -393,7 +393,7 @@  Objective-C and Objective-C++ Dialects}.
 -fprefetch-loop-arrays -fprofile-report @gol
 -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
 -fprofile-generate=@var{path} @gol
--fprofile-use -fprofile-use=@var{path} -fprofile-values @gol
+-fprofile-use -fprofile-use=@var{path} -fprofile-values -fprofile-reorder-functions @gol
 -freciprocal-math -free -frename-registers -freorder-blocks @gol
 -freorder-blocks-and-partition -freorder-functions @gol
 -frerun-cse-after-loop -freschedule-modulo-scheduled-loops @gol
@@ -8933,6 +8933,14 @@  from profiling values of expressions for usage in optimizations.
 
 Enabled with @option{-fprofile-generate} and @option{-fprofile-use}.
 
+@item -fprofile-reoder-functions
+@opindex fprofile-reorder-functions
+Function reordering based on profile instrumentation collects
+first time of execution of a function and orders these functions
+in ascending order.
+
+Enabled with @option{-fprofile-use}.
+
 @item -fvpt
 @opindex fvpt
 If combined with @option{-fprofile-arcs}, this option instructs the compiler
diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c
index 6a3d881..42e4aef 100644
--- a/gcc/lto/lto-partition.c
+++ b/gcc/lto/lto-partition.c
@@ -280,9 +280,11 @@  add_symbol_to_partition (ltrans_partition part, symtab_node *node)
 
      Be lax about comdats; they may or may not be duplicated and we may
      end up in need to duplicate keyed comdat because it has unkeyed alias.  */
+
   gcc_assert (get_symbol_class (node) == SYMBOL_DUPLICATE
 	      || DECL_COMDAT (node->decl)
 	      || !symbol_partitioned_p (node));
+
   add_symbol_to_partition_1 (part, node);
 }
 
@@ -395,6 +397,25 @@  node_cmp (const void *pa, const void *pb)
 {
   const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
   const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
+
+  /* Profile reorder flag enables function reordering based on first execution
+     of a function. All functions with profile are placed in ascending
+     order at the beginning.  */
+
+  if (flag_profile_reorder_functions)
+  {
+    /* Functions with time profile are sorted in ascending order.  */
+    if (a->tp_first_run && b->tp_first_run)
+      return a->tp_first_run != b->tp_first_run
+	? a->tp_first_run - b->tp_first_run
+        : a->order - b->order;
+
+    /* Functions with time profile are sorted before the functions
+       that do not have the profile.  */
+    if (a->tp_first_run || b->tp_first_run)
+      return b->tp_first_run - a->tp_first_run;
+  }
+
   return b->order - a->order;
 }
 
@@ -449,7 +470,7 @@  void
 lto_balanced_map (void)
 {
   int n_nodes = 0;
-  int n_varpool_nodes = 0, varpool_pos = 0, best_varpool_pos = 0;
+  int n_varpool_nodes = 0, varpool_pos = 0;
   struct cgraph_node **order = XNEWVEC (struct cgraph_node *, cgraph_max_uid);
   struct varpool_node **varpool_order = NULL;
   int i;
@@ -481,10 +502,13 @@  lto_balanced_map (void)
      get better about minimizing the function bounday, but until that
      things works smoother if we order in source order.  */
   qsort (order, n_nodes, sizeof (struct cgraph_node *), node_cmp);
+
+  if (cgraph_dump_file)
+    for(i = 0; i < n_nodes; i++)
+      fprintf (cgraph_dump_file, "Balanced map symbol order:%s:%u\n", cgraph_node_asm_name (order[i]), order[i]->tp_first_run);
+
   if (!flag_toplevel_reorder)
     {
-      qsort (order, n_nodes, sizeof (struct cgraph_node *), node_cmp);
-
       FOR_EACH_VARIABLE (vnode)
 	if (get_symbol_class (vnode) == SYMBOL_PARTITION)
 	  n_varpool_nodes++;
@@ -683,7 +707,6 @@  lto_balanced_map (void)
 	  best_i = i;
 	  best_n_nodes = lto_symtab_encoder_size (partition->encoder);
 	  best_total_size = total_size;
-	  best_varpool_pos = varpool_pos;
 	}
       if (cgraph_dump_file)
 	fprintf (cgraph_dump_file, "Step %i: added %s/%i, size %i, cost %i/%i "
@@ -701,7 +724,6 @@  lto_balanced_map (void)
 		fprintf (cgraph_dump_file, "Unwinding %i insertions to step %i\n",
 			 i - best_i, best_i);
 	      undo_partition (partition, best_n_nodes);
-	      varpool_pos = best_varpool_pos;
 	    }
 	  i = best_i;
  	  /* When we are finished, avoid creating empty partition.  */
@@ -849,7 +871,7 @@  may_need_named_section_p (lto_symtab_encoder_t encoder, symtab_node *node)
    of the same name in partition ENCODER (or in whole compilation unit if
    ENCODER is NULL) and if so, mangle the statics.  Always mangle all
    conflicting statics, so we reduce changes of silently miscompiling
-   asm statemnets referring to them by symbol name.  */
+   asm statements referring to them by symbol name.  */
 
 static void
 rename_statics (lto_symtab_encoder_t encoder, symtab_node *node)
diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
index 62856d0..2401c76 100644
--- a/gcc/lto/lto.c
+++ b/gcc/lto/lto.c
@@ -2453,9 +2453,12 @@  lto_wpa_write_files (void)
   /* Sort partitions by size so small ones are compiled last.
      FIXME: Even when not reordering we may want to output one list for parallel make
      and other for final link command.  */
-  ltrans_partitions.qsort (flag_toplevel_reorder
+
+  if (!flag_profile_reorder_functions || !flag_profile_use)
+    ltrans_partitions.qsort (flag_toplevel_reorder
 			   ? cmp_partitions_size
 			   : cmp_partitions_order);
+
   for (i = 0; i < n_sets; i++)
     {
       size_t len;
diff --git a/gcc/opts.c b/gcc/opts.c
index 3a939ac..b842543 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -1690,6 +1690,8 @@  common_handle_option (struct gcc_options *opts,
 	opts->x_flag_vect_cost_model = VECT_COST_MODEL_DYNAMIC;
       if (!opts_set->x_flag_tree_loop_distribute_patterns)
 	opts->x_flag_tree_loop_distribute_patterns = value;
+      if (!opts_set->x_flag_profile_reorder_functions)
+	opts->x_flag_profile_reorder_functions = value;
       /* Indirect call profiling should do all useful transformations
  	 speculative devirutalization does.  */
       if (!opts_set->x_flag_devirtualize_speculatively
diff --git a/gcc/predict.c b/gcc/predict.c
index 61cac52..39a0ef7 100644
--- a/gcc/predict.c
+++ b/gcc/predict.c
@@ -2840,12 +2840,24 @@  handle_missing_profiles (void)
     {
       struct cgraph_edge *e;
       gcov_type call_count = 0;
+      gcov_type max_tp_first_run = 0;
       struct function *fn = DECL_STRUCT_FUNCTION (node->decl);
 
       if (node->count)
         continue;
       for (e = node->callers; e; e = e->next_caller)
+      {
         call_count += e->count;
+
+	if (e->caller->tp_first_run > max_tp_first_run)
+	  max_tp_first_run = e->caller->tp_first_run;
+      }
+
+      /* If time profile is missing, let assign the maximum that comes from
+	 caller functions.  */
+      if (!node->tp_first_run && max_tp_first_run)
+	node->tp_first_run = max_tp_first_run + 1;
+
       if (call_count
           && fn && fn->cfg
           && (call_count * unlikely_count_fraction >= profile_info->runs))