diff mbox

Handle more COMDAT profiling issues

Message ID 20140526054323.GD9540@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka May 26, 2014, 5:43 a.m. UTC
> This patch attempts to address the lost profile issue for COMDATs in
> more circumstances, exposed by function splitting.
> 
> My earlier patch handled the case where the comdat had 0 counts since
> the linker kept the copy in a different module. In that case we
> prevent the guessed frequencies on 0-count functions from being
> dropped by counts_to_freq, and then later mark any reached via
> non-zero callgraph edges as guessed. Finally, when one such 0-count
> comdat is inlined the call count is propagated to the callee blocks
> using the guessed probabilities.
> 
> However, in this case, there was a comdat that had a very small
> non-zero count, that was being inlined to a much hotter callsite. This
> could happen when there was a copy that was ipa-inlined
> in the profile gen compile, so the copy in that module gets some
> non-zero counts from the ipa inlined instance, but when the out of
> line copy was eliminated by the linker (selected from a different
> module). In this case the inliner was scaling the bb counts up quite a
> lot when inlining. The problem is that you most likely can't trust
> that the 0 count bbs in such a case are really not executed by the
> callsite it is being into, since the counts are very small and
> correspond to a different callsite. In some internal C++ applications
> I am seeing that we execute out of the split cold portion of COMDATs
> for this reason.

With all the problems we have with COMDATs, it still seems to me that perhaps
best approach would be to merge them in runtime, as we discussed some time ago
and incrementally handle the problem how to get callstack sensitive profiles.
> 
> This problem is more complicated to address than the 0-count instance,
> because we need the cgraph to determine which functions to drop the
> profile on, and at that point the estimated frequencies have already
> been overwritten by counts_to_freqs. To handle this broader case, I
> have changed the drop_profile routine to simply re-estimate the
> probabilities/frequencies (and translate these into counts scaled from
> the incoming call edge counts). This unfortunately necessitates
> rebuilding the cgraph, to propagate the new synthesized counts and
> avoid checking failures downstream. But it will only be rebuilt if we
> dropped any profiles. With this solution, some of the older approach
> can be removed (e.g. propagating counts using the guessed
> probabilities during inlining).
> 
> Patch is below. Bootstrapped and tested on x86-64-unknown-linux-gnu.
> Also tested on
> a profile-use build of SPEC cpu2006. Ok for trunk when stage 1 reopens?
> 
> Thanks,
> Teresa

2014-02-12  Teresa Johnson  <tejohnson@google.com>

	* graphite.c (graphite_finalize): Pass new parameter.
	* params.def (PARAM_MIN_CALLER_REESTIMATE_RATIO): New.
	* predict.c (tree_estimate_probability): New parameter.
	(tree_estimate_probability_worker): Renamed from
        tree_estimate_probability_driver.
	(tree_reestimate_probability): New function.
	(tree_estimate_probability_driver): Invoke
        tree_estimate_probability_worker.
	(freqs_to_counts): Move here from tree-inline.c.
	(drop_profile): Re-estimated profiles when dropping counts.
	(handle_missing_profiles): Drop for some non-zero functions as well,
        get counts from bbs to support invocation before cgraph rebuild.
	(counts_to_freqs): Remove code obviated by reestimation.
	* predict.h (tree_estimate_probability): Update declaration.
	* tree-inline.c (freqs_to_counts): Move to predict.c.
	(copy_cfg_body): Remove code obviated by reestimation.
	* tree-profile.c (tree_profiling): Invoke handle_missing_profiles
        before cgraph rebuild.

   connect_infinite_loops_to_exit ();
   /* We use loop_niter_by_eval, which requires that the loops have
      preheaders.  */
-  create_preheaders (CP_SIMPLE_PREHEADERS);
+  if (!redo)
+    create_preheaders (CP_SIMPLE_PREHEADERS);

So you disable preheaders construction because this all happens at inlining
time when we do not want to change BBs because callgraph is built, right?
Are you sure we do have preheaders built at IPA time for all functions reliably?
(It would make sense to do so, but I think loop analysis is done rather randomly
by IPA analysis passes, such as inliner).
Otherwise the SCEV infrastructure will explode, if I remember correctly.

+/* Force re-estimation of the probabilities, because the profile was
+   deemed insufficient.  */
+
+static void
+tree_reestimate_probability (void)
+{
+  basic_block bb;
+  edge e;
+  edge_iterator ei;
+
+  /* Need to reset the counts to ensure probabilities are recomputed.  */
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      bb->count = 0;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+        e->count = 0;
+    }
+  tree_estimate_probability_worker (true);

You also want to remove recorded loop count estimates, since they are also
driven by profile we chose to not trust.

@@ -2842,48 +2927,77 @@ handle_missing_profiles (void)
       gcov_type max_tp_first_run = 0;
       struct function *fn = DECL_STRUCT_FUNCTION (node->decl);
 
-      if (node->count)
-        continue;
       for (e = node->callers; e; e = e->next_caller)
       {
-        call_count += e->count;
+        call_count += cgraph_function_with_gimple_body_p (e->caller)
+                      ? gimple_bb (e->call_stmt)->count : 0;

I am quite confused by this change - we you don't want to trust edge counts?
Other parts of the patch seems quite resonable, but I don't really follow here.

Honza

Comments

Teresa Johnson July 7, 2014, 6:31 p.m. UTC | #1
Finally had some time to pick this up again. See responses and
questions below. But it looks like you are commenting on a stale
version of the patch. See the update I sent in March:

https://gcc.gnu.org/ml/gcc-patches/2014-03/msg01176.html


On Sun, May 25, 2014 at 10:43 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> This patch attempts to address the lost profile issue for COMDATs in
>> more circumstances, exposed by function splitting.
>>
>> My earlier patch handled the case where the comdat had 0 counts since
>> the linker kept the copy in a different module. In that case we
>> prevent the guessed frequencies on 0-count functions from being
>> dropped by counts_to_freq, and then later mark any reached via
>> non-zero callgraph edges as guessed. Finally, when one such 0-count
>> comdat is inlined the call count is propagated to the callee blocks
>> using the guessed probabilities.
>>
>> However, in this case, there was a comdat that had a very small
>> non-zero count, that was being inlined to a much hotter callsite. This
>> could happen when there was a copy that was ipa-inlined
>> in the profile gen compile, so the copy in that module gets some
>> non-zero counts from the ipa inlined instance, but when the out of
>> line copy was eliminated by the linker (selected from a different
>> module). In this case the inliner was scaling the bb counts up quite a
>> lot when inlining. The problem is that you most likely can't trust
>> that the 0 count bbs in such a case are really not executed by the
>> callsite it is being into, since the counts are very small and
>> correspond to a different callsite. In some internal C++ applications
>> I am seeing that we execute out of the split cold portion of COMDATs
>> for this reason.
>
> With all the problems we have with COMDATs, it still seems to me that perhaps
> best approach would be to merge them in runtime, as we discussed some time ago
> and incrementally handle the problem how to get callstack sensitive profiles.

I recently committed some functionality to a google branch to do some
COMDAT profile merging at runtime, but that was to handle all-zero
count copies and operates on top of the LIPO runtime infrastructure.
We don't do any fixup in non-zero count COMDATs to avoid losing
context-sensitive profiles.

>>
>> This problem is more complicated to address than the 0-count instance,
>> because we need the cgraph to determine which functions to drop the
>> profile on, and at that point the estimated frequencies have already
>> been overwritten by counts_to_freqs. To handle this broader case, I
>> have changed the drop_profile routine to simply re-estimate the
>> probabilities/frequencies (and translate these into counts scaled from
>> the incoming call edge counts). This unfortunately necessitates
>> rebuilding the cgraph, to propagate the new synthesized counts and
>> avoid checking failures downstream. But it will only be rebuilt if we
>> dropped any profiles. With this solution, some of the older approach
>> can be removed (e.g. propagating counts using the guessed
>> probabilities during inlining).
>>
>> Patch is below. Bootstrapped and tested on x86-64-unknown-linux-gnu.
>> Also tested on
>> a profile-use build of SPEC cpu2006. Ok for trunk when stage 1 reopens?
>>
>> Thanks,
>> Teresa
>
> 2014-02-12  Teresa Johnson  <tejohnson@google.com>
>
>         * graphite.c (graphite_finalize): Pass new parameter.
>         * params.def (PARAM_MIN_CALLER_REESTIMATE_RATIO): New.
>         * predict.c (tree_estimate_probability): New parameter.
>         (tree_estimate_probability_worker): Renamed from
>         tree_estimate_probability_driver.
>         (tree_reestimate_probability): New function.
>         (tree_estimate_probability_driver): Invoke
>         tree_estimate_probability_worker.
>         (freqs_to_counts): Move here from tree-inline.c.
>         (drop_profile): Re-estimated profiles when dropping counts.
>         (handle_missing_profiles): Drop for some non-zero functions as well,
>         get counts from bbs to support invocation before cgraph rebuild.
>         (counts_to_freqs): Remove code obviated by reestimation.
>         * predict.h (tree_estimate_probability): Update declaration.
>         * tree-inline.c (freqs_to_counts): Move to predict.c.
>         (copy_cfg_body): Remove code obviated by reestimation.
>         * tree-profile.c (tree_profiling): Invoke handle_missing_profiles
>         before cgraph rebuild.
>
> Index: params.def
> ===================================================================
> --- params.def  (revision 207436)
> +++ params.def  (working copy)
> @@ -44,6 +44,12 @@ DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCOME,
>           "Maximal estimated outcome of branch considered predictable",
>           2, 0, 50)
>
> +DEFPARAM (PARAM_MIN_CALLER_REESTIMATE_RATIO,
> +         "min-caller-reestimate-ratio",
> +         "Minimum caller-to-callee node count ratio to force reestimated branch "
> +          "probabilities in callee (where 0 means only when callee count is 0)",
> +         10, 0, 0)
>
> OK, so if callee count is 10% of call site count, we do the recomputation?

Right.


>
> @@ -2390,7 +2392,8 @@ void
>    connect_infinite_loops_to_exit ();
>    /* We use loop_niter_by_eval, which requires that the loops have
>       preheaders.  */
> -  create_preheaders (CP_SIMPLE_PREHEADERS);
> +  if (!redo)
> +    create_preheaders (CP_SIMPLE_PREHEADERS);
>
> So you disable preheaders construction because this all happens at inlining
> time when we do not want to change BBs because callgraph is built, right?
> Are you sure we do have preheaders built at IPA time for all functions reliably?
> (It would make sense to do so, but I think loop analysis is done rather randomly
> by IPA analysis passes, such as inliner).
> Otherwise the SCEV infrastructure will explode, if I remember correctly.

I can't remember why I guarded this, but it appears it was just
precautionary. I just changed the implementation to assert before
invoking create_preheaders if redo is true and the loops don't already
have preheaders, made the call to create_preheaders unconditional, and
added an assert afterwards if there was any change when redo is true:

  gcc_assert (!redo || loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS));
  bool changed = create_preheaders (CP_SIMPLE_PREHEADERS);
  gcc_assert (!(redo && changed));

(changed create_preheaders to return a bool indicating whether any
preheaders were created).

Neither assert fired for either gcc regression tests, or more
extensive tests I ran on internal C++ benchmarks. Should I change the
patch to make the call unconditional and add these asserts?

>
> +/* Force re-estimation of the probabilities, because the profile was
> +   deemed insufficient.  */
> +
> +static void
> +tree_reestimate_probability (void)
> +{
> +  basic_block bb;
> +  edge e;
> +  edge_iterator ei;
> +
> +  /* Need to reset the counts to ensure probabilities are recomputed.  */
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      bb->count = 0;
> +      FOR_EACH_EDGE (e, ei, bb->succs)
> +        e->count = 0;
> +    }
> +  tree_estimate_probability_worker (true);
>
> You also want to remove recorded loop count estimates, since they are also
> driven by profile we chose to not trust.

Can you point me to the recorded loop count estimates you are
referring to? I poked around and didn't find them. The places I looked
used the edge counts to estimate, and seemed downstream of this.

>
> @@ -2842,48 +2927,77 @@ handle_missing_profiles (void)
>        gcov_type max_tp_first_run = 0;
>        struct function *fn = DECL_STRUCT_FUNCTION (node->decl);
>
> -      if (node->count)
> -        continue;
>        for (e = node->callers; e; e = e->next_caller)
>        {
> -        call_count += e->count;
> +        call_count += cgraph_function_with_gimple_body_p (e->caller)
> +                      ? gimple_bb (e->call_stmt)->count : 0;
>
> I am quite confused by this change - we you don't want to trust edge counts?
> Other parts of the patch seems quite resonable, but I don't really follow here.

It is because we are invoking this before rebuilding the cgraph. Here
is the comment to this effect that I added just above
handle_missing_profiles:

+   This is now invoked before rebuilding the cgraph after reading profile
+   counts, so the cgraph edge and node counts are still 0. Therefore we
+   need to look at the counts on the entry bbs and the call stmt bbs.
+   That way we can avoid needing to rebuild the cgraph again to reflect
+   the nodes with dropped profiles.  */

Thanks,
Teresa

>
> Honza
diff mbox

Patch

Index: params.def
===================================================================
--- params.def	(revision 207436)
+++ params.def	(working copy)
@@ -44,6 +44,12 @@  DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCOME,
 	  "Maximal estimated outcome of branch considered predictable",
 	  2, 0, 50)
 
+DEFPARAM (PARAM_MIN_CALLER_REESTIMATE_RATIO,
+	  "min-caller-reestimate-ratio",
+	  "Minimum caller-to-callee node count ratio to force reestimated branch "
+          "probabilities in callee (where 0 means only when callee count is 0)",
+	  10, 0, 0)

OK, so if callee count is 10% of call site count, we do the recomputation?

@@ -2390,7 +2392,8 @@  void