Patchwork Sanitize block partitioning under -freorder-blocks-and-partition

login
register
mail settings
Submitter Teresa Johnson
Date Aug. 30, 2013, 5 a.m.
Message ID <CAAe5K+XwXbpFJeqdcxj9Ua1XcBLpfs_32ogqmjUaLUxSS17p6w@mail.gmail.com>
Download mbox | patch
Permalink /patch/271119/
State New
Headers show

Comments

Teresa Johnson - Aug. 30, 2013, 5 a.m.
On Wed, Aug 28, 2013 at 11:20 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Wed, Aug 28, 2013 at 9:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Hi,
>> with Martin we did bit of progress on analyzing the problems.  We now have
>> COMDAT profile merging for FDO
>
>  Great! Is this the LTO merging you were talking about in an earlier
> message, or the gcov runtime fix (that would presumably not be
> lto-specific)?
>
>> and we also noticed that forks can make your
>> basic blocks appear never executed even though they are executed every run:
>> the fork is accounted as 3 independnet runs of the program.  First run
>> is until fork, the second 2 runs are both variant.
>>
>> I have patch to track this.  Moreover vforks seems to produce repeated
>> merging of results.
>
> Aha, does this explain the gimp issue as well?
>
>>
>> These two factors seems to help to riddle enought the firefox profiles
>> so it took us weeks to understand what happens.
>>> +         if (changed)
>>> +            {
>>> +              /* Edge forwarding in particular can cause hot blocks previously
>>> +                 reached by both hot and cold blocks to become dominated only
>>> +                 by cold blocks. This will cause the verification
>>> below to fail,
>>> +                 and lead to now cold code in the hot section. This is not easy
>>> +                 to detect and fix during edge forwarding, and in some cases
>>> +                 is only visible after newly unreachable blocks are deleted,
>>> +                 which will be done in fixup_partitions.  */
>>> +              fixup_partitions ();
>>
>> Is it really necessary to run this from internal loop of the cfgcleanup?
>
> The reason I added it here is that just below there is a call to
> verify_flow_info, and that will fail with the new verification.
>
>> It seems
>> you will play back and forth game where edge forwarding will remove your fallthru
>> and you will be re-adding it?
>
> fixup_partitions will not add new fall through edges. (It may invoke
> force_nonfallthru to do the opposite.) So there shouldn't be any
> ping-ponging effect.
>
>>
>> I would wait for cfgcleanup to finish its job (I don't really think it needs to be
>> iterative) and then do the fixup possibly cleaning up when the blocks was repoisitoined
>> (I suppose it is only case where the code above introduces new cfgcleanup oppurtunities).
>
> As noted above, I can't do this due to the call to verify_flow_info
> for each iteration. One option is to move both down outside the loop.
>
>>
>>> +      /* Walk the preds/succs and check if there is at least one already
>>> +         marked hot. Keep track of the most frequent pred/succ so that we
>>> +         can mark it hot if we don't find one.  */
>>> +      FOR_EACH_EDGE (e, ei, edges)
>>> +        {
>>> +          basic_block reach_bb = walk_up ? e->src : e->dest;
>>> +
>>> +          if (e->flags & EDGE_DFS_BACK)
>>> +            continue;
>>> +
>>> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
>>> +          {
>>> +            found = true;
>>> +            break;
>>> +          }
>>> +          if (e->probability > highest_probability)
>>> +            highest_probability = e->probability;
>>
>> When doing predecestor walk if you have two predecestors, one executing 100000
>> times and getting to the block with probability 1%, you want to chose it over
>> block executing once and getting to you with probability 100%.
>>
>> You probably want to look for most likely predecestor here.  You need to look
>> for highest e->count and if they are all 0 for highest EDGE_FREQUENCY and for
>> maximal probability only if all fails?
>
> Yes, thanks, let me do that.

New patch that addresses this is included below.

Thanks,
Teresa

>
>>
>>> +        }
>>> +
>>> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
>>> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
>>> +         then the most frequent pred (or succ) needs to be adjusted.  In the
>>> +         case where multiple preds/succs have the same probability (e.g. a
>>> +         50-50 branch), then both will be adjusted.  */
>>> +      if (found)
>>> +        continue;
>>> +
>>> +      FOR_EACH_EDGE (e, ei, edges)
>>> +        {
>>> +          if (e->flags & EDGE_DFS_BACK)
>>> +            continue;
>>> +          if (e->probability < highest_probability)
>>> +            continue;
>>
>> Again for predecestor walk you need to wind down the bit crazy logic described above.
>>> Index: predict.c
>>> ===================================================================
>>> --- predict.c   (revision 202021)
>>> +++ predict.c   (working copy)
>>> @@ -241,6 +241,22 @@ probably_never_executed_bb_p (struct function *fun
>>>    return false;
>>>  }
>>>
>>> +
>>> +/* Return true in case edge E is probably never executed.  */
>>> +
>>> +bool
>>> +probably_never_executed_edge_p (struct function *fun, edge e)
>>> +{
>>> +  gcc_checking_assert (fun);
>>> +  if (profile_info && flag_branch_probabilities)
>>> +    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
>>> +  if ((!profile_info || !flag_branch_probabilities)
>>> +      && (cgraph_get_node (fun->decl)->frequency
>>> +         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
>>> +    return true;
>>> +  return false;
>> Instead of duplicating the conditional, break out the tests into
>> probably_never_executed_count_p, like we have for maybe_hot_count_p.
>
> ok
>
>>
>> It would be nice to extend this to work w/o profile: probably_never_executed_edge_p
>> can return true for EH edges, setjmp edges and we can then walk bodies ignoring
>> EH/setjmp flagging blocks that are probably never executed enabling the partitioning
>> to do a job w/o profile.
>
> agreed. although I would prefer to leave that for a follow-on patch so
> it can be tuned a bit.
>
>>
>> Otherwise the patch look OK to me. Thanks for working on this. Do we have agreement on C++ way of mixing
>> declarations and code?
>
> According to http://gcc.gnu.org/wiki/CppConventions:
>
> "In new code variables which are used in a small scope should be
> defined at the point of first use, rather than at the top of the
> function. Variables which are used throughout the function may be
> defined at the top of the function, as in C."
>
> I think I am following that, but let me know if you see something that
> needs to be fixed.
>
> Thanks,
> Teresa
>
>>

2013-08-29  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

        * cfgrtl.c (fixup_new_cold_bb): New routine.
        (commit_edge_insertions): Invoke fixup_partitions.
        (find_partition_fixes): New routine.
        (fixup_partitions): Ditto.
        (verify_hot_cold_block_grouping): Update comments.
        (rtl_verify_edges): Invoke find_partition_fixes.
        (rtl_verify_bb_pointers): Update comments.
        (rtl_verify_bb_layout): Ditto.
        * basic-block.h (probably_never_executed_edge_p): Declare.
        (fixup_partitions): Ditto.
        * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
        * bb-reorder.c (sanitize_hot_paths): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_hot_paths.
        * predict.c (probably_never_executed_edge_p): New routine.
        * cfg.c (check_bb_profile): Add partition insanity warnings.
Jan Hubicka - Aug. 30, 2013, 9:14 a.m.
> Great! Is this the LTO merging you were talking about in an earlier
> message, or the gcov runtime fix (that would presumably not be
> lto-specific)?

It is the LTO path - we need to merge profiles there anyway for his code unification
work.

> > I have patch to track this.  Moreover vforks seems to produce repeated
> > merging of results.
> 
> Aha, does this explain the gimp issue as well?

Not really - we still need to debug why we hit cold section so many times with
partitioning.  I sitll think easier approach will be to lock the cold section and
then start probably with testsuite (i.e. write script to compile the small testcases
with FDO + partitioning and see what crash by hitting cold section).

> >
> > Is it really necessary to run this from internal loop of the cfgcleanup?
> 
> The reason I added it here is that just below there is a call to
> verify_flow_info, and that will fail with the new verification.

Hmm, OK, I suppose we run the cleanup after partitioning just once or twice, right?
We can track this incrementally - I am not sure if we put it from the internal iteration
loop we would get anything substantial either.
Removing unreachable blocks twice is however ugly.

> +/* Ensure that all hot bbs are included in a hot path through the
> +   procedure. This is done by calling this function twice, once
> +   with WALK_UP true (to look for paths from the entry to hot bbs) and
> +   once with WALK_UP false (to look for paths from hot bbs to the exit).
> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
> +   to BBS_IN_HOT_PARTITION.  */
> +
> +static unsigned int
> +sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
> +                    vec<basic_block> *bbs_in_hot_partition)
> +{
> +  /* Callers check this.  */
> +  gcc_checking_assert (cold_bb_count);
> +
> +  /* Keep examining hot bbs while we still have some left to check
> +     and there are remaining cold bbs.  */
> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
> +  while (! hot_bbs_to_check.is_empty ()
> +         && cold_bb_count)
> +    {
> +      basic_block bb = hot_bbs_to_check.pop ();
> +      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
> +      edge e;
> +      edge_iterator ei;
> +      int highest_probability = 0;
> +      int highest_freq = 0;
> +      gcov_type highest_count = 0;
> +      bool found = false;
> +
> +      /* Walk the preds/succs and check if there is at least one already
> +         marked hot. Keep track of the most frequent pred/succ so that we
> +         can mark it hot if we don't find one.  */
> +      FOR_EACH_EDGE (e, ei, edges)
> +        {
> +          basic_block reach_bb = walk_up ? e->src : e->dest;
> +
> +          if (e->flags & EDGE_DFS_BACK)
> +            continue;
> +
> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
> +          {
> +            found = true;
> +            break;
> +          }
> +          /* The following loop will look for the hottest edge via
> +             the edge count, if it is non-zero, then fallback to the edge
> +             frequency and finally the edge probability.  */
> +          if (e->count > highest_count)
> +            highest_count = e->count;
> +          int edge_freq = EDGE_FREQUENCY (e);
> +          if (edge_freq > highest_freq)
> +            highest_freq = edge_freq;
> +          if (e->probability > highest_probability)
> +            highest_probability = e->probability;
> +        }
> +
> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
> +         then the most frequent pred (or succ) needs to be adjusted.  In the
> +         case where multiple preds/succs have the same frequency (e.g. a
> +         50-50 branch), then both will be adjusted.  */
> +      if (found)
> +        continue;
> +
> +      FOR_EACH_EDGE (e, ei, edges)
> +        {
> +          if (e->flags & EDGE_DFS_BACK)
> +            continue;
> +          /* Select the hottest edge using the edge count, if it is non-zero,
> +             then fallback to the edge frequency and finally the edge
> +             probability.  */
> +          if (highest_count)
> +            {
> +              if (e->count < highest_count)
> +                continue;
> +            }
> +          else if (highest_freq)

The frequency condition needs to be done only when you walk predecestors - when
you walk down the edge probabilities are just fine.

The patch seems OK to me now.  I will make our FDO tester to use partitioning so we get
this benchmarked a bit.

Honza
Teresa Johnson - Aug. 30, 2013, 2:50 p.m.
Can someone review and ok the attached patch for trunk? It has been
bootstrapped and tested on x86-64-unknown-linux-gnu, and tested by
enabling -freorder-blocks-and-partition enabled for a
profiledbootstrap as well.

(Honza, see more responses inlined below. Rong, please see note below as well).

Thanks,
Teresa

On Fri, Aug 30, 2013 at 2:14 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Great! Is this the LTO merging you were talking about in an earlier
>> message, or the gcov runtime fix (that would presumably not be
>> lto-specific)?
>
> It is the LTO path - we need to merge profiles there anyway for his code unification
> work.

Rong - can you send a summary of the approach you are working on? Is
it LIPO-specific?

>
>> > I have patch to track this.  Moreover vforks seems to produce repeated
>> > merging of results.
>>
>> Aha, does this explain the gimp issue as well?
>
> Not really - we still need to debug why we hit cold section so many times with
> partitioning.  I sitll think easier approach will be to lock the cold section and
> then start probably with testsuite (i.e. write script to compile the small testcases
> with FDO + partitioning and see what crash by hitting cold section).

Ok, that is on my todo list.

>
>> >
>> > Is it really necessary to run this from internal loop of the cfgcleanup?
>>
>> The reason I added it here is that just below there is a call to
>> verify_flow_info, and that will fail with the new verification.
>
> Hmm, OK, I suppose we run the cleanup after partitioning just once or twice, right?
> We can track this incrementally - I am not sure if we put it from the internal iteration
> loop we would get anything substantial either.
> Removing unreachable blocks twice is however ugly.

When I was debugging the issue that led to this change I seemed to see
1-2 iterations typically. Although I haven't measured it
scientifically. It would be good to revisit that and see if we can
pull both parts out of the loop, but as a separate patch.

>
>> +/* Ensure that all hot bbs are included in a hot path through the
>> +   procedure. This is done by calling this function twice, once
>> +   with WALK_UP true (to look for paths from the entry to hot bbs) and
>> +   once with WALK_UP false (to look for paths from hot bbs to the exit).
>> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
>> +   to BBS_IN_HOT_PARTITION.  */
>> +
>> +static unsigned int
>> +sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
>> +                    vec<basic_block> *bbs_in_hot_partition)
>> +{
>> +  /* Callers check this.  */
>> +  gcc_checking_assert (cold_bb_count);
>> +
>> +  /* Keep examining hot bbs while we still have some left to check
>> +     and there are remaining cold bbs.  */
>> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
>> +  while (! hot_bbs_to_check.is_empty ()
>> +         && cold_bb_count)
>> +    {
>> +      basic_block bb = hot_bbs_to_check.pop ();
>> +      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
>> +      edge e;
>> +      edge_iterator ei;
>> +      int highest_probability = 0;
>> +      int highest_freq = 0;
>> +      gcov_type highest_count = 0;
>> +      bool found = false;
>> +
>> +      /* Walk the preds/succs and check if there is at least one already
>> +         marked hot. Keep track of the most frequent pred/succ so that we
>> +         can mark it hot if we don't find one.  */
>> +      FOR_EACH_EDGE (e, ei, edges)
>> +        {
>> +          basic_block reach_bb = walk_up ? e->src : e->dest;
>> +
>> +          if (e->flags & EDGE_DFS_BACK)
>> +            continue;
>> +
>> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
>> +          {
>> +            found = true;
>> +            break;
>> +          }
>> +          /* The following loop will look for the hottest edge via
>> +             the edge count, if it is non-zero, then fallback to the edge
>> +             frequency and finally the edge probability.  */
>> +          if (e->count > highest_count)
>> +            highest_count = e->count;
>> +          int edge_freq = EDGE_FREQUENCY (e);
>> +          if (edge_freq > highest_freq)
>> +            highest_freq = edge_freq;
>> +          if (e->probability > highest_probability)
>> +            highest_probability = e->probability;
>> +        }
>> +
>> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
>> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
>> +         then the most frequent pred (or succ) needs to be adjusted.  In the
>> +         case where multiple preds/succs have the same frequency (e.g. a
>> +         50-50 branch), then both will be adjusted.  */
>> +      if (found)
>> +        continue;
>> +
>> +      FOR_EACH_EDGE (e, ei, edges)
>> +        {
>> +          if (e->flags & EDGE_DFS_BACK)
>> +            continue;
>> +          /* Select the hottest edge using the edge count, if it is non-zero,
>> +             then fallback to the edge frequency and finally the edge
>> +             probability.  */
>> +          if (highest_count)
>> +            {
>> +              if (e->count < highest_count)
>> +                continue;
>> +            }
>> +          else if (highest_freq)
>
> The frequency condition needs to be done only when you walk predecestors - when
> you walk down the edge probabilities are just fine.

True. For simplicity I think it should be fine to leave as-is so there
isn't more special casing as the current approach works in both
directions.

>
> The patch seems OK to me now.  I will make our FDO tester to use partitioning so we get
> this benchmarked a bit.

Ok thanks.

>
> Honza
Jan Hubicka - Aug. 30, 2013, 3:26 p.m.
> >
> > The frequency condition needs to be done only when you walk predecestors - when
> > you walk down the edge probabilities are just fine.
> 
> True. For simplicity I think it should be fine to leave as-is so there
> isn't more special casing as the current approach works in both
> directions.

Yep, you are right. Frequencies are safe both directions.

I think this change belongs to profile feedback category, so the patch is OK.

Honza
Teresa Johnson - Aug. 30, 2013, 3:54 p.m.
On Fri, Aug 30, 2013 at 8:26 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >
>> > The frequency condition needs to be done only when you walk predecestors - when
>> > you walk down the edge probabilities are just fine.
>>
>> True. For simplicity I think it should be fine to leave as-is so there
>> isn't more special casing as the current approach works in both
>> directions.
>
> Yep, you are right. Frequencies are safe both directions.
>
> I think this change belongs to profile feedback category, so the patch is OK.

ok, thanks.

Teresa

>
> Honza
Rong Xu - Aug. 30, 2013, 9:53 p.m.
On Fri, Aug 30, 2013 at 7:50 AM, Teresa Johnson <tejohnson@google.com> wrote:
> Can someone review and ok the attached patch for trunk? It has been
> bootstrapped and tested on x86-64-unknown-linux-gnu, and tested by
> enabling -freorder-blocks-and-partition enabled for a
> profiledbootstrap as well.
>
> (Honza, see more responses inlined below. Rong, please see note below as well).
>
> Thanks,
> Teresa
>
> On Fri, Aug 30, 2013 at 2:14 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> Great! Is this the LTO merging you were talking about in an earlier
>>> message, or the gcov runtime fix (that would presumably not be
>>> lto-specific)?
>>
>> It is the LTO path - we need to merge profiles there anyway for his code unification
>> work.
>
> Rong - can you send a summary of the approach you are working on? Is
> it LIPO-specific?

I'm also working to improve COMDAT handling in FDO/LIPO.  It's
applicable to regular FDO.
Our motivation case is different from the case talked here:
  COMDAT function F is defined in module A and is picked by linker as
the out-of-line copy in profile-gen phrase.
It gets all the counters. In profile-use compilation F may not be
emitted in A (like all the callsites are ipa-inlined in A),
We choose instance of F in module B as the out-of-line copy, and it's
not optimized.

We are seeing more of this kind of cases in LIPO due to multiple
COMDAT copies brought by the auxiliary modules.

Since COMDAT function may be inlined after instrumentation, multiple
copies of the counters make be co-exists
We want to differentiate inlined-copy counters or out-of-line-copy counters.
(In LIPO, we actually encourage the inline of COMDAT in IPA-inline to
get more context sensitive counters.)

Our current solution is to have another instrumentation only for
COMDAT functions.
* For each  comdat_key, we create a global var pointing to the
gcov_fn_info of the out-of-line copy.
* This global var is initialized by the instrumentation code placed in
the function entry, to the value of gcov_fn_info of current module.
   This is post IPA-inline instrumentation. So at most one
instrumentation (the one picked by linker) is executed.
* Expand gcov_fn_info to points to the global var.
In the run time, we can differentiate if the counter out-of-line copy
or inlined copy. We set a tag in gcda file accordingly.
(in the case of both out-of-line copy and inlined copy, we treat it as
out-of-line copy).

In the profile-use phrase, we use this info to resolve the decls of
COMDAT functions, and also make sure we only emit
the out-of-line copy of the COMDAT function (even if it's not
referenced in the module).

This has been done and we are testing it now.

I talked to Teresa about the her case some time back. It seems that
merging the counters is an easy change with the above patch because
we have the address of the out-of-line gcov_fn_info. We can do a
simple in-memory merge if the checksum matches. The concern is
that this may reduce the context sensitive information.

-Rong

>
>>
>>> > I have patch to track this.  Moreover vforks seems to produce repeated
>>> > merging of results.
>>>
>>> Aha, does this explain the gimp issue as well?
>>
>> Not really - we still need to debug why we hit cold section so many times with
>> partitioning.  I sitll think easier approach will be to lock the cold section and
>> then start probably with testsuite (i.e. write script to compile the small testcases
>> with FDO + partitioning and see what crash by hitting cold section).
>
> Ok, that is on my todo list.
>
>>
>>> >
>>> > Is it really necessary to run this from internal loop of the cfgcleanup?
>>>
>>> The reason I added it here is that just below there is a call to
>>> verify_flow_info, and that will fail with the new verification.
>>
>> Hmm, OK, I suppose we run the cleanup after partitioning just once or twice, right?
>> We can track this incrementally - I am not sure if we put it from the internal iteration
>> loop we would get anything substantial either.
>> Removing unreachable blocks twice is however ugly.
>
> When I was debugging the issue that led to this change I seemed to see
> 1-2 iterations typically. Although I haven't measured it
> scientifically. It would be good to revisit that and see if we can
> pull both parts out of the loop, but as a separate patch.
>
>>
>>> +/* Ensure that all hot bbs are included in a hot path through the
>>> +   procedure. This is done by calling this function twice, once
>>> +   with WALK_UP true (to look for paths from the entry to hot bbs) and
>>> +   once with WALK_UP false (to look for paths from hot bbs to the exit).
>>> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
>>> +   to BBS_IN_HOT_PARTITION.  */
>>> +
>>> +static unsigned int
>>> +sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
>>> +                    vec<basic_block> *bbs_in_hot_partition)
>>> +{
>>> +  /* Callers check this.  */
>>> +  gcc_checking_assert (cold_bb_count);
>>> +
>>> +  /* Keep examining hot bbs while we still have some left to check
>>> +     and there are remaining cold bbs.  */
>>> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
>>> +  while (! hot_bbs_to_check.is_empty ()
>>> +         && cold_bb_count)
>>> +    {
>>> +      basic_block bb = hot_bbs_to_check.pop ();
>>> +      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
>>> +      edge e;
>>> +      edge_iterator ei;
>>> +      int highest_probability = 0;
>>> +      int highest_freq = 0;
>>> +      gcov_type highest_count = 0;
>>> +      bool found = false;
>>> +
>>> +      /* Walk the preds/succs and check if there is at least one already
>>> +         marked hot. Keep track of the most frequent pred/succ so that we
>>> +         can mark it hot if we don't find one.  */
>>> +      FOR_EACH_EDGE (e, ei, edges)
>>> +        {
>>> +          basic_block reach_bb = walk_up ? e->src : e->dest;
>>> +
>>> +          if (e->flags & EDGE_DFS_BACK)
>>> +            continue;
>>> +
>>> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
>>> +          {
>>> +            found = true;
>>> +            break;
>>> +          }
>>> +          /* The following loop will look for the hottest edge via
>>> +             the edge count, if it is non-zero, then fallback to the edge
>>> +             frequency and finally the edge probability.  */
>>> +          if (e->count > highest_count)
>>> +            highest_count = e->count;
>>> +          int edge_freq = EDGE_FREQUENCY (e);
>>> +          if (edge_freq > highest_freq)
>>> +            highest_freq = edge_freq;
>>> +          if (e->probability > highest_probability)
>>> +            highest_probability = e->probability;
>>> +        }
>>> +
>>> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
>>> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
>>> +         then the most frequent pred (or succ) needs to be adjusted.  In the
>>> +         case where multiple preds/succs have the same frequency (e.g. a
>>> +         50-50 branch), then both will be adjusted.  */
>>> +      if (found)
>>> +        continue;
>>> +
>>> +      FOR_EACH_EDGE (e, ei, edges)
>>> +        {
>>> +          if (e->flags & EDGE_DFS_BACK)
>>> +            continue;
>>> +          /* Select the hottest edge using the edge count, if it is non-zero,
>>> +             then fallback to the edge frequency and finally the edge
>>> +             probability.  */
>>> +          if (highest_count)
>>> +            {
>>> +              if (e->count < highest_count)
>>> +                continue;
>>> +            }
>>> +          else if (highest_freq)
>>
>> The frequency condition needs to be done only when you walk predecestors - when
>> you walk down the edge probabilities are just fine.
>
> True. For simplicity I think it should be fine to leave as-is so there
> isn't more special casing as the current approach works in both
> directions.
>
>>
>> The patch seems OK to me now.  I will make our FDO tester to use partitioning so we get
>> this benchmarked a bit.
>
> Ok thanks.
>
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Patch

Index: cfgrtl.c
===================================================================
--- cfgrtl.c    (revision 202021)
+++ cfgrtl.c    (working copy)
@@ -1358,6 +1358,43 @@  fixup_partition_crossing (edge e)
     }
 }

+/* Called when block BB has been reassigned to the cold partition,
+   because it is now dominated by another cold block,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_new_cold_bb (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* This is called when a hot bb is found to now be dominated
+     by a cold bb and therefore needs to become cold. Therefore,
+     its preds will no longer be region crossing. Any non-dominating
+     preds that were previously hot would also have become cold
+     in the caller for the same region. Any preds that were previously
+     region-crossing will be adjusted in fixup_partition_crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      /* We can't have fall-through edges across partition boundaries.
+         Note that force_nonfallthru will do any necessary partition
+         boundary fixup by calling fixup_partition_crossing itself.  */
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.

@@ -1996,6 +2033,14 @@  commit_edge_insertions (void)
 {
   basic_block bb;

+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2190,6 +2235,101 @@  get_last_bb_insn (basic_block bb)
   return end;
 }

+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+static vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  /* Callers check this.  */
+  gcc_checking_assert (crtl->has_bb_partition);
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition (%d)",
son->index, bb->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_new_cold_bb (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2236,7 +2376,8 @@  verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */

 static int
 rtl_verify_edges (void)
@@ -2399,6 +2540,14 @@  rtl_verify_edges (void)
        }
     }

+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2532,7 +2681,7 @@  rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks

    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2778,7 +2927,8 @@  rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */

 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h       (revision 202021)
+++ basic-block.h       (working copy)
@@ -726,6 +726,7 @@  extern void compute_available (sbitmap *, sbitmap
 extern bool maybe_hot_bb_p (struct function *, const_basic_block);
 extern bool maybe_hot_edge_p (edge);
 extern bool probably_never_executed_bb_p (struct function *,
const_basic_block);
+extern bool probably_never_executed_edge_p (struct function *, edge);
 extern bool optimize_bb_for_size_p (const_basic_block);
 extern bool optimize_bb_for_speed_p (const_basic_block);
 extern bool optimize_edge_for_size_p (edge);
@@ -797,6 +798,7 @@  extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);

 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c        (revision 202021)
+++ cfgcleanup.c        (working copy)
@@ -2807,10 +2807,21 @@  try_optimize_cfg (int mode)
              df_analyze ();
            }

+         if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification
below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-         if (changed)
-           verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }

          changed_overall |= changed;
          first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 202021)
+++ bb-reorder.c        (working copy)
@@ -1444,27 +1444,157 @@  fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }

+
+/* Ensure that all hot bbs are included in a hot path through the
+   procedure. This is done by calling this function twice, once
+   with WALK_UP true (to look for paths from the entry to hot bbs) and
+   once with WALK_UP false (to look for paths from hot bbs to the exit).
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
+                    vec<basic_block> *bbs_in_hot_partition)
+{
+  /* Callers check this.  */
+  gcc_checking_assert (cold_bb_count);
+
+  /* Keep examining hot bbs while we still have some left to check
+     and there are remaining cold bbs.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
+      edge e;
+      edge_iterator ei;
+      int highest_probability = 0;
+      int highest_freq = 0;
+      gcov_type highest_count = 0;
+      bool found = false;
+
+      /* Walk the preds/succs and check if there is at least one already
+         marked hot. Keep track of the most frequent pred/succ so that we
+         can mark it hot if we don't find one.  */
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+
+          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
+          {
+            found = true;
+            break;
+          }
+          /* The following loop will look for the hottest edge via
+             the edge count, if it is non-zero, then fallback to the edge
+             frequency and finally the edge probability.  */
+          if (e->count > highest_count)
+            highest_count = e->count;
+          int edge_freq = EDGE_FREQUENCY (e);
+          if (edge_freq > highest_freq)
+            highest_freq = edge_freq;
+          if (e->probability > highest_probability)
+            highest_probability = e->probability;
+        }
+
+      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
+         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
+         then the most frequent pred (or succ) needs to be adjusted.  In the
+         case where multiple preds/succs have the same frequency (e.g. a
+         50-50 branch), then both will be adjusted.  */
+      if (found)
+        continue;
+
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+          /* Select the hottest edge using the edge count, if it is non-zero,
+             then fallback to the edge frequency and finally the edge
+             probability.  */
+          if (highest_count)
+            {
+              if (e->count < highest_count)
+                continue;
+            }
+          else if (highest_freq)
+            {
+              if (EDGE_FREQUENCY (e) < highest_freq)
+                continue;
+            }
+          else if (e->probability < highest_probability)
+            continue;
+
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          /* We have a hot bb with an immediate dominator that is cold.
+             The dominator needs to be re-marked hot.  */
+          BB_SET_PARTITION (reach_bb, BB_HOT_PARTITION);
+          cold_bb_count--;
+
+          /* Now we need to examine newly-hot reach_bb to see if it is also
+             dominated by a cold bb.  */
+          bbs_in_hot_partition->safe_push (reach_bb);
+          hot_bbs_to_check.safe_push (reach_bb);
+        }
+    }
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */

-static vec<edge>
+static vec<edge>
 find_rarely_executed_basic_blocks_and_crossing_edges (void)
 {
   vec<edge> crossing_edges = vNULL;
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;

   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-       BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-       BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }

+  /* Ensure that hot bbs are included along a hot path from the entry to exit.
+     Several different possibilities may include cold bbs along all paths
+     to/from a hot bb. One is that there are edge weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be
hot, because
+     it is entered fewer times than the number of profile training
runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. This is fixed by walking up from hot bbs
+     to the entry block, and then down from hot bbs to the exit, performing
+     partitioning fixups as necessary.  */
+  if (cold_bb_count)
+    {
+      mark_dfs_back_edges ();
+      cold_bb_count = sanitize_hot_paths (true, cold_bb_count,
+                                          &bbs_in_hot_partition);
+      if (cold_bb_count)
+        sanitize_hot_paths (false, cold_bb_count, &bbs_in_hot_partition);
+    }
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */
Index: predict.c
===================================================================
--- predict.c   (revision 202021)
+++ predict.c   (working copy)
@@ -241,6 +241,22 @@  probably_never_executed_bb_p (struct function *fun
   return false;
 }

+
+/* Return true in case edge E is probably never executed.  */
+
+bool
+probably_never_executed_edge_p (struct function *fun, edge e)
+{
+  gcc_checking_assert (fun);
+  if (profile_info && flag_branch_probabilities)
+    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
+  if ((!profile_info || !flag_branch_probabilities)
+      && (cgraph_get_node (fun->decl)->frequency
+         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
+    return true;
+  return false;
+}
+
 /* Return true if NODE should be optimized for size.  */

 bool
Index: cfg.c
===================================================================
--- cfg.c       (revision 202021)
+++ cfg.c       (working copy)
@@ -446,6 +446,21 @@  check_bb_profile (basic_block bb, FILE * file, int
                 (flags & TDF_COMMENT) ? ";; " : "", s_indent,
                 (int) lsum, (int) bb->count);
     }
+  if (BB_PARTITION (bb) == BB_COLD_PARTITION)
+    {
+      /* Warn about inconsistencies in the partitioning that are
+         currently caused by profile insanities created via optimization.  */
+      if (!probably_never_executed_bb_p (fun, bb))
+        fprintf (file, "%s%sBlock in cold partition with hot count\n",
+                 (flags & TDF_COMMENT) ? ";; " : "", s_indent);
+      FOR_EACH_EDGE (e, ei, bb->preds)
+        {
+          if (!probably_never_executed_edge_p (fun, e))
+            fprintf (file,
+                     "%s%sBlock in cold partition with incoming hot edge\n",
+                     (flags & TDF_COMMENT) ? ";; " : "", s_indent);
+        }
+    }
 }
 ^L
 void