AutoFDO patch for trunk

Hi, Honza,

Sorry for the delay. I just picked up the original patch, and updated
it with your comments.

I've addressed most of your comments. Something else to discuss inlined.

I had refactored the patch to make it much less intrusive. New patch
is attached (ChangeLog will be added in the final version of the
patch). The performance of the new patch is the same as the original
patch.

Thanks,
Dehao

>> Index: gcc/bb-reorder.c
>> ===================================================================
>> --- gcc/bb-reorder.c  (revision 210180)
>> +++ gcc/bb-reorder.c  (working copy)
>> @@ -2663,7 +2663,7 @@ pass_partition_blocks::gate (function *fun)
>>       user defined section attributes.  Don't call it if either case
>>       arises.  */
>>    return (flag_reorder_blocks_and_partition
>> -       && optimize
>> +       && optimize &&!flag_auto_profile
>
> Formatting error. Why we want !flag_auto_profile? Can't we just arrange
> flag_reorder_blocks_and_partition to default to false?
>
> In fact I would like to see flag_reorder_blocks_and_partition to offload obviously
> cold code (in partiuclar the EH cleanups) even without profile. That should work
> well with autofdo.

This will cause bzip2 performance to degrade 6%. I haven't had time to
triage the problem. Will investigate this later.

>> Index: gcc/coverage.c
>> ===================================================================
>> --- gcc/coverage.c    (revision 210180)
>> +++ gcc/coverage.c    (working copy)
>> @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "intl.h"
>>  #include "filenames.h"
>>  #include "target.h"
>> +#include "auto-profile.h"
>>
>>  #include "gcov-io.h"
>>  #include "gcov-io.c"
>> @@ -1175,7 +1176,9 @@ coverage_init (const char *filename)
>>
>>    bbg_file_stamp = local_tick;
>>
>> -  if (flag_branch_probabilities)
>> +  if (flag_auto_profile)
>> +    init_auto_profile ();
>
> Perhaps initialization can happen from toplev.c instead?

This is actually reading of afdo profile. I've renamed the function to
make it consistent.

>> Index: gcc/cgraphclones.c
>> ===================================================================
>> --- gcc/cgraphclones.c        (revision 210180)
>> +++ gcc/cgraphclones.c        (working copy)
>> @@ -435,6 +435,11 @@ cgraph_clone_node (struct cgraph_node *n, tree dec
>>      }
>>    else
>>      count_scale = 0;
>> +  /* In AutoFDO, if edge count is larger than callee's entry block
>> +     count, we will not update the original callee because it may
>> +     mistakenly mark some hot function as cold.  */
>> +  if (flag_auto_profile && count >= n->count)
>> +    update_original = false;
>
> This looks like a hack. Lets go with it later - I see what you are shooting for here
> and non-autoFDO has kind of same problem, too.  Counts are not at all that reliable
> after some inlining.
>
> I wonder how this hack would fare for -fprofile-use

It does not affect -fprofile-use performance in my testing. However,
for AutoFDO, this happens much more often.

>> @@ -1286,6 +1285,9 @@ del_node_map (void)
>>  struct cgraph_node*
>>  find_func_by_profile_id (int profile_id)
>>  {
>> +  if (flag_auto_profile)
>> +    return cgraph_node_for_asm (
>> +     get_identifier ((const char *)(size_t)profile_id));
>
> I think with LTO we will need to make this machinery safe for symbol mangling.  Any plans on this?

No plan yet. Let's get AutoFDO in first, and spend some separate
effort to make AutoFDO+LTO work?

>> Index: gcc/ipa-inline.h
>> ===================================================================
>> --- gcc/ipa-inline.h  (revision 210180)
>> +++ gcc/ipa-inline.h  (working copy)
>> @@ -345,3 +345,5 @@ reset_edge_growth_cache (struct cgraph_edge *edge)
>>        edge_growth_cache[edge->uid] = zero;
>>      }
>>  }
>> +
>> +unsigned int early_inliner (function *fun);
>
> Can't this be handled by scheduling early inliner as subpass of autofdo or something similar?
> I would rather make this explicit in PM.

See my comment below in auto-profile.c

>> Index: gcc/auto-profile.c
>> ===================================================================
>> --- gcc/auto-profile.c        (revision 0)
>> +++ gcc/auto-profile.c        (revision 0)
>> +   After the above 3 phases, all profile is readily annotated on the GCC IR.
>> +   AutoFDO tries to reuse all FDO infrastructure as much as possible to make
>> +   use of the profile. E.g. it uses existing mechanism to calculate the basic
>> +   block/edge frequency, as well as the cgraph node/edge count.
>> +*/
>
> I suppose the phases can be made explicit to the PM, so early inliner can stay
> a pass.

Unfortunately, this is more complicated:

while(changed) {
  vpt();
  einline();
  annotate();
}

>> +/* Store a string array, indexed by string position in the array.  */
>> +class string_table {
>> +public:
>> +  static string_table *create ();
>> +
>> +  /* For a given string, returns its index.  */
>> +  int get_index (const char *name) const;
>> +
>> +  /* For a given decl, returns the index of the decl name.  */
>> +  int get_index_by_decl (tree decl) const;
>> +
>> +  /* For a given index, returns the string.  */
>> +  const char *get_name (int index) const;
>
> I wonder how these should work with LTO.  I suppose we can tell user to not LTO
> for train run, we should teach the table to ignore LTO generated suffixes as
> well as random seeds (like coverage code already does).
>
> What happens when two static functions have same name?

Shall we first focus on non-LTO version. And there could be separate
effort to make AutoFDO+LTO work?

For static functions, there will only be one copy in the profile. the
profile will be merged. I agree that this could potentially cause
problem. But OTOH, in a good practice, these functions should *not* be
hot functions. At least in our practice, we didn't see any hot
functions like this.

>
>> +/* Profile for all functions.  */
>> +class autofdo_source_profile {
>> +public:
>> +  static autofdo_source_profile *create ()
>> +    {
>> +      autofdo_source_profile *map = new autofdo_source_profile ();
>
> I think we still want empty lines here.  Otherwise I trust you on following GCC C++ coding standards :)
>> +      if (map->read ())
>> +     return map;
>> +      delete map;
>> +      return NULL;
>> +    }
>
>> +/* Helper functions.  */
>> +
>> +/* Return the original name of NAME: strip the suffix that starts
>> +   with '.'  */
>> +
>> +static const char *get_original_name (const char *name)
>> +{
>> +  char *ret = xstrdup (name);
>> +  char *find = strchr (ret, '.');
>> +  if (find != NULL)
>> +    *find = 0;
>> +  return ret;
>> +}
>
> Stripping all suffixes will likely make conflicts on static functions, right?
> Example like this will get you a conflict
> m()
> {
> void t(void)
> {
> }
> t();
> }
> m2()
> {
>   void t2(void)
> {
> }
> t2();
> }
>
> that is of course not that important given that nested functions are not top priority
> but it would be nice to handle it gratefuly.

I agree. My original intention is to make the function name handling
as simple as possible. The current impl is based on the asumption that
all important (hot) functions deserve a separate name (not just suffix
difference). We could definitely refine the function name handling in
the future patches.

> I would definitely merge logic here with logic in gcov that already knows how to skip
> random seeds and other cases that makes filenames to diverge...

Could you point me to the gcov logic that handles skipping function name suffix?

>> +
>> +/* Return the combined location, which is a 32bit integer in which
>> +   higher 16 bits stores the line offset of LOC to the start lineno
>> +   of DECL, The lower 16 bits stores the discrimnator.  */
>> +
>> +static unsigned
>> +get_combined_location (location_t loc, tree decl)
>> +{
>> +  return ((LOCATION_LINE (loc) - DECL_SOURCE_LINE (decl)) << 16);
>> +}
>
> What happens on overflows here? I guess they are possible - 65536 lines is not infinity
> these days.

If a function is longer than 65535 lines, then some lines may have
incorrect count attributed. But in pactice, we did not find any
function that is over 10K lines.

> The discriminator is added later?

Yes, we need to update this when discrminator support is ported upstream.

>
> Please review the file for the coding standards.

I'll do a final pass of the coding standard after all code logics are OK.

>> +
>> +static void
>> +fake_read_autofdo_module_profile ()
>
> Probably could get in only after LIPO?

Yes, this is only there to make sure that the profile format is
compatible with google branch. But after LIPO is upstream, we can add
the actual content here.

>> +{
>> +  /* Read in the module info.  */
>> +  gcov_read_unsigned ();
>> +
>> +  /* Skip the length of the section.  */
>> +  gcov_read_unsigned ();
>
> You probably want to check that length is 0 or so.

We already checked num_module is 0, whch is equivalent with checking
length is 0.

>> +/* If a basic block's count is known, and only one of its in/out edges' count
>> +   is unknown, its count can be calculated.
>> +   Meanwhile, if all of the in/out edges' counts are known, then the basic
>> +   block's unknown count can also be calculated.
>> +   IS_SUCC is true if out edges of a basic blocks are examined.
>> +   Return TRUE if any basic block/edge count is changed.  */
>
> Interesting profile cleanup.
> I wonder wy EDGE_ANNOTATED is exported now (i.e. why it is not maintained only withing
> auto-FDO). Do you have more use for it outside auto-FDO?
>
> I was thinking that having a flag about profile being reliable may be useful idea, but never
> really tried to implement it.

Good point, removed the ANNOTATED flags.

>> +
>> +/* Special propagation for circuit expressions. Because GCC translates
>> +   control flow into data flow for circuit expressions. E.g.
>> +   BB1:
>> +   if (a && b)
>> +     BB2
>> +   else
>> +     BB3
>> +
>> +   will be translated into:
>> +
>> +   BB1:
>> +     if (a)
>> +       goto BB.t1
>> +     else
>> +       goto BB.t3
>> +   BB.t1:
>> +     if (b)
>> +       goto BB.t2
>> +     else
>> +       goto BB.t3
>> +   BB.t2:
>> +     goto BB.t3
>> +   BB.t3:
>> +     tmp = PHI (0 (BB1), 0 (BB.t1), 1 (BB.t2)
>> +     if (tmp)
>> +       goto BB2
>> +     else
>> +       goto BB3
>> +
>> +   In this case, we need to propagate through PHI to determine the edge
>> +   count of BB1->BB.t1, BB.t1->BB.t2.  */
>
> So here you have known probabilities of BB1, BB2 and BB3 and you need to guess all the rest?
> Why it need so much of gimple matching?

This is basically because some basic blocks have no stmt so that we
can not get its count from source level profile. However, we can get
edge count through PHI nodes.

>> +  while (changed && i++ < 10)
>> +    {
>> +      changed = false;
>> +
>> +      if (afdo_propagate_edge (true))
>> +     changed = true;
>> +      if (afdo_propagate_edge (false))
>> +     changed = true;
>> +      afdo_propagate_circuit ();
>> +    }
>
> I wonder if there is any code to share with same (except for the circuit thing) code in profile.c
> and gcov.c? Profile.c also have the minimal flow algorithm for fixing broken profiles, perhaps
> it could be useful here?

The initial SampleFDO implementation uses MCF algorithm to calculate
edge counts (the current mcf.c actually comes from that effort).
However, from our performance tuning effort, we found that MCF is an
overkill for AutoFDO purpose, and it's extremely slow and hard to
debug. Thus we implemented this ad-hoc heuristic for propagation.

>> +  FOR_ALL_BB_FN (bb, cfun)
>> +    {
>> +      edge e;
>> +      edge_iterator ei;
>> +
>> +      FOR_EACH_EDGE (e, ei, bb->succs)
>> +     e->count =
>> +             (double) bb->count * e->probability / REG_BR_PROB_BASE;
>> +      bb->aux = NULL;
>> +    }
>> +
>> +  loop_optimizer_finalize ();
>> +  free_dominance_info (CDI_DOMINATORS);
>> +  free_dominance_info (CDI_POST_DOMINATORS);
>> +}
>
> I wonder if using the estimated profile to fill in the missing gaps would make sense?
> I.e. running the standard branch probabilities->freuqencies propagation code bug this
> time taking the determined counts as known values?

Not sure if I follow. Could you elaborate?

>> +
>> +/* Perform value profile transformation using AutoFDO profile. Add the
>> +   promoted stmts to PROMOTED_STMTS. Return TRUE if there is any
>> +   indirect call promoted.  */
>> +
>> +static bool
>> +afdo_vpt_for_early_inline (stmt_set *promoted_stmts)
>
> What exactly this function does? It inlines according to the inline decisions
> made at train run time and recorded to profile feedback?

As replied above, we need iterative vpt-einline to make sure the IR
looks exactly the same as profiling binary before annotation.

AutoFDO patch for trunk

Commit Message

Comments

Patch