diff mbox

Sanitize block partitioning under -freorder-blocks-and-partition

Message ID 20130831214614.GA12372@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka Aug. 31, 2013, 9:46 p.m. UTC
Hi,
I run my script on execute testsuite and looked into few testcases. The problem I found
was roundoff errors - i.e. when expanding switch we set 50% chage that out of bound
value is above or bellow.  Both gets rounded to 0, because switch is executed once
and the value is bellow.

Partly this can be fixed by making probably_never_executed to consider frequencies when
counts are too coarse:


In other cases it was mostly loop unrolling in combination with jump threading. So
I modified my script to separately report when failure happens for test trained
once and test trained hundred times.

FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c

FAIL1 is failure after one run, FIAL is failure after 100 train runs.
We should take look at FAILs and see if there are bugs to fix. For FAIL1
I think it is kind of design problem: while implementing counts&frequencies
the idea was that small counts do not matter, so integer arithmetic is all
right.

I wonder if with current C++ wonderland we can't simply switch count
to a better representation. Either sreal or fixedpoint with capping
(the integer overflow issues are tiring, too).

Honza

Comments

Teresa Johnson Sept. 24, 2013, 5:28 a.m. UTC | #1
Hi Honza,

I am finally getting back to working on this after a few weeks of
working on some other priorities.

On Sat, Aug 31, 2013 at 2:46 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> I run my script on execute testsuite and looked into few testcases. The problem I found
> was roundoff errors - i.e. when expanding switch we set 50% chage that out of bound
> value is above or bellow.  Both gets rounded to 0, because switch is executed once
> and the value is bellow.
>
> Partly this can be fixed by making probably_never_executed to consider frequencies when
> counts are too coarse:
>
> Index: predict.c
> ===================================================================
> --- predict.c   (revision 202133)
> +++ predict.c   (working copy)
> @@ -232,8 +232,22 @@ bool
>  probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
>  {
>    gcc_checking_assert (fun);
> -  if (profile_info && flag_branch_probabilities)
> -    return ((bb->count + profile_info->runs / 2) / profile_info->runs) == 0;
> +  if (profile_status_for_function (fun) == PROFILE_READ)
> +    {
> +      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
> +       return false;
> +      if (!bb->frequency)
> +       return true;
> +      if (!ENTRY_BLOCK_PTR->frequency)
> +       return false;
> +      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
> +       {
> +         return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
> +                       ENTRY_BLOCK_PTR->frequency)
> +                 < REG_BR_PROB_BASE / 4);
> +       }
> +      return true;
> +    }
>    if ((!profile_info || !flag_branch_probabilities)
>        && (cgraph_get_node (fun->decl)->frequency
>           == NODE_FREQUENCY_UNLIKELY_EXECUTED))

Did you mean to commit the above change? I see that it went in as part
of r202258 but doesn't show up in the ChangeLog entry for that
revision.

>
> In other cases it was mostly loop unrolling in combination with jump threading. So
> I modified my script to separately report when failure happens for test trained
> once and test trained hundred times.

Thanks for the linker script. I reproduced your results. I looked at a
couple cases. The first was one that failed after 1 training run only
(20000910-2.c). It was due to jump threading, which you noted was a
problem. For this one I think we can handle it in the partitioning,
since there is an FDO insanity that we could probably treat more
conservatively when splitting.

I looked at one that failed after 100 as well (20031204-1.c). In this
case, it was due to expansion which was creating multiple branches/bbs
from a logical OR and guessing incorrectly on how to assign the
counts:

 if (octets == 4 && (*cp == ':' || *cp == '\0')) {

The (*cp == ':' || *cp == '\0') part looked like the following going
into RTL expansion:

  [20031204-1.c : 31:33] _29 = _28 == 58;
  [20031204-1.c : 31:33] _30 = _28 == 0;
  [20031204-1.c : 31:33] _31 = _29 | _30;
  [20031204-1.c : 31:18] if (_31 != 0)
    goto <bb 16>;
  else
    goto <bb 19>;

where the result of the OR was always true, so bb 16 had a count of
100 and bb 19 a count of 0. When it was expanded, the expanded version
of the above turned into 2 bbs with a branch in between. Both
comparisons were done in the first bb, but the first bb checked
whether the result of the *cp == '\0' compare was true, and if not
branched to the check for whether the *cp == ':' compare was true. It
gave the branch to the second check against ':' a count of 0, so that
bb got a count of 0 and was split out, and put the count of 100 on the
fall through assuming the compare with '\0' always evaluated to true.
In reality, this OR condition was always true because *cp was ':', not
'\0'. Therefore, the count of 0 on the second block with the check for
':' was incorrect, we ended up trying to execute it, and failed.

Presumably we had the correct profile data for both blocks, but the
accuracy was reduced when the OR was represented as a logical
computation with a single branch. We could change the expansion code
to do something different, e.g. treat as a 50-50 branch. But we would
still end up with integer truncation issues when there was a single
training run. But that could be dealt with conservatively in the
bbpart code as I suggested for the jump threading issue above. I.e. a
cold block with incoming non-cold edges conservatively not marked cold
for splitting.

>
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>
> FAIL1 is failure after one run, FIAL is failure after 100 train runs.
> We should take look at FAILs and see if there are bugs to fix. For FAIL1
> I think it is kind of design problem: while implementing counts&frequencies
> the idea was that small counts do not matter, so integer arithmetic is all
> right.
>
> I wonder if with current C++ wonderland we can't simply switch count
> to a better representation. Either sreal or fixedpoint with capping
> (the integer overflow issues are tiring, too).

It also seems like we should be able to detect the profile insanities
caused by integer truncation and handle them conservatively. That
being said, I see some sreal uses already in the profile.c code, so
presumably we could use this for the counts as well if it turns out to
be necessary?

BTW, Rong also implemented his runtime patch to do the COMDAT profile
merging. However, that ended up having some issues, that were solvable
but would have caused us to lose all context sensitivity from COMDATS
inlined during the profile-gen build. I am going to go back to solving
this in the profile-use phase as we discussed in the separate thread
on the COMDAT inlining patch I had been working on.

Thanks,
Teresa

>
> Honza
Jan Hubicka Sept. 24, 2013, 12:31 p.m. UTC | #2
> Hi Honza,
> 
> I am finally getting back to working on this after a few weeks of
> working on some other priorities.

I am also trying to return to this, so good timming ;)
Martin has got smaller C++ programs (Inkscape) to not touch cold segment
during the startup with FDO (w/o partitioning). Firefox still does, I think
the problem are lost samples due to different linker decisions even with LTO.
(i.e. linker pick an object from .a libraryat profile-generate time that i never
passes later.).

I plan to look into that today.
> 
> Did you mean to commit the above change? I see that it went in as part
> of r202258 but doesn't show up in the ChangeLog entry for that
> revision.

Yes, I meant to check it in, but did not mean to do so w/o Changelog.  I wil
fix that.
> 
> >
> > In other cases it was mostly loop unrolling in combination with jump threading. So
> > I modified my script to separately report when failure happens for test trained
> > once and test trained hundred times.
> 
> Thanks for the linker script. I reproduced your results. I looked at a
> couple cases. The first was one that failed after 1 training run only
> (20000910-2.c). It was due to jump threading, which you noted was a
> problem. For this one I think we can handle it in the partitioning,
> since there is an FDO insanity that we could probably treat more
> conservatively when splitting.

We should fix the roundoff issues - when I was introducing the
frequency/probability/count system I made an assumptions that parts of programs
with very low counts do not matter, since they are not part of hot spot (and I
cared only about the hot spot).  Now we care about identifying unlikely
executed spots and we need to fix this.
> 
> I looked at one that failed after 100 as well (20031204-1.c). In this
> case, it was due to expansion which was creating multiple branches/bbs
> from a logical OR and guessing incorrectly on how to assign the
> counts:
> 
>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
> 
> The (*cp == ':' || *cp == '\0') part looked like the following going
> into RTL expansion:
> 
>   [20031204-1.c : 31:33] _29 = _28 == 58;
>   [20031204-1.c : 31:33] _30 = _28 == 0;
>   [20031204-1.c : 31:33] _31 = _29 | _30;
>   [20031204-1.c : 31:18] if (_31 != 0)
>     goto <bb 16>;
>   else
>     goto <bb 19>;
> 
> where the result of the OR was always true, so bb 16 had a count of
> 100 and bb 19 a count of 0. When it was expanded, the expanded version
> of the above turned into 2 bbs with a branch in between. Both
> comparisons were done in the first bb, but the first bb checked
> whether the result of the *cp == '\0' compare was true, and if not
> branched to the check for whether the *cp == ':' compare was true. It
> gave the branch to the second check against ':' a count of 0, so that
> bb got a count of 0 and was split out, and put the count of 100 on the
> fall through assuming the compare with '\0' always evaluated to true.
> In reality, this OR condition was always true because *cp was ':', not
> '\0'. Therefore, the count of 0 on the second block with the check for
> ':' was incorrect, we ended up trying to execute it, and failed.
> 
> Presumably we had the correct profile data for both blocks, but the
> accuracy was reduced when the OR was represented as a logical
> computation with a single branch. We could change the expansion code
> to do something different, e.g. treat as a 50-50 branch. But we would
> still end up with integer truncation issues when there was a single
> training run. But that could be dealt with conservatively in the
> bbpart code as I suggested for the jump threading issue above. I.e. a
> cold block with incoming non-cold edges conservatively not marked cold
> for splitting.
> 
> >
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
> >
> > FAIL1 is failure after one run, FIAL is failure after 100 train runs.
> > We should take look at FAILs and see if there are bugs to fix. For FAIL1
> > I think it is kind of design problem: while implementing counts&frequencies
> > the idea was that small counts do not matter, so integer arithmetic is all
> > right.
> >
> > I wonder if with current C++ wonderland we can't simply switch count
> > to a better representation. Either sreal or fixedpoint with capping
> > (the integer overflow issues are tiring, too).
> 
> It also seems like we should be able to detect the profile insanities
> caused by integer truncation and handle them conservatively. That
> being said, I see some sreal uses already in the profile.c code, so
> presumably we could use this for the counts as well if it turns out to
> be necessary?

Yes, I was thinking about this, too.  We would need to do some evaulation of
compile time implications but switching counts from gcov_type to
profile_counter_t that is typedef to sreal seems sane idea.

We could switch CFG code first.  there should not be many hot spots where
counts are involved.  We can offline the common calculation we already moved to
macros that

We will need to invent also REG representation for them.  Now we have INT_LIST
for that we may have SREAL list and introduce SREAL as valid RTX argument.
This can be done incrementally.
> 
> BTW, Rong also implemented his runtime patch to do the COMDAT profile
> merging. However, that ended up having some issues, that were solvable
> but would have caused us to lose all context sensitivity from COMDATS
> inlined during the profile-gen build. I am going to go back to solving
> this in the profile-use phase as we discussed in the separate thread
> on the COMDAT inlining patch I had been working on.

Yes, lets move ahead with this, too.  I think i should dig out the change
that made frequencies to be guessed again.

As for COMDAT merging, i would like to see the patch.  I am experimenting
now with a patch to also privatize COMDATs during -fprofile-generate to
avoid problems with lost profiles mentioned above.

As for context sensitivity, one approach would be to have two sets of
counters for every comdat - one merged globally and one counting local
instances.  We can then privatize always and at profile read in stage
just clone every comdat and have two instances - one for offline copy
and one for inlining.

This is not different from how I always wanted to handle GNU extern inlines
(that also do have this issue - when you do not inline it, the unit do not see
any profile of it).

We can just tie the two functions together so "inline" version stay prior
inlining and then have linker to redirect to inline version instead of offline
version in such cases.  It already knows how to skip aliases and this is not
terribly different from that.

Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka Sept. 24, 2013, 5:57 p.m. UTC | #3
> 
> I looked at one that failed after 100 as well (20031204-1.c). In this
> case, it was due to expansion which was creating multiple branches/bbs
> from a logical OR and guessing incorrectly on how to assign the
> counts:
> 
>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
> 
> The (*cp == ':' || *cp == '\0') part looked like the following going
> into RTL expansion:
> 
>   [20031204-1.c : 31:33] _29 = _28 == 58;
>   [20031204-1.c : 31:33] _30 = _28 == 0;
>   [20031204-1.c : 31:33] _31 = _29 | _30;
>   [20031204-1.c : 31:18] if (_31 != 0)
>     goto <bb 16>;
>   else
>     goto <bb 19>;
> 
> where the result of the OR was always true, so bb 16 had a count of
> 100 and bb 19 a count of 0. When it was expanded, the expanded version
> of the above turned into 2 bbs with a branch in between. Both
> comparisons were done in the first bb, but the first bb checked
> whether the result of the *cp == '\0' compare was true, and if not
> branched to the check for whether the *cp == ':' compare was true. It
> gave the branch to the second check against ':' a count of 0, so that
> bb got a count of 0 and was split out, and put the count of 100 on the
> fall through assuming the compare with '\0' always evaluated to true.
> In reality, this OR condition was always true because *cp was ':', not
> '\0'. Therefore, the count of 0 on the second block with the check for
> ':' was incorrect, we ended up trying to execute it, and failed.

I see, we produce:
;; if (_26 != 0)  

(insn 94 93 95 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 107 [ D.2184 ])
            (const_int 0 [0]))) a.c:31 -1
     (nil))

(insn 95 94 96 (set (reg:QI 122 [ D.2186 ])
        (eq:QI (reg:CCZ 17 flags)
            (const_int 0 [0]))) a.c:31 -1
     (nil)) 
        
(insn 96 95 97 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 122 [ D.2186 ])
            (const_int 0 [0]))) a.c:31 -1
     (nil))

(jump_insn 97 96 98 (set (pc)
        (if_then_else (ne (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (label_ref 100)
            (pc))) a.c:31 -1
     (expr_list:REG_BR_PROB (const_int 6100 [0x17d4])
        (nil)))
     
(insn 98 97 99 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 108 [ D.2186 ])
            (const_int 0 [0]))) a.c:31 -1 
     (nil)) 
     
(jump_insn 99 98 100 (set (pc)
        (if_then_else (eq (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (label_ref 0)
            (pc))) a.c:31 -1
     (expr_list:REG_BR_PROB (const_int 3900 [0xf3c])
        (nil)))

(code_label 100 99 0 14 "" [0 uses])

That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)"

First I think the logic of do_jump should really be moved to trees.  It is not
doing things that can not be adequately represented by gimple.

I am not that certain we want to move it before profiling though.
> 
> Presumably we had the correct profile data for both blocks, but the
> accuracy was reduced when the OR was represented as a logical
> computation with a single branch. We could change the expansion code
> to do something different, e.g. treat as a 50-50 branch. But we would
> still end up with integer truncation issues when there was a single
> training run. But that could be dealt with conservatively in the

Yep, but it is still better than what we have now - if the test above was
in hot part of program (i.e. not executed once), we will end up optimizing
the second conditional for size.

So I think it is do_jump bug to not distribute probabilities across the two
conditoinals introduced.
> bbpart code as I suggested for the jump threading issue above. I.e. a
> cold block with incoming non-cold edges conservatively not marked cold
> for splitting.

Yep, we can probably do that, but we ought to fix the individual cases
above at least for resonable number of runs.

Will you look into logic of do_jump or shall I try to dive in?

Honza
Teresa Johnson Sept. 24, 2013, 6:25 p.m. UTC | #4
On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> I looked at one that failed after 100 as well (20031204-1.c). In this
>> case, it was due to expansion which was creating multiple branches/bbs
>> from a logical OR and guessing incorrectly on how to assign the
>> counts:
>>
>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>
>> The (*cp == ':' || *cp == '\0') part looked like the following going
>> into RTL expansion:
>>
>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>   [20031204-1.c : 31:18] if (_31 != 0)
>>     goto <bb 16>;
>>   else
>>     goto <bb 19>;
>>
>> where the result of the OR was always true, so bb 16 had a count of
>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>> of the above turned into 2 bbs with a branch in between. Both
>> comparisons were done in the first bb, but the first bb checked
>> whether the result of the *cp == '\0' compare was true, and if not
>> branched to the check for whether the *cp == ':' compare was true. It
>> gave the branch to the second check against ':' a count of 0, so that
>> bb got a count of 0 and was split out, and put the count of 100 on the
>> fall through assuming the compare with '\0' always evaluated to true.
>> In reality, this OR condition was always true because *cp was ':', not
>> '\0'. Therefore, the count of 0 on the second block with the check for
>> ':' was incorrect, we ended up trying to execute it, and failed.
>
> I see, we produce:
> ;; if (_26 != 0)
>
> (insn 94 93 95 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg:QI 107 [ D.2184 ])
>             (const_int 0 [0]))) a.c:31 -1
>      (nil))
>
> (insn 95 94 96 (set (reg:QI 122 [ D.2186 ])
>         (eq:QI (reg:CCZ 17 flags)
>             (const_int 0 [0]))) a.c:31 -1
>      (nil))
>
> (insn 96 95 97 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg:QI 122 [ D.2186 ])
>             (const_int 0 [0]))) a.c:31 -1
>      (nil))
>
> (jump_insn 97 96 98 (set (pc)
>         (if_then_else (ne (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (label_ref 100)
>             (pc))) a.c:31 -1
>      (expr_list:REG_BR_PROB (const_int 6100 [0x17d4])
>         (nil)))
>
> (insn 98 97 99 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg:QI 108 [ D.2186 ])
>             (const_int 0 [0]))) a.c:31 -1
>      (nil))
>
> (jump_insn 99 98 100 (set (pc)
>         (if_then_else (eq (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (label_ref 0)
>             (pc))) a.c:31 -1
>      (expr_list:REG_BR_PROB (const_int 3900 [0xf3c])
>         (nil)))
>
> (code_label 100 99 0 14 "" [0 uses])
>
> That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)"
>
> First I think the logic of do_jump should really be moved to trees.  It is not
> doing things that can not be adequately represented by gimple.
>
> I am not that certain we want to move it before profiling though.
>>
>> Presumably we had the correct profile data for both blocks, but the
>> accuracy was reduced when the OR was represented as a logical
>> computation with a single branch. We could change the expansion code
>> to do something different, e.g. treat as a 50-50 branch. But we would
>> still end up with integer truncation issues when there was a single
>> training run. But that could be dealt with conservatively in the
>
> Yep, but it is still better than what we have now - if the test above was
> in hot part of program (i.e. not executed once), we will end up optimizing
> the second conditional for size.
>
> So I think it is do_jump bug to not distribute probabilities across the two
> conditoinals introduced.
>> bbpart code as I suggested for the jump threading issue above. I.e. a
>> cold block with incoming non-cold edges conservatively not marked cold
>> for splitting.
>
> Yep, we can probably do that, but we ought to fix the individual cases
> above at least for resonable number of runs.

I made this change and it removed a few of the failures.

I looked at another case that still failed with 1 train run but passed
with 100. It turned out to be another truncation issue exposed by RTL
expansion, where we created some control flow for a memset builtin
which was in a block with an execution count of 1. Some of the blocks
got frequencies less than half the original block, so the count was
rounded down or truncated to 0. I noticed that in this case (as well
as the jump threading case I fixed by looking for non-zero incoming
edges in partitioning) that the bb frequency was non-zero.

Why not just have probably_never_executed_bb_p return simply return
false bb->frequency is non-zero (right now it does the opposite -
returns true when bb->frequency is 0)? Making this change removed a
bunch of other failures. With this change as well, there are only 3
cases that still fail with 1 train run that pass with 100. Need to
look at those.

>
> Will you look into logic of do_jump or shall I try to dive in?

I can take a look, but probably won't have a chance until late this
week. If you don't get to it before then I will see if I can figure
out why it is applying the branch probabilities this way.

Teresa

>
> Honza
Teresa Johnson Sept. 24, 2013, 6:28 p.m. UTC | #5
Rong - can you answer the questions below on the comdat patch?


On Tue, Sep 24, 2013 at 5:31 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Hi Honza,
>>
>> I am finally getting back to working on this after a few weeks of
>> working on some other priorities.
>
> I am also trying to return to this, so good timming ;)
> Martin has got smaller C++ programs (Inkscape) to not touch cold segment
> during the startup with FDO (w/o partitioning). Firefox still does, I think
> the problem are lost samples due to different linker decisions even with LTO.
> (i.e. linker pick an object from .a libraryat profile-generate time that i never
> passes later.).
>
> I plan to look into that today.
>>
>> Did you mean to commit the above change? I see that it went in as part
>> of r202258 but doesn't show up in the ChangeLog entry for that
>> revision.
>
> Yes, I meant to check it in, but did not mean to do so w/o Changelog.  I wil
> fix that.

Should the same fix be applied to probably_never_executed_edge_p?

>>
>> >
>> > In other cases it was mostly loop unrolling in combination with jump threading. So
>> > I modified my script to separately report when failure happens for test trained
>> > once and test trained hundred times.
>>
>> Thanks for the linker script. I reproduced your results. I looked at a
>> couple cases. The first was one that failed after 1 training run only
>> (20000910-2.c). It was due to jump threading, which you noted was a
>> problem. For this one I think we can handle it in the partitioning,
>> since there is an FDO insanity that we could probably treat more
>> conservatively when splitting.
>
> We should fix the roundoff issues - when I was introducing the
> frequency/probability/count system I made an assumptions that parts of programs
> with very low counts do not matter, since they are not part of hot spot (and I
> cared only about the hot spot).  Now we care about identifying unlikely
> executed spots and we need to fix this.
>>
>> I looked at one that failed after 100 as well (20031204-1.c). In this
>> case, it was due to expansion which was creating multiple branches/bbs
>> from a logical OR and guessing incorrectly on how to assign the
>> counts:
>>
>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>
>> The (*cp == ':' || *cp == '\0') part looked like the following going
>> into RTL expansion:
>>
>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>   [20031204-1.c : 31:18] if (_31 != 0)
>>     goto <bb 16>;
>>   else
>>     goto <bb 19>;
>>
>> where the result of the OR was always true, so bb 16 had a count of
>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>> of the above turned into 2 bbs with a branch in between. Both
>> comparisons were done in the first bb, but the first bb checked
>> whether the result of the *cp == '\0' compare was true, and if not
>> branched to the check for whether the *cp == ':' compare was true. It
>> gave the branch to the second check against ':' a count of 0, so that
>> bb got a count of 0 and was split out, and put the count of 100 on the
>> fall through assuming the compare with '\0' always evaluated to true.
>> In reality, this OR condition was always true because *cp was ':', not
>> '\0'. Therefore, the count of 0 on the second block with the check for
>> ':' was incorrect, we ended up trying to execute it, and failed.
>>
>> Presumably we had the correct profile data for both blocks, but the
>> accuracy was reduced when the OR was represented as a logical
>> computation with a single branch. We could change the expansion code
>> to do something different, e.g. treat as a 50-50 branch. But we would
>> still end up with integer truncation issues when there was a single
>> training run. But that could be dealt with conservatively in the
>> bbpart code as I suggested for the jump threading issue above. I.e. a
>> cold block with incoming non-cold edges conservatively not marked cold
>> for splitting.
>>
>> >
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> >
>> > FAIL1 is failure after one run, FIAL is failure after 100 train runs.
>> > We should take look at FAILs and see if there are bugs to fix. For FAIL1
>> > I think it is kind of design problem: while implementing counts&frequencies
>> > the idea was that small counts do not matter, so integer arithmetic is all
>> > right.
>> >
>> > I wonder if with current C++ wonderland we can't simply switch count
>> > to a better representation. Either sreal or fixedpoint with capping
>> > (the integer overflow issues are tiring, too).
>>
>> It also seems like we should be able to detect the profile insanities
>> caused by integer truncation and handle them conservatively. That
>> being said, I see some sreal uses already in the profile.c code, so
>> presumably we could use this for the counts as well if it turns out to
>> be necessary?
>
> Yes, I was thinking about this, too.  We would need to do some evaulation of
> compile time implications but switching counts from gcov_type to
> profile_counter_t that is typedef to sreal seems sane idea.
>
> We could switch CFG code first.  there should not be many hot spots where
> counts are involved.  We can offline the common calculation we already moved to
> macros that
>
> We will need to invent also REG representation for them.  Now we have INT_LIST
> for that we may have SREAL list and introduce SREAL as valid RTX argument.
> This can be done incrementally.
>>
>> BTW, Rong also implemented his runtime patch to do the COMDAT profile
>> merging. However, that ended up having some issues, that were solvable
>> but would have caused us to lose all context sensitivity from COMDATS
>> inlined during the profile-gen build. I am going to go back to solving
>> this in the profile-use phase as we discussed in the separate thread
>> on the COMDAT inlining patch I had been working on.
>
> Yes, lets move ahead with this, too.  I think i should dig out the change
> that made frequencies to be guessed again.

I think I have that and was building my patch on top of it.

Rong:

>
> As for COMDAT merging, i would like to see the patch.  I am experimenting
> now with a patch to also privatize COMDATs during -fprofile-generate to
> avoid problems with lost profiles mentioned above.
>
> As for context sensitivity, one approach would be to have two sets of
> counters for every comdat - one merged globally and one counting local
> instances.  We can then privatize always and at profile read in stage
> just clone every comdat and have two instances - one for offline copy
> and one for inlining.
>
> This is not different from how I always wanted to handle GNU extern inlines
> (that also do have this issue - when you do not inline it, the unit do not see
> any profile of it).
>
> We can just tie the two functions together so "inline" version stay prior
> inlining and then have linker to redirect to inline version instead of offline
> version in such cases.  It already knows how to skip aliases and this is not
> terribly different from that.
>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson Sept. 25, 2013, 9:33 p.m. UTC | #6
On Tue, Sep 24, 2013 at 11:25 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>
>>> I looked at one that failed after 100 as well (20031204-1.c). In this
>>> case, it was due to expansion which was creating multiple branches/bbs
>>> from a logical OR and guessing incorrectly on how to assign the
>>> counts:
>>>
>>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>>
>>> The (*cp == ':' || *cp == '\0') part looked like the following going
>>> into RTL expansion:
>>>
>>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>>   [20031204-1.c : 31:18] if (_31 != 0)
>>>     goto <bb 16>;
>>>   else
>>>     goto <bb 19>;
>>>
>>> where the result of the OR was always true, so bb 16 had a count of
>>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>>> of the above turned into 2 bbs with a branch in between. Both
>>> comparisons were done in the first bb, but the first bb checked
>>> whether the result of the *cp == '\0' compare was true, and if not
>>> branched to the check for whether the *cp == ':' compare was true. It
>>> gave the branch to the second check against ':' a count of 0, so that
>>> bb got a count of 0 and was split out, and put the count of 100 on the
>>> fall through assuming the compare with '\0' always evaluated to true.
>>> In reality, this OR condition was always true because *cp was ':', not
>>> '\0'. Therefore, the count of 0 on the second block with the check for
>>> ':' was incorrect, we ended up trying to execute it, and failed.
>>
>> I see, we produce:
>> ;; if (_26 != 0)
>>
>> (insn 94 93 95 (set (reg:CCZ 17 flags)
>>         (compare:CCZ (reg:QI 107 [ D.2184 ])
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (insn 95 94 96 (set (reg:QI 122 [ D.2186 ])
>>         (eq:QI (reg:CCZ 17 flags)
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (insn 96 95 97 (set (reg:CCZ 17 flags)
>>         (compare:CCZ (reg:QI 122 [ D.2186 ])
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (jump_insn 97 96 98 (set (pc)
>>         (if_then_else (ne (reg:CCZ 17 flags)
>>                 (const_int 0 [0]))
>>             (label_ref 100)
>>             (pc))) a.c:31 -1
>>      (expr_list:REG_BR_PROB (const_int 6100 [0x17d4])
>>         (nil)))
>>
>> (insn 98 97 99 (set (reg:CCZ 17 flags)
>>         (compare:CCZ (reg:QI 108 [ D.2186 ])
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (jump_insn 99 98 100 (set (pc)
>>         (if_then_else (eq (reg:CCZ 17 flags)
>>                 (const_int 0 [0]))
>>             (label_ref 0)
>>             (pc))) a.c:31 -1
>>      (expr_list:REG_BR_PROB (const_int 3900 [0xf3c])
>>         (nil)))
>>
>> (code_label 100 99 0 14 "" [0 uses])
>>
>> That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)"
>>
>> First I think the logic of do_jump should really be moved to trees.  It is not
>> doing things that can not be adequately represented by gimple.
>>
>> I am not that certain we want to move it before profiling though.
>>>
>>> Presumably we had the correct profile data for both blocks, but the
>>> accuracy was reduced when the OR was represented as a logical
>>> computation with a single branch. We could change the expansion code
>>> to do something different, e.g. treat as a 50-50 branch. But we would
>>> still end up with integer truncation issues when there was a single
>>> training run. But that could be dealt with conservatively in the
>>
>> Yep, but it is still better than what we have now - if the test above was
>> in hot part of program (i.e. not executed once), we will end up optimizing
>> the second conditional for size.
>>
>> So I think it is do_jump bug to not distribute probabilities across the two
>> conditoinals introduced.
>>> bbpart code as I suggested for the jump threading issue above. I.e. a
>>> cold block with incoming non-cold edges conservatively not marked cold
>>> for splitting.
>>
>> Yep, we can probably do that, but we ought to fix the individual cases
>> above at least for resonable number of runs.
>
> I made this change and it removed a few of the failures.
>
> I looked at another case that still failed with 1 train run but passed
> with 100. It turned out to be another truncation issue exposed by RTL
> expansion, where we created some control flow for a memset builtin
> which was in a block with an execution count of 1. Some of the blocks
> got frequencies less than half the original block, so the count was
> rounded down or truncated to 0. I noticed that in this case (as well
> as the jump threading case I fixed by looking for non-zero incoming
> edges in partitioning) that the bb frequency was non-zero.
>
> Why not just have probably_never_executed_bb_p return simply return
> false bb->frequency is non-zero (right now it does the opposite -
> returns true when bb->frequency is 0)? Making this change removed a
> bunch of other failures. With this change as well, there are only 3
> cases that still fail with 1 train run that pass with 100. Need to
> look at those.

FYI, These turned out to be more jump threading issues. I am currently
working on getting jump threading profile updates to work properly, I
think I'm pretty close. Haven't had a chance to look at do_jump yet.

Teresa

>
>>
>> Will you look into logic of do_jump or shall I try to dive in?
>
> I can take a look, but probably won't have a chance until late this
> week. If you don't get to it before then I will see if I can figure
> out why it is applying the branch probabilities this way.
>
> Teresa
>
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson Sept. 26, 2013, 5:41 a.m. UTC | #7
On Wed, Sep 25, 2013 at 2:33 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Tue, Sep 24, 2013 at 11:25 AM, Teresa Johnson <tejohnson@google.com> wrote:
>> On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>
>>>> I looked at one that failed after 100 as well (20031204-1.c). In this
>>>> case, it was due to expansion which was creating multiple branches/bbs
>>>> from a logical OR and guessing incorrectly on how to assign the
>>>> counts:
>>>>
>>>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>>>
>>>> The (*cp == ':' || *cp == '\0') part looked like the following going
>>>> into RTL expansion:
>>>>
>>>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>>>   [20031204-1.c : 31:18] if (_31 != 0)
>>>>     goto <bb 16>;
>>>>   else
>>>>     goto <bb 19>;
>>>>
>>>> where the result of the OR was always true, so bb 16 had a count of
>>>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>>>> of the above turned into 2 bbs with a branch in between. Both
>>>> comparisons were done in the first bb, but the first bb checked
>>>> whether the result of the *cp == '\0' compare was true, and if not
>>>> branched to the check for whether the *cp == ':' compare was true. It
>>>> gave the branch to the second check against ':' a count of 0, so that
>>>> bb got a count of 0 and was split out, and put the count of 100 on the
>>>> fall through assuming the compare with '\0' always evaluated to true.
>>>> In reality, this OR condition was always true because *cp was ':', not
>>>> '\0'. Therefore, the count of 0 on the second block with the check for
>>>> ':' was incorrect, we ended up trying to execute it, and failed.
>>>
>>> I see, we produce:
>>> ;; if (_26 != 0)
>>>
>>> (insn 94 93 95 (set (reg:CCZ 17 flags)
>>>         (compare:CCZ (reg:QI 107 [ D.2184 ])
>>>             (const_int 0 [0]))) a.c:31 -1
>>>      (nil))
>>>
>>> (insn 95 94 96 (set (reg:QI 122 [ D.2186 ])
>>>         (eq:QI (reg:CCZ 17 flags)
>>>             (const_int 0 [0]))) a.c:31 -1
>>>      (nil))
>>>
>>> (insn 96 95 97 (set (reg:CCZ 17 flags)
>>>         (compare:CCZ (reg:QI 122 [ D.2186 ])
>>>             (const_int 0 [0]))) a.c:31 -1
>>>      (nil))
>>>
>>> (jump_insn 97 96 98 (set (pc)
>>>         (if_then_else (ne (reg:CCZ 17 flags)
>>>                 (const_int 0 [0]))
>>>             (label_ref 100)
>>>             (pc))) a.c:31 -1
>>>      (expr_list:REG_BR_PROB (const_int 6100 [0x17d4])
>>>         (nil)))
>>>
>>> (insn 98 97 99 (set (reg:CCZ 17 flags)
>>>         (compare:CCZ (reg:QI 108 [ D.2186 ])
>>>             (const_int 0 [0]))) a.c:31 -1
>>>      (nil))
>>>
>>> (jump_insn 99 98 100 (set (pc)
>>>         (if_then_else (eq (reg:CCZ 17 flags)
>>>                 (const_int 0 [0]))
>>>             (label_ref 0)
>>>             (pc))) a.c:31 -1
>>>      (expr_list:REG_BR_PROB (const_int 3900 [0xf3c])
>>>         (nil)))
>>>
>>> (code_label 100 99 0 14 "" [0 uses])
>>>
>>> That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)"
>>>
>>> First I think the logic of do_jump should really be moved to trees.  It is not
>>> doing things that can not be adequately represented by gimple.
>>>
>>> I am not that certain we want to move it before profiling though.
>>>>
>>>> Presumably we had the correct profile data for both blocks, but the
>>>> accuracy was reduced when the OR was represented as a logical
>>>> computation with a single branch. We could change the expansion code
>>>> to do something different, e.g. treat as a 50-50 branch. But we would
>>>> still end up with integer truncation issues when there was a single
>>>> training run. But that could be dealt with conservatively in the
>>>
>>> Yep, but it is still better than what we have now - if the test above was
>>> in hot part of program (i.e. not executed once), we will end up optimizing
>>> the second conditional for size.
>>>
>>> So I think it is do_jump bug to not distribute probabilities across the two
>>> conditoinals introduced.
>>>> bbpart code as I suggested for the jump threading issue above. I.e. a
>>>> cold block with incoming non-cold edges conservatively not marked cold
>>>> for splitting.
>>>
>>> Yep, we can probably do that, but we ought to fix the individual cases
>>> above at least for resonable number of runs.
>>
>> I made this change and it removed a few of the failures.
>>
>> I looked at another case that still failed with 1 train run but passed
>> with 100. It turned out to be another truncation issue exposed by RTL
>> expansion, where we created some control flow for a memset builtin
>> which was in a block with an execution count of 1. Some of the blocks
>> got frequencies less than half the original block, so the count was
>> rounded down or truncated to 0. I noticed that in this case (as well
>> as the jump threading case I fixed by looking for non-zero incoming
>> edges in partitioning) that the bb frequency was non-zero.
>>
>> Why not just have probably_never_executed_bb_p return simply return
>> false bb->frequency is non-zero (right now it does the opposite -
>> returns true when bb->frequency is 0)? Making this change removed a
>> bunch of other failures. With this change as well, there are only 3
>> cases that still fail with 1 train run that pass with 100. Need to
>> look at those.
>
> FYI, These turned out to be more jump threading issues. I am currently
> working on getting jump threading profile updates to work properly, I
> think I'm pretty close. Haven't had a chance to look at do_jump yet.

Correction, it was not the 3 tests I mentioned above, but a different
set of tests being affected by this jump threading issue. I have a
patch I am regression testing. But there are other profile insanities
being caused upstream of jump threading that I haven't tracked down.
So I will also test and send the patch to handle some of these in the
splitting/cold cold detection code.  It also contains the change to
probably_never_executed_bb_p that I mention above to return false when
bb->frequency is non-zero.

Teresa

>
> Teresa
>
>>
>>>
>>> Will you look into logic of do_jump or shall I try to dive in?
>>
>> I can take a look, but probably won't have a chance until late this
>> week. If you don't get to it before then I will see if I can figure
>> out why it is applying the branch probabilities this way.
>>
>> Teresa
>>
>>>
>>> Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Rong Xu Sept. 26, 2013, 7:25 p.m. UTC | #8
On Tue, Sep 24, 2013 at 5:31 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Hi Honza,
>>
>> I am finally getting back to working on this after a few weeks of
>> working on some other priorities.
>
> I am also trying to return to this, so good timming ;)
> Martin has got smaller C++ programs (Inkscape) to not touch cold segment
> during the startup with FDO (w/o partitioning). Firefox still does, I think
> the problem are lost samples due to different linker decisions even with LTO.
> (i.e. linker pick an object from .a libraryat profile-generate time that i never
> passes later.).
>
> I plan to look into that today.
>>
>> Did you mean to commit the above change? I see that it went in as part
>> of r202258 but doesn't show up in the ChangeLog entry for that
>> revision.
>
> Yes, I meant to check it in, but did not mean to do so w/o Changelog.  I wil
> fix that.
>>
>> >
>> > In other cases it was mostly loop unrolling in combination with jump threading. So
>> > I modified my script to separately report when failure happens for test trained
>> > once and test trained hundred times.
>>
>> Thanks for the linker script. I reproduced your results. I looked at a
>> couple cases. The first was one that failed after 1 training run only
>> (20000910-2.c). It was due to jump threading, which you noted was a
>> problem. For this one I think we can handle it in the partitioning,
>> since there is an FDO insanity that we could probably treat more
>> conservatively when splitting.
>
> We should fix the roundoff issues - when I was introducing the
> frequency/probability/count system I made an assumptions that parts of programs
> with very low counts do not matter, since they are not part of hot spot (and I
> cared only about the hot spot).  Now we care about identifying unlikely
> executed spots and we need to fix this.
>>
>> I looked at one that failed after 100 as well (20031204-1.c). In this
>> case, it was due to expansion which was creating multiple branches/bbs
>> from a logical OR and guessing incorrectly on how to assign the
>> counts:
>>
>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>
>> The (*cp == ':' || *cp == '\0') part looked like the following going
>> into RTL expansion:
>>
>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>   [20031204-1.c : 31:18] if (_31 != 0)
>>     goto <bb 16>;
>>   else
>>     goto <bb 19>;
>>
>> where the result of the OR was always true, so bb 16 had a count of
>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>> of the above turned into 2 bbs with a branch in between. Both
>> comparisons were done in the first bb, but the first bb checked
>> whether the result of the *cp == '\0' compare was true, and if not
>> branched to the check for whether the *cp == ':' compare was true. It
>> gave the branch to the second check against ':' a count of 0, so that
>> bb got a count of 0 and was split out, and put the count of 100 on the
>> fall through assuming the compare with '\0' always evaluated to true.
>> In reality, this OR condition was always true because *cp was ':', not
>> '\0'. Therefore, the count of 0 on the second block with the check for
>> ':' was incorrect, we ended up trying to execute it, and failed.
>>
>> Presumably we had the correct profile data for both blocks, but the
>> accuracy was reduced when the OR was represented as a logical
>> computation with a single branch. We could change the expansion code
>> to do something different, e.g. treat as a 50-50 branch. But we would
>> still end up with integer truncation issues when there was a single
>> training run. But that could be dealt with conservatively in the
>> bbpart code as I suggested for the jump threading issue above. I.e. a
>> cold block with incoming non-cold edges conservatively not marked cold
>> for splitting.
>>
>> >
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> >
>> > FAIL1 is failure after one run, FIAL is failure after 100 train runs.
>> > We should take look at FAILs and see if there are bugs to fix. For FAIL1
>> > I think it is kind of design problem: while implementing counts&frequencies
>> > the idea was that small counts do not matter, so integer arithmetic is all
>> > right.
>> >
>> > I wonder if with current C++ wonderland we can't simply switch count
>> > to a better representation. Either sreal or fixedpoint with capping
>> > (the integer overflow issues are tiring, too).
>>
>> It also seems like we should be able to detect the profile insanities
>> caused by integer truncation and handle them conservatively. That
>> being said, I see some sreal uses already in the profile.c code, so
>> presumably we could use this for the counts as well if it turns out to
>> be necessary?
>
> Yes, I was thinking about this, too.  We would need to do some evaulation of
> compile time implications but switching counts from gcov_type to
> profile_counter_t that is typedef to sreal seems sane idea.
>
> We could switch CFG code first.  there should not be many hot spots where
> counts are involved.  We can offline the common calculation we already moved to
> macros that
>
> We will need to invent also REG representation for them.  Now we have INT_LIST
> for that we may have SREAL list and introduce SREAL as valid RTX argument.
> This can be done incrementally.
>>
>> BTW, Rong also implemented his runtime patch to do the COMDAT profile
>> merging. However, that ended up having some issues, that were solvable
>> but would have caused us to lose all context sensitivity from COMDATS
>> inlined during the profile-gen build. I am going to go back to solving
>> this in the profile-use phase as we discussed in the separate thread
>> on the COMDAT inlining patch I had been working on.
>
> Yes, lets move ahead with this, too.  I think i should dig out the change
> that made frequencies to be guessed again.
>
> As for COMDAT merging, i would like to see the patch.  I am experimenting
> now with a patch to also privatize COMDATs during -fprofile-generate to
> avoid problems with lost profiles mentioned above.
>

Do you mean you privatize every COMDAT function in the profile-generate?
We discussed this idea internally and we thought it would not work for
large applications (like in google) due to size.

> As for context sensitivity, one approach would be to have two sets of
> counters for every comdat - one merged globally and one counting local
> instances.  We can then privatize always and at profile read in stage
> just clone every comdat and have two instances - one for offline copy
> and one for inlining.
>

In my implementation, I also allow multiple sets of COMDAT profile
co-existing in one compilation.
Due to the auxiliary modules in LIPO, I actually have more than two.

But I'm wondering how do you determine which profile to use for each
call-site -- the inline decision may not
be the same for profile-generate and profile-use compilation.

> This is not different from how I always wanted to handle GNU extern inlines
> (that also do have this issue - when you do not inline it, the unit do not see
> any profile of it).
>
> We can just tie the two functions together so "inline" version stay prior
> inlining and then have linker to redirect to inline version instead of offline
> version in such cases.  It already knows how to skip aliases and this is not
> terribly different from that.
>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka Sept. 26, 2013, 9:54 p.m. UTC | #9
> > As for COMDAT merging, i would like to see the patch.  I am experimenting
> > now with a patch to also privatize COMDATs during -fprofile-generate to
> > avoid problems with lost profiles mentioned above.
> >
> 
> Do you mean you privatize every COMDAT function in the profile-generate?
> We discussed this idea internally and we thought it would not work for
> large applications (like in google) due to size.

Yes, Martin and I plan to test this on firefox.  In a way you already have all
the COMDAT functions unshared in the object files, so the resulting binary
should not be completely off the limits.  But I do not have any quantitative
data, yet, since we hit bug in constant folding and devirtualization I fixed in
meantime but we did not re-run the tests yet.

> 
> > As for context sensitivity, one approach would be to have two sets of
> > counters for every comdat - one merged globally and one counting local
> > instances.  We can then privatize always and at profile read in stage
> > just clone every comdat and have two instances - one for offline copy
> > and one for inlining.
> >
> 
> In my implementation, I also allow multiple sets of COMDAT profile
> co-existing in one compilation.
> Due to the auxiliary modules in LIPO, I actually have more than two.

How does auxiliary modules work?
> 
> But I'm wondering how do you determine which profile to use for each
> call-site -- the inline decision may not
> be the same for profile-generate and profile-use compilation.

My suggestion was to simply use the module local profile for all inline sites
within the given module and the global profile for the offline copy of the
function (that one will, in the case it survives linking, be shared across
all the modules anyway).

I think this may work in the cases where i.e. use of hash templates in one
module is very different (in average size) from other module.
I did not really put much effort into it - I currently worry primarily about
the cases where profile is lost completely since it gets attached to a function
not surviving final linking (or because we inline something we did not inlined
at profile time).

As for context sensitivity, we may try to consider developing more consistent
solution for this.  COMDAT functions are definitely not only that may exhibit
context sensitive behaviour.
One approach would be to always have multiple counters for each function and
hash based on cbacktraces collected by indirect call profiling instrumentation.
In a way this is same path profiling, but that would definitely add quite some
overhead + we will need to think of resonable way to represent this within
compiler.

How do you decide what functions you want to have multiple profiles for?

Honza
Jan Hubicka Sept. 26, 2013, 10:02 p.m. UTC | #10
> 
> Why not just have probably_never_executed_bb_p return simply return
> false bb->frequency is non-zero (right now it does the opposite -

We want to have frequencies guessed for functions that was not trained
in the profiling run (that was patch I posted earlier that I think did not
go in, yet).

Currently I return true when frequency indicate that BB is executed at least in
1/4th of all executions.  With the cases discussed I see we may need to reduce
this threshold.  In general I do not like much hard tests for 0 because meaning
of 0 depends on REG_BR_FREQ_BASE that is supposed to be changeable and we may
want to make frequencies sreal, too.

I suppose we may introduce --param for this.  You are also right that I should
update probably_never_executed_edge_p (I intended so, but obviously the code
ended up in mainline accidentally).

I however saw at least one case of jump threading where this trick did not
help: the jump threading update confused itself by scaling via counts rather
than frequencies and ended up with dropping everything to 0. This makes it 
more tempting to try to go with sreals for those....

Honza

> returns true when bb->frequency is 0)? Making this change removed a
> bunch of other failures. With this change as well, there are only 3
> cases that still fail with 1 train run that pass with 100. Need to
> look at those.
> 
> >
> > Will you look into logic of do_jump or shall I try to dive in?
> 
> I can take a look, but probably won't have a chance until late this
> week. If you don't get to it before then I will see if I can figure
> out why it is applying the branch probabilities this way.
> 
> Teresa
> 
> >
> > Honza
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Rong Xu Sept. 26, 2013, 10:26 p.m. UTC | #11
On Thu, Sep 26, 2013 at 2:54 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > As for COMDAT merging, i would like to see the patch.  I am experimenting
>> > now with a patch to also privatize COMDATs during -fprofile-generate to
>> > avoid problems with lost profiles mentioned above.
>> >
>>
>> Do you mean you privatize every COMDAT function in the profile-generate?
>> We discussed this idea internally and we thought it would not work for
>> large applications (like in google) due to size.
>
> Yes, Martin and I plan to test this on firefox.  In a way you already have all
> the COMDAT functions unshared in the object files, so the resulting binary
> should not be completely off the limits.  But I do not have any quantitative
> data, yet, since we hit bug in constant folding and devirtualization I fixed in
> meantime but we did not re-run the tests yet.

LInker removes a great numbers of duplicated copies, esp for those
template functions.
We don't have a quantitative numbers either. But I'll collect some soon.
>
>>
>> > As for context sensitivity, one approach would be to have two sets of
>> > counters for every comdat - one merged globally and one counting local
>> > instances.  We can then privatize always and at profile read in stage
>> > just clone every comdat and have two instances - one for offline copy
>> > and one for inlining.
>> >
>>
>> In my implementation, I also allow multiple sets of COMDAT profile
>> co-existing in one compilation.
>> Due to the auxiliary modules in LIPO, I actually have more than two.
>
> How does auxiliary modules work?

It pulls in multiple profiles from other compilation. So there might be multiple
inlined profiles.

>>
>> But I'm wondering how do you determine which profile to use for each
>> call-site -- the inline decision may not
>> be the same for profile-generate and profile-use compilation.
>
> My suggestion was to simply use the module local profile for all inline sites
> within the given module and the global profile for the offline copy of the
> function (that one will, in the case it survives linking, be shared across
> all the modules anyway).

For simple example like:
callsite1 --> comcat_function_foo
callsite2 --> comdat_function_foo

callsite1 is inlined in profile-generate, it has its own inlined
profile counter.
callsite2 is not inlined and the profile goes to the offline copies.
let's callsite 1 is cold (0 counter) and callsite 2 is hot. Using
local profile (the cold one)
for callsite2 will not be correct.

>
> I think this may work in the cases where i.e. use of hash templates in one
> module is very different (in average size) from other module.
> I did not really put much effort into it - I currently worry primarily about
> the cases where profile is lost completely since it gets attached to a function
> not surviving final linking (or because we inline something we did not inlined
> at profile time).
>
> As for context sensitivity, we may try to consider developing more consistent
> solution for this.  COMDAT functions are definitely not only that may exhibit
> context sensitive behaviour.
> One approach would be to always have multiple counters for each function and
> hash based on cbacktraces collected by indirect call profiling instrumentation.
> In a way this is same path profiling, but that would definitely add quite some
> overhead + we will need to think of resonable way to represent this within
> compiler.
>
> How do you decide what functions you want to have multiple profiles for?

I do the instrumentation after ipa-inline for comdat function. I know
if a callsite
is inlined or not. In profile-use phrase, I also need to provide to
the context (which module this is from) to pick
the right profile.

>
> Honza
Teresa Johnson Sept. 27, 2013, 2:15 p.m. UTC | #12
On Thu, Sep 26, 2013 at 3:02 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> Why not just have probably_never_executed_bb_p return simply return
>> false bb->frequency is non-zero (right now it does the opposite -
>
> We want to have frequencies guessed for functions that was not trained
> in the profiling run (that was patch I posted earlier that I think did not
> go in, yet).

Right, but for splitting and bb layout purposes, for these statically
guessed unprofiled functions we in fact don't want to do any splitting
or treat the bbs as never executed (which shouldn't be a change from
the status quo since all the bbs in these functions are currently 0
weight, it's only when we inline in the case of comdats that they
appear colder than the surrounding code, but in fact we don't want
this).

The only other caller to probably_never_executed_bb_p is
compute_function_frequency, but in the case of statically guessed
functions they will have profile_status != PROFILE_READ and won't
invoke probably_never_executed_bb_p. But re-reading our most recent
exchange on the comdat profile issue, it sounds like you were
suggesting guessing profiles for all 0-weight functions early, then
dropping them from PROFILE_READ to PROFILE_GUESSED only once we
determine in ipa-inline that there is a potentially non-zero call path
to them. In that case with the change I describe above to
probably_never_executed_bb_p, the 0-weight functions with 0 calls to
them will incorrectly be marked as NODE_FREQUENCY_NORMAL, which would
be bad as they would not be size optimized or moved into the cold
section.

So it seems like we want different handling of these guessed
frequencies in compute_function_frequency and bb-reorder.c. Actually I
think we can handle this by checking if the function entry block has a
0 count. If so, then we just look at the bb counts and not the
frequencies for determining bb hotness as the frequencies would
presumably have been statically-guessed. This will ensure that the
cgraph node continues to be marked unlikely and size-optimized. If the
function entry block has a non-zero count, then we look at both the bb
count and the bb frequency - if they are both zero then the bb is
probably never executed, but if either is non-zero then we should
treat the block as possibly executed (which will come into play for
splitting and bb layout).

Teresa

>
> Currently I return true when frequency indicate that BB is executed at least in
> 1/4th of all executions.  With the cases discussed I see we may need to reduce
> this threshold.  In general I do not like much hard tests for 0 because meaning
> of 0 depends on REG_BR_FREQ_BASE that is supposed to be changeable and we may
> want to make frequencies sreal, too.
>
> I suppose we may introduce --param for this.  You are also right that I should
> update probably_never_executed_edge_p (I intended so, but obviously the code
> ended up in mainline accidentally).
>
> I however saw at least one case of jump threading where this trick did not
> help: the jump threading update confused itself by scaling via counts rather
> than frequencies and ended up with dropping everything to 0. This makes it
> more tempting to try to go with sreals for those....
>
> Honza
>
>> returns true when bb->frequency is 0)? Making this change removed a
>> bunch of other failures. With this change as well, there are only 3
>> cases that still fail with 1 train run that pass with 100. Need to
>> look at those.
>>
>> >
>> > Will you look into logic of do_jump or shall I try to dive in?
>>
>> I can take a look, but probably won't have a chance until late this
>> week. If you don't get to it before then I will see if I can figure
>> out why it is applying the branch probabilities this way.
>>
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
diff mbox

Patch

Index: predict.c
===================================================================
--- predict.c	(revision 202133)
+++ predict.c	(working copy)
@@ -232,8 +232,22 @@  bool
 probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
 {
   gcc_checking_assert (fun);
-  if (profile_info && flag_branch_probabilities)
-    return ((bb->count + profile_info->runs / 2) / profile_info->runs) == 0;
+  if (profile_status_for_function (fun) == PROFILE_READ)
+    {
+      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
+	return false;
+      if (!bb->frequency)
+	return true;
+      if (!ENTRY_BLOCK_PTR->frequency)
+	return false;
+      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
+	{
+	  return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
+		        ENTRY_BLOCK_PTR->frequency)
+		  < REG_BR_PROB_BASE / 4);
+	}
+      return true;
+    }
   if ((!profile_info || !flag_branch_probabilities)
       && (cgraph_get_node (fun->decl)->frequency
 	  == NODE_FREQUENCY_UNLIKELY_EXECUTED))