Message ID | 20130831214614.GA12372@kam.mff.cuni.cz |
---|---|
State | New |
Headers | show |
Hi Honza, I am finally getting back to working on this after a few weeks of working on some other priorities. On Sat, Aug 31, 2013 at 2:46 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > Hi, > I run my script on execute testsuite and looked into few testcases. The problem I found > was roundoff errors - i.e. when expanding switch we set 50% chage that out of bound > value is above or bellow. Both gets rounded to 0, because switch is executed once > and the value is bellow. > > Partly this can be fixed by making probably_never_executed to consider frequencies when > counts are too coarse: > > Index: predict.c > =================================================================== > --- predict.c (revision 202133) > +++ predict.c (working copy) > @@ -232,8 +232,22 @@ bool > probably_never_executed_bb_p (struct function *fun, const_basic_block bb) > { > gcc_checking_assert (fun); > - if (profile_info && flag_branch_probabilities) > - return ((bb->count + profile_info->runs / 2) / profile_info->runs) == 0; > + if (profile_status_for_function (fun) == PROFILE_READ) > + { > + if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0) > + return false; > + if (!bb->frequency) > + return true; > + if (!ENTRY_BLOCK_PTR->frequency) > + return false; > + if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE) > + { > + return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count, > + ENTRY_BLOCK_PTR->frequency) > + < REG_BR_PROB_BASE / 4); > + } > + return true; > + } > if ((!profile_info || !flag_branch_probabilities) > && (cgraph_get_node (fun->decl)->frequency > == NODE_FREQUENCY_UNLIKELY_EXECUTED)) Did you mean to commit the above change? I see that it went in as part of r202258 but doesn't show up in the ChangeLog entry for that revision. > > In other cases it was mostly loop unrolling in combination with jump threading. So > I modified my script to separately report when failure happens for test trained > once and test trained hundred times. Thanks for the linker script. I reproduced your results. I looked at a couple cases. The first was one that failed after 1 training run only (20000910-2.c). It was due to jump threading, which you noted was a problem. For this one I think we can handle it in the partitioning, since there is an FDO insanity that we could probably treat more conservatively when splitting. I looked at one that failed after 100 as well (20031204-1.c). In this case, it was due to expansion which was creating multiple branches/bbs from a logical OR and guessing incorrectly on how to assign the counts: if (octets == 4 && (*cp == ':' || *cp == '\0')) { The (*cp == ':' || *cp == '\0') part looked like the following going into RTL expansion: [20031204-1.c : 31:33] _29 = _28 == 58; [20031204-1.c : 31:33] _30 = _28 == 0; [20031204-1.c : 31:33] _31 = _29 | _30; [20031204-1.c : 31:18] if (_31 != 0) goto <bb 16>; else goto <bb 19>; where the result of the OR was always true, so bb 16 had a count of 100 and bb 19 a count of 0. When it was expanded, the expanded version of the above turned into 2 bbs with a branch in between. Both comparisons were done in the first bb, but the first bb checked whether the result of the *cp == '\0' compare was true, and if not branched to the check for whether the *cp == ':' compare was true. It gave the branch to the second check against ':' a count of 0, so that bb got a count of 0 and was split out, and put the count of 100 on the fall through assuming the compare with '\0' always evaluated to true. In reality, this OR condition was always true because *cp was ':', not '\0'. Therefore, the count of 0 on the second block with the check for ':' was incorrect, we ended up trying to execute it, and failed. Presumably we had the correct profile data for both blocks, but the accuracy was reduced when the OR was represented as a logical computation with a single branch. We could change the expansion code to do something different, e.g. treat as a 50-50 branch. But we would still end up with integer truncation issues when there was a single training run. But that could be dealt with conservatively in the bbpart code as I suggested for the jump threading issue above. I.e. a cold block with incoming non-cold edges conservatively not marked cold for splitting. > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c > > FAIL1 is failure after one run, FIAL is failure after 100 train runs. > We should take look at FAILs and see if there are bugs to fix. For FAIL1 > I think it is kind of design problem: while implementing counts&frequencies > the idea was that small counts do not matter, so integer arithmetic is all > right. > > I wonder if with current C++ wonderland we can't simply switch count > to a better representation. Either sreal or fixedpoint with capping > (the integer overflow issues are tiring, too). It also seems like we should be able to detect the profile insanities caused by integer truncation and handle them conservatively. That being said, I see some sreal uses already in the profile.c code, so presumably we could use this for the counts as well if it turns out to be necessary? BTW, Rong also implemented his runtime patch to do the COMDAT profile merging. However, that ended up having some issues, that were solvable but would have caused us to lose all context sensitivity from COMDATS inlined during the profile-gen build. I am going to go back to solving this in the profile-use phase as we discussed in the separate thread on the COMDAT inlining patch I had been working on. Thanks, Teresa > > Honza
> Hi Honza, > > I am finally getting back to working on this after a few weeks of > working on some other priorities. I am also trying to return to this, so good timming ;) Martin has got smaller C++ programs (Inkscape) to not touch cold segment during the startup with FDO (w/o partitioning). Firefox still does, I think the problem are lost samples due to different linker decisions even with LTO. (i.e. linker pick an object from .a libraryat profile-generate time that i never passes later.). I plan to look into that today. > > Did you mean to commit the above change? I see that it went in as part > of r202258 but doesn't show up in the ChangeLog entry for that > revision. Yes, I meant to check it in, but did not mean to do so w/o Changelog. I wil fix that. > > > > > In other cases it was mostly loop unrolling in combination with jump threading. So > > I modified my script to separately report when failure happens for test trained > > once and test trained hundred times. > > Thanks for the linker script. I reproduced your results. I looked at a > couple cases. The first was one that failed after 1 training run only > (20000910-2.c). It was due to jump threading, which you noted was a > problem. For this one I think we can handle it in the partitioning, > since there is an FDO insanity that we could probably treat more > conservatively when splitting. We should fix the roundoff issues - when I was introducing the frequency/probability/count system I made an assumptions that parts of programs with very low counts do not matter, since they are not part of hot spot (and I cared only about the hot spot). Now we care about identifying unlikely executed spots and we need to fix this. > > I looked at one that failed after 100 as well (20031204-1.c). In this > case, it was due to expansion which was creating multiple branches/bbs > from a logical OR and guessing incorrectly on how to assign the > counts: > > if (octets == 4 && (*cp == ':' || *cp == '\0')) { > > The (*cp == ':' || *cp == '\0') part looked like the following going > into RTL expansion: > > [20031204-1.c : 31:33] _29 = _28 == 58; > [20031204-1.c : 31:33] _30 = _28 == 0; > [20031204-1.c : 31:33] _31 = _29 | _30; > [20031204-1.c : 31:18] if (_31 != 0) > goto <bb 16>; > else > goto <bb 19>; > > where the result of the OR was always true, so bb 16 had a count of > 100 and bb 19 a count of 0. When it was expanded, the expanded version > of the above turned into 2 bbs with a branch in between. Both > comparisons were done in the first bb, but the first bb checked > whether the result of the *cp == '\0' compare was true, and if not > branched to the check for whether the *cp == ':' compare was true. It > gave the branch to the second check against ':' a count of 0, so that > bb got a count of 0 and was split out, and put the count of 100 on the > fall through assuming the compare with '\0' always evaluated to true. > In reality, this OR condition was always true because *cp was ':', not > '\0'. Therefore, the count of 0 on the second block with the check for > ':' was incorrect, we ended up trying to execute it, and failed. > > Presumably we had the correct profile data for both blocks, but the > accuracy was reduced when the OR was represented as a logical > computation with a single branch. We could change the expansion code > to do something different, e.g. treat as a 50-50 branch. But we would > still end up with integer truncation issues when there was a single > training run. But that could be dealt with conservatively in the > bbpart code as I suggested for the jump threading issue above. I.e. a > cold block with incoming non-cold edges conservatively not marked cold > for splitting. > > > > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c > > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c > > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c > > > > FAIL1 is failure after one run, FIAL is failure after 100 train runs. > > We should take look at FAILs and see if there are bugs to fix. For FAIL1 > > I think it is kind of design problem: while implementing counts&frequencies > > the idea was that small counts do not matter, so integer arithmetic is all > > right. > > > > I wonder if with current C++ wonderland we can't simply switch count > > to a better representation. Either sreal or fixedpoint with capping > > (the integer overflow issues are tiring, too). > > It also seems like we should be able to detect the profile insanities > caused by integer truncation and handle them conservatively. That > being said, I see some sreal uses already in the profile.c code, so > presumably we could use this for the counts as well if it turns out to > be necessary? Yes, I was thinking about this, too. We would need to do some evaulation of compile time implications but switching counts from gcov_type to profile_counter_t that is typedef to sreal seems sane idea. We could switch CFG code first. there should not be many hot spots where counts are involved. We can offline the common calculation we already moved to macros that We will need to invent also REG representation for them. Now we have INT_LIST for that we may have SREAL list and introduce SREAL as valid RTX argument. This can be done incrementally. > > BTW, Rong also implemented his runtime patch to do the COMDAT profile > merging. However, that ended up having some issues, that were solvable > but would have caused us to lose all context sensitivity from COMDATS > inlined during the profile-gen build. I am going to go back to solving > this in the profile-use phase as we discussed in the separate thread > on the COMDAT inlining patch I had been working on. Yes, lets move ahead with this, too. I think i should dig out the change that made frequencies to be guessed again. As for COMDAT merging, i would like to see the patch. I am experimenting now with a patch to also privatize COMDATs during -fprofile-generate to avoid problems with lost profiles mentioned above. As for context sensitivity, one approach would be to have two sets of counters for every comdat - one merged globally and one counting local instances. We can then privatize always and at profile read in stage just clone every comdat and have two instances - one for offline copy and one for inlining. This is not different from how I always wanted to handle GNU extern inlines (that also do have this issue - when you do not inline it, the unit do not see any profile of it). We can just tie the two functions together so "inline" version stay prior inlining and then have linker to redirect to inline version instead of offline version in such cases. It already knows how to skip aliases and this is not terribly different from that. Honza > > Thanks, > Teresa > > > > > Honza > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> > I looked at one that failed after 100 as well (20031204-1.c). In this > case, it was due to expansion which was creating multiple branches/bbs > from a logical OR and guessing incorrectly on how to assign the > counts: > > if (octets == 4 && (*cp == ':' || *cp == '\0')) { > > The (*cp == ':' || *cp == '\0') part looked like the following going > into RTL expansion: > > [20031204-1.c : 31:33] _29 = _28 == 58; > [20031204-1.c : 31:33] _30 = _28 == 0; > [20031204-1.c : 31:33] _31 = _29 | _30; > [20031204-1.c : 31:18] if (_31 != 0) > goto <bb 16>; > else > goto <bb 19>; > > where the result of the OR was always true, so bb 16 had a count of > 100 and bb 19 a count of 0. When it was expanded, the expanded version > of the above turned into 2 bbs with a branch in between. Both > comparisons were done in the first bb, but the first bb checked > whether the result of the *cp == '\0' compare was true, and if not > branched to the check for whether the *cp == ':' compare was true. It > gave the branch to the second check against ':' a count of 0, so that > bb got a count of 0 and was split out, and put the count of 100 on the > fall through assuming the compare with '\0' always evaluated to true. > In reality, this OR condition was always true because *cp was ':', not > '\0'. Therefore, the count of 0 on the second block with the check for > ':' was incorrect, we ended up trying to execute it, and failed. I see, we produce: ;; if (_26 != 0) (insn 94 93 95 (set (reg:CCZ 17 flags) (compare:CCZ (reg:QI 107 [ D.2184 ]) (const_int 0 [0]))) a.c:31 -1 (nil)) (insn 95 94 96 (set (reg:QI 122 [ D.2186 ]) (eq:QI (reg:CCZ 17 flags) (const_int 0 [0]))) a.c:31 -1 (nil)) (insn 96 95 97 (set (reg:CCZ 17 flags) (compare:CCZ (reg:QI 122 [ D.2186 ]) (const_int 0 [0]))) a.c:31 -1 (nil)) (jump_insn 97 96 98 (set (pc) (if_then_else (ne (reg:CCZ 17 flags) (const_int 0 [0])) (label_ref 100) (pc))) a.c:31 -1 (expr_list:REG_BR_PROB (const_int 6100 [0x17d4]) (nil))) (insn 98 97 99 (set (reg:CCZ 17 flags) (compare:CCZ (reg:QI 108 [ D.2186 ]) (const_int 0 [0]))) a.c:31 -1 (nil)) (jump_insn 99 98 100 (set (pc) (if_then_else (eq (reg:CCZ 17 flags) (const_int 0 [0])) (label_ref 0) (pc))) a.c:31 -1 (expr_list:REG_BR_PROB (const_int 3900 [0xf3c]) (nil))) (code_label 100 99 0 14 "" [0 uses]) That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)" First I think the logic of do_jump should really be moved to trees. It is not doing things that can not be adequately represented by gimple. I am not that certain we want to move it before profiling though. > > Presumably we had the correct profile data for both blocks, but the > accuracy was reduced when the OR was represented as a logical > computation with a single branch. We could change the expansion code > to do something different, e.g. treat as a 50-50 branch. But we would > still end up with integer truncation issues when there was a single > training run. But that could be dealt with conservatively in the Yep, but it is still better than what we have now - if the test above was in hot part of program (i.e. not executed once), we will end up optimizing the second conditional for size. So I think it is do_jump bug to not distribute probabilities across the two conditoinals introduced. > bbpart code as I suggested for the jump threading issue above. I.e. a > cold block with incoming non-cold edges conservatively not marked cold > for splitting. Yep, we can probably do that, but we ought to fix the individual cases above at least for resonable number of runs. Will you look into logic of do_jump or shall I try to dive in? Honza
On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> I looked at one that failed after 100 as well (20031204-1.c). In this >> case, it was due to expansion which was creating multiple branches/bbs >> from a logical OR and guessing incorrectly on how to assign the >> counts: >> >> if (octets == 4 && (*cp == ':' || *cp == '\0')) { >> >> The (*cp == ':' || *cp == '\0') part looked like the following going >> into RTL expansion: >> >> [20031204-1.c : 31:33] _29 = _28 == 58; >> [20031204-1.c : 31:33] _30 = _28 == 0; >> [20031204-1.c : 31:33] _31 = _29 | _30; >> [20031204-1.c : 31:18] if (_31 != 0) >> goto <bb 16>; >> else >> goto <bb 19>; >> >> where the result of the OR was always true, so bb 16 had a count of >> 100 and bb 19 a count of 0. When it was expanded, the expanded version >> of the above turned into 2 bbs with a branch in between. Both >> comparisons were done in the first bb, but the first bb checked >> whether the result of the *cp == '\0' compare was true, and if not >> branched to the check for whether the *cp == ':' compare was true. It >> gave the branch to the second check against ':' a count of 0, so that >> bb got a count of 0 and was split out, and put the count of 100 on the >> fall through assuming the compare with '\0' always evaluated to true. >> In reality, this OR condition was always true because *cp was ':', not >> '\0'. Therefore, the count of 0 on the second block with the check for >> ':' was incorrect, we ended up trying to execute it, and failed. > > I see, we produce: > ;; if (_26 != 0) > > (insn 94 93 95 (set (reg:CCZ 17 flags) > (compare:CCZ (reg:QI 107 [ D.2184 ]) > (const_int 0 [0]))) a.c:31 -1 > (nil)) > > (insn 95 94 96 (set (reg:QI 122 [ D.2186 ]) > (eq:QI (reg:CCZ 17 flags) > (const_int 0 [0]))) a.c:31 -1 > (nil)) > > (insn 96 95 97 (set (reg:CCZ 17 flags) > (compare:CCZ (reg:QI 122 [ D.2186 ]) > (const_int 0 [0]))) a.c:31 -1 > (nil)) > > (jump_insn 97 96 98 (set (pc) > (if_then_else (ne (reg:CCZ 17 flags) > (const_int 0 [0])) > (label_ref 100) > (pc))) a.c:31 -1 > (expr_list:REG_BR_PROB (const_int 6100 [0x17d4]) > (nil))) > > (insn 98 97 99 (set (reg:CCZ 17 flags) > (compare:CCZ (reg:QI 108 [ D.2186 ]) > (const_int 0 [0]))) a.c:31 -1 > (nil)) > > (jump_insn 99 98 100 (set (pc) > (if_then_else (eq (reg:CCZ 17 flags) > (const_int 0 [0])) > (label_ref 0) > (pc))) a.c:31 -1 > (expr_list:REG_BR_PROB (const_int 3900 [0xf3c]) > (nil))) > > (code_label 100 99 0 14 "" [0 uses]) > > That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)" > > First I think the logic of do_jump should really be moved to trees. It is not > doing things that can not be adequately represented by gimple. > > I am not that certain we want to move it before profiling though. >> >> Presumably we had the correct profile data for both blocks, but the >> accuracy was reduced when the OR was represented as a logical >> computation with a single branch. We could change the expansion code >> to do something different, e.g. treat as a 50-50 branch. But we would >> still end up with integer truncation issues when there was a single >> training run. But that could be dealt with conservatively in the > > Yep, but it is still better than what we have now - if the test above was > in hot part of program (i.e. not executed once), we will end up optimizing > the second conditional for size. > > So I think it is do_jump bug to not distribute probabilities across the two > conditoinals introduced. >> bbpart code as I suggested for the jump threading issue above. I.e. a >> cold block with incoming non-cold edges conservatively not marked cold >> for splitting. > > Yep, we can probably do that, but we ought to fix the individual cases > above at least for resonable number of runs. I made this change and it removed a few of the failures. I looked at another case that still failed with 1 train run but passed with 100. It turned out to be another truncation issue exposed by RTL expansion, where we created some control flow for a memset builtin which was in a block with an execution count of 1. Some of the blocks got frequencies less than half the original block, so the count was rounded down or truncated to 0. I noticed that in this case (as well as the jump threading case I fixed by looking for non-zero incoming edges in partitioning) that the bb frequency was non-zero. Why not just have probably_never_executed_bb_p return simply return false bb->frequency is non-zero (right now it does the opposite - returns true when bb->frequency is 0)? Making this change removed a bunch of other failures. With this change as well, there are only 3 cases that still fail with 1 train run that pass with 100. Need to look at those. > > Will you look into logic of do_jump or shall I try to dive in? I can take a look, but probably won't have a chance until late this week. If you don't get to it before then I will see if I can figure out why it is applying the branch probabilities this way. Teresa > > Honza
Rong - can you answer the questions below on the comdat patch? On Tue, Sep 24, 2013 at 5:31 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Hi Honza, >> >> I am finally getting back to working on this after a few weeks of >> working on some other priorities. > > I am also trying to return to this, so good timming ;) > Martin has got smaller C++ programs (Inkscape) to not touch cold segment > during the startup with FDO (w/o partitioning). Firefox still does, I think > the problem are lost samples due to different linker decisions even with LTO. > (i.e. linker pick an object from .a libraryat profile-generate time that i never > passes later.). > > I plan to look into that today. >> >> Did you mean to commit the above change? I see that it went in as part >> of r202258 but doesn't show up in the ChangeLog entry for that >> revision. > > Yes, I meant to check it in, but did not mean to do so w/o Changelog. I wil > fix that. Should the same fix be applied to probably_never_executed_edge_p? >> >> > >> > In other cases it was mostly loop unrolling in combination with jump threading. So >> > I modified my script to separately report when failure happens for test trained >> > once and test trained hundred times. >> >> Thanks for the linker script. I reproduced your results. I looked at a >> couple cases. The first was one that failed after 1 training run only >> (20000910-2.c). It was due to jump threading, which you noted was a >> problem. For this one I think we can handle it in the partitioning, >> since there is an FDO insanity that we could probably treat more >> conservatively when splitting. > > We should fix the roundoff issues - when I was introducing the > frequency/probability/count system I made an assumptions that parts of programs > with very low counts do not matter, since they are not part of hot spot (and I > cared only about the hot spot). Now we care about identifying unlikely > executed spots and we need to fix this. >> >> I looked at one that failed after 100 as well (20031204-1.c). In this >> case, it was due to expansion which was creating multiple branches/bbs >> from a logical OR and guessing incorrectly on how to assign the >> counts: >> >> if (octets == 4 && (*cp == ':' || *cp == '\0')) { >> >> The (*cp == ':' || *cp == '\0') part looked like the following going >> into RTL expansion: >> >> [20031204-1.c : 31:33] _29 = _28 == 58; >> [20031204-1.c : 31:33] _30 = _28 == 0; >> [20031204-1.c : 31:33] _31 = _29 | _30; >> [20031204-1.c : 31:18] if (_31 != 0) >> goto <bb 16>; >> else >> goto <bb 19>; >> >> where the result of the OR was always true, so bb 16 had a count of >> 100 and bb 19 a count of 0. When it was expanded, the expanded version >> of the above turned into 2 bbs with a branch in between. Both >> comparisons were done in the first bb, but the first bb checked >> whether the result of the *cp == '\0' compare was true, and if not >> branched to the check for whether the *cp == ':' compare was true. It >> gave the branch to the second check against ':' a count of 0, so that >> bb got a count of 0 and was split out, and put the count of 100 on the >> fall through assuming the compare with '\0' always evaluated to true. >> In reality, this OR condition was always true because *cp was ':', not >> '\0'. Therefore, the count of 0 on the second block with the check for >> ':' was incorrect, we ended up trying to execute it, and failed. >> >> Presumably we had the correct profile data for both blocks, but the >> accuracy was reduced when the OR was represented as a logical >> computation with a single branch. We could change the expansion code >> to do something different, e.g. treat as a 50-50 branch. But we would >> still end up with integer truncation issues when there was a single >> training run. But that could be dealt with conservatively in the >> bbpart code as I suggested for the jump threading issue above. I.e. a >> cold block with incoming non-cold edges conservatively not marked cold >> for splitting. >> >> > >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c >> > >> > FAIL1 is failure after one run, FIAL is failure after 100 train runs. >> > We should take look at FAILs and see if there are bugs to fix. For FAIL1 >> > I think it is kind of design problem: while implementing counts&frequencies >> > the idea was that small counts do not matter, so integer arithmetic is all >> > right. >> > >> > I wonder if with current C++ wonderland we can't simply switch count >> > to a better representation. Either sreal or fixedpoint with capping >> > (the integer overflow issues are tiring, too). >> >> It also seems like we should be able to detect the profile insanities >> caused by integer truncation and handle them conservatively. That >> being said, I see some sreal uses already in the profile.c code, so >> presumably we could use this for the counts as well if it turns out to >> be necessary? > > Yes, I was thinking about this, too. We would need to do some evaulation of > compile time implications but switching counts from gcov_type to > profile_counter_t that is typedef to sreal seems sane idea. > > We could switch CFG code first. there should not be many hot spots where > counts are involved. We can offline the common calculation we already moved to > macros that > > We will need to invent also REG representation for them. Now we have INT_LIST > for that we may have SREAL list and introduce SREAL as valid RTX argument. > This can be done incrementally. >> >> BTW, Rong also implemented his runtime patch to do the COMDAT profile >> merging. However, that ended up having some issues, that were solvable >> but would have caused us to lose all context sensitivity from COMDATS >> inlined during the profile-gen build. I am going to go back to solving >> this in the profile-use phase as we discussed in the separate thread >> on the COMDAT inlining patch I had been working on. > > Yes, lets move ahead with this, too. I think i should dig out the change > that made frequencies to be guessed again. I think I have that and was building my patch on top of it. Rong: > > As for COMDAT merging, i would like to see the patch. I am experimenting > now with a patch to also privatize COMDATs during -fprofile-generate to > avoid problems with lost profiles mentioned above. > > As for context sensitivity, one approach would be to have two sets of > counters for every comdat - one merged globally and one counting local > instances. We can then privatize always and at profile read in stage > just clone every comdat and have two instances - one for offline copy > and one for inlining. > > This is not different from how I always wanted to handle GNU extern inlines > (that also do have this issue - when you do not inline it, the unit do not see > any profile of it). > > We can just tie the two functions together so "inline" version stay prior > inlining and then have linker to redirect to inline version instead of offline > version in such cases. It already knows how to skip aliases and this is not > terribly different from that. > > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Tue, Sep 24, 2013 at 11:25 AM, Teresa Johnson <tejohnson@google.com> wrote: > On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >>> >>> I looked at one that failed after 100 as well (20031204-1.c). In this >>> case, it was due to expansion which was creating multiple branches/bbs >>> from a logical OR and guessing incorrectly on how to assign the >>> counts: >>> >>> if (octets == 4 && (*cp == ':' || *cp == '\0')) { >>> >>> The (*cp == ':' || *cp == '\0') part looked like the following going >>> into RTL expansion: >>> >>> [20031204-1.c : 31:33] _29 = _28 == 58; >>> [20031204-1.c : 31:33] _30 = _28 == 0; >>> [20031204-1.c : 31:33] _31 = _29 | _30; >>> [20031204-1.c : 31:18] if (_31 != 0) >>> goto <bb 16>; >>> else >>> goto <bb 19>; >>> >>> where the result of the OR was always true, so bb 16 had a count of >>> 100 and bb 19 a count of 0. When it was expanded, the expanded version >>> of the above turned into 2 bbs with a branch in between. Both >>> comparisons were done in the first bb, but the first bb checked >>> whether the result of the *cp == '\0' compare was true, and if not >>> branched to the check for whether the *cp == ':' compare was true. It >>> gave the branch to the second check against ':' a count of 0, so that >>> bb got a count of 0 and was split out, and put the count of 100 on the >>> fall through assuming the compare with '\0' always evaluated to true. >>> In reality, this OR condition was always true because *cp was ':', not >>> '\0'. Therefore, the count of 0 on the second block with the check for >>> ':' was incorrect, we ended up trying to execute it, and failed. >> >> I see, we produce: >> ;; if (_26 != 0) >> >> (insn 94 93 95 (set (reg:CCZ 17 flags) >> (compare:CCZ (reg:QI 107 [ D.2184 ]) >> (const_int 0 [0]))) a.c:31 -1 >> (nil)) >> >> (insn 95 94 96 (set (reg:QI 122 [ D.2186 ]) >> (eq:QI (reg:CCZ 17 flags) >> (const_int 0 [0]))) a.c:31 -1 >> (nil)) >> >> (insn 96 95 97 (set (reg:CCZ 17 flags) >> (compare:CCZ (reg:QI 122 [ D.2186 ]) >> (const_int 0 [0]))) a.c:31 -1 >> (nil)) >> >> (jump_insn 97 96 98 (set (pc) >> (if_then_else (ne (reg:CCZ 17 flags) >> (const_int 0 [0])) >> (label_ref 100) >> (pc))) a.c:31 -1 >> (expr_list:REG_BR_PROB (const_int 6100 [0x17d4]) >> (nil))) >> >> (insn 98 97 99 (set (reg:CCZ 17 flags) >> (compare:CCZ (reg:QI 108 [ D.2186 ]) >> (const_int 0 [0]))) a.c:31 -1 >> (nil)) >> >> (jump_insn 99 98 100 (set (pc) >> (if_then_else (eq (reg:CCZ 17 flags) >> (const_int 0 [0])) >> (label_ref 0) >> (pc))) a.c:31 -1 >> (expr_list:REG_BR_PROB (const_int 3900 [0xf3c]) >> (nil))) >> >> (code_label 100 99 0 14 "" [0 uses]) >> >> That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)" >> >> First I think the logic of do_jump should really be moved to trees. It is not >> doing things that can not be adequately represented by gimple. >> >> I am not that certain we want to move it before profiling though. >>> >>> Presumably we had the correct profile data for both blocks, but the >>> accuracy was reduced when the OR was represented as a logical >>> computation with a single branch. We could change the expansion code >>> to do something different, e.g. treat as a 50-50 branch. But we would >>> still end up with integer truncation issues when there was a single >>> training run. But that could be dealt with conservatively in the >> >> Yep, but it is still better than what we have now - if the test above was >> in hot part of program (i.e. not executed once), we will end up optimizing >> the second conditional for size. >> >> So I think it is do_jump bug to not distribute probabilities across the two >> conditoinals introduced. >>> bbpart code as I suggested for the jump threading issue above. I.e. a >>> cold block with incoming non-cold edges conservatively not marked cold >>> for splitting. >> >> Yep, we can probably do that, but we ought to fix the individual cases >> above at least for resonable number of runs. > > I made this change and it removed a few of the failures. > > I looked at another case that still failed with 1 train run but passed > with 100. It turned out to be another truncation issue exposed by RTL > expansion, where we created some control flow for a memset builtin > which was in a block with an execution count of 1. Some of the blocks > got frequencies less than half the original block, so the count was > rounded down or truncated to 0. I noticed that in this case (as well > as the jump threading case I fixed by looking for non-zero incoming > edges in partitioning) that the bb frequency was non-zero. > > Why not just have probably_never_executed_bb_p return simply return > false bb->frequency is non-zero (right now it does the opposite - > returns true when bb->frequency is 0)? Making this change removed a > bunch of other failures. With this change as well, there are only 3 > cases that still fail with 1 train run that pass with 100. Need to > look at those. FYI, These turned out to be more jump threading issues. I am currently working on getting jump threading profile updates to work properly, I think I'm pretty close. Haven't had a chance to look at do_jump yet. Teresa > >> >> Will you look into logic of do_jump or shall I try to dive in? > > I can take a look, but probably won't have a chance until late this > week. If you don't get to it before then I will see if I can figure > out why it is applying the branch probabilities this way. > > Teresa > >> >> Honza > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Wed, Sep 25, 2013 at 2:33 PM, Teresa Johnson <tejohnson@google.com> wrote: > On Tue, Sep 24, 2013 at 11:25 AM, Teresa Johnson <tejohnson@google.com> wrote: >> On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >>>> >>>> I looked at one that failed after 100 as well (20031204-1.c). In this >>>> case, it was due to expansion which was creating multiple branches/bbs >>>> from a logical OR and guessing incorrectly on how to assign the >>>> counts: >>>> >>>> if (octets == 4 && (*cp == ':' || *cp == '\0')) { >>>> >>>> The (*cp == ':' || *cp == '\0') part looked like the following going >>>> into RTL expansion: >>>> >>>> [20031204-1.c : 31:33] _29 = _28 == 58; >>>> [20031204-1.c : 31:33] _30 = _28 == 0; >>>> [20031204-1.c : 31:33] _31 = _29 | _30; >>>> [20031204-1.c : 31:18] if (_31 != 0) >>>> goto <bb 16>; >>>> else >>>> goto <bb 19>; >>>> >>>> where the result of the OR was always true, so bb 16 had a count of >>>> 100 and bb 19 a count of 0. When it was expanded, the expanded version >>>> of the above turned into 2 bbs with a branch in between. Both >>>> comparisons were done in the first bb, but the first bb checked >>>> whether the result of the *cp == '\0' compare was true, and if not >>>> branched to the check for whether the *cp == ':' compare was true. It >>>> gave the branch to the second check against ':' a count of 0, so that >>>> bb got a count of 0 and was split out, and put the count of 100 on the >>>> fall through assuming the compare with '\0' always evaluated to true. >>>> In reality, this OR condition was always true because *cp was ':', not >>>> '\0'. Therefore, the count of 0 on the second block with the check for >>>> ':' was incorrect, we ended up trying to execute it, and failed. >>> >>> I see, we produce: >>> ;; if (_26 != 0) >>> >>> (insn 94 93 95 (set (reg:CCZ 17 flags) >>> (compare:CCZ (reg:QI 107 [ D.2184 ]) >>> (const_int 0 [0]))) a.c:31 -1 >>> (nil)) >>> >>> (insn 95 94 96 (set (reg:QI 122 [ D.2186 ]) >>> (eq:QI (reg:CCZ 17 flags) >>> (const_int 0 [0]))) a.c:31 -1 >>> (nil)) >>> >>> (insn 96 95 97 (set (reg:CCZ 17 flags) >>> (compare:CCZ (reg:QI 122 [ D.2186 ]) >>> (const_int 0 [0]))) a.c:31 -1 >>> (nil)) >>> >>> (jump_insn 97 96 98 (set (pc) >>> (if_then_else (ne (reg:CCZ 17 flags) >>> (const_int 0 [0])) >>> (label_ref 100) >>> (pc))) a.c:31 -1 >>> (expr_list:REG_BR_PROB (const_int 6100 [0x17d4]) >>> (nil))) >>> >>> (insn 98 97 99 (set (reg:CCZ 17 flags) >>> (compare:CCZ (reg:QI 108 [ D.2186 ]) >>> (const_int 0 [0]))) a.c:31 -1 >>> (nil)) >>> >>> (jump_insn 99 98 100 (set (pc) >>> (if_then_else (eq (reg:CCZ 17 flags) >>> (const_int 0 [0])) >>> (label_ref 0) >>> (pc))) a.c:31 -1 >>> (expr_list:REG_BR_PROB (const_int 3900 [0xf3c]) >>> (nil))) >>> >>> (code_label 100 99 0 14 "" [0 uses]) >>> >>> That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)" >>> >>> First I think the logic of do_jump should really be moved to trees. It is not >>> doing things that can not be adequately represented by gimple. >>> >>> I am not that certain we want to move it before profiling though. >>>> >>>> Presumably we had the correct profile data for both blocks, but the >>>> accuracy was reduced when the OR was represented as a logical >>>> computation with a single branch. We could change the expansion code >>>> to do something different, e.g. treat as a 50-50 branch. But we would >>>> still end up with integer truncation issues when there was a single >>>> training run. But that could be dealt with conservatively in the >>> >>> Yep, but it is still better than what we have now - if the test above was >>> in hot part of program (i.e. not executed once), we will end up optimizing >>> the second conditional for size. >>> >>> So I think it is do_jump bug to not distribute probabilities across the two >>> conditoinals introduced. >>>> bbpart code as I suggested for the jump threading issue above. I.e. a >>>> cold block with incoming non-cold edges conservatively not marked cold >>>> for splitting. >>> >>> Yep, we can probably do that, but we ought to fix the individual cases >>> above at least for resonable number of runs. >> >> I made this change and it removed a few of the failures. >> >> I looked at another case that still failed with 1 train run but passed >> with 100. It turned out to be another truncation issue exposed by RTL >> expansion, where we created some control flow for a memset builtin >> which was in a block with an execution count of 1. Some of the blocks >> got frequencies less than half the original block, so the count was >> rounded down or truncated to 0. I noticed that in this case (as well >> as the jump threading case I fixed by looking for non-zero incoming >> edges in partitioning) that the bb frequency was non-zero. >> >> Why not just have probably_never_executed_bb_p return simply return >> false bb->frequency is non-zero (right now it does the opposite - >> returns true when bb->frequency is 0)? Making this change removed a >> bunch of other failures. With this change as well, there are only 3 >> cases that still fail with 1 train run that pass with 100. Need to >> look at those. > > FYI, These turned out to be more jump threading issues. I am currently > working on getting jump threading profile updates to work properly, I > think I'm pretty close. Haven't had a chance to look at do_jump yet. Correction, it was not the 3 tests I mentioned above, but a different set of tests being affected by this jump threading issue. I have a patch I am regression testing. But there are other profile insanities being caused upstream of jump threading that I haven't tracked down. So I will also test and send the patch to handle some of these in the splitting/cold cold detection code. It also contains the change to probably_never_executed_bb_p that I mention above to return false when bb->frequency is non-zero. Teresa > > Teresa > >> >>> >>> Will you look into logic of do_jump or shall I try to dive in? >> >> I can take a look, but probably won't have a chance until late this >> week. If you don't get to it before then I will see if I can figure >> out why it is applying the branch probabilities this way. >> >> Teresa >> >>> >>> Honza >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Tue, Sep 24, 2013 at 5:31 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Hi Honza, >> >> I am finally getting back to working on this after a few weeks of >> working on some other priorities. > > I am also trying to return to this, so good timming ;) > Martin has got smaller C++ programs (Inkscape) to not touch cold segment > during the startup with FDO (w/o partitioning). Firefox still does, I think > the problem are lost samples due to different linker decisions even with LTO. > (i.e. linker pick an object from .a libraryat profile-generate time that i never > passes later.). > > I plan to look into that today. >> >> Did you mean to commit the above change? I see that it went in as part >> of r202258 but doesn't show up in the ChangeLog entry for that >> revision. > > Yes, I meant to check it in, but did not mean to do so w/o Changelog. I wil > fix that. >> >> > >> > In other cases it was mostly loop unrolling in combination with jump threading. So >> > I modified my script to separately report when failure happens for test trained >> > once and test trained hundred times. >> >> Thanks for the linker script. I reproduced your results. I looked at a >> couple cases. The first was one that failed after 1 training run only >> (20000910-2.c). It was due to jump threading, which you noted was a >> problem. For this one I think we can handle it in the partitioning, >> since there is an FDO insanity that we could probably treat more >> conservatively when splitting. > > We should fix the roundoff issues - when I was introducing the > frequency/probability/count system I made an assumptions that parts of programs > with very low counts do not matter, since they are not part of hot spot (and I > cared only about the hot spot). Now we care about identifying unlikely > executed spots and we need to fix this. >> >> I looked at one that failed after 100 as well (20031204-1.c). In this >> case, it was due to expansion which was creating multiple branches/bbs >> from a logical OR and guessing incorrectly on how to assign the >> counts: >> >> if (octets == 4 && (*cp == ':' || *cp == '\0')) { >> >> The (*cp == ':' || *cp == '\0') part looked like the following going >> into RTL expansion: >> >> [20031204-1.c : 31:33] _29 = _28 == 58; >> [20031204-1.c : 31:33] _30 = _28 == 0; >> [20031204-1.c : 31:33] _31 = _29 | _30; >> [20031204-1.c : 31:18] if (_31 != 0) >> goto <bb 16>; >> else >> goto <bb 19>; >> >> where the result of the OR was always true, so bb 16 had a count of >> 100 and bb 19 a count of 0. When it was expanded, the expanded version >> of the above turned into 2 bbs with a branch in between. Both >> comparisons were done in the first bb, but the first bb checked >> whether the result of the *cp == '\0' compare was true, and if not >> branched to the check for whether the *cp == ':' compare was true. It >> gave the branch to the second check against ':' a count of 0, so that >> bb got a count of 0 and was split out, and put the count of 100 on the >> fall through assuming the compare with '\0' always evaluated to true. >> In reality, this OR condition was always true because *cp was ':', not >> '\0'. Therefore, the count of 0 on the second block with the check for >> ':' was incorrect, we ended up trying to execute it, and failed. >> >> Presumably we had the correct profile data for both blocks, but the >> accuracy was reduced when the OR was represented as a logical >> computation with a single branch. We could change the expansion code >> to do something different, e.g. treat as a 50-50 branch. But we would >> still end up with integer truncation issues when there was a single >> training run. But that could be dealt with conservatively in the >> bbpart code as I suggested for the jump threading issue above. I.e. a >> cold block with incoming non-cold edges conservatively not marked cold >> for splitting. >> >> > >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c >> > >> > FAIL1 is failure after one run, FIAL is failure after 100 train runs. >> > We should take look at FAILs and see if there are bugs to fix. For FAIL1 >> > I think it is kind of design problem: while implementing counts&frequencies >> > the idea was that small counts do not matter, so integer arithmetic is all >> > right. >> > >> > I wonder if with current C++ wonderland we can't simply switch count >> > to a better representation. Either sreal or fixedpoint with capping >> > (the integer overflow issues are tiring, too). >> >> It also seems like we should be able to detect the profile insanities >> caused by integer truncation and handle them conservatively. That >> being said, I see some sreal uses already in the profile.c code, so >> presumably we could use this for the counts as well if it turns out to >> be necessary? > > Yes, I was thinking about this, too. We would need to do some evaulation of > compile time implications but switching counts from gcov_type to > profile_counter_t that is typedef to sreal seems sane idea. > > We could switch CFG code first. there should not be many hot spots where > counts are involved. We can offline the common calculation we already moved to > macros that > > We will need to invent also REG representation for them. Now we have INT_LIST > for that we may have SREAL list and introduce SREAL as valid RTX argument. > This can be done incrementally. >> >> BTW, Rong also implemented his runtime patch to do the COMDAT profile >> merging. However, that ended up having some issues, that were solvable >> but would have caused us to lose all context sensitivity from COMDATS >> inlined during the profile-gen build. I am going to go back to solving >> this in the profile-use phase as we discussed in the separate thread >> on the COMDAT inlining patch I had been working on. > > Yes, lets move ahead with this, too. I think i should dig out the change > that made frequencies to be guessed again. > > As for COMDAT merging, i would like to see the patch. I am experimenting > now with a patch to also privatize COMDATs during -fprofile-generate to > avoid problems with lost profiles mentioned above. > Do you mean you privatize every COMDAT function in the profile-generate? We discussed this idea internally and we thought it would not work for large applications (like in google) due to size. > As for context sensitivity, one approach would be to have two sets of > counters for every comdat - one merged globally and one counting local > instances. We can then privatize always and at profile read in stage > just clone every comdat and have two instances - one for offline copy > and one for inlining. > In my implementation, I also allow multiple sets of COMDAT profile co-existing in one compilation. Due to the auxiliary modules in LIPO, I actually have more than two. But I'm wondering how do you determine which profile to use for each call-site -- the inline decision may not be the same for profile-generate and profile-use compilation. > This is not different from how I always wanted to handle GNU extern inlines > (that also do have this issue - when you do not inline it, the unit do not see > any profile of it). > > We can just tie the two functions together so "inline" version stay prior > inlining and then have linker to redirect to inline version instead of offline > version in such cases. It already knows how to skip aliases and this is not > terribly different from that. > > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> > As for COMDAT merging, i would like to see the patch. I am experimenting > > now with a patch to also privatize COMDATs during -fprofile-generate to > > avoid problems with lost profiles mentioned above. > > > > Do you mean you privatize every COMDAT function in the profile-generate? > We discussed this idea internally and we thought it would not work for > large applications (like in google) due to size. Yes, Martin and I plan to test this on firefox. In a way you already have all the COMDAT functions unshared in the object files, so the resulting binary should not be completely off the limits. But I do not have any quantitative data, yet, since we hit bug in constant folding and devirtualization I fixed in meantime but we did not re-run the tests yet. > > > As for context sensitivity, one approach would be to have two sets of > > counters for every comdat - one merged globally and one counting local > > instances. We can then privatize always and at profile read in stage > > just clone every comdat and have two instances - one for offline copy > > and one for inlining. > > > > In my implementation, I also allow multiple sets of COMDAT profile > co-existing in one compilation. > Due to the auxiliary modules in LIPO, I actually have more than two. How does auxiliary modules work? > > But I'm wondering how do you determine which profile to use for each > call-site -- the inline decision may not > be the same for profile-generate and profile-use compilation. My suggestion was to simply use the module local profile for all inline sites within the given module and the global profile for the offline copy of the function (that one will, in the case it survives linking, be shared across all the modules anyway). I think this may work in the cases where i.e. use of hash templates in one module is very different (in average size) from other module. I did not really put much effort into it - I currently worry primarily about the cases where profile is lost completely since it gets attached to a function not surviving final linking (or because we inline something we did not inlined at profile time). As for context sensitivity, we may try to consider developing more consistent solution for this. COMDAT functions are definitely not only that may exhibit context sensitive behaviour. One approach would be to always have multiple counters for each function and hash based on cbacktraces collected by indirect call profiling instrumentation. In a way this is same path profiling, but that would definitely add quite some overhead + we will need to think of resonable way to represent this within compiler. How do you decide what functions you want to have multiple profiles for? Honza
> > Why not just have probably_never_executed_bb_p return simply return > false bb->frequency is non-zero (right now it does the opposite - We want to have frequencies guessed for functions that was not trained in the profiling run (that was patch I posted earlier that I think did not go in, yet). Currently I return true when frequency indicate that BB is executed at least in 1/4th of all executions. With the cases discussed I see we may need to reduce this threshold. In general I do not like much hard tests for 0 because meaning of 0 depends on REG_BR_FREQ_BASE that is supposed to be changeable and we may want to make frequencies sreal, too. I suppose we may introduce --param for this. You are also right that I should update probably_never_executed_edge_p (I intended so, but obviously the code ended up in mainline accidentally). I however saw at least one case of jump threading where this trick did not help: the jump threading update confused itself by scaling via counts rather than frequencies and ended up with dropping everything to 0. This makes it more tempting to try to go with sreals for those.... Honza > returns true when bb->frequency is 0)? Making this change removed a > bunch of other failures. With this change as well, there are only 3 > cases that still fail with 1 train run that pass with 100. Need to > look at those. > > > > > Will you look into logic of do_jump or shall I try to dive in? > > I can take a look, but probably won't have a chance until late this > week. If you don't get to it before then I will see if I can figure > out why it is applying the branch probabilities this way. > > Teresa > > > > > Honza > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Thu, Sep 26, 2013 at 2:54 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> > As for COMDAT merging, i would like to see the patch. I am experimenting >> > now with a patch to also privatize COMDATs during -fprofile-generate to >> > avoid problems with lost profiles mentioned above. >> > >> >> Do you mean you privatize every COMDAT function in the profile-generate? >> We discussed this idea internally and we thought it would not work for >> large applications (like in google) due to size. > > Yes, Martin and I plan to test this on firefox. In a way you already have all > the COMDAT functions unshared in the object files, so the resulting binary > should not be completely off the limits. But I do not have any quantitative > data, yet, since we hit bug in constant folding and devirtualization I fixed in > meantime but we did not re-run the tests yet. LInker removes a great numbers of duplicated copies, esp for those template functions. We don't have a quantitative numbers either. But I'll collect some soon. > >> >> > As for context sensitivity, one approach would be to have two sets of >> > counters for every comdat - one merged globally and one counting local >> > instances. We can then privatize always and at profile read in stage >> > just clone every comdat and have two instances - one for offline copy >> > and one for inlining. >> > >> >> In my implementation, I also allow multiple sets of COMDAT profile >> co-existing in one compilation. >> Due to the auxiliary modules in LIPO, I actually have more than two. > > How does auxiliary modules work? It pulls in multiple profiles from other compilation. So there might be multiple inlined profiles. >> >> But I'm wondering how do you determine which profile to use for each >> call-site -- the inline decision may not >> be the same for profile-generate and profile-use compilation. > > My suggestion was to simply use the module local profile for all inline sites > within the given module and the global profile for the offline copy of the > function (that one will, in the case it survives linking, be shared across > all the modules anyway). For simple example like: callsite1 --> comcat_function_foo callsite2 --> comdat_function_foo callsite1 is inlined in profile-generate, it has its own inlined profile counter. callsite2 is not inlined and the profile goes to the offline copies. let's callsite 1 is cold (0 counter) and callsite 2 is hot. Using local profile (the cold one) for callsite2 will not be correct. > > I think this may work in the cases where i.e. use of hash templates in one > module is very different (in average size) from other module. > I did not really put much effort into it - I currently worry primarily about > the cases where profile is lost completely since it gets attached to a function > not surviving final linking (or because we inline something we did not inlined > at profile time). > > As for context sensitivity, we may try to consider developing more consistent > solution for this. COMDAT functions are definitely not only that may exhibit > context sensitive behaviour. > One approach would be to always have multiple counters for each function and > hash based on cbacktraces collected by indirect call profiling instrumentation. > In a way this is same path profiling, but that would definitely add quite some > overhead + we will need to think of resonable way to represent this within > compiler. > > How do you decide what functions you want to have multiple profiles for? I do the instrumentation after ipa-inline for comdat function. I know if a callsite is inlined or not. In profile-use phrase, I also need to provide to the context (which module this is from) to pick the right profile. > > Honza
On Thu, Sep 26, 2013 at 3:02 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> Why not just have probably_never_executed_bb_p return simply return >> false bb->frequency is non-zero (right now it does the opposite - > > We want to have frequencies guessed for functions that was not trained > in the profiling run (that was patch I posted earlier that I think did not > go in, yet). Right, but for splitting and bb layout purposes, for these statically guessed unprofiled functions we in fact don't want to do any splitting or treat the bbs as never executed (which shouldn't be a change from the status quo since all the bbs in these functions are currently 0 weight, it's only when we inline in the case of comdats that they appear colder than the surrounding code, but in fact we don't want this). The only other caller to probably_never_executed_bb_p is compute_function_frequency, but in the case of statically guessed functions they will have profile_status != PROFILE_READ and won't invoke probably_never_executed_bb_p. But re-reading our most recent exchange on the comdat profile issue, it sounds like you were suggesting guessing profiles for all 0-weight functions early, then dropping them from PROFILE_READ to PROFILE_GUESSED only once we determine in ipa-inline that there is a potentially non-zero call path to them. In that case with the change I describe above to probably_never_executed_bb_p, the 0-weight functions with 0 calls to them will incorrectly be marked as NODE_FREQUENCY_NORMAL, which would be bad as they would not be size optimized or moved into the cold section. So it seems like we want different handling of these guessed frequencies in compute_function_frequency and bb-reorder.c. Actually I think we can handle this by checking if the function entry block has a 0 count. If so, then we just look at the bb counts and not the frequencies for determining bb hotness as the frequencies would presumably have been statically-guessed. This will ensure that the cgraph node continues to be marked unlikely and size-optimized. If the function entry block has a non-zero count, then we look at both the bb count and the bb frequency - if they are both zero then the bb is probably never executed, but if either is non-zero then we should treat the block as possibly executed (which will come into play for splitting and bb layout). Teresa > > Currently I return true when frequency indicate that BB is executed at least in > 1/4th of all executions. With the cases discussed I see we may need to reduce > this threshold. In general I do not like much hard tests for 0 because meaning > of 0 depends on REG_BR_FREQ_BASE that is supposed to be changeable and we may > want to make frequencies sreal, too. > > I suppose we may introduce --param for this. You are also right that I should > update probably_never_executed_edge_p (I intended so, but obviously the code > ended up in mainline accidentally). > > I however saw at least one case of jump threading where this trick did not > help: the jump threading update confused itself by scaling via counts rather > than frequencies and ended up with dropping everything to 0. This makes it > more tempting to try to go with sreals for those.... > > Honza > >> returns true when bb->frequency is 0)? Making this change removed a >> bunch of other failures. With this change as well, there are only 3 >> cases that still fail with 1 train run that pass with 100. Need to >> look at those. >> >> > >> > Will you look into logic of do_jump or shall I try to dive in? >> >> I can take a look, but probably won't have a chance until late this >> week. If you don't get to it before then I will see if I can figure >> out why it is applying the branch probabilities this way. >> >> Teresa >> >> > >> > Honza >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Index: predict.c =================================================================== --- predict.c (revision 202133) +++ predict.c (working copy) @@ -232,8 +232,22 @@ bool probably_never_executed_bb_p (struct function *fun, const_basic_block bb) { gcc_checking_assert (fun); - if (profile_info && flag_branch_probabilities) - return ((bb->count + profile_info->runs / 2) / profile_info->runs) == 0; + if (profile_status_for_function (fun) == PROFILE_READ) + { + if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0) + return false; + if (!bb->frequency) + return true; + if (!ENTRY_BLOCK_PTR->frequency) + return false; + if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE) + { + return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count, + ENTRY_BLOCK_PTR->frequency) + < REG_BR_PROB_BASE / 4); + } + return true; + } if ((!profile_info || !flag_branch_probabilities) && (cgraph_get_node (fun->decl)->frequency == NODE_FREQUENCY_UNLIKELY_EXECUTED))