Message ID | 20120816142137.7F5EA61414@tjsboxrox.mtv.corp.google.com |
---|---|
State | New |
Headers | show |
> + { > + cs_prg->num = cs_tprg->num; > + /* Allocate the working set array for the merged summary. */ > + if (ws_cnt) > + { > + cs_prg->working_set_count = ws_cnt; > + cs_prg->working_sets = (gcov_ws_info_t *) malloc ( > + ws_cnt * sizeof (gcov_ws_info_t)); > + } > + } > + else if (cs_prg->num != cs_tprg->num > + || ws_cnt != cs_prg->working_set_count) > + goto read_mismatch; > + /* Copy over this run's working set information if either this is > + the first run, the total size of the profile (sum_all) is much > + (50%) larger for this run (need to check this before updating > + cs_prg->sum_all below), or this run has a larger working > + set in the largest (99.99% of sum_all) bucket. */ > + if (ws_cnt > + && (cs_prg->runs == 1 > + || (cs_tprg->sum_all > + > (cs_prg->sum_all + cs_prg->sum_all / 2)) > + || (cs_tprg->working_sets[ws_cnt - 1].num_counters > + > cs_prg->working_sets[ws_cnt - 1].num_counters))) > + memcpy (cs_prg->working_sets, > + cs_tprg->working_sets, > + ws_cnt * sizeof (gcov_ws_info_t)); > cs_prg->sum_all += cs_tprg->sum_all; Hmm, when you run i.e. gcc bootstrap where the program is run couple hundred times, this will end up really inaccurate, since it will probably just store the histogram of insn-attrtab compilation, right? Why you don't simply write the histogram into gcov file and don't merge the values here (i.e. doing the cummulation loop in GCC instead of within libgcov)? By default you are streaming 128 values that is the same as needed to stream the histogram. I suppose we can have environment variable to reduce the histogram size - I guess in smaller setups smaller histogram will run just fine... Honza
On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > > > + { > > + cs_prg->num = cs_tprg->num; > > + /* Allocate the working set array for the merged summary. */ > > + if (ws_cnt) > > + { > > + cs_prg->working_set_count = ws_cnt; > > + cs_prg->working_sets = (gcov_ws_info_t *) malloc ( > > + ws_cnt * sizeof (gcov_ws_info_t)); > > + } > > + } > > + else if (cs_prg->num != cs_tprg->num > > + || ws_cnt != cs_prg->working_set_count) > > + goto read_mismatch; > > + /* Copy over this run's working set information if either this is > > + the first run, the total size of the profile (sum_all) is much > > + (50%) larger for this run (need to check this before updating > > + cs_prg->sum_all below), or this run has a larger working > > + set in the largest (99.99% of sum_all) bucket. */ > > + if (ws_cnt > > + && (cs_prg->runs == 1 > > + || (cs_tprg->sum_all > > + > (cs_prg->sum_all + cs_prg->sum_all / 2)) > > + || (cs_tprg->working_sets[ws_cnt - 1].num_counters > > + > cs_prg->working_sets[ws_cnt - 1].num_counters))) > > + memcpy (cs_prg->working_sets, > > + cs_tprg->working_sets, > > + ws_cnt * sizeof (gcov_ws_info_t)); > > cs_prg->sum_all += cs_tprg->sum_all; > > Hmm, when you run i.e. gcc bootstrap where the program is run couple hundred > times, this will end up really inaccurate, since it will probably just store > the histogram of insn-attrtab compilation, right? Well, it should store the largest working set in BBs, or one that came from a much longer run. But I see the issue you are pointing out. The num_counters (the number of hot bbs) should be reasonable, as the total number of BBs is the same between all runs, and I want to show the largest or more dominant working set in terms of the number of hot bbs. But the min_bb_counter will become more and more inaccurate as the number of runs increases, since the counter values are accumulated. I typed up a detailed email below on why getting this right would be difficult, but then I realized there might be a fairly simple accurate solution, which I'll describe first: The only way I see to do this completely accurately is to take two passes through the existing gcda files that we are merging into, one to read in all the counter values and merge them into all the counter values in the arrays from the current run (after which we can do the histogramming/working set computation accurately from the merged counters), and the second to rewrite them. In libgcov this doesn't seem like it would be too difficult to do, although it is a little bit of restructuring of the main merging loop and needs some special handling for buffered functions (which could increase the memory footprint a bit if there are many of these since they will all need to be buffered across the iteration over all the gcda files). The summary merging in coverage.c confuses me a bit as it seems to be handling the case when there are multiple program summaries in a single gcda file. When would this happen? It looks like the merge handling in libgcov should always produce a single program summary per gcda file. > > > Why you don't simply write the histogram into gcov file and don't merge the values > here (i.e. doing the cummulation loop in GCC instead of within libgcov)? That doesn't completely solve the problem, unfortunately. The reason is that you don't know which histogram entry contains a particular block each run, and therefore you don't know how to compute the new combined histogram value and index for that bb counter. For example, a particular histogram index may have 5 counters (bbs) in it for one run and 2 counters (bbs) in it for a second run, so the question is how to compute the new entry for all of those bb counters, as the 5 bbs from the first run may or may not be a superset of the 2 from the second run. You could assume that the bbs have the same relative order of hotness between runs, and combine the bb counters accordingly, but there is still some trickiness because you have to apportion the cumulative counter sum stored in the histogram entry between new entries. For example, assume the highest indexed non-zero entries (the histogram buckets containing the largest counter values) in the two incoming histograms are: histogram 1: index 100: 4 counters, cumulative value X, min counter value minx ... histogram 2: index 100: 2 counters, cumulative value Y, min counter value miny index 99: 3 counters, cumulative value W, min counter value minw ... To merge, you could assume something like 2 counters with a new cumulative value (Y + X*2/4), and new min counter value minx+miny, that go into the merged histogram with the index corresponding to counter value minx+miny. And then 2 counters have a new cumulative value (W*2/3 + X*2/4) and new min counter value minx+minw, that go into the merged histogram with index corresponding to counter value minw+minx. Etc... Not entirely accurate, although it might be a reasonable approximation, but it also requires a number of division operations during the merge in libgcov. Another possibility, that might also provide a reasonable approximation, would be to scale the min_bb_counter values in the working sets by the sum_all_merged/sum_all_orig, where sum_all_merged is the new sum_all, and sum_all_orig is the sum_all from the summary whose working_set was chosen to be propagated to the new merged summary. This also requires some divisions at libgcov merge time, unless we save the original sum_all along with the working set in the summary and do the scaling at profile-use time. > By default you are streaming 128 values that is the same as needed to stream the histogram. > I suppose we can have environment variable to reduce the histogram size - I guess in smaller > setups smaller histogram will run just fine... It is a bit more, as the histogram will have 64*4 = 256 entries for 64-bit counters, and each entry contains 2 gcov_type counter values and one unsigned int. The working set entries only have one gcov_type counter and one unsigned. So it could be close to 4x. What do you think? Thanks, Teresa > > Honza -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Sun, Aug 19, 2012 at 9:59 PM, Teresa Johnson <tejohnson@google.com> wrote: > On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> > + { >> > + cs_prg->num = cs_tprg->num; >> > + /* Allocate the working set array for the merged summary. */ >> > + if (ws_cnt) >> > + { >> > + cs_prg->working_set_count = ws_cnt; >> > + cs_prg->working_sets = (gcov_ws_info_t *) malloc ( >> > + ws_cnt * sizeof (gcov_ws_info_t)); >> > + } >> > + } >> > + else if (cs_prg->num != cs_tprg->num >> > + || ws_cnt != cs_prg->working_set_count) >> > + goto read_mismatch; >> > + /* Copy over this run's working set information if either this is >> > + the first run, the total size of the profile (sum_all) is much >> > + (50%) larger for this run (need to check this before updating >> > + cs_prg->sum_all below), or this run has a larger working >> > + set in the largest (99.99% of sum_all) bucket. */ >> > + if (ws_cnt >> > + && (cs_prg->runs == 1 >> > + || (cs_tprg->sum_all >> > + > (cs_prg->sum_all + cs_prg->sum_all / 2)) >> > + || (cs_tprg->working_sets[ws_cnt - 1].num_counters >> > + > cs_prg->working_sets[ws_cnt - 1].num_counters))) >> > + memcpy (cs_prg->working_sets, >> > + cs_tprg->working_sets, >> > + ws_cnt * sizeof (gcov_ws_info_t)); >> > cs_prg->sum_all += cs_tprg->sum_all; >> >> Hmm, when you run i.e. gcc bootstrap where the program is run couple hundred >> times, this will end up really inaccurate, since it will probably just store >> the histogram of insn-attrtab compilation, right? > > > Well, it should store the largest working set in BBs, or one that came > from a much longer run. But I see the issue you are pointing out. The > num_counters (the number of hot bbs) should be reasonable, as the > total number of BBs is the same between all runs, and I want to show > the largest or more dominant working set in terms of the number of hot > bbs. But the min_bb_counter will become more and more inaccurate as > the number of runs increases, since the counter values are > accumulated. > > I typed up a detailed email below on why getting this right would be > difficult, but then I realized there might be a fairly simple accurate > solution, which I'll describe first: > > The only way I see to do this completely accurately is to take two > passes through the existing gcda files that we are merging into, one > to read in all the counter values and merge them into all the counter > values in the arrays from the current run (after which we can do the > histogramming/working set computation accurately from the merged > counters), and the second to rewrite them. In libgcov this doesn't > seem like it would be too difficult to do, although it is a little bit > of restructuring of the main merging loop and needs some special > handling for buffered functions (which could increase the memory > footprint a bit if there are many of these since they will all need to > be buffered across the iteration over all the gcda files). > The current way of doing summary merging produces imprecise data (e.g, sum_max). It computes current execution's summary first, and then merge with previous run's summary data while profile counters are merged. The right way is to merge profile counters first, and then recompute the final summary based on the merged profile. The later way allows merging of more advanced summary data such as the histogram here. It is also used in LIPO mode (for dynamic callgraph build/merge -- kind of a summary itself). However I think changing the way summary is merged should probably be done in a different patch. David > The summary merging in coverage.c confuses me a bit as it seems to be > handling the case when there are multiple program summaries in a > single gcda file. When would this happen? It looks like the merge > handling in libgcov should always produce a single program summary per > gcda file. > >> >> >> Why you don't simply write the histogram into gcov file and don't merge the values >> here (i.e. doing the cummulation loop in GCC instead of within libgcov)? > > That doesn't completely solve the problem, unfortunately. The reason > is that you don't know which histogram entry contains a particular > block each run, and therefore you don't know how to compute the new > combined histogram value and index for that bb counter. For example, a > particular histogram index may have 5 counters (bbs) in it for one run > and 2 counters (bbs) in it for a second run, so the question is how to > compute the new entry for all of those bb counters, as the 5 bbs from > the first run may or may not be a superset of the 2 from the second > run. You could assume that the bbs have the same relative order of > hotness between runs, and combine the bb counters accordingly, but > there is still some trickiness because you have to apportion the > cumulative counter sum stored in the histogram entry between new > entries. For example, assume the highest indexed non-zero entries (the > histogram buckets containing the largest counter values) in the two > incoming histograms are: > > histogram 1: > > index 100: 4 counters, cumulative value X, min counter value minx > ... > > histogram 2: > > index 100: 2 counters, cumulative value Y, min counter value miny > index 99: 3 counters, cumulative value W, min counter value minw > ... > > To merge, you could assume something like 2 counters with a new > cumulative value (Y + X*2/4), and new min counter value minx+miny, > that go into the merged histogram with the index corresponding to > counter value minx+miny. And then 2 counters have a new cumulative > value (W*2/3 + X*2/4) and new min counter value minx+minw, that go > into the merged histogram with index corresponding to counter value > minw+minx. Etc... Not entirely accurate, although it might be a > reasonable approximation, but it also requires a number of division > operations during the merge in libgcov. > > Another possibility, that might also provide a reasonable > approximation, would be to scale the min_bb_counter values in the > working sets by the sum_all_merged/sum_all_orig, where sum_all_merged > is the new sum_all, and sum_all_orig is the sum_all from the summary > whose working_set was chosen to be propagated to the new merged > summary. This also requires some divisions at libgcov merge time, > unless we save the original sum_all along with the working set in the > summary and do the scaling at profile-use time. > >> By default you are streaming 128 values that is the same as needed to stream the histogram. >> I suppose we can have environment variable to reduce the histogram size - I guess in smaller >> setups smaller histogram will run just fine... > > It is a bit more, as the histogram will have 64*4 = 256 entries for > 64-bit counters, and each entry contains 2 gcov_type counter values > and one unsigned int. The working set entries only have one gcov_type > counter and one unsigned. So it could be close to 4x. > > What do you think? > > Thanks, > Teresa > >> >> Honza > > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> Well, it should store the largest working set in BBs, or one that came > from a much longer run. But I see the issue you are pointing out. The > num_counters (the number of hot bbs) should be reasonable, as the > total number of BBs is the same between all runs, and I want to show > the largest or more dominant working set in terms of the number of hot > bbs. But the min_bb_counter will become more and more inaccurate as > the number of runs increases, since the counter values are > accumulated. Yes and that is the one that we plan to use to determine hot/cold decisions on, right? Note that there is no really 1-1 corespondence in betwen BBs and counters. For each function the there should be num_edges-num_bbs+1 counters. What do you plan to use BB counts for? > > I typed up a detailed email below on why getting this right would be > difficult, but then I realized there might be a fairly simple accurate > solution, which I'll describe first: > > The only way I see to do this completely accurately is to take two > passes through the existing gcda files that we are merging into, one > to read in all the counter values and merge them into all the counter > values in the arrays from the current run (after which we can do the > histogramming/working set computation accurately from the merged > counters), and the second to rewrite them. In libgcov this doesn't > seem like it would be too difficult to do, although it is a little bit > of restructuring of the main merging loop and needs some special > handling for buffered functions (which could increase the memory > footprint a bit if there are many of these since they will all need to > be buffered across the iteration over all the gcda files). > > The summary merging in coverage.c confuses me a bit as it seems to be > handling the case when there are multiple program summaries in a > single gcda file. When would this happen? It looks like the merge > handling in libgcov should always produce a single program summary per > gcda file. This is something Nathan introduced years back. The idea was IMO to handle more acurately objects linked into multiple binaries. I am not sure if the code really works or worked on some point. The idea, as I recall it, was to produce overall checksum of all objects and have different profile records for each combination. This is not really useful for profile feedback as you generate single object file for all uses, but it might become useful for LTO where you know into which binary you are linking to. I am not really sure it is worth all the infrastructure needed to support this though. > > > > > > > Why you don't simply write the histogram into gcov file and don't merge the values > > here (i.e. doing the cummulation loop in GCC instead of within libgcov)? > > That doesn't completely solve the problem, unfortunately. The reason > is that you don't know which histogram entry contains a particular > block each run, and therefore you don't know how to compute the new > combined histogram value and index for that bb counter. For example, a > particular histogram index may have 5 counters (bbs) in it for one run > and 2 counters (bbs) in it for a second run, so the question is how to > compute the new entry for all of those bb counters, as the 5 bbs from > the first run may or may not be a superset of the 2 from the second > run. You could assume that the bbs have the same relative order of > hotness between runs, and combine the bb counters accordingly, but > there is still some trickiness because you have to apportion the > cumulative counter sum stored in the histogram entry between new > entries. For example, assume the highest indexed non-zero entries (the > histogram buckets containing the largest counter values) in the two > incoming histograms are: > > histogram 1: > > index 100: 4 counters, cumulative value X, min counter value minx > ... > > histogram 2: > > index 100: 2 counters, cumulative value Y, min counter value miny > index 99: 3 counters, cumulative value W, min counter value minw > ... > > To merge, you could assume something like 2 counters with a new > cumulative value (Y + X*2/4), and new min counter value minx+miny, > that go into the merged histogram with the index corresponding to > counter value minx+miny. And then 2 counters have a new cumulative > value (W*2/3 + X*2/4) and new min counter value minx+minw, that go > into the merged histogram with index corresponding to counter value > minw+minx. Etc... Not entirely accurate, although it might be a > reasonable approximation, but it also requires a number of division > operations during the merge in libgcov. > > Another possibility, that might also provide a reasonable > approximation, would be to scale the min_bb_counter values in the > working sets by the sum_all_merged/sum_all_orig, where sum_all_merged > is the new sum_all, and sum_all_orig is the sum_all from the summary > whose working_set was chosen to be propagated to the new merged > summary. This also requires some divisions at libgcov merge time, > unless we save the original sum_all along with the working set in the > summary and do the scaling at profile-use time. > > > By default you are streaming 128 values that is the same as needed to stream the histogram. > > I suppose we can have environment variable to reduce the histogram size - I guess in smaller > > setups smaller histogram will run just fine... > > It is a bit more, as the histogram will have 64*4 = 256 entries for > 64-bit counters, and each entry contains 2 gcov_type counter values > and one unsigned int. The working set entries only have one gcov_type > counter and one unsigned. So it could be close to 4x. > > What do you think? So we have options 1) ignore the problem that summaries become inaccurate with many train runs as we do now. 2) write histogram only, do not track BB counts or approximate them by scalling and perhaps retire max_counter from the summary in favour of histogram estimate We will pay by more data being written (I think the histogram may actually compress pretty well by skipping zero entries, don't know) and by getting only estimated max_count from the histogram that still should be good for practice (max_count is used only for the hot/cold decisions and those will be histogram based anyway) 3) do two stage processing with reading data into memory first, producing summaries and writting them next. 2) seems appealing to me because it is simple, but there are limitations in what it can handle. 3) solve precision problems, but how we handle the locking & races? In the case like bootstrap where GCCs are executed in parallel we will end up with random results if the files gets modified by another profiled run of GCC in between read in and write out. So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does it work for LIPO? Honza > > Thanks, > Teresa > > > > > Honza > > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Mon, Aug 20, 2012 at 2:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Well, it should store the largest working set in BBs, or one that came >> from a much longer run. But I see the issue you are pointing out. The >> num_counters (the number of hot bbs) should be reasonable, as the >> total number of BBs is the same between all runs, and I want to show >> the largest or more dominant working set in terms of the number of hot >> bbs. But the min_bb_counter will become more and more inaccurate as >> the number of runs increases, since the counter values are >> accumulated. > > Yes and that is the one that we plan to use to determine hot/cold decisions on, > right? Yes, so I agree it is important to do something to update this as profiles are merged. > > Note that there is no really 1-1 corespondence in betwen BBs and counters. > For each function the there should be num_edges-num_bbs+1 counters. > What do you plan to use BB counts for? True, although what I am using this for is to get an idea of the size of the working set to help identify large icache footprints in order to control code size increasing optimizations such as unrolling. The idea is that a huge number of hot counters indicates a huge number of hot bbs which in turn indicates a huge number of hot instructions and therefore high icache pressure. We're using it in the google branch to control unrolling successfully. >> >> I typed up a detailed email below on why getting this right would be >> difficult, but then I realized there might be a fairly simple accurate >> solution, which I'll describe first: >> >> The only way I see to do this completely accurately is to take two >> passes through the existing gcda files that we are merging into, one >> to read in all the counter values and merge them into all the counter >> values in the arrays from the current run (after which we can do the >> histogramming/working set computation accurately from the merged >> counters), and the second to rewrite them. In libgcov this doesn't >> seem like it would be too difficult to do, although it is a little bit >> of restructuring of the main merging loop and needs some special >> handling for buffered functions (which could increase the memory >> footprint a bit if there are many of these since they will all need to >> be buffered across the iteration over all the gcda files). >> >> The summary merging in coverage.c confuses me a bit as it seems to be >> handling the case when there are multiple program summaries in a >> single gcda file. When would this happen? It looks like the merge >> handling in libgcov should always produce a single program summary per >> gcda file. > > > This is something Nathan introduced years back. The idea was IMO to handle > more acurately objects linked into multiple binaries. I am not sure > if the code really works or worked on some point. > > The idea, as I recall it, was to produce overall checksum of all objects and > have different profile records for each combination. > > This is not really useful for profile feedback as you generate single object > file for all uses, but it might become useful for LTO where you know into which > binary you are linking to. I am not really sure it is worth all the infrastructure > needed to support this though. Ok, so perhaps for the merging in coverage.c we can do something less accurate, and worry more about the libgcov merging. >> >> > >> > >> > Why you don't simply write the histogram into gcov file and don't merge the values >> > here (i.e. doing the cummulation loop in GCC instead of within libgcov)? >> >> That doesn't completely solve the problem, unfortunately. The reason >> is that you don't know which histogram entry contains a particular >> block each run, and therefore you don't know how to compute the new >> combined histogram value and index for that bb counter. For example, a >> particular histogram index may have 5 counters (bbs) in it for one run >> and 2 counters (bbs) in it for a second run, so the question is how to >> compute the new entry for all of those bb counters, as the 5 bbs from >> the first run may or may not be a superset of the 2 from the second >> run. You could assume that the bbs have the same relative order of >> hotness between runs, and combine the bb counters accordingly, but >> there is still some trickiness because you have to apportion the >> cumulative counter sum stored in the histogram entry between new >> entries. For example, assume the highest indexed non-zero entries (the >> histogram buckets containing the largest counter values) in the two >> incoming histograms are: >> >> histogram 1: >> >> index 100: 4 counters, cumulative value X, min counter value minx >> ... >> >> histogram 2: >> >> index 100: 2 counters, cumulative value Y, min counter value miny >> index 99: 3 counters, cumulative value W, min counter value minw >> ... >> >> To merge, you could assume something like 2 counters with a new >> cumulative value (Y + X*2/4), and new min counter value minx+miny, >> that go into the merged histogram with the index corresponding to >> counter value minx+miny. And then 2 counters have a new cumulative >> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go >> into the merged histogram with index corresponding to counter value >> minw+minx. Etc... Not entirely accurate, although it might be a >> reasonable approximation, but it also requires a number of division >> operations during the merge in libgcov. >> >> Another possibility, that might also provide a reasonable >> approximation, would be to scale the min_bb_counter values in the >> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged >> is the new sum_all, and sum_all_orig is the sum_all from the summary >> whose working_set was chosen to be propagated to the new merged >> summary. This also requires some divisions at libgcov merge time, >> unless we save the original sum_all along with the working set in the >> summary and do the scaling at profile-use time. >> >> > By default you are streaming 128 values that is the same as needed to stream the histogram. >> > I suppose we can have environment variable to reduce the histogram size - I guess in smaller >> > setups smaller histogram will run just fine... >> >> It is a bit more, as the histogram will have 64*4 = 256 entries for >> 64-bit counters, and each entry contains 2 gcov_type counter values >> and one unsigned int. The working set entries only have one gcov_type >> counter and one unsigned. So it could be close to 4x. >> >> What do you think? > > So we have options > 1) ignore the problem that summaries become inaccurate with many train runs > as we do now. Or 1b) do some sum_all based scaling of the min_bb_counter values as I describe a couple paragraphs above. > 2) write histogram only, do not track BB counts or approximate them by scalling > and perhaps retire max_counter from the summary in favour of histogram estimate By max_counter do you mean sum_max? > > We will pay by more data being written (I think the histogram may actually compress pretty > well by skipping zero entries, don't know) and by getting only estimated max_count > from the histogram that still should be good for practice (max_count is used only > for the hot/cold decisions and those will be histogram based anyway) Regarding the data being written, I think we can minimize this since the histogram should be pretty sparse. So store only non-zero entries. The histogram indices can be recomputed on read from the min counter value stored in the entry. > 3) do two stage processing with reading data into memory first, producing summaries > and writting them next. > > 2) seems appealing to me because it is simple, but there are limitations in > what it can handle. > 3) solve precision problems, but how we handle the locking & races? In the > case like bootstrap where GCCs are executed in parallel we will end up with > random results if the files gets modified by another profiled run of GCC in > between read in and write out. > > So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does > it work for LIPO? I think the locking issue is reduced for LIPO because each gcda file is read/merged/written in a single pass as it is on trunk, and the lipo module info is written subsequently to a separate file. Although that could certainly get a little stale if there are many simultaneous profile runs merging the files. To address this for the working set summary info, I think we would have to maintain the current per-gcdafile read/merge/write for the counters but not write any working set summary info on the first pass. Then compute the working set summary and rewrite the all the gcda files with it. The working set summary may get a little stale before it is written (just as in the lipo case for the module info), although at least the individual counters will not. Since the individual counters will get merged properly, there may be a good chance that one of the final profile runs will compute an accurate working set as it is finishing up. If this approach seems like it is feasible, then we could stick with the current approach of emitting the working set array in the summary, mitigating it somewhat by doing the sum_all based scaling of the counter values, then in a follow on patch restructure the merging code to delay the working set computation as described above. Otherwise, either go with the approach of scaling the counter values using the sum_all ratio, or encode/merge the histograms instead. Teresa > > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Mon, Aug 20, 2012 at 11:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> The summary merging in coverage.c confuses me a bit as it seems to be >> handling the case when there are multiple program summaries in a >> single gcda file. When would this happen? It looks like the merge >> handling in libgcov should always produce a single program summary per >> gcda file. AFAIU it merges existing profile data with new profile data from a second (or third, or ...) trial run. Ciao! Steven
> > So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does > it work for LIPO? > This (lack of locking, races) is not a new problem. There is no synchronization in libgcov for profile update/merge at both thread and process level. Thread level data races leads to inconsistent counters, but can be mostly corrected under -fprofile-correction and smoothing. There is also problem with false indirect call targets --- gcc has mechanism to filter those out. Process level synchronization problems can happen when two processes (running the instrumented binary) exit at the same time. The updated/merged counters from one process may be overwritten by another process -- this is true for both counter data and summary data. Solution 3) does not introduce any new problems. thanks, David > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> If this approach seems like it is feasible, then we could stick with > the current approach of emitting the working set array in the summary, > mitigating it somewhat by doing the sum_all based scaling of the > counter values, then in a follow on patch restructure the merging code > to delay the working set computation as described above. +1 David > Teresa > >> >> Honza >>> >>> Thanks, >>> Teresa >>> >>> > >>> > Honza >>> >>> >>> >>> >>> -- >>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Mon, Aug 20, 2012 at 8:35 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > On Mon, Aug 20, 2012 at 11:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >>> The summary merging in coverage.c confuses me a bit as it seems to be >>> handling the case when there are multiple program summaries in a >>> single gcda file. When would this happen? It looks like the merge >>> handling in libgcov should always produce a single program summary per >>> gcda file. > > AFAIU it merges existing profile data with new profile data from a > second (or third, or ...) trial run. It looks like libgcov will always merge program summaries between different runs though. As far as I can tell, it will always either rewrite the whole gcda file with a single merged program summary, or abort the write if the merge was not successful. However, the comment above gcov_exit does talk about keeping program summaries separate for object files contained in different programs, which is what Honza was describing as the motivation for the coverage.c merging. But I don't see where multiple program summaries ever get written to the same gcda file - if the checksums are different it seems to abort the write. But the code in coverage.c is dealing with a single gcda file containing multiple program summaries. Is there code somewhere that will cause this to happen? Thanks, Teresa > > Ciao! > Steven
On Mon, Aug 20, 2012 at 7:44 PM, Teresa Johnson <tejohnson@google.com> wrote: > But > the code in coverage.c is dealing with a single gcda file containing > multiple program summaries. Is there code somewhere that will cause > this to happen? Not that I know of, no. (There are enhancement requests for this, see e.g. PR47618). Ciao! Steven
Xinliang David Li <davidxl@google.com> writes: > > Process level synchronization problems can happen when two processes > (running the instrumented binary) exit at the same time. The > updated/merged counters from one process may be overwritten by another > process -- this is true for both counter data and summary data. > Solution 3) does not introduce any new problems. You could just use lockf() ? -Andi
I was mistaken here -- gcov_open actually uses locking via fcntl interface -- so we do need to find a way to solve the summary update synchronization problem when it is separate out of the per-file update loop. David On Mon, Aug 20, 2012 at 12:03 PM, Andi Kleen <andi@firstfloor.org> wrote: > Xinliang David Li <davidxl@google.com> writes: >> >> Process level synchronization problems can happen when two processes >> (running the instrumented binary) exit at the same time. The >> updated/merged counters from one process may be overwritten by another >> process -- this is true for both counter data and summary data. >> Solution 3) does not introduce any new problems. > > You could just use lockf() ? > > -Andi
> Xinliang David Li <davidxl@google.com> writes: > > > > Process level synchronization problems can happen when two processes > > (running the instrumented binary) exit at the same time. The > > updated/merged counters from one process may be overwritten by another > > process -- this is true for both counter data and summary data. > > Solution 3) does not introduce any new problems. > > You could just use lockf() ? The issue here is holding lock for all the files (that can be many) versus number of locks limits & possibilities for deadlocking (mind that updating may happen in different orders on the same files for different programs built from same objects) For David: there is no thread safety code in mainline for the counters. Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented) http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down as too memory expensive per thread. We could optionaly do atomic updates like ICC or combination of both as discussed in the thread. So far no one implemented it since the coverage fixups seems to work well enough in pracitce for multithreaded programs where reproducibility do not seem to be _that_ important. For GCC profiled bootstrap however I would like to see the output binary to be reproducible. We realy ought to update profiles safe for multple processes. Trashing whole process run is worse than doing race in increment. There is good chance that one of runs is more important than others and it will get trashed. I do not think we do have serious update problems in the summaries at the moment. We lock individual files as we update them. The summary is simple enough to be safe. sum_all is summed, max_all is maximum over the individual runs. Even when you combine mutiple programs the summary will end up same. Everything except for max_all is ignored anyway. Solution 2 (i.e. histogram streaming) will also have the property that it is safe WRT multiple programs, just like sum_all. Honza > > -Andi
On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Xinliang David Li <davidxl@google.com> writes: >> > >> > Process level synchronization problems can happen when two processes >> > (running the instrumented binary) exit at the same time. The >> > updated/merged counters from one process may be overwritten by another >> > process -- this is true for both counter data and summary data. >> > Solution 3) does not introduce any new problems. >> >> You could just use lockf() ? > > The issue here is holding lock for all the files (that can be many) versus > number of locks limits & possibilities for deadlocking (mind that updating > may happen in different orders on the same files for different programs built > from same objects) > > For David: there is no thread safety code in mainline for the counters. > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented) > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down > as too memory expensive per thread. We could optionaly do atomic updates like ICC > or combination of both as discussed in the thread. > So far no one implemented it since the coverage fixups seems to work well enough in > pracitce for multithreaded programs where reproducibility do not seem to be _that_ > important. > > For GCC profiled bootstrap however I would like to see the output binary to be > reproducible. We realy ought to update profiles safe for multple processes. > Trashing whole process run is worse than doing race in increment. There is good > chance that one of runs is more important than others and it will get trashed. > > I do not think we do have serious update problems in the summaries at the moment. > We lock individual files as we update them. The summary is simple enough to be safe. > sum_all is summed, max_all is maximum over the individual runs. Even when you combine > mutiple programs the summary will end up same. Everything except for max_all is ignored > anyway. > > Solution 2 (i.e. histogram streaming) will also have the property that it is safe > WRT multiple programs, just like sum_all. I think the sum_all based scaling of the working set entries also has this property. What is your opinion on saving the histogram in the summary and merging histograms together as best as possible compared to the alternative of saving the working set information as now and scaling it up by the ratio between the new and old sum_all when merging? Thanks, Teresa > > Honza >> >> -Andi
> On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> Xinliang David Li <davidxl@google.com> writes: > >> > > >> > Process level synchronization problems can happen when two processes > >> > (running the instrumented binary) exit at the same time. The > >> > updated/merged counters from one process may be overwritten by another > >> > process -- this is true for both counter data and summary data. > >> > Solution 3) does not introduce any new problems. > >> > >> You could just use lockf() ? > > > > The issue here is holding lock for all the files (that can be many) versus > > number of locks limits & possibilities for deadlocking (mind that updating > > may happen in different orders on the same files for different programs built > > from same objects) > > > > For David: there is no thread safety code in mainline for the counters. > > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented) > > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down > > as too memory expensive per thread. We could optionaly do atomic updates like ICC > > or combination of both as discussed in the thread. > > So far no one implemented it since the coverage fixups seems to work well enough in > > pracitce for multithreaded programs where reproducibility do not seem to be _that_ > > important. > > > > For GCC profiled bootstrap however I would like to see the output binary to be > > reproducible. We realy ought to update profiles safe for multple processes. > > Trashing whole process run is worse than doing race in increment. There is good > > chance that one of runs is more important than others and it will get trashed. > > > > I do not think we do have serious update problems in the summaries at the moment. > > We lock individual files as we update them. The summary is simple enough to be safe. > > sum_all is summed, max_all is maximum over the individual runs. Even when you combine > > mutiple programs the summary will end up same. Everything except for max_all is ignored > > anyway. > > > > Solution 2 (i.e. histogram streaming) will also have the property that it is safe > > WRT multiple programs, just like sum_all. > > I think the sum_all based scaling of the working set entries also has > this property. What is your opinion on saving the histogram in the I think the scaling will have at lest roundoff issues WRT different merging orders. > summary and merging histograms together as best as possible compared > to the alternative of saving the working set information as now and > scaling it up by the ratio between the new and old sum_all when > merging? So far I like this option best. But David seems to lean towards thirtd option with whole file locking. I see it may show to be more extensible in the future. At the moment I do not understand two things 1) why do we need info on the number of counter above given threshold, sinc ethe hot/cold decisions usually depends purely on the count cutoff. Removing those will solve merging issues with variant 2 and then it would be probably good solution. 2) Do we plan to add some features in near future that will anyway require global locking? I guess LIPO itself does not count since it streams its data into independent file as you mentioned earlier and locking LIPO file is not that hard. Does LIPO stream everything into that common file, or does it use combination of gcda files and common summary? What other stuff Google plans to merge? (In general I would be curious about merging plans WRT profile stuff, so we get more synchronized and effective on getting patches in. We have about two months to get it done in stage1 and it would be nice to get as much as possible. Obviously some of the patches will need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think this is an important feature). I also realized today that the common value counters (used by switch, indirect call and div/mod value profiling) are non-stanble WRT different merging orders (i.e. parallel make in train run). I do not think there is actual solution to that except for not merging the counter section of this type in libgcov and merge them in some canonical order at profile feedback time. Perhaps we just want to live with this, since the disprepancy here is small. (i.e. these counters are quite rare and their outcome has just local effect on the final binary, unlike the global summaries/edge counters). Honza > > Thanks, > Teresa > > > > > Honza > >> > >> -Andi > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Mon, Aug 20, 2012 at 10:29 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> Xinliang David Li <davidxl@google.com> writes: >> >> > >> >> > Process level synchronization problems can happen when two processes >> >> > (running the instrumented binary) exit at the same time. The >> >> > updated/merged counters from one process may be overwritten by another >> >> > process -- this is true for both counter data and summary data. >> >> > Solution 3) does not introduce any new problems. >> >> >> >> You could just use lockf() ? >> > >> > The issue here is holding lock for all the files (that can be many) versus >> > number of locks limits & possibilities for deadlocking (mind that updating >> > may happen in different orders on the same files for different programs built >> > from same objects) >> > >> > For David: there is no thread safety code in mainline for the counters. >> > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented) >> > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down >> > as too memory expensive per thread. We could optionaly do atomic updates like ICC >> > or combination of both as discussed in the thread. >> > So far no one implemented it since the coverage fixups seems to work well enough in >> > pracitce for multithreaded programs where reproducibility do not seem to be _that_ >> > important. >> > >> > For GCC profiled bootstrap however I would like to see the output binary to be >> > reproducible. We realy ought to update profiles safe for multple processes. >> > Trashing whole process run is worse than doing race in increment. There is good >> > chance that one of runs is more important than others and it will get trashed. >> > >> > I do not think we do have serious update problems in the summaries at the moment. >> > We lock individual files as we update them. The summary is simple enough to be safe. >> > sum_all is summed, max_all is maximum over the individual runs. Even when you combine >> > mutiple programs the summary will end up same. Everything except for max_all is ignored >> > anyway. >> > >> > Solution 2 (i.e. histogram streaming) will also have the property that it is safe >> > WRT multiple programs, just like sum_all. >> >> I think the sum_all based scaling of the working set entries also has >> this property. What is your opinion on saving the histogram in the > > I think the scaling will have at lest roundoff issues WRT different merging orders. > >> summary and merging histograms together as best as possible compared >> to the alternative of saving the working set information as now and >> scaling it up by the ratio between the new and old sum_all when >> merging? > > So far I like this option best. But David seems to lean towards thirtd option with > whole file locking. I see it may show to be more extensible in the future. > At the moment I do not understand two things > 1) why do we need info on the number of counter above given threshold, sinc ethe hot/cold > decisions usually depends purely on the count cutoff. > Removing those will solve merging issues with variant 2 and then it would be probably > good solution. This is useful for large applications with a long tail. The instruction working set for those applications are very large, and inliner and unroller need to be aware of that and good heuristics can be developed to throttle aggressive code bloat transformations. For inliner, it is kind of the like the global budget but more application dependent. In the long run, we will collect more advanced fdo summary regarding working set -- it will be working set size for each code region (locality region). > 2) Do we plan to add some features in near future that will anyway require global locking? > I guess LIPO itself does not count since it streams its data into independent file as you > mentioned earlier and locking LIPO file is not that hard. > Does LIPO stream everything into that common file, or does it use combination of gcda files > and common summary? Actually, LIPO module grouping information are stored in gcda files. It is also stored in a separate .imports file (one per object) --- this is primarily used by our build system for dependence information. > > What other stuff Google plans to merge? > (In general I would be curious about merging plans WRT profile stuff, so we get more > synchronized and effective on getting patches in. We have about two months to get it done > in stage1 and it would be nice to get as much as possible. Obviously some of the patches will > need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think > this is an important feature). We plan to merge in the new LIPO implementation based on LTO streaming. Rong Xu finished this in 4.6 based compiler, and he needs to port it to 4.8. thanks, David > > I also realized today that the common value counters (used by switch, indirect > call and div/mod value profiling) are non-stanble WRT different merging orders > (i.e. parallel make in train run). I do not think there is actual solution to > that except for not merging the counter section of this type in libgcov and > merge them in some canonical order at profile feedback time. Perhaps we just > want to live with this, since the disprepancy here is small. (i.e. these > counters are quite rare and their outcome has just local effect on the final > binary, unlike the global summaries/edge counters). > > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> -Andi >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> > This is useful for large applications with a long tail. The > instruction working set for those applications are very large, and > inliner and unroller need to be aware of that and good heuristics can > be developed to throttle aggressive code bloat transformations. For > inliner, it is kind of the like the global budget but more application > dependent. In the long run, we will collect more advanced fdo summary > regarding working set -- it will be working set size for each code > region (locality region). I see, so you use it to estimate size of the working set and effect of bloating optimizations on cache size. This sounds interesting. What are you experiences with this? What concerns me that it is greatly inaccurate - you have no idea how many instructions given counter is guarding and it can differ quite a lot. Also inlining/optimization makes working sets significantly different (by factor of 100 for tramp3d). But on the ohter hand any solution at this level will be greatly inaccurate. So I am curious how reliable data you can get from this? How you take this into account for the heuristics? It seems to me that for this use perhaps the simple logic in histogram merging maximizing the number of BBs for given bucket will work well? It is inaccurate, but we are working with greatly inaccurate data anyway. Except for degenerated cases, the small and unimportant runs will have small BB counts, while large runs will have larger counts and those are ones we optimize for anyway. > > > > 2) Do we plan to add some features in near future that will anyway require global locking? > > I guess LIPO itself does not count since it streams its data into independent file as you > > mentioned earlier and locking LIPO file is not that hard. > > Does LIPO stream everything into that common file, or does it use combination of gcda files > > and common summary? > > Actually, LIPO module grouping information are stored in gcda files. > It is also stored in a separate .imports file (one per object) --- > this is primarily used by our build system for dependence information. I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave on GCC bootstrap? (i.e. it does a lot more work in the libgcov module per each invocation, so I am curious if it is practically useful at all). With LTO based solution a lot can be probably pushed at link time? Before actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from gcda files and do all the merging/updating/CFG constructions that is currently performed at runtime, right? > > > > > > What other stuff Google plans to merge? > > (In general I would be curious about merging plans WRT profile stuff, so we get more > > synchronized and effective on getting patches in. We have about two months to get it done > > in stage1 and it would be nice to get as much as possible. Obviously some of the patches will > > need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think > > this is an important feature). > > We plan to merge in the new LIPO implementation based on LTO > streaming. Rong Xu finished this in 4.6 based compiler, and he needs > to port it to 4.8. Good. Looks like a lot of work ahead. It would be nice if we can perhaps start by merging the libgcov infrastructure updates prior the LIPO changes. From what I saw at LIPO branch some time ago it has a lot of stuff that is not exactly LIPO specific. Honza > > > thanks, > > David > > > > > I also realized today that the common value counters (used by switch, indirect > > call and div/mod value profiling) are non-stanble WRT different merging orders > > (i.e. parallel make in train run). I do not think there is actual solution to > > that except for not merging the counter section of this type in libgcov and > > merge them in some canonical order at profile feedback time. Perhaps we just > > want to live with this, since the disprepancy here is small. (i.e. these > > counters are quite rare and their outcome has just local effect on the final > > binary, unlike the global summaries/edge counters). > > > > Honza > >> > >> Thanks, > >> Teresa > >> > >> > > >> > Honza > >> >> > >> >> -Andi > >> > >> > >> > >> -- > >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Mon, Aug 20, 2012 at 11:33 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> This is useful for large applications with a long tail. The >> instruction working set for those applications are very large, and >> inliner and unroller need to be aware of that and good heuristics can >> be developed to throttle aggressive code bloat transformations. For >> inliner, it is kind of the like the global budget but more application >> dependent. In the long run, we will collect more advanced fdo summary >> regarding working set -- it will be working set size for each code >> region (locality region). > > I see, so you use it to estimate size of the working set and effect of bloating > optimizations on cache size. This sounds interesting. What are you experiences > with this? Teresa has done some tunings for the unroller so far. The inliner tuning is the next step. > > What concerns me that it is greatly inaccurate - you have no idea how many > instructions given counter is guarding and it can differ quite a lot. Also > inlining/optimization makes working sets significantly different (by factor of > 100 for tramp3d). The pre ipa-inline working set is the one that is needed for ipa inliner tuning. For post-ipa inline code increase transformations, some update is probably needed. >But on the ohter hand any solution at this level will be > greatly inaccurate. So I am curious how reliable data you can get from this? > How you take this into account for the heuristics? This effort is just the first step to allow good heuristics to develop. > > It seems to me that for this use perhaps the simple logic in histogram merging > maximizing the number of BBs for given bucket will work well? It is > inaccurate, but we are working with greatly inaccurate data anyway. > Except for degenerated cases, the small and unimportant runs will have small BB > counts, while large runs will have larger counts and those are ones we optimize > for anyway. The working set curve for each type of applications contains lots of information that can be mined. The inaccuracy can also be mitigated by more data 'calibration'. >> >> >> > 2) Do we plan to add some features in near future that will anyway require global locking? >> > I guess LIPO itself does not count since it streams its data into independent file as you >> > mentioned earlier and locking LIPO file is not that hard. >> > Does LIPO stream everything into that common file, or does it use combination of gcda files >> > and common summary? >> >> Actually, LIPO module grouping information are stored in gcda files. >> It is also stored in a separate .imports file (one per object) --- >> this is primarily used by our build system for dependence information. > > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave > on GCC bootstrap? We have not tried gcc bootstrap with LIPO. Gcc compile time is not the main problem for application build -- the link time (for debug build) is. > (i.e. it does a lot more work in the libgcov module per each > invocation, so I am curious if it is practically useful at all). > > With LTO based solution a lot can be probably pushed at link time? Before > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from > gcda files and do all the merging/updating/CFG constructions that is currently > performed at runtime, right? The dynamic cgraph build and analysis is still done at runtime. However, with the new implementation, FE is no longer involved. Gcc driver is modified to understand module grouping, and lto is used to merge the streamed output from aux modules. David >> >> >> > >> > What other stuff Google plans to merge? >> > (In general I would be curious about merging plans WRT profile stuff, so we get more >> > synchronized and effective on getting patches in. We have about two months to get it done >> > in stage1 and it would be nice to get as much as possible. Obviously some of the patches will >> > need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think >> > this is an important feature). >> >> We plan to merge in the new LIPO implementation based on LTO >> streaming. Rong Xu finished this in 4.6 based compiler, and he needs >> to port it to 4.8. > > Good. Looks like a lot of work ahead. It would be nice if we can perhaps start > by merging the libgcov infrastructure updates prior the LIPO changes. From > what I saw at LIPO branch some time ago it has a lot of stuff that is not > exactly LIPO specific. > > Honza >> >> >> thanks, >> >> David >> >> > >> > I also realized today that the common value counters (used by switch, indirect >> > call and div/mod value profiling) are non-stanble WRT different merging orders >> > (i.e. parallel make in train run). I do not think there is actual solution to >> > that except for not merging the counter section of this type in libgcov and >> > merge them in some canonical order at profile feedback time. Perhaps we just >> > want to live with this, since the disprepancy here is small. (i.e. these >> > counters are quite rare and their outcome has just local effect on the final >> > binary, unlike the global summaries/edge counters). >> > >> > Honza >> >> >> >> Thanks, >> >> Teresa >> >> >> >> > >> >> > Honza >> >> >> >> >> >> -Andi >> >> >> >> >> >> >> >> -- >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> Teresa has done some tunings for the unroller so far. The inliner > tuning is the next step. > > > > > What concerns me that it is greatly inaccurate - you have no idea how many > > instructions given counter is guarding and it can differ quite a lot. Also > > inlining/optimization makes working sets significantly different (by factor of > > 100 for tramp3d). > > The pre ipa-inline working set is the one that is needed for ipa > inliner tuning. For post-ipa inline code increase transformations, > some update is probably needed. > > >But on the ohter hand any solution at this level will be > > greatly inaccurate. So I am curious how reliable data you can get from this? > > How you take this into account for the heuristics? > > This effort is just the first step to allow good heuristics to develop. > > > > > It seems to me that for this use perhaps the simple logic in histogram merging > > maximizing the number of BBs for given bucket will work well? It is > > inaccurate, but we are working with greatly inaccurate data anyway. > > Except for degenerated cases, the small and unimportant runs will have small BB > > counts, while large runs will have larger counts and those are ones we optimize > > for anyway. > > The working set curve for each type of applications contains lots of > information that can be mined. The inaccuracy can also be mitigated by > more data 'calibration'. Sure, I think I am leaning towards trying the solution 2) with maximizing counter count merging (probably it would make sense to rename it from BB count since it is not really BB count and thus it is misleading) and we will see how well it works in practice. We have benefits of much fewer issues with profile locking/unlocking and we lose bit of precision on BB counts. I tend to believe that the error will not be that important in practice. Another loss is more histogram streaming into each gcda file, but with skiping zero entries it should not be major overhead problem I hope. What do you think? > > >> > >> > >> > 2) Do we plan to add some features in near future that will anyway require global locking? > >> > I guess LIPO itself does not count since it streams its data into independent file as you > >> > mentioned earlier and locking LIPO file is not that hard. > >> > Does LIPO stream everything into that common file, or does it use combination of gcda files > >> > and common summary? > >> > >> Actually, LIPO module grouping information are stored in gcda files. > >> It is also stored in a separate .imports file (one per object) --- > >> this is primarily used by our build system for dependence information. > > > > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave > > on GCC bootstrap? > > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > main problem for application build -- the link time (for debug build) > is. I was primarily curious how the LIPOs runtime analysis fare in the situation where you do very many small train runs on rather large app (sure GCC is small to google's use case ;). > > > (i.e. it does a lot more work in the libgcov module per each > > invocation, so I am curious if it is practically useful at all). > > > > With LTO based solution a lot can be probably pushed at link time? Before > > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from > > gcda files and do all the merging/updating/CFG constructions that is currently > > performed at runtime, right? > > The dynamic cgraph build and analysis is still done at runtime. > However, with the new implementation, FE is no longer involved. Gcc > driver is modified to understand module grouping, and lto is used to > merge the streamed output from aux modules. I see. Are there any fundamental reasons why it can not be done at link-time when all gcda files are available? Why the grouping is not done inside linker plugin? Honza > > > David
> The issue here is holding lock for all the files (that can be many) versus > number of locks limits & possibilities for deadlocking (mind that updating > may happen in different orders on the same files for different programs built > from same objects) lockf typically has a deadlock detector, and will error out. -Andi
On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Teresa has done some tunings for the unroller so far. The inliner >> tuning is the next step. >> >> > >> > What concerns me that it is greatly inaccurate - you have no idea how many >> > instructions given counter is guarding and it can differ quite a lot. Also >> > inlining/optimization makes working sets significantly different (by factor of >> > 100 for tramp3d). >> >> The pre ipa-inline working set is the one that is needed for ipa >> inliner tuning. For post-ipa inline code increase transformations, >> some update is probably needed. >> >> >But on the ohter hand any solution at this level will be >> > greatly inaccurate. So I am curious how reliable data you can get from this? >> > How you take this into account for the heuristics? >> >> This effort is just the first step to allow good heuristics to develop. >> >> > >> > It seems to me that for this use perhaps the simple logic in histogram merging >> > maximizing the number of BBs for given bucket will work well? It is >> > inaccurate, but we are working with greatly inaccurate data anyway. >> > Except for degenerated cases, the small and unimportant runs will have small BB >> > counts, while large runs will have larger counts and those are ones we optimize >> > for anyway. >> >> The working set curve for each type of applications contains lots of >> information that can be mined. The inaccuracy can also be mitigated by >> more data 'calibration'. > > Sure, I think I am leaning towards trying the solution 2) with maximizing > counter count merging (probably it would make sense to rename it from BB count > since it is not really BB count and thus it is misleading) and we will see how > well it works in practice. > > We have benefits of much fewer issues with profile locking/unlocking and we > lose bit of precision on BB counts. I tend to believe that the error will not > be that important in practice. Another loss is more histogram streaming into > each gcda file, but with skiping zero entries it should not be major overhead > problem I hope. > > What do you think? >> >> >> >> >> >> >> > 2) Do we plan to add some features in near future that will anyway require global locking? >> >> > I guess LIPO itself does not count since it streams its data into independent file as you >> >> > mentioned earlier and locking LIPO file is not that hard. >> >> > Does LIPO stream everything into that common file, or does it use combination of gcda files >> >> > and common summary? >> >> >> >> Actually, LIPO module grouping information are stored in gcda files. >> >> It is also stored in a separate .imports file (one per object) --- >> >> this is primarily used by our build system for dependence information. >> > >> > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave >> > on GCC bootstrap? >> >> We have not tried gcc bootstrap with LIPO. Gcc compile time is not the >> main problem for application build -- the link time (for debug build) >> is. > > I was primarily curious how the LIPOs runtime analysis fare in the situation where > you do very many small train runs on rather large app (sure GCC is small to google's > use case ;). There will be race, but as Teresa mentioned, there is a big chance that the process which finishes the merge the last is also t the final overrider of the LIPO summary data. >> >> > (i.e. it does a lot more work in the libgcov module per each >> > invocation, so I am curious if it is practically useful at all). >> > >> > With LTO based solution a lot can be probably pushed at link time? Before >> > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from >> > gcda files and do all the merging/updating/CFG constructions that is currently >> > performed at runtime, right? >> >> The dynamic cgraph build and analysis is still done at runtime. >> However, with the new implementation, FE is no longer involved. Gcc >> driver is modified to understand module grouping, and lto is used to >> merge the streamed output from aux modules. > > I see. Are there any fundamental reasons why it can not be done at link-time > when all gcda files are available? For build parallelism, the decision should be made as early as possible -- that is what makes LIPO 'light'. > Why the grouping is not done inside linker > plugin? It is not delayed into link time. In fact linker plugin is not even involved. David > > Honza >> >> >> David
> > I can go ahead with the histogram approach. There is some roundoff > > error from the working set scaling approach that can affect different > > merging orders as you note, although I think this only really affects the > > small counter values. The other place where saving/merging the histogram Do you have any intuition on why simple maximalization merging (that is safe wrt ordering) would be bad idea? We care only about working set size around top of the histogram and I would say that we should sort of optimize for the largest (in the number of blocks in hot area) of the train runs. One way where things will get messed up is when the working set is about the same but the runs are of different size so all the blocks gets accounted into two different buckets. But in general I do not think there is resonably accurate way to merge the profiles without actually streaming out all counter IDs in every bucket, so perhaps this will work well enough. If not, we can in future introduce per-program global summary file that will contain the counters to be merged acurately. > > would help is when the distribution of counter values in the histogram > > varies between runs, say for example, the hottest loop is much hotter in a > > subsequent run, but the rest of the counter values stay largely consistent. > > Note, however, that if the hotspots are different in different runs, then > > merging either the histogram or the working set will have issues. The only > > way to detect this is to recompute the histogram/working set from scratch > > from the merged counters. > > > > I wonder in practice, even when there are a lot of simultaneous runs going > > like in a gcc bootstrap, if we could get reasonably accurate summary > > recomputation without global locking. The reason is that as long as the > > actual counter updates are safe as they are now due to the individual file > > locking, the inaccuracy in the recomputed summary information will not grow > > over time, and there is a reasonable chance that the last run to complete > > and merge will recompute the summary from the final merged counter values > > and get it right (or only be off by a little bit if there are a couple of > > runs completing simultaneously at the end). But this can be investigated as > > a follow on to this patch. > > > > > The main concern is probably the build reproducibility in gcc bootstrap > with FDO. Hmm, you mean in the first pass update every file with new counters and in the second pass just update the summaries? OK, so I guess we went through 1) two pass updating with race in between pases. 2) two pass updating with first pass updating countes and second having race only for summary update. (i.e. no races for counters) 3) two pass updating with flocking (and some way to handle detected deadlocks) 4) one pass updating with histogram merging + maximalization of working set. (we do not realy need to scale the buckets, we can simply merge the histograms and then mutiply them by nruns before comparing to actual counters. This assumes that working sets of all runs are about the same, but should work resonably in practice I think. I guess 3/4 are acceptable WRT bootstrap reproducibility. I have no experience with flocking large number of files and portability of this solution i.e. to Windows. If you think that 2) would be too inaccurate in practice and 3) has chance to be portable, we could go for this. It will solve the precision problems and will also work for LIPO summaries. I would be curious about effect on profiledbootstrap time of this if you implement it. Honza > > David > > > > > > > Thanks, > > Teresa > > > > > > >> > >> > >> > >> > >> > >> > 2) Do we plan to add some features in near future that will > >> anyway require global locking? > >> > >> > I guess LIPO itself does not count since it streams its data > >> into independent file as you > >> > >> > mentioned earlier and locking LIPO file is not that hard. > >> > >> > Does LIPO stream everything into that common file, or does it > >> use combination of gcda files > >> > >> > and common summary? > >> > >> > >> > >> Actually, LIPO module grouping information are stored in gcda files. > >> > >> It is also stored in a separate .imports file (one per object) --- > >> > >> this is primarily used by our build system for dependence > >> information. > >> > > > >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does > >> LIPO behave > >> > > on GCC bootstrap? > >> > > >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > >> > main problem for application build -- the link time (for debug build) > >> > is. > >> > >> I was primarily curious how the LIPOs runtime analysis fare in the > >> situation where > >> you do very many small train runs on rather large app (sure GCC is small > >> to google's > >> use case ;). > >> > > >> > > (i.e. it does a lot more work in the libgcov module per each > >> > > invocation, so I am curious if it is practically useful at all). > >> > > > >> > > With LTO based solution a lot can be probably pushed at link time? > >> Before > >> > > actual GCC starts from the linker plugin, LIPO module can read gcov > >> CFGs from > >> > > gcda files and do all the merging/updating/CFG constructions that is > >> currently > >> > > performed at runtime, right? > >> > > >> > The dynamic cgraph build and analysis is still done at runtime. > >> > However, with the new implementation, FE is no longer involved. Gcc > >> > driver is modified to understand module grouping, and lto is used to > >> > merge the streamed output from aux modules. > >> > >> I see. Are there any fundamental reasons why it can not be done at > >> link-time > >> when all gcda files are available? Why the grouping is not done inside > >> linker > >> plugin? > >> > >> Honza > >> > > >> > > >> > David > >> > > > > > > > > -- > > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > >
On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> > I can go ahead with the histogram approach. There is some roundoff >> > error from the working set scaling approach that can affect different >> > merging orders as you note, although I think this only really affects the >> > small counter values. The other place where saving/merging the histogram > > Do you have any intuition on why simple maximalization merging (that is safe wrt > ordering) would be bad idea? When you say "maximalization merging" are you talking about the histogram merging approach I mentioned a few emails back (my response on Aug 19) where we assume the same relative order of hotness in the counters between runs, and accumulate the counter values in the histogram in that order? This would be inaccurate if different runs exercised different areas of the code, and thus the counters would be ordered in the histogram differently. > > We care only about working set size around top of the histogram and I would say For optimizations that care about the boundary between hot and cold such as code layout I think we will also care about the smaller values in the histogram (to have a good idea of what constitutes a cold block counter value). > that we should sort of optimize for the largest (in the number of blocks in hot > area) of the train runs. One way where things will get messed up is when the > working set is about the same but the runs are of different size so all the > blocks gets accounted into two different buckets. I'm not sure I understand the last sentence - is this something that would not get handled by merging the histogram entries as I described earlier? Or it sounds like you might have a different merging approach in mind? > > But in general I do not think there is resonably accurate way to merge the > profiles without actually streaming out all counter IDs in every bucket, so perhaps > this will work well enough. If not, we can in future introduce per-program global > summary file that will contain the counters to be merged acurately. Sounds good. > >> > would help is when the distribution of counter values in the histogram >> > varies between runs, say for example, the hottest loop is much hotter in a >> > subsequent run, but the rest of the counter values stay largely consistent. >> > Note, however, that if the hotspots are different in different runs, then >> > merging either the histogram or the working set will have issues. The only >> > way to detect this is to recompute the histogram/working set from scratch >> > from the merged counters. >> > >> > I wonder in practice, even when there are a lot of simultaneous runs going >> > like in a gcc bootstrap, if we could get reasonably accurate summary >> > recomputation without global locking. The reason is that as long as the >> > actual counter updates are safe as they are now due to the individual file >> > locking, the inaccuracy in the recomputed summary information will not grow >> > over time, and there is a reasonable chance that the last run to complete >> > and merge will recompute the summary from the final merged counter values >> > and get it right (or only be off by a little bit if there are a couple of >> > runs completing simultaneously at the end). But this can be investigated as >> > a follow on to this patch. >> > >> >> >> The main concern is probably the build reproducibility in gcc bootstrap >> with FDO. > > Hmm, you mean in the first pass update every file with new counters and in the second > pass just update the summaries? Right, that's what I had in mind (what you have described in #2 below). > OK, so I guess we went through > 1) two pass updating with race in between pases. > 2) two pass updating with first pass updating countes and second having race only for summary update. > (i.e. no races for counters) > 3) two pass updating with flocking (and some way to handle detected deadlocks) > 4) one pass updating with histogram merging + maximalization of working set. > (we do not realy need to scale the buckets, we can simply merge the histograms > and then mutiply them by nruns before comparing to actual counters. By merging the histograms (and accumulating the counter values stored there as we merge), I don't think we need to multiply the counter values by nruns, do we? > This assumes that working sets of all runs are about the same, but should work > resonably in practice I think. > > I guess 3/4 are acceptable WRT bootstrap reproducibility. > > I have no experience with flocking large number of files and portability of > this solution i.e. to Windows. If you think that 2) would be too inaccurate > in practice and 3) has chance to be portable, we could go for this. It will > solve the precision problems and will also work for LIPO summaries. > I would be curious about effect on profiledbootstrap time of this if you implement > it. I'm hoping that 2) will be accurate enough in practice, but it will need some investigation. Thanks, Teresa > > Honza >> >> David >> >> >> >> > >> > Thanks, >> > Teresa >> > >> > > >> >> > >> >> >> > >> >> >> > >> > 2) Do we plan to add some features in near future that will >> >> anyway require global locking? >> >> > >> > I guess LIPO itself does not count since it streams its data >> >> into independent file as you >> >> > >> > mentioned earlier and locking LIPO file is not that hard. >> >> > >> > Does LIPO stream everything into that common file, or does it >> >> use combination of gcda files >> >> > >> > and common summary? >> >> > >> >> >> > >> Actually, LIPO module grouping information are stored in gcda files. >> >> > >> It is also stored in a separate .imports file (one per object) --- >> >> > >> this is primarily used by our build system for dependence >> >> information. >> >> > > >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does >> >> LIPO behave >> >> > > on GCC bootstrap? >> >> > >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the >> >> > main problem for application build -- the link time (for debug build) >> >> > is. >> >> >> >> I was primarily curious how the LIPOs runtime analysis fare in the >> >> situation where >> >> you do very many small train runs on rather large app (sure GCC is small >> >> to google's >> >> use case ;). >> >> > >> >> > > (i.e. it does a lot more work in the libgcov module per each >> >> > > invocation, so I am curious if it is practically useful at all). >> >> > > >> >> > > With LTO based solution a lot can be probably pushed at link time? >> >> Before >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov >> >> CFGs from >> >> > > gcda files and do all the merging/updating/CFG constructions that is >> >> currently >> >> > > performed at runtime, right? >> >> > >> >> > The dynamic cgraph build and analysis is still done at runtime. >> >> > However, with the new implementation, FE is no longer involved. Gcc >> >> > driver is modified to understand module grouping, and lto is used to >> >> > merge the streamed output from aux modules. >> >> >> >> I see. Are there any fundamental reasons why it can not be done at >> >> link-time >> >> when all gcda files are available? Why the grouping is not done inside >> >> linker >> >> plugin? >> >> >> >> Honza >> >> > >> >> > >> >> > David >> >> >> > >> > >> > >> > -- >> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 >> >
> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> > I can go ahead with the histogram approach. There is some roundoff > >> > error from the working set scaling approach that can affect different > >> > merging orders as you note, although I think this only really affects the > >> > small counter values. The other place where saving/merging the histogram > > > > Do you have any intuition on why simple maximalization merging (that is safe wrt > > ordering) would be bad idea? > > When you say "maximalization merging" are you talking about the > histogram merging approach I mentioned a few emails back (my response > on Aug 19) where we assume the same relative order of hotness in the > counters between runs, and accumulate the counter values in the > histogram in that order? I was speaking of BB (counter) counts only here and considering the simplest strategy: maximalize the BB counts in corresponding buckets and sum the cummlative values in the corresponding buckets. The strategy of scaling is more sensible, but it will also be sensitive to order of train runs, i.e. it will give unstable results with parallel make. > > OK, so I guess we went through > > 1) two pass updating with race in between pases. > > 2) two pass updating with first pass updating countes and second having race only for summary update. > > (i.e. no races for counters) > > 3) two pass updating with flocking (and some way to handle detected deadlocks) > > 4) one pass updating with histogram merging + maximalization of working set. > > (we do not realy need to scale the buckets, we can simply merge the histograms > > and then mutiply them by nruns before comparing to actual counters. > > By merging the histograms (and accumulating the counter values stored > there as we merge), I don't think we need to multiply the counter > values by nruns, do we? With the simple approach we need, but... > > > This assumes that working sets of all runs are about the same, but should work > > resonably in practice I think. > > > > I guess 3/4 are acceptable WRT bootstrap reproducibility. > > > > I have no experience with flocking large number of files and portability of > > this solution i.e. to Windows. If you think that 2) would be too inaccurate > > in practice and 3) has chance to be portable, we could go for this. It will > > solve the precision problems and will also work for LIPO summaries. > > I would be curious about effect on profiledbootstrap time of this if you implement > > it. > > I'm hoping that 2) will be accurate enough in practice, but it will > need some investigation. Perhaps weighting all pros/cons you are right here. I think getting results consistent across different order of runs is more important than the possibility of race on the last update and 4) is probably getting quite a lot of GIGO issues. We can implement more consistent locking mechanizm incrementally if this turns out to be issue. It is posible to actually detect the race: at first round we are updating the nruns. In the second round we can see if nruns has changed. If so, we have problem since someone has already updated the file and will update the histograms. I would suggest skiping update when nruns has changes, so the summary at least match the counters in current file accurately (even though it will be off for the whole program since whoever updated the file may have not seen all our updates, just part of them). Just to reduce changes that the trashed run is the most important one. Theoretically we could also make libgcov to re-read all counters in this case and try another update, but I would not worry about that for the moment. We will still have problems with bootstrap: the summary of libbackend.a will depend on what of compiler binaries has executed last, since cc1 will have different summary from cc1plus becuase frontend is different. I would give extra points to voluteer who changes libgcov to do multiple summaries for different programs as intended originally (if you search archives in 2001-2003, somewhere are patches that changes libgcov from not merging at all to merging always. I guess changing it to merge just when program checksum match is not that hard). Thanks, Honza > > Thanks, > Teresa > > > > > Honza > >> > >> David > >> > >> > >> > >> > > >> > Thanks, > >> > Teresa > >> > > >> > > > >> >> > >> > >> >> > >> > >> >> > >> > 2) Do we plan to add some features in near future that will > >> >> anyway require global locking? > >> >> > >> > I guess LIPO itself does not count since it streams its data > >> >> into independent file as you > >> >> > >> > mentioned earlier and locking LIPO file is not that hard. > >> >> > >> > Does LIPO stream everything into that common file, or does it > >> >> use combination of gcda files > >> >> > >> > and common summary? > >> >> > >> > >> >> > >> Actually, LIPO module grouping information are stored in gcda files. > >> >> > >> It is also stored in a separate .imports file (one per object) --- > >> >> > >> this is primarily used by our build system for dependence > >> >> information. > >> >> > > > >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does > >> >> LIPO behave > >> >> > > on GCC bootstrap? > >> >> > > >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > >> >> > main problem for application build -- the link time (for debug build) > >> >> > is. > >> >> > >> >> I was primarily curious how the LIPOs runtime analysis fare in the > >> >> situation where > >> >> you do very many small train runs on rather large app (sure GCC is small > >> >> to google's > >> >> use case ;). > >> >> > > >> >> > > (i.e. it does a lot more work in the libgcov module per each > >> >> > > invocation, so I am curious if it is practically useful at all). > >> >> > > > >> >> > > With LTO based solution a lot can be probably pushed at link time? > >> >> Before > >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov > >> >> CFGs from > >> >> > > gcda files and do all the merging/updating/CFG constructions that is > >> >> currently > >> >> > > performed at runtime, right? > >> >> > > >> >> > The dynamic cgraph build and analysis is still done at runtime. > >> >> > However, with the new implementation, FE is no longer involved. Gcc > >> >> > driver is modified to understand module grouping, and lto is used to > >> >> > merge the streamed output from aux modules. > >> >> > >> >> I see. Are there any fundamental reasons why it can not be done at > >> >> link-time > >> >> when all gcda files are available? Why the grouping is not done inside > >> >> linker > >> >> plugin? > >> >> > >> >> Honza > >> >> > > >> >> > > >> >> > David > >> >> > >> > > >> > > >> > > >> > -- > >> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > >> > > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Tue, Aug 21, 2012 at 11:18 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> > I can go ahead with the histogram approach. There is some roundoff >> >> > error from the working set scaling approach that can affect different >> >> > merging orders as you note, although I think this only really affects the >> >> > small counter values. The other place where saving/merging the histogram >> > >> > Do you have any intuition on why simple maximalization merging (that is safe wrt >> > ordering) would be bad idea? >> >> When you say "maximalization merging" are you talking about the >> histogram merging approach I mentioned a few emails back (my response >> on Aug 19) where we assume the same relative order of hotness in the >> counters between runs, and accumulate the counter values in the >> histogram in that order? > > I was speaking of BB (counter) counts only here and considering the simplest > strategy: maximalize the BB counts in corresponding buckets and sum the cummlative > values in the corresponding buckets. I'm concerned about inaccuracies arising when multiple runs have different sizes and therefore counters are in different buckets in the corresponding histograms. We would end up in a situation where the merged histogram has more "counters" in it than there are actual counters in the program. > > The strategy of scaling is more sensible, but it will also be sensitive to order > of train runs, i.e. it will give unstable results with parallel make. Regarding the approach of merging histograms by matching up counters in the same order and apportioning out the histogram cumulative values evenly between counters in the same bucket before merging, I don't think the ordering will have a huge effect. The reason is that counters we end up matching together and accumulating will stay in the same relative order, and the min counter values are simply summed to compute the new bucket index and new min counter value to record. For the cumulative value, since the incoming cumulative values will be apportioned evenly between the counters coming from the same bucket when computing the new cumulative value (which is then just another add), there could be some slight differences from different round off errors when histograms are merged in different orders. But I would expect that to result in only small differences in the cumulative values in the histogram. Is that acceptable? >> > OK, so I guess we went through >> > 1) two pass updating with race in between pases. >> > 2) two pass updating with first pass updating countes and second having race only for summary update. >> > (i.e. no races for counters) >> > 3) two pass updating with flocking (and some way to handle detected deadlocks) >> > 4) one pass updating with histogram merging + maximalization of working set. >> > (we do not realy need to scale the buckets, we can simply merge the histograms >> > and then mutiply them by nruns before comparing to actual counters. >> >> By merging the histograms (and accumulating the counter values stored >> there as we merge), I don't think we need to multiply the counter >> values by nruns, do we? > > With the simple approach we need, but... >> >> > This assumes that working sets of all runs are about the same, but should work >> > resonably in practice I think. >> > >> > I guess 3/4 are acceptable WRT bootstrap reproducibility. >> > >> > I have no experience with flocking large number of files and portability of >> > this solution i.e. to Windows. If you think that 2) would be too inaccurate >> > in practice and 3) has chance to be portable, we could go for this. It will >> > solve the precision problems and will also work for LIPO summaries. >> > I would be curious about effect on profiledbootstrap time of this if you implement >> > it. >> >> I'm hoping that 2) will be accurate enough in practice, but it will >> need some investigation. > > Perhaps weighting all pros/cons you are right here. I think getting results > consistent across different order of runs is more important than the > possibility of race on the last update and 4) is probably getting quite a lot > of GIGO issues. > > We can implement more consistent locking mechanizm incrementally if this turns > out to be issue. > > It is posible to actually detect the race: at first round we are updating the > nruns. In the second round we can see if nruns has changed. If so, we have > problem since someone has already updated the file and will update the > histograms. > > I would suggest skiping update when nruns has changes, so the summary at least > match the counters in current file accurately (even though it will be off for > the whole program since whoever updated the file may have not seen all our > updates, just part of them). Just to reduce changes that the trashed run is > the most important one. Theoretically we could also make libgcov to re-read all > counters in this case and try another update, but I would not worry about that > for the moment. Ok, I think that makes sense. If it is ok with you, I'd prefer to go with a 3 step approach: 1) Implement some form of histogram merging (one of the options we discussed above). (Your approach #4). 2) Implement approach #2 (restructure into 2 passes with counters being updated accurately and a possible race on the histogram computation). 3) Investigate and implement either approach #3 (some kind of global lock) or the race detection you describe above. > > We will still have problems with bootstrap: the summary of libbackend.a will > depend on what of compiler binaries has executed last, since cc1 will have > different summary from cc1plus becuase frontend is different. I would give > extra points to voluteer who changes libgcov to do multiple summaries for > different programs as intended originally (if you search archives in 2001-2003, > somewhere are patches that changes libgcov from not merging at all to merging > always. I guess changing it to merge just when program checksum match is not > that hard). In this case, I assume that currently the merge will abort because the # counters, etc don't match? Thanks, Teresa > > Thanks, > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> David >> >> >> >> >> >> >> >> > >> >> > Thanks, >> >> > Teresa >> >> > >> >> > > >> >> >> > >> >> >> >> > >> >> >> >> > >> > 2) Do we plan to add some features in near future that will >> >> >> anyway require global locking? >> >> >> > >> > I guess LIPO itself does not count since it streams its data >> >> >> into independent file as you >> >> >> > >> > mentioned earlier and locking LIPO file is not that hard. >> >> >> > >> > Does LIPO stream everything into that common file, or does it >> >> >> use combination of gcda files >> >> >> > >> > and common summary? >> >> >> > >> >> >> >> > >> Actually, LIPO module grouping information are stored in gcda files. >> >> >> > >> It is also stored in a separate .imports file (one per object) --- >> >> >> > >> this is primarily used by our build system for dependence >> >> >> information. >> >> >> > > >> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does >> >> >> LIPO behave >> >> >> > > on GCC bootstrap? >> >> >> > >> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the >> >> >> > main problem for application build -- the link time (for debug build) >> >> >> > is. >> >> >> >> >> >> I was primarily curious how the LIPOs runtime analysis fare in the >> >> >> situation where >> >> >> you do very many small train runs on rather large app (sure GCC is small >> >> >> to google's >> >> >> use case ;). >> >> >> > >> >> >> > > (i.e. it does a lot more work in the libgcov module per each >> >> >> > > invocation, so I am curious if it is practically useful at all). >> >> >> > > >> >> >> > > With LTO based solution a lot can be probably pushed at link time? >> >> >> Before >> >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov >> >> >> CFGs from >> >> >> > > gcda files and do all the merging/updating/CFG constructions that is >> >> >> currently >> >> >> > > performed at runtime, right? >> >> >> > >> >> >> > The dynamic cgraph build and analysis is still done at runtime. >> >> >> > However, with the new implementation, FE is no longer involved. Gcc >> >> >> > driver is modified to understand module grouping, and lto is used to >> >> >> > merge the streamed output from aux modules. >> >> >> >> >> >> I see. Are there any fundamental reasons why it can not be done at >> >> >> link-time >> >> >> when all gcda files are available? Why the grouping is not done inside >> >> >> linker >> >> >> plugin? >> >> >> >> >> >> Honza >> >> >> > >> >> >> > >> >> >> > David >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 >> >> > >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> > I'm concerned about inaccuracies arising when multiple runs have > different sizes and therefore counters are in different buckets in the > corresponding histograms. We would end up in a situation where the > merged histogram has more "counters" in it than there are actual > counters in the program. Yes, it is very inaccurate. > > > > > The strategy of scaling is more sensible, but it will also be sensitive to order > > of train runs, i.e. it will give unstable results with parallel make. > > Regarding the approach of merging histograms by matching up counters > in the same order and apportioning out the histogram cumulative values > evenly between counters in the same bucket before merging, I don't > think the ordering will have a huge effect. The reason is that > counters we end up matching together and accumulating will stay in the > same relative order, and the min counter values are simply summed to > compute the new bucket index and new min counter value to record. For > the cumulative value, since the incoming cumulative values will be > apportioned evenly between the counters coming from the same bucket > when computing the new cumulative value (which is then just another > add), there could be some slight differences from different round off > errors when histograms are merged in different orders. But I would > expect that to result in only small differences in the cumulative > values in the histogram. Is that acceptable? Yes, I think it will be small difference and it will give resonable results in practice. I am only concerned about reproducibility of the bootstraps that... > Ok, I think that makes sense. > > If it is ok with you, I'd prefer to go with a 3 step approach: > 1) Implement some form of histogram merging (one of the options we > discussed above). (Your approach #4). > 2) Implement approach #2 (restructure into 2 passes with counters > being updated accurately and a possible race on the histogram > computation). > 3) Investigate and implement either approach #3 (some kind of global > lock) or the race detection you describe above. Yes, this sounds like good plan. > > > > > We will still have problems with bootstrap: the summary of libbackend.a will > > depend on what of compiler binaries has executed last, since cc1 will have > > different summary from cc1plus becuase frontend is different. I would give > > extra points to voluteer who changes libgcov to do multiple summaries for > > different programs as intended originally (if you search archives in 2001-2003, > > somewhere are patches that changes libgcov from not merging at all to merging > > always. I guess changing it to merge just when program checksum match is not > > that hard). > > In this case, I assume that currently the merge will abort because the > # counters, etc don't match? We only abort when number of counters in current file does not match (i.e. the gcda file is from difference compilation). We accept when the overall program is different, i.e. libbackend.a is linked into multiple binaries. Since the histogram summing is global across all profiled libraries, we will end up with different histograms depending on what program using given library was executed last. To get this stable I think it is only possible to go with #3 (i.e. program wide global locking) and get different records for different programs (i.e. do checksum of all objects libgcov sees and merge only with matching checksums). In coverage.c we then can either choose the proper record whendoing LTO and knowing what program we are just compiling or do hitogram merging in your way but after sorting the entries into some canonical order (i.e. increasing checksum). So we can track this incrementally. I guess if you go with step 1) as you describe and we can move ahead. Other thing I have patch for is to make LTO ipa-profile to produce more accurate histograms based on real instruction counts (here we can read in all gcda files, turn them into CFG profiles, and do our own summary). I suppose that once you merge in your patches I can update this and we will have two mechanizms so we will also have data on how your implementation diverges from precise solution. Thanks! Honza > > Thanks, > Teresa > > > > > Thanks, > > Honza > >> > >> Thanks, > >> Teresa > >> > >> > > >> > Honza > >> >> > >> >> David > >> >> > >> >> > >> >> > >> >> > > >> >> > Thanks, > >> >> > Teresa > >> >> > > >> >> > > > >> >> >> > >> > >> >> >> > >> > >> >> >> > >> > 2) Do we plan to add some features in near future that will > >> >> >> anyway require global locking? > >> >> >> > >> > I guess LIPO itself does not count since it streams its data > >> >> >> into independent file as you > >> >> >> > >> > mentioned earlier and locking LIPO file is not that hard. > >> >> >> > >> > Does LIPO stream everything into that common file, or does it > >> >> >> use combination of gcda files > >> >> >> > >> > and common summary? > >> >> >> > >> > >> >> >> > >> Actually, LIPO module grouping information are stored in gcda files. > >> >> >> > >> It is also stored in a separate .imports file (one per object) --- > >> >> >> > >> this is primarily used by our build system for dependence > >> >> >> information. > >> >> >> > > > >> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does > >> >> >> LIPO behave > >> >> >> > > on GCC bootstrap? > >> >> >> > > >> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > >> >> >> > main problem for application build -- the link time (for debug build) > >> >> >> > is. > >> >> >> > >> >> >> I was primarily curious how the LIPOs runtime analysis fare in the > >> >> >> situation where > >> >> >> you do very many small train runs on rather large app (sure GCC is small > >> >> >> to google's > >> >> >> use case ;). > >> >> >> > > >> >> >> > > (i.e. it does a lot more work in the libgcov module per each > >> >> >> > > invocation, so I am curious if it is practically useful at all). > >> >> >> > > > >> >> >> > > With LTO based solution a lot can be probably pushed at link time? > >> >> >> Before > >> >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov > >> >> >> CFGs from > >> >> >> > > gcda files and do all the merging/updating/CFG constructions that is > >> >> >> currently > >> >> >> > > performed at runtime, right? > >> >> >> > > >> >> >> > The dynamic cgraph build and analysis is still done at runtime. > >> >> >> > However, with the new implementation, FE is no longer involved. Gcc > >> >> >> > driver is modified to understand module grouping, and lto is used to > >> >> >> > merge the streamed output from aux modules. > >> >> >> > >> >> >> I see. Are there any fundamental reasons why it can not be done at > >> >> >> link-time > >> >> >> when all gcda files are available? Why the grouping is not done inside > >> >> >> linker > >> >> >> plugin? > >> >> >> > >> >> >> Honza > >> >> >> > > >> >> >> > > >> >> >> > David > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > >> >> > > >> > >> > >> > >> -- > >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Index: libgcc/libgcov.c =================================================================== --- libgcc/libgcov.c (revision 189420) +++ libgcc/libgcov.c (working copy) @@ -276,6 +276,252 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned return 1; } +/* Structure used for each bucket of the log2 histogram of counter values. */ + +typedef struct +{ + /* The total number of BBs whose profile count falls within the bucket. */ + gcov_unsigned_t count; + /* The minimum profile count placed in this bucket. */ + gcov_type min_value; + /* This is the cumulative value of the profile counts in this histogram + bucket. */ + gcov_type cum_value; +} bucket_type; + +/* For a log2 scale histogram with each range split into 4 + linear sub-ranges, there will be at most GCOV_TYPE_SIZE - 1 log2 + ranges since the lowest 2 log2 values share the lowest 4 linear + sub-range (values 0 - 3). */ + +#define HISTOGRAM_SIZE (GCOV_TYPE_SIZE - 1) * 4 + +/* Determine the index into histogram for VALUE. */ + +static unsigned +histo_index(gcov_type value) +{ + gcov_type_unsigned v = (gcov_type_unsigned)value; + unsigned r = 0; + unsigned prev2bits = 0; + + /* Find index into log2 scale histogram, where each of the log2 + sized buckets is divided into 4 linear sub-buckets for better + focus in the higher buckets. */ + + /* Find the place of the most-significant bit set. */ + if (v > 0) + r = GCOV_TYPE_SIZE - 1 - __builtin_clzll (v); + + /* If at most the 2 least significant bits are set (value is + 0 - 3) then that value is our index into the lowest set of + four buckets. */ + if (r < 2) + return (unsigned)value; + + gcc_assert (r < 64); + + /* Find the two next most significant bits to determine which + of the four linear sub-buckets to select. */ + prev2bits = (v >> (r - 2)) & 0x3; + /* Finally, compose the final bucket index from the log2 index and + the next 2 bits. The minimum r value at this point is 2 since we + returned above if r was 2 or more, so the minimum bucket at this + point is 4. */ + return (r - 1) * 4 + prev2bits; +} + +/* Insert counter VALUE into HISTOGRAM. */ + +static void +histogram_insert(bucket_type *histogram, gcov_type value) +{ + unsigned i; + + i = histo_index(value); + gcc_assert (i < HISTOGRAM_SIZE); + + histogram[i].count++; + histogram[i].cum_value += value; + if (value < histogram[i].min_value) + histogram[i].min_value = value; +} + +/* Determines the working set statistics to place in the summary SUM. This is + an array of information for a range of percentages of the total execution + count (sum_all), and includes the number of counters required to cover that + working set percentage and the minimum counter value in that working set. */ + +static void +gcov_compute_working_set (struct gcov_summary *sum) +{ + struct gcov_info *gi_ptr; + const struct gcov_fn_info *gfi_ptr; + const struct gcov_ctr_info *ci_ptr; + struct gcov_ctr_summary *cs_ptr; + unsigned t_ix, f_ix, ws_ix, ctr_info_ix, ix; + gcov_unsigned_t c_num, count; + int h_ix; + bucket_type histogram[HISTOGRAM_SIZE]; + gcov_type *working_set_cum_values; + gcov_type ws_cum_hotness_incr; + gcov_type cum, tmp_cum; + char *ws_entries_str; + unsigned num_ws_entries; + + /* This currently only applies to arc counters. */ + t_ix = GCOV_COUNTER_ARCS; + + /* First check if there are any counts recorded for this counter. */ + cs_ptr = &(sum->ctrs[t_ix]); + if (!cs_ptr->num) + return; + + for (h_ix = 0; h_ix < HISTOGRAM_SIZE; h_ix++) + { + histogram[h_ix].count = 0; + histogram[h_ix].min_value = cs_ptr->run_max; + histogram[h_ix].cum_value = 0; + } + +#define MAX_WS_SUMMARY_ENTRIES 128 + ws_entries_str = getenv ("GCOV_WORKING_SET_ENTRIES"); + if (ws_entries_str && strlen (ws_entries_str)) + { + num_ws_entries = atoi (ws_entries_str); + /* Limit the summary information to a reasonable amount. + The maximum value gives statistics for more than every 1% + of the sum_all, which should be sufficient granularity for + optimizations. It also ensures that the last entry of 99.9% + is the largest value. */ + if (num_ws_entries > MAX_WS_SUMMARY_ENTRIES) + num_ws_entries = MAX_WS_SUMMARY_ENTRIES; + } + else + num_ws_entries = MAX_WS_SUMMARY_ENTRIES; + + cs_ptr->working_set_count = num_ws_entries; + + if (! num_ws_entries) + return; + + /* Next fill in an array of the cumulative hotness values corresponding + to each working set summary entry we are going to compute below. */ + working_set_cum_values + = (gcov_type *) malloc (sizeof (gcov_type) * num_ws_entries); + + /* Compute the amount of sum_all that the cumulative hotness grows + by in each successive working set entry, which depends on the + number of working set entries requested. By default num_ws_entries + is a power-of-two so that divide becomes a shift. */ + ws_cum_hotness_incr = cs_ptr->sum_all / num_ws_entries; + + /* Skip 0% statistics, which can be extrapolated from the + rest of the summary data. */ + cum = ws_cum_hotness_incr; + for (ws_ix = 0; ws_ix < num_ws_entries; ws_ix++, cum += ws_cum_hotness_incr) + working_set_cum_values[ws_ix] = cum; + /* The last summary entry is reserved for (roughly) 99.9% of the + working set. Divide by 1024 so it becomes a shift, which gives + almost exactly 99.9%. */ + working_set_cum_values[num_ws_entries-1] + = cs_ptr->sum_all - cs_ptr->sum_all/1024; + + /* Next, walk through all the per-object structures and record each of + the count values in histogram. */ + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) + { + if (!gi_ptr->merge[t_ix]) + continue; + + /* Find the appropriate index into the gcov_ctr_info array + for the counter we are currently working on based on the + existence of the merge function pointer for this object. */ + for (ix = 0, ctr_info_ix = 0; ix < t_ix; ix++) + { + if (gi_ptr->merge[ix]) + ctr_info_ix++; + } + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) + { + gfi_ptr = gi_ptr->functions[f_ix]; + + if (!gfi_ptr || gfi_ptr->key != gi_ptr) + continue; + + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; + for (ix = 0; ix < ci_ptr->num; ix++) + histogram_insert(histogram, ci_ptr->values[ix]); + } + } + + /* Next, walk through the histogram in decending order of hotness + and compute the statistics for the working set summary array. + As histogram entries are accumulated, we check to see which + working set entries have had their expected cum_value reached + and fill them in, walking the working set entries in increasing + size of cum_value. */ + cs_ptr->working_sets = (gcov_ws_info_t *) malloc ( + num_ws_entries * sizeof (gcov_ws_info_t)); + ws_ix = 0; /* The current entry into the working set array. */ + cum = 0; /* The current accumulated counter sum. */ + count = 0; /* The current accumulated count of block counters. */ + for (h_ix = HISTOGRAM_SIZE - 1; h_ix > 0 && ws_ix < num_ws_entries; h_ix--) + { + /* If we haven't reached the required cumulative counter value for + the current working set percentage, simply accumulate this histogram + entry into the running sums and continue to the next histogram + entry. */ + if (cum + histogram[h_ix].cum_value < working_set_cum_values[ws_ix]) + { + cum += histogram[h_ix].cum_value; + count += histogram[h_ix].count; + continue; + } + + /* If adding the current histogram entry's cumulative counter value + causes us to exceed the current working set size, then estimate + how many of this histogram entry's counter values are required to + reach the working set size, and fill in working set entries + as we reach their expected cumulative value. */ + for (c_num = 0, tmp_cum = cum; + c_num < histogram[h_ix].count && ws_ix < num_ws_entries; + c_num++) + { + count++; + /* If we haven't reached the last histogram entry counter, add + in the minimum value again. This will underestimate the + cumulative sum so far, because many of the counter values in this + entry may have been larger than the minimum. We could add in the + average value every time, but that would require an expensive + divide operation. */ + if (c_num + 1 < histogram[h_ix].count) + tmp_cum += histogram[h_ix].min_value; + /* If we have reached the last histogram entry counter, then add + in the entire cumulative value. */ + else + tmp_cum = cum + histogram[h_ix].cum_value; + + /* Next walk through successive working set entries and fill in + the statistics for any whose size we have reached by accumulating + this histogram counter. */ + while (tmp_cum >= working_set_cum_values[ws_ix] + && ws_ix < num_ws_entries) + { + cs_ptr->working_sets[ws_ix].num_counters = count; + cs_ptr->working_sets[ws_ix].min_bb_counter + = histogram[h_ix].min_value; + ws_ix++; + } + } + /* Finally, update the running cumulative value since we were + using a temporary above. */ + cum += histogram[h_ix].cum_value; + } + gcc_assert (ws_ix == num_ws_entries); + free (working_set_cum_values); +} + /* Dump the coverage counts. We merge with existing counts when possible, to avoid growing the .da files ad infinitum. We use this program's checksum to make sure we only accumulate whole program @@ -347,6 +593,7 @@ gcov_exit (void) } } } + gcov_compute_working_set (&this_prg); { /* Check if the level of dirs to strip off specified. */ @@ -389,7 +636,7 @@ gcov_exit (void) /* Now merge each file. */ for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) { - unsigned n_counts; + unsigned n_counts, ws_cnt; struct gcov_summary prg; /* summary for this object over all program. */ struct gcov_ctr_summary *cs_prg, *cs_tprg, *cs_all; @@ -482,8 +729,6 @@ gcov_exit (void) f_ix--; length = gcov_read_unsigned (); - if (length != GCOV_TAG_SUMMARY_LENGTH) - goto read_mismatch; gcov_read_summary (&tmp); if ((error = gcov_is_error ())) goto read_error; @@ -597,8 +842,35 @@ gcov_exit (void) if (gi_ptr->merge[t_ix]) { + ws_cnt = cs_tprg->working_set_count; if (!cs_prg->runs++) - cs_prg->num = cs_tprg->num; + { + cs_prg->num = cs_tprg->num; + /* Allocate the working set array for the merged summary. */ + if (ws_cnt) + { + cs_prg->working_set_count = ws_cnt; + cs_prg->working_sets = (gcov_ws_info_t *) malloc ( + ws_cnt * sizeof (gcov_ws_info_t)); + } + } + else if (cs_prg->num != cs_tprg->num + || ws_cnt != cs_prg->working_set_count) + goto read_mismatch; + /* Copy over this run's working set information if either this is + the first run, the total size of the profile (sum_all) is much + (50%) larger for this run (need to check this before updating + cs_prg->sum_all below), or this run has a larger working + set in the largest (99.99% of sum_all) bucket. */ + if (ws_cnt + && (cs_prg->runs == 1 + || (cs_tprg->sum_all + > (cs_prg->sum_all + cs_prg->sum_all / 2)) + || (cs_tprg->working_sets[ws_cnt - 1].num_counters + > cs_prg->working_sets[ws_cnt - 1].num_counters))) + memcpy (cs_prg->working_sets, + cs_tprg->working_sets, + ws_cnt * sizeof (gcov_ws_info_t)); cs_prg->sum_all += cs_tprg->sum_all; if (cs_prg->run_max < cs_tprg->run_max) cs_prg->run_max = cs_tprg->run_max; @@ -696,7 +968,11 @@ gcov_exit (void) "profiling:%s:Overflow writing\n" : "profiling:%s:Error writing\n", gi_filename); + + gcov_destroy_summary (&prg); } + + gcov_destroy_summary (&this_prg); } /* Reset all counters to zero. */ Index: gcc/gcov.c =================================================================== --- gcc/gcov.c (revision 189924) +++ gcc/gcov.c (working copy) @@ -1264,6 +1264,7 @@ read_count_file (function_t *fns) struct gcov_summary summary; gcov_read_summary (&summary); object_runs += summary.ctrs[GCOV_COUNTER_ARCS].runs; + gcov_destroy_summary (&summary); program_count++; } else if (tag == GCOV_TAG_FUNCTION && !length) Index: gcc/gcov-io.c =================================================================== --- gcc/gcov-io.c (revision 189924) +++ gcc/gcov-io.c (working copy) @@ -368,10 +368,13 @@ gcov_write_tag_length (gcov_unsigned_t tag, gcov_u GCOV_LINKAGE void gcov_write_summary (gcov_unsigned_t tag, const struct gcov_summary *summary) { - unsigned ix; + unsigned ix, ws_ix, ws_cnt; const struct gcov_ctr_summary *csum; - gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH); + for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE, ws_cnt = 0; + ix--; csum++) + ws_cnt += csum->working_set_count; + gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(ws_cnt)); gcov_write_unsigned (summary->checksum); for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) { @@ -380,6 +383,12 @@ gcov_write_summary (gcov_unsigned_t tag, const str gcov_write_counter (csum->sum_all); gcov_write_counter (csum->run_max); gcov_write_counter (csum->sum_max); + gcov_write_unsigned (csum->working_set_count); + for (ws_ix = 0; ws_ix < csum->working_set_count; ws_ix++) + { + gcov_write_unsigned (csum->working_sets[ws_ix].num_counters); + gcov_write_counter (csum->working_sets[ws_ix].min_bb_counter); + } } } #endif /* IN_LIBGCOV */ @@ -488,7 +497,7 @@ gcov_read_string (void) GCOV_LINKAGE void gcov_read_summary (struct gcov_summary *summary) { - unsigned ix; + unsigned ix, ws_ix; struct gcov_ctr_summary *csum; summary->checksum = gcov_read_unsigned (); @@ -499,9 +508,35 @@ gcov_read_summary (struct gcov_summary *summary) csum->sum_all = gcov_read_counter (); csum->run_max = gcov_read_counter (); csum->sum_max = gcov_read_counter (); + csum->working_set_count = gcov_read_unsigned (); + csum->working_sets +#if IN_LIBGCOV + = (gcov_ws_info_t *) malloc (csum->working_set_count * +#else + = (gcov_ws_info_t *) xmalloc (csum->working_set_count * +#endif + sizeof (gcov_ws_info_t)); + for (ws_ix = 0; ws_ix < csum->working_set_count; ws_ix++) + { + csum->working_sets[ws_ix].num_counters = gcov_read_unsigned (); + csum->working_sets[ws_ix].min_bb_counter = gcov_read_counter (); + } } } +GCOV_LINKAGE void +gcov_destroy_summary (struct gcov_summary *summary) +{ + unsigned ix; + struct gcov_ctr_summary *csum; + + for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) + { + if (csum->working_set_count) + free (csum->working_sets); + } +} + #if !IN_LIBGCOV /* Reset to a known position. BASE should have been obtained from gcov_position, LENGTH should be a record length. */ Index: gcc/gcov-io.h =================================================================== --- gcc/gcov-io.h (revision 189924) +++ gcc/gcov-io.h (working copy) @@ -139,7 +139,9 @@ see the files COPYING3 and COPYING.RUNTIME respect counts: header int64:count* summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE count-summary: int32:num int32:runs int64:sum - int64:max int64:sum_max + int64:max int64:sum_max working-set-summary + working-set-summary: int32:count working-sets* + working-sets: int32:num int64:min The ANNOUNCE_FUNCTION record is the same as that in the note file, but without the source location. The COUNTS gives the @@ -171,8 +173,10 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m typedef unsigned gcov_position_t __attribute__ ((mode (SI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (DI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (DI))); #else typedef signed gcov_type __attribute__ ((mode (SI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI))); #endif #else #if BITS_PER_UNIT == 16 @@ -180,16 +184,20 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m typedef unsigned gcov_position_t __attribute__ ((mode (HI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (SI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI))); #else typedef signed gcov_type __attribute__ ((mode (HI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI))); #endif #else typedef unsigned gcov_unsigned_t __attribute__ ((mode (QI))); typedef unsigned gcov_position_t __attribute__ ((mode (QI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (HI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI))); #else typedef signed gcov_type __attribute__ ((mode (QI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (QI))); #endif #endif #endif @@ -213,8 +221,6 @@ typedef HOST_WIDEST_INT gcov_type; #if IN_GCOV > 0 #include <sys/types.h> #endif -#else /*!IN_GCOV */ -#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32) #endif #if defined (HOST_HAS_F_SETLKW) @@ -225,6 +231,8 @@ typedef HOST_WIDEST_INT gcov_type; #endif /* !IN_LIBGCOV */ +#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32) + /* In gcov we want function linkage to be static. In the compiler we want it extern, so that they can be accessed from elsewhere. In libgcov we need these functions to be extern, so prefix them with __gcov. In @@ -248,6 +256,7 @@ typedef HOST_WIDEST_INT gcov_type; #define gcov_read_unsigned __gcov_read_unsigned #define gcov_read_counter __gcov_read_counter #define gcov_read_summary __gcov_read_summary +#define gcov_destroy_summary __gcov_destroy_summary /* Poison these, so they don't accidentally slip in. */ #pragma GCC poison gcov_write_string gcov_write_tag gcov_write_length @@ -309,9 +318,10 @@ typedef HOST_WIDEST_INT gcov_type; #define GCOV_TAG_COUNTER_NUM(LENGTH) ((LENGTH) / 2) #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) -#define GCOV_TAG_SUMMARY_LENGTH \ - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) +#define GCOV_TAG_SUMMARY_LENGTH(NUM) \ + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 3 * 2) + (NUM) * 3) + /* Counters that are collected. */ #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ #define GCOV_COUNTERS_SUMMABLE 1 /* Counters which can be @@ -389,6 +399,16 @@ typedef HOST_WIDEST_INT gcov_type; /* Structured records. */ +/* Working set size statistics for a given percentage of the entire + profile (sum_all). */ +typedef struct gcov_working_set_info +{ + /* Number of hot counters included in this working set. */ + gcov_unsigned_t num_counters; + /* Smallest counter included in this working set. */ + gcov_type min_bb_counter; +} gcov_ws_info_t; + /* Cumulative counter data. */ struct gcov_ctr_summary { @@ -397,6 +417,8 @@ struct gcov_ctr_summary gcov_type sum_all; /* sum of all counters accumulated. */ gcov_type run_max; /* maximum value on a single run. */ gcov_type sum_max; /* sum of individual run max values. */ + gcov_unsigned_t working_set_count; /* number of working set infos. */ + gcov_ws_info_t *working_sets; /* array of working set info. */ }; /* Object & program summary record. */ @@ -552,6 +574,7 @@ static int gcov_is_error (void); GCOV_LINKAGE gcov_unsigned_t gcov_read_unsigned (void) ATTRIBUTE_HIDDEN; GCOV_LINKAGE gcov_type gcov_read_counter (void) ATTRIBUTE_HIDDEN; GCOV_LINKAGE void gcov_read_summary (struct gcov_summary *) ATTRIBUTE_HIDDEN; +GCOV_LINKAGE void gcov_destroy_summary (struct gcov_summary *) ATTRIBUTE_HIDDEN; #if IN_LIBGCOV /* Available only in libgcov */ Index: gcc/coverage.c =================================================================== --- gcc/coverage.c (revision 189924) +++ gcc/coverage.c (working copy) @@ -172,6 +172,8 @@ htab_counts_entry_del (void *of) counts_entry_t *const entry = (counts_entry_t *) of; free (entry->counts); + if (entry->summary.working_sets) + free (entry->summary.working_sets); free (entry); } @@ -247,12 +249,40 @@ read_counts_file (void) gcov_read_summary (&sum); for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) { + gcov_unsigned_t ws_cnt = sum.ctrs[ix].working_set_count; + /* Allocate the working set array for a new summary. */ + if (ws_cnt && new_summary) + { + summary.ctrs[ix].working_set_count = ws_cnt; + summary.ctrs[ix].working_sets + = (gcov_ws_info_t *) xmalloc (ws_cnt + * sizeof (gcov_ws_info_t)); + } + /* Copy over this summary's working set information if either + summary is new, the total size of the profile (sum_all) is much + (50%) larger for this summary (need to check this before + updating sum_all below), or this summary has a larger working + set in the largest (99.99% of sum_all) bucket. */ + if (ws_cnt + && ws_cnt == summary.ctrs[ix].working_set_count + && (new_summary + || (sum.ctrs[ix].sum_all + > (summary.ctrs[ix].sum_all + + summary.ctrs[ix].sum_all / 2)) + || (sum.ctrs[ix].working_sets[ws_cnt - 1].num_counters + > summary.ctrs[ix].working_sets[ws_cnt - 1].num_counters))) + { + memcpy (summary.ctrs[ix].working_sets, + sum.ctrs[ix].working_sets, + ws_cnt * sizeof (gcov_ws_info_t)); + } summary.ctrs[ix].runs += sum.ctrs[ix].runs; summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) summary.ctrs[ix].run_max = sum.ctrs[ix].run_max; summary.ctrs[ix].sum_max += sum.ctrs[ix].sum_max; } + gcov_destroy_summary (&sum); new_summary = 0; } else if (GCOV_TAG_IS_COUNTER (tag) && fn_ident) Index: gcc/gcov-dump.c =================================================================== --- gcc/gcov-dump.c (revision 189924) +++ gcc/gcov-dump.c (working copy) @@ -447,7 +447,8 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED unsigned tag ATTRIBUTE_UNUSED, unsigned length ATTRIBUTE_UNUSED) { struct gcov_summary summary; - unsigned ix; + unsigned ix, jx, pctinc, pct; + gcov_ws_info_t *ws_info; gcov_read_summary (&summary); printf (" checksum=0x%08x", summary.checksum); @@ -465,5 +466,27 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED (HOST_WIDEST_INT)summary.ctrs[ix].run_max); printf (", sum_max=" HOST_WIDEST_INT_PRINT_DEC, (HOST_WIDEST_INT)summary.ctrs[ix].sum_max); + if (! summary.ctrs[ix].working_set_count) + continue; + printf ("\n"); + print_prefix (filename, 0, 0); + printf ("\t\tworking set statistics:"); + /* Multiply the percentage by 100 to avoid float. */ + pctinc = 100 * 100 / summary.ctrs[ix].working_set_count; + for (jx = 0, pct = pctinc; jx < summary.ctrs[ix].working_set_count; + jx++, pct += pctinc) + { + if (jx == summary.ctrs[ix].working_set_count - 1) + pct = 9999; + ws_info = &summary.ctrs[ix].working_sets[jx]; + printf ("\n"); + print_prefix (filename, 0, 0); + /* Print out the percentage using int arithmatic to avoid float. */ + printf ("\t\t%u.%u%%: num counts=%u, min counter=" + HOST_WIDEST_INT_PRINT_DEC, + pct / 100, pct - pct / 100, + ws_info->num_counters, (HOST_WIDEST_INT)ws_info->min_bb_counter); + } } + gcov_destroy_summary (&summary); }