Patchwork Add working-set size and hotness information to fdo summary (issue6465057)

login
register
mail settings
Submitter Teresa Johnson
Date Aug. 16, 2012, 2:21 p.m.
Message ID <20120816142137.7F5EA61414@tjsboxrox.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/178003/
State New
Headers show

Comments

Teresa Johnson - Aug. 16, 2012, 2:21 p.m.
This is a revision of my earlier patch to add working set information to
the gcov program summary based on discussions between Honza, David and I
and approved with changes by Honza in this earlier thread on the first patch:
  http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01412.html
I plan to commit this in couple days unless there are more revisions
requested.

Changes from the earlier patch are the removal of the loop unroller changes
utilizing the new summary info, and feeding back an array of working
set information for different cutoffs, as well as using a histogram to
compute the working set information to avoid expensive copying and qsorting.

Patch description:

Patch to add new summary information on counter working set sizes to gcov
profiles. The function summaries now include an array of working set data,
where each data point corresponds to a percentage of the total sum_all of
the counters and includes the number of hot counters and the minimum counter
value included in that working set.

By default, there are 128 working set data points included in the summary,
but this is controlled at runtime with the GCOV_WORKING_SET_ENTRIES
environment variable. The first GCOV_WORKING_SET_ENTRIES - 1 entries
correspond to percentages 100.0/GCOV_WORKING_SET_ENTRIES through
100.0/GCOV_WORKING_SET_ENTRIES * (GCOV_WORKING_SET_ENTRIES - 1) of sum_all.
The last entry always corresponds to 99.99% of sum_all.

A log2-based histogram is used to compute the working set information,
to avoid the need to copy all counters into a large array and invoke
qsort.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

2012-08-16  Teresa Johnson  <tejohnson@google.com>

        * libgcc/libgcov.c (histo_index): New function.
        (histogram_insert): Ditto.
        (gcov_compute_working_set): Ditto.
        (gcov_exit): Invoke gcov_compute_working_set, and
        perform merging of new working set summary info.
        * gcc/gcov.c (read_count_file): Invoke gcov_destroy_summary to
        clean up allocated working set memory.
        * gcc/gcov-io.c (gcov_write_summary): Include length of working
        set summary info in GCOV_TAG_SUMMARY_LENGTH, and write out working
        set summary.
        (gcov_read_summary): Read working set summary info.
        (gcov_destroy_summary): New function.
        * gcc/gcov-io.h (gcov_type_unsigned): New type.
        (gcov_destroy_summary): Declare.
        (GCOV_TAG_SUMMARY_LENGTH): Update to include working set summary.
        (struct gcov_working_set_info): New structure.
        (struct gcov_ctr_summary): Include working set summary.
        * gcc/coverage.c (htab_counts_entry_del): Free working sets.
        (read_counts_file): Read and merge in working set summary info.
        * gcc/gcov-dump.c (tag_summary): Dump out working set summary info.


--
This patch is available for review at http://codereview.appspot.com/6465057
Jan Hubicka - Aug. 18, 2012, 8:19 a.m.
> +                {
> +                  cs_prg->num = cs_tprg->num;
> +                  /* Allocate the working set array for the merged summary.  */
> +                  if (ws_cnt)
> +                    {
> +                      cs_prg->working_set_count = ws_cnt;
> +                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
> +                          ws_cnt * sizeof (gcov_ws_info_t));
> +                    }
> +                }
> +              else if (cs_prg->num != cs_tprg->num
> +                       || ws_cnt != cs_prg->working_set_count)
> +                goto read_mismatch;
> +              /* Copy over this run's working set information if either this is
> +                 the first run, the total size of the profile (sum_all) is much
> +                 (50%) larger for this run (need to check this before updating
> +                 cs_prg->sum_all below), or this run has a larger working
> +                 set in the largest (99.99% of sum_all) bucket.  */
> +              if (ws_cnt
> +                  && (cs_prg->runs == 1
> +                      || (cs_tprg->sum_all
> +                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
> +                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
> +                          > cs_prg->working_sets[ws_cnt - 1].num_counters)))
> +                memcpy (cs_prg->working_sets,
> +                        cs_tprg->working_sets,
> +                        ws_cnt * sizeof (gcov_ws_info_t));
>  	      cs_prg->sum_all += cs_tprg->sum_all;

Hmm, when you run i.e. gcc bootstrap  where the program is run couple hundred
times, this will end up really inaccurate, since it will probably just store
the histogram of insn-attrtab compilation, right?

Why you don't simply write the histogram into gcov file and don't merge the values
here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
By default you are streaming 128 values that is the same as needed to stream the histogram.
I suppose we can have environment variable to reduce the histogram size - I guess in smaller
setups smaller histogram will run just fine...

Honza
Teresa Johnson - Aug. 20, 2012, 4:59 a.m.
On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > +                {
> > +                  cs_prg->num = cs_tprg->num;
> > +                  /* Allocate the working set array for the merged summary.  */
> > +                  if (ws_cnt)
> > +                    {
> > +                      cs_prg->working_set_count = ws_cnt;
> > +                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
> > +                          ws_cnt * sizeof (gcov_ws_info_t));
> > +                    }
> > +                }
> > +              else if (cs_prg->num != cs_tprg->num
> > +                       || ws_cnt != cs_prg->working_set_count)
> > +                goto read_mismatch;
> > +              /* Copy over this run's working set information if either this is
> > +                 the first run, the total size of the profile (sum_all) is much
> > +                 (50%) larger for this run (need to check this before updating
> > +                 cs_prg->sum_all below), or this run has a larger working
> > +                 set in the largest (99.99% of sum_all) bucket.  */
> > +              if (ws_cnt
> > +                  && (cs_prg->runs == 1
> > +                      || (cs_tprg->sum_all
> > +                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
> > +                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
> > +                          > cs_prg->working_sets[ws_cnt - 1].num_counters)))
> > +                memcpy (cs_prg->working_sets,
> > +                        cs_tprg->working_sets,
> > +                        ws_cnt * sizeof (gcov_ws_info_t));
> >             cs_prg->sum_all += cs_tprg->sum_all;
>
> Hmm, when you run i.e. gcc bootstrap  where the program is run couple hundred
> times, this will end up really inaccurate, since it will probably just store
> the histogram of insn-attrtab compilation, right?


Well, it should store the largest working set in BBs, or one that came
from a much longer run. But I see the issue you are pointing out. The
num_counters (the number of hot bbs) should be reasonable, as the
total number of BBs is the same between all runs, and I want to show
the largest or more dominant working set in terms of the number of hot
bbs. But the min_bb_counter will become more and more inaccurate as
the number of runs increases, since the counter values are
accumulated.

I typed up a detailed email below on why getting this right would be
difficult, but then I realized there might be a fairly simple accurate
solution, which I'll describe first:

The only way I see to do this completely accurately is to take two
passes through the existing gcda files that we are merging into, one
to read in all the counter values and merge them into all the counter
values in the arrays from the current run (after which we can do the
histogramming/working set computation accurately from the merged
counters), and the second to rewrite them. In libgcov this doesn't
seem like it would be too difficult to do, although it is a little bit
of restructuring of the main merging loop and needs some special
handling for buffered functions (which could increase the memory
footprint a bit if there are many of these since they will all need to
be buffered across the iteration over all the gcda files).

The summary merging in coverage.c confuses me a bit as it seems to be
handling the case when there are multiple program summaries in a
single gcda file. When would this happen? It looks like the merge
handling in libgcov should always produce a single program summary per
gcda file.

>
>
> Why you don't simply write the histogram into gcov file and don't merge the values
> here (i.e. doing the cummulation loop in GCC instead of within libgcov)?

That doesn't completely solve the problem, unfortunately. The reason
is that you don't know which histogram entry contains a particular
block each run, and therefore you don't know how to compute the new
combined histogram value and index for that bb counter. For example, a
particular histogram index may have 5 counters (bbs) in it for one run
and 2 counters (bbs) in it for a second run, so the question is how to
compute the new entry for all of those bb counters, as the 5 bbs from
the first run may or may not be a superset of the 2 from the second
run. You could assume that the bbs have the same relative order of
hotness between runs, and combine the bb counters accordingly, but
there is still some trickiness because you have to apportion the
cumulative counter sum stored in the histogram entry between new
entries. For example, assume the highest indexed non-zero entries (the
histogram buckets containing the largest counter values) in the two
incoming histograms are:

histogram 1:

index 100: 4 counters, cumulative value X, min counter value minx
...

histogram 2:

index 100: 2 counters, cumulative value Y, min counter value miny
index 99: 3 counters, cumulative value W, min counter value minw
...

To merge, you could assume something like 2 counters with a new
cumulative value (Y + X*2/4), and new min counter value minx+miny,
that go into the merged histogram with the index corresponding to
counter value minx+miny. And then 2 counters have a new cumulative
value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
into the merged histogram with index corresponding to counter value
minw+minx. Etc... Not entirely accurate, although it might be a
reasonable approximation, but it also requires a number of division
operations during the merge in libgcov.

Another possibility, that might also provide a reasonable
approximation, would be to scale the min_bb_counter values in the
working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
is the new sum_all, and sum_all_orig is the sum_all from the summary
whose working_set was chosen to be propagated to the new merged
summary. This also requires some divisions at libgcov merge time,
unless we save the original sum_all along with the working set in the
summary and do the scaling at profile-use time.

> By default you are streaming 128 values that is the same as needed to stream the histogram.
> I suppose we can have environment variable to reduce the histogram size - I guess in smaller
> setups smaller histogram will run just fine...

It is a bit more, as the histogram will have 64*4 = 256 entries for
64-bit counters, and each entry contains 2 gcov_type counter values
and one unsigned int. The working set entries only have one gcov_type
counter and one unsigned. So it could be close to 4x.

What do you think?

Thanks,
Teresa

>
> Honza




--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li - Aug. 20, 2012, 5:41 a.m.
On Sun, Aug 19, 2012 at 9:59 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> > +                {
>> > +                  cs_prg->num = cs_tprg->num;
>> > +                  /* Allocate the working set array for the merged summary.  */
>> > +                  if (ws_cnt)
>> > +                    {
>> > +                      cs_prg->working_set_count = ws_cnt;
>> > +                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
>> > +                          ws_cnt * sizeof (gcov_ws_info_t));
>> > +                    }
>> > +                }
>> > +              else if (cs_prg->num != cs_tprg->num
>> > +                       || ws_cnt != cs_prg->working_set_count)
>> > +                goto read_mismatch;
>> > +              /* Copy over this run's working set information if either this is
>> > +                 the first run, the total size of the profile (sum_all) is much
>> > +                 (50%) larger for this run (need to check this before updating
>> > +                 cs_prg->sum_all below), or this run has a larger working
>> > +                 set in the largest (99.99% of sum_all) bucket.  */
>> > +              if (ws_cnt
>> > +                  && (cs_prg->runs == 1
>> > +                      || (cs_tprg->sum_all
>> > +                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
>> > +                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
>> > +                          > cs_prg->working_sets[ws_cnt - 1].num_counters)))
>> > +                memcpy (cs_prg->working_sets,
>> > +                        cs_tprg->working_sets,
>> > +                        ws_cnt * sizeof (gcov_ws_info_t));
>> >             cs_prg->sum_all += cs_tprg->sum_all;
>>
>> Hmm, when you run i.e. gcc bootstrap  where the program is run couple hundred
>> times, this will end up really inaccurate, since it will probably just store
>> the histogram of insn-attrtab compilation, right?
>
>
> Well, it should store the largest working set in BBs, or one that came
> from a much longer run. But I see the issue you are pointing out. The
> num_counters (the number of hot bbs) should be reasonable, as the
> total number of BBs is the same between all runs, and I want to show
> the largest or more dominant working set in terms of the number of hot
> bbs. But the min_bb_counter will become more and more inaccurate as
> the number of runs increases, since the counter values are
> accumulated.
>
> I typed up a detailed email below on why getting this right would be
> difficult, but then I realized there might be a fairly simple accurate
> solution, which I'll describe first:
>
> The only way I see to do this completely accurately is to take two
> passes through the existing gcda files that we are merging into, one
> to read in all the counter values and merge them into all the counter
> values in the arrays from the current run (after which we can do the
> histogramming/working set computation accurately from the merged
> counters), and the second to rewrite them. In libgcov this doesn't
> seem like it would be too difficult to do, although it is a little bit
> of restructuring of the main merging loop and needs some special
> handling for buffered functions (which could increase the memory
> footprint a bit if there are many of these since they will all need to
> be buffered across the iteration over all the gcda files).
>

The current way of doing summary merging produces imprecise data (e.g,
sum_max). It computes current execution's summary first, and then
merge with previous run's summary data while profile counters are
merged.  The right way is to merge profile counters first, and then
recompute the final summary based on the merged profile.   The later
way allows merging of more advanced summary data such as the histogram
here. It is also used in LIPO mode (for dynamic callgraph build/merge
-- kind of a summary itself).

However I think changing the way summary is merged should probably be
done in a different patch.

David


> The summary merging in coverage.c confuses me a bit as it seems to be
> handling the case when there are multiple program summaries in a
> single gcda file. When would this happen? It looks like the merge
> handling in libgcov should always produce a single program summary per
> gcda file.
>
>>
>>
>> Why you don't simply write the histogram into gcov file and don't merge the values
>> here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
>
> That doesn't completely solve the problem, unfortunately. The reason
> is that you don't know which histogram entry contains a particular
> block each run, and therefore you don't know how to compute the new
> combined histogram value and index for that bb counter. For example, a
> particular histogram index may have 5 counters (bbs) in it for one run
> and 2 counters (bbs) in it for a second run, so the question is how to
> compute the new entry for all of those bb counters, as the 5 bbs from
> the first run may or may not be a superset of the 2 from the second
> run. You could assume that the bbs have the same relative order of
> hotness between runs, and combine the bb counters accordingly, but
> there is still some trickiness because you have to apportion the
> cumulative counter sum stored in the histogram entry between new
> entries. For example, assume the highest indexed non-zero entries (the
> histogram buckets containing the largest counter values) in the two
> incoming histograms are:
>
> histogram 1:
>
> index 100: 4 counters, cumulative value X, min counter value minx
> ...
>
> histogram 2:
>
> index 100: 2 counters, cumulative value Y, min counter value miny
> index 99: 3 counters, cumulative value W, min counter value minw
> ...
>
> To merge, you could assume something like 2 counters with a new
> cumulative value (Y + X*2/4), and new min counter value minx+miny,
> that go into the merged histogram with the index corresponding to
> counter value minx+miny. And then 2 counters have a new cumulative
> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
> into the merged histogram with index corresponding to counter value
> minw+minx. Etc... Not entirely accurate, although it might be a
> reasonable approximation, but it also requires a number of division
> operations during the merge in libgcov.
>
> Another possibility, that might also provide a reasonable
> approximation, would be to scale the min_bb_counter values in the
> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
> is the new sum_all, and sum_all_orig is the sum_all from the summary
> whose working_set was chosen to be propagated to the new merged
> summary. This also requires some divisions at libgcov merge time,
> unless we save the original sum_all along with the working set in the
> summary and do the scaling at profile-use time.
>
>> By default you are streaming 128 values that is the same as needed to stream the histogram.
>> I suppose we can have environment variable to reduce the histogram size - I guess in smaller
>> setups smaller histogram will run just fine...
>
> It is a bit more, as the histogram will have 64*4 = 256 entries for
> 64-bit counters, and each entry contains 2 gcov_type counter values
> and one unsigned int. The working set entries only have one gcov_type
> counter and one unsigned. So it could be close to 4x.
>
> What do you think?
>
> Thanks,
> Teresa
>
>>
>> Honza
>
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka - Aug. 20, 2012, 9:48 a.m.
> Well, it should store the largest working set in BBs, or one that came
> from a much longer run. But I see the issue you are pointing out. The
> num_counters (the number of hot bbs) should be reasonable, as the
> total number of BBs is the same between all runs, and I want to show
> the largest or more dominant working set in terms of the number of hot
> bbs. But the min_bb_counter will become more and more inaccurate as
> the number of runs increases, since the counter values are
> accumulated.

Yes and that is the one that we plan to use to determine hot/cold decisions on,
right?

Note that there is no really 1-1 corespondence in betwen BBs and counters.
For each function the there should be num_edges-num_bbs+1 counters.
What do you plan to use BB counts for?
> 
> I typed up a detailed email below on why getting this right would be
> difficult, but then I realized there might be a fairly simple accurate
> solution, which I'll describe first:
> 
> The only way I see to do this completely accurately is to take two
> passes through the existing gcda files that we are merging into, one
> to read in all the counter values and merge them into all the counter
> values in the arrays from the current run (after which we can do the
> histogramming/working set computation accurately from the merged
> counters), and the second to rewrite them. In libgcov this doesn't
> seem like it would be too difficult to do, although it is a little bit
> of restructuring of the main merging loop and needs some special
> handling for buffered functions (which could increase the memory
> footprint a bit if there are many of these since they will all need to
> be buffered across the iteration over all the gcda files).
> 
> The summary merging in coverage.c confuses me a bit as it seems to be
> handling the case when there are multiple program summaries in a
> single gcda file. When would this happen? It looks like the merge
> handling in libgcov should always produce a single program summary per
> gcda file.


This is something Nathan introduced years back. The idea was IMO to handle
more acurately objects linked into multiple binaries. I am not sure
if the code really works or worked on some point.

The idea, as I recall it, was to produce overall checksum of all objects and
have different profile records for each combination.

This is not really useful for profile feedback as you generate single object
file for all uses, but it might become useful for LTO where you know into which
binary you are linking to. I am not really sure it is worth all the infrastructure
needed to support this though.
> 
> >
> >
> > Why you don't simply write the histogram into gcov file and don't merge the values
> > here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
> 
> That doesn't completely solve the problem, unfortunately. The reason
> is that you don't know which histogram entry contains a particular
> block each run, and therefore you don't know how to compute the new
> combined histogram value and index for that bb counter. For example, a
> particular histogram index may have 5 counters (bbs) in it for one run
> and 2 counters (bbs) in it for a second run, so the question is how to
> compute the new entry for all of those bb counters, as the 5 bbs from
> the first run may or may not be a superset of the 2 from the second
> run. You could assume that the bbs have the same relative order of
> hotness between runs, and combine the bb counters accordingly, but
> there is still some trickiness because you have to apportion the
> cumulative counter sum stored in the histogram entry between new
> entries. For example, assume the highest indexed non-zero entries (the
> histogram buckets containing the largest counter values) in the two
> incoming histograms are:
> 
> histogram 1:
> 
> index 100: 4 counters, cumulative value X, min counter value minx
> ...
> 
> histogram 2:
> 
> index 100: 2 counters, cumulative value Y, min counter value miny
> index 99: 3 counters, cumulative value W, min counter value minw
> ...
> 
> To merge, you could assume something like 2 counters with a new
> cumulative value (Y + X*2/4), and new min counter value minx+miny,
> that go into the merged histogram with the index corresponding to
> counter value minx+miny. And then 2 counters have a new cumulative
> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
> into the merged histogram with index corresponding to counter value
> minw+minx. Etc... Not entirely accurate, although it might be a
> reasonable approximation, but it also requires a number of division
> operations during the merge in libgcov.
> 
> Another possibility, that might also provide a reasonable
> approximation, would be to scale the min_bb_counter values in the
> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
> is the new sum_all, and sum_all_orig is the sum_all from the summary
> whose working_set was chosen to be propagated to the new merged
> summary. This also requires some divisions at libgcov merge time,
> unless we save the original sum_all along with the working set in the
> summary and do the scaling at profile-use time.
> 
> > By default you are streaming 128 values that is the same as needed to stream the histogram.
> > I suppose we can have environment variable to reduce the histogram size - I guess in smaller
> > setups smaller histogram will run just fine...
> 
> It is a bit more, as the histogram will have 64*4 = 256 entries for
> 64-bit counters, and each entry contains 2 gcov_type counter values
> and one unsigned int. The working set entries only have one gcov_type
> counter and one unsigned. So it could be close to 4x.
> 
> What do you think?

So we have options
  1) ignore the problem that summaries become inaccurate with many train runs
     as we do now.
  2) write histogram only, do not track BB counts or approximate them by scalling
     and perhaps retire max_counter from the summary in favour of histogram estimate

     We will pay by more data being written (I think the histogram may actually compress pretty
     well by skipping zero entries, don't know) and by getting only estimated max_count
     from the histogram that still should be good for practice (max_count is used only
     for the hot/cold decisions and those will be histogram based anyway)
  3) do two stage processing with reading data into memory first, producing summaries
     and writting them next.

2) seems appealing to me because it is simple, but there are limitations in
what it can handle.
3) solve precision problems, but how we handle the locking & races?  In the
case like bootstrap where GCCs are executed in parallel we will end up with
random results if the files gets modified by another profiled run of GCC in
between read in and write out.

So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does
it work for LIPO?

Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> 
> 
> 
> 
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson - Aug. 20, 2012, 2:53 p.m.
On Mon, Aug 20, 2012 at 2:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Well, it should store the largest working set in BBs, or one that came
>> from a much longer run. But I see the issue you are pointing out. The
>> num_counters (the number of hot bbs) should be reasonable, as the
>> total number of BBs is the same between all runs, and I want to show
>> the largest or more dominant working set in terms of the number of hot
>> bbs. But the min_bb_counter will become more and more inaccurate as
>> the number of runs increases, since the counter values are
>> accumulated.
>
> Yes and that is the one that we plan to use to determine hot/cold decisions on,
> right?

Yes, so I agree it is important to do something to update this as
profiles are merged.

>
> Note that there is no really 1-1 corespondence in betwen BBs and counters.
> For each function the there should be num_edges-num_bbs+1 counters.
> What do you plan to use BB counts for?

True, although what I am using this for is to get an idea of the size
of the working set to help identify large icache footprints in order
to control code size increasing optimizations such as unrolling. The
idea is that a huge number of hot counters indicates a huge number of
hot bbs which in turn indicates a huge number of hot instructions and
therefore high icache pressure. We're using it in the google branch to
control unrolling successfully.

>>
>> I typed up a detailed email below on why getting this right would be
>> difficult, but then I realized there might be a fairly simple accurate
>> solution, which I'll describe first:
>>
>> The only way I see to do this completely accurately is to take two
>> passes through the existing gcda files that we are merging into, one
>> to read in all the counter values and merge them into all the counter
>> values in the arrays from the current run (after which we can do the
>> histogramming/working set computation accurately from the merged
>> counters), and the second to rewrite them. In libgcov this doesn't
>> seem like it would be too difficult to do, although it is a little bit
>> of restructuring of the main merging loop and needs some special
>> handling for buffered functions (which could increase the memory
>> footprint a bit if there are many of these since they will all need to
>> be buffered across the iteration over all the gcda files).
>>
>> The summary merging in coverage.c confuses me a bit as it seems to be
>> handling the case when there are multiple program summaries in a
>> single gcda file. When would this happen? It looks like the merge
>> handling in libgcov should always produce a single program summary per
>> gcda file.
>
>
> This is something Nathan introduced years back. The idea was IMO to handle
> more acurately objects linked into multiple binaries. I am not sure
> if the code really works or worked on some point.
>
> The idea, as I recall it, was to produce overall checksum of all objects and
> have different profile records for each combination.
>
> This is not really useful for profile feedback as you generate single object
> file for all uses, but it might become useful for LTO where you know into which
> binary you are linking to. I am not really sure it is worth all the infrastructure
> needed to support this though.

Ok, so perhaps for the merging in coverage.c we can do something less
accurate, and worry more about the libgcov merging.

>>
>> >
>> >
>> > Why you don't simply write the histogram into gcov file and don't merge the values
>> > here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
>>
>> That doesn't completely solve the problem, unfortunately. The reason
>> is that you don't know which histogram entry contains a particular
>> block each run, and therefore you don't know how to compute the new
>> combined histogram value and index for that bb counter. For example, a
>> particular histogram index may have 5 counters (bbs) in it for one run
>> and 2 counters (bbs) in it for a second run, so the question is how to
>> compute the new entry for all of those bb counters, as the 5 bbs from
>> the first run may or may not be a superset of the 2 from the second
>> run. You could assume that the bbs have the same relative order of
>> hotness between runs, and combine the bb counters accordingly, but
>> there is still some trickiness because you have to apportion the
>> cumulative counter sum stored in the histogram entry between new
>> entries. For example, assume the highest indexed non-zero entries (the
>> histogram buckets containing the largest counter values) in the two
>> incoming histograms are:
>>
>> histogram 1:
>>
>> index 100: 4 counters, cumulative value X, min counter value minx
>> ...
>>
>> histogram 2:
>>
>> index 100: 2 counters, cumulative value Y, min counter value miny
>> index 99: 3 counters, cumulative value W, min counter value minw
>> ...
>>
>> To merge, you could assume something like 2 counters with a new
>> cumulative value (Y + X*2/4), and new min counter value minx+miny,
>> that go into the merged histogram with the index corresponding to
>> counter value minx+miny. And then 2 counters have a new cumulative
>> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
>> into the merged histogram with index corresponding to counter value
>> minw+minx. Etc... Not entirely accurate, although it might be a
>> reasonable approximation, but it also requires a number of division
>> operations during the merge in libgcov.
>>
>> Another possibility, that might also provide a reasonable
>> approximation, would be to scale the min_bb_counter values in the
>> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
>> is the new sum_all, and sum_all_orig is the sum_all from the summary
>> whose working_set was chosen to be propagated to the new merged
>> summary. This also requires some divisions at libgcov merge time,
>> unless we save the original sum_all along with the working set in the
>> summary and do the scaling at profile-use time.
>>
>> > By default you are streaming 128 values that is the same as needed to stream the histogram.
>> > I suppose we can have environment variable to reduce the histogram size - I guess in smaller
>> > setups smaller histogram will run just fine...
>>
>> It is a bit more, as the histogram will have 64*4 = 256 entries for
>> 64-bit counters, and each entry contains 2 gcov_type counter values
>> and one unsigned int. The working set entries only have one gcov_type
>> counter and one unsigned. So it could be close to 4x.
>>
>> What do you think?
>
> So we have options
>   1) ignore the problem that summaries become inaccurate with many train runs
>      as we do now.

Or 1b) do some sum_all based scaling of the min_bb_counter values as I
describe a couple paragraphs above.

>   2) write histogram only, do not track BB counts or approximate them by scalling
>      and perhaps retire max_counter from the summary in favour of histogram estimate

By max_counter do you mean sum_max?

>
>      We will pay by more data being written (I think the histogram may actually compress pretty
>      well by skipping zero entries, don't know) and by getting only estimated max_count
>      from the histogram that still should be good for practice (max_count is used only
>      for the hot/cold decisions and those will be histogram based anyway)

Regarding the data being written, I think we can minimize this since
the histogram should be pretty sparse. So store only non-zero entries.
The histogram indices can be recomputed on read from the min counter
value stored in the entry.

>   3) do two stage processing with reading data into memory first, producing summaries
>      and writting them next.
>
> 2) seems appealing to me because it is simple, but there are limitations in
> what it can handle.
> 3) solve precision problems, but how we handle the locking & races?  In the
> case like bootstrap where GCCs are executed in parallel we will end up with
> random results if the files gets modified by another profiled run of GCC in
> between read in and write out.
>
> So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does
> it work for LIPO?

I think the locking issue is reduced for LIPO because each gcda file
is read/merged/written in a single pass as it is on trunk, and the
lipo module info is written subsequently to a separate file. Although
that could certainly get a little stale if there are many simultaneous
profile runs merging the files.

To address this for the working set summary info, I think we would
have to maintain the current per-gcdafile read/merge/write for the
counters but not write any working set summary info on the first pass.
Then compute the working set summary and rewrite the all the gcda
files with it. The working set summary may get a little stale before
it is written (just as in the lipo case for the module info), although
at least the individual counters will not. Since the individual
counters will get merged properly, there may be a good chance that one
of the final profile runs will compute an accurate working set as it
is finishing up.

If this approach seems like it is feasible, then we could stick with
the current approach of emitting the working set array in the summary,
mitigating it somewhat by doing the sum_all based scaling of the
counter values, then in a follow on patch restructure the merging code
to delay the working set computation as described above.

Otherwise, either go with the approach of scaling the counter values
using the sum_all ratio, or encode/merge the histograms instead.

Teresa

>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Steven Bosscher - Aug. 20, 2012, 3:35 p.m.
On Mon, Aug 20, 2012 at 11:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> The summary merging in coverage.c confuses me a bit as it seems to be
>> handling the case when there are multiple program summaries in a
>> single gcda file. When would this happen? It looks like the merge
>> handling in libgcov should always produce a single program summary per
>> gcda file.

AFAIU it merges existing profile data with new profile data from a
second (or third, or ...) trial run.

Ciao!
Steven
Xinliang David Li - Aug. 20, 2012, 4:57 p.m.
>
> So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does
> it work for LIPO?
>

This (lack of locking, races) is not a new problem. There is no
synchronization in libgcov for profile update/merge at both thread and
process level. Thread level data races leads to inconsistent counters,
but can be mostly corrected under -fprofile-correction and smoothing.
There is also problem with false indirect call targets --- gcc has
mechanism to filter those out.

Process level synchronization problems can happen when two processes
(running the instrumented binary) exit at the same time. The
updated/merged counters from one process may be overwritten by another
process -- this is true for both counter data and summary data.
Solution 3) does not introduce any new problems.

thanks,

David



> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li - Aug. 20, 2012, 5:04 p.m.
> If this approach seems like it is feasible, then we could stick with
> the current approach of emitting the working set array in the summary,
> mitigating it somewhat by doing the sum_all based scaling of the
> counter values, then in a follow on patch restructure the merging code
> to delay the working set computation as described above.

+1

David

> Teresa
>
>>
>> Honza
>>>
>>> Thanks,
>>> Teresa
>>>
>>> >
>>> > Honza
>>>
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson - Aug. 20, 2012, 5:44 p.m.
On Mon, Aug 20, 2012 at 8:35 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Mon, Aug 20, 2012 at 11:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> The summary merging in coverage.c confuses me a bit as it seems to be
>>> handling the case when there are multiple program summaries in a
>>> single gcda file. When would this happen? It looks like the merge
>>> handling in libgcov should always produce a single program summary per
>>> gcda file.
>
> AFAIU it merges existing profile data with new profile data from a
> second (or third, or ...) trial run.

It looks like libgcov will always merge program summaries between
different runs though. As far as I can tell, it will always either
rewrite the whole gcda file with a single merged program summary, or
abort the write if the merge was not successful. However, the comment
above gcov_exit does talk about keeping program summaries separate for
object files contained in different programs, which is what Honza was
describing as the motivation for the coverage.c merging. But I don't
see where multiple program summaries ever get written to the same gcda
file - if the checksums are different it seems to abort the write. But
the code in coverage.c is dealing with a single gcda file containing
multiple program summaries. Is there code somewhere that will cause
this to happen?

Thanks,
Teresa

>
> Ciao!
> Steven
Steven Bosscher - Aug. 20, 2012, 5:51 p.m.
On Mon, Aug 20, 2012 at 7:44 PM, Teresa Johnson <tejohnson@google.com> wrote:
> But
> the code in coverage.c is dealing with a single gcda file containing
> multiple program summaries. Is there code somewhere that will cause
> this to happen?

Not that I know of, no.
(There are enhancement requests for this, see e.g. PR47618).

Ciao!
Steven
Andi Kleen - Aug. 20, 2012, 7:03 p.m.
Xinliang David Li <davidxl@google.com> writes:
>
> Process level synchronization problems can happen when two processes
> (running the instrumented binary) exit at the same time. The
> updated/merged counters from one process may be overwritten by another
> process -- this is true for both counter data and summary data.
> Solution 3) does not introduce any new problems.

You could just use lockf() ?

-Andi
Xinliang David Li - Aug. 20, 2012, 7:58 p.m.
I was mistaken here -- gcov_open actually uses locking via fcntl
interface -- so we do need to find a way to solve the summary update
synchronization problem when it is separate out of the per-file update
loop.

David

On Mon, Aug 20, 2012 at 12:03 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Xinliang David Li <davidxl@google.com> writes:
>>
>> Process level synchronization problems can happen when two processes
>> (running the instrumented binary) exit at the same time. The
>> updated/merged counters from one process may be overwritten by another
>> process -- this is true for both counter data and summary data.
>> Solution 3) does not introduce any new problems.
>
> You could just use lockf() ?
>
> -Andi
Jan Hubicka - Aug. 21, 2012, 1:27 a.m.
> Xinliang David Li <davidxl@google.com> writes:
> >
> > Process level synchronization problems can happen when two processes
> > (running the instrumented binary) exit at the same time. The
> > updated/merged counters from one process may be overwritten by another
> > process -- this is true for both counter data and summary data.
> > Solution 3) does not introduce any new problems.
> 
> You could just use lockf() ?

The issue here is holding lock for all the files (that can be many) versus
number of locks limits & possibilities for deadlocking (mind that updating
may happen in different orders on the same files for different programs built
from same objects)

For David: there is no thread safety code in mainline for the counters.
Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented)
http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down
as too memory expensive per thread. We could optionaly do atomic updates like ICC
or combination of both as discussed in the thread.
So far no one implemented it since the coverage fixups seems to work well enough in
pracitce for multithreaded programs where reproducibility do not seem to be _that_
important.

For GCC profiled bootstrap however I would like to see the output binary to be
reproducible. We realy ought to update profiles safe for multple processes.
Trashing whole process run is worse than doing race in increment. There is good
chance that one of runs is more important than others and it will get trashed.

I do not think we do have serious update problems in the summaries at the moment.
We lock individual files as we update them. The summary is simple enough to be safe.
sum_all is summed, max_all is maximum over the individual runs. Even when you combine
mutiple programs the summary will end up same. Everything except for max_all is ignored
anyway.

Solution 2 (i.e. histogram streaming) will also have the property that it is safe
WRT multiple programs, just like sum_all.

Honza
> 
> -Andi
Teresa Johnson - Aug. 21, 2012, 5:14 a.m.
On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Xinliang David Li <davidxl@google.com> writes:
>> >
>> > Process level synchronization problems can happen when two processes
>> > (running the instrumented binary) exit at the same time. The
>> > updated/merged counters from one process may be overwritten by another
>> > process -- this is true for both counter data and summary data.
>> > Solution 3) does not introduce any new problems.
>>
>> You could just use lockf() ?
>
> The issue here is holding lock for all the files (that can be many) versus
> number of locks limits & possibilities for deadlocking (mind that updating
> may happen in different orders on the same files for different programs built
> from same objects)
>
> For David: there is no thread safety code in mainline for the counters.
> Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented)
> http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down
> as too memory expensive per thread. We could optionaly do atomic updates like ICC
> or combination of both as discussed in the thread.
> So far no one implemented it since the coverage fixups seems to work well enough in
> pracitce for multithreaded programs where reproducibility do not seem to be _that_
> important.
>
> For GCC profiled bootstrap however I would like to see the output binary to be
> reproducible. We realy ought to update profiles safe for multple processes.
> Trashing whole process run is worse than doing race in increment. There is good
> chance that one of runs is more important than others and it will get trashed.
>
> I do not think we do have serious update problems in the summaries at the moment.
> We lock individual files as we update them. The summary is simple enough to be safe.
> sum_all is summed, max_all is maximum over the individual runs. Even when you combine
> mutiple programs the summary will end up same. Everything except for max_all is ignored
> anyway.
>
> Solution 2 (i.e. histogram streaming) will also have the property that it is safe
> WRT multiple programs, just like sum_all.

I think the sum_all based scaling of the working set entries also has
this property. What is your opinion on saving the histogram in the
summary and merging histograms together as best as possible compared
to the alternative of saving the working set information as now and
scaling it up by the ratio between the new and old sum_all when
merging?

Thanks,
Teresa

>
> Honza
>>
>> -Andi
Jan Hubicka - Aug. 21, 2012, 5:29 a.m.
> On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> Xinliang David Li <davidxl@google.com> writes:
> >> >
> >> > Process level synchronization problems can happen when two processes
> >> > (running the instrumented binary) exit at the same time. The
> >> > updated/merged counters from one process may be overwritten by another
> >> > process -- this is true for both counter data and summary data.
> >> > Solution 3) does not introduce any new problems.
> >>
> >> You could just use lockf() ?
> >
> > The issue here is holding lock for all the files (that can be many) versus
> > number of locks limits & possibilities for deadlocking (mind that updating
> > may happen in different orders on the same files for different programs built
> > from same objects)
> >
> > For David: there is no thread safety code in mainline for the counters.
> > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented)
> > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down
> > as too memory expensive per thread. We could optionaly do atomic updates like ICC
> > or combination of both as discussed in the thread.
> > So far no one implemented it since the coverage fixups seems to work well enough in
> > pracitce for multithreaded programs where reproducibility do not seem to be _that_
> > important.
> >
> > For GCC profiled bootstrap however I would like to see the output binary to be
> > reproducible. We realy ought to update profiles safe for multple processes.
> > Trashing whole process run is worse than doing race in increment. There is good
> > chance that one of runs is more important than others and it will get trashed.
> >
> > I do not think we do have serious update problems in the summaries at the moment.
> > We lock individual files as we update them. The summary is simple enough to be safe.
> > sum_all is summed, max_all is maximum over the individual runs. Even when you combine
> > mutiple programs the summary will end up same. Everything except for max_all is ignored
> > anyway.
> >
> > Solution 2 (i.e. histogram streaming) will also have the property that it is safe
> > WRT multiple programs, just like sum_all.
> 
> I think the sum_all based scaling of the working set entries also has
> this property. What is your opinion on saving the histogram in the

I think the scaling will have at lest roundoff issues WRT different merging orders.

> summary and merging histograms together as best as possible compared
> to the alternative of saving the working set information as now and
> scaling it up by the ratio between the new and old sum_all when
> merging?

So far I like this option best. But David seems to lean towards thirtd option with
whole file locking.  I see it may show to be more extensible in the future.
At the moment I do not understand two things
 1) why do we need info on the number of counter above given threshold, sinc ethe hot/cold
    decisions usually depends purely on the count cutoff.
    Removing those will solve merging issues with variant 2 and then it would be probably
    good solution.
 2) Do we plan to add some features in near future that will anyway require global locking?
    I guess LIPO itself does not count since it streams its data into independent file as you
    mentioned earlier and locking LIPO file is not that hard.
    Does LIPO stream everything into that common file, or does it use combination of gcda files
    and common summary?

    What other stuff Google plans to merge?
    (In general I would be curious about merging plans WRT profile stuff, so we get more
    synchronized and effective on getting patches in. We have about two months to get it done
    in stage1 and it would be nice to get as much as possible. Obviously some of the patches will
    need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think
    this is an important feature).

I also realized today that the common value counters (used by switch, indirect
call and div/mod value profiling) are non-stanble WRT different merging orders
(i.e.  parallel make in train run).  I do not think there is actual solution to
that except for not merging the counter section of this type in libgcov and
merge them in some canonical order at profile feedback time.  Perhaps we just
want to live with this, since the disprepancy here is small. (i.e. these
counters are quite rare and their outcome has just local effect on the final
binary, unlike the global summaries/edge counters).

Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> >>
> >> -Andi
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li - Aug. 21, 2012, 6:15 a.m.
On Mon, Aug 20, 2012 at 10:29 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> Xinliang David Li <davidxl@google.com> writes:
>> >> >
>> >> > Process level synchronization problems can happen when two processes
>> >> > (running the instrumented binary) exit at the same time. The
>> >> > updated/merged counters from one process may be overwritten by another
>> >> > process -- this is true for both counter data and summary data.
>> >> > Solution 3) does not introduce any new problems.
>> >>
>> >> You could just use lockf() ?
>> >
>> > The issue here is holding lock for all the files (that can be many) versus
>> > number of locks limits & possibilities for deadlocking (mind that updating
>> > may happen in different orders on the same files for different programs built
>> > from same objects)
>> >
>> > For David: there is no thread safety code in mainline for the counters.
>> > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented)
>> > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down
>> > as too memory expensive per thread. We could optionaly do atomic updates like ICC
>> > or combination of both as discussed in the thread.
>> > So far no one implemented it since the coverage fixups seems to work well enough in
>> > pracitce for multithreaded programs where reproducibility do not seem to be _that_
>> > important.
>> >
>> > For GCC profiled bootstrap however I would like to see the output binary to be
>> > reproducible. We realy ought to update profiles safe for multple processes.
>> > Trashing whole process run is worse than doing race in increment. There is good
>> > chance that one of runs is more important than others and it will get trashed.
>> >
>> > I do not think we do have serious update problems in the summaries at the moment.
>> > We lock individual files as we update them. The summary is simple enough to be safe.
>> > sum_all is summed, max_all is maximum over the individual runs. Even when you combine
>> > mutiple programs the summary will end up same. Everything except for max_all is ignored
>> > anyway.
>> >
>> > Solution 2 (i.e. histogram streaming) will also have the property that it is safe
>> > WRT multiple programs, just like sum_all.
>>
>> I think the sum_all based scaling of the working set entries also has
>> this property. What is your opinion on saving the histogram in the
>
> I think the scaling will have at lest roundoff issues WRT different merging orders.
>
>> summary and merging histograms together as best as possible compared
>> to the alternative of saving the working set information as now and
>> scaling it up by the ratio between the new and old sum_all when
>> merging?
>
> So far I like this option best. But David seems to lean towards thirtd option with
> whole file locking.  I see it may show to be more extensible in the future.
> At the moment I do not understand two things
>  1) why do we need info on the number of counter above given threshold, sinc ethe hot/cold
>     decisions usually depends purely on the count cutoff.
>     Removing those will solve merging issues with variant 2 and then it would be probably
>     good solution.

This is useful for large applications with a long tail. The
instruction working set for those applications are very large, and
inliner and unroller need to be aware of that and good heuristics can
be developed to throttle aggressive code bloat transformations. For
inliner, it is kind of the like the global budget but more application
dependent. In the long run, we will collect more advanced fdo summary
regarding working set -- it will be working set size for each code
region (locality region).


>  2) Do we plan to add some features in near future that will anyway require global locking?
>     I guess LIPO itself does not count since it streams its data into independent file as you
>     mentioned earlier and locking LIPO file is not that hard.
>     Does LIPO stream everything into that common file, or does it use combination of gcda files
>     and common summary?

Actually, LIPO module grouping information are stored in gcda files.
It is also stored in a separate .imports file (one per object) ---
this is primarily used by our build system for dependence information.


>
>     What other stuff Google plans to merge?
>     (In general I would be curious about merging plans WRT profile stuff, so we get more
>     synchronized and effective on getting patches in. We have about two months to get it done
>     in stage1 and it would be nice to get as much as possible. Obviously some of the patches will
>     need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think
>     this is an important feature).

We plan to merge in the new LIPO implementation based on LTO
streaming. Rong Xu finished this in 4.6 based compiler, and he needs
to port it to 4.8.


thanks,

David

>
> I also realized today that the common value counters (used by switch, indirect
> call and div/mod value profiling) are non-stanble WRT different merging orders
> (i.e.  parallel make in train run).  I do not think there is actual solution to
> that except for not merging the counter section of this type in libgcov and
> merge them in some canonical order at profile feedback time.  Perhaps we just
> want to live with this, since the disprepancy here is small. (i.e. these
> counters are quite rare and their outcome has just local effect on the final
> binary, unlike the global summaries/edge counters).
>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>> >>
>> >> -Andi
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka - Aug. 21, 2012, 6:33 a.m.
> 
> This is useful for large applications with a long tail. The
> instruction working set for those applications are very large, and
> inliner and unroller need to be aware of that and good heuristics can
> be developed to throttle aggressive code bloat transformations. For
> inliner, it is kind of the like the global budget but more application
> dependent. In the long run, we will collect more advanced fdo summary
> regarding working set -- it will be working set size for each code
> region (locality region).

I see, so you use it to estimate size of the working set and effect of bloating
optimizations on cache size. This sounds interesting. What are you experiences
with this?

What concerns me that it is greatly inaccurate - you have no idea how many
instructions given counter is guarding and it can differ quite a lot. Also
inlining/optimization makes working sets significantly different (by factor of
100 for tramp3d). But on the ohter hand any solution at this level will be
greatly inaccurate. So I am curious how reliable data you can get from this?
How you take this into account for the heuristics?

It seems to me that for this use perhaps the simple logic in histogram merging
maximizing the number of BBs for given bucket will work well?  It is
inaccurate, but we are working with greatly inaccurate data anyway.
Except for degenerated cases, the small and unimportant runs will have small BB
counts, while large runs will have larger counts and those are ones we optimize
for anyway.
> 
> 
> >  2) Do we plan to add some features in near future that will anyway require global locking?
> >     I guess LIPO itself does not count since it streams its data into independent file as you
> >     mentioned earlier and locking LIPO file is not that hard.
> >     Does LIPO stream everything into that common file, or does it use combination of gcda files
> >     and common summary?
> 
> Actually, LIPO module grouping information are stored in gcda files.
> It is also stored in a separate .imports file (one per object) ---
> this is primarily used by our build system for dependence information.

I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave
on GCC bootstrap? (i.e. it does a lot more work in the libgcov module per each
invocation, so I am curious if it is practically useful at all).

With LTO based solution a lot can be probably pushed at link time? Before
actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from
gcda files and do all the merging/updating/CFG constructions that is currently
performed at runtime, right?
> 
> 
> >
> >     What other stuff Google plans to merge?
> >     (In general I would be curious about merging plans WRT profile stuff, so we get more
> >     synchronized and effective on getting patches in. We have about two months to get it done
> >     in stage1 and it would be nice to get as much as possible. Obviously some of the patches will
> >     need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think
> >     this is an important feature).
> 
> We plan to merge in the new LIPO implementation based on LTO
> streaming. Rong Xu finished this in 4.6 based compiler, and he needs
> to port it to 4.8.

Good.  Looks like a lot of work ahead. It would be nice if we can perhaps start
by merging the libgcov infrastructure updates prior the LIPO changes.  From
what I saw at LIPO branch some time ago it has a lot of stuff that is not
exactly LIPO specific.

Honza
> 
> 
> thanks,
> 
> David
> 
> >
> > I also realized today that the common value counters (used by switch, indirect
> > call and div/mod value profiling) are non-stanble WRT different merging orders
> > (i.e.  parallel make in train run).  I do not think there is actual solution to
> > that except for not merging the counter section of this type in libgcov and
> > merge them in some canonical order at profile feedback time.  Perhaps we just
> > want to live with this, since the disprepancy here is small. (i.e. these
> > counters are quite rare and their outcome has just local effect on the final
> > binary, unlike the global summaries/edge counters).
> >
> > Honza
> >>
> >> Thanks,
> >> Teresa
> >>
> >> >
> >> > Honza
> >> >>
> >> >> -Andi
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Xinliang David Li - Aug. 21, 2012, 7:14 a.m.
On Mon, Aug 20, 2012 at 11:33 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> This is useful for large applications with a long tail. The
>> instruction working set for those applications are very large, and
>> inliner and unroller need to be aware of that and good heuristics can
>> be developed to throttle aggressive code bloat transformations. For
>> inliner, it is kind of the like the global budget but more application
>> dependent. In the long run, we will collect more advanced fdo summary
>> regarding working set -- it will be working set size for each code
>> region (locality region).
>
> I see, so you use it to estimate size of the working set and effect of bloating
> optimizations on cache size. This sounds interesting. What are you experiences
> with this?

Teresa has done some tunings for the unroller so far. The inliner
tuning is the next step.

>
> What concerns me that it is greatly inaccurate - you have no idea how many
> instructions given counter is guarding and it can differ quite a lot. Also
> inlining/optimization makes working sets significantly different (by factor of
> 100 for tramp3d).

The pre ipa-inline working set is the one that is needed for ipa
inliner tuning. For post-ipa inline code increase transformations,
some update is probably needed.

>But on the ohter hand any solution at this level will be
> greatly inaccurate. So I am curious how reliable data you can get from this?
> How you take this into account for the heuristics?

This effort is just the first step to allow good heuristics to develop.

>
> It seems to me that for this use perhaps the simple logic in histogram merging
> maximizing the number of BBs for given bucket will work well?  It is
> inaccurate, but we are working with greatly inaccurate data anyway.
> Except for degenerated cases, the small and unimportant runs will have small BB
> counts, while large runs will have larger counts and those are ones we optimize
> for anyway.

The working set curve for each type of applications contains lots of
information that can be mined. The inaccuracy can also be mitigated by
more data 'calibration'.

>>
>>
>> >  2) Do we plan to add some features in near future that will anyway require global locking?
>> >     I guess LIPO itself does not count since it streams its data into independent file as you
>> >     mentioned earlier and locking LIPO file is not that hard.
>> >     Does LIPO stream everything into that common file, or does it use combination of gcda files
>> >     and common summary?
>>
>> Actually, LIPO module grouping information are stored in gcda files.
>> It is also stored in a separate .imports file (one per object) ---
>> this is primarily used by our build system for dependence information.
>
> I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave
> on GCC bootstrap?

We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
main problem for application build -- the link time (for debug build)
is.

> (i.e. it does a lot more work in the libgcov module per each
> invocation, so I am curious if it is practically useful at all).
>
> With LTO based solution a lot can be probably pushed at link time? Before
> actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from
> gcda files and do all the merging/updating/CFG constructions that is currently
> performed at runtime, right?

The dynamic cgraph build and analysis is still done at runtime.
However, with the new implementation, FE is no longer involved. Gcc
driver is modified to understand module grouping, and lto is used to
merge the streamed output from aux modules.


David

>>
>>
>> >
>> >     What other stuff Google plans to merge?
>> >     (In general I would be curious about merging plans WRT profile stuff, so we get more
>> >     synchronized and effective on getting patches in. We have about two months to get it done
>> >     in stage1 and it would be nice to get as much as possible. Obviously some of the patches will
>> >     need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think
>> >     this is an important feature).
>>
>> We plan to merge in the new LIPO implementation based on LTO
>> streaming. Rong Xu finished this in 4.6 based compiler, and he needs
>> to port it to 4.8.
>
> Good.  Looks like a lot of work ahead. It would be nice if we can perhaps start
> by merging the libgcov infrastructure updates prior the LIPO changes.  From
> what I saw at LIPO branch some time ago it has a lot of stuff that is not
> exactly LIPO specific.
>
> Honza
>>
>>
>> thanks,
>>
>> David
>>
>> >
>> > I also realized today that the common value counters (used by switch, indirect
>> > call and div/mod value profiling) are non-stanble WRT different merging orders
>> > (i.e.  parallel make in train run).  I do not think there is actual solution to
>> > that except for not merging the counter section of this type in libgcov and
>> > merge them in some canonical order at profile feedback time.  Perhaps we just
>> > want to live with this, since the disprepancy here is small. (i.e. these
>> > counters are quite rare and their outcome has just local effect on the final
>> > binary, unlike the global summaries/edge counters).
>> >
>> > Honza
>> >>
>> >> Thanks,
>> >> Teresa
>> >>
>> >> >
>> >> > Honza
>> >> >>
>> >> >> -Andi
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka - Aug. 21, 2012, 7:34 a.m.
> Teresa has done some tunings for the unroller so far. The inliner
> tuning is the next step.
> 
> >
> > What concerns me that it is greatly inaccurate - you have no idea how many
> > instructions given counter is guarding and it can differ quite a lot. Also
> > inlining/optimization makes working sets significantly different (by factor of
> > 100 for tramp3d).
> 
> The pre ipa-inline working set is the one that is needed for ipa
> inliner tuning. For post-ipa inline code increase transformations,
> some update is probably needed.
> 
> >But on the ohter hand any solution at this level will be
> > greatly inaccurate. So I am curious how reliable data you can get from this?
> > How you take this into account for the heuristics?
> 
> This effort is just the first step to allow good heuristics to develop.
> 
> >
> > It seems to me that for this use perhaps the simple logic in histogram merging
> > maximizing the number of BBs for given bucket will work well?  It is
> > inaccurate, but we are working with greatly inaccurate data anyway.
> > Except for degenerated cases, the small and unimportant runs will have small BB
> > counts, while large runs will have larger counts and those are ones we optimize
> > for anyway.
> 
> The working set curve for each type of applications contains lots of
> information that can be mined. The inaccuracy can also be mitigated by
> more data 'calibration'.

Sure, I think I am leaning towards trying the solution 2) with maximizing
counter count merging (probably it would make sense to rename it from BB count
since it is not really BB count and thus it is misleading) and we will see how
well it works in practice.

We have benefits of much fewer issues with profile locking/unlocking and we
lose bit of precision on BB counts. I tend to believe that the error will not
be that important in practice. Another loss is more histogram streaming into
each gcda file, but with skiping zero entries it should not be major overhead
problem I hope.

What do you think?
> 
> >>
> >>
> >> >  2) Do we plan to add some features in near future that will anyway require global locking?
> >> >     I guess LIPO itself does not count since it streams its data into independent file as you
> >> >     mentioned earlier and locking LIPO file is not that hard.
> >> >     Does LIPO stream everything into that common file, or does it use combination of gcda files
> >> >     and common summary?
> >>
> >> Actually, LIPO module grouping information are stored in gcda files.
> >> It is also stored in a separate .imports file (one per object) ---
> >> this is primarily used by our build system for dependence information.
> >
> > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave
> > on GCC bootstrap?
> 
> We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
> main problem for application build -- the link time (for debug build)
> is.

I was primarily curious how the LIPOs runtime analysis fare in the situation where
you do very many small train runs on rather large app (sure GCC is small to google's
use case ;).
> 
> > (i.e. it does a lot more work in the libgcov module per each
> > invocation, so I am curious if it is practically useful at all).
> >
> > With LTO based solution a lot can be probably pushed at link time? Before
> > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from
> > gcda files and do all the merging/updating/CFG constructions that is currently
> > performed at runtime, right?
> 
> The dynamic cgraph build and analysis is still done at runtime.
> However, with the new implementation, FE is no longer involved. Gcc
> driver is modified to understand module grouping, and lto is used to
> merge the streamed output from aux modules.

I see. Are there any fundamental reasons why it can not be done at link-time
when all gcda files are available? Why the grouping is not done inside linker
plugin?

Honza
> 
> 
> David
Andi Kleen - Aug. 21, 2012, 2:59 p.m.
> The issue here is holding lock for all the files (that can be many) versus
> number of locks limits & possibilities for deadlocking (mind that updating
> may happen in different orders on the same files for different programs built
> from same objects)

lockf typically has a deadlock detector, and will error out.

-Andi
Xinliang David Li - Aug. 21, 2012, 5:10 p.m.
On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Teresa has done some tunings for the unroller so far. The inliner
>> tuning is the next step.
>>
>> >
>> > What concerns me that it is greatly inaccurate - you have no idea how many
>> > instructions given counter is guarding and it can differ quite a lot. Also
>> > inlining/optimization makes working sets significantly different (by factor of
>> > 100 for tramp3d).
>>
>> The pre ipa-inline working set is the one that is needed for ipa
>> inliner tuning. For post-ipa inline code increase transformations,
>> some update is probably needed.
>>
>> >But on the ohter hand any solution at this level will be
>> > greatly inaccurate. So I am curious how reliable data you can get from this?
>> > How you take this into account for the heuristics?
>>
>> This effort is just the first step to allow good heuristics to develop.
>>
>> >
>> > It seems to me that for this use perhaps the simple logic in histogram merging
>> > maximizing the number of BBs for given bucket will work well?  It is
>> > inaccurate, but we are working with greatly inaccurate data anyway.
>> > Except for degenerated cases, the small and unimportant runs will have small BB
>> > counts, while large runs will have larger counts and those are ones we optimize
>> > for anyway.
>>
>> The working set curve for each type of applications contains lots of
>> information that can be mined. The inaccuracy can also be mitigated by
>> more data 'calibration'.
>
> Sure, I think I am leaning towards trying the solution 2) with maximizing
> counter count merging (probably it would make sense to rename it from BB count
> since it is not really BB count and thus it is misleading) and we will see how
> well it works in practice.
>
> We have benefits of much fewer issues with profile locking/unlocking and we
> lose bit of precision on BB counts. I tend to believe that the error will not
> be that important in practice. Another loss is more histogram streaming into
> each gcda file, but with skiping zero entries it should not be major overhead
> problem I hope.
>
> What do you think?
>>
>> >>
>> >>
>> >> >  2) Do we plan to add some features in near future that will anyway require global locking?
>> >> >     I guess LIPO itself does not count since it streams its data into independent file as you
>> >> >     mentioned earlier and locking LIPO file is not that hard.
>> >> >     Does LIPO stream everything into that common file, or does it use combination of gcda files
>> >> >     and common summary?
>> >>
>> >> Actually, LIPO module grouping information are stored in gcda files.
>> >> It is also stored in a separate .imports file (one per object) ---
>> >> this is primarily used by our build system for dependence information.
>> >
>> > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave
>> > on GCC bootstrap?
>>
>> We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
>> main problem for application build -- the link time (for debug build)
>> is.
>
> I was primarily curious how the LIPOs runtime analysis fare in the situation where
> you do very many small train runs on rather large app (sure GCC is small to google's
> use case ;).


There will be race, but as Teresa mentioned, there is a big chance
that the process which finishes the merge the last is also t the final
overrider of the LIPO summary data.


>>
>> > (i.e. it does a lot more work in the libgcov module per each
>> > invocation, so I am curious if it is practically useful at all).
>> >
>> > With LTO based solution a lot can be probably pushed at link time? Before
>> > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from
>> > gcda files and do all the merging/updating/CFG constructions that is currently
>> > performed at runtime, right?
>>
>> The dynamic cgraph build and analysis is still done at runtime.
>> However, with the new implementation, FE is no longer involved. Gcc
>> driver is modified to understand module grouping, and lto is used to
>> merge the streamed output from aux modules.
>
> I see. Are there any fundamental reasons why it can not be done at link-time
> when all gcda files are available?

For build parallelism, the decision should be made as early as
possible -- that is what makes LIPO 'light'.

> Why the grouping is not done inside linker
> plugin?

It is not delayed into link time. In fact linker plugin is not even involved.

David


>
> Honza
>>
>>
>> David
Jan Hubicka - Aug. 22, 2012, 1:56 a.m.
> > I can go ahead with the histogram approach. There is some roundoff
> > error from the working set scaling approach that can affect different
> > merging orders as you note, although I think this only really affects the
> > small counter values. The other place where saving/merging the histogram

Do you have any intuition on why simple maximalization merging (that is safe wrt
ordering) would be bad idea?

We care only about working set size around top of the histogram and I would say
that we should sort of optimize for the largest (in the number of blocks in hot
area) of the train runs.  One way where things will get messed up is when the
working set is about the same but the runs are of different size so all the
blocks gets accounted into two different buckets.

But in general I do not think there is resonably accurate way to merge the
profiles without actually streaming out all counter IDs in every bucket, so perhaps
this will work well enough.  If not, we can in future introduce per-program global
summary file that will contain the counters to be merged acurately.

> > would help is when the distribution of counter values in the histogram
> > varies between runs, say for example, the hottest loop is much hotter in a
> > subsequent run, but the rest of the counter values stay largely consistent.
> > Note, however, that if the hotspots are different in different runs, then
> > merging either the histogram or the working set will have issues. The only
> > way to detect this is to recompute the histogram/working set from scratch
> > from the merged counters.
> >
> > I wonder in practice, even when there are a lot of simultaneous runs going
> > like in a gcc bootstrap, if we could get reasonably accurate summary
> > recomputation without global locking. The reason is that as long as the
> > actual counter updates are safe as they are now due to the individual file
> > locking, the inaccuracy in the recomputed summary information will not grow
> > over time, and there is a reasonable chance that the last run to complete
> > and merge will recompute the summary from the final merged counter values
> > and get it right (or only be off by a little bit if there are a couple of
> > runs completing simultaneously at the end). But this can be investigated as
> > a follow on to this patch.
> >
> 
> 
> The main concern is probably the build reproducibility in gcc bootstrap
> with FDO.

Hmm, you mean in the first pass update every file with new counters and in the second
pass just update the summaries?
OK, so I guess we went through
 1) two pass updating with race in between pases.
 2) two pass updating with first pass updating countes and second having race only for summary update.
    (i.e. no races for counters)
 3) two pass updating with flocking (and some way to handle detected deadlocks)
 4) one pass updating with histogram merging + maximalization of working set.
    (we do not realy need to scale the buckets, we can simply merge the histograms
     and then mutiply them by nruns before comparing to actual counters.
     This assumes that working sets of all runs are about the same, but should work
     resonably in practice I think.

I guess 3/4 are acceptable WRT bootstrap reproducibility. 

I have no experience with flocking large number of files and portability of
this solution i.e.  to Windows.  If you think that 2) would be too inaccurate
in practice and 3) has chance to be portable, we could go for this.  It will
solve the precision problems and will also work for LIPO summaries.
I would be curious about effect on profiledbootstrap time of this if you implement
it.

Honza
> 
> David
> 
> 
> 
> >
> > Thanks,
> > Teresa
> >
> > >
> >> > >>
> >> > >>
> >> > >> >  2) Do we plan to add some features in near future that will
> >> anyway require global locking?
> >> > >> >     I guess LIPO itself does not count since it streams its data
> >> into independent file as you
> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
> >> > >> >     Does LIPO stream everything into that common file, or does it
> >> use combination of gcda files
> >> > >> >     and common summary?
> >> > >>
> >> > >> Actually, LIPO module grouping information are stored in gcda files.
> >> > >> It is also stored in a separate .imports file (one per object) ---
> >> > >> this is primarily used by our build system for dependence
> >> information.
> >> > >
> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
> >> LIPO behave
> >> > > on GCC bootstrap?
> >> >
> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
> >> > main problem for application build -- the link time (for debug build)
> >> > is.
> >>
> >> I was primarily curious how the LIPOs runtime analysis fare in the
> >> situation where
> >> you do very many small train runs on rather large app (sure GCC is small
> >> to google's
> >> use case ;).
> >> >
> >> > > (i.e. it does a lot more work in the libgcov module per each
> >> > > invocation, so I am curious if it is practically useful at all).
> >> > >
> >> > > With LTO based solution a lot can be probably pushed at link time?
> >> Before
> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
> >> CFGs from
> >> > > gcda files and do all the merging/updating/CFG constructions that is
> >> currently
> >> > > performed at runtime, right?
> >> >
> >> > The dynamic cgraph build and analysis is still done at runtime.
> >> > However, with the new implementation, FE is no longer involved. Gcc
> >> > driver is modified to understand module grouping, and lto is used to
> >> > merge the streamed output from aux modules.
> >>
> >> I see. Are there any fundamental reasons why it can not be done at
> >> link-time
> >> when all gcda files are available? Why the grouping is not done inside
> >> linker
> >> plugin?
> >>
> >> Honza
> >> >
> >> >
> >> > David
> >>
> >
> >
> >
> > --
> > Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
> >
Teresa Johnson - Aug. 22, 2012, 5:33 a.m.
On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > I can go ahead with the histogram approach. There is some roundoff
>> > error from the working set scaling approach that can affect different
>> > merging orders as you note, although I think this only really affects the
>> > small counter values. The other place where saving/merging the histogram
>
> Do you have any intuition on why simple maximalization merging (that is safe wrt
> ordering) would be bad idea?

When you say "maximalization merging" are you talking about the
histogram merging approach I mentioned a few emails back (my response
on Aug 19) where we assume the same relative order of hotness in the
counters between runs, and accumulate the counter values in the
histogram in that order?

This would be inaccurate if different runs exercised different areas
of the code, and thus the counters would be ordered in the histogram
differently.

>
> We care only about working set size around top of the histogram and I would say

For optimizations that care about the boundary between hot and cold
such as code layout I think we will also care about the smaller values
in the histogram (to have a good idea of what constitutes a cold block
counter value).

> that we should sort of optimize for the largest (in the number of blocks in hot
> area) of the train runs.  One way where things will get messed up is when the
> working set is about the same but the runs are of different size so all the
> blocks gets accounted into two different buckets.

I'm not sure I understand the last sentence - is this something that
would not get handled by merging the histogram entries as I described
earlier? Or it sounds like you might have a different merging approach
in mind?

>
> But in general I do not think there is resonably accurate way to merge the
> profiles without actually streaming out all counter IDs in every bucket, so perhaps
> this will work well enough.  If not, we can in future introduce per-program global
> summary file that will contain the counters to be merged acurately.

Sounds good.

>
>> > would help is when the distribution of counter values in the histogram
>> > varies between runs, say for example, the hottest loop is much hotter in a
>> > subsequent run, but the rest of the counter values stay largely consistent.
>> > Note, however, that if the hotspots are different in different runs, then
>> > merging either the histogram or the working set will have issues. The only
>> > way to detect this is to recompute the histogram/working set from scratch
>> > from the merged counters.
>> >
>> > I wonder in practice, even when there are a lot of simultaneous runs going
>> > like in a gcc bootstrap, if we could get reasonably accurate summary
>> > recomputation without global locking. The reason is that as long as the
>> > actual counter updates are safe as they are now due to the individual file
>> > locking, the inaccuracy in the recomputed summary information will not grow
>> > over time, and there is a reasonable chance that the last run to complete
>> > and merge will recompute the summary from the final merged counter values
>> > and get it right (or only be off by a little bit if there are a couple of
>> > runs completing simultaneously at the end). But this can be investigated as
>> > a follow on to this patch.
>> >
>>
>>
>> The main concern is probably the build reproducibility in gcc bootstrap
>> with FDO.
>
> Hmm, you mean in the first pass update every file with new counters and in the second
> pass just update the summaries?

Right, that's what I had in mind (what you have described in #2 below).

> OK, so I guess we went through
>  1) two pass updating with race in between pases.
>  2) two pass updating with first pass updating countes and second having race only for summary update.
>     (i.e. no races for counters)
>  3) two pass updating with flocking (and some way to handle detected deadlocks)
>  4) one pass updating with histogram merging + maximalization of working set.
>     (we do not realy need to scale the buckets, we can simply merge the histograms
>      and then mutiply them by nruns before comparing to actual counters.

By merging the histograms (and accumulating the counter values stored
there as we merge), I don't think we need to multiply the counter
values by nruns, do we?

>      This assumes that working sets of all runs are about the same, but should work
>      resonably in practice I think.
>
> I guess 3/4 are acceptable WRT bootstrap reproducibility.
>
> I have no experience with flocking large number of files and portability of
> this solution i.e.  to Windows.  If you think that 2) would be too inaccurate
> in practice and 3) has chance to be portable, we could go for this.  It will
> solve the precision problems and will also work for LIPO summaries.
> I would be curious about effect on profiledbootstrap time of this if you implement
> it.

I'm hoping that 2) will be accurate enough in practice, but it will
need some investigation.

Thanks,
Teresa

>
> Honza
>>
>> David
>>
>>
>>
>> >
>> > Thanks,
>> > Teresa
>> >
>> > >
>> >> > >>
>> >> > >>
>> >> > >> >  2) Do we plan to add some features in near future that will
>> >> anyway require global locking?
>> >> > >> >     I guess LIPO itself does not count since it streams its data
>> >> into independent file as you
>> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
>> >> > >> >     Does LIPO stream everything into that common file, or does it
>> >> use combination of gcda files
>> >> > >> >     and common summary?
>> >> > >>
>> >> > >> Actually, LIPO module grouping information are stored in gcda files.
>> >> > >> It is also stored in a separate .imports file (one per object) ---
>> >> > >> this is primarily used by our build system for dependence
>> >> information.
>> >> > >
>> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
>> >> LIPO behave
>> >> > > on GCC bootstrap?
>> >> >
>> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
>> >> > main problem for application build -- the link time (for debug build)
>> >> > is.
>> >>
>> >> I was primarily curious how the LIPOs runtime analysis fare in the
>> >> situation where
>> >> you do very many small train runs on rather large app (sure GCC is small
>> >> to google's
>> >> use case ;).
>> >> >
>> >> > > (i.e. it does a lot more work in the libgcov module per each
>> >> > > invocation, so I am curious if it is practically useful at all).
>> >> > >
>> >> > > With LTO based solution a lot can be probably pushed at link time?
>> >> Before
>> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
>> >> CFGs from
>> >> > > gcda files and do all the merging/updating/CFG constructions that is
>> >> currently
>> >> > > performed at runtime, right?
>> >> >
>> >> > The dynamic cgraph build and analysis is still done at runtime.
>> >> > However, with the new implementation, FE is no longer involved. Gcc
>> >> > driver is modified to understand module grouping, and lto is used to
>> >> > merge the streamed output from aux modules.
>> >>
>> >> I see. Are there any fundamental reasons why it can not be done at
>> >> link-time
>> >> when all gcda files are available? Why the grouping is not done inside
>> >> linker
>> >> plugin?
>> >>
>> >> Honza
>> >> >
>> >> >
>> >> > David
>> >>
>> >
>> >
>> >
>> > --
>> > Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
>> >
Jan Hubicka - Aug. 22, 2012, 6:18 a.m.
> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> > I can go ahead with the histogram approach. There is some roundoff
> >> > error from the working set scaling approach that can affect different
> >> > merging orders as you note, although I think this only really affects the
> >> > small counter values. The other place where saving/merging the histogram
> >
> > Do you have any intuition on why simple maximalization merging (that is safe wrt
> > ordering) would be bad idea?
> 
> When you say "maximalization merging" are you talking about the
> histogram merging approach I mentioned a few emails back (my response
> on Aug 19) where we assume the same relative order of hotness in the
> counters between runs, and accumulate the counter values in the
> histogram in that order?

I was speaking of BB (counter) counts only here and considering the simplest
strategy: maximalize the BB counts in corresponding buckets and sum the cummlative
values in the corresponding buckets. 

The strategy of scaling is more sensible, but it will also be sensitive to order
of train runs, i.e. it will give unstable results with parallel make.
> > OK, so I guess we went through
> >  1) two pass updating with race in between pases.
> >  2) two pass updating with first pass updating countes and second having race only for summary update.
> >     (i.e. no races for counters)
> >  3) two pass updating with flocking (and some way to handle detected deadlocks)
> >  4) one pass updating with histogram merging + maximalization of working set.
> >     (we do not realy need to scale the buckets, we can simply merge the histograms
> >      and then mutiply them by nruns before comparing to actual counters.
> 
> By merging the histograms (and accumulating the counter values stored
> there as we merge), I don't think we need to multiply the counter
> values by nruns, do we?

With the simple approach we need, but...
> 
> >      This assumes that working sets of all runs are about the same, but should work
> >      resonably in practice I think.
> >
> > I guess 3/4 are acceptable WRT bootstrap reproducibility.
> >
> > I have no experience with flocking large number of files and portability of
> > this solution i.e.  to Windows.  If you think that 2) would be too inaccurate
> > in practice and 3) has chance to be portable, we could go for this.  It will
> > solve the precision problems and will also work for LIPO summaries.
> > I would be curious about effect on profiledbootstrap time of this if you implement
> > it.
> 
> I'm hoping that 2) will be accurate enough in practice, but it will
> need some investigation.

Perhaps weighting all pros/cons you are right here.  I think getting results
consistent across different order of runs is more important than the
possibility of race on the last update and 4) is probably getting quite a lot
of GIGO issues.

We can implement more consistent locking mechanizm incrementally if this turns
out to be issue.

It is posible to actually detect the race: at first round we are updating the
nruns. In the second round we can see if nruns has changed.  If so, we have
problem since someone has already updated the file and will update the
histograms.

I would suggest skiping update when nruns has changes, so the summary at least
match the counters in current file accurately (even though it will be off for
the whole program since whoever updated the file may have not seen all our
updates, just part of them).  Just to reduce changes that the trashed run is
the most important one. Theoretically we could also make libgcov to re-read all
counters in this case and try another update, but I would not worry about that
for the moment.

We will still have problems with bootstrap: the summary of libbackend.a will
depend on what of compiler binaries has executed last, since cc1 will have
different summary from cc1plus becuase frontend is different.  I would give
extra points to voluteer who changes libgcov to do multiple summaries for
different programs as intended originally (if you search archives in 2001-2003,
somewhere are patches that changes libgcov from not merging at all to merging
always. I guess changing it to merge just when program checksum match is not
that hard).

Thanks,
Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> >>
> >> David
> >>
> >>
> >>
> >> >
> >> > Thanks,
> >> > Teresa
> >> >
> >> > >
> >> >> > >>
> >> >> > >>
> >> >> > >> >  2) Do we plan to add some features in near future that will
> >> >> anyway require global locking?
> >> >> > >> >     I guess LIPO itself does not count since it streams its data
> >> >> into independent file as you
> >> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
> >> >> > >> >     Does LIPO stream everything into that common file, or does it
> >> >> use combination of gcda files
> >> >> > >> >     and common summary?
> >> >> > >>
> >> >> > >> Actually, LIPO module grouping information are stored in gcda files.
> >> >> > >> It is also stored in a separate .imports file (one per object) ---
> >> >> > >> this is primarily used by our build system for dependence
> >> >> information.
> >> >> > >
> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
> >> >> LIPO behave
> >> >> > > on GCC bootstrap?
> >> >> >
> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
> >> >> > main problem for application build -- the link time (for debug build)
> >> >> > is.
> >> >>
> >> >> I was primarily curious how the LIPOs runtime analysis fare in the
> >> >> situation where
> >> >> you do very many small train runs on rather large app (sure GCC is small
> >> >> to google's
> >> >> use case ;).
> >> >> >
> >> >> > > (i.e. it does a lot more work in the libgcov module per each
> >> >> > > invocation, so I am curious if it is practically useful at all).
> >> >> > >
> >> >> > > With LTO based solution a lot can be probably pushed at link time?
> >> >> Before
> >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
> >> >> CFGs from
> >> >> > > gcda files and do all the merging/updating/CFG constructions that is
> >> >> currently
> >> >> > > performed at runtime, right?
> >> >> >
> >> >> > The dynamic cgraph build and analysis is still done at runtime.
> >> >> > However, with the new implementation, FE is no longer involved. Gcc
> >> >> > driver is modified to understand module grouping, and lto is used to
> >> >> > merge the streamed output from aux modules.
> >> >>
> >> >> I see. Are there any fundamental reasons why it can not be done at
> >> >> link-time
> >> >> when all gcda files are available? Why the grouping is not done inside
> >> >> linker
> >> >> plugin?
> >> >>
> >> >> Honza
> >> >> >
> >> >> >
> >> >> > David
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
> >> >
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson - Aug. 23, 2012, 2:07 a.m.
On Tue, Aug 21, 2012 at 11:18 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > I can go ahead with the histogram approach. There is some roundoff
>> >> > error from the working set scaling approach that can affect different
>> >> > merging orders as you note, although I think this only really affects the
>> >> > small counter values. The other place where saving/merging the histogram
>> >
>> > Do you have any intuition on why simple maximalization merging (that is safe wrt
>> > ordering) would be bad idea?
>>
>> When you say "maximalization merging" are you talking about the
>> histogram merging approach I mentioned a few emails back (my response
>> on Aug 19) where we assume the same relative order of hotness in the
>> counters between runs, and accumulate the counter values in the
>> histogram in that order?
>
> I was speaking of BB (counter) counts only here and considering the simplest
> strategy: maximalize the BB counts in corresponding buckets and sum the cummlative
> values in the corresponding buckets.

I'm concerned about inaccuracies arising when multiple runs have
different sizes and therefore counters are in different buckets in the
corresponding histograms. We would end up in a situation where the
merged histogram has more "counters" in it than there are actual
counters in the program.

>
> The strategy of scaling is more sensible, but it will also be sensitive to order
> of train runs, i.e. it will give unstable results with parallel make.

Regarding the approach of merging histograms by matching up counters
in the same order and apportioning out the histogram cumulative values
evenly between counters in the same bucket before merging, I don't
think the ordering will have a huge effect. The reason is that
counters we end up matching together and accumulating will stay in the
same relative order, and the min counter values are simply summed to
compute the new bucket index and new min counter value to record. For
the cumulative value, since the incoming cumulative values will be
apportioned evenly between the counters coming from the same bucket
when computing the new cumulative value (which is then just another
add), there could be some slight differences from different round off
errors when histograms are merged in different orders. But I would
expect that to result in only small differences in the cumulative
values in the histogram. Is that acceptable?

>> > OK, so I guess we went through
>> >  1) two pass updating with race in between pases.
>> >  2) two pass updating with first pass updating countes and second having race only for summary update.
>> >     (i.e. no races for counters)
>> >  3) two pass updating with flocking (and some way to handle detected deadlocks)
>> >  4) one pass updating with histogram merging + maximalization of working set.
>> >     (we do not realy need to scale the buckets, we can simply merge the histograms
>> >      and then mutiply them by nruns before comparing to actual counters.
>>
>> By merging the histograms (and accumulating the counter values stored
>> there as we merge), I don't think we need to multiply the counter
>> values by nruns, do we?
>
> With the simple approach we need, but...
>>
>> >      This assumes that working sets of all runs are about the same, but should work
>> >      resonably in practice I think.
>> >
>> > I guess 3/4 are acceptable WRT bootstrap reproducibility.
>> >
>> > I have no experience with flocking large number of files and portability of
>> > this solution i.e.  to Windows.  If you think that 2) would be too inaccurate
>> > in practice and 3) has chance to be portable, we could go for this.  It will
>> > solve the precision problems and will also work for LIPO summaries.
>> > I would be curious about effect on profiledbootstrap time of this if you implement
>> > it.
>>
>> I'm hoping that 2) will be accurate enough in practice, but it will
>> need some investigation.
>
> Perhaps weighting all pros/cons you are right here.  I think getting results
> consistent across different order of runs is more important than the
> possibility of race on the last update and 4) is probably getting quite a lot
> of GIGO issues.
>
> We can implement more consistent locking mechanizm incrementally if this turns
> out to be issue.
>
> It is posible to actually detect the race: at first round we are updating the
> nruns. In the second round we can see if nruns has changed.  If so, we have
> problem since someone has already updated the file and will update the
> histograms.
>
> I would suggest skiping update when nruns has changes, so the summary at least
> match the counters in current file accurately (even though it will be off for
> the whole program since whoever updated the file may have not seen all our
> updates, just part of them).  Just to reduce changes that the trashed run is
> the most important one. Theoretically we could also make libgcov to re-read all
> counters in this case and try another update, but I would not worry about that
> for the moment.

Ok, I think that makes sense.

If it is ok with you, I'd prefer to go with a 3 step approach:
1) Implement some form of histogram merging (one of the options we
discussed above). (Your approach #4).
2) Implement approach #2 (restructure into 2 passes with counters
being updated accurately and a possible race on the histogram
computation).
3) Investigate and implement either approach #3 (some kind of global
lock) or the race detection you describe above.

>
> We will still have problems with bootstrap: the summary of libbackend.a will
> depend on what of compiler binaries has executed last, since cc1 will have
> different summary from cc1plus becuase frontend is different.  I would give
> extra points to voluteer who changes libgcov to do multiple summaries for
> different programs as intended originally (if you search archives in 2001-2003,
> somewhere are patches that changes libgcov from not merging at all to merging
> always. I guess changing it to merge just when program checksum match is not
> that hard).

In this case, I assume that currently the merge will abort because the
# counters, etc don't match?

Thanks,
Teresa

>
> Thanks,
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>> >>
>> >> David
>> >>
>> >>
>> >>
>> >> >
>> >> > Thanks,
>> >> > Teresa
>> >> >
>> >> > >
>> >> >> > >>
>> >> >> > >>
>> >> >> > >> >  2) Do we plan to add some features in near future that will
>> >> >> anyway require global locking?
>> >> >> > >> >     I guess LIPO itself does not count since it streams its data
>> >> >> into independent file as you
>> >> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
>> >> >> > >> >     Does LIPO stream everything into that common file, or does it
>> >> >> use combination of gcda files
>> >> >> > >> >     and common summary?
>> >> >> > >>
>> >> >> > >> Actually, LIPO module grouping information are stored in gcda files.
>> >> >> > >> It is also stored in a separate .imports file (one per object) ---
>> >> >> > >> this is primarily used by our build system for dependence
>> >> >> information.
>> >> >> > >
>> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
>> >> >> LIPO behave
>> >> >> > > on GCC bootstrap?
>> >> >> >
>> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
>> >> >> > main problem for application build -- the link time (for debug build)
>> >> >> > is.
>> >> >>
>> >> >> I was primarily curious how the LIPOs runtime analysis fare in the
>> >> >> situation where
>> >> >> you do very many small train runs on rather large app (sure GCC is small
>> >> >> to google's
>> >> >> use case ;).
>> >> >> >
>> >> >> > > (i.e. it does a lot more work in the libgcov module per each
>> >> >> > > invocation, so I am curious if it is practically useful at all).
>> >> >> > >
>> >> >> > > With LTO based solution a lot can be probably pushed at link time?
>> >> >> Before
>> >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
>> >> >> CFGs from
>> >> >> > > gcda files and do all the merging/updating/CFG constructions that is
>> >> >> currently
>> >> >> > > performed at runtime, right?
>> >> >> >
>> >> >> > The dynamic cgraph build and analysis is still done at runtime.
>> >> >> > However, with the new implementation, FE is no longer involved. Gcc
>> >> >> > driver is modified to understand module grouping, and lto is used to
>> >> >> > merge the streamed output from aux modules.
>> >> >>
>> >> >> I see. Are there any fundamental reasons why it can not be done at
>> >> >> link-time
>> >> >> when all gcda files are available? Why the grouping is not done inside
>> >> >> linker
>> >> >> plugin?
>> >> >>
>> >> >> Honza
>> >> >> >
>> >> >> >
>> >> >> > David
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
>> >> >
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka - Aug. 23, 2012, 2:28 a.m.
> 
> I'm concerned about inaccuracies arising when multiple runs have
> different sizes and therefore counters are in different buckets in the
> corresponding histograms. We would end up in a situation where the
> merged histogram has more "counters" in it than there are actual
> counters in the program.

Yes, it is very inaccurate.
> 
> >
> > The strategy of scaling is more sensible, but it will also be sensitive to order
> > of train runs, i.e. it will give unstable results with parallel make.
> 
> Regarding the approach of merging histograms by matching up counters
> in the same order and apportioning out the histogram cumulative values
> evenly between counters in the same bucket before merging, I don't
> think the ordering will have a huge effect. The reason is that
> counters we end up matching together and accumulating will stay in the
> same relative order, and the min counter values are simply summed to
> compute the new bucket index and new min counter value to record. For
> the cumulative value, since the incoming cumulative values will be
> apportioned evenly between the counters coming from the same bucket
> when computing the new cumulative value (which is then just another
> add), there could be some slight differences from different round off
> errors when histograms are merged in different orders. But I would
> expect that to result in only small differences in the cumulative
> values in the histogram. Is that acceptable?

Yes, I think it will be small difference and it will give resonable results in practice.
I am only concerned about reproducibility of the bootstraps that...
> Ok, I think that makes sense.
> 
> If it is ok with you, I'd prefer to go with a 3 step approach:
> 1) Implement some form of histogram merging (one of the options we
> discussed above). (Your approach #4).
> 2) Implement approach #2 (restructure into 2 passes with counters
> being updated accurately and a possible race on the histogram
> computation).
> 3) Investigate and implement either approach #3 (some kind of global
> lock) or the race detection you describe above.

Yes, this sounds like good plan.
> 
> >
> > We will still have problems with bootstrap: the summary of libbackend.a will
> > depend on what of compiler binaries has executed last, since cc1 will have
> > different summary from cc1plus becuase frontend is different.  I would give
> > extra points to voluteer who changes libgcov to do multiple summaries for
> > different programs as intended originally (if you search archives in 2001-2003,
> > somewhere are patches that changes libgcov from not merging at all to merging
> > always. I guess changing it to merge just when program checksum match is not
> > that hard).
> 
> In this case, I assume that currently the merge will abort because the
> # counters, etc don't match?

We only abort when number of counters in current file does not match (i.e. the gcda
file is from difference compilation). We accept when the overall program is different,
i.e. libbackend.a is linked into multiple binaries.
Since the histogram summing is global across all profiled libraries, we will end up with
different histograms depending on what program using given library was executed last.

To get this stable I think it is only possible to go with #3 (i.e. program wide
global locking) and get different records for different programs (i.e. do
checksum of all objects libgcov sees and merge only with matching checksums).
In coverage.c we then can either choose the proper record whendoing LTO and
knowing what program we are just compiling or do hitogram merging in your way
but after sorting the entries into some canonical order (i.e. increasing
checksum).

So we can track this incrementally.  I guess if you go with step 1) as you
describe and we can move ahead.

Other thing I have patch for is to make LTO ipa-profile to produce more
accurate histograms based on real instruction counts (here we can read in all
gcda files, turn them into CFG profiles, and do our own summary).  I suppose
that once you merge in your patches I can update this and we will have two
mechanizms so we will also have data on how your implementation diverges from
precise solution.

Thanks!
Honza
> 
> Thanks,
> Teresa
> 
> >
> > Thanks,
> > Honza
> >>
> >> Thanks,
> >> Teresa
> >>
> >> >
> >> > Honza
> >> >>
> >> >> David
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> > Thanks,
> >> >> > Teresa
> >> >> >
> >> >> > >
> >> >> >> > >>
> >> >> >> > >>
> >> >> >> > >> >  2) Do we plan to add some features in near future that will
> >> >> >> anyway require global locking?
> >> >> >> > >> >     I guess LIPO itself does not count since it streams its data
> >> >> >> into independent file as you
> >> >> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
> >> >> >> > >> >     Does LIPO stream everything into that common file, or does it
> >> >> >> use combination of gcda files
> >> >> >> > >> >     and common summary?
> >> >> >> > >>
> >> >> >> > >> Actually, LIPO module grouping information are stored in gcda files.
> >> >> >> > >> It is also stored in a separate .imports file (one per object) ---
> >> >> >> > >> this is primarily used by our build system for dependence
> >> >> >> information.
> >> >> >> > >
> >> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
> >> >> >> LIPO behave
> >> >> >> > > on GCC bootstrap?
> >> >> >> >
> >> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
> >> >> >> > main problem for application build -- the link time (for debug build)
> >> >> >> > is.
> >> >> >>
> >> >> >> I was primarily curious how the LIPOs runtime analysis fare in the
> >> >> >> situation where
> >> >> >> you do very many small train runs on rather large app (sure GCC is small
> >> >> >> to google's
> >> >> >> use case ;).
> >> >> >> >
> >> >> >> > > (i.e. it does a lot more work in the libgcov module per each
> >> >> >> > > invocation, so I am curious if it is practically useful at all).
> >> >> >> > >
> >> >> >> > > With LTO based solution a lot can be probably pushed at link time?
> >> >> >> Before
> >> >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
> >> >> >> CFGs from
> >> >> >> > > gcda files and do all the merging/updating/CFG constructions that is
> >> >> >> currently
> >> >> >> > > performed at runtime, right?
> >> >> >> >
> >> >> >> > The dynamic cgraph build and analysis is still done at runtime.
> >> >> >> > However, with the new implementation, FE is no longer involved. Gcc
> >> >> >> > driver is modified to understand module grouping, and lto is used to
> >> >> >> > merge the streamed output from aux modules.
> >> >> >>
> >> >> >> I see. Are there any fundamental reasons why it can not be done at
> >> >> >> link-time
> >> >> >> when all gcda files are available? Why the grouping is not done inside
> >> >> >> linker
> >> >> >> plugin?
> >> >> >>
> >> >> >> Honza
> >> >> >> >
> >> >> >> >
> >> >> >> > David
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
> >> >> >
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Patch

Index: libgcc/libgcov.c
===================================================================
--- libgcc/libgcov.c	(revision 189420)
+++ libgcc/libgcov.c	(working copy)
@@ -276,6 +276,252 @@  gcov_version (struct gcov_info *ptr, gcov_unsigned
   return 1;
 }
 
+/* Structure used for each bucket of the log2 histogram of counter values.  */
+
+typedef struct
+{
+  /* The total number of BBs whose profile count falls within the bucket.  */
+  gcov_unsigned_t count;
+  /* The minimum profile count placed in this bucket.  */
+  gcov_type min_value;
+  /* This is the cumulative value of the profile counts in this histogram
+     bucket.  */
+  gcov_type cum_value;
+} bucket_type;
+
+/* For a log2 scale histogram with each range split into 4
+   linear sub-ranges, there will be at most GCOV_TYPE_SIZE - 1 log2
+   ranges since the lowest 2 log2 values share the lowest 4 linear
+   sub-range (values 0 - 3).  */
+
+#define HISTOGRAM_SIZE (GCOV_TYPE_SIZE - 1) * 4
+
+/* Determine the index into histogram for VALUE. */
+
+static unsigned
+histo_index(gcov_type value)
+{
+  gcov_type_unsigned v = (gcov_type_unsigned)value;
+  unsigned r = 0;
+  unsigned prev2bits = 0;
+
+  /* Find index into log2 scale histogram, where each of the log2
+     sized buckets is divided into 4 linear sub-buckets for better
+     focus in the higher buckets.  */
+
+  /* Find the place of the most-significant bit set.  */
+  if (v > 0)
+    r = GCOV_TYPE_SIZE - 1 - __builtin_clzll (v);
+
+  /* If at most the 2 least significant bits are set (value is
+     0 - 3) then that value is our index into the lowest set of
+     four buckets.  */
+  if (r < 2)
+    return (unsigned)value;
+
+  gcc_assert (r < 64);
+
+  /* Find the two next most significant bits to determine which
+     of the four linear sub-buckets to select.  */
+  prev2bits = (v >> (r - 2)) & 0x3;
+  /* Finally, compose the final bucket index from the log2 index and
+     the next 2 bits. The minimum r value at this point is 2 since we
+     returned above if r was 2 or more, so the minimum bucket at this
+     point is 4.  */
+  return (r - 1) * 4 + prev2bits;
+}
+
+/* Insert counter VALUE into HISTOGRAM.  */
+
+static void
+histogram_insert(bucket_type *histogram, gcov_type value)
+{
+  unsigned i;
+
+  i = histo_index(value);
+  gcc_assert (i < HISTOGRAM_SIZE);
+
+  histogram[i].count++;
+  histogram[i].cum_value += value;
+  if (value < histogram[i].min_value)
+    histogram[i].min_value = value;
+}
+
+/* Determines the working set statistics to place in the summary SUM. This is
+   an array of information for a range of percentages of the total execution
+   count (sum_all), and includes the number of counters required to cover that
+   working set percentage and the minimum counter value in that working set.  */
+
+static void
+gcov_compute_working_set (struct gcov_summary *sum)
+{
+  struct gcov_info *gi_ptr;
+  const struct gcov_fn_info *gfi_ptr;
+  const struct gcov_ctr_info *ci_ptr;
+  struct gcov_ctr_summary *cs_ptr;
+  unsigned t_ix, f_ix, ws_ix, ctr_info_ix, ix;
+  gcov_unsigned_t c_num, count;
+  int h_ix;
+  bucket_type histogram[HISTOGRAM_SIZE];
+  gcov_type *working_set_cum_values;
+  gcov_type ws_cum_hotness_incr;
+  gcov_type cum, tmp_cum;
+  char *ws_entries_str;
+  unsigned num_ws_entries;
+
+  /* This currently only applies to arc counters.  */
+  t_ix = GCOV_COUNTER_ARCS;
+
+  /* First check if there are any counts recorded for this counter.  */
+  cs_ptr = &(sum->ctrs[t_ix]);
+  if (!cs_ptr->num)
+    return;
+
+  for (h_ix = 0; h_ix < HISTOGRAM_SIZE; h_ix++)
+    {
+      histogram[h_ix].count = 0;
+      histogram[h_ix].min_value = cs_ptr->run_max;
+      histogram[h_ix].cum_value = 0;
+    }
+
+#define MAX_WS_SUMMARY_ENTRIES 128
+  ws_entries_str = getenv ("GCOV_WORKING_SET_ENTRIES");
+  if (ws_entries_str && strlen (ws_entries_str))
+    {
+      num_ws_entries = atoi (ws_entries_str);
+      /* Limit the summary information to a reasonable amount.
+         The maximum value gives statistics for more than every 1%
+         of the sum_all, which should be sufficient granularity for
+         optimizations. It also ensures that the last entry of 99.9%
+         is the largest value.  */
+      if (num_ws_entries > MAX_WS_SUMMARY_ENTRIES)
+        num_ws_entries = MAX_WS_SUMMARY_ENTRIES;
+    }
+  else
+    num_ws_entries = MAX_WS_SUMMARY_ENTRIES;
+
+  cs_ptr->working_set_count = num_ws_entries;
+
+  if (! num_ws_entries)
+    return;
+
+  /* Next fill in an array of the cumulative hotness values corresponding
+     to each working set summary entry we are going to compute below.  */
+  working_set_cum_values
+      = (gcov_type *) malloc (sizeof (gcov_type) * num_ws_entries);
+
+  /* Compute the amount of sum_all that the cumulative hotness grows
+     by in each successive working set entry, which depends on the
+     number of working set entries requested. By default num_ws_entries
+     is a power-of-two so that divide becomes a shift.  */
+  ws_cum_hotness_incr = cs_ptr->sum_all / num_ws_entries;
+
+  /* Skip 0% statistics, which can be extrapolated from the
+     rest of the summary data.  */
+  cum = ws_cum_hotness_incr;
+  for (ws_ix = 0; ws_ix < num_ws_entries; ws_ix++, cum += ws_cum_hotness_incr)
+    working_set_cum_values[ws_ix] = cum;
+  /* The last summary entry is reserved for (roughly) 99.9% of the
+     working set. Divide by 1024 so it becomes a shift, which gives
+     almost exactly 99.9%.  */
+  working_set_cum_values[num_ws_entries-1]
+      = cs_ptr->sum_all - cs_ptr->sum_all/1024;
+
+  /* Next, walk through all the per-object structures and record each of
+     the count values in histogram.  */
+  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
+    {
+      if (!gi_ptr->merge[t_ix])
+        continue;
+
+      /* Find the appropriate index into the gcov_ctr_info array
+         for the counter we are currently working on based on the
+         existence of the merge function pointer for this object.  */
+      for (ix = 0, ctr_info_ix = 0; ix < t_ix; ix++)
+        {
+          if (gi_ptr->merge[ix])
+            ctr_info_ix++;
+        }
+      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
+        {
+          gfi_ptr = gi_ptr->functions[f_ix];
+
+          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
+            continue;
+
+          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
+          for (ix = 0; ix < ci_ptr->num; ix++)
+            histogram_insert(histogram, ci_ptr->values[ix]);
+        }
+    }
+
+  /* Next, walk through the histogram in decending order of hotness
+     and compute the statistics for the working set summary array.
+     As histogram entries are accumulated, we check to see which
+     working set entries have had their expected cum_value reached
+     and fill them in, walking the working set entries in increasing
+     size of cum_value.  */
+  cs_ptr->working_sets = (gcov_ws_info_t *) malloc (
+      num_ws_entries * sizeof (gcov_ws_info_t));
+  ws_ix = 0; /* The current entry into the working set array.  */
+  cum = 0; /* The current accumulated counter sum.  */
+  count = 0; /* The current accumulated count of block counters.  */
+  for (h_ix = HISTOGRAM_SIZE - 1; h_ix > 0 && ws_ix < num_ws_entries; h_ix--)
+    {
+      /* If we haven't reached the required cumulative counter value for
+         the current working set percentage, simply accumulate this histogram
+         entry into the running sums and continue to the next histogram
+         entry.  */
+      if (cum + histogram[h_ix].cum_value < working_set_cum_values[ws_ix])
+        {
+          cum += histogram[h_ix].cum_value;
+          count += histogram[h_ix].count;
+          continue;
+        }
+
+      /* If adding the current histogram entry's cumulative counter value
+         causes us to exceed the current working set size, then estimate
+         how many of this histogram entry's counter values are required to
+         reach the working set size, and fill in working set entries
+         as we reach their expected cumulative value.  */
+      for (c_num = 0, tmp_cum = cum;
+           c_num < histogram[h_ix].count && ws_ix < num_ws_entries;
+           c_num++)
+        {
+          count++;
+          /* If we haven't reached the last histogram entry counter, add
+             in the minimum value again. This will underestimate the
+             cumulative sum so far, because many of the counter values in this
+             entry may have been larger than the minimum. We could add in the
+             average value every time, but that would require an expensive
+             divide operation.  */
+          if (c_num + 1 < histogram[h_ix].count)
+            tmp_cum += histogram[h_ix].min_value;
+          /* If we have reached the last histogram entry counter, then add
+             in the entire cumulative value.  */
+          else
+            tmp_cum = cum + histogram[h_ix].cum_value;
+
+          /* Next walk through successive working set entries and fill in
+            the statistics for any whose size we have reached by accumulating
+            this histogram counter.  */
+          while (tmp_cum >= working_set_cum_values[ws_ix]
+                 && ws_ix < num_ws_entries)
+            {
+              cs_ptr->working_sets[ws_ix].num_counters = count;
+              cs_ptr->working_sets[ws_ix].min_bb_counter
+                  = histogram[h_ix].min_value;
+              ws_ix++;
+            }
+        }
+      /* Finally, update the running cumulative value since we were
+         using a temporary above.  */
+      cum += histogram[h_ix].cum_value;
+    }
+  gcc_assert (ws_ix == num_ws_entries);
+  free (working_set_cum_values);
+}
+
 /* Dump the coverage counts. We merge with existing counts when
    possible, to avoid growing the .da files ad infinitum. We use this
    program's checksum to make sure we only accumulate whole program
@@ -347,6 +593,7 @@  gcov_exit (void)
 	    }
 	}
     }
+  gcov_compute_working_set (&this_prg);
 
   {
     /* Check if the level of dirs to strip off specified. */
@@ -389,7 +636,7 @@  gcov_exit (void)
   /* Now merge each file.  */
   for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
     {
-      unsigned n_counts;
+      unsigned n_counts, ws_cnt;
       struct gcov_summary prg; /* summary for this object over all
 				  program.  */
       struct gcov_ctr_summary *cs_prg, *cs_tprg, *cs_all;
@@ -482,8 +729,6 @@  gcov_exit (void)
 
 	      f_ix--;
 	      length = gcov_read_unsigned ();
-	      if (length != GCOV_TAG_SUMMARY_LENGTH)
-		goto read_mismatch;
 	      gcov_read_summary (&tmp);
 	      if ((error = gcov_is_error ()))
 		goto read_error;
@@ -597,8 +842,35 @@  gcov_exit (void)
 
 	  if (gi_ptr->merge[t_ix])
 	    {
+              ws_cnt = cs_tprg->working_set_count;
 	      if (!cs_prg->runs++)
-		cs_prg->num = cs_tprg->num;
+                {
+                  cs_prg->num = cs_tprg->num;
+                  /* Allocate the working set array for the merged summary.  */
+                  if (ws_cnt)
+                    {
+                      cs_prg->working_set_count = ws_cnt;
+                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
+                          ws_cnt * sizeof (gcov_ws_info_t));
+                    }
+                }
+              else if (cs_prg->num != cs_tprg->num
+                       || ws_cnt != cs_prg->working_set_count)
+                goto read_mismatch;
+              /* Copy over this run's working set information if either this is
+                 the first run, the total size of the profile (sum_all) is much
+                 (50%) larger for this run (need to check this before updating
+                 cs_prg->sum_all below), or this run has a larger working
+                 set in the largest (99.99% of sum_all) bucket.  */
+              if (ws_cnt
+                  && (cs_prg->runs == 1
+                      || (cs_tprg->sum_all
+                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
+                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
+                          > cs_prg->working_sets[ws_cnt - 1].num_counters)))
+                memcpy (cs_prg->working_sets,
+                        cs_tprg->working_sets,
+                        ws_cnt * sizeof (gcov_ws_info_t));
 	      cs_prg->sum_all += cs_tprg->sum_all;
 	      if (cs_prg->run_max < cs_tprg->run_max)
 		cs_prg->run_max = cs_tprg->run_max;
@@ -696,7 +968,11 @@  gcov_exit (void)
 		   "profiling:%s:Overflow writing\n" :
 		   "profiling:%s:Error writing\n",
 		   gi_filename);
+
+      gcov_destroy_summary (&prg);
     }
+
+    gcov_destroy_summary (&this_prg);
 }
 
 /* Reset all counters to zero.  */
Index: gcc/gcov.c
===================================================================
--- gcc/gcov.c	(revision 189924)
+++ gcc/gcov.c	(working copy)
@@ -1264,6 +1264,7 @@  read_count_file (function_t *fns)
 	  struct gcov_summary summary;
 	  gcov_read_summary (&summary);
 	  object_runs += summary.ctrs[GCOV_COUNTER_ARCS].runs;
+          gcov_destroy_summary (&summary);
 	  program_count++;
 	}
       else if (tag == GCOV_TAG_FUNCTION && !length)
Index: gcc/gcov-io.c
===================================================================
--- gcc/gcov-io.c	(revision 189924)
+++ gcc/gcov-io.c	(working copy)
@@ -368,10 +368,13 @@  gcov_write_tag_length (gcov_unsigned_t tag, gcov_u
 GCOV_LINKAGE void
 gcov_write_summary (gcov_unsigned_t tag, const struct gcov_summary *summary)
 {
-  unsigned ix;
+  unsigned ix, ws_ix, ws_cnt;
   const struct gcov_ctr_summary *csum;
 
-  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH);
+  for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE, ws_cnt = 0;
+       ix--; csum++)
+    ws_cnt += csum->working_set_count;
+  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(ws_cnt));
   gcov_write_unsigned (summary->checksum);
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
@@ -380,6 +383,12 @@  gcov_write_summary (gcov_unsigned_t tag, const str
       gcov_write_counter (csum->sum_all);
       gcov_write_counter (csum->run_max);
       gcov_write_counter (csum->sum_max);
+      gcov_write_unsigned (csum->working_set_count);
+      for (ws_ix = 0; ws_ix < csum->working_set_count; ws_ix++)
+        {
+          gcov_write_unsigned (csum->working_sets[ws_ix].num_counters);
+          gcov_write_counter (csum->working_sets[ws_ix].min_bb_counter);
+        }
     }
 }
 #endif /* IN_LIBGCOV */
@@ -488,7 +497,7 @@  gcov_read_string (void)
 GCOV_LINKAGE void
 gcov_read_summary (struct gcov_summary *summary)
 {
-  unsigned ix;
+  unsigned ix, ws_ix;
   struct gcov_ctr_summary *csum;
 
   summary->checksum = gcov_read_unsigned ();
@@ -499,9 +508,35 @@  gcov_read_summary (struct gcov_summary *summary)
       csum->sum_all = gcov_read_counter ();
       csum->run_max = gcov_read_counter ();
       csum->sum_max = gcov_read_counter ();
+      csum->working_set_count = gcov_read_unsigned ();
+      csum->working_sets
+#if IN_LIBGCOV
+          = (gcov_ws_info_t *) malloc (csum->working_set_count *
+#else
+          = (gcov_ws_info_t *) xmalloc (csum->working_set_count *
+#endif
+                                        sizeof (gcov_ws_info_t));
+      for (ws_ix = 0; ws_ix < csum->working_set_count; ws_ix++)
+        {
+          csum->working_sets[ws_ix].num_counters = gcov_read_unsigned ();
+          csum->working_sets[ws_ix].min_bb_counter = gcov_read_counter ();
+        }
     }
 }
 
+GCOV_LINKAGE void
+gcov_destroy_summary (struct gcov_summary *summary)
+{
+  unsigned ix;
+  struct gcov_ctr_summary *csum;
+
+  for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
+    {
+      if (csum->working_set_count)
+        free (csum->working_sets);
+    }
+}
+
 #if !IN_LIBGCOV
 /* Reset to a known position.  BASE should have been obtained from
    gcov_position, LENGTH should be a record length.  */
Index: gcc/gcov-io.h
===================================================================
--- gcc/gcov-io.h	(revision 189924)
+++ gcc/gcov-io.h	(working copy)
@@ -139,7 +139,9 @@  see the files COPYING3 and COPYING.RUNTIME respect
 	counts: header int64:count*
 	summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE
 	count-summary:	int32:num int32:runs int64:sum
-			int64:max int64:sum_max
+			int64:max int64:sum_max working-set-summary
+        working-set-summary: int32:count working-sets*
+        working-sets: int32:num int64:min
 
    The ANNOUNCE_FUNCTION record is the same as that in the note file,
    but without the source location.  The COUNTS gives the
@@ -171,8 +173,10 @@  typedef unsigned gcov_unsigned_t __attribute__ ((m
 typedef unsigned gcov_position_t __attribute__ ((mode (SI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (DI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (DI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (SI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI)));
 #endif
 #else
 #if BITS_PER_UNIT == 16
@@ -180,16 +184,20 @@  typedef unsigned gcov_unsigned_t __attribute__ ((m
 typedef unsigned gcov_position_t __attribute__ ((mode (HI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (SI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (HI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI)));
 #endif
 #else
 typedef unsigned gcov_unsigned_t __attribute__ ((mode (QI)));
 typedef unsigned gcov_position_t __attribute__ ((mode (QI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (HI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (QI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (QI)));
 #endif
 #endif
 #endif
@@ -213,8 +221,6 @@  typedef HOST_WIDEST_INT gcov_type;
 #if IN_GCOV > 0
 #include <sys/types.h>
 #endif
-#else /*!IN_GCOV */
-#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32)
 #endif
 
 #if defined (HOST_HAS_F_SETLKW)
@@ -225,6 +231,8 @@  typedef HOST_WIDEST_INT gcov_type;
 
 #endif /* !IN_LIBGCOV */
 
+#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32)
+
 /* In gcov we want function linkage to be static.  In the compiler we want
    it extern, so that they can be accessed from elsewhere.  In libgcov we
    need these functions to be extern, so prefix them with __gcov.  In
@@ -248,6 +256,7 @@  typedef HOST_WIDEST_INT gcov_type;
 #define gcov_read_unsigned __gcov_read_unsigned
 #define gcov_read_counter __gcov_read_counter
 #define gcov_read_summary __gcov_read_summary
+#define gcov_destroy_summary __gcov_destroy_summary
 
 /* Poison these, so they don't accidentally slip in.  */
 #pragma GCC poison gcov_write_string gcov_write_tag gcov_write_length
@@ -309,9 +318,10 @@  typedef HOST_WIDEST_INT gcov_type;
 #define GCOV_TAG_COUNTER_NUM(LENGTH) ((LENGTH) / 2)
 #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
 #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
-#define GCOV_TAG_SUMMARY_LENGTH  \
-	(1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
+#define GCOV_TAG_SUMMARY_LENGTH(NUM)  \
+        (1 + GCOV_COUNTERS_SUMMABLE * (3 + 3 * 2) + (NUM) * 3)
 
+
 /* Counters that are collected.  */
 #define GCOV_COUNTER_ARCS 	0  /* Arc transitions.  */
 #define GCOV_COUNTERS_SUMMABLE	1  /* Counters which can be
@@ -389,6 +399,16 @@  typedef HOST_WIDEST_INT gcov_type;
 
 /* Structured records.  */
 
+/* Working set size statistics for a given percentage of the entire
+   profile (sum_all).  */
+typedef struct gcov_working_set_info
+{
+  /* Number of hot counters included in this working set.  */
+  gcov_unsigned_t num_counters;
+  /* Smallest counter included in this working set.  */
+  gcov_type min_bb_counter;
+} gcov_ws_info_t;
+
 /* Cumulative counter data.  */
 struct gcov_ctr_summary
 {
@@ -397,6 +417,8 @@  struct gcov_ctr_summary
   gcov_type sum_all;		/* sum of all counters accumulated.  */
   gcov_type run_max;		/* maximum value on a single run.  */
   gcov_type sum_max;    	/* sum of individual run max values.  */
+  gcov_unsigned_t working_set_count; /* number of working set infos.  */
+  gcov_ws_info_t *working_sets; /* array of working set info.  */
 };
 
 /* Object & program summary record.  */
@@ -552,6 +574,7 @@  static int gcov_is_error (void);
 GCOV_LINKAGE gcov_unsigned_t gcov_read_unsigned (void) ATTRIBUTE_HIDDEN;
 GCOV_LINKAGE gcov_type gcov_read_counter (void) ATTRIBUTE_HIDDEN;
 GCOV_LINKAGE void gcov_read_summary (struct gcov_summary *) ATTRIBUTE_HIDDEN;
+GCOV_LINKAGE void gcov_destroy_summary (struct gcov_summary *) ATTRIBUTE_HIDDEN;
 
 #if IN_LIBGCOV
 /* Available only in libgcov */
Index: gcc/coverage.c
===================================================================
--- gcc/coverage.c	(revision 189924)
+++ gcc/coverage.c	(working copy)
@@ -172,6 +172,8 @@  htab_counts_entry_del (void *of)
   counts_entry_t *const entry = (counts_entry_t *) of;
 
   free (entry->counts);
+  if (entry->summary.working_sets)
+    free (entry->summary.working_sets);
   free (entry);
 }
 
@@ -247,12 +249,40 @@  read_counts_file (void)
 	  gcov_read_summary (&sum);
 	  for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
 	    {
+              gcov_unsigned_t ws_cnt = sum.ctrs[ix].working_set_count;
+              /* Allocate the working set array for a new summary.  */
+              if (ws_cnt && new_summary)
+                {
+                  summary.ctrs[ix].working_set_count = ws_cnt;
+                  summary.ctrs[ix].working_sets
+                      = (gcov_ws_info_t *) xmalloc (ws_cnt
+                                                    * sizeof (gcov_ws_info_t));
+                }
+              /* Copy over this summary's working set information if either
+                 summary is new, the total size of the profile (sum_all) is much
+                 (50%) larger for this summary (need to check this before
+                 updating sum_all below), or this summary has a larger working
+                 set in the largest (99.99% of sum_all) bucket.  */
+              if (ws_cnt
+                  && ws_cnt == summary.ctrs[ix].working_set_count
+                  && (new_summary
+                      || (sum.ctrs[ix].sum_all
+                          > (summary.ctrs[ix].sum_all
+                             + summary.ctrs[ix].sum_all / 2))
+                      || (sum.ctrs[ix].working_sets[ws_cnt - 1].num_counters
+                          > summary.ctrs[ix].working_sets[ws_cnt - 1].num_counters)))
+                {
+                  memcpy (summary.ctrs[ix].working_sets,
+                          sum.ctrs[ix].working_sets,
+                          ws_cnt * sizeof (gcov_ws_info_t));
+                }
 	      summary.ctrs[ix].runs += sum.ctrs[ix].runs;
 	      summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
 	      if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
 		summary.ctrs[ix].run_max = sum.ctrs[ix].run_max;
 	      summary.ctrs[ix].sum_max += sum.ctrs[ix].sum_max;
 	    }
+          gcov_destroy_summary (&sum);
 	  new_summary = 0;
 	}
       else if (GCOV_TAG_IS_COUNTER (tag) && fn_ident)
Index: gcc/gcov-dump.c
===================================================================
--- gcc/gcov-dump.c	(revision 189924)
+++ gcc/gcov-dump.c	(working copy)
@@ -447,7 +447,8 @@  tag_summary (const char *filename ATTRIBUTE_UNUSED
 	     unsigned tag ATTRIBUTE_UNUSED, unsigned length ATTRIBUTE_UNUSED)
 {
   struct gcov_summary summary;
-  unsigned ix;
+  unsigned ix, jx, pctinc, pct;
+  gcov_ws_info_t *ws_info;
 
   gcov_read_summary (&summary);
   printf (" checksum=0x%08x", summary.checksum);
@@ -465,5 +466,27 @@  tag_summary (const char *filename ATTRIBUTE_UNUSED
 	      (HOST_WIDEST_INT)summary.ctrs[ix].run_max);
       printf (", sum_max=" HOST_WIDEST_INT_PRINT_DEC,
 	      (HOST_WIDEST_INT)summary.ctrs[ix].sum_max);
+      if (! summary.ctrs[ix].working_set_count)
+        continue;
+      printf ("\n");
+      print_prefix (filename, 0, 0);
+      printf ("\t\tworking set statistics:");
+      /* Multiply the percentage by 100 to avoid float.  */
+      pctinc = 100 * 100 / summary.ctrs[ix].working_set_count;
+      for (jx = 0, pct = pctinc; jx < summary.ctrs[ix].working_set_count;
+           jx++, pct += pctinc)
+        {
+          if (jx == summary.ctrs[ix].working_set_count - 1)
+            pct = 9999;
+          ws_info = &summary.ctrs[ix].working_sets[jx];
+          printf ("\n");
+          print_prefix (filename, 0, 0);
+          /* Print out the percentage using int arithmatic to avoid float.  */
+          printf ("\t\t%u.%u%%: num counts=%u, min counter="
+              HOST_WIDEST_INT_PRINT_DEC,
+              pct / 100, pct - pct / 100,
+	      ws_info->num_counters, (HOST_WIDEST_INT)ws_info->min_bb_counter);
+        }
     }
+  gcov_destroy_summary (&summary);
 }