[PATCH] Add working-set size and hotness information to fdo summary (issue6465057)

Mon Aug 20 09:48:00 GMT 2012

> Well, it should store the largest working set in BBs, or one that came
> from a much longer run. But I see the issue you are pointing out. The
> num_counters (the number of hot bbs) should be reasonable, as the
> total number of BBs is the same between all runs, and I want to show
> the largest or more dominant working set in terms of the number of hot
> bbs. But the min_bb_counter will become more and more inaccurate as
> the number of runs increases, since the counter values are
> accumulated.

Yes and that is the one that we plan to use to determine hot/cold decisions on,
right?

Note that there is no really 1-1 corespondence in betwen BBs and counters.
For each function the there should be num_edges-num_bbs+1 counters.
What do you plan to use BB counts for?
> 
> I typed up a detailed email below on why getting this right would be
> difficult, but then I realized there might be a fairly simple accurate
> solution, which I'll describe first:
> 
> The only way I see to do this completely accurately is to take two
> passes through the existing gcda files that we are merging into, one
> to read in all the counter values and merge them into all the counter
> values in the arrays from the current run (after which we can do the
> histogramming/working set computation accurately from the merged
> counters), and the second to rewrite them. In libgcov this doesn't
> seem like it would be too difficult to do, although it is a little bit
> of restructuring of the main merging loop and needs some special
> handling for buffered functions (which could increase the memory
> footprint a bit if there are many of these since they will all need to
> be buffered across the iteration over all the gcda files).
> 
> The summary merging in coverage.c confuses me a bit as it seems to be
> handling the case when there are multiple program summaries in a
> single gcda file. When would this happen? It looks like the merge
> handling in libgcov should always produce a single program summary per
> gcda file.

This is something Nathan introduced years back. The idea was IMO to handle
more acurately objects linked into multiple binaries. I am not sure
if the code really works or worked on some point.

The idea, as I recall it, was to produce overall checksum of all objects and
have different profile records for each combination.

This is not really useful for profile feedback as you generate single object
file for all uses, but it might become useful for LTO where you know into which
binary you are linking to. I am not really sure it is worth all the infrastructure
needed to support this though.
> 
> >
> >
> > Why you don't simply write the histogram into gcov file and don't merge the values
> > here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
> 
> That doesn't completely solve the problem, unfortunately. The reason
> is that you don't know which histogram entry contains a particular
> block each run, and therefore you don't know how to compute the new
> combined histogram value and index for that bb counter. For example, a
> particular histogram index may have 5 counters (bbs) in it for one run
> and 2 counters (bbs) in it for a second run, so the question is how to
> compute the new entry for all of those bb counters, as the 5 bbs from
> the first run may or may not be a superset of the 2 from the second
> run. You could assume that the bbs have the same relative order of
> hotness between runs, and combine the bb counters accordingly, but
> there is still some trickiness because you have to apportion the
> cumulative counter sum stored in the histogram entry between new
> entries. For example, assume the highest indexed non-zero entries (the
> histogram buckets containing the largest counter values) in the two
> incoming histograms are:
> 
> histogram 1:
> 
> index 100: 4 counters, cumulative value X, min counter value minx
> ...
> 
> histogram 2:
> 
> index 100: 2 counters, cumulative value Y, min counter value miny
> index 99: 3 counters, cumulative value W, min counter value minw
> ...
> 
> To merge, you could assume something like 2 counters with a new
> cumulative value (Y + X*2/4), and new min counter value minx+miny,
> that go into the merged histogram with the index corresponding to
> counter value minx+miny. And then 2 counters have a new cumulative
> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
> into the merged histogram with index corresponding to counter value
> minw+minx. Etc... Not entirely accurate, although it might be a
> reasonable approximation, but it also requires a number of division
> operations during the merge in libgcov.
> 
> Another possibility, that might also provide a reasonable
> approximation, would be to scale the min_bb_counter values in the
> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
> is the new sum_all, and sum_all_orig is the sum_all from the summary
> whose working_set was chosen to be propagated to the new merged
> summary. This also requires some divisions at libgcov merge time,
> unless we save the original sum_all along with the working set in the
> summary and do the scaling at profile-use time.
> 
> > By default you are streaming 128 values that is the same as needed to stream the histogram.
> > I suppose we can have environment variable to reduce the histogram size - I guess in smaller
> > setups smaller histogram will run just fine...
> 
> It is a bit more, as the histogram will have 64*4 = 256 entries for
> 64-bit counters, and each entry contains 2 gcov_type counter values
> and one unsigned int. The working set entries only have one gcov_type
> counter and one unsigned. So it could be close to 4x.
> 
> What do you think?

So we have options
  1) ignore the problem that summaries become inaccurate with many train runs
     as we do now.
  2) write histogram only, do not track BB counts or approximate them by scalling
     and perhaps retire max_counter from the summary in favour of histogram estimate

     We will pay by more data being written (I think the histogram may actually compress pretty
     well by skipping zero entries, don't know) and by getting only estimated max_count
     from the histogram that still should be good for practice (max_count is used only
     for the hot/cold decisions and those will be histogram based anyway)
  3) do two stage processing with reading data into memory first, producing summaries
     and writting them next.

2) seems appealing to me because it is simple, but there are limitations in
what it can handle.
3) solve precision problems, but how we handle the locking & races?  In the
case like bootstrap where GCCs are executed in parallel we will end up with
random results if the files gets modified by another profiled run of GCC in
between read in and write out.

So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does
it work for LIPO?

Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> 
> 
> 
> 
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413