[PATCH] Add working-set size and hotness information to fdo summary (issue6465057)

Mon Aug 20 05:41:00 GMT 2012

On Sun, Aug 19, 2012 at 9:59 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> > +                {
>> > +                  cs_prg->num = cs_tprg->num;
>> > +                  /* Allocate the working set array for the merged summary.  */
>> > +                  if (ws_cnt)
>> > +                    {
>> > +                      cs_prg->working_set_count = ws_cnt;
>> > +                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
>> > +                          ws_cnt * sizeof (gcov_ws_info_t));
>> > +                    }
>> > +                }
>> > +              else if (cs_prg->num != cs_tprg->num
>> > +                       || ws_cnt != cs_prg->working_set_count)
>> > +                goto read_mismatch;
>> > +              /* Copy over this run's working set information if either this is
>> > +                 the first run, the total size of the profile (sum_all) is much
>> > +                 (50%) larger for this run (need to check this before updating
>> > +                 cs_prg->sum_all below), or this run has a larger working
>> > +                 set in the largest (99.99% of sum_all) bucket.  */
>> > +              if (ws_cnt
>> > +                  && (cs_prg->runs == 1
>> > +                      || (cs_tprg->sum_all
>> > +                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
>> > +                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
>> > +                          > cs_prg->working_sets[ws_cnt - 1].num_counters)))
>> > +                memcpy (cs_prg->working_sets,
>> > +                        cs_tprg->working_sets,
>> > +                        ws_cnt * sizeof (gcov_ws_info_t));
>> >             cs_prg->sum_all += cs_tprg->sum_all;
>>
>> Hmm, when you run i.e. gcc bootstrap  where the program is run couple hundred
>> times, this will end up really inaccurate, since it will probably just store
>> the histogram of insn-attrtab compilation, right?
>
>
> Well, it should store the largest working set in BBs, or one that came
> from a much longer run. But I see the issue you are pointing out. The
> num_counters (the number of hot bbs) should be reasonable, as the
> total number of BBs is the same between all runs, and I want to show
> the largest or more dominant working set in terms of the number of hot
> bbs. But the min_bb_counter will become more and more inaccurate as
> the number of runs increases, since the counter values are
> accumulated.
>
> I typed up a detailed email below on why getting this right would be
> difficult, but then I realized there might be a fairly simple accurate
> solution, which I'll describe first:
>
> The only way I see to do this completely accurately is to take two
> passes through the existing gcda files that we are merging into, one
> to read in all the counter values and merge them into all the counter
> values in the arrays from the current run (after which we can do the
> histogramming/working set computation accurately from the merged
> counters), and the second to rewrite them. In libgcov this doesn't
> seem like it would be too difficult to do, although it is a little bit
> of restructuring of the main merging loop and needs some special
> handling for buffered functions (which could increase the memory
> footprint a bit if there are many of these since they will all need to
> be buffered across the iteration over all the gcda files).
>

The current way of doing summary merging produces imprecise data (e.g,
sum_max). It computes current execution's summary first, and then
merge with previous run's summary data while profile counters are
merged.  The right way is to merge profile counters first, and then
recompute the final summary based on the merged profile.   The later
way allows merging of more advanced summary data such as the histogram
here. It is also used in LIPO mode (for dynamic callgraph build/merge
-- kind of a summary itself).

However I think changing the way summary is merged should probably be
done in a different patch.

David

> The summary merging in coverage.c confuses me a bit as it seems to be
> handling the case when there are multiple program summaries in a
> single gcda file. When would this happen? It looks like the merge
> handling in libgcov should always produce a single program summary per
> gcda file.
>
>>
>>
>> Why you don't simply write the histogram into gcov file and don't merge the values
>> here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
>
> That doesn't completely solve the problem, unfortunately. The reason
> is that you don't know which histogram entry contains a particular
> block each run, and therefore you don't know how to compute the new
> combined histogram value and index for that bb counter. For example, a
> particular histogram index may have 5 counters (bbs) in it for one run
> and 2 counters (bbs) in it for a second run, so the question is how to
> compute the new entry for all of those bb counters, as the 5 bbs from
> the first run may or may not be a superset of the 2 from the second
> run. You could assume that the bbs have the same relative order of
> hotness between runs, and combine the bb counters accordingly, but
> there is still some trickiness because you have to apportion the
> cumulative counter sum stored in the histogram entry between new
> entries. For example, assume the highest indexed non-zero entries (the
> histogram buckets containing the largest counter values) in the two
> incoming histograms are:
>
> histogram 1:
>
> index 100: 4 counters, cumulative value X, min counter value minx
> ...
>
> histogram 2:
>
> index 100: 2 counters, cumulative value Y, min counter value miny
> index 99: 3 counters, cumulative value W, min counter value minw
> ...
>
> To merge, you could assume something like 2 counters with a new
> cumulative value (Y + X*2/4), and new min counter value minx+miny,
> that go into the merged histogram with the index corresponding to
> counter value minx+miny. And then 2 counters have a new cumulative
> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
> into the merged histogram with index corresponding to counter value
> minw+minx. Etc... Not entirely accurate, although it might be a
> reasonable approximation, but it also requires a number of division
> operations during the merge in libgcov.
>
> Another possibility, that might also provide a reasonable
> approximation, would be to scale the min_bb_counter values in the
> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
> is the new sum_all, and sum_all_orig is the sum_all from the summary
> whose working_set was chosen to be propagated to the new merged
> summary. This also requires some divisions at libgcov merge time,
> unless we save the original sum_all along with the working set in the
> summary and do the scaling at profile-use time.
>
>> By default you are streaming 128 values that is the same as needed to stream the histogram.
>> I suppose we can have environment variable to reduce the histogram size - I guess in smaller
>> setups smaller histogram will run just fine...
>
> It is a bit more, as the histogram will have 64*4 = 256 entries for
> 64-bit counters, and each entry contains 2 gcov_type counter values
> and one unsigned int. The working set entries only have one gcov_type
> counter and one unsigned. So it could be close to 4x.
>
> What do you think?
>
> Thanks,
> Teresa
>
>>
>> Honza
>
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413