[PATCH] Add working-set size and hotness information to fdo summary (issue6465057)

Mon Aug 20 04:59:00 GMT 2012

On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > +                {
> > +                  cs_prg->num = cs_tprg->num;
> > +                  /* Allocate the working set array for the merged summary.  */
> > +                  if (ws_cnt)
> > +                    {
> > +                      cs_prg->working_set_count = ws_cnt;
> > +                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
> > +                          ws_cnt * sizeof (gcov_ws_info_t));
> > +                    }
> > +                }
> > +              else if (cs_prg->num != cs_tprg->num
> > +                       || ws_cnt != cs_prg->working_set_count)
> > +                goto read_mismatch;
> > +              /* Copy over this run's working set information if either this is
> > +                 the first run, the total size of the profile (sum_all) is much
> > +                 (50%) larger for this run (need to check this before updating
> > +                 cs_prg->sum_all below), or this run has a larger working
> > +                 set in the largest (99.99% of sum_all) bucket.  */
> > +              if (ws_cnt
> > +                  && (cs_prg->runs == 1
> > +                      || (cs_tprg->sum_all
> > +                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
> > +                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
> > +                          > cs_prg->working_sets[ws_cnt - 1].num_counters)))
> > +                memcpy (cs_prg->working_sets,
> > +                        cs_tprg->working_sets,
> > +                        ws_cnt * sizeof (gcov_ws_info_t));
> >             cs_prg->sum_all += cs_tprg->sum_all;
>
> Hmm, when you run i.e. gcc bootstrap  where the program is run couple hundred
> times, this will end up really inaccurate, since it will probably just store
> the histogram of insn-attrtab compilation, right?

Well, it should store the largest working set in BBs, or one that came
from a much longer run. But I see the issue you are pointing out. The
num_counters (the number of hot bbs) should be reasonable, as the
total number of BBs is the same between all runs, and I want to show
the largest or more dominant working set in terms of the number of hot
bbs. But the min_bb_counter will become more and more inaccurate as
the number of runs increases, since the counter values are
accumulated.

I typed up a detailed email below on why getting this right would be
difficult, but then I realized there might be a fairly simple accurate
solution, which I'll describe first:

The only way I see to do this completely accurately is to take two
passes through the existing gcda files that we are merging into, one
to read in all the counter values and merge them into all the counter
values in the arrays from the current run (after which we can do the
histogramming/working set computation accurately from the merged
counters), and the second to rewrite them. In libgcov this doesn't
seem like it would be too difficult to do, although it is a little bit
of restructuring of the main merging loop and needs some special
handling for buffered functions (which could increase the memory
footprint a bit if there are many of these since they will all need to
be buffered across the iteration over all the gcda files).

The summary merging in coverage.c confuses me a bit as it seems to be
handling the case when there are multiple program summaries in a
single gcda file. When would this happen? It looks like the merge
handling in libgcov should always produce a single program summary per
gcda file.

>
>
> Why you don't simply write the histogram into gcov file and don't merge the values
> here (i.e. doing the cummulation loop in GCC instead of within libgcov)?

That doesn't completely solve the problem, unfortunately. The reason
is that you don't know which histogram entry contains a particular
block each run, and therefore you don't know how to compute the new
combined histogram value and index for that bb counter. For example, a
particular histogram index may have 5 counters (bbs) in it for one run
and 2 counters (bbs) in it for a second run, so the question is how to
compute the new entry for all of those bb counters, as the 5 bbs from
the first run may or may not be a superset of the 2 from the second
run. You could assume that the bbs have the same relative order of
hotness between runs, and combine the bb counters accordingly, but
there is still some trickiness because you have to apportion the
cumulative counter sum stored in the histogram entry between new
entries. For example, assume the highest indexed non-zero entries (the
histogram buckets containing the largest counter values) in the two
incoming histograms are:

histogram 1:

index 100: 4 counters, cumulative value X, min counter value minx
...

histogram 2:

index 100: 2 counters, cumulative value Y, min counter value miny
index 99: 3 counters, cumulative value W, min counter value minw
...

To merge, you could assume something like 2 counters with a new
cumulative value (Y + X*2/4), and new min counter value minx+miny,
that go into the merged histogram with the index corresponding to
counter value minx+miny. And then 2 counters have a new cumulative
value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
into the merged histogram with index corresponding to counter value
minw+minx. Etc... Not entirely accurate, although it might be a
reasonable approximation, but it also requires a number of division
operations during the merge in libgcov.

Another possibility, that might also provide a reasonable
approximation, would be to scale the min_bb_counter values in the
working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
is the new sum_all, and sum_all_orig is the sum_all from the summary
whose working_set was chosen to be propagated to the new merged
summary. This also requires some divisions at libgcov merge time,
unless we save the original sum_all along with the working set in the
summary and do the scaling at profile-use time.

> By default you are streaming 128 values that is the same as needed to stream the histogram.
> I suppose we can have environment variable to reduce the histogram size - I guess in smaller
> setups smaller histogram will run just fine...

It is a bit more, as the histogram will have 64*4 = 256 entries for
64-bit counters, and each entry contains 2 gcov_type counter values
and one unsigned int. The working set entries only have one gcov_type
counter and one unsigned. So it could be close to 4x.

What do you think?

Thanks,
Teresa

>
> Honza

--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413