This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Add working-set size and hotness information to fdo summary (issue6465057)


On Tue, Aug 21, 2012 at 11:18 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > I can go ahead with the histogram approach. There is some roundoff
>> >> > error from the working set scaling approach that can affect different
>> >> > merging orders as you note, although I think this only really affects the
>> >> > small counter values. The other place where saving/merging the histogram
>> >
>> > Do you have any intuition on why simple maximalization merging (that is safe wrt
>> > ordering) would be bad idea?
>>
>> When you say "maximalization merging" are you talking about the
>> histogram merging approach I mentioned a few emails back (my response
>> on Aug 19) where we assume the same relative order of hotness in the
>> counters between runs, and accumulate the counter values in the
>> histogram in that order?
>
> I was speaking of BB (counter) counts only here and considering the simplest
> strategy: maximalize the BB counts in corresponding buckets and sum the cummlative
> values in the corresponding buckets.

I'm concerned about inaccuracies arising when multiple runs have
different sizes and therefore counters are in different buckets in the
corresponding histograms. We would end up in a situation where the
merged histogram has more "counters" in it than there are actual
counters in the program.

>
> The strategy of scaling is more sensible, but it will also be sensitive to order
> of train runs, i.e. it will give unstable results with parallel make.

Regarding the approach of merging histograms by matching up counters
in the same order and apportioning out the histogram cumulative values
evenly between counters in the same bucket before merging, I don't
think the ordering will have a huge effect. The reason is that
counters we end up matching together and accumulating will stay in the
same relative order, and the min counter values are simply summed to
compute the new bucket index and new min counter value to record. For
the cumulative value, since the incoming cumulative values will be
apportioned evenly between the counters coming from the same bucket
when computing the new cumulative value (which is then just another
add), there could be some slight differences from different round off
errors when histograms are merged in different orders. But I would
expect that to result in only small differences in the cumulative
values in the histogram. Is that acceptable?

>> > OK, so I guess we went through
>> >  1) two pass updating with race in between pases.
>> >  2) two pass updating with first pass updating countes and second having race only for summary update.
>> >     (i.e. no races for counters)
>> >  3) two pass updating with flocking (and some way to handle detected deadlocks)
>> >  4) one pass updating with histogram merging + maximalization of working set.
>> >     (we do not realy need to scale the buckets, we can simply merge the histograms
>> >      and then mutiply them by nruns before comparing to actual counters.
>>
>> By merging the histograms (and accumulating the counter values stored
>> there as we merge), I don't think we need to multiply the counter
>> values by nruns, do we?
>
> With the simple approach we need, but...
>>
>> >      This assumes that working sets of all runs are about the same, but should work
>> >      resonably in practice I think.
>> >
>> > I guess 3/4 are acceptable WRT bootstrap reproducibility.
>> >
>> > I have no experience with flocking large number of files and portability of
>> > this solution i.e.  to Windows.  If you think that 2) would be too inaccurate
>> > in practice and 3) has chance to be portable, we could go for this.  It will
>> > solve the precision problems and will also work for LIPO summaries.
>> > I would be curious about effect on profiledbootstrap time of this if you implement
>> > it.
>>
>> I'm hoping that 2) will be accurate enough in practice, but it will
>> need some investigation.
>
> Perhaps weighting all pros/cons you are right here.  I think getting results
> consistent across different order of runs is more important than the
> possibility of race on the last update and 4) is probably getting quite a lot
> of GIGO issues.
>
> We can implement more consistent locking mechanizm incrementally if this turns
> out to be issue.
>
> It is posible to actually detect the race: at first round we are updating the
> nruns. In the second round we can see if nruns has changed.  If so, we have
> problem since someone has already updated the file and will update the
> histograms.
>
> I would suggest skiping update when nruns has changes, so the summary at least
> match the counters in current file accurately (even though it will be off for
> the whole program since whoever updated the file may have not seen all our
> updates, just part of them).  Just to reduce changes that the trashed run is
> the most important one. Theoretically we could also make libgcov to re-read all
> counters in this case and try another update, but I would not worry about that
> for the moment.

Ok, I think that makes sense.

If it is ok with you, I'd prefer to go with a 3 step approach:
1) Implement some form of histogram merging (one of the options we
discussed above). (Your approach #4).
2) Implement approach #2 (restructure into 2 passes with counters
being updated accurately and a possible race on the histogram
computation).
3) Investigate and implement either approach #3 (some kind of global
lock) or the race detection you describe above.

>
> We will still have problems with bootstrap: the summary of libbackend.a will
> depend on what of compiler binaries has executed last, since cc1 will have
> different summary from cc1plus becuase frontend is different.  I would give
> extra points to voluteer who changes libgcov to do multiple summaries for
> different programs as intended originally (if you search archives in 2001-2003,
> somewhere are patches that changes libgcov from not merging at all to merging
> always. I guess changing it to merge just when program checksum match is not
> that hard).

In this case, I assume that currently the merge will abort because the
# counters, etc don't match?

Thanks,
Teresa

>
> Thanks,
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>> >>
>> >> David
>> >>
>> >>
>> >>
>> >> >
>> >> > Thanks,
>> >> > Teresa
>> >> >
>> >> > >
>> >> >> > >>
>> >> >> > >>
>> >> >> > >> >  2) Do we plan to add some features in near future that will
>> >> >> anyway require global locking?
>> >> >> > >> >     I guess LIPO itself does not count since it streams its data
>> >> >> into independent file as you
>> >> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
>> >> >> > >> >     Does LIPO stream everything into that common file, or does it
>> >> >> use combination of gcda files
>> >> >> > >> >     and common summary?
>> >> >> > >>
>> >> >> > >> Actually, LIPO module grouping information are stored in gcda files.
>> >> >> > >> It is also stored in a separate .imports file (one per object) ---
>> >> >> > >> this is primarily used by our build system for dependence
>> >> >> information.
>> >> >> > >
>> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
>> >> >> LIPO behave
>> >> >> > > on GCC bootstrap?
>> >> >> >
>> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
>> >> >> > main problem for application build -- the link time (for debug build)
>> >> >> > is.
>> >> >>
>> >> >> I was primarily curious how the LIPOs runtime analysis fare in the
>> >> >> situation where
>> >> >> you do very many small train runs on rather large app (sure GCC is small
>> >> >> to google's
>> >> >> use case ;).
>> >> >> >
>> >> >> > > (i.e. it does a lot more work in the libgcov module per each
>> >> >> > > invocation, so I am curious if it is practically useful at all).
>> >> >> > >
>> >> >> > > With LTO based solution a lot can be probably pushed at link time?
>> >> >> Before
>> >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
>> >> >> CFGs from
>> >> >> > > gcda files and do all the merging/updating/CFG constructions that is
>> >> >> currently
>> >> >> > > performed at runtime, right?
>> >> >> >
>> >> >> > The dynamic cgraph build and analysis is still done at runtime.
>> >> >> > However, with the new implementation, FE is no longer involved. Gcc
>> >> >> > driver is modified to understand module grouping, and lto is used to
>> >> >> > merge the streamed output from aux modules.
>> >> >>
>> >> >> I see. Are there any fundamental reasons why it can not be done at
>> >> >> link-time
>> >> >> when all gcda files are available? Why the grouping is not done inside
>> >> >> linker
>> >> >> plugin?
>> >> >>
>> >> >> Honza
>> >> >> >
>> >> >> >
>> >> >> > David
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
>> >> >
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]