This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Maintain loop iteration count estimates
- From: Richard Biener <rguenther at suse dot de>
- To: Jan Hubicka <hubicka at ucw dot cz>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Fri, 10 Jun 2016 14:50:56 +0200 (CEST)
- Subject: Re: Maintain loop iteration count estimates
- Authentication-results: sourceware.org; auth=none
- References: <20160609095842 dot GA98613 at kam dot mff dot cuni dot cz> <alpine dot LSU dot 2 dot 11 dot 1606091317210 dot 1493 at t29 dot fhfr dot qr> <20160609123459 dot GA89386 at kam dot mff dot cuni dot cz> <alpine dot LSU dot 2 dot 11 dot 1606101206020 dot 13824 at t29 dot fhfr dot qr> <20160610124512 dot GA78024 at kam dot mff dot cuni dot cz>
On Fri, 10 Jun 2016, Jan Hubicka wrote:
> > > I always interpreted the estimated number of iterations to be the same as
> > > expected number of iterations and to be same as average. So it seems to be
> > > sane to feed the info from profile.
> > >
> > > I am thinking to add the histograms, yes. It is midly anoying to do so becuase
> > > one needs to intrstrument all exit edges out of loop. I guess we don't care
> > > much if histogram will get lost on abnormals.
> >
> > Can't we just instrument the loop header (and thus get "averages" for all
> > exits)?
>
> You need to know number of iterations, so either you save it into a global and
> flush into profile next time edge header->preheader is executed or you instrument
> exists. At least when one wants something more precise than average iteration count
> I think.
> >
> > > One option I am thinking about is to introduce counter that will take two
> > > parameters A,B. It will record linear histogram for values in range 0...A and
> > > logarithmic histogram for values greater than A capping by 2^B (so we don't
> > > need 64 counters for every loop). I think we are not really that interested in
> > > precise histograms for loops that iterate more than, say, 2^10 times. We only
> > > need to know that it loops a lot. We however care about low iteration counts
> > > to make good decision for peeling. This is still 26 counters per loop that is
> > > quite a lot.
> >
> > I think what we are interested in is "clusters" of niters. So yes,
> > having a histogram for low values (can we use HIST_TYPE_POW2 here?)
> > makes sense. Or simply sum all n <= 16 and all n > 16 values separately?
> > We seem to lack a histogram type where we can specify N differently-sized
> > bins.
>
> HIST_TYPE_POW2 measure how often value is exactly power of two and we don't
> have histogram counter at all, so we need to introduce one ;)
>
> We have HIST_TYPE_INTERVAL that is real linear histogram for given interval of values.
> >
> > > A lot cheaper alternative may be to simply measure loop peeling limit by
> > > having counter that counts how often loop exits in first
> > > PARAM_MAX_PEEL_TIMES iterations and second counter that measure what is
> > > the maximal number of iterations in this case (we really want likely
> > > maximal number of iterations but that seems harder to get). This will
> > > determine peeling limit which we can then store into loop structure.
> >
> > Sure.
> >
> > In my original loop value-profiling work the most interesting (for
> > performance) case was value-profiling variable strides and then
> > versioning for stride == 1. That triggered a lot with fortran
> > array descriptors though nowadays IPA-CP of aggregates plus LTO
> > might have solved this.
>
> ipa-cp is not very powerful on propagating those. Measuring strides is
> interesting idea. I forgot about your work on the loop histograms. Do you
> still have patches around?
No, but those weren't complete anyway as we didn't preserve loops
back in time. I think I only implemented versioning assuming the
stride would be one unconditionally to show the possible benefit
(it was for the 2006 Summit).
Richard.
> >
> > > Vectorizer/unroller/prefetcher and most of other classical loop opts
> > > care about the average being large enough so the current expected value
> > > seems to do the trick.
> >
> > Yes, though knowing that a significant fraction of times the iteration
> > count is one would also be useful.
>
> Yes, if we interval profile the low iteration counts, we will have exact
> info on 0,1...n intrations.
> >
> > > This does not let you to update profile very precisely after peeling (and also peeling
> > > done by vectorizer), but it needs only 2 counters per loop.
> > >
> > > What other passes, beside peeling, would immediately benefit from a
> > > historgram? I wonder if you can think of better scheme?
> >
> > Epilogue peeling in the vectorizer (for the remainder of iterations,
> > thus profiling N % vectorization-factor which means a value-profile
> > of niter & <low-bits>).
>
> Hmm, to predict that we will need to know the exact value modulo vectorization
> factor. This seems bit hard to measure an I am not sure how improtant it is.
> >
> > Peeling for alignment profitability (we do too much of that IMHO),
> > thus value-profiling of <read1-addr> | <read2-addr> | ...
> > and <write1-addr> | <write2-addr> | ... (not sure if that would
> > be fine-grained enough - we want to know if peeling for alignment
> > will likely make many accesses aligned, not just the one we peel for).
>
> Peeling for alingmnet can probably be predicted by the power of 2 predictor - that
> is already used by string routines. For that we need to identify which alignments
> we want to test at the vrp time and probably we can avoid alignmnet peeling in favour
> of versioning for loops that work on aligned data in practice.
>
> Honza
> >
> > > >
> > > > > Currently we use profile each time estimate_numbers_of_iterations_loop
> > > > > is called to recompute this value. This is not very safe because the
> > > > > profile may be misupdated. It seems safer to compute it at once and
> > > > > maintain thorough the compilation.
> > > > >
> > > > > Notice that I removed:
> > > > > - /* Force estimate compuation but leave any existing upper bound in place. */
> > > > > - loop->any_estimate = false;
> > > > > From beggining of estimate_numbers_of_iterations_loop. I can not make sense of
> > > > > this. Even w/o profile if we have estimate, we are better to maintain it because
> > > > > later we may not be able to derive it again.
> > > > > There seems to be no code that is forced by setting loop->any_estimate = true.
> > > > > Only code that cares seems to be record_niter_bound that only decreases existing
> > > > > estimates. THis seems sane procedure - we don't roll loops.
> > > > >
> > > > > Bootstrapped/regtested x86_64-linux, OK?
> > > >
> > > > Ok. Did you check what this does to SPEC with FDO?
> > >
> > > I didn't do full SPEC run with FDO, only tested tramp3d and xalancbmk I have
> > > readily available. Any regressions would however point to loop info updating
> > > bugs (or wrong use of the profile) so I will look into them if they appear.
> > > I am trying to benchmark firefox now.
> >
> > Ok.
> >
> > Thanks,
> > Richard.
>
>
--
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)