This is the mail archive of the
mailing list for the GCC project.
Re: Loop peeling
- From: Jan Hubicka <hubicka at ucw dot cz>
- To: Richard Biener <richard dot guenther at gmail dot com>
- Cc: Tejas Belagod <tejas dot belagod at arm dot com>, Evandro Menezes <e dot menezes at samsung dot com>, Jan Hubicka <hubicka at ucw dot cz>, GCC Development <gcc at gcc dot gnu dot org>
- Date: Wed, 29 Oct 2014 15:44:46 +0100
- Subject: Re: Loop peeling
- Authentication-results: sourceware.org; auth=none
- References: <033101cff2c7$96bff550$c43fdff0$ at samsung dot com> <CAFiYyc23OhPmV3DJa9z62DB5jwR1JWKowqQDKRyCbrBiKai0xA at mail dot gmail dot com> <5450D521 dot 1060500 at arm dot com> <CAFiYyc29-qAL_iwTfq4XCkM3Nps4+du5bSzezZyNNZAcwA-_cg at mail dot gmail dot com>
> On Wed, Oct 29, 2014 at 12:53 PM, Tejas Belagod <firstname.lastname@example.org> wrote:
> > On 29/10/14 09:32, Richard Biener wrote:
> >> On Tue, Oct 28, 2014 at 4:55 PM, Evandro Menezes <email@example.com>
> >> wrote:
> >>> While doing some benchmark flag mining on AArch64, I noticed that
> >>> -fpeel-loops was a mined option often. As a matter of fact, when using
> >>> it
> >>> always, even without FDO, it seemed to raise most benchmarks and to leave
> >>> almost all of the rest flat, with a barely noticeable cost in code-size.
> >>> It
> >>> seems to me that it might be safe enough to be implied perhaps at -O3.
> >>> Is
> >>> there any reason why this never came into being?
> > Loop peeling is done by default on AArch64 unless, IIRC,
> > -fvect-cost-model=cheap is specified which switches it off. There was a
> > general thread on loop peeling around the same time last year
> > (https://gcc.gnu.org/ml/gcc/2013-11/msg00307.html) where Richard suggested
> > that peeling vs. non-peeling should be factored into the vector cost model
> > and is a more generic improvement.
> Oh, you are talking about the vectorizer pro-/epilogue loops where we
> know a (low) upper bound for the number of iterations. I think that
> is enabled by default at -O3 as it is a "completely peeling" operation.
> Only regular peeling which looks at the _estimated_ loop trip count
> (peeling that number of times) is guarded by -fpeel-loops.
This is something that was not revisited since 2003 when loop peeling was
introduced. My basic intuition at that time was that it makes sense to
optimize for loops with large trip counts by default (i.e. unroll) but not for
loops with small trip counts. While the average trip count of a loop is about 6
iterations, the low trip count loops tends to be off hot spot or hard to
identify. For that reason loop peeling was done only with profile feedback or
with explicit command line option.
The 2003 benchmarks are here. http://www.ucw.cz/~hubicka/papers/amd64/node4.html
It lists 0.66% speedups for SPECfp (slowdown for SPECint) where almost all
comes from wupwise that looks like an anomaly (wupwise tended to be rather random,
I do not recall if I figured out the reason for the speedup at that time).
The code size costs were about 0.07%.
There are definitly things that changed in last decade :). We have some
infrastructure to identify loops with low trip count (and I added logic to
complete unrolling to "peel" those that are known to iterate just few times)
and we are better on globally optimizing the peeled code.
What kind of data you have about loop peeling helping performance in general?
Did you measured them with the new tree level loop peeling pass?