This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Loop peeling


> On Wed, Oct 29, 2014 at 12:53 PM, Tejas Belagod <tejas.belagod@arm.com> wrote:
> > On 29/10/14 09:32, Richard Biener wrote:
> >>
> >> On Tue, Oct 28, 2014 at 4:55 PM, Evandro Menezes <e.menezes@samsung.com>
> >> wrote:
> >>>
> >>> While doing some benchmark flag mining on AArch64, I noticed that
> >>> -fpeel-loops was a mined option often.  As a matter of fact, when using
> >>> it
> >>> always, even without FDO, it seemed to raise most benchmarks and to leave
> >>> almost all of the rest flat, with a barely noticeable cost in code-size.
> >>> It
> >>> seems to me that it might be safe enough to be implied perhaps at -O3.
> >>> Is
> >>> there any reason why this never came into being?
> >
> >
> > Loop peeling is done by default on AArch64 unless, IIRC,
> > -fvect-cost-model=cheap is specified which switches it off. There was a
> > general thread on loop peeling around the same time last year
> > (https://gcc.gnu.org/ml/gcc/2013-11/msg00307.html) where Richard suggested
> > that peeling vs. non-peeling should be factored into the vector cost model
> > and is a more generic improvement.
> 
> Oh, you are talking about the vectorizer pro-/epilogue loops where we
> know a (low) upper bound for the number of iterations.  I think that
> is enabled by default at -O3 as it is a "completely peeling" operation.
> Only regular peeling which looks at the _estimated_ loop trip count
> (peeling that number of times) is guarded by -fpeel-loops.

This is something that was not revisited since 2003 when loop peeling was
introduced.  My basic intuition at that time was that it makes sense to
optimize for loops with large trip counts by default (i.e. unroll) but not for
loops with small trip counts. While the average trip count of a loop is about 6
iterations, the low trip count loops tends to be off hot spot or hard to
identify.  For that reason loop peeling was done only with profile feedback or
with explicit command line option.

The 2003 benchmarks are here.  http://www.ucw.cz/~hubicka/papers/amd64/node4.html
It lists 0.66% speedups for SPECfp (slowdown for SPECint) where almost all
comes from wupwise that looks like an anomaly (wupwise tended to be rather random,
I do not recall if I figured out the reason for the speedup at that time).
The code size costs were about 0.07%.

There are definitly things that changed in last decade :).  We have some
infrastructure to identify loops with low trip count (and I added logic to
complete unrolling to "peel" those that are known to iterate just few times)
and we are better on globally optimizing the peeled code.

What kind of data you have about loop peeling helping performance in general?
Did you measured them with the new tree level loop peeling pass?

Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]