[Bug tree-optimization/88760] GCC unrolling is suboptimal

Wed Jan 9 10:47:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to ktkachov from comment #2)
> Created attachment 45386 [details]
> aarch64-llvm output with -Ofast -mcpu=cortex-a57
> 
> I'm attaching the full LLVM aarch64 output.
> 
> The output you quoted is with -funroll-loops. If that's not given, GCC
> doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> testing).
> 
> Is there anything we can do to make the default unrolling a bit more
> aggressive?

Well, the RTL loop unroller is not enabled by default at any
optimization level (unless you are using FDO).  There's also
related flags not enabled (-fsplit-ivs-in-unroller and
-fvariable-expansion-in-unroller).

The RTL loop unroller is simply not good at estimating benefit
of unrolling (which is also why you usually see it unrolling
--param max-unroll-times times) and the tunables it has are
not very well tuned across targets.

Micha did quite extensive benchmarking (on x86_64) which shows that
the cases where unrolling is profitable are rare and the reason
is often hard to understand.

That's of course in the context of CPUs having caches of
pre-decoded/fused/etc. instructions optimizing issue which
makes peeled prologues expensive as well as even more special
caches for small loops avoiding more frontend costs.

Not sure if arm archs have any of this.

I generally don't believe in unrolling as a separately profitable
transform.  Rather unrolling could be done as part of another
transform (vectorization is the best example).  For sth still
done on RTL that would then include scheduling which is where
the best cost estimates should be available (and if you do
this post-reload then you even have a very good idea of
register pressure).  This is also why I think a standalone
unrolling phase belongs on RTL since I don't see a good way
of estimating cost/benefit on GIMPLE (see how difficult it is
to cost vectorization vs. non-vectorization there).