[Bug tree-optimization/88760] GCC unrolling is suboptimal

Wed Jan 9 13:59:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #7 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #6)
> On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #5 from Wilco <wilco at gcc dot gnu.org> ---
> > (In reply to Wilco from comment #4)
> > > (In reply to ktkachov from comment #2)
> > > > Created attachment 45386 [details]
> > > > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> > > > 
> > > > I'm attaching the full LLVM aarch64 output.
> > > > 
> > > > The output you quoted is with -funroll-loops. If that's not given, GCC
> > > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> > > > testing).
> > > > 
> > > > Is there anything we can do to make the default unrolling a bit more
> > > > aggressive?
> > > 
> > > I don't think the RTL unroller works at all. It doesn't have the right
> > > settings, and doesn't understand how to unroll, so we always get inefficient
> > > and bloated code.
> > > 
> > > To do unrolling correctly it has to be integrated at tree level - for
> > > example when vectorization isn't possible/beneficial, unrolling might still
> > > be a good idea.
> > 
> > To add some numbers to the conversation, the gain LLVM gets from default
> > unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017.
> > 
> > This clearly shows there is huge potential from unrolling, *if* we can teach
> > GCC to unroll properly like LLVM. That means early unrolling, using good
> > default settings and using a trailing loop rather than inefficient peeling.
> 
> I don't see why this cannot be done on RTL where we have vastly more
> information of whether there are execution resources that can be
> used by unrolling.  Note we also want unrolling to interleave
> instructions to not rely on pre-reload scheduling which in turn means
> having a good eye on register pressure (again sth not very well handled
> on GIMPLE)

The main issue is that other loop optimizations are done on tree, so things
like addressing modes, loop invariants, CSEs are run on the non-unrolled
version. Then when we unroll in RTL we end up with very non-optimal code.
Typical unrolled loop starts like this:

        add     x13, x2, 1
        add     x14, x2, 2
        add     x11, x2, 3
        add     x10, x2, 4
        ldr     w30, [x4, x13, lsl 2]
        add     x9, x2, 5
        add     x5, x2, 6
        add     x12, x2, 7
        ldr     d23, [x3, x2, lsl 3]
        ... rest of unrolled loop

So basically it decides to create a new induction variable for every unrolled
copy in the loop. This often leads to spills just because it creates way too
many redundant addressing instructions. It also blocks scheduling between
iterations since the alias optimization doesn't appear to understand simple
constant differences between indices.

So unrolling should definitely be done at a high level just like vectorization.