Inefficient loop unrolling.

Thu Jul 3 13:48:00 GMT 2008

Steven,
I just created a bug report. You should receive a CCed mail now.

I can see these issues are solvable at RTL-level, but require lots of
efforts. The main optimization in loop unrolling pass, split iv, can
reduce dependence chain but not extra ADDs and alias issue. What is the
main reason that loop unrolling should belong to RTL level? Is it
fundamental?

Cheers,
Bingfeng

-----Original Message-----
From: Steven Bosscher [mailto:stevenb.gcc@gmail.com] 
Sent: 02 July 2008 17:01
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: Inefficient loop unrolling.

On Wed, Jul 2, 2008 at 1:13 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hello,
> I am looking at GCC's loop unrolling and find it quite inefficient
> compared with manually unrolled loop even for very simple loop. The
> followings are a simple loop and its manually unrolled version. I
didn't
> apply any trick on manually unrolled one as it is exact replications
of
> original loop body. I have expected by -funroll-loops the first
version
> should produce code of similar quality as the second one. However,
> compiled with ARM target of mainline GCC, both functions produce very
> different results.
>
> GCC-unrolled version mainly suffers from two issues. First, the
> load/store offsets are registers. Extra ADD instructions are needed to
> increase offset over iteration. In the contrast, manually unrolled
code
> makes use of immediate offset efficiently and only need one ADD to
> adjust base register in the end. Second, the alias (dependence)
analysis
> is over conservative. The LOAD instruction of next unrolled iteration
> cannot be moved beyond previous STORE instruction even they are
clearly
> not aliased. I suspect the failure of alias analysis is related to the
> first issue of handling base and offset address. The .sched2 file
shows
> that the first loop body requires 57 cycles whereas the second one
takes
> 50 cycles for arm9 (56 cycles vs 34 cycles for Xscale).  It become
even
> worse for our VLIW porting due to longer latency of MUL and Load
> instructions and incapability of filling all slots (120 cycles vs. 20
> cycles)

Both issues should be solvable for RTL (where unrolling belongs IMHO).
 If you file a PR in bugzilla (with this test case, target, compiler
options, etc), I promise I will analyze why we don't fold away the
ADDs, and why the scheduler doesn't glob the loads (may be due to bad
alias analysis, but maybe something else is not working properly).

Gr.
Steven