[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
wdijkstr at arm dot com
gcc-bugzilla@gcc.gnu.org
Thu Oct 23 00:31:00 GMT 2014
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #19 from Wilco <wdijkstr at arm dot com> ---
(In reply to Evandro from comment #16)
> (In reply to Wilco from comment #15)
> > Using -Ofast is not any different from -O3 -ffast-math when compiling
> > non-Fortran code. As comment 10 shows, both loops are vectorized, however
> > LLVM unrolls twice and uses multiple accumulators while GCC doesn't.
>
> You're right. LLVM produces:
>
> .LBB0_1: // %vector.body
> // =>This Inner Loop Header: Depth=1
> add x11, x9, x8
> add x12, x10, x8
> ldp q2, q3, [x11]
> ldp q4, q5, [x12]
> add x8, x8, #32 // =32
> fmla v0.2d, v2.2d, v4.2d
> fmla v1.2d, v3.2d, v5.2d
> cmp x8, #128, lsl #12 // =524288
> b.ne .LBB0_1
>
> And GCC:
>
> .L3:
> ldr q2, [x2, x0]
> add w1, w1, 1
> ldr q1, [x3, x0]
> cmp w1, w4
> add x0, x0, 16
> fmla v0.2d, v2.2d, v1.2d
> bcc .L3
>
> > I still don't see what this has to do with A57. You should open a generic
> > bug about GCC not applying basic loop optimizations with -O3 (in fact
> > limited unrolling is useful even for -O2).
>
> Indeed, but I think that there's still a code-generation opportunity for A57
> here.
>
> Note above that the registers are loaded in pairs by LLVM, while GCC, when
> it unrolls the loop, more aggressively BTW, each vector is loaded
> individually:
Load/store pair optimization should be committed soon:
https://gcc.gnu.org/ml/gcc-patches/2014-10/msg02005.html
> .L3:
> ldr q28, [x15, x16]
> add x17, x16, 16
> ldr q29, [x14, x16]
> add x0, x16, 32
> ldr q30, [x15, x17]
> add x18, x16, 48
> ldr q31, [x14, x17]
> add x1, x16, 64
> ...
> fmla v27.2d, v28.2d, v29.2d
> ...
> fmla v27.2d, v30.2d, v31.2d
> ... # Rest of 8x unroll
> bcc .L3
>
> It also goes without saying that this code could also benefit from the
> post-increment addressing mode.
Yes I've noticed bad addressing like that and fixes are in progress. It's an
issue in iv-opt - even without post-increment enabled the obvious addressing
mode to use is immediate offset.
More information about the Gcc-bugs
mailing list