[Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations
e.menezes at samsung dot com
gcc-bugzilla@gcc.gnu.org
Thu Oct 9 21:56:00 GMT 2014
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
Bug ID: 63503
Summary: [AArch64] A57 executes fused multiply-add poorly in
some situations
Product: gcc
Version: 5.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: e.menezes at samsung dot com
CC: spop at gcc dot gnu.org
Target: aarch64-*
Curious why Geekbench's {D,S}GEMM by GCC were 8-9% slower than by LLVM, I was
baffled to find that the code emitted by GCC for the innermost loop in the
algorithm core is actually very good:
.L8:
ldr d2, [x8, w5, uxtw 3]
ldr d1, [x7, w5, uxtw 3]
add w5, w5, 1
cmp w5, w6
fmadd d0, d2, d1, d0
bne .L8
LLVM's code is not so neat:
.LBB0_10:
ldr d1, [x27, x22, lsl #3]
ldr d2, [x9, x22, lsl #3]
fmul d1, d1, d2
fadd d0, d0, d1
add w21, w21, #1
add x22, x22, #1
cmp w21, w24, uxtw
b.ne .LBB0_10
However, it runs faster.
Methinks that the A57 microarchitecture is performing tricks for discrete FP
operations but not for fused multiply-add, since both code sequences are
semantically the same. Whatever it is, it seems that fused multiply-add, and
perhaps its cousins, is actually a performance hit only when one depends on the
results of a previous one, as in this case on the results of the fused
operation in the previous loop iteration.
I'll try to create a simple test-case, but, in the meantime, please chime in
about your thoughts.
More information about the Gcc-bugs
mailing list