[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
already5chosen at yahoo dot com
gcc-bugzilla@gcc.gnu.org
Fri Nov 25 13:19:37 GMT 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #16 from Michael_S <already5chosen at yahoo dot com> ---
On unrelated note, why loop overhead uses so many instructions?
Assuming that I am as misguided as gcc about load-op combining, I would write
it as:
sub %rax, %rdx
.L3:
vmovupd (%rdx,%rax), %ymm1
vmovupd 32(%rdx,%rax), %ymm0
vfmadd213pd 32(%rax), %ymm3, %ymm1
vfnmadd213pd (%rax), %ymm2, %ymm0
vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0
vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1
vmovupd %ymm0, (%rax)
vmovupd %ymm1, 32(%rax)
addq $64, %rax
decl %esi
jb .L3
The loop overhead in my variant is 3 x86 instructions==2 macro-ops,
vs 5 x86 instructions==4 macro-ops in gcc variant.
Also, in gcc variant all memory accesses have displacement that makes them
1 byte longer. In my variant only half of accesses have displacement.
I think, in the past I had seen cases where gcc generates optimal or
near-optimal
code sequences for loop overhead. I wonder why it can not do it here.
More information about the Gcc-bugs
mailing list