[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

Fri Nov 25 13:19:37 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #16 from Michael_S <already5chosen at yahoo dot com> ---
On unrelated note, why loop overhead uses so many instructions?
Assuming that I am as misguided as gcc about load-op combining, I would write
it as:
  sub %rax, %rdx
.L3:
  vmovupd   (%rdx,%rax), %ymm1
  vmovupd 32(%rdx,%rax), %ymm0
  vfmadd213pd    32(%rax), %ymm3, %ymm1
  vfnmadd213pd     (%rax), %ymm2, %ymm0
  vfnmadd231pd   32(%rdx,%rax), %ymm3, %ymm0
  vfnmadd231pd     (%rdx,%rax), %ymm2, %ymm1
  vmovupd %ymm0,   (%rax)
  vmovupd %ymm1, 32(%rax)
  addq    $64, %rax
  decl    %esi
  jb      .L3

The loop overhead in my variant is 3 x86 instructions==2 macro-ops,
vs 5 x86 instructions==4 macro-ops in gcc variant.
Also, in gcc variant all memory accesses have displacement that makes them
1 byte longer. In my variant only half of accesses have displacement.

I think, in the past I had seen cases where gcc generates optimal or
near-optimal
code sequences for loop overhead. I wonder why it can not do it here.