[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

Mon Sep 21 10:35:20 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Richard, though register moves are resolved by renaming, they still occupy a
uop in all stages except execution, and since renaming is one of the narrowest
points in the pipeline (only up to 4 uops/cycle on Intel), reducing number of
uops generally helps.

In Michael's the actual memory address has two operands:

<       vmovapd %ymm1, %ymm10
<       vmovapd %ymm1, %ymm11
<       vfnmadd213pd    (%rdx,%rax), %ymm9, %ymm10
<       vfnmadd213pd    (%rcx,%rax), %ymm7, %ymm11
---
> 	vmovupd	(%rdx,%rax), %ymm10
> 	vmovupd	(%rcx,%rax), %ymm11
> 	vfnmadd231pd	%ymm1, %ymm9, %ymm10
> 	vfnmadd231pd	%ymm1, %ymm7, %ymm11

The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before
renaming (because otherwise there would be too many operands to handle). Hence
the original code has 4 uops after decoding, 6 uops before renaming, and the
transformed code has 4 uops before renaming. Execution handles 4 uops in both
cases.

FMA unlamination is mentioned in
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes

Michael, you can probably measure it for yourself with

   perf stat -e cycles,instructions,uops_retired.all,uops_retired.retire_slots