[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
amonakov at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon Sep 21 10:35:20 GMT 2020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Richard, though register moves are resolved by renaming, they still occupy a
uop in all stages except execution, and since renaming is one of the narrowest
points in the pipeline (only up to 4 uops/cycle on Intel), reducing number of
uops generally helps.
In Michael's the actual memory address has two operands:
< vmovapd %ymm1, %ymm10
< vmovapd %ymm1, %ymm11
< vfnmadd213pd (%rdx,%rax), %ymm9, %ymm10
< vfnmadd213pd (%rcx,%rax), %ymm7, %ymm11
---
> vmovupd (%rdx,%rax), %ymm10
> vmovupd (%rcx,%rax), %ymm11
> vfnmadd231pd %ymm1, %ymm9, %ymm10
> vfnmadd231pd %ymm1, %ymm7, %ymm11
The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before
renaming (because otherwise there would be too many operands to handle). Hence
the original code has 4 uops after decoding, 6 uops before renaming, and the
transformed code has 4 uops before renaming. Execution handles 4 uops in both
cases.
FMA unlamination is mentioned in
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes
Michael, you can probably measure it for yourself with
perf stat -e cycles,instructions,uops_retired.all,uops_retired.retire_slots
More information about the Gcc-bugs
mailing list