[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Thu Jan 27 07:42:49 GMT 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|tree-optimization |rtl-optimization
CC| |vmakarov at gcc dot gnu.org
Keywords| |ra
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I see a lot more GPR <-> XMM moves in the 'after' case:
1035 : 401c8b: vaddsd %xmm1,%xmm0,%xmm0
1953 : 401c8f: vmovq %rcx,%xmm1
305 : 401c94: vaddsd %xmm8,%xmm1,%xmm1
3076 : 401c99: vmovq %xmm0,%r14
590 : 401c9e: vmovq %r11,%xmm0
267 : 401ca3: vmovq %xmm1,%r8
136 : 401ca8: vmovq %rdx,%xmm1
448 : 401cad: vaddsd %xmm1,%xmm0,%xmm1
1703 : 401cb1: vmovq %xmm1,%r9 (*)
834 : 401cb6: vmovq %r8,%xmm1
1719 : 401cbb: vmovq %r9,%xmm0 (*)
2782 : 401cc0: vaddsd %xmm0,%xmm1,%xmm1
22135 : 401cc4: vmovsd %xmm1,%xmm1,%xmm0
1261 : 401cc8: vmovq %r14,%xmm1
646 : 401ccd: vaddsd %xmm0,%xmm1,%xmm0
18136 : 401cd1: vaddsd %xmm2,%xmm5,%xmm1
629 : 401cd5: vmovq %xmm1,%r8
142 : 401cda: vaddsd %xmm6,%xmm3,%xmm1
177 : 401cde: vmovq %xmm0,%r14
288 : 401ce3: vmovq %xmm1,%r9
177 : 401ce8: vmovq %r8,%xmm1
174 : 401ced: vmovq %r9,%xmm0
those look like RA / spilling artifacts and IIRC I saw Hongtao posting
patches in this area to regcprop I think? The above is definitely
bad, for example (*) seems to swap %xmm0 and %xmm1 via %r9.
The function is LBM_performStreamCollide, the sinking pass does nothing wrong,
it moves unconditionally executed
- _948 = _861 + _867;
- _957 = _944 + _948;
- _912 = _861 + _873;
...
- _981 = _853 + _865;
- _989 = _977 + _981;
- _916 = _853 + _857;
- _924 = _912 + _916;
into a conditionally executed block. But that increases register pressure
by 5 FP regs (if I counted correctly) in that area. So this would be the
usual issue of GIMPLE transforms not being register-pressure aware.
-fschedule-insn -fsched-pressure seems to be able to somewhat mitigate this
(though I think EBB scheduling cannot undo such movement).
In postreload I see transforms like
-(insn 466 410 411 7 (set (reg:DF 0 ax [530])
- (mem/u/c:DF (symbol_ref/u:DI ("*.LC10") [flags 0x2]) [0 S8 A64]))
"lbm.c":241:5 141 {*movdf_internal}
- (expr_list:REG_EQUAL (const_double:DF
9.939744999999999830464503247640095651149749755859375e-1
[0x0.fe751ce28ed5fp+0])
- (nil)))
-(insn 411 466 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
+(insn 411 410 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
(reg:DF 0 ax [530])) "lbm.c":241:5 141 {*movdf_internal}
(nil))
which seems like we could have reloaded %xmm5 from .LC10. But the spilling
to GPRs seems to be present already after LRA and cprop_hardreg doesn't
do anything bad either.
The differences can be seen on trunk with -Ofast -march=znver2
[-fdisable-tree-sink2].
We have X86_TUNE_INTER_UNIT_MOVES_TO_VEC/X86_TUNE_INTER_UNIT_MOVES_FROM_VEC
and the interesting thing is that when I disable them I do see some
spilling to the stack but also quite some re-materialized constants
(loads from .LC* as seem from the opportunity above).
It might be interesting to benchmark with
-mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec and find a way
to make costs in a way that IRA/LRA prefer re-materialization of constants
from the constant pool over spilling to GPRs (if that's possible at all -
Vlad?)
More information about the Gcc-bugs
mailing list