[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

Thu Jan 27 07:42:49 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |rtl-optimization
                 CC|                            |vmakarov at gcc dot gnu.org
           Keywords|                            |ra

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I see a lot more GPR <-> XMM moves in the 'after' case:

    1035 :   401c8b: vaddsd %xmm1,%xmm0,%xmm0
    1953 :   401c8f: vmovq  %rcx,%xmm1
     305 :   401c94: vaddsd %xmm8,%xmm1,%xmm1
    3076 :   401c99: vmovq  %xmm0,%r14
     590 :   401c9e: vmovq  %r11,%xmm0
     267 :   401ca3: vmovq  %xmm1,%r8
     136 :   401ca8: vmovq  %rdx,%xmm1
     448 :   401cad: vaddsd %xmm1,%xmm0,%xmm1
    1703 :   401cb1: vmovq  %xmm1,%r9   (*)
     834 :   401cb6: vmovq  %r8,%xmm1
    1719 :   401cbb: vmovq  %r9,%xmm0   (*)
    2782 :   401cc0: vaddsd %xmm0,%xmm1,%xmm1
   22135 :   401cc4: vmovsd %xmm1,%xmm1,%xmm0
    1261 :   401cc8: vmovq  %r14,%xmm1
     646 :   401ccd: vaddsd %xmm0,%xmm1,%xmm0
   18136 :   401cd1: vaddsd %xmm2,%xmm5,%xmm1
     629 :   401cd5: vmovq  %xmm1,%r8
     142 :   401cda: vaddsd %xmm6,%xmm3,%xmm1
     177 :   401cde: vmovq  %xmm0,%r14
     288 :   401ce3: vmovq  %xmm1,%r9
     177 :   401ce8: vmovq  %r8,%xmm1
     174 :   401ced: vmovq  %r9,%xmm0

those look like RA / spilling artifacts and IIRC I saw Hongtao posting
patches in this area to regcprop I think?  The above is definitely
bad, for example (*) seems to swap %xmm0 and %xmm1 via %r9.

The function is LBM_performStreamCollide, the sinking pass does nothing wrong,
it moves unconditionally executed

-  _948 = _861 + _867;
-  _957 = _944 + _948;
-  _912 = _861 + _873;
...
-  _981 = _853 + _865;
-  _989 = _977 + _981;
-  _916 = _853 + _857;
-  _924 = _912 + _916;

into a conditionally executed block.  But that increases register pressure
by 5 FP regs (if I counted correctly) in that area.  So this would be the
usual issue of GIMPLE transforms not being register-pressure aware.

-fschedule-insn -fsched-pressure seems to be able to somewhat mitigate this
(though I think EBB scheduling cannot undo such movement).

In postreload I see transforms like

-(insn 466 410 411 7 (set (reg:DF 0 ax [530])
-        (mem/u/c:DF (symbol_ref/u:DI ("*.LC10") [flags 0x2]) [0  S8 A64]))
"lbm.c":241:5 141 {*movdf_internal}
-     (expr_list:REG_EQUAL (const_double:DF
9.939744999999999830464503247640095651149749755859375e-1
[0x0.fe751ce28ed5fp+0])
-        (nil)))
-(insn 411 466 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
+(insn 411 410 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
         (reg:DF 0 ax [530])) "lbm.c":241:5 141 {*movdf_internal}
      (nil))

which seems like we could have reloaded %xmm5 from .LC10.  But the spilling
to GPRs seems to be present already after LRA and cprop_hardreg doesn't
do anything bad either.

The differences can be seen on trunk with -Ofast -march=znver2
[-fdisable-tree-sink2].

We have X86_TUNE_INTER_UNIT_MOVES_TO_VEC/X86_TUNE_INTER_UNIT_MOVES_FROM_VEC
and the interesting thing is that when I disable them I do see some
spilling to the stack but also quite some re-materialized constants
(loads from .LC* as seem from the opportunity above).

It might be interesting to benchmark with
-mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec and find a way
to make costs in a way that IRA/LRA prefer re-materialization of constants
from the constant pool over spilling to GPRs (if that's possible at all -
Vlad?)