Bug 117562 - [15 Regression] 40% slowdown of 482.sphinx3 on Zen4, Zen5 since r15-5120-g9a62c149589103
Summary: [15 Regression] 40% slowdown of 482.sphinx3 on Zen4, Zen5 since r15-5120-g9a6...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: 15.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2024-11-13 15:02 UTC by Filip Kastl
Modified: 2024-11-27 01:25 UTC (History)
4 users (show)

See Also:
Host: x86_64-linux
Target: x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-11-19 00:00:00


Attachments
preprocessed source (19.82 KB, text/plain)
2024-11-21 13:55 UTC, Richard Biener
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Filip Kastl 2024-11-13 15:02:16 UTC
As seen here

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=992.280.0

there was a 40% exec time slowdown of 482.sphinx3 SPEC 2006 benchmark.  I bisected the slowdown to r15-5120-g9a62c149589103.

The command line options are -Ofast -march=native -flto.  I've seen the slowdown on AMD Zen4 and Zen5 machines.  I've seen slowdowns without LTO too but those aren't as extreme (8 or 9%).
Comment 1 Richard Biener 2024-11-14 09:31:52 UTC
Needs some analysis, -mtune-crtl=^avx512_two_epilogues should help with that.
Comment 2 Hongtao Liu 2024-11-19 09:03:38 UTC
My guess there's a lower-tripcount(< 128bit vector) hot loop, avx512_two_epilogues only takes more cmp/jcc instructions but doesn't execute any real vector instructions.
Comment 3 Richard Biener 2024-11-19 09:17:10 UTC
I'll do actual benchmarking and see what happens during stage3.
Comment 4 Richard Biener 2024-11-21 13:53:16 UTC
 34.37%        419183  sphinx_livepret  sphinx_livepretend_peak.amd64-m64-gcc42-nn  [.] utt_decode_block.constprop.0
  17.28%        212804  sphinx_livepret  sphinx_livepretend_base.amd64-m64-gcc42-nn  [.] utt_decode_block.constprop.0

it's the inlined copy of vector_gautbl_eval_logs3 where we now seem to
hit vectorized code for

        for (i = 0; i < veclen; i++) {
            diff1 = x[i] - m1[i];
            dval1 -= diff1 * diff1 * v1[i];
            diff2 = x[i] - m2[i];
            dval2 -= diff2 * diff2 * v2[i];
        }

never for the zmm version, few cycles on the ymm and now very many on
the xmm version

   772 │        vmovaps       -0xc0(%rbp),%xmm6                                                                                             ▒
     4 │        vmulpd        %xmm13,%xmm13,%xmm13                                                                                          ▒
       │        vmulpd        %xmm1,%xmm1,%xmm1                                                                                             ▒
102664 │        vfnmadd132pd  %xmm14,%xmm3,%xmm13                                                                                           ▒
     1 │        vcvtps2pd     %xmm6,%xmm3                                                                                                   ▒
       │      diff2 = x[i] - m2[i];                                                                                                         ▒
     3 │        vmovaps       -0xd0(%rbp),%xmm6                                                                                             ▒
       │      dval1 -= diff1 * diff1 * v1[i];                                                                                               ▒
 95472 │        vfnmadd132pd  %xmm1,%xmm13,%xmm3

We end up also applying basic-block
vectorization to the scalar loop via the stores after it:

        if (dval1 < gautbl->distfloor)
            dval1 = gautbl->distfloor;
        if (dval2 < gautbl->distfloor)
            dval2 = gautbl->distfloor;

        score[r] = (int32)(f * dval1);
        score[r+1] = (int32)(f * dval2);

interestingly without LTO this specific instance doesn't behave that badly
but is faster with the three epilogues:

  12.63%        132184  sphinx_livepret  sphinx_livepretend_base.amd64-m64-gcc42-nn  [.] vector_gautbl_eval_logs3                           ▒
  11.55%        119447  sphinx_livepret  sphinx_livepretend_peak.amd64-m64-gcc42-nn  [.] vector_gautbl_eval_logs3 

in turn a similar case elsewhere shows up:

  27.63%        285537  sphinx_livepret  sphinx_livepretend_peak.amd64-m64-gcc42-nn  [.] mgau_eval                                          ◆
  15.19%        158646  sphinx_livepret  sphinx_livepretend_base.amd64-m64-gcc42-nn  [.] mgau_eval                                

that's basically exactly the same loop kernel.

What we can see here is

   130 │       vmovhps       %xmm4,0x60(%rsp)                                                                                               ▒
     5 │       vmovhps       %xmm2,0x70(%rsp)                                                                                               ▒
       │       vmovaps       0x70(%rsp),%xmm0                                                                                               ▒
 10552 │       vcvtps2pd     %xmm4,%xmm6                                                                                                    ▒
  1204 │       vcvtps2pd     %xmm2,%xmm3                                                                                                    ▒
       │       vcvtps2pd     %xmm0,%xmm2                                                                                                    ▒
  6872 │       vmovaps       0x60(%rsp),%xmm0                                                                                               ▒
  1598 │       vmulpd        %xmm3,%xmm3,%xmm3                                                                                              ▒
     5 │       vmulpd        %xmm2,%xmm2,%xmm2                                                                                              ▒
 15119 │       vfnmadd132pd  %xmm6,%xmm1,%xmm3                                                                                              ▒
    38 │       vcvtps2pd     %xmm0,%xmm1                                                                                                    ▒
  5088 │       vfnmadd132pd  %xmm2,%xmm3,%xmm1                                                                                              ▒
 33818 │       vunpckhpd     %xmm1,%xmm1,%xmm2                                                                                              ▒
 39938 │       vaddpd        %xmm1,%xmm2,%xmm2                                                                                              ▒
 37932 │       test          $0x3,%cl          

There is an odd spill/reload at 0x60(%rsp) which I can't see where it's coming from.

Could it be that V2SFmode is spilled as V2SFmode but reloaded as V4SFmode?!

#(insn:TI 1161 1175 1578 15 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
#                (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
#        (vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
#                        (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
#                (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566]))
#            (parallel [
#                    (const_int 6 [0x6])
#                    (const_int 7 [0x7])
#                    (const_int 2 [0x2])
#                    (const_int 3 [0x3])
#                ]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
#     (expr_list:REG_DEAD (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566])
#        (nil)))
        vmovhps %xmm2, 112(%rsp)        # 1161  [c=4 l=8]  sse_movhlps/4
#(insn 1578 1161 1173 15 (set (reg:V4SF 20 xmm0 [853])
#        (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
#                (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])) "cont_mgau.c":157:9 2354 {movv4sf_internal}
#     (nil))
        vmovaps 112(%rsp), %xmm0        # 1578  [c=10 l=8]  movv4sf_internal/3

Huh.  It looks like this is from a V4SF -> 2xV2DF extension via
vec_unpack_{hi,lo}_expr.

Originally this is

(insn 1161 1160 1162 58 (set (reg:V4SF 853)
        (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 853)
                (reg:V4SF 566 [ vect__811.229 ]))
            (parallel [
                    (const_int 6 [0x6])
                    (const_int 7 [0x7])
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                ]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
     (expr_list:REG_DEAD (reg:V4SF 566 [ vect__811.229 ])
        (nil)))

but LRA chooses the memory alternative for the destination:

      Choosing alt 4 in insn 1161:  (0) m  (1) 0  (2) v {sse_movhlps}
         Considering alt=0 of insn 1162:   (0) =v  (1) v
          overall=6,losers=1,rld_nregs=1
      Choosing alt 0 in insn 1162:  (0) =v  (1) v {sse2_cvtps2pd}
      Creating newreg=998 from oldreg=853, assigning class ALL_SSE_REGS to r998
 1162: r568:V2DF=float_extend(vec_select(r998:V4SF,parallel))
    Inserting insn reload before:
 1578: r998:V4SF=r853:V4SF


This is the following pattern:

(define_insn "sse_movhlps"
  [(set (match_operand:V4SF 0 "nonimmediate_operand"     "=x,v,x,v,m")
    (vec_select:V4SF
      (vec_concat:V8SF
        (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v,0")
        (match_operand:V4SF 2 "nonimmediate_operand" " x,v,o,o,v"))
      (parallel [(const_int 6)
             (const_int 7)
             (const_int 2)
             (const_int 3)])))]
  "TARGET_SSE && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
  "@
   movhlps\t{%2, %0|%0, %2}
   vmovhlps\t{%2, %1, %0|%0, %1, %2}
   movlps\t{%H2, %0|%0, %H2}
   vmovlps\t{%H2, %1, %0|%0, %1, %H2}
   %vmovhps\t{%2, %0|%q0, %2}"
  [(set_attr "isa" "noavx,avx,noavx,avx,*")
   (set_attr "type" "ssemov2")
   (set_attr "prefix" "orig,maybe_evex,orig,maybe_evex,maybe_vex")
   (set_attr "mode" "V4SF,V4SF,V2SF,V2SF,V2SF")])

indeed the "mode" attr says V2SF for the memory (store) alternative but
this is for a V4SFmode.  Also LRA doesn't seem to understand that
the match_operand:1 should be the same memory.

(insn 1161 1160 1578 59 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
                (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
        (vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
                        (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
                (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566]))
            (parallel [
                    (const_int 6 [0x6])
                    (const_int 7 [0x7])
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                ]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
     (nil))

it's also odd that I don't see the spill to (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)(const_int 112 [0x70])) that LRA would need to generate for the input operand.

Clearly something is odd here and clearly this alternative is bad.
Comment 5 Richard Biener 2024-11-21 13:55:29 UTC
Created attachment 59662 [details]
preprocessed source

This is the mgau_eval from the non-LTO slowdown.  When you build with
-Ofast -g -fopt-info-vec -march=znver4 you'll see this vmovhlps + vmovaps
"spill"/reload pair.

I didn't yet attempt to reduce further (register pressure is needed for the spill).
Comment 6 Richard Biener 2024-11-21 14:10:39 UTC
Removing the alternative fixes the slowdown, both LTO and non-LTO, the multi-epilogue version is now faster.

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 72acd5bde5e..0a3ee38f55a 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -11045,11 +11045,11 @@
 })
 
 (define_insn "sse_movhlps"
-  [(set (match_operand:V4SF 0 "nonimmediate_operand"     "=x,v,x,v,m")
+  [(set (match_operand:V4SF 0 "nonimmediate_operand"     "=x,v,x,v")
        (vec_select:V4SF
          (vec_concat:V8SF
-           (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v,0")
-           (match_operand:V4SF 2 "nonimmediate_operand" " x,v,o,o,v"))
+           (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v")
+           (match_operand:V4SF 2 "nonimmediate_operand" " x,v,o,o"))
          (parallel [(const_int 6)
                     (const_int 7)
                     (const_int 2)
@@ -11059,12 +11059,11 @@
    movhlps\t{%2, %0|%0, %2}
    vmovhlps\t{%2, %1, %0|%0, %1, %2}
    movlps\t{%H2, %0|%0, %H2}
-   vmovlps\t{%H2, %1, %0|%0, %1, %H2}
-   %vmovhps\t{%2, %0|%q0, %2}"
-  [(set_attr "isa" "noavx,avx,noavx,avx,*")
+   vmovlps\t{%H2, %1, %0|%0, %1, %H2}"
+  [(set_attr "isa" "noavx,avx,noavx,avx")
    (set_attr "type" "ssemov2")
-   (set_attr "prefix" "orig,maybe_evex,orig,maybe_evex,maybe_vex")
-   (set_attr "mode" "V4SF,V4SF,V2SF,V2SF,V2SF")])
+   (set_attr "prefix" "orig,maybe_evex,orig,maybe_evex")
+   (set_attr "mode" "V4SF,V4SF,V2SF,V2SF")])
 
 (define_expand "sse_movlhps_exp"
   [(set (match_operand:V4SF 0 "nonimmediate_operand")
Comment 7 Hongtao Liu 2024-11-22 05:12:45 UTC
> Huh.  It looks like this is from a V4SF -> 2xV2DF extension via
> vec_unpack_{hi,lo}_expr.
> 
> Originally this is
> 
> (insn 1161 1160 1162 58 (set (reg:V4SF 853)
>         (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 853)
>                 (reg:V4SF 566 [ vect__811.229 ]))
>             (parallel [
>                     (const_int 6 [0x6])
>                     (const_int 7 [0x7])
>                     (const_int 2 [0x2])
>                     (const_int 3 [0x3])
>                 ]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
>      (expr_list:REG_DEAD (reg:V4SF 566 [ vect__811.229 ])
>         (nil)))
> 
vec_unpacks_hi_v4sf create an unintialized (reg:V4SF 853), I guess it may confuse LRA to allocate a mem for it.

10413(define_expand "vec_unpacks_hi_v4sf"
10414  [(set (match_dup 2)
10415   (vec_select:V4SF
10416     (vec_concat:V8SF
10417       (match_dup 2)   ------------ here, use without initialization.
10418       (match_operand:V4SF 1 "vector_operand"))
10419     (parallel [(const_int 6) (const_int 7)
10420                (const_int 2) (const_int 3)])))
10421  (set (match_operand:V2DF 0 "register_operand")
10422   (float_extend:V2DF
10423     (vec_select:V2SF
10424       (match_dup 2)
10425       (parallel [(const_int 0) (const_int 1)]))))]
10426  "TARGET_SSE2"
10427  "operands[2] = gen_reg_rtx (V4SFmode);")



Explicitly initialize the operands[2] with 0, the spill is gone, Could you try below patch to see if it also fixed ths issue?

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index e50355f4839..e1ee688d5e4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -10500,7 +10500,10 @@ (define_expand "vec_unpacks_hi_v4sf"
        (match_dup 2)
        (parallel [(const_int 0) (const_int 1)]))))]
   "TARGET_SSE2"
-  "operands[2] = gen_reg_rtx (V4SFmode);")
+{
+  operands[2] = gen_reg_rtx (V4SFmode);
+  emit_move_insn (operands[2], CONST0_RTX (V4SFmode));
+})

 (define_expand "vec_unpacks_hi_v8sf"
   [(set (match_dup 2)
Comment 8 Hongtao Liu 2024-11-22 05:21:56 UTC
> vec_unpacks_hi_v4sf create an unintialized (reg:V4SF 853), I guess it may
> confuse LRA to allocate a mem for it.

For simple case
void
foo (double* a, float* b, int n)
{
    for (int i = 0; i != n; i++)
      a[i] = b[i];
}

RA works ok, there's no extra spill there.
Comment 9 Richard Biener 2024-11-22 08:01:10 UTC
(In reply to Hongtao Liu from comment #8)
> > vec_unpacks_hi_v4sf create an unintialized (reg:V4SF 853), I guess it may
> > confuse LRA to allocate a mem for it.
> 
> For simple case
> void
> foo (double* a, float* b, int n)
> {
>     for (int i = 0; i != n; i++)
>       a[i] = b[i];
> }
> 
> RA works ok, there's no extra spill there.

Yeah, it needs enough register pressure to not have the extra reg here.

I think the proposed patch in comment#7 might be good on its own as it
avoids a false dependence on prior register contents (if not optimizing
for size).

It does fix the benchmark regression as well.

I do wonder about the usefulness of the memory alternative on the
sse_movhlps pattern though, there's the sse_storehps pattern which
also models the store part more precisely as V2SFmode.  Is
sse_movhlps_exp ever invoked with a memory destination?

That said, if the memory alternative stays we might want to mark it with '$'
so it's never chosen when the then memory operand needs a reload?
Comment 10 Hongtao Liu 2024-11-22 08:26:55 UTC
> 
> I do wonder about the usefulness of the memory alternative on the
> sse_movhlps pattern though, there's the sse_storehps pattern which
> also models the store part more precisely as V2SFmode.  Is
> sse_movhlps_exp ever invoked with a memory destination?
> 

Like this?

typedef float v4sf __attribute__((vector_size(16)));
void
foo (v4sf a, v4sf* b)
{
    *b = __builtin_shufflevector (*b, a, 0, 1, 4, 5);
}


foo(float __vector(4), float __vector(4)*):
        movlps  QWORD PTR [rdi+8], xmm0     # 11      [c=4 l=3]  sse_movlhps/4
        ret       # 19        [c=0 l=1]  simple_return_internal
Comment 11 Richard Biener 2024-11-22 08:36:57 UTC
(In reply to Hongtao Liu from comment #10)
> > 
> > I do wonder about the usefulness of the memory alternative on the
> > sse_movhlps pattern though, there's the sse_storehps pattern which
> > also models the store part more precisely as V2SFmode.  Is
> > sse_movhlps_exp ever invoked with a memory destination?
> > 
> 
> Like this?
> 
> typedef float v4sf __attribute__((vector_size(16)));
> void
> foo (v4sf a, v4sf* b)
> {
>     *b = __builtin_shufflevector (*b, a, 0, 1, 4, 5);
> }
> 
> 
> foo(float __vector(4), float __vector(4)*):
>         movlps  QWORD PTR [rdi+8], xmm0     # 11      [c=4 l=3] 
> sse_movlhps/4
>         ret       # 19        [c=0 l=1]  simple_return_internal

Indeed.  Btw, I checked and

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 72acd5bde5e..6cd0d932bd9 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -11045,7 +11045,7 @@
 })
 
 (define_insn "sse_movhlps"
-  [(set (match_operand:V4SF 0 "nonimmediate_operand"     "=x,v,x,v,m")
+  [(set (match_operand:V4SF 0 "nonimmediate_operand"     "=x,v,x,v,$m")
        (vec_select:V4SF
          (vec_concat:V8SF
            (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v,0")

does _not_ fix the regression (but maybe I did it wrong).
Comment 12 GCC Commits 2024-11-25 02:15:07 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:ba4cf2e296d8d5950c3d356fa6b6efcad00d0189

commit r15-5639-gba4cf2e296d8d5950c3d356fa6b6efcad00d0189
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Nov 21 23:57:38 2024 -0800

    Fix uninitialized operands[2] in vec_unpacks_hi_v4sf.
    
    It could cause weired spill in RA when register pressure is high.
    
    gcc/ChangeLog:
    
            PR target/117562
            * config/i386/sse.md (vec_unpacks_hi_v4sf): Initialize
            operands[2] with CONST0_RTX.
Comment 13 GCC Commits 2024-11-25 08:45:09 UTC
The releases/gcc-14 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:4a63cc6de77481878ec31e1e6ac30e22c50b063a

commit r14-10979-g4a63cc6de77481878ec31e1e6ac30e22c50b063a
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Nov 21 23:57:38 2024 -0800

    Fix uninitialized operands[2] in vec_unpacks_hi_v4sf.
    
    It could cause weired spill in RA when register pressure is high.
    
    gcc/ChangeLog:
    
            PR target/117562
            * config/i386/sse.md (vec_unpacks_hi_v4sf): Initialize
            operands[2] with CONST0_RTX.
    
    (cherry picked from commit ba4cf2e296d8d5950c3d356fa6b6efcad00d0189)
Comment 14 Richard Biener 2024-11-25 09:03:36 UTC
Fixed.
Comment 15 GCC Commits 2024-11-26 01:10:11 UTC
The releases/gcc-13 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:0eb8c19cb45fc004b7039fa22ff9021604d80dbc

commit r13-9216-g0eb8c19cb45fc004b7039fa22ff9021604d80dbc
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Nov 21 23:57:38 2024 -0800

    Fix uninitialized operands[2] in vec_unpacks_hi_v4sf.
    
    It could cause weired spill in RA when register pressure is high.
    
    gcc/ChangeLog:
    
            PR target/117562
            * config/i386/sse.md (vec_unpacks_hi_v4sf): Initialize
            operands[2] with CONST0_RTX.
    
    (cherry picked from commit ba4cf2e296d8d5950c3d356fa6b6efcad00d0189)
Comment 16 GCC Commits 2024-11-26 01:10:42 UTC
The releases/gcc-12 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:89a27cf6b1354cc80d834d71f7a3aa137d605e94

commit r12-10832-g89a27cf6b1354cc80d834d71f7a3aa137d605e94
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Nov 21 23:57:38 2024 -0800

    Fix uninitialized operands[2] in vec_unpacks_hi_v4sf.
    
    It could cause weired spill in RA when register pressure is high.
    
    gcc/ChangeLog:
    
            PR target/117562
            * config/i386/sse.md (vec_unpacks_hi_v4sf): Initialize
            operands[2] with CONST0_RTX.
    
    (cherry picked from commit ba4cf2e296d8d5950c3d356fa6b6efcad00d0189)