As seen here https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=992.280.0 there was a 40% exec time slowdown of 482.sphinx3 SPEC 2006 benchmark. I bisected the slowdown to r15-5120-g9a62c149589103. The command line options are -Ofast -march=native -flto. I've seen the slowdown on AMD Zen4 and Zen5 machines. I've seen slowdowns without LTO too but those aren't as extreme (8 or 9%).
Needs some analysis, -mtune-crtl=^avx512_two_epilogues should help with that.
My guess there's a lower-tripcount(< 128bit vector) hot loop, avx512_two_epilogues only takes more cmp/jcc instructions but doesn't execute any real vector instructions.
I'll do actual benchmarking and see what happens during stage3.
34.37% 419183 sphinx_livepret sphinx_livepretend_peak.amd64-m64-gcc42-nn [.] utt_decode_block.constprop.0 17.28% 212804 sphinx_livepret sphinx_livepretend_base.amd64-m64-gcc42-nn [.] utt_decode_block.constprop.0 it's the inlined copy of vector_gautbl_eval_logs3 where we now seem to hit vectorized code for for (i = 0; i < veclen; i++) { diff1 = x[i] - m1[i]; dval1 -= diff1 * diff1 * v1[i]; diff2 = x[i] - m2[i]; dval2 -= diff2 * diff2 * v2[i]; } never for the zmm version, few cycles on the ymm and now very many on the xmm version 772 │ vmovaps -0xc0(%rbp),%xmm6 ▒ 4 │ vmulpd %xmm13,%xmm13,%xmm13 ▒ │ vmulpd %xmm1,%xmm1,%xmm1 ▒ 102664 │ vfnmadd132pd %xmm14,%xmm3,%xmm13 ▒ 1 │ vcvtps2pd %xmm6,%xmm3 ▒ │ diff2 = x[i] - m2[i]; ▒ 3 │ vmovaps -0xd0(%rbp),%xmm6 ▒ │ dval1 -= diff1 * diff1 * v1[i]; ▒ 95472 │ vfnmadd132pd %xmm1,%xmm13,%xmm3 We end up also applying basic-block vectorization to the scalar loop via the stores after it: if (dval1 < gautbl->distfloor) dval1 = gautbl->distfloor; if (dval2 < gautbl->distfloor) dval2 = gautbl->distfloor; score[r] = (int32)(f * dval1); score[r+1] = (int32)(f * dval2); interestingly without LTO this specific instance doesn't behave that badly but is faster with the three epilogues: 12.63% 132184 sphinx_livepret sphinx_livepretend_base.amd64-m64-gcc42-nn [.] vector_gautbl_eval_logs3 ▒ 11.55% 119447 sphinx_livepret sphinx_livepretend_peak.amd64-m64-gcc42-nn [.] vector_gautbl_eval_logs3 in turn a similar case elsewhere shows up: 27.63% 285537 sphinx_livepret sphinx_livepretend_peak.amd64-m64-gcc42-nn [.] mgau_eval ◆ 15.19% 158646 sphinx_livepret sphinx_livepretend_base.amd64-m64-gcc42-nn [.] mgau_eval that's basically exactly the same loop kernel. What we can see here is 130 │ vmovhps %xmm4,0x60(%rsp) ▒ 5 │ vmovhps %xmm2,0x70(%rsp) ▒ │ vmovaps 0x70(%rsp),%xmm0 ▒ 10552 │ vcvtps2pd %xmm4,%xmm6 ▒ 1204 │ vcvtps2pd %xmm2,%xmm3 ▒ │ vcvtps2pd %xmm0,%xmm2 ▒ 6872 │ vmovaps 0x60(%rsp),%xmm0 ▒ 1598 │ vmulpd %xmm3,%xmm3,%xmm3 ▒ 5 │ vmulpd %xmm2,%xmm2,%xmm2 ▒ 15119 │ vfnmadd132pd %xmm6,%xmm1,%xmm3 ▒ 38 │ vcvtps2pd %xmm0,%xmm1 ▒ 5088 │ vfnmadd132pd %xmm2,%xmm3,%xmm1 ▒ 33818 │ vunpckhpd %xmm1,%xmm1,%xmm2 ▒ 39938 │ vaddpd %xmm1,%xmm2,%xmm2 ▒ 37932 │ test $0x3,%cl There is an odd spill/reload at 0x60(%rsp) which I can't see where it's coming from. Could it be that V2SFmode is spilled as V2SFmode but reloaded as V4SFmode?! #(insn:TI 1161 1175 1578 15 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) # (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]) # (vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) # (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]) # (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566])) # (parallel [ # (const_int 6 [0x6]) # (const_int 7 [0x7]) # (const_int 2 [0x2]) # (const_int 3 [0x3]) # ]))) "cont_mgau.c":157:9 5181 {sse_movhlps} # (expr_list:REG_DEAD (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566]) # (nil))) vmovhps %xmm2, 112(%rsp) # 1161 [c=4 l=8] sse_movhlps/4 #(insn 1578 1161 1173 15 (set (reg:V4SF 20 xmm0 [853]) # (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) # (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])) "cont_mgau.c":157:9 2354 {movv4sf_internal} # (nil)) vmovaps 112(%rsp), %xmm0 # 1578 [c=10 l=8] movv4sf_internal/3 Huh. It looks like this is from a V4SF -> 2xV2DF extension via vec_unpack_{hi,lo}_expr. Originally this is (insn 1161 1160 1162 58 (set (reg:V4SF 853) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 853) (reg:V4SF 566 [ vect__811.229 ])) (parallel [ (const_int 6 [0x6]) (const_int 7 [0x7]) (const_int 2 [0x2]) (const_int 3 [0x3]) ]))) "cont_mgau.c":157:9 5181 {sse_movhlps} (expr_list:REG_DEAD (reg:V4SF 566 [ vect__811.229 ]) (nil))) but LRA chooses the memory alternative for the destination: Choosing alt 4 in insn 1161: (0) m (1) 0 (2) v {sse_movhlps} Considering alt=0 of insn 1162: (0) =v (1) v overall=6,losers=1,rld_nregs=1 Choosing alt 0 in insn 1162: (0) =v (1) v {sse2_cvtps2pd} Creating newreg=998 from oldreg=853, assigning class ALL_SSE_REGS to r998 1162: r568:V2DF=float_extend(vec_select(r998:V4SF,parallel)) Inserting insn reload before: 1578: r998:V4SF=r853:V4SF This is the following pattern: (define_insn "sse_movhlps" [(set (match_operand:V4SF 0 "nonimmediate_operand" "=x,v,x,v,m") (vec_select:V4SF (vec_concat:V8SF (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v,0") (match_operand:V4SF 2 "nonimmediate_operand" " x,v,o,o,v")) (parallel [(const_int 6) (const_int 7) (const_int 2) (const_int 3)])))] "TARGET_SSE && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "@ movhlps\t{%2, %0|%0, %2} vmovhlps\t{%2, %1, %0|%0, %1, %2} movlps\t{%H2, %0|%0, %H2} vmovlps\t{%H2, %1, %0|%0, %1, %H2} %vmovhps\t{%2, %0|%q0, %2}" [(set_attr "isa" "noavx,avx,noavx,avx,*") (set_attr "type" "ssemov2") (set_attr "prefix" "orig,maybe_evex,orig,maybe_evex,maybe_vex") (set_attr "mode" "V4SF,V4SF,V2SF,V2SF,V2SF")]) indeed the "mode" attr says V2SF for the memory (store) alternative but this is for a V4SFmode. Also LRA doesn't seem to understand that the match_operand:1 should be the same memory. (insn 1161 1160 1578 59 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]) (vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp) (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]) (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566])) (parallel [ (const_int 6 [0x6]) (const_int 7 [0x7]) (const_int 2 [0x2]) (const_int 3 [0x3]) ]))) "cont_mgau.c":157:9 5181 {sse_movhlps} (nil)) it's also odd that I don't see the spill to (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)(const_int 112 [0x70])) that LRA would need to generate for the input operand. Clearly something is odd here and clearly this alternative is bad.
Created attachment 59662 [details] preprocessed source This is the mgau_eval from the non-LTO slowdown. When you build with -Ofast -g -fopt-info-vec -march=znver4 you'll see this vmovhlps + vmovaps "spill"/reload pair. I didn't yet attempt to reduce further (register pressure is needed for the spill).
Removing the alternative fixes the slowdown, both LTO and non-LTO, the multi-epilogue version is now faster. diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 72acd5bde5e..0a3ee38f55a 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -11045,11 +11045,11 @@ }) (define_insn "sse_movhlps" - [(set (match_operand:V4SF 0 "nonimmediate_operand" "=x,v,x,v,m") + [(set (match_operand:V4SF 0 "nonimmediate_operand" "=x,v,x,v") (vec_select:V4SF (vec_concat:V8SF - (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v,0") - (match_operand:V4SF 2 "nonimmediate_operand" " x,v,o,o,v")) + (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v") + (match_operand:V4SF 2 "nonimmediate_operand" " x,v,o,o")) (parallel [(const_int 6) (const_int 7) (const_int 2) @@ -11059,12 +11059,11 @@ movhlps\t{%2, %0|%0, %2} vmovhlps\t{%2, %1, %0|%0, %1, %2} movlps\t{%H2, %0|%0, %H2} - vmovlps\t{%H2, %1, %0|%0, %1, %H2} - %vmovhps\t{%2, %0|%q0, %2}" - [(set_attr "isa" "noavx,avx,noavx,avx,*") + vmovlps\t{%H2, %1, %0|%0, %1, %H2}" + [(set_attr "isa" "noavx,avx,noavx,avx") (set_attr "type" "ssemov2") - (set_attr "prefix" "orig,maybe_evex,orig,maybe_evex,maybe_vex") - (set_attr "mode" "V4SF,V4SF,V2SF,V2SF,V2SF")]) + (set_attr "prefix" "orig,maybe_evex,orig,maybe_evex") + (set_attr "mode" "V4SF,V4SF,V2SF,V2SF")]) (define_expand "sse_movlhps_exp" [(set (match_operand:V4SF 0 "nonimmediate_operand")
> Huh. It looks like this is from a V4SF -> 2xV2DF extension via > vec_unpack_{hi,lo}_expr. > > Originally this is > > (insn 1161 1160 1162 58 (set (reg:V4SF 853) > (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 853) > (reg:V4SF 566 [ vect__811.229 ])) > (parallel [ > (const_int 6 [0x6]) > (const_int 7 [0x7]) > (const_int 2 [0x2]) > (const_int 3 [0x3]) > ]))) "cont_mgau.c":157:9 5181 {sse_movhlps} > (expr_list:REG_DEAD (reg:V4SF 566 [ vect__811.229 ]) > (nil))) > vec_unpacks_hi_v4sf create an unintialized (reg:V4SF 853), I guess it may confuse LRA to allocate a mem for it. 10413(define_expand "vec_unpacks_hi_v4sf" 10414 [(set (match_dup 2) 10415 (vec_select:V4SF 10416 (vec_concat:V8SF 10417 (match_dup 2) ------------ here, use without initialization. 10418 (match_operand:V4SF 1 "vector_operand")) 10419 (parallel [(const_int 6) (const_int 7) 10420 (const_int 2) (const_int 3)]))) 10421 (set (match_operand:V2DF 0 "register_operand") 10422 (float_extend:V2DF 10423 (vec_select:V2SF 10424 (match_dup 2) 10425 (parallel [(const_int 0) (const_int 1)]))))] 10426 "TARGET_SSE2" 10427 "operands[2] = gen_reg_rtx (V4SFmode);") Explicitly initialize the operands[2] with 0, the spill is gone, Could you try below patch to see if it also fixed ths issue? diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index e50355f4839..e1ee688d5e4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -10500,7 +10500,10 @@ (define_expand "vec_unpacks_hi_v4sf" (match_dup 2) (parallel [(const_int 0) (const_int 1)]))))] "TARGET_SSE2" - "operands[2] = gen_reg_rtx (V4SFmode);") +{ + operands[2] = gen_reg_rtx (V4SFmode); + emit_move_insn (operands[2], CONST0_RTX (V4SFmode)); +}) (define_expand "vec_unpacks_hi_v8sf" [(set (match_dup 2)
> vec_unpacks_hi_v4sf create an unintialized (reg:V4SF 853), I guess it may > confuse LRA to allocate a mem for it. For simple case void foo (double* a, float* b, int n) { for (int i = 0; i != n; i++) a[i] = b[i]; } RA works ok, there's no extra spill there.
(In reply to Hongtao Liu from comment #8) > > vec_unpacks_hi_v4sf create an unintialized (reg:V4SF 853), I guess it may > > confuse LRA to allocate a mem for it. > > For simple case > void > foo (double* a, float* b, int n) > { > for (int i = 0; i != n; i++) > a[i] = b[i]; > } > > RA works ok, there's no extra spill there. Yeah, it needs enough register pressure to not have the extra reg here. I think the proposed patch in comment#7 might be good on its own as it avoids a false dependence on prior register contents (if not optimizing for size). It does fix the benchmark regression as well. I do wonder about the usefulness of the memory alternative on the sse_movhlps pattern though, there's the sse_storehps pattern which also models the store part more precisely as V2SFmode. Is sse_movhlps_exp ever invoked with a memory destination? That said, if the memory alternative stays we might want to mark it with '$' so it's never chosen when the then memory operand needs a reload?
> > I do wonder about the usefulness of the memory alternative on the > sse_movhlps pattern though, there's the sse_storehps pattern which > also models the store part more precisely as V2SFmode. Is > sse_movhlps_exp ever invoked with a memory destination? > Like this? typedef float v4sf __attribute__((vector_size(16))); void foo (v4sf a, v4sf* b) { *b = __builtin_shufflevector (*b, a, 0, 1, 4, 5); } foo(float __vector(4), float __vector(4)*): movlps QWORD PTR [rdi+8], xmm0 # 11 [c=4 l=3] sse_movlhps/4 ret # 19 [c=0 l=1] simple_return_internal
(In reply to Hongtao Liu from comment #10) > > > > I do wonder about the usefulness of the memory alternative on the > > sse_movhlps pattern though, there's the sse_storehps pattern which > > also models the store part more precisely as V2SFmode. Is > > sse_movhlps_exp ever invoked with a memory destination? > > > > Like this? > > typedef float v4sf __attribute__((vector_size(16))); > void > foo (v4sf a, v4sf* b) > { > *b = __builtin_shufflevector (*b, a, 0, 1, 4, 5); > } > > > foo(float __vector(4), float __vector(4)*): > movlps QWORD PTR [rdi+8], xmm0 # 11 [c=4 l=3] > sse_movlhps/4 > ret # 19 [c=0 l=1] simple_return_internal Indeed. Btw, I checked and diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 72acd5bde5e..6cd0d932bd9 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -11045,7 +11045,7 @@ }) (define_insn "sse_movhlps" - [(set (match_operand:V4SF 0 "nonimmediate_operand" "=x,v,x,v,m") + [(set (match_operand:V4SF 0 "nonimmediate_operand" "=x,v,x,v,$m") (vec_select:V4SF (vec_concat:V8SF (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v,0") does _not_ fix the regression (but maybe I did it wrong).
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:ba4cf2e296d8d5950c3d356fa6b6efcad00d0189 commit r15-5639-gba4cf2e296d8d5950c3d356fa6b6efcad00d0189 Author: liuhongt <hongtao.liu@intel.com> Date: Thu Nov 21 23:57:38 2024 -0800 Fix uninitialized operands[2] in vec_unpacks_hi_v4sf. It could cause weired spill in RA when register pressure is high. gcc/ChangeLog: PR target/117562 * config/i386/sse.md (vec_unpacks_hi_v4sf): Initialize operands[2] with CONST0_RTX.
The releases/gcc-14 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:4a63cc6de77481878ec31e1e6ac30e22c50b063a commit r14-10979-g4a63cc6de77481878ec31e1e6ac30e22c50b063a Author: liuhongt <hongtao.liu@intel.com> Date: Thu Nov 21 23:57:38 2024 -0800 Fix uninitialized operands[2] in vec_unpacks_hi_v4sf. It could cause weired spill in RA when register pressure is high. gcc/ChangeLog: PR target/117562 * config/i386/sse.md (vec_unpacks_hi_v4sf): Initialize operands[2] with CONST0_RTX. (cherry picked from commit ba4cf2e296d8d5950c3d356fa6b6efcad00d0189)
Fixed.
The releases/gcc-13 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:0eb8c19cb45fc004b7039fa22ff9021604d80dbc commit r13-9216-g0eb8c19cb45fc004b7039fa22ff9021604d80dbc Author: liuhongt <hongtao.liu@intel.com> Date: Thu Nov 21 23:57:38 2024 -0800 Fix uninitialized operands[2] in vec_unpacks_hi_v4sf. It could cause weired spill in RA when register pressure is high. gcc/ChangeLog: PR target/117562 * config/i386/sse.md (vec_unpacks_hi_v4sf): Initialize operands[2] with CONST0_RTX. (cherry picked from commit ba4cf2e296d8d5950c3d356fa6b6efcad00d0189)
The releases/gcc-12 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:89a27cf6b1354cc80d834d71f7a3aa137d605e94 commit r12-10832-g89a27cf6b1354cc80d834d71f7a3aa137d605e94 Author: liuhongt <hongtao.liu@intel.com> Date: Thu Nov 21 23:57:38 2024 -0800 Fix uninitialized operands[2] in vec_unpacks_hi_v4sf. It could cause weired spill in RA when register pressure is high. gcc/ChangeLog: PR target/117562 * config/i386/sse.md (vec_unpacks_hi_v4sf): Initialize operands[2] with CONST0_RTX. (cherry picked from commit ba4cf2e296d8d5950c3d356fa6b6efcad00d0189)