I have bisected an AMD zen2 10% performance regression of SPEC 2006 FP 433.milc benchmark when compiled with -Ofast -march=native -flto to commit r12-3893-g6390c5047adb75. See also: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=412.70.0&plot.1=289.70.0& As Richi asked, I am filing this bug even though I cannot reproduce the regression neither on an AMD zen3 machine nor on Intel CascadeLake, because the history of the benchmark performance and because I know milc can be sensitive to conditions outside our control. OTOH, the regression reproduces reliably for me. Some relevant perf data: BEFORE: # Samples: 585K of event 'cycles:u' # Event count (approx.): 472738682838 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ...................... ......................................... # 24.59% 140397 milc_peak.mine- milc_peak.mine-lto-nat [.] u_shift_fermion 18.47% 105497 milc_peak.mine- milc_peak.mine-lto-nat [.] add_force_to_mom 15.97% 96343 milc_peak.mine- milc_peak.mine-lto-nat [.] mult_su3_na 15.29% 90027 milc_peak.mine- milc_peak.mine-lto-nat [.] mult_su3_nn 5.55% 35114 milc_peak.mine- milc_peak.mine-lto-nat [.] path_product 4.75% 27693 milc_peak.mine- milc_peak.mine-lto-nat [.] compute_gen_staple 2.76% 16109 milc_peak.mine- milc_peak.mine-lto-nat [.] mult_su3_an 2.42% 14255 milc_peak.mine- milc_peak.mine-lto-nat [.] imp_gauge_force.constprop.0 2.02% 11561 milc_peak.mine- milc_peak.mine-lto-nat [.] mult_adj_su3_mat_4vec AFTER: # Samples: 634K of event 'cycles:u' # Event count (approx.): 513635733685 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ...................... ......................................... # 24.04% 149010 milc_peak.mine- milc_peak.mine-lto-nat [.] add_force_to_mom 23.76% 147370 milc_peak.mine- milc_peak.mine-lto-nat [.] u_shift_fermion 14.19% 90929 milc_peak.mine- milc_peak.mine-lto-nat [.] mult_su3_nn 14.14% 92912 milc_peak.mine- milc_peak.mine-lto-nat [.] mult_su3_na 4.90% 33846 milc_peak.mine- milc_peak.mine-lto-nat [.] path_product 3.89% 24621 milc_peak.mine- milc_peak.mine-lto-nat [.] mult_su3_an 3.62% 22831 milc_peak.mine- milc_peak.mine-lto-nat [.] compute_gen_staple 2.05% 13215 milc_peak.mine- milc_peak.mine-lto-nat [.] imp_gauge_force.constprop.0
So for some unknown reason the more vectorized version of the function is slower. Note that all BB vectorization in this function happens when triggered from the loop vectorizer on the if-converted loop body after loop vectorization failed. One difference is -make_ahmat.c:37:13: missed: desired vector type conflicts with earlier one for _334 = _5->c[0].real; -make_ahmat.c:37:13: note: removing SLP instance operations starting from: MEM <struct site> [(struct anti_hermitmat *)s_339].mom[dir_8].m00im = _45; +make_ahmat.c:37:13: note: vect_compute_data_ref_alignment: +make_ahmat.c:37:13: note: can't force alignment of ref: MEM <struct site> [(struct anti_hermitmat *)s_339].mom[dir_8].m00im and the extra vectorization has live lanes that we think we cannot reliably place: +make_ahmat.c:37:13: missed: Cannot determine insertion place for lane extract +make_ahmat.c:37:13: missed: Cannot determine insertion place for lane extract +make_ahmat.c:37:13: missed: Cannot determine insertion place for lane extract +make_ahmat.c:37:13: missed: Cannot determine insertion place for lane extract that will cause the scalar definitions to be retained (but they will be not costed as removed then), possibly causing redundant computations and ressource competition. +su3_proj.c:44:24: note: Cost model analysis for part in loop 1: + Vector cost: 776 + Scalar cost: 836 and before the change for a much smaller portion of the block: make_ahmat.c:40:60: note: Cost model analysis for part in loop 1: Vector cost: 224 Scalar cost: 328 the scalar/vector ratio was better there.
Created attachment 51609 [details] testcase This testcase reproduces the vectorization difference with REV and REV^ when using -Ofast -march=znver2 [-fno-early-inlinig]. Trunk shows similar behavior still. The issue is likely code like vect__325.61_64 = MEM <vector(2) double> [(double *)_80]; vect__325.62_60 = MEM <vector(2) double> [(double *)_80 + 16B]; vect__325.64_53 = VEC_PERM_EXPR <vect__325.61_64, vect__325.62_60, { 0, 2 }>; _52 = BIT_FIELD_REF <vect__325.62_60, 64, 0>; where it would have been better to emit two scalar loads and combine them with a CTOR (or in asm use movsd + movhpd). But we end up with .L5: vmovupd (%rcx), %xmm3 vmovupd 16(%rcx), %xmm2 ... addq $48, %rcx ... vunpckhpd %xmm2, %xmm3, %xmm7 ... vfnmadd132sd -48(%rcx), %xmm8, %xmm15 also scattered around (but that's not GIMPLEs fault). The vectorizer generates vectp.60_65 = &_6->c[0].real; vect__325.61_64 = MEM <vector(2) double> [(double *)vectp.60_65]; vectp.60_61 = vectp.60_65 + 16; vect__325.62_60 = MEM <vector(2) double> [(double *)vectp.60_61]; vectp.60_57 = vectp.60_65 + 32; vect__325.64_53 = VEC_PERM_EXPR <vect__325.61_64, vect__325.62_60, { 0, 2 }>; _52 = BIT_FIELD_REF <vect__325.64_53, 64, 64>; which is of course entirely reasonable in some sense (that's 12 + 12 + 4 + 4 cost - two scalar loads plus CTOR would cost 12 + 12 + 8 but we'd likely still generate and cost the scalar extract). Eventually it's better to pattern-match the permute and demote the vector loads to scalar... We also end up with vect__259.20_179 = .FNMA (_182, vect__326.17_185, _196); vect__368.19_180 = .FMA (_182, vect__326.17_185, _196); _178 = VEC_PERM_EXPR <vect__368.19_180, vect__259.20_179, { 0, 5, 2, 7 }>; and unhandled add_force_to_mom.c:72:60: note: node 0x3ed37e0 (max_nunits=4, refcnt=2) add_force_to_mom.c:72:60: note: op template: _362 = _6->c[1].real; add_force_to_mom.c:72:60: note: stmt 0 _362 = _6->c[1].real; add_force_to_mom.c:72:60: note: stmt 1 _362 = _6->c[1].real; add_force_to_mom.c:72:60: note: stmt 2 _399 = _6->c[2].real; add_force_to_mom.c:72:60: note: stmt 3 _399 = _6->c[2].real; add_force_to_mom.c:72:60: note: load permutation { 2 2 4 4 } .... add_force_to_mom.c:72:60: note: ==> examining statement: _362 = _6->c[1].real; add_force_to_mom.c:72:60: missed: BB vectorization with gaps at the end of a load is not supported add_force_to_mom.c:53:16: missed: not vectorized: relevant stmt not supported: _362 = _6->c[1].real; add_force_to_mom.c:72:60: note: Building vector operands of 0x3ed37e0 from scalars instead
So when we have BB vectorization and costs like +su3_proj.c:44:24: note: Cost model analysis for part in loop 1: + Vector cost: 776 + Scalar cost: 836 we decide it's worthwhile to vectorize. But that does not account for the fact that the scalar code represents multiple vector lanes possibly executing in parallel while the vector code in combining the lanes makes them data dependent. So considering a VF of at least 2 the scalar code could in the ideal case run with a latency of 836/2 while the vector code would have a latency of 776. Of course the cost number we compute doesn't really resemble overall latency, I will see if we can improve that for GCC 13. What I'm after is that it's maybe not a good idea to compare scalar vs. vector cost for BB vectorization in the way we do. Without any info on the individually costed stmt dependences it's hard to do better of course though the target could try simple adjustments based on available issue with and execution resources to for example assume that two scalar ops can always issue & execute in parallel and thus divide the scalar cost by two (for BB vectorization only or for scalar stmts participating in SLP for loop vect, sth not readily available though). There's always the possibility to special case more constrained resources (integer/fp divider, stores or shifts), but that already gets a bit complicated because we cost scalar stmts independently and thus the code doesn't know how many of them will be combined into a single lane. The alternative is to do the biasing on the vector side but there you do not know the original scalar operation involved (consider patterns). So it might be not easy to adjust the current heuristics in a good way.
It's back to good performance.