As seen here https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=467.50.0 https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.50.0 https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=992.50.0 there was a 6-11% exec time slowdown of the 416.gamess SPEC 2006 benchmark between commits r15-3465-gde3ca363811a39 r15-3518-g2c4438d3915649 when run with -Ofast -march=native (optionally -flto) on AMD Zen3/4 machines.
Bisected to r15-3509-gd34cda72098867. Cc-ing richi.
I will eventually investigate.
Re-confirmed (comparing 14.2 against trunk on Zen4 with -Ofast -flto -march=native). Samples: 1M of event 'cycles:Pu', Event count (approx.): 2401109021645 Overhead Samples Command Shared Object Symbol 12.03% 230087 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] twotff_ 11.79% 224014 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] forms_ 11.66% 222528 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] forms_ 8.44% 160676 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] dirfck_ 8.09% 153197 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] dirfck_ 6.27% 119537 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] twotff_ 5.89% 111667 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] xyzint_ 5.21% 99376 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] xyzint_ 3.02% 57506 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] genral_ 2.36% 44702 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] genral_ 1.62% 30954 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] zqout_ 1.56% 29806 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] twoei_.constprop.2 1.53% 29092 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] twoei_.constprop.2 1.40% 26663 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] zqout_ so the main thing is the usual suspect, the "triangular" loop MKL=0 DO 10 MK=1,NOC DO 10 ML=1,MK MKL = MKL+1 XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) + * VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK)) XPQKL(MRS,MKL) = XPQKL(MRS,MKL) + * VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK)) 10 CONTINUE where previously I massaged costing to have the loop _not_ vectorized but that doesn't work anymore it seems.
So on x86 the cost model difference 14.2 vs trunk is -(*co_271(D))[_95] 1 times vec_construct costs 792 in body +(*co_271(D))[_95] 1 times vec_construct costs 88 in body and similar for -_103 1 times vec_to_scalar costs 72 in body +_103 1 times vec_to_scalar costs 8 in body r15-5565-gdbc38dd9e96a99 doesn't seem to fix this yet. The reason is that the cost hook for non-SLP considers VMAT_ELEMENTWISE with variable stride separately but not so VMAT_STRIDED_SLP with SLP. With SLP we don't get all the info we like (how we use lvectype/ltype vs. vectype). For GCC 15 I'm going to emulate GCC 14 behavior here by special-casing single-lane SLP. For the future we want to let the backend know how many and what kind of loads we do for VMAT_STRIDED_SLP, that's something the cost hook doesn't get us yet.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:cd8db107b9bef73fd822ffb420f96ed2bc622a19 commit r15-5651-gcd8db107b9bef73fd822ffb420f96ed2bc622a19 Author: Richard Biener <rguenther@suse.de> Date: Mon Nov 25 13:32:15 2024 +0100 target/116760 - 416.gamess slowdown with SLP For the TWOTFF loop vectorization the backend scales constructor and vector extract cost to make higher VFs less profitable. This heuristic currently fails to consider VMAT_STRIDED_SLP which we now get with single-lane SLP, causing a huge regression in SPEC 2k6 416.gamess for the respective loop nest. The following fixes this, matching behavior to that of GCC 14 by treating single-lane VMAT_STRIDED_SLP the same as VMAT_ELEMENTWISE. PR target/116760 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Scale vec_construct for single-lane VMAT_STRIDED_SLP the same as VMAT_ELEMENTWISE. * tree-vect-stmts.cc (vectorizable_store): Pass SLP node down to costing for vec_to_scalar for VMAT_STRIDED_SLP.
Fixed.