| Summary: | [15 Regression] 6-11% slowdown of 416.gamess on AMD Zen3 and Zen4 since r15-3509-gd34cda72098867 | ||
|---|---|---|---|
| Product: | gcc | Reporter: | Filip Kastl <pheeck> |
| Component: | tree-optimization | Assignee: | Richard Biener <rguenth> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | rguenth |
| Priority: | P3 | Keywords: | missed-optimization |
| Version: | 15.0 | ||
| Target Milestone: | 15.0 | ||
| See Also: |
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116761 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912 |
||
| Host: | x86_64-pc-linux-gnu | Target: | x86_64-pc-linux-gnu |
| Build: | Known to work: | ||
| Known to fail: | Last reconfirmed: | 2024-11-25 00:00:00 | |
| Bug Depends on: | |||
| Bug Blocks: | 26163 | ||
|
Description
Filip Kastl
2024-09-18 12:19:27 UTC
Bisected to r15-3509-gd34cda72098867. Cc-ing richi. I will eventually investigate. Re-confirmed (comparing 14.2 against trunk on Zen4 with -Ofast -flto -march=native).
Samples: 1M of event 'cycles:Pu', Event count (approx.): 2401109021645
Overhead Samples Command Shared Object Symbol
12.03% 230087 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] twotff_
11.79% 224014 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] forms_
11.66% 222528 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] forms_
8.44% 160676 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] dirfck_
8.09% 153197 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] dirfck_
6.27% 119537 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] twotff_
5.89% 111667 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] xyzint_
5.21% 99376 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] xyzint_
3.02% 57506 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] genral_
2.36% 44702 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] genral_
1.62% 30954 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] zqout_
1.56% 29806 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] twoei_.constprop.2
1.53% 29092 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] twoei_.constprop.2
1.40% 26663 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] zqout_
so the main thing is the usual suspect, the "triangular" loop
MKL=0
DO 10 MK=1,NOC
DO 10 ML=1,MK
MKL = MKL+1
XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
* VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
* VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
10 CONTINUE
where previously I massaged costing to have the loop _not_ vectorized
but that doesn't work anymore it seems.
So on x86 the cost model difference 14.2 vs trunk is -(*co_271(D))[_95] 1 times vec_construct costs 792 in body +(*co_271(D))[_95] 1 times vec_construct costs 88 in body and similar for -_103 1 times vec_to_scalar costs 72 in body +_103 1 times vec_to_scalar costs 8 in body r15-5565-gdbc38dd9e96a99 doesn't seem to fix this yet. The reason is that the cost hook for non-SLP considers VMAT_ELEMENTWISE with variable stride separately but not so VMAT_STRIDED_SLP with SLP. With SLP we don't get all the info we like (how we use lvectype/ltype vs. vectype). For GCC 15 I'm going to emulate GCC 14 behavior here by special-casing single-lane SLP. For the future we want to let the backend know how many and what kind of loads we do for VMAT_STRIDED_SLP, that's something the cost hook doesn't get us yet. The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:cd8db107b9bef73fd822ffb420f96ed2bc622a19 commit r15-5651-gcd8db107b9bef73fd822ffb420f96ed2bc622a19 Author: Richard Biener <rguenther@suse.de> Date: Mon Nov 25 13:32:15 2024 +0100 target/116760 - 416.gamess slowdown with SLP For the TWOTFF loop vectorization the backend scales constructor and vector extract cost to make higher VFs less profitable. This heuristic currently fails to consider VMAT_STRIDED_SLP which we now get with single-lane SLP, causing a huge regression in SPEC 2k6 416.gamess for the respective loop nest. The following fixes this, matching behavior to that of GCC 14 by treating single-lane VMAT_STRIDED_SLP the same as VMAT_ELEMENTWISE. PR target/116760 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Scale vec_construct for single-lane VMAT_STRIDED_SLP the same as VMAT_ELEMENTWISE. * tree-vect-stmts.cc (vectorizable_store): Pass SLP node down to costing for vec_to_scalar for VMAT_STRIDED_SLP. Fixed. |