I can see x86_64 regressions of 525.x264_r and 538.imagick_r when built with plain -O2 (so generic march/mtune) and profile guided optimization (PGO), compared to GCC 11. The performance drop of 525.x264_r is about 11% on znver3 and 10% on Intel cascadelake. The performance drop of 538.imagick_r is about 6.4% on znver3. FWIW, I bisected both to commit r12-7319-g90d693bdc9d718: commit 90d693bdc9d71841f51d68826ffa5bd685d7f0bc Author: Richard Biener <rguenther@suse.de> Date: Fri Feb 18 14:32:14 2022 +0100 target/99881 - x86 vector cost of CTOR from integer regs This uses the now passed SLP node to the vectorizer costing hook to adjust vector construction costs for the cost of moving an integer component from a GPR to a vector register when that's required for building a vector from components. A cruical difference here is whether the component is loaded from memory or extracted from a vector register as in those cases no intermediate GPR is involved. The pr99881.c testcase can be Un-XFAILed with this patch, the pr91446.c testcase now produces scalar code which looks superior to me so I've adjusted it as well. 2022-02-18 Richard Biener <rguenther@suse.de> PR tree-optimization/104582 PR target/99881 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Cost GPR to vector register moves for integer vector construction. With PGo+LTO, the 538.imagick_r regression on znver3 is small (less than 3%), the 525.x264_r ones are smaller but visible (9.4% and 7.1% on the two machines).
Confirmed with GCC 12.1 numbers.
I can see this again in my measurements from January 10, 2023. Trunk and GCC 12.2 are about 10% slower with PGO than GCC 11 with the same options and (this time also) about 9% slower with both PGO and LTO than GCC 11 with the same options (well, in the latter case it's only 4% on zen2).
I have re-checked this year again (using master revision r14-7200-g95440171d0e615) but this time on a high-frequency Zen3 CPU (EPYC 75F3). Run-time of 525.x264_r built with master with PGO and -O2 improved by 5.49% compared to GCC 13 and so compared to GCC 11 the regression dropped to 4.2%. Run-time of 538.imagick_r compiled with the same options and master is 5.8% slower on this CPU than when compiling it with GCC 11. With both PGO and LTO, 525.x264_r is now only 2.8% slower than GCC 11. In case of 538.imagick_r the regression is 2.01% on the zen4, but it is 7.49% on a zen4 machine :-/
Since this was a costing change I wonder if we identified the code change responsible and thus have a testcase? I realize that for maximum assurance one would need to have a debug counter for switching the patch on/off to have it apply more selectively (possibly per SLP attempt rather than per cost hook invocation which would be even more tricky to do). Feeding another parameter to the hook via a new flag in the vinfo might be possible (and set that from a dbg_cnt call) for example.
GCC 12.4 is being released, retargeting bugs to GCC 12.5.