On Zen2 based CPUs (and only on those, I have not seen this neither on Zen3 nor on Intel Cascadelake, for example), 531.deepsjeng_r regressed by almost 10% when built with -Ofast -march=native as can be seen on LNT: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=295.387.0 Fortunately, it does not happen with LTO. Given how specific it is, it may not be easy to diagnose or fix, but for what it is worth, I was able to to bisect the big jump from October to: cb153222404e2e149aa65a4b3139b09477551203 is the first bad commit commit cb153222404e2e149aa65a4b3139b09477551203 Author: Andrew MacLeod <amacleod@redhat.com> Date: Wed Oct 20 13:37:29 2021 -0400 Fold all statements in Ranger VRP. Until now, ranger VRP has only simplified statements with ranges. This patch enables us to fold all statements. gcc/ * tree-vrp.c (rvrp_folder::fold_stmt): If simplification fails, try to fold anyway. gcc/testsuite/ * gcc.dg/tree-ssa/vrp98.c: Disable evrp for vrp1 test. * gcc.dg/tree-ssa/vrp98-1.c: New. Test for folding in evrp.
wow, thats a crazy change to get that kind of difference. All we are doing is invoking the ::fold_stmt () on statements we can't simplify with ranges. I wonder if something is being simplified too early and messing some loop stuff up?
Created attachment 52278 [details] patch which undoes the original change I'm not suggesting we de-apply the original patch, but it cant be directly undone today as there have been other changes on top of the original. The attached patch turns off the "fold all statement" functionality on trunk today. You can use it to see if this is still the root of the problem. My guess would be that we are now folding something that is causing a hot section of code to get processed differently by some other optimization. It could be useful to know what I guess.
The patch did not change the run-time (by more than could be attributed to noise). I will take a *quick* look at what happened in October.
Despite spending much more time on this than I wanted I was not able to find out anything really interesting. The functions that slowed down significantly is feval (FWIW, perf annotation points down to a conditional jump, depending on a comparison of 0x78(%rsp) to zero, as a new costly instruction). I have gone back to the commit that introduced the regression and added a debug counter to switch between the old and new behavior. The single change responsible for the entire slowdown happened in evrp pass when working on function positional_eval: @@ -1946,7 +1948,7 @@ _11 = _9 & _10; _95 = PopCount (_11); _96 = _95 * 15; - _104 = -_96; + _104 = _95 * -15; _13 = pawntt_84(D)->b_super_strong_square; _14 = s_85(D)->BitBoard[4]; _15 = _13 & _14; Neither _95 nor _96 has any further uses, and either way, simple search in dumps suggests that even in the "fast" case, the expression is folded to multiplication by -15 later anyway. But from here the investigation is difficult, this change introduces changes in SSA numbering in later passes and diffs are huge. Moreover, this also causes change in inlining order (as reported by -fopt-info-optimized): --- opt-fast 2022-02-01 17:17:50.928639947 +0100 +++ opt-slow 2022-02-01 17:18:07.284728740 +0100 @@ -4,4 +4,4 @@ neval.cpp:1086:26: optimized: Inlined trapped_eval.constprop/209 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 172.599138 and size 156, net change of -25. -neval.cpp:1067:22: optimized: Inlined void kingpressure_eval(state_t*, attackinfo_t*, t_eval_comps*)/162 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 216.190938 and size 314, net change of -31. -neval.cpp:1081:20: optimized: Inlined void positional_eval(state_t*, pawntt_t*, t_eval_comps*)/157 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 314.215938 and size 433, net change of -21. +neval.cpp:1081:20: optimized: Inlined void positional_eval(state_t*, pawntt_t*, t_eval_comps*)/157 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 269.624138 and size 274, net change of -21. +neval.cpp:1067:22: optimized: Inlined void kingpressure_eval(state_t*, attackinfo_t*, t_eval_comps*)/162 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 313.215938 and size 432, net change of -31. neval.cpp:394:22: optimized: basic block part vectorized using 32 byte vectors On the assembly level, register allocation, spilling and scheduling are clearly somewhat different, again creating so much differences that I cannot tell what is going on from a simple diff.
This still exists but it is a zen2 oddity. The zen3, zen4 and cascade-lake machines I looked at this month don't exhibit this behavior (or at least I don't see an obvious regression).