Bug 104125 - 531.deepsjeng_r regressed on Zen2 CPUs at -Ofast -march=native (without LTO) during GCC 12 development
Summary: 531.deepsjeng_r regressed on Zen2 CPUs at -Ofast -march=native (without LTO) ...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2022-01-19 17:48 UTC by Martin Jambor
Modified: 2023-01-18 16:12 UTC (History)
3 users (show)

See Also:
Host: x86_64-linux
Target: x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
patch which undoes the original change (212 bytes, patch)
2022-01-24 18:26 UTC, Andrew Macleod
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Jambor 2022-01-19 17:48:51 UTC
On Zen2 based CPUs (and only on those, I have not seen this neither on
Zen3 nor on Intel Cascadelake, for example), 531.deepsjeng_r regressed
by almost 10% when built with -Ofast -march=native as can be seen on
LNT:

  https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=295.387.0

Fortunately, it does not happen with LTO.  Given how specific it is,
it may not be easy to diagnose or fix, but for what it is worth, I was
able to to bisect the big jump from October to:

  cb153222404e2e149aa65a4b3139b09477551203 is the first bad commit
  commit cb153222404e2e149aa65a4b3139b09477551203
  Author: Andrew MacLeod <amacleod@redhat.com>
  Date:   Wed Oct 20 13:37:29 2021 -0400

    Fold all statements in Ranger VRP.
    
    Until now, ranger VRP has only simplified statements with ranges.  This patch
    enables us to fold all statements.
    
            gcc/
            * tree-vrp.c (rvrp_folder::fold_stmt): If simplification fails, try
            to fold anyway.
    
            gcc/testsuite/
            * gcc.dg/tree-ssa/vrp98.c: Disable evrp for vrp1 test.
            * gcc.dg/tree-ssa/vrp98-1.c: New. Test for folding in evrp.
Comment 1 Andrew Macleod 2022-01-19 18:00:46 UTC
wow, thats a crazy change to get that kind of difference.  All we are doing is invoking the ::fold_stmt () on statements we can't simplify with ranges.

I wonder if something is being simplified too early and messing some loop stuff up?
Comment 2 Andrew Macleod 2022-01-24 18:26:29 UTC
Created attachment 52278 [details]
patch which undoes the original change

I'm not suggesting we de-apply the original patch, but it cant be directly undone today as there have been other changes on top of the original.

The attached patch turns off the "fold all statement" functionality on trunk today.   You can use it to see if this is still the root of the problem.

My guess would be that we are now folding something that is causing a hot section of code to get processed differently by some other optimization.  It could be useful to know what I guess.
Comment 3 Martin Jambor 2022-01-26 12:43:07 UTC
The patch did not change the run-time (by more than could be attributed to noise).  I will take a *quick* look at what happened in October.
Comment 4 Martin Jambor 2022-02-01 17:43:14 UTC
Despite spending much more time on this than I wanted I was not able
to find out anything really interesting.

The functions that slowed down significantly is feval (FWIW, perf
annotation points down to a conditional jump, depending on a
comparison of 0x78(%rsp) to zero, as a new costly instruction).

I have gone back to the commit that introduced the regression and
added a debug counter to switch between the old and new behavior.  The
single change responsible for the entire slowdown happened in evrp
pass when working on function positional_eval:

@@ -1946,7 +1948,7 @@
   _11 = _9 & _10;
   _95 = PopCount (_11);
   _96 = _95 * 15;
-  _104 = -_96;
+  _104 = _95 * -15;
   _13 = pawntt_84(D)->b_super_strong_square;
   _14 = s_85(D)->BitBoard[4];
   _15 = _13 & _14;

Neither _95 nor _96 has any further uses, and either way, simple
search in dumps suggests that even in the "fast" case, the expression
is folded to multiplication by -15 later anyway.

But from here the investigation is difficult, this change introduces
changes in SSA numbering in later passes and diffs are huge.
Moreover, this also causes change in inlining order (as reported by
-fopt-info-optimized):

--- opt-fast    2022-02-01 17:17:50.928639947 +0100
+++ opt-slow    2022-02-01 17:18:07.284728740 +0100
@@ -4,4 +4,4 @@
 neval.cpp:1086:26: optimized:  Inlined trapped_eval.constprop/209 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 172.599138 and size 156, net change of -25.
-neval.cpp:1067:22: optimized:  Inlined void kingpressure_eval(state_t*, attackinfo_t*, t_eval_comps*)/162 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 216.190938 and size 314, net change of -31.
-neval.cpp:1081:20: optimized:  Inlined void positional_eval(state_t*, pawntt_t*, t_eval_comps*)/157 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 314.215938 and size 433, net change of -21.
+neval.cpp:1081:20: optimized:  Inlined void positional_eval(state_t*, pawntt_t*, t_eval_comps*)/157 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 269.624138 and size 274, net change of -21.
+neval.cpp:1067:22: optimized:  Inlined void kingpressure_eval(state_t*, attackinfo_t*, t_eval_comps*)/162 into void feval(state_t*, int, t_eval_comps*)/163 which now has time 313.215938 and size 432, net change of -31.
 neval.cpp:394:22: optimized: basic block part vectorized using 32 byte vectors

On the assembly level, register allocation, spilling and scheduling
are clearly somewhat different, again creating so much differences
that I cannot tell what is going on from a simple diff.
Comment 5 Martin Jambor 2023-01-18 16:12:43 UTC
This still exists but it is a zen2 oddity. The zen3, zen4 and cascade-lake machines I looked at this month don't exhibit this behavior (or at least I don't see an obvious regression).