In my own measurements, 507.cactuBSSN_r is about 9.4% slower on an AMD Zen CPU when compiled with GCC 9 with -Ofast and native march/mtune than when it si compiled with GCC 8. LNT currently even shows 11.4% regression: https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch I have done some bisecting and the slowdown happened in three steps. First, the benchmark slowed by about 2% at some point before r262510 which I have not tracked down yet. Second, it then dived 3% with r263874 but this seems to be some code-placement issue again because the assembly of the functions which gained perf samples has not changed in that revision and perf reported stalled-cycles-frontend went from 4.58% to 5.02%. However, the third regression was caused by the immediately following revision r263875, the difference is 4.5% (7.5% is compared to GCC 8 run-time) while perf reported stalled-cycles-frontend were only 4.05%. r263872 (good) perf stat and report: Performance counter stats for 'numactl -C 0 -l specinvoke': 238848.989836 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 92923 page-faults:u # 0.389 K/sec 758195547230 cycles:u # 3.174 GHz (83.33%) 34727040659 stalled-cycles-frontend:u # 4.58% frontend cycles idle (83.33%) 15457735869 stalled-cycles-backend:u # 2.04% backend cycles idle (83.33%) 1225370192228 instructions:u # 1.62 insn per cycle # 0.03 stalled cycles per insn (83.33%) 23031544594 branches:u # 96.427 M/sec (83.34%) 18985096 branch-misses:u # 0.08% of all branches (83.33%) 239.158442295 seconds time elapsed # Event count (approx.): 758374775503 # # Overhead Samples Command Shared Object Symbol # ........ ......... ............ ................. ......................................... # 40.51% 387505 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_RHS_Body 22.34% 214782 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_Advect_Body 8.42% 80594 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_constraints_Body 7.40% 70897 cactusBSSN_r libm-2.26.so __ieee754_exp_avx 5.77% 55393 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBaseDtLapseShift_Body 4.99% 47952 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBase_Body 2.98% 28573 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_InitRHS_Body 2.44% 23623 cactusBSSN_r cactusBSSN_r_peak MoL_LinearCombination r263874 (worse) perf stat and report: Performance counter stats for 'numactl -C 0 -l specinvoke': 244036.523777 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 93013 page-faults:u # 0.381 K/sec 774757677736 cycles:u # 3.175 GHz (83.33%) 38930288027 stalled-cycles-frontend:u # 5.02% frontend cycles idle (83.33%) 15508961324 stalled-cycles-backend:u # 2.00% backend cycles idle (83.34%) 1226167776333 instructions:u # 1.58 insn per cycle # 0.03 stalled cycles per insn (83.33%) 23218262947 branches:u # 95.143 M/sec (83.33%) 18890390 branch-misses:u # 0.08% of all branches (83.33%) 244.344340731 seconds time elapsed # Samples: 979K of event 'cycles' # Event count (approx.): 775138268715 # # Overhead Samples Command Shared Object Symbol # ........ ......... ............ ................. ......................................... # 41.43% 404835 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_RHS_Body 22.04% 216520 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_Advect_Body 8.22% 80341 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_constraints_Body 7.26% 71052 cactusBSSN_r libm-2.26.so __ieee754_exp_avx 5.86% 57419 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBaseDtLapseShift_Body 4.89% 48084 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBase_Body 2.92% 28579 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_InitRHS_Body 2.38% 23520 cactusBSSN_r cactusBSSN_r_peak MoL_LinearCombination r263875 (bad) perf stat and report (note that branch misses grew by 6%): Performance counter stats for 'numactl -C 0 -l specinvoke': 254984.828108 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 92949 page-faults:u # 0.365 K/sec 809505457529 cycles:u # 3.175 GHz (83.33%) 32784020923 stalled-cycles-frontend:u # 4.05% frontend cycles idle (83.33%) 15658463714 stalled-cycles-backend:u # 1.93% backend cycles idle (83.33%) 1225361873924 instructions:u # 1.51 insn per cycle # 0.03 stalled cycles per insn (83.33%) 23461309363 branches:u # 92.011 M/sec (83.34%) 20152382 branch-misses:u # 0.09% of all branches (83.33%) 255.313012246 seconds time elapsed # Event count (approx.): 812138555051 # # Overhead Samples Command Shared Object Symbol # ........ ......... ............ ................. ......................................... # 37.54% 384512 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_RHS_Body 27.51% 282987 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_Advect_Body 7.80% 79887 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_constraints_Body 6.86% 70384 cactusBSSN_r libm-2.26.so __ieee754_exp_avx 5.73% 58878 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBaseDtLapseShift_Body 4.66% 47990 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_convertToADMBase_Body 2.79% 28638 cactusBSSN_r cactusBSSN_r_peak ML_BSSN_InitRHS_Body 2.28% 23615 cactusBSSN_r cactusBSSN_r_peak MoL_LinearCombination I did the bisecting on a machine with glibc 2.26 but the issue was detected on one with glibc 2.29.
So the issue is in ML_BSSN_Advect_Body (the other function rebounded). I will have a look.
Ugh. Cactus is really ugly code :/ For one there's an invariant switch () in the innermost loop, expanded to a binary tree (slightly different split point GCC 8 vs. trunk), obviously unswitching cannot handle this. This is a general missed optimization precluding any vectorization attempt here. Then we spill the hell out of us because of the way the code is written. Other than that I don't see anything obvious here. It might be that trunk: 5802: 83 fb 06 cmp $0x6,%ebx 5805: 0f 84 25 84 00 00 je dc30 <_ZL19ML_BSSN_Advect_BodyPK4 _cGHiiPKdS3_S3_PKiS5_iPKPd+0xdc30> 580b: 0f 8f cf 1d 00 00 jg 75e0 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x75e0> 5811: 83 fb 02 cmp $0x2,%ebx 5814: 0f 85 06 c0 ff ff jne 1820 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x1820> is worse to the branch predictor than the GCC 8 version 89ee: 0f 84 bc 64 00 00 je eeb0 <_ZL19ML_BSSN_Advect_BodyPK4 _cGHiiPKdS3_S3_PKiS5_iPKPd+0xeeb0> 89f4: 0f 8e 96 45 00 00 jle cf90 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0xcf90> 89fa: 8b b4 24 a8 08 00 00 mov 0x8a8(%rsp),%esi 8a01: 83 fe 06 cmp $0x6,%esi 8a04: 0f 85 e6 8e ff ff jne 18f0 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x18f0> (notice the "padding" reload). That is probably going to depend on final code layout again of course. I recall reading a third conditional jump in a fetch word requires an additional branch predictor slot or so. So it would be interesting to see if the branch misses accumulate on that binary tree generated from the loop invariant switch where in theory those should be all totally predictable. That said, I'm not yet able to reproduce the slowdown but will try.
Direct graph link to branch comparison: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=148.437.0&plot.1=59.437.0&plot.2=76.437.0&plot.3=33.437.0&
(In reply to Richard Biener from comment #2) > Ugh. Cactus is really ugly code :/ For one there's an invariant switch () > in the innermost loop, expanded to a binary tree (slightly different split > point > GCC 8 vs. trunk), obviously unswitching cannot handle this. Yes, the binary tree is bit different, but equally good to me: GCC 8: if (fdOrder_15741 == 4) goto <bb 193>; [20.00%] else goto <bb 188>; [80.00%] <bb 188> [local count: 955630223]: if (fdOrder_15741 > 4) goto <bb 190>; [62.50%] else goto <bb 189>; [37.50%] <bb 189> [local count: 955630223]: if (fdOrder_15741 == 2) goto <bb 192>; [66.67%] else goto <bb 196>; [33.33%] <bb 190> [local count: 955630223]: if (fdOrder_15741 == 6) goto <bb 194>; [40.00%] else goto <bb 191>; [60.00%] <bb 191> [local count: 955630223]: if (fdOrder_15741 == 8) goto <bb 195>; [66.67%] else goto <bb 196>; [33.33%] GCC 9: if (fdOrder_13024 == 6) goto <bb 194>; [20.00%] else goto <bb 188>; [80.00%] <bb 188> [local count: 955630224]: if (fdOrder_13024 > 6) goto <bb 191>; [37.50%] else goto <bb 189>; [62.50%] <bb 189> [local count: 955630224]: if (fdOrder_13024 == 2) goto <bb 192>; [40.00%] else goto <bb 190>; [60.00%] <bb 190> [local count: 955630224]: if (fdOrder_13024 == 4) goto <bb 193>; [100.00%] else goto <bb 196>; [0.00%] <bb 191> [local count: 955630224]: if (fdOrder_13024 == 8) goto <bb 195>; [66.67%] else goto <bb 196>; [33.33%]
CPU 2006 436.cactusADM also has an interesting history: https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/436_cactusADM_big.png
(In reply to Richard Biener from comment #5) > CPU 2006 436.cactusADM also has an interesting history: > https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/ > 436_cactusADM_big.png compared to other benchmarks it is also quite noisy - esp. in the timeframe of this regression.
Benchmarking r270408 on branch vs. trunk on Haswell doesn't show any regression for me. Will double-check with up-to-date CPU 2017 tree.
(In reply to Richard Biener from comment #7) > Benchmarking r270408 on branch vs. trunk on Haswell doesn't show any > regression > for me. Will double-check with up-to-date CPU 2017 tree. Confirmed.
I have only seen this when compiling with -march=native on Zen, but even at -O2 (which I overlooked yesterday, and which is also confirmed by LNT).
We still regress, according to LNT 8% on zen2: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=335.437.0&plot.1=309.437.0&plot.2=346.437.0&plot.3=276.437.0&plot.4=398.437.0&plot.5=417.437.0&plot.6=295.437.0& and 12% on zen3: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=554.437.0&plot.1=539.437.0&plot.2=562.437.0&plot.3=493.437.0&plot.4=520.437.0&plot.5=508.437.0&plot.6=471.437.0& (versions we regress against are represented by dots) and 9.40% against zen1: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=148.437.0&plot.1=59.437.0&plot.2=76.437.0&plot.3=260.437.0&plot.4=361.437.0&plot.5=454.437.0&plot.6=33.437.0& However, while my independent measurements confirmed the zen2 regression, I dod not see the zen3 regression (I have not independently benchmarked zen1).
(In reply to Martin Jambor from comment #10) > We still regress, according to LNT 8% on zen2: > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=335.437.0&plot. > 1=309.437.0&plot.2=346.437.0&plot.3=276.437.0&plot.4=398.437.0&plot.5=417. > 437.0&plot.6=295.437.0& > > and 12% on zen3: > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=554.437.0&plot. > 1=539.437.0&plot.2=562.437.0&plot.3=493.437.0&plot.4=520.437.0&plot.5=508. > 437.0&plot.6=471.437.0& > (versions we regress against are represented by dots) > > and 9.40% against zen1: > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=148.437.0&plot.1=59. > 437.0&plot.2=76.437.0&plot.3=260.437.0&plot.4=361.437.0&plot.5=454.437. > 0&plot.6=33.437.0& > > However, while my independent measurements confirmed the zen2 regression, I > dod not see the zen3 regression (I have not independently benchmarked zen1). According to the first two links above (LNT no longer has a zen1 machine), the problem has been fixed over the GCC 13 development cycle.