Bug 90128 - 507.cactuBSSN_r is 9-11% slower at -Ofast and native march/tuning on Zen CPUs
Summary: 507.cactuBSSN_r is 9-11% slower at -Ofast and native march/tuning on Zen CPUs
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
Depends on:
Blocks: spec
  Show dependency treegraph
Reported: 2019-04-17 11:24 UTC by Martin Jambor
Modified: 2023-09-20 20:11 UTC (History)
4 users (show)

See Also:
Host: x86_64-linux
Target: x86_64-linux
Known to work:
Known to fail:
Last reconfirmed: 2019-04-17 00:00:00


Note You need to log in before you can comment on or make changes to this bug.
Description Martin Jambor 2019-04-17 11:24:21 UTC
In my own measurements, 507.cactuBSSN_r is about 9.4% slower on an AMD
Zen CPU when compiled with GCC 9 with -Ofast and native march/mtune
than when it si compiled with GCC 8.  LNT currently even shows 11.4%
regression: https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch

I have done some bisecting and the slowdown happened in three steps.
First, the benchmark slowed by about 2% at some point before r262510
which I have not tracked down yet. Second, it then dived 3% with
r263874 but this seems to be some code-placement issue again because
the assembly of the functions which gained perf samples has not
changed in that revision and perf reported stalled-cycles-frontend
went from 4.58% to 5.02%.

However, the third regression was caused by the immediately following
revision r263875, the difference is 4.5% (7.5% is compared to GCC 8
run-time) while perf reported stalled-cycles-frontend were only 4.05%.

r263872 (good) perf stat and report:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     238848.989836      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             92923      page-faults:u             #    0.389 K/sec                  
      758195547230      cycles:u                  #    3.174 GHz                      (83.33%)
       34727040659      stalled-cycles-frontend:u #    4.58% frontend cycles idle     (83.33%)
       15457735869      stalled-cycles-backend:u  #    2.04% backend cycles idle      (83.33%)
     1225370192228      instructions:u            #    1.62  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.33%)
       23031544594      branches:u                #   96.427 M/sec                    (83.34%)
          18985096      branch-misses:u           #    0.08% of all branches          (83.33%)

     239.158442295 seconds time elapsed

 # Event count (approx.): 758374775503
 # Overhead    Samples  Command       Shared Object      Symbol                                                                                                                                          
 # ........  .........  ............  .................  .........................................
     40.51%     387505  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_RHS_Body
     22.34%     214782  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_Advect_Body
      8.42%      80594  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_constraints_Body
      7.40%      70897  cactusBSSN_r  libm-2.26.so       __ieee754_exp_avx
      5.77%      55393  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_convertToADMBaseDtLapseShift_Body
      4.99%      47952  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_convertToADMBase_Body
      2.98%      28573  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_InitRHS_Body
      2.44%      23623  cactusBSSN_r  cactusBSSN_r_peak  MoL_LinearCombination

r263874 (worse) perf stat and report:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     244036.523777      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             93013      page-faults:u             #    0.381 K/sec                  
      774757677736      cycles:u                  #    3.175 GHz                      (83.33%)
       38930288027      stalled-cycles-frontend:u #    5.02% frontend cycles idle     (83.33%)
       15508961324      stalled-cycles-backend:u  #    2.00% backend cycles idle      (83.34%)
     1226167776333      instructions:u            #    1.58  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.33%)
       23218262947      branches:u                #   95.143 M/sec                    (83.33%)
          18890390      branch-misses:u           #    0.08% of all branches          (83.33%)

     244.344340731 seconds time elapsed

 # Samples: 979K of event 'cycles'
 # Event count (approx.): 775138268715
 # Overhead    Samples  Command       Shared Object      Symbol                                                                                                                                          
 # ........  .........  ............  .................  .........................................
     41.43%     404835  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_RHS_Body
     22.04%     216520  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_Advect_Body
      8.22%      80341  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_constraints_Body
      7.26%      71052  cactusBSSN_r  libm-2.26.so       __ieee754_exp_avx
      5.86%      57419  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_convertToADMBaseDtLapseShift_Body
      4.89%      48084  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_convertToADMBase_Body
      2.92%      28579  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_InitRHS_Body
      2.38%      23520  cactusBSSN_r  cactusBSSN_r_peak  MoL_LinearCombination

r263875 (bad) perf stat and report (note that branch misses grew by 6%):

  Performance counter stats for 'numactl -C 0 -l specinvoke':

     254984.828108      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             92949      page-faults:u             #    0.365 K/sec                  
      809505457529      cycles:u                  #    3.175 GHz                      (83.33%)
       32784020923      stalled-cycles-frontend:u #    4.05% frontend cycles idle     (83.33%)
       15658463714      stalled-cycles-backend:u  #    1.93% backend cycles idle      (83.33%)
     1225361873924      instructions:u            #    1.51  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.33%)
       23461309363      branches:u                #   92.011 M/sec                    (83.34%)
          20152382      branch-misses:u           #    0.09% of all branches          (83.33%)

     255.313012246 seconds time elapsed

 # Event count (approx.): 812138555051
 # Overhead    Samples  Command       Shared Object      Symbol                                                                                                                                          
 # ........  .........  ............  .................  .........................................
     37.54%     384512  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_RHS_Body
     27.51%     282987  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_Advect_Body
      7.80%      79887  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_constraints_Body
      6.86%      70384  cactusBSSN_r  libm-2.26.so       __ieee754_exp_avx
      5.73%      58878  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_convertToADMBaseDtLapseShift_Body
      4.66%      47990  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_convertToADMBase_Body
      2.79%      28638  cactusBSSN_r  cactusBSSN_r_peak  ML_BSSN_InitRHS_Body
      2.28%      23615  cactusBSSN_r  cactusBSSN_r_peak  MoL_LinearCombination
I did the bisecting on a machine with glibc 2.26 but the issue was
detected on one with glibc 2.29.
Comment 1 Richard Biener 2019-04-17 11:44:48 UTC
So the issue is in ML_BSSN_Advect_Body (the other function rebounded).  I will
have a look.
Comment 2 Richard Biener 2019-04-17 12:07:06 UTC
Ugh.  Cactus is really ugly code :/  For one there's an invariant switch () in the innermost loop, expanded to a binary tree (slightly different split point
GCC 8 vs. trunk), obviously unswitching cannot handle this.  This is a general
missed optimization precluding any vectorization attempt here.  Then we spill
the hell out of us because of the way the code is written.  Other than that
I don't see anything obvious here.  It might be that trunk:

    5802:       83 fb 06                cmp    $0x6,%ebx
    5805:       0f 84 25 84 00 00       je     dc30 <_ZL19ML_BSSN_Advect_BodyPK4
    580b:       0f 8f cf 1d 00 00       jg     75e0 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x75e0>
    5811:       83 fb 02                cmp    $0x2,%ebx
    5814:       0f 85 06 c0 ff ff       jne    1820 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x1820>

is worse to the branch predictor than the GCC 8 version

    89ee:       0f 84 bc 64 00 00       je     eeb0 <_ZL19ML_BSSN_Advect_BodyPK4
    89f4:       0f 8e 96 45 00 00       jle    cf90 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0xcf90>
    89fa:       8b b4 24 a8 08 00 00    mov    0x8a8(%rsp),%esi
    8a01:       83 fe 06                cmp    $0x6,%esi
    8a04:       0f 85 e6 8e ff ff       jne    18f0 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x18f0>

(notice the "padding" reload).  That is probably going to depend on final
code layout again of course.  I recall reading a third conditional jump
in a fetch word requires an additional branch predictor slot or so.

So it would be interesting to see if the branch misses accumulate on
that binary tree generated from the loop invariant switch where in
theory those should be all totally predictable.

That said, I'm not yet able to reproduce the slowdown but will try.
Comment 3 Martin Liška 2019-04-17 12:21:22 UTC
Direct graph link to branch comparison:
Comment 4 Martin Liška 2019-04-17 12:35:50 UTC
(In reply to Richard Biener from comment #2)
> Ugh.  Cactus is really ugly code :/  For one there's an invariant switch ()
> in the innermost loop, expanded to a binary tree (slightly different split
> point
> GCC 8 vs. trunk), obviously unswitching cannot handle this.

Yes, the binary tree is bit different, but equally good to me:

GCC 8:

  if (fdOrder_15741 == 4)
    goto <bb 193>; [20.00%]
    goto <bb 188>; [80.00%]

  <bb 188> [local count: 955630223]:
  if (fdOrder_15741 > 4)
    goto <bb 190>; [62.50%]
    goto <bb 189>; [37.50%]

  <bb 189> [local count: 955630223]:
  if (fdOrder_15741 == 2)
    goto <bb 192>; [66.67%]
    goto <bb 196>; [33.33%]

  <bb 190> [local count: 955630223]:
  if (fdOrder_15741 == 6)
    goto <bb 194>; [40.00%]
    goto <bb 191>; [60.00%]

  <bb 191> [local count: 955630223]:
  if (fdOrder_15741 == 8)
    goto <bb 195>; [66.67%]
    goto <bb 196>; [33.33%]

GCC 9:

  if (fdOrder_13024 == 6)
    goto <bb 194>; [20.00%]
    goto <bb 188>; [80.00%]

  <bb 188> [local count: 955630224]:
  if (fdOrder_13024 > 6)
    goto <bb 191>; [37.50%]
    goto <bb 189>; [62.50%]

  <bb 189> [local count: 955630224]:
  if (fdOrder_13024 == 2)
    goto <bb 192>; [40.00%]
    goto <bb 190>; [60.00%]

  <bb 190> [local count: 955630224]:
  if (fdOrder_13024 == 4)
    goto <bb 193>; [100.00%]
    goto <bb 196>; [0.00%]

  <bb 191> [local count: 955630224]:
  if (fdOrder_13024 == 8)
    goto <bb 195>; [66.67%]
    goto <bb 196>; [33.33%]
Comment 5 Richard Biener 2019-04-17 12:48:54 UTC
CPU 2006 436.cactusADM also has an interesting history: https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/436_cactusADM_big.png
Comment 6 Richard Biener 2019-04-17 12:52:31 UTC
(In reply to Richard Biener from comment #5)
> CPU 2006 436.cactusADM also has an interesting history:
> https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/
> 436_cactusADM_big.png

compared to other benchmarks it is also quite noisy - esp. in the timeframe
of this regression.
Comment 7 Richard Biener 2019-04-17 13:12:25 UTC
Benchmarking r270408 on branch vs. trunk on Haswell doesn't show any regression
for me.  Will double-check with up-to-date CPU 2017 tree.
Comment 8 Richard Biener 2019-04-17 13:42:31 UTC
(In reply to Richard Biener from comment #7)
> Benchmarking r270408 on branch vs. trunk on Haswell doesn't show any
> regression
> for me.  Will double-check with up-to-date CPU 2017 tree.

Comment 9 Martin Jambor 2019-04-17 17:06:01 UTC
I have only seen this when compiling with -march=native on Zen, but even at -O2 (which I overlooked yesterday, and which is also confirmed by LNT).
Comment 10 Martin Jambor 2022-01-21 17:19:00 UTC
We still regress, according to LNT 8% on zen2:

and 12% on zen3:
(versions we regress against are represented by dots)

and 9.40% against zen1:

However, while my independent measurements confirmed the zen2 regression, I dod not see the zen3 regression (I have not independently benchmarked zen1).
Comment 11 Martin Jambor 2023-01-18 17:17:45 UTC
(In reply to Martin Jambor from comment #10)
> We still regress, according to LNT 8% on zen2:
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=335.437.0&plot.
> 1=309.437.0&plot.2=346.437.0&plot.3=276.437.0&plot.4=398.437.0&plot.5=417.
> 437.0&plot.6=295.437.0&
> and 12% on zen3:
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=554.437.0&plot.
> 1=539.437.0&plot.2=562.437.0&plot.3=493.437.0&plot.4=520.437.0&plot.5=508.
> 437.0&plot.6=471.437.0&
> (versions we regress against are represented by dots)
> and 9.40% against zen1:
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=148.437.0&plot.1=59.
> 437.0&plot.2=76.437.0&plot.3=260.437.0&plot.4=361.437.0&plot.5=454.437.
> 0&plot.6=33.437.0&
> However, while my independent measurements confirmed the zen2 regression, I
> dod not see the zen3 regression (I have not independently benchmarked zen1).

According to the first two links above (LNT no longer has a zen1 machine), the problem has been fixed over the GCC 13 development cycle.