[Bug middle-end/90283] New: 519.lbm_r is 7%-10% slower with -Ofast -march=native and both LTO and PGO than with GCC 8

Mon Apr 29 17:45:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90283

            Bug ID: 90283
           Summary: 519.lbm_r is 7%-10% slower with -Ofast -march=native
                    and both LTO and PGO  than with GCC 8
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: rsandifo at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

When I build 519.lbm_r with GCC 9 (specifically, r270364) using -Ofast
-march=native -mtune=native and both LTO and PGO, the binary is then
about 7%-10% slower than when built with GCC 8 and the same options.

I can see this on both and AMD Zen machine (10%) and an Intel Skylake
server (7%).

I have bisected the regression on the Zen machine where it regressed
in two steps.  The first one is r260348, which causes a 7% regression
on both the Zen and Intel server CPUs.  Because it affects both in a
similar way, I hope it is not another manifestation of PR 84200.

As far as profile data are concerned, in all cases 99% of run-time is
spent in function main.  Perf stat output is the following:

Fast (r260347) on Zen:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     157862.072201      task-clock:u (msec)       #    0.999 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
              4354      page-faults:u             #    0.028 K/sec              
      490921430199      cycles:u                  #    
        5942617830      stalled-cycles-frontend:u #    1.21% frontend cycles
idle     (83.36%)
       11565687163      stalled-cycles-backend:u  #    2.36% backend cycles
idle      (83.32%)
     1121945505076      instructions:u            #    2.29  insn per cycle     
                                                  #    0.01  stalled cycles per
insn  (83.32%)
       11591019938      branches:u                #   73.425 M/sec             
      (83.36%)
          50878910      branch-misses:u           #    0.44% of all branches   
      (83.33%)

     158.013578100 seconds time elapsed

Slower (r260348) on Zen:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     166747.570030      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
              4354      page-faults:u             #    0.026 K/sec
      520147919104      cycles:u                  #    
        4619521659      stalled-cycles-frontend:u #    0.89% frontend cycles
idle     (83.32%)
       11565577319      stalled-cycles-backend:u  #    2.22% backend cycles
idle      (83.32%)
     1133497632829      instructions:u            #    2.18  insn per cycle
                                                  #    0.01  stalled cycles per
insn  (83.36%)
       11583199072      branches:u                #   69.465 M/sec             
      (83.33%)
          50821264      branch-misses:u           #    0.44% of all branches   
      (83.32%)

     166.898923990 seconds time elapsed

The second performance drop on Zen happened at r265795, albeit only by
3% and the revision does not seem to have any effect on the Intel CPU
(and thus given how weirdly the benchmark can sometimes behave, may
not be that interesting).

Just before the second drop (r265794):

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     165315.997872      task-clock:u (msec)       #    0.999 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
              4354      page-faults:u             #    0.026 K/sec              
      520201473687      cycles:u                  #    
        4890796962      stalled-cycles-frontend:u #    0.94% frontend cycles
idle     (83.37%)
       11565134531      stalled-cycles-backend:u  #    2.22% backend cycles
idle      (83.32%)
     1132849187518      instructions:u            #    2.18  insn per cycle     
                                                  #    0.01  stalled cycles per
insn  (83.31%)
       11591493304      branches:u                #   70.117 M/sec             
      (83.37%)
          50879513      branch-misses:u           #    0.44% of all branches   
      (83.32%)

     165.498590592 seconds time elapsed

Second drop (r265795):

 Performance counter stats for 'numactl -C 0 -l specinvoke':

     170908.963939      task-clock:u (msec)       #    0.999 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
              4430      page-faults:u             #    0.026 K/sec              
      539336426342      cycles:u                  #    
        3889378937      stalled-cycles-frontend:u #    0.72% frontend cycles
idle     (83.36%)
       11564727183      stalled-cycles-backend:u  #    2.14% backend cycles
idle      (83.32%)
     1146203876321      instructions:u            #    2.13  insn per cycle     
                                                  #    0.01  stalled cycles per
insn  (83.31%)
       11589809180      branches:u                #   67.813 M/sec             
      (83.37%)
          50679537      branch-misses:u           #    0.44% of all branches   
      (83.32%)

     171.089470855 seconds time elapsed

Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)