[Bug gcov-profile/90364] New: 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

jamborm at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Mon May 6 13:04:00 GMT 2019


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

            Bug ID: 90364
           Summary: 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at
                    -Ofast and native march/mtune
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: gcov-profile
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

In my measurements using trunk r270639, profile guided optimization
(PGO) regresses the run time of 521.wrf_r from SPEC FPrate 2017 by
9.5% (and even LTO+PGO is 7% slower than when using neither) when
compiling with -Ofast -march=native -mtune=native.
My observations are consistent with data from LNT:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=33.548.0&plot.1=15.548.0&plot.2=12.548.0&plot.3=17.548.0&

Perf stat and report for the two runs are:

Non-PGO (fast):

     304790.490558      task-clock:u (msec)       #    0.994 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
            292908      page-faults:u             #    0.961 K/sec              
      962209421444      cycles:u                  #    
       24018297656      stalled-cycles-frontend:u #    2.50% frontend cycles
idle     (83.35%)
      142992971234      stalled-cycles-backend:u  #   14.86% backend cycles
idle      (83.33%)
     1792410646274      instructions:u            #    1.86  insn per cycle     
                                                  #    0.08  stalled cycles per
insn  (83.34%)
      185705451528      branches:u                #  609.289 M/sec             
      (83.34%)
        2087790818      branch-misses:u           #    1.12% of all branches   
      (83.35%)

     306.542849367 seconds time elapsed

 # Samples: 1M of event 'cycles'
 # Event count (approx.): 964214205064
 #
 # Overhead    Samples   Shared Object     Symbol                               
 # ........  .........   ...............  
..............................................................
 #
      7.02%      85562   libm-2.29.so      __logf_fma
      5.99%      72982   libm-2.29.so      __powf_fma
      5.44%      66794   wrf_r_peak.std   
__module_advect_em_MOD_advect_scalar_pd
      5.21%      63576   libm-2.29.so      __atanf
      4.30%      52426   libmvec-2.29.so   _ZGVbN4v_expf_sse4
      4.04%      49253   wrf_r_peak.std    __module_mp_wsm5_MOD_wsm52d
      3.93%      47888   wrf_r_peak.std    __module_mp_wsm5_MOD_nislfv_rain_plm
      2.97%      36505   wrf_r_peak.std   
__module_small_step_em_MOD_advance_uv
      2.67%      32786   wrf_r_peak.std   
__module_small_step_em_MOD_advance_mu_t
      2.63%      32334   wrf_r_peak.std    __module_small_step_em_MOD_advance_w
      2.52%      30796   wrf_r_peak.std    __module_mp_wsm5_MOD_slope_wsm5
      2.52%      30948   wrf_r_peak.std    __module_advect_em_MOD_advect_scalar
      2.34%      28718   libc-2.29.so      __memset_avx2_unaligned_erms
      2.32%      28336   wrf_r_peak.std    __module_bl_ysu_MOD_ysu2d
      2.18%      26624   wrf_r_peak.std    psim_unstable
      2.09%      25667   libmvec-2.29.so   _ZGVbN4vv_powf_sse4
      2.08%      25418   libmvec-2.29.so   _ZGVbN4v_logf_sse4
      1.87%      22858   wrf_r_peak.std    psih_unstable
      1.65%      20244   wrf_r_peak.std   
__module_big_step_utilities_em_MOD_phy_prep
      1.56%      19006   wrf_r_peak.std    __module_ra_rrtm_MOD_rtrn
      1.40%      17198   wrf_r_peak.std    __module_bc_MOD_set_physical_bc3d
      1.25%      15339   wrf_r_peak.std   
__module_big_step_utilities_em_MOD_horizontal_diffusion
      1.22%      15029   libc-2.29.so      __memmove_avx_unaligned_erms
      1.22%      14833   libm-2.29.so      __expf_fma
      1.15%      14101   wrf_r_peak.std   
__module_small_step_em_MOD_calc_p_rho
      1.08%      13312   wrf_r_peak.std   
__module_big_step_utilities_em_MOD_horizontal_pressure_gradient
      1.00%      12345   wrf_r_peak.std   
__module_big_step_utilities_em_MOD_rhs_ph


PGO (slow):

     325215.123075      task-clock:u (msec)       #    0.993 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
            302283      page-faults:u             #    0.929 K/sec              
     1026804177693      cycles:u                  #    3.157 GHz               
      (83.33%)
       29812608056      stalled-cycles-frontend:u #    2.90% frontend cycles
idle     (83.35%)
      126544641902      stalled-cycles-backend:u  #   12.32% backend cycles
idle      (83.34%)
     1968104678527      instructions:u            #    1.92  insn per cycle     
                                                  #    0.06  stalled cycles per
insn  (83.35%)
      199828338783      branches:u                #  614.450 M/sec             
      (83.34%)
        2418851470      branch-misses:u           #    1.21% of all branches   
      (83.35%)

     327.574599867 seconds time elapsed

 # Samples: 1M of event 'cycles'
 # Event count (approx.): 1029158853895
 #
 # Overhead    Samples   Shared Object    Symbol                                
 # ........  .........   .............. 
.......................................................
 #
      9.94%     129149   libm-2.29.so     __powf_fma
      6.77%      87916   libm-2.29.so     __logf_fma
      6.22%      80774   wrf_r_peak.pgo   __module_mp_wsm5_MOD_nislfv_rain_plm
      5.50%      71494   wrf_r_peak.pgo   __module_mp_wsm5_MOD_wsm52d
      5.16%      67454   wrf_r_peak.pgo  
__module_advect_em_MOD_advect_scalar_pd
      4.87%      63208   libm-2.29.so     __atanf
      4.13%      53689   libm-2.29.so     __expf_fma
      3.99%      51813   wrf_r_peak.pgo   __module_bl_ysu_MOD_ysu2d
      2.76%      36137   wrf_r_peak.pgo   __module_small_step_em_MOD_advance_uv
      2.51%      32915   wrf_r_peak.pgo   __module_small_step_em_MOD_advance_w
      2.30%      30061   wrf_r_peak.pgo   __module_advect_em_MOD_advect_scalar
      2.05%      26646   wrf_r_peak.pgo   __module_ra_rrtm_MOD_rtrn
      1.99%      26017   wrf_r_peak.pgo  
__module_small_step_em_MOD_advance_mu_t
      1.93%      25130   wrf_r_peak.pgo   psim_unstable
      1.91%      24995   libc-2.29.so     __memset_avx2_unaligned_erms
      1.69%      21998   wrf_r_peak.pgo   psih_unstable
      1.41%      18434   wrf_r_peak.pgo   __module_bc_MOD_set_physical_bc3d
      1.25%      16375   wrf_r_peak.pgo  
__module_big_step_utilities_em_MOD_phy_prep
      1.18%      15384   wrf_r_peak.pgo  
__module_big_step_utilities_em_MOD_horizontal_diffusion
      1.04%      13570   wrf_r_peak.pgo   __module_small_step_em_MOD_calc_p_rho


Note that calls to libmvec are gone with PGO.  However, they could
only be generated because the system I used had the necessary Fortran
include file, which IIUC the LNT worker did not have until last week
and yet the regression can be seen in earlier data too.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)


More information about the Gcc-bugs mailing list