[Bug gcov-profile/90364] New: 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
jamborm at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon May 6 13:04:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364
Bug ID: 90364
Summary: 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at
-Ofast and native march/mtune
Product: gcc
Version: 9.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: gcov-profile
Assignee: unassigned at gcc dot gnu.org
Reporter: jamborm at gcc dot gnu.org
CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org
Blocks: 26163
Target Milestone: ---
Host: x86_64-linux
Target: x86_64-linux
In my measurements using trunk r270639, profile guided optimization
(PGO) regresses the run time of 521.wrf_r from SPEC FPrate 2017 by
9.5% (and even LTO+PGO is 7% slower than when using neither) when
compiling with -Ofast -march=native -mtune=native.
My observations are consistent with data from LNT:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=33.548.0&plot.1=15.548.0&plot.2=12.548.0&plot.3=17.548.0&
Perf stat and report for the two runs are:
Non-PGO (fast):
304790.490558 task-clock:u (msec) # 0.994 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
292908 page-faults:u # 0.961 K/sec
962209421444 cycles:u #
24018297656 stalled-cycles-frontend:u # 2.50% frontend cycles
idle (83.35%)
142992971234 stalled-cycles-backend:u # 14.86% backend cycles
idle (83.33%)
1792410646274 instructions:u # 1.86 insn per cycle
# 0.08 stalled cycles per
insn (83.34%)
185705451528 branches:u # 609.289 M/sec
(83.34%)
2087790818 branch-misses:u # 1.12% of all branches
(83.35%)
306.542849367 seconds time elapsed
# Samples: 1M of event 'cycles'
# Event count (approx.): 964214205064
#
# Overhead Samples Shared Object Symbol
# ........ ......... ...............
..............................................................
#
7.02% 85562 libm-2.29.so __logf_fma
5.99% 72982 libm-2.29.so __powf_fma
5.44% 66794 wrf_r_peak.std
__module_advect_em_MOD_advect_scalar_pd
5.21% 63576 libm-2.29.so __atanf
4.30% 52426 libmvec-2.29.so _ZGVbN4v_expf_sse4
4.04% 49253 wrf_r_peak.std __module_mp_wsm5_MOD_wsm52d
3.93% 47888 wrf_r_peak.std __module_mp_wsm5_MOD_nislfv_rain_plm
2.97% 36505 wrf_r_peak.std
__module_small_step_em_MOD_advance_uv
2.67% 32786 wrf_r_peak.std
__module_small_step_em_MOD_advance_mu_t
2.63% 32334 wrf_r_peak.std __module_small_step_em_MOD_advance_w
2.52% 30796 wrf_r_peak.std __module_mp_wsm5_MOD_slope_wsm5
2.52% 30948 wrf_r_peak.std __module_advect_em_MOD_advect_scalar
2.34% 28718 libc-2.29.so __memset_avx2_unaligned_erms
2.32% 28336 wrf_r_peak.std __module_bl_ysu_MOD_ysu2d
2.18% 26624 wrf_r_peak.std psim_unstable
2.09% 25667 libmvec-2.29.so _ZGVbN4vv_powf_sse4
2.08% 25418 libmvec-2.29.so _ZGVbN4v_logf_sse4
1.87% 22858 wrf_r_peak.std psih_unstable
1.65% 20244 wrf_r_peak.std
__module_big_step_utilities_em_MOD_phy_prep
1.56% 19006 wrf_r_peak.std __module_ra_rrtm_MOD_rtrn
1.40% 17198 wrf_r_peak.std __module_bc_MOD_set_physical_bc3d
1.25% 15339 wrf_r_peak.std
__module_big_step_utilities_em_MOD_horizontal_diffusion
1.22% 15029 libc-2.29.so __memmove_avx_unaligned_erms
1.22% 14833 libm-2.29.so __expf_fma
1.15% 14101 wrf_r_peak.std
__module_small_step_em_MOD_calc_p_rho
1.08% 13312 wrf_r_peak.std
__module_big_step_utilities_em_MOD_horizontal_pressure_gradient
1.00% 12345 wrf_r_peak.std
__module_big_step_utilities_em_MOD_rhs_ph
PGO (slow):
325215.123075 task-clock:u (msec) # 0.993 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
302283 page-faults:u # 0.929 K/sec
1026804177693 cycles:u # 3.157 GHz
(83.33%)
29812608056 stalled-cycles-frontend:u # 2.90% frontend cycles
idle (83.35%)
126544641902 stalled-cycles-backend:u # 12.32% backend cycles
idle (83.34%)
1968104678527 instructions:u # 1.92 insn per cycle
# 0.06 stalled cycles per
insn (83.35%)
199828338783 branches:u # 614.450 M/sec
(83.34%)
2418851470 branch-misses:u # 1.21% of all branches
(83.35%)
327.574599867 seconds time elapsed
# Samples: 1M of event 'cycles'
# Event count (approx.): 1029158853895
#
# Overhead Samples Shared Object Symbol
# ........ ......... ..............
.......................................................
#
9.94% 129149 libm-2.29.so __powf_fma
6.77% 87916 libm-2.29.so __logf_fma
6.22% 80774 wrf_r_peak.pgo __module_mp_wsm5_MOD_nislfv_rain_plm
5.50% 71494 wrf_r_peak.pgo __module_mp_wsm5_MOD_wsm52d
5.16% 67454 wrf_r_peak.pgo
__module_advect_em_MOD_advect_scalar_pd
4.87% 63208 libm-2.29.so __atanf
4.13% 53689 libm-2.29.so __expf_fma
3.99% 51813 wrf_r_peak.pgo __module_bl_ysu_MOD_ysu2d
2.76% 36137 wrf_r_peak.pgo __module_small_step_em_MOD_advance_uv
2.51% 32915 wrf_r_peak.pgo __module_small_step_em_MOD_advance_w
2.30% 30061 wrf_r_peak.pgo __module_advect_em_MOD_advect_scalar
2.05% 26646 wrf_r_peak.pgo __module_ra_rrtm_MOD_rtrn
1.99% 26017 wrf_r_peak.pgo
__module_small_step_em_MOD_advance_mu_t
1.93% 25130 wrf_r_peak.pgo psim_unstable
1.91% 24995 libc-2.29.so __memset_avx2_unaligned_erms
1.69% 21998 wrf_r_peak.pgo psih_unstable
1.41% 18434 wrf_r_peak.pgo __module_bc_MOD_set_physical_bc3d
1.25% 16375 wrf_r_peak.pgo
__module_big_step_utilities_em_MOD_phy_prep
1.18% 15384 wrf_r_peak.pgo
__module_big_step_utilities_em_MOD_horizontal_diffusion
1.04% 13570 wrf_r_peak.pgo __module_small_step_em_MOD_calc_p_rho
Note that calls to libmvec are gone with PGO. However, they could
only be generated because the system I used had the necessary Fortran
include file, which IIUC the LNT worker did not have until last week
and yet the regression can be seen in earlier data too.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
More information about the Gcc-bugs
mailing list