SPEC 2017 INTrate benchmark 505.mcf_r, when compiled with options -Ofast -march=native -mtune=native, is 6-7% slower when compiled with both PGO and LTO than when built with just LTO. I have observed this on both AMD Zen2 (7%) and Intel Cascade Lake (6%) server CPUs. The train run cannot be very bad because without LTO, PGO improves run-time by 15% on both systems. This is with master revision 26b3e568a60. Profiling results (from an AMD CPU): LTO: Overhead Samples Shared Object Symbol ........ ......... ............... ........................ 39.53% 518450 mcf_r_peak.mine spec_qsort.constprop.0 22.13% 289745 mcf_r_peak.mine master.constprop.0 19.00% 248641 mcf_r_peak.mine replace_weaker_arc 9.37% 122669 mcf_r_peak.mine main 8.60% 112601 mcf_r_peak.mine spec_qsort.constprop.1 PGO+LTO: Overhead Samples Shared Object Symbol ........ ......... ............... ....................................... 40.13% 562770 mcf_r_peak.mine spec_qsort.constprop.0 21.68% 303543 mcf_r_peak.mine master.constprop.0 18.24% 255236 mcf_r_peak.mine replace_weaker_arc 10.32% 144433 mcf_r_peak.mine main 8.07% 112775 mcf_r_peak.mine arc_compare Perhaps I should note that we have patched qsort in the benchmark to work with strict aliasing even with LTO. But the performance gap is there also with -fno-strict-aliasing.
Confirmed, can be nicely seen on LNT periodic benchmarks: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=295.347.0&plot.1=293.347.0&plot.2=287.347.0&plot.3=286.347.0
The profile looks unconclusive, the # samples differ but evenly increase. The overall number of samples is missing - does that increase by 6-7%?
I did not save the reported number of samples but from the raw sample numbers and percentage points it seems so: (562770/0.4013)/(518450/0.3953) = 1.069 Nevertheless, I did save separately obtained perf stat numbers which also look similar (and the number of branches might be a clue): LTO: 326083.03 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 8821 page-faults:u # 0.027 K/sec 1080945983089 cycles:u # (83.33%) 21883016095 stalled-cycles-frontend:u # 2.02% frontend cycles idle (83.33%) 435184347885 stalled-cycles-backend:u # 40.26% backend cycles idle (83.33%) 847570680279 instructions:u # 0.78 insn per cycle # 0.51 stalled cycles per insn (83.34%) 147428907202 branches:u # 452.121 M/sec (83.33%) 13395643229 branch-misses:u # 9.09% of all branches (83.33%) 326.436794016 seconds time elapsed 325.869528000 seconds user 0.086873000 seconds sys vs. PGO+LTO: 347929.80 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 8535 page-faults:u # 0.025 K/sec 1153803509197 cycles:u # (83.33%) 19911862620 stalled-cycles-frontend:u # 1.73% frontend cycles idle (83.33%) 476343319558 stalled-cycles-backend:u # 41.28% backend cycles idle (83.33%) 894092414890 instructions:u # 0.77 insn per cycle # 0.53 stalled cycles per insn (83.33%) 173999066006 branches:u # 500.098 M/sec (83.33%) 13698979291 branch-misses:u # 7.87% of all branches (83.34%) 348.308607033 seconds time elapsed 347.711752000 seconds user 0.090975000 seconds sys
This has been apparently fixed about a year ago. Thanks to the unknown hero who did that.