[Bug gcov-profile/94369] New: 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO

Fri Mar 27 19:39:30 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94369

            Bug ID: 94369
           Summary: 505.mcf_r is 6-7% slower at -Ofast -march=native with
                    PGO+LTO than with just LTO
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: gcov-profile
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: marxin at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

SPEC 2017 INTrate benchmark 505.mcf_r, when compiled with options
-Ofast -march=native -mtune=native, is 6-7% slower when compiled with
both PGO and LTO than when built with just LTO.  I have observed this
on both AMD Zen2 (7%) and Intel Cascade Lake (6%) server CPUs.  The
train run cannot be very bad because without LTO, PGO improves
run-time by 15% on both systems.  This is with master revision
26b3e568a60.

Profiling results (from an AMD CPU):

LTO:

  Overhead    Samples  Shared Object    Symbol                                 
  ........  .........  ...............  ........................

    39.53%     518450  mcf_r_peak.mine  spec_qsort.constprop.0
    22.13%     289745  mcf_r_peak.mine  master.constprop.0
    19.00%     248641  mcf_r_peak.mine  replace_weaker_arc
     9.37%     122669  mcf_r_peak.mine  main
     8.60%     112601  mcf_r_peak.mine  spec_qsort.constprop.1

PGO+LTO:

  Overhead    Samples  Shared Object    Symbol                                 
  ........  .........  ...............  .......................................

    40.13%     562770  mcf_r_peak.mine  spec_qsort.constprop.0
    21.68%     303543  mcf_r_peak.mine  master.constprop.0
    18.24%     255236  mcf_r_peak.mine  replace_weaker_arc
    10.32%     144433  mcf_r_peak.mine  main
     8.07%     112775  mcf_r_peak.mine  arc_compare

Perhaps I should note that we have patched qsort in the benchmark to
work with strict aliasing even with LTO.  But the performance gap is
there also with -fno-strict-aliasing.

Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)