94369 – 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO

Bug 94369 - 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO

Summary: 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	gcov-profile (show other bugs)
Version:	10.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:	spec
	Show dependency tree / graph

Reported:	2020-03-27 19:39 UTC by Martin Jambor
Modified:	2023-01-18 15:45 UTC (History)
CC List:	2 users (show)

See Also:
Host:	x86_64-linux
Target:	x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-03-29 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Jambor 2020-03-27 19:39:30 UTC

SPEC 2017 INTrate benchmark 505.mcf_r, when compiled with options
-Ofast -march=native -mtune=native, is 6-7% slower when compiled with
both PGO and LTO than when built with just LTO.  I have observed this
on both AMD Zen2 (7%) and Intel Cascade Lake (6%) server CPUs.  The
train run cannot be very bad because without LTO, PGO improves
run-time by 15% on both systems.  This is with master revision
26b3e568a60.

Profiling results (from an AMD CPU):

LTO:

  Overhead    Samples  Shared Object    Symbol                                 
  ........  .........  ...............  ........................

    39.53%     518450  mcf_r_peak.mine  spec_qsort.constprop.0
    22.13%     289745  mcf_r_peak.mine  master.constprop.0
    19.00%     248641  mcf_r_peak.mine  replace_weaker_arc
     9.37%     122669  mcf_r_peak.mine  main
     8.60%     112601  mcf_r_peak.mine  spec_qsort.constprop.1

PGO+LTO:

  Overhead    Samples  Shared Object    Symbol                                 
  ........  .........  ...............  .......................................

    40.13%     562770  mcf_r_peak.mine  spec_qsort.constprop.0
    21.68%     303543  mcf_r_peak.mine  master.constprop.0
    18.24%     255236  mcf_r_peak.mine  replace_weaker_arc
    10.32%     144433  mcf_r_peak.mine  main
     8.07%     112775  mcf_r_peak.mine  arc_compare

Perhaps I should note that we have patched qsort in the benchmark to
work with strict aliasing even with LTO.  But the performance gap is
there also with -fno-strict-aliasing.

Comment 1 Martin Liška 2020-03-29 14:23:48 UTC

Confirmed, can be nicely seen on LNT periodic benchmarks:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=295.347.0&plot.1=293.347.0&plot.2=287.347.0&plot.3=286.347.0

Comment 2 Richard Biener 2020-03-30 07:55:46 UTC

The profile looks unconclusive, the # samples differ but evenly increase.  The overall number of samples is missing - does that increase by 6-7%?

Comment 3 Martin Jambor 2020-03-30 11:06:48 UTC

I did not save the reported number of samples but from the raw sample
numbers and percentage points it seems so:

 (562770/0.4013)/(518450/0.3953) = 1.069

Nevertheless, I did save separately obtained perf stat numbers which
also look similar (and the number of branches might be a clue):

LTO:

         326083.03 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              8821      page-faults:u             #    0.027 K/sec                  
     1080945983089      cycles:u                  #                                   (83.33%)
       21883016095      stalled-cycles-frontend:u #    2.02% frontend cycles idle     (83.33%)
      435184347885      stalled-cycles-backend:u  #   40.26% backend cycles idle      (83.33%)
      847570680279      instructions:u            #    0.78  insn per cycle         
                                                  #    0.51  stalled cycles per insn  (83.34%)
      147428907202      branches:u                #  452.121 M/sec                    (83.33%)
       13395643229      branch-misses:u           #    9.09% of all branches          (83.33%)

     326.436794016 seconds time elapsed

     325.869528000 seconds user
       0.086873000 seconds sys

vs. PGO+LTO:

         347929.80 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              8535      page-faults:u             #    0.025 K/sec                  
     1153803509197      cycles:u                  #                                   (83.33%)
       19911862620      stalled-cycles-frontend:u #    1.73% frontend cycles idle     (83.33%)
      476343319558      stalled-cycles-backend:u  #   41.28% backend cycles idle      (83.33%)
      894092414890      instructions:u            #    0.77  insn per cycle         
                                                  #    0.53  stalled cycles per insn  (83.33%)
      173999066006      branches:u                #  500.098 M/sec                    (83.33%)
       13698979291      branch-misses:u           #    7.87% of all branches          (83.34%)

     348.308607033 seconds time elapsed

     347.711752000 seconds user
       0.090975000 seconds sys

Comment 4 Martin Jambor 2023-01-18 15:45:44 UTC

This has been apparently fixed about a year ago. Thanks to the unknown hero who did that.