Bug 84481 - [12/13/14/15 Regression] 429.mcf with -O2 regresses by ~6% and ~4%, depending on tuning, on Zen compared to GCC 7.2
Summary: [12/13/14/15 Regression] 429.mcf with -O2 regresses by ~6% and ~4%, depending...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P2 normal
Target Milestone: 12.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2018-02-20 13:52 UTC by Martin Jambor
Modified: 2024-07-19 13:01 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-04-11 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Jambor 2018-02-20 13:52:26 UTC
Our friends at AMD reported that, compared to gcc 7.2, 429.mcf is
about 6% slower when compiled with just -O2 with trunk and run on Zen
CPU, which I can confirm.  I have managed to bisect this to Honza's
r255395.

The benchmark also regresses compared to gcc 7.2 with -O2
-march=native -mtune=native, by over 4% but I was not able to pin this
down to a single commit.
Comment 1 Jakub Jelinek 2018-05-02 10:09:34 UTC
GCC 8.1 has been released.
Comment 2 Martin Jambor 2018-06-22 13:18:54 UTC
Regarding the generic tuning issue, the difference comes down to the
order of the three instructions at offset 46 in the hottest loop below
(left is fast, right is slow, both along with their perf samples):

    38082423 |30:   sub    %rsi,%rcx             37881074 |30:   sub    %rsi,%rcx
    33361536 |      add    $0x1,%rax             29965960 |      add    $0x1,%rax
    14727831 |      mov    %rcx,(%rdx)           11839813 |      mov    %rcx,(%rdx)
   306224188 |      mov    0x10(%rdx),%rcx      280934119 |      mov    0x10(%rdx),%rcx
     7929159 |      test   %rcx,%rcx              3987929 |      test   %rcx,%rcx
    11735894 |      je     69                     5855925 |      je     69
             |43:   mov    %rcx,%rdx                      |43:   mov    %rcx,%rdx
   239584355 |46:   cmpl   $0x1,0x8(%rdx)       225344308 |46:   mov    0x18(%rdx),%rcx
 10777052578 |      mov    0x18(%rdx),%rcx    21488318830 |      mov    0x30(%rdx),%rsi
  4358414249 |      mov    0x30(%rdx),%rsi     6773073327 |      cmpl   $0x1,0x8(%rdx)
  4227512903 |      mov    (%rcx),%rcx         1386678856 |      mov    (%rcx),%rcx
  6128900849 |      mov    (%rsi),%rsi         6005737871 |      mov    (%rsi),%rsi
 220097857758|      jne    30                 263974962392|      jne    30
    74107789 |      add    %rsi,%rcx             47610508 |      add    %rsi,%rcx
    29107594 |      mov    %rcx,(%rdx)           31975201 |      mov    %rcx,(%rdx)
    28866535 |      mov    0x10(%rdx),%rcx       31974627 |      mov    0x10(%rdx),%rcx
     2996253 |      test   %rcx,%rcx              6035544 |      test   %rcx,%rcx
    37486332 |      jne    43                    24769958 |      jne    43
Comment 3 Martin Liška 2018-06-29 11:37:42 UTC
Interesting. Do I understand that correctly that it's due to increasing addresses of the 3 load instructions: 0x8(%rdx), 0x18(%rdx), 0x30(%rdx) vs. 0x18(%rdx) 0x30(%rdx) 0x8(%rdx) ?
Comment 4 Jan Hubicka 2018-06-29 11:41:05 UTC
Ahoj,
jeste jedna vec je to, ze by asi slo udelat konzervativni slejvani pro VPT - testovat
jen jestli stejna hodnota vyhraje ve vsech runech co maji nenulovy count. To by melo
byt stabilni vuci poradi.

Honza
Comment 5 Jakub Jelinek 2018-07-26 11:22:51 UTC
GCC 8.2 has been released.
Comment 6 Richard Biener 2018-12-20 11:11:20 UTC
What's the state on trunk?
Comment 7 Martin Jambor 2018-12-20 11:26:37 UTC
(In reply to Richard Biener from comment #6)
> What's the state on trunk?

I should have my own measurements only in January but according to
https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch there
is still 4% regression at -O2 and even over 8% at -Ofast on Zen.
Comment 8 Martin Jambor 2019-01-18 16:34:38 UTC
And even my own measurements show 6% slowdown at both -O2 and -Ofast with generic march/tuning against GCC 7 and now also 5% slowdown at -Ofast and native march/tuning against GCC 8.
Comment 9 Jakub Jelinek 2019-02-22 15:23:57 UTC
GCC 8.3 has been released.
Comment 10 Richard Biener 2019-04-11 08:44:33 UTC
(In reply to Martin Liška from comment #3)
> Interesting. Do I understand that correctly that it's due to increasing
> addresses of the 3 load instructions: 0x8(%rdx), 0x18(%rdx), 0x30(%rdx) vs.
> 0x18(%rdx) 0x30(%rdx) 0x8(%rdx) ?

I would guess that the hardware prefetcher might be sensitive to this.  But
note that depending on the frontend any two of the loads might issue in
parallel.

It seems this is some kind of list-walking so HW prefetching possibly
doesn't (and should not) trigger.

Anyways, it's probably a cache subsystem "issue".  Ordering memory
references might be an interesting post-reload scheduling heuristic
we could employ here.
Comment 11 Jakub Jelinek 2020-03-04 09:51:43 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 12 Martin Jambor 2020-07-30 19:43:43 UTC
I can once again confirm the slowdown on a zen1-based machine (commit 6e1e0decc9e vs gcc 7.5) but it is not present on a zen2-based one.  I wonder whether the bug should me marked as WONTFIX.
Comment 13 Jakub Jelinek 2021-05-14 09:49:59 UTC
GCC 8 branch is being closed.
Comment 14 Richard Biener 2021-06-01 08:10:25 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 15 Richard Biener 2022-05-27 09:38:26 UTC
GCC 9 branch is being closed
Comment 16 Jakub Jelinek 2022-06-28 10:34:35 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 17 Martin Jambor 2023-01-31 10:56:39 UTC
Looking LNT (and excluding machines which are no longer active), the worst regression is now 4% and that only at -O2 -Ofast.  Probably not a very high priority then (do we want to close this?).
Comment 18 Richard Biener 2023-07-07 10:33:19 UTC
GCC 10 branch is being closed.
Comment 19 Richard Biener 2024-07-19 13:01:47 UTC
GCC 11 branch is being closed.