Bug 90056 - 548.exchange2_r regressions on AMD Zen
Summary: 548.exchange2_r regressions on AMD Zen
Status: RESOLVED MOVED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2019-04-12 11:49 UTC by Martin Jambor
Modified: 2020-03-27 23:33 UTC (History)
2 users (show)

See Also:
Host: x86_64-linux
Target: x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
perf annotate - Ofast native vs. Ofast native PGO (52.81 KB, application/x-bzip)
2019-04-15 11:51 UTC, Martin Liška
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Jambor 2019-04-12 11:49:04 UTC
As of revision 270053, the 548.exchange2_r benchmark from SPEC 2017
INTrate suite suffered a number of smaller regressions on AMD Zen
CPUs:

  - At -O2, it is 4.5% slower than when compiled with GCC 7
  - At -Ofast, it is 4.7% slower than when compiled with GCC 8
  - At -Ofast -march=native -mutine=native, this difference is 6.9%
  - At -Ofast and native tuning, it is 6% slower with PGO than
    without it.

According to
https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/options the
last regression on a different Ryzen CPU is 6.8 and PGO+LTO is 8.2%
slower than just native -Ofast.

Bisecting does not help much because the performance of the benchmark
has varied a lot.  For example in September there was no PGO
regression but only because the non-PGO executable was equally slow.

I only have data from February from an Intel machine, but there I only
saw the native -Ofast regression, but it might have gone away
meanwhile.
Comment 1 Martin Liška 2019-04-15 11:51:27 UTC
Created attachment 46169 [details]
perf annotate - Ofast native vs. Ofast native PGO

I'm attaching HTML and txt perf annotate for Ofast native and Ofast native PGO builds. As seen, it's still the same story. There's a big register pressure that leads to spilling of some of the induction variables.

For these builds, the most significant difference is:

GOOD:

         :	                if(block(row, 4, i4) <= 0) cycle
    0.00 :	  41c660:       mov    (%r9),%r12d
    1.99 :	  41c663:       mov    %r11d,0x80(%rsp)
    0.11 :	  41c66b:       mov    %r11d,%edx
    0.02 :	  41c66e:       test   %r12d,%r12d
    0.15 :	  41c671:       jg     41c7b0 <__brute_force_MOD_digits_2+0xe00>
    0.01 :	  41c677:       inc    %r11
    0.64 :	  41c67a:       add    $0x144,%r9
    0.13 :	  41c681:       add    $0x144,%r8
    0.05 :	  41c688:       add    $0x144,%r10
         :	           do i4 = l(4), u(4)
    0.15 :	  41c68f:       cmp    %r11d,0x6c(%rsp)
    2.39 :	  41c694:       jge    41c660 <__brute_force_MOD_digits_2+0xcb0>
    0.00 :	  41c696:       mov    0x168(%rsp),%r10
    0.55 :	  41c69e:       mov    0x170(%rsp),%r9
    0.08 :	  41c6a6:       mov    0x178(%rsp),%r11
    0.05 :	  41c6ae:       mov    0x180(%rsp),%r8
         :	           block(row, 4:9, i3) = block(row, 4:9, i3) + 10

BAD:

         :	                if(block(row, 4, i4) <= 0) cycle
    0.05 :	  41a8b0:       mov    (%r11),%edi
    0.78 :	  41a8b3:       mov    %r10d,0x84(%rsp)
    0.04 :	  41a8bb:       mov    %r10d,%r13d
    0.01 :	  41a8be:       test   %edi,%edi
    0.26 :	  41a8c0:       jg     41aa10 <__brute_force_MOD_digits_2+0x1210>
    0.44 :	  41a8c6:       addq   $0x144,0x48(%rsp)
    4.04 :	  41a8cf:       addq   $0x144,0x58(%rsp)
    1.31 :	  41a8d8:       inc    %r10
    0.02 :	  41a8db:       add    $0x144,%r11
         :	           do i4 = l(4), u(4)
    0.01 :	  41a8e2:       cmp    %r10d,0x88(%rsp)
    0.25 :	  41a8ea:       jge    41a8b0 <__brute_force_MOD_digits_2+0x10b0>
         :	           block(row, 4:9, i3) = block(row, 4:9, i3) + 10
    0.03 :	  41a8ec:       mov    0xd0(%rsp),%r15
    0.27 :	  41a8f4:       addl   $0xa,-0xdc(%r15)
    0.20 :	  41a8fc:       addl   $0xa,-0xb8(%r15)
    0.01 :	  41a904:       addl   $0xa,-0x94(%r15)
    0.07 :	  41a90c:       addl   $0xa,-0x70(%r15)
    0.05 :	  41a911:       addl   $0xa,-0x4c(%r15)
    0.06 :	  41a916:       addl   $0xa,-0x28(%r15)

The benchmark is quite unpredictable, I'm leaving that for now.
Comment 2 Martin Jambor 2020-03-27 20:43:22 UTC
(In reply to Martin Jambor from comment #0)
> As of revision 270053, the 548.exchange2_r benchmark from SPEC 2017
> INTrate suite suffered a number of smaller regressions on AMD Zen
> CPUs:
> 
>   - At -O2, it is 4.5% slower than when compiled with GCC 7

I am about to file a specific bug about exchange at -O2.

>   - At -Ofast, it is 4.7% slower than when compiled with GCC 8

This is no longer true.

>   - At -Ofast -march=native -mutine=native, this difference is 6.9%

Again, I will file a more specific bug about -Ofast -march=native in a
little while.

>   - At -Ofast and native tuning, it is 6% slower with PGO than
>     without it.

I can still see this in my measurements on Zen1-based CPU but not in
those done on AMD Zen2 or Intel Cascade Lake.  So I am not sure if we
care.  I'll e happy to file a specific bug if we do.
Comment 3 Martin Jambor 2020-03-27 23:33:55 UTC
So replaced with more specific bugs for newer hardware: PR94373 and PR94375.