[Bug middle-end/90056] 548.exchange2_r regressions on AMD Zen

Mon Apr 15 11:51:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056

--- Comment #1 from Martin Liška <marxin at gcc dot gnu.org> ---
Created attachment 46169
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46169&action=edit
perf annotate - Ofast native vs. Ofast native PGO

I'm attaching HTML and txt perf annotate for Ofast native and Ofast native PGO
builds. As seen, it's still the same story. There's a big register pressure
that leads to spilling of some of the induction variables.

For these builds, the most significant difference is:

GOOD:

         :                      if(block(row, 4, i4) <= 0) cycle
    0.00 :        41c660:       mov    (%r9),%r12d
    1.99 :        41c663:       mov    %r11d,0x80(%rsp)
    0.11 :        41c66b:       mov    %r11d,%edx
    0.02 :        41c66e:       test   %r12d,%r12d
    0.15 :        41c671:       jg     41c7b0
<__brute_force_MOD_digits_2+0xe00>
    0.01 :        41c677:       inc    %r11
    0.64 :        41c67a:       add    $0x144,%r9
    0.13 :        41c681:       add    $0x144,%r8
    0.05 :        41c688:       add    $0x144,%r10
         :                 do i4 = l(4), u(4)
    0.15 :        41c68f:       cmp    %r11d,0x6c(%rsp)
    2.39 :        41c694:       jge    41c660
<__brute_force_MOD_digits_2+0xcb0>
    0.00 :        41c696:       mov    0x168(%rsp),%r10
    0.55 :        41c69e:       mov    0x170(%rsp),%r9
    0.08 :        41c6a6:       mov    0x178(%rsp),%r11
    0.05 :        41c6ae:       mov    0x180(%rsp),%r8
         :                 block(row, 4:9, i3) = block(row, 4:9, i3) + 10

BAD:

         :                      if(block(row, 4, i4) <= 0) cycle
    0.05 :        41a8b0:       mov    (%r11),%edi
    0.78 :        41a8b3:       mov    %r10d,0x84(%rsp)
    0.04 :        41a8bb:       mov    %r10d,%r13d
    0.01 :        41a8be:       test   %edi,%edi
    0.26 :        41a8c0:       jg     41aa10
<__brute_force_MOD_digits_2+0x1210>
    0.44 :        41a8c6:       addq   $0x144,0x48(%rsp)
    4.04 :        41a8cf:       addq   $0x144,0x58(%rsp)
    1.31 :        41a8d8:       inc    %r10
    0.02 :        41a8db:       add    $0x144,%r11
         :                 do i4 = l(4), u(4)
    0.01 :        41a8e2:       cmp    %r10d,0x88(%rsp)
    0.25 :        41a8ea:       jge    41a8b0
<__brute_force_MOD_digits_2+0x10b0>
         :                 block(row, 4:9, i3) = block(row, 4:9, i3) + 10
    0.03 :        41a8ec:       mov    0xd0(%rsp),%r15
    0.27 :        41a8f4:       addl   $0xa,-0xdc(%r15)
    0.20 :        41a8fc:       addl   $0xa,-0xb8(%r15)
    0.01 :        41a904:       addl   $0xa,-0x94(%r15)
    0.07 :        41a90c:       addl   $0xa,-0x70(%r15)
    0.05 :        41a911:       addl   $0xa,-0x4c(%r15)
    0.06 :        41a916:       addl   $0xa,-0x28(%r15)

The benchmark is quite unpredictable, I'm leaving that for now.