As of revision 270053, the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered a number of smaller regressions on AMD Zen CPUs: - At -O2, it is 4.5% slower than when compiled with GCC 7 - At -Ofast, it is 4.7% slower than when compiled with GCC 8 - At -Ofast -march=native -mutine=native, this difference is 6.9% - At -Ofast and native tuning, it is 6% slower with PGO than without it. According to https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/options the last regression on a different Ryzen CPU is 6.8 and PGO+LTO is 8.2% slower than just native -Ofast. Bisecting does not help much because the performance of the benchmark has varied a lot. For example in September there was no PGO regression but only because the non-PGO executable was equally slow. I only have data from February from an Intel machine, but there I only saw the native -Ofast regression, but it might have gone away meanwhile.
Created attachment 46169 [details] perf annotate - Ofast native vs. Ofast native PGO I'm attaching HTML and txt perf annotate for Ofast native and Ofast native PGO builds. As seen, it's still the same story. There's a big register pressure that leads to spilling of some of the induction variables. For these builds, the most significant difference is: GOOD: : if(block(row, 4, i4) <= 0) cycle 0.00 : 41c660: mov (%r9),%r12d 1.99 : 41c663: mov %r11d,0x80(%rsp) 0.11 : 41c66b: mov %r11d,%edx 0.02 : 41c66e: test %r12d,%r12d 0.15 : 41c671: jg 41c7b0 <__brute_force_MOD_digits_2+0xe00> 0.01 : 41c677: inc %r11 0.64 : 41c67a: add $0x144,%r9 0.13 : 41c681: add $0x144,%r8 0.05 : 41c688: add $0x144,%r10 : do i4 = l(4), u(4) 0.15 : 41c68f: cmp %r11d,0x6c(%rsp) 2.39 : 41c694: jge 41c660 <__brute_force_MOD_digits_2+0xcb0> 0.00 : 41c696: mov 0x168(%rsp),%r10 0.55 : 41c69e: mov 0x170(%rsp),%r9 0.08 : 41c6a6: mov 0x178(%rsp),%r11 0.05 : 41c6ae: mov 0x180(%rsp),%r8 : block(row, 4:9, i3) = block(row, 4:9, i3) + 10 BAD: : if(block(row, 4, i4) <= 0) cycle 0.05 : 41a8b0: mov (%r11),%edi 0.78 : 41a8b3: mov %r10d,0x84(%rsp) 0.04 : 41a8bb: mov %r10d,%r13d 0.01 : 41a8be: test %edi,%edi 0.26 : 41a8c0: jg 41aa10 <__brute_force_MOD_digits_2+0x1210> 0.44 : 41a8c6: addq $0x144,0x48(%rsp) 4.04 : 41a8cf: addq $0x144,0x58(%rsp) 1.31 : 41a8d8: inc %r10 0.02 : 41a8db: add $0x144,%r11 : do i4 = l(4), u(4) 0.01 : 41a8e2: cmp %r10d,0x88(%rsp) 0.25 : 41a8ea: jge 41a8b0 <__brute_force_MOD_digits_2+0x10b0> : block(row, 4:9, i3) = block(row, 4:9, i3) + 10 0.03 : 41a8ec: mov 0xd0(%rsp),%r15 0.27 : 41a8f4: addl $0xa,-0xdc(%r15) 0.20 : 41a8fc: addl $0xa,-0xb8(%r15) 0.01 : 41a904: addl $0xa,-0x94(%r15) 0.07 : 41a90c: addl $0xa,-0x70(%r15) 0.05 : 41a911: addl $0xa,-0x4c(%r15) 0.06 : 41a916: addl $0xa,-0x28(%r15) The benchmark is quite unpredictable, I'm leaving that for now.
(In reply to Martin Jambor from comment #0) > As of revision 270053, the 548.exchange2_r benchmark from SPEC 2017 > INTrate suite suffered a number of smaller regressions on AMD Zen > CPUs: > > - At -O2, it is 4.5% slower than when compiled with GCC 7 I am about to file a specific bug about exchange at -O2. > - At -Ofast, it is 4.7% slower than when compiled with GCC 8 This is no longer true. > - At -Ofast -march=native -mutine=native, this difference is 6.9% Again, I will file a more specific bug about -Ofast -march=native in a little while. > - At -Ofast and native tuning, it is 6% slower with PGO than > without it. I can still see this in my measurements on Zen1-based CPU but not in those done on AMD Zen2 or Intel Cascade Lake. So I am not sure if we care. I'll e happy to file a specific bug if we do.
So replaced with more specific bugs for newer hardware: PR94373 and PR94375.