Our friends at AMD reported that, compared to gcc 7.2, 429.mcf is about 6% slower when compiled with just -O2 with trunk and run on Zen CPU, which I can confirm. I have managed to bisect this to Honza's r255395. The benchmark also regresses compared to gcc 7.2 with -O2 -march=native -mtune=native, by over 4% but I was not able to pin this down to a single commit.
GCC 8.1 has been released.
Regarding the generic tuning issue, the difference comes down to the order of the three instructions at offset 46 in the hottest loop below (left is fast, right is slow, both along with their perf samples): 38082423 |30: sub %rsi,%rcx 37881074 |30: sub %rsi,%rcx 33361536 | add $0x1,%rax 29965960 | add $0x1,%rax 14727831 | mov %rcx,(%rdx) 11839813 | mov %rcx,(%rdx) 306224188 | mov 0x10(%rdx),%rcx 280934119 | mov 0x10(%rdx),%rcx 7929159 | test %rcx,%rcx 3987929 | test %rcx,%rcx 11735894 | je 69 5855925 | je 69 |43: mov %rcx,%rdx |43: mov %rcx,%rdx 239584355 |46: cmpl $0x1,0x8(%rdx) 225344308 |46: mov 0x18(%rdx),%rcx 10777052578 | mov 0x18(%rdx),%rcx 21488318830 | mov 0x30(%rdx),%rsi 4358414249 | mov 0x30(%rdx),%rsi 6773073327 | cmpl $0x1,0x8(%rdx) 4227512903 | mov (%rcx),%rcx 1386678856 | mov (%rcx),%rcx 6128900849 | mov (%rsi),%rsi 6005737871 | mov (%rsi),%rsi 220097857758| jne 30 263974962392| jne 30 74107789 | add %rsi,%rcx 47610508 | add %rsi,%rcx 29107594 | mov %rcx,(%rdx) 31975201 | mov %rcx,(%rdx) 28866535 | mov 0x10(%rdx),%rcx 31974627 | mov 0x10(%rdx),%rcx 2996253 | test %rcx,%rcx 6035544 | test %rcx,%rcx 37486332 | jne 43 24769958 | jne 43
Interesting. Do I understand that correctly that it's due to increasing addresses of the 3 load instructions: 0x8(%rdx), 0x18(%rdx), 0x30(%rdx) vs. 0x18(%rdx) 0x30(%rdx) 0x8(%rdx) ?
Ahoj, jeste jedna vec je to, ze by asi slo udelat konzervativni slejvani pro VPT - testovat jen jestli stejna hodnota vyhraje ve vsech runech co maji nenulovy count. To by melo byt stabilni vuci poradi. Honza
GCC 8.2 has been released.
What's the state on trunk?
(In reply to Richard Biener from comment #6) > What's the state on trunk? I should have my own measurements only in January but according to https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch there is still 4% regression at -O2 and even over 8% at -Ofast on Zen.
And even my own measurements show 6% slowdown at both -O2 and -Ofast with generic march/tuning against GCC 7 and now also 5% slowdown at -Ofast and native march/tuning against GCC 8.
GCC 8.3 has been released.
(In reply to Martin Liška from comment #3) > Interesting. Do I understand that correctly that it's due to increasing > addresses of the 3 load instructions: 0x8(%rdx), 0x18(%rdx), 0x30(%rdx) vs. > 0x18(%rdx) 0x30(%rdx) 0x8(%rdx) ? I would guess that the hardware prefetcher might be sensitive to this. But note that depending on the frontend any two of the loads might issue in parallel. It seems this is some kind of list-walking so HW prefetching possibly doesn't (and should not) trigger. Anyways, it's probably a cache subsystem "issue". Ordering memory references might be an interesting post-reload scheduling heuristic we could employ here.
GCC 8.4.0 has been released, adjusting target milestone.
I can once again confirm the slowdown on a zen1-based machine (commit 6e1e0decc9e vs gcc 7.5) but it is not present on a zen2-based one. I wonder whether the bug should me marked as WONTFIX.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Looking LNT (and excluding machines which are no longer active), the worst regression is now 4% and that only at -O2 -Ofast. Probably not a very high priority then (do we want to close this?).
GCC 10 branch is being closed.
GCC 11 branch is being closed.