Summary: | [12/13/14/15 Regression] 10-12% performance decrease in benchmark going from GCC8 to GCC9/GCC10 | ||
---|---|---|---|
Product: | gcc | Reporter: | Matt Bentley <mattreecebentley> |
Component: | ipa | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | crazylht, hjl.tools, hongyuw, hubicka, marxin |
Priority: | P2 | Keywords: | missed-optimization |
Version: | 9.5.0 | ||
Target Milestone: | 12.5 | ||
Host: | Target: | ||
Build: | Known to work: | 7.5.0 | |
Known to fail: | Last reconfirmed: | 2020-08-24 00:00:00 | |
Attachments: |
Compiler output
Demonstration of code which doesn't trigger the performance anomaly. unincluded testcase |
Confirmed, started with r9-5763-g61a8637c8893a252: after: 1794240.0 before: 1802710.0 Anyway, using PGO one can get to: 1488310.0 (In reply to Martin Liška from comment #1) > after: > 1794240.0 > > before: > 1802710.0 That's less than 1% of difference (with "after" better than "before"), not the 10% regression claimed, maybe there is another relevant commit? (In reply to Marc Glisse from comment #2) > (In reply to Martin Liška from comment #1) > > after: > > 1794240.0 > > > > before: > > 1802710.0 > > That's less than 1% of difference (with "after" better than "before"), not > the 10% regression claimed, maybe there is another relevant commit? Sorry, I copied bad numbers: after: 1806140.0 before: 1705630.0 which is ~6% regression. (In reply to Marc Glisse from comment #2) > (In reply to Martin Liška from comment #1) > > after: > > 1794240.0 > > > > before: > > 1802710.0 > > That's less than 1% of difference (with "after" better than "before"), not > the 10% regression claimed, maybe there is another relevant commit? See the .ods spreadsheet in the zip for my results with same code. If anyone out there is interested in working on this, I found the smallest change possible to create the same performance as GCC8- it is literally eliminating one branch possibility in one function (move-insert). The branch in question questions whether there are existing memory blocks to re-use or if we need to create a new memory block. For example, if the reserve() function has been called there will be existing memory blocks to re-use. However the performance drop occurs whether or not there are memory blocks to reuse. The actual if statement is irrelevant. I have tested and can remove all instances of memory block storage (reserve(), erase()) and problem still exists if this one branch is still in insert. I've attached the source files to demonstrate this above, including one plf_colony.h with the branch removed (renamed to plf_colony_fast.h), so you can see what difference there is. This code is all zlib license, free to share, but is early beta so don't redistribute please. Thanks, Matt ps. For consistency I've also removed the non-move-insert and emplace instances of this branch statement, even though they won't be called by the benchmark code in a C++11-compliant compiler. Created attachment 49278 [details]
Demonstration of code which doesn't trigger the performance anomaly.
plf_colony_fast.h does not trigger the problem, has one branch eliminated in each insert/emplace function.
Needs re-evaluation with GCC 11 / 12 and see if it's worth continue to track this bug. GCC 9 branch is being closed GCC 10.4 is being released, retargeting bugs to GCC 10.5. Created attachment 53728 [details]
unincluded testcase
Runtimes with GCC 10 and GCC 12 are the same for me, but the benchmark completes very quickly.
The attached is unincluded compiling with GCC 7 up to trunk for me.
GCC 10 branch is being closed. GCC 11 branch is being closed. |
Created attachment 49102 [details] Compiler output Have recently been working on a new version of the plf::colony container (plflib.org) and found GCC9 was giving ~10% worse performance on average in a given benchmark than GCC8. Further investigation found GCC10 was just as bad. The effect is repeatable across architectures - I've tested on xubuntu, windows running nuwen mingw, and on Core2 and Haswell CPUs, with and without -march=native specified. Compiler flags are: -O2;-march=native;-std=c++17 Code presented is with an absolute minimum use-case - other benchmarks have not shown such strong performance differences - including both simpler and more complex tests. So I cannot reduce further, please do not ask me to do so. The benchmark in question inserts into a container initially then iterates over container elements repeatedly, randomly erasing and/or inserting new elements. Compilers/environments used: Xubuntu 20: GCC8.4, GCC9.3, GCC10.0.1 Windows 7: Nuwen mingw GCC8.2, nuwen mingw GCC9.2 The attached code output is from the Xubuntu environment. Any questions let me know. I will help where I can, but my knowledge of assembly is limited. Information on code components: Nanotimer is a ~nanosecond-precision sub-timeslice cross-platform timer. Colony is a bucket-array-like unordered sequence container. The attached zip contains the build logs and compiler preprocessed outputs for GCC 8.4, 9.3 and 10.0.1 Thanks- Mat