Bug 96750 - 10-12% performance decrease in benchmark going from GCC8 to GCC9/GCC10
Summary: 10-12% performance decrease in benchmark going from GCC8 to GCC9/GCC10
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2020-08-23 06:03 UTC by Matt Bentley
Modified: 2020-10-01 01:59 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-08-24 00:00:00


Attachments
Compiler output (603.46 KB, application/x-zip-compressed)
2020-08-23 06:03 UTC, Matt Bentley
Details
Demonstration of code which doesn't trigger the performance anomaly. (46.01 KB, application/x-zip-compressed)
2020-09-27 23:35 UTC, Matt Bentley
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Bentley 2020-08-23 06:03:23 UTC
Created attachment 49102 [details]
Compiler output

Have recently been working on a new version of the plf::colony container (plflib.org) and found GCC9 was giving ~10% worse performance on average in a given benchmark than GCC8. Further investigation found GCC10 was just as bad.

The effect is repeatable across architectures - I've tested on xubuntu, windows running nuwen mingw, and on Core2 and Haswell CPUs, with and without -march=native specified.

Compiler flags are: -O2;-march=native;-std=c++17

Code presented is with an absolute minimum use-case - other benchmarks have not shown such strong performance differences - including both simpler and more complex tests.
So I cannot reduce further, please do not ask me to do so.

The benchmark in question inserts into a container initially then iterates over container elements repeatedly, randomly erasing and/or inserting new elements.

Compilers/environments used:
Xubuntu 20: GCC8.4, GCC9.3, GCC10.0.1
Windows 7: Nuwen mingw GCC8.2, nuwen mingw GCC9.2

The attached code output is from the Xubuntu environment.

Any questions let me know. I will help where I can, but my knowledge of assembly is limited.

Information on code components:
Nanotimer is a ~nanosecond-precision sub-timeslice cross-platform timer.
Colony is a bucket-array-like unordered sequence container.

The attached zip contains the build logs and compiler preprocessed outputs for GCC 8.4, 9.3 and 10.0.1

Thanks-
Mat
Comment 1 Martin Liška 2020-08-24 08:47:16 UTC
Confirmed, started with r9-5763-g61a8637c8893a252:

after:
1794240.0

before:
1802710.0

Anyway, using PGO one can get to:
1488310.0
Comment 2 Marc Glisse 2020-08-24 09:17:59 UTC
(In reply to Martin Liška from comment #1)
> after:
> 1794240.0
> 
> before:
> 1802710.0

That's less than 1% of difference (with "after" better than "before"), not the 10% regression claimed, maybe there is another relevant commit?
Comment 3 Martin Liška 2020-08-24 09:38:08 UTC
(In reply to Marc Glisse from comment #2)
> (In reply to Martin Liška from comment #1)
> > after:
> > 1794240.0
> > 
> > before:
> > 1802710.0
> 
> That's less than 1% of difference (with "after" better than "before"), not
> the 10% regression claimed, maybe there is another relevant commit?

Sorry, I copied bad numbers:

after:
1806140.0

before:
1705630.0

which is ~6% regression.
Comment 4 Matt Bentley 2020-08-24 23:21:24 UTC
(In reply to Marc Glisse from comment #2)
> (In reply to Martin Liška from comment #1)
> > after:
> > 1794240.0
> > 
> > before:
> > 1802710.0
> 
> That's less than 1% of difference (with "after" better than "before"), not
> the 10% regression claimed, maybe there is another relevant commit?

See the .ods spreadsheet in the zip for my results with same code.
Comment 5 Matt Bentley 2020-09-27 23:34:42 UTC
If anyone out there is interested in working on this, 
I found the smallest change possible to create the same performance as GCC8- 
it is literally eliminating one branch possibility in one function (move-insert).

The branch in question questions whether there are existing memory blocks to re-use or if we need to create a new memory block. For example, if the reserve() function has been called there will be existing memory blocks to re-use.

However the performance drop occurs whether or not there are memory blocks to reuse. The actual if statement is irrelevant. I have tested and can remove all instances of memory block storage (reserve(), erase()) and problem still exists if this one branch is still in insert.

I've attached the source files to demonstrate this above, including one plf_colony.h with the branch removed (renamed to plf_colony_fast.h), so you can see what difference there is.
This code is all zlib license, free to share, but is early beta so don't redistribute please.

Thanks,
Matt

ps. For consistency I've also removed the non-move-insert and emplace instances of this branch statement, even though they won't be called by the benchmark code in a C++11-compliant compiler.
Comment 6 Matt Bentley 2020-09-27 23:35:57 UTC
Created attachment 49278 [details]
Demonstration of code which doesn't trigger the performance anomaly.

plf_colony_fast.h does not trigger the problem, has one branch eliminated in each insert/emplace function.