Bug 56493 - [4.8/4.9/5 Regression] Performance regression in google dense hashmap
Summary: [4.8/4.9/5 Regression] Performance regression in google dense hashmap
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 4.7.1
: P3 normal
Target Milestone: 4.8.4
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-03-01 13:10 UTC by Andrew Gallagher
Modified: 2014-12-10 12:34 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 4.8.3, 4.9.2
Last reconfirmed: 2014-12-03 00:00:00


Attachments
Sample program using dense hashmap (187 bytes, application/octet-stream)
2013-03-01 13:11 UTC, Andrew Gallagher
Details
Preprocessed test program source (321.26 KB, application/zip)
2013-03-01 22:09 UTC, Andrew Gallagher
Details
Simple test case (212 bytes, application/octet-stream)
2013-04-13 01:37 UTC, Christopher Berner
Details
gcc49-pr56493.patch (786 bytes, patch)
2013-06-16 10:25 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Gallagher 2013-03-01 13:10:08 UTC
When compared to gcc-4.6.2, gcc-4.7.1 shows about a 10-20% performance regression when inserting into a google dense hashmap in a tight loop.  A small sample program is attached and was compiled with '-O3' and on a linux x86_64 box (using -std=gnu++0x, as required by sparsehash).  Potentially related, the generated binary under gcc-4.7 is also over 50% bigger.
Comment 1 Andrew Gallagher 2013-03-01 13:11:30 UTC
Created attachment 29561 [details]
Sample program using dense hashmap
Comment 2 Andrew Gallagher 2013-03-01 13:12:41 UTC
I also see this same perf regression on trunk (as of r195725).
Comment 3 Jonathan Wakely 2013-03-01 14:29:24 UTC
Instead of expecting everyone who looks at this bug to find and download the dense hash map code please provide preprocessed source as requested in the bug reporting guidelines that you were prominently asked to read before you entered this bug report: http://gcc.gnu.org/bugs/
Comment 4 Andrew Gallagher 2013-03-01 22:09:10 UTC
Created attachment 29565 [details]
Preprocessed test program source
Comment 5 Andrew Gallagher 2013-03-01 22:10:14 UTC
Ah, sorry.  Just attached preprocessed sources now (which had to be compressed to it in the 1K size limit).
Comment 6 Jonathan Wakely 2013-03-02 00:56:53 UTC
thanks
Comment 7 Andrew Gallagher 2013-03-09 04:17:39 UTC
I was able to bisect this to http://gcc.gnu.org/viewcvs?view=revision&revision=172430.  I haven't had time to dig in further, but that rev looks like it has plenty of inlining changes, which makes sense.
Comment 8 Jonathan Wakely 2013-03-09 10:57:06 UTC
Thanks for bisecting it
Comment 9 Christopher Berner 2013-04-13 01:37:27 UTC
Created attachment 29864 [details]
Simple test case

Here's an even simpler test case. It seems that there's a fairly serious regression with inlining. Even adding the attribute always_inline doesn't seem to restore performance. However, manually inlining the code with a macro restores performance to gcc 4.6 level
Comment 10 Andrew Gallagher 2013-04-13 01:44:27 UTC
I did several attempts at bisecting this.  Whenever I (hackily) reverted the change which caused the regression, it just popped up a few hundred revs later.  This seems to be gcc assigning new weights to functions, which determines whether the functions is early inlined or not.  So I *think* this is really an early inlining v. late inlining issue, and we happened to get lucky with the weights that 4.6 selected (I don't think there is a really effective way for gcc to predict how subsequent optimizations will/can benefit by early inlining).

Also, as far as I can see, there is no way to force early inlining (other than switching to macros).
Comment 11 Jakub Jelinek 2013-06-15 07:12:07 UTC
The reduced testcase has been slowed down by r176072, aka PR45437.  If you want to speed it up by source code changes, q += (unsigned int) f(i); would do instead of q += f(i);, or you can do q = q + f(i);.
Before that bugfix, the C++ FE produced
(void) (q = (int) ((unsigned int) f (i) + (unsigned int) q))
which is generally wrong (say if q was addressable and f could modify it),
after it we generate
(void) (q = TARGET_EXPR <D.2078, f (i)>;, (int) ((long unsigned int) q + D.2078);)
The question is why the narrowing of the operation (to be done on unsigned int rather than long unsigned int) isn't performed here, that is something we do right now solely in the FE.  There are PRs (PR45397 and PR47477) about lack of type demotion (and promotion) on GIMPLE, but don't know what the current state of that is.
Comment 12 jim 2013-06-16 00:03:51 UTC
Thank you, Jakub.
Comment 13 Jakub Jelinek 2013-06-16 10:25:14 UTC
Created attachment 30312 [details]
gcc49-pr56493.patch

Untested patch that in the FE restores the performance on the simplified testcase.  Haven't tried the original.  Of course, having this optimization be performed in GIMPLE will be much more worthwhile, but likely not backportable.
Comment 14 Uroš Bizjak 2014-12-03 09:40:36 UTC
According to Comment #0 and Comment #9, this PR should be confirmed as a  regression from 4.6.
Comment 15 Jakub Jelinek 2014-12-03 09:56:45 UTC
We are using that #c13 patch successfully for almost 18 months now in Fedora.
As type promotion/demotion patch is not (sadly) going to happen for GCC 5, I still think it would be worthwhile to apply it temporarily, until type promotion/demotion is written.
Comment 16 Richard Biener 2014-12-03 10:13:11 UTC
(In reply to Jakub Jelinek from comment #15)
> We are using that #c13 patch successfully for almost 18 months now in Fedora.
> As type promotion/demotion patch is not (sadly) going to happen for GCC 5, I
> still think it would be worthwhile to apply it temporarily, until type
> promotion/demotion is written.

Works for me.
Comment 17 Richard Biener 2014-12-03 10:20:48 UTC
Btw, I think that

(int) ((long unsigned int) q + D.2078)

to

(int) ((unsigned int) q + (unsigned int) D.2078)

doesn't look like an always profitable pattern on GIMPLE (more stmts).  Also
take into consideration what targets PROMOTE_MODE (mode-of-q/D.2078) do
and prefer types according to that.

The vectorizer prefers single type sizes throughout computations to avoid
re-packing.
Comment 18 Jakub Jelinek 2014-12-03 10:34:20 UTC
(In reply to Richard Biener from comment #17)
> Btw, I think that
> 
> (int) ((long unsigned int) q + D.2078)
> 
> to
> 
> (int) ((unsigned int) q + (unsigned int) D.2078)
> 
> doesn't look like an always profitable pattern on GIMPLE (more stmts).  Also
> take into consideration what targets PROMOTE_MODE (mode-of-q/D.2078) do
> and prefer types according to that.
> 
> The vectorizer prefers single type sizes throughout computations to avoid
> re-packing.

Which is why the proposal was to do demotion early and promote back depending on machine description (and, perhaps on separate IL copy, in between ifcvt and vectorizer also promote to decrease re-packing).  The demotion would help with avoiding unnecessary computations (e.g. those affecting just the high bits), canonicalizing the IL and in some cases allowing bigger parts of computations with the same type bitsizes, and promotion would be an attempt to help with sub-word arithmetics on word only arithmetics targets, etc.
Dunno why Kai has stopped working on that :(.
Comment 19 Richard Biener 2014-12-03 11:26:42 UTC
(In reply to Jakub Jelinek from comment #18)
> (In reply to Richard Biener from comment #17)
> > Btw, I think that
> > 
> > (int) ((long unsigned int) q + D.2078)
> > 
> > to
> > 
> > (int) ((unsigned int) q + (unsigned int) D.2078)
> > 
> > doesn't look like an always profitable pattern on GIMPLE (more stmts).  Also
> > take into consideration what targets PROMOTE_MODE (mode-of-q/D.2078) do
> > and prefer types according to that.
> > 
> > The vectorizer prefers single type sizes throughout computations to avoid
> > re-packing.
> 
> Which is why the proposal was to do demotion early and promote back
> depending on machine description (and, perhaps on separate IL copy, in
> between ifcvt and vectorizer also promote to decrease re-packing).  The
> demotion would help with avoiding unnecessary computations (e.g. those
> affecting just the high bits), canonicalizing the IL and in some cases
> allowing bigger parts of computations with the same type bitsizes, and
> promotion would be an attempt to help with sub-word arithmetics on word only
> arithmetics targets, etc.
> Dunno why Kai has stopped working on that :(.

Because I pushed back hard.  He ended up with multiple passes here and there,
and mostly motivated things by better folding with less conversions in the IL.

I also think you can't really separate demotion and promotion (demotion
is just making range information a little bit more "explicit" in the IL,
and thus IMHO tied to VRP if it runs).  Promotion (which can of course also
be effective demotion if legal) is to make things aligned with the target.
Comment 20 Jakub Jelinek 2014-12-04 09:47:18 UTC
Author: jakub
Date: Thu Dec  4 09:46:45 2014
New Revision: 218345

URL: https://gcc.gnu.org/viewcvs?rev=218345&root=gcc&view=rev
Log:
	PR c++/56493
	* convert.c (convert_to_real, convert_to_expr, convert_to_complex):
	Handle COMPOUND_EXPR.

	* c-c++-common/pr56493.c: New test.

Added:
    trunk/gcc/testsuite/c-c++-common/pr56493.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/convert.c
    trunk/gcc/testsuite/ChangeLog
Comment 21 Jakub Jelinek 2014-12-04 09:48:25 UTC
Author: jakub
Date: Thu Dec  4 09:47:54 2014
New Revision: 218346

URL: https://gcc.gnu.org/viewcvs?rev=218346&root=gcc&view=rev
Log:
	PR c++/56493
	* convert.c (convert_to_real, convert_to_expr, convert_to_complex):
	Handle COMPOUND_EXPR.

	* c-c++-common/pr56493.c: New test.

Added:
    branches/gcc-4_9-branch/gcc/testsuite/c-c++-common/pr56493.c
Modified:
    branches/gcc-4_9-branch/gcc/ChangeLog
    branches/gcc-4_9-branch/gcc/convert.c
    branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
Comment 22 Jakub Jelinek 2014-12-04 09:49:27 UTC
Author: jakub
Date: Thu Dec  4 09:48:54 2014
New Revision: 218347

URL: https://gcc.gnu.org/viewcvs?rev=218347&root=gcc&view=rev
Log:
	PR c++/56493
	* convert.c (convert_to_real, convert_to_expr, convert_to_complex):
	Handle COMPOUND_EXPR.

	* c-c++-common/pr56493.c: New test.

Added:
    branches/gcc-4_8-branch/gcc/testsuite/c-c++-common/pr56493.c
Modified:
    branches/gcc-4_8-branch/gcc/ChangeLog
    branches/gcc-4_8-branch/gcc/convert.c
    branches/gcc-4_8-branch/gcc/testsuite/ChangeLog
Comment 23 Richard Biener 2014-12-10 12:34:27 UTC
Fixed.