54075 – [4.7.1] unordered_map insert still slower than 4.6.2

Bug 54075 - [4.7.1] unordered_map insert still slower than 4.6.2

Summary: [4.7.1] unordered_map insert still slower than 4.6.2

Status:	REOPENED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	libstdc++ (show other bugs)
Version:	4.7.1

Importance:	P3 normal
Target Milestone:	---
Assignee:	François Dumont

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-07-23 23:06 UTC by likan_999.student
Modified:	2016-09-19 14:51 UTC (History)
CC List:	8 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2012-07-23 00:00:00

Attachments
Profiling of gcc-4.7.1 using google-perftools (76.41 KB, image/gif) 2012-07-23 23:08 UTC, likan_999.student	Details
Profiling of gcc-4.6.2 using google-perftools (61.09 KB, image/gif) 2012-07-23 23:09 UTC, likan_999.student	Details
hashtable_rehash.patch (551 bytes, text/x-patch) 2012-11-08 20:21 UTC, frs.dumont	Details
performance.patch (1.79 KB, text/x-patch) 2012-11-08 21:19 UTC, frs.dumont	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description likan_999.student 2012-07-23 23:06:57 UTC

The following piece of code shows 3x performance regression from 4.6.2 to 4.7.1.  The reason should come from the change in libstdc++.  The code is compiled with -O2 -std=c++0x. -O3 doesn't make much difference; with or without the call to reserve doesn't make much difference.  The attachment contains profiling using google-perftool.  Looks like the major cost comes from the rehashing.  Does anyone aware of the issue, or I am the first one to report this?


#include <unordered_map>

using namespace std;

int main() {
    const int N = 10000000;
    unordered_map<int, int> m;
    m.reserve(2 * N);
    for (int i = 0; i < N; i++) {
        m[i] = i;
    }
}


Timing:
[hidden]$ time ./a-4.6.2.out 

real	0m1.029s
user	0m0.787s
sys	0m0.239s
[hidden]$ time ./a-4.7.1.out 

real	0m3.364s
user	0m2.596s
sys	0m0.761s

Comment 1 likan_999.student 2012-07-23 23:08:07 UTC

Created attachment 27861 [details]
Profiling of gcc-4.7.1 using google-perftools

Comment 2 likan_999.student 2012-07-23 23:09:43 UTC

Created attachment 27862 [details]
Profiling of gcc-4.6.2 using google-perftools

Comment 3 Paolo Carlini 2012-07-23 23:12:15 UTC

Francois, can you have a look? Thanks!

Comment 4 Paolo Carlini 2012-07-23 23:23:27 UTC

I wonder, anyway, if the apparent slow down is just an artifact caused by a different handling of the load factor in the reworked unordered containers which we have been delivering since 4.7.0. I would suggest submitter to experiment a bit with max_load_factor, and see if when 4.6.x seems faster at insertion time actually the load factor is higher (too searching would be slower).

Comment 5 likan_999.student 2012-07-24 00:17:10 UTC

@Paolo Carlini: can you talk more about how to experiment with max_load_factor?  As long as I use the same max_load_factor for 4.6 and 4.7, I can still see the performance difference (3x slower is the best result):

max_load_factor  gcc-4.6.2 gcc-4.7.1
0.2                      1.600s      7.650s
0.5                      1.125s      4.251s
1.0                      1.004s      3.378s
2.0                      0.914s      20.471s
5.0                      0.917s      29.132s

Comment 6 Paolo Carlini 2012-07-24 00:29:38 UTC

In some cases 4.6.x was handling max_load_factor incorrectly. Thus, the idea isn't comparing 4.6.x to 4.7.x with the same max_load_factor (I don't think it's useful), instead, actually measure load_factor(s), see if for some values of max_load_factor, the actual load_factor(s) are different in 4.6.x vs 4.7.x. In any case, measure the actual load_factor while the map grows.

Comment 7 likan_999.student 2012-07-24 00:42:57 UTC

@Paolo Carlini: the problem is, with different max_load_factor in range [0.2-5], the *best* result of 4.7.1 is still 2x slower than the *worst* of 4.6.2.  

I printed out the average load factor during the insertion.  Looks like 4.7.1's load factor is very close to the max_load_factor, and the average for 4.6.2 is about 1/4 of the max_load_factor.  But what conclusion can we get from this result?

Comment 8 Dennis Lubert 2012-07-24 10:41:54 UTC

I just stumbled over this bug while searching for something related to the max load factor (it seems that since 4.7 the hashtable also shrinks when you set the load factor higher, which is at least for me unfortunate).

I did just change the testcase to count the number of rehashes (by checking bucket count before and after insert) and found:

gcc4.6 without reserve: 21
gcc4.6 with reserve: 1

gcc4.7 without reserve: 155
gcc4.7 with reserve: 160

Then in callgrind I had a look and most time seems to be spend in rehashing. So I would assume that when the 4.7 version gets the number of rehashing down to where it was, then we also get the speed down to where it was.

I would say that the reserve behaviour though is definetly broken, as it even increases the amount of rehashings. I personally would just not have the hashtable ever shrink on insert and/or load factor setting, just only ever on explicit rehash() calls...

Comment 9 François Dumont 2012-07-24 20:15:10 UTC

I confirm that the reserve method is broken. I had correctly handle the size hint that can be given through the hashtable constructor, I set _M_rehash_policy._M_prev_resize to 0 just to avoid the container to be shrink until it reaches the targetted size. The same thing should be done in the rehash method that is called from reserve one. I will submit a patch soon.

Regarding the shrink on insert calls, well, it can be easily avoided by calling (normally) reserve so I considered that it is better to offer a container with a symetric behavior.

I will also I think reconsider the grow factor. During one of my propositions of hashtable redesign the number of empty buckets was a problem for performance. With the current implementation it is not a problem anymore so we could grow a little bit faster.

Comment 10 Paolo Carlini 2012-07-25 09:56:15 UTC

A patch is available here:

  http://gcc.gnu.org/ml/libstdc++/2012-07/msg00051.html

Submitter and interested people can give it a try before it goes in.

Comment 11 François Dumont 2012-07-25 19:32:53 UTC

Author: fdumont
Date: Wed Jul 25 19:32:48 2012
New Revision: 189863

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189863
Log:
2012-07-25  François Dumont  <fdumont@gcc.gnu.org>

	PR libstdc++/54075
	* include/bits/hashtable.h
	(_Hashtable<>::_Hashtable(_InputIterator, _InputIterator,
	size_type, ...): Remove std::max usage to guarantee that hashtable
	state is consistent with hash policy state.
	(_Hashtable<>::rehash): Likewise. Set _M_prev_resize to 0 to avoid
	the hashtable to be shrinking on next insertion.
	* testsuite/23_containers/unordered_set/modifiers/reserve.cc: New.
	* testsuite/23_containers/unordered_multiset/modifiers/reserve.cc: New.
	* testsuite/23_containers/unordered_map/modifiers/reserve.cc: New.
	* testsuite/23_containers/unordered_multimap/modifiers/reserve.cc: New.

Added:
    trunk/libstdc++-v3/testsuite/23_containers/unordered_map/modifiers/reserve.cc
    trunk/libstdc++-v3/testsuite/23_containers/unordered_multimap/modifiers/reserve.cc
    trunk/libstdc++-v3/testsuite/23_containers/unordered_multiset/modifiers/reserve.cc
    trunk/libstdc++-v3/testsuite/23_containers/unordered_set/modifiers/reserve.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/hashtable.h

Comment 12 Dennis Lubert 2012-07-26 12:30:00 UTC

I can confirm that now the reserve works as I would expect it (causing no further rehashes). However the amount of rehashes done in the testcase is still 155 (needing 4.5s), while gcc 4.6 only needs 21 (needing 1.5s).

Monitoring the bucket counts on resizes it seems that gcc 4.8 is now much more fine grained than gcc 4.6. I am unsure if this is expected, deliberate and/or a good idea.

Comment 13 François Dumont 2012-07-26 12:31:56 UTC

Author: fdumont
Date: Thu Jul 26 12:31:50 2012
New Revision: 189889

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189889
Log:
2012-07-26  François Dumont  <fdumont@gcc.gnu.org>

	PR libstdc++/54075
	* include/bits/hashtable.h
	(_Hashtable<>::_Hashtable(_InputIterator, _InputIterator,
	size_type, ...): Remove std::max usage to guarantee that hashtable
	state is consistent with hash policy state.
	(_Hashtable<>::rehash): Likewise. Set _M_prev_resize to 0 to avoid
	the hashtable shrinking on next insertion.
	* testsuite/23_containers/unordered_set/modifiers/reserve.cc: New.
	* testsuite/23_containers/unordered_multiset/modifiers/reserve.cc: New.
	* testsuite/23_containers/unordered_map/modifiers/reserve.cc: New.
	* testsuite/23_containers/unordered_multimap/modifiers/reserve.cc: New.

Added:
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_map/modifiers/reserve.cc
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_multimap/modifiers/reserve.cc
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_multiset/modifiers/reserve.cc
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_set/modifiers/reserve.cc
Modified:
    branches/gcc-4_7-branch/libstdc++-v3/ChangeLog
    branches/gcc-4_7-branch/libstdc++-v3/include/bits/hashtable.h

Comment 14 Paolo Carlini 2012-07-26 17:36:28 UTC

In any case, please add the testcase showing 4.5s vs 1.5s.

Comment 15 likan_999.student 2012-07-26 22:10:21 UTC

Tried the patch and just as Dennis Lubert pointed out, the number of rehashes is not decreased.  Is there any plan to fix this issue?

Comment 16 Chip Salzenberg 2012-07-26 22:50:17 UTC

In my tests, with this patch, 4.7.1 is about 10% slower than 4.6 ... a vast improvement but certainly not parity.

./bench46  1.75s user 0.82s system 99% cpu 2.577 total
./bench47  8.01s user 2.78s system 99% cpu 10.800 total
./bench47+patch  1.95s user 0.80s system 99% cpu 2.764 total

Comment 17 Paolo Carlini 2012-07-26 22:55:15 UTC

Because of more rehashing, unrelated to reserve, I suppose?

Comment 18 Chip Salzenberg 2012-07-26 23:38:34 UTC

I couldn't say.  I don't understand the issue, I'm just reporting results and deploying packages for my fellow devs.

Comment 19 Jonathan Wakely 2012-07-27 00:32:54 UTC

Testcase please.

Comment 20 Chip Salzenberg 2012-07-27 01:00:14 UTC

Are you talking to me?  'cause I was providing results for the patch already committed to svn, using the code in this very bug description.

Comment 21 François Dumont 2012-07-27 07:57:59 UTC

I haven't touch the grow speed for the moment. I prefer to fix the reserve Standard conformity first.

Now I can restore the 4.6 grow speed as it seems to be a relatively correct one. Note that 3x slower with so many more rehash operations doesn't seem so bad ! Of course growing faster will also induce higher memory consumption which is not shown by the simple benchmark proposed here.

Comment 22 Dennis Lubert 2012-07-27 19:20:54 UTC

Am on vacation so I don't have the testcase at hand, but it is the same as likan posted in the original bugreport, minus the reserve. The main difference is that without reserve I see 21 rehashes in gcc 4.6 and 155 rehashes. I have added some code that after each insert tests if the bucket count has changed, I think that should be trivial to add for yourselves without me providing it.

Comment 23 François Dumont 2012-07-29 16:44:26 UTC

Author: fdumont
Date: Sun Jul 29 16:44:18 2012
New Revision: 189938

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189938
Log:
2012-07-29  François Dumont  <fdumont@gcc.gnu.org>

	PR libstdc++/54075
	* include/bits/hashtable_policy.h
	(_Prime_rehash_policy::_M_next_bkt): Add a growth factor set to 2
	to boost growth in the number of buckets.
	* testsuite/performance/23_containers/insert/unordered_set.cc: New.

Added:
    trunk/libstdc++-v3/testsuite/performance/23_containers/insert/unordered_set.cc
Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/hashtable_policy.h

Comment 24 François Dumont 2012-07-29 17:06:25 UTC

Author: fdumont
Date: Sun Jul 29 17:06:21 2012
New Revision: 189941

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189941
Log:
2012-07-29  François Dumont  <fdumont@gcc.gnu.org>

	PR libstdc++/54075
	* include/bits/hashtable_policy.h
	(_Prime_rehash_policy::_M_next_bkt): Add a growth factor set to 2
	to boost growth in the number of buckets.
	* testsuite/performance/23_containers/insert/unordered_set.cc: New.

Added:
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/performance/23_containers/insert/unordered_set.cc
Modified:
    branches/gcc-4_7-branch/libstdc++-v3/ChangeLog
    branches/gcc-4_7-branch/libstdc++-v3/include/bits/hashtable_policy.h

Comment 25 Paolo Carlini 2012-09-26 23:11:28 UTC

Fixed.

Comment 26 Martin Cracauer 2012-10-22 20:50:25 UTC

I'm afraid this doesn't quite do it.

I still observe a > 60% slowdown going from gcc-4.4 to gcc-4.7, with this fix already in, specifically for insert().  It's a 320,000 member table I am dealing with here.

I can make 4.7 be as fast as 4.4 by preemptively setting the reserve to what I know (for this test) to be the maximum size I need, but measured resident memory shoots up (not unexpected). And resident memory of the 4.7 build was already higher than 4.4 so I don't think this can be the answer here.

Playing with the load factor resulted in a minor speedup with 0.4 (from 1.0), but not reaching 4.4 performance. Other load factors (lower than 0.4 and higher than 1.0) are even slower.

Is there more specific information available about the tradeoff numbers that made GNU pick this new implementation?

Comment 27 Paolo Carlini 2012-10-22 21:05:41 UTC

I can only recommend profiling (by gprof or other means).

Comment 28 Martin Cracauer 2012-10-22 22:04:27 UTC

I should clarify that I was pointed to this problem (with insert) by profiling and for us nothing pops up as faster (or smaller for that matter). Hence the question.

Comment 29 Paolo Carlini 2012-10-22 22:47:31 UTC

Indeed, if we have to do something about that, we need to know those profiles. That's my point. Otherwise, blindly, it could be anything, ie, not necessarily rehashes which are the topic of this specific PR.

Comment 30 François Dumont 2012-10-24 19:27:58 UTC

I am going to take a look to 4.4 implementation to see if I can explain the difference. But waiting for that can you confirm that without the reserve the number of rehash is similar to what you had with version 4.4.

Regarding the resident memory, I can't remember any overhead with the new implementation but once again checking 4.4 implementation will point me to it. Do you have it with the code originally posted, that is to say an unordered_map<int, int> ?

For information the major reason to review the implementation was for Standard conformity. The erase method had to return the iterator following the erased one.

Comment 31 Martin Cracauer 2012-10-25 17:47:45 UTC

The case I reported is in a large system. I did an isolated test case which sees the time shoots up from 32.54 seconds in gcc-4.4 to 42.86 s in gcc-4.7 and back down to 32.27s when using gcc-4.7 with the tr1 hashtables.

As you can see there also is a reasonably catastrophic increase in minor page faults.

gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1)
extra flags:
meh: 1635550270 4081940 288998848
0:32.54 32.54 real 30.50 user 1.97 sys 99% CPU 0/1394855 faults
text data bss dec hex filename
5456 688 3840032 3846176 3ab020 sdriver

gcc version 4.7.2 (Debian 4.7.2-4)
extra flags:
meh: 1635550270 4081940 288998848
0:42.86 42.86 real 40.21 user 2.56 sys 99% CPU 0/2101681 faults
text data bss dec hex filename
6822 760 3840032 3847614 3ab5be sdriver

gcc version 4.7.2 (Debian 4.7.2-4)
extra flags: -DCRA_USE_TR1
meh: 1635550270 4081940 288998848
0:32.27 32.27 real 30.29 user 1.92 sys 99% CPU 0/1394853 faults
text data bss dec hex filename
5426 736 3840032 3846194 3ab032 sdriver

Profiles for the three cases:
http://www.cons.org/gcc-hash-slowdown/

Source code for test case:
http://www.cons.org/gcc-hash-slowdown/reproduce-slow-hash.tar.gz

Please feel free to copy things if you want to append things to this pr.

My original case with worse slowdown is in a complicated system and driven from Lisp. For no other reason than Murphy's law the profiler in there does not penetrate into layers of C++ frames for this run, although it normally does. So I'm not able to break up the insert() into it's parts right now. I have no reason to believe that it is different than this test case, probably just a slower hash. I have seen gcc-4.7 + tr1 hashtable faster than gcc-4.4 both in the large system and in the test case so I'm pretty confident that we are looking at the same thing here. I'm fixing that profiler, more later.

Hope this helps.

Comment 32 Paolo Carlini 2012-10-26 14:34:37 UTC

Thanks. Francois, can you please further investigate this issue? In fact, if the slowdown goes away with a preliminary reserve, it must be possible to handle the problem rather easily, by tweaking the grow policy or something (We should be a bit patient and consider that vs 4.4.x the implementation is almost completely different)

Comment 33 François Dumont 2012-11-03 15:28:30 UTC

I fear that this performance issue is a normal drawback of the major enhancement for PR 41975. Before this evolution the hashtable data model was like a std::vector<std::forward_list>. During the insert operation it was easy to find out if the inserted element was already in the targetted bucket by simply checking for equality with each element of the bucket forwward_list. The code was something like:

for (auto it = bucket.begin(); it != bucket.end(); ++it)
  if (comp(*it, element))
    // The element already exists
    return false;
// Ok, no equivalent element
return true;

  After the evolution the data model became a combination of a std::forward_list containing all elements and a std::vector<std::forward_list<>::iterator> representing buckets. Now to check if an element is already in a bucket we still need to compare for equality with each element of the bucket but the element of the bucket are harder to identified. There is no more bucket end so we must also check that the element we are about to compare is still in the bucket. The code became something like:

// Pre: bucket not empty.
for (auto it = buckets[n];)
{
  if (comp(*it, element))
    // The element already exists
    return false;
  ++it;
  if (it == end() || bucket(it) != n)
    // We are not in bucket n anymore
    return false;
}
// Ok, no equivalent element
return true;

The new design helps to iterate through the container because even if it is almost empty we simply need to iterate within the forward_list. In the previous implementation we needed to iterate within a bucket forward_list and then jump above all empty buckets which could be very expensive. Try to run testsuite/performance/23_containers/insert_erase/41975.cc, you can easily tweak it to make it run with the tr1 implementation and you will notice the difference. This test also show the insert overhead even if with a perfect hasher like the one use in this test the impact is very limited.

To make bucket(it) not too expensive the new implementation caches most of the time the hash code which explain the additional memory. You can try to ask for it not to be cache but you might have to qualified your hasher with noexcept, static assertions will tell you so.

So for the moment I can see a way to restore tr1 insert performance without impacting erase performance. In my opinion, a Standard implementation needs to behave correctly in all use cases and not to be perform very well in one use case but very bad in an other.

Comment 34 James Y Knight 2012-11-04 04:55:30 UTC

Well, I haven't looked into this issue in detail, but, it looks like everyone is about the same speed if you put a foo.{reserve or rehash}(n) before the inserts.

Unfortunately, it doesn't look as simple as the new impl calling for a rehash more often than the old, cause it's not. So, I don't know if the slowness is because rehash is now a lot slower, or if insert is slower but only if there aren't a huge number of empty buckets.

I'll also note that libc++'s implementation (seen here: http://llvm.org/svn/llvm-project/libcxx/trunk/) appears to be getting the same speed as the old libstdc++ implementation, while appearing to have approximately the same bucket datastructure as the new libstdc++ implementation. That says to me that it should be *possible* somehow to make the new version fast.

Comment 35 Paolo Carlini 2012-11-05 12:55:09 UTC

Ok, let's reopen this with an adjusted Summary.

Comment 36 Lawrence 2012-11-05 22:12:12 UTC

It seems that this commit doesn't fully fix this issue. If you call rehash multiple times with the same size, the second call to rehash resets _M_prev_resize to a non-zero value in _M_next_bkt(). Here is a sample program that shows this behavior:

#include <stdio.h>
#include <unordered_map>

int main(void) {
    std::unordered_map<int, int> myMap;

    myMap.rehash(4000000);
    myMap.rehash(4000000);

    unsigned long long buckets = myMap.bucket_count();
    int i = 0;
    while (i < 2000000000) {
        myMap.insert(std::make_pair(i, 0));
        ++i;
        if (buckets != myMap.bucket_count()) {
            printf("buckets %lu -> %lu\n", buckets, myMap.bucket_count());
            buckets = myMap.bucket_count();
        }
    }

    return 0;
}

(In reply to comment #13)
> Author: fdumont
> Date: Thu Jul 26 12:31:50 2012
> New Revision: 189889
> 
> URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189889
> Log:
> 2012-07-26  François Dumont  <fdumont@gcc.gnu.org>
> 
>     PR libstdc++/54075
>     * include/bits/hashtable.h
>     (_Hashtable<>::_Hashtable(_InputIterator, _InputIterator,
>     size_type, ...): Remove std::max usage to guarantee that hashtable
>     state is consistent with hash policy state.
>     (_Hashtable<>::rehash): Likewise. Set _M_prev_resize to 0 to avoid
>     the hashtable shrinking on next insertion.
>     * testsuite/23_containers/unordered_set/modifiers/reserve.cc: New.
>     * testsuite/23_containers/unordered_multiset/modifiers/reserve.cc: New.
>     * testsuite/23_containers/unordered_map/modifiers/reserve.cc: New.
>     * testsuite/23_containers/unordered_multimap/modifiers/reserve.cc: New.
> 
> Added:
>    
> branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_map/modifiers/reserve.cc
>    
> branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_multimap/modifiers/reserve.cc
>    
> branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_multiset/modifiers/reserve.cc
>    
> branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_set/modifiers/reserve.cc
> Modified:
>     branches/gcc-4_7-branch/libstdc++-v3/ChangeLog
>     branches/gcc-4_7-branch/libstdc++-v3/include/bits/hashtable.h

Comment 37 Paolo Carlini 2012-11-06 00:58:37 UTC

Francois, can you please look further into this, possibly basing on the new testcase? Thanks!

Comment 38 François Dumont 2012-11-06 21:22:48 UTC

Sure, I will. However I don't expect this problem to have any relation with the performance subject of this PR.

Comment 39 Paolo Carlini 2012-11-06 21:33:57 UTC

Ok thanks. I guess depending on the complexity of the fixes we can apply some only to mainline first and reconsider the 4_7 branch later. Please do your best to work on both issues: we just entered Stage 3 thus no new features from now on, we are all concentrated on bug fixes until the release.

Comment 40 frs.dumont 2012-11-07 22:02:56 UTC

Here is the patch to fix the redundant rehash/reserve issue.

2012-11-07  François Dumont  <fdumont@gcc.gnu.org>

     PR libstdc++/54075
     * include/bits/hashtable.h (_Hashtable<>::rehash): Reset hash
     policy state if no rehash.
     * testsuite/23_containers/unordered_set/modifiers/reserve.cc
     (test02): New.

I had prepared and tested it in 4.7 branch but I can apply the same on 
trunk.

Ok to commit ? If so, where ?

François

On 11/06/2012 10:33 PM, paolo.carlini at oracle dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54075
>
> --- Comment #39 from Paolo Carlini <paolo.carlini at oracle dot com> 2012-11-06 21:33:57 UTC ---
> Ok thanks. I guess depending on the complexity of the fixes we can apply some
> only to mainline first and reconsider the 4_7 branch later. Please do your best
> to work on both issues: we just entered Stage 3 thus no new features from now
> on, we are all concentrated on bug fixes until the release.
>

Comment 41 Jonathan Wakely 2012-11-08 00:58:55 UTC

On 7 November 2012 22:02, François Dumont wrote:
>
> Ok to commit ? If so, where ?

That patch is OK for trunk and 4.7, thanks.

Comment 42 Paolo Carlini 2012-11-08 01:56:15 UTC

On 11/08/2012 01:58 AM, Jonathan Wakely wrote:
> On 7 November 2012 22:02, François Dumont wrote:
>> Ok to commit ? If so, where ?
> That patch is OK for trunk and 4.7, thanks.
... I was having a look at this code - patch per se is straightforward 
enough and can in any case go in now - and something is puzzling me a 
lot: we always compute things, in _M_next_bkt etc, in terms of 
__grown_n, thus __n * 2, until the final _M_rehash call.

On the other hand, the old-old code for rehash didn't use 
_M_growth_factor in these computations, it just literally enforced the 
post-conditions of the Standard. Are we sure we aren't so to speak 
rehashing too much? For example, when the load factor is very low and 
doesn't count, it looks like a current rehash(n) accomplishes the same 
of an old rehash(n * 2)?!? Something seems wrong, can you double check 
that? In any case the comments before _M_next_bkt would need fixing.

Thanks,
Paolo.

Comment 43 Paolo Carlini 2012-11-08 02:26:12 UTC

On 11/08/2012 02:56 AM, Paolo Carlini wrote:
> On the other hand, the old-old code for rehash didn't use 
> _M_growth_factor in these computations, it just literally enforced the 
> post-conditions of the Standard. Are we sure we aren't so to speak 
> rehashing too much? For example, when the load factor is very low and 
> doesn't count, it looks like a current rehash(n) accomplishes the same 
> of an old rehash(n * 2)?!? Something seems wrong, can you double check 
> that? In any case the comments before _M_next_bkt would need fixing.
... in other terms, I really think that the only uses of 
_S_growth_factor should return to be inside _M_need_rehash, because 
that's the function called by the inserts, when the container must 
automatically grow the number of buckets. Elsewhere, like the 
constructors, rehash, should not need to know about _S_growth_factor.

Paolo.

Comment 44 François Dumont 2012-11-08 20:06:08 UTC

Author: fdumont
Date: Thu Nov  8 20:06:00 2012
New Revision: 193335

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193335
Log:
2012-11-08  François Dumont  <fdumont@gcc.gnu.org>

	PR libstdc++/54075
	* include/bits/hashtable.h (_Hashtable<>::rehash): Reset hash
	policy state if no rehash.
	* testsuite/23_containers/unordered_set/modifiers/reserve.cc
	(test02): New.

Modified:
    branches/gcc-4_7-branch/libstdc++-v3/ChangeLog
    branches/gcc-4_7-branch/libstdc++-v3/include/bits/hashtable.h
    branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_set/modifiers/reserve.cc

Comment 45 François Dumont 2012-11-08 20:16:15 UTC

Created attachment 28638 [details]
hashtable_rehash.patch

Author: fdumont
Date: Thu Nov  8 20:16:04 2012
New Revision: 193339

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193339
Log:
2012-11-08  François Dumont  <fdumont@gcc.gnu.org>

	PR libstdc++/54075
	* include/bits/hashtable.h (_Hashtable<>::rehash): Reset hash
	policy state if no rehash.
	* testsuite/23_containers/unordered_set/modifiers/reserve.cc
	(test02): New.

Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/bits/hashtable.h
    trunk/libstdc++-v3/testsuite/23_containers/unordered_set/modifiers/reserve.cc

Comment 46 frs.dumont 2012-11-08 20:21:15 UTC

Created attachment 28640 [details]
performance.patch

Attached patch applied to trunk and 4.7 branch.

2012-11-08  François Dumont  <fdumont@gcc.gnu.org>

     PR libstdc++/54075
     * include/bits/hashtable.h (_Hashtable<>::rehash): Reset hash
     policy state if no rehash.
     * testsuite/23_containers/unordered_set/modifiers/reserve.cc
     (test02): New.


François

On 11/08/2012 01:58 AM, Jonathan Wakely wrote:
> On 7 November 2012 22:02, François Dumont wrote:
>> Ok to commit ? If so, where ?
> That patch is OK for trunk and 4.7, thanks.
>

Comment 47 frs.dumont 2012-11-08 21:19:05 UTC

On 11/08/2012 03:25 AM, Paolo Carlini wrote:
> On 11/08/2012 02:56 AM, Paolo Carlini wrote:
>> On the other hand, the old-old code for rehash didn't use 
>> _M_growth_factor in these computations, it just literally enforced 
>> the post-conditions of the Standard. Are we sure we aren't so to 
>> speak rehashing too much? For example, when the load factor is very 
>> low and doesn't count, it looks like a current rehash(n) accomplishes 
>> the same of an old rehash(n * 2)?!? Something seems wrong, can you 
>> double check that? In any case the comments before _M_next_bkt would 
>> need fixing.
> ... in other terms, I really think that the only uses of 
> _S_growth_factor should return to be inside _M_need_rehash, because 
> that's the function called by the inserts, when the container must 
> automatically grow the number of buckets. Elsewhere, like the 
> constructors, rehash, should not need to know about _S_growth_factor.
>
> Paolo.
>
     I haven't yet considered the following emails but based on those 
good remarks I have done the attached patch. Surprisingly it seems to 
have a good impact on performance even if before it 
testsuite/performance/23_containers/insert/unordered_set.cc was showing 
that new implementation was better.

     I have also starting to adapt tests so that it's possible to check 
unordered performance with std or std::tr1 implementations. Can I 
generalize it to other tests ? If so, is it a good approach ?

François

Comment 48 Akim Demaille 2016-03-31 15:06:58 UTC

It looks like this history is missing an end.

Comment 49 Akim Demaille 2016-03-31 15:07:20 UTC

It looks like this story is missing an end.

Comment 50 François Dumont 2016-04-01 20:06:21 UTC

This performance issue is a result of fixing:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41975

It resulted in many more modulo operations and so expensive float divisions.

I plan to commit an alternative hash policy using power of 2 number of buckets so that modulo is trivial. Bench are showing that performance are better but still not at the level of tr1 implementation on the operations you are interested in.

So it is difficult to close this ticket cause performance regression is still there but it might stay this way for a long time.