This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list
- From: "justin at fathomdb dot com" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 01 Sep 2011 17:33:32 +0000
- Subject: [Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list
- Auto-submitted: auto-generated
- References: <bug-50257-4@http.gcc.gnu.org/bugzilla/>
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257
--- Comment #6 from Justin SB <justin at fathomdb dot com> 2011-09-01 17:33:32 UTC ---
Thanks so much for the help. I created a test case (attached) and tried out
the patch (the second version); it is a big improvement.
Here are the timings:
g++ -g -O3 --std=c++0x test.cc
Without patch:
user 0m13.201s
With patch:
user 0m10.237s
---
The special casing is obviously a bit of a hack, so I wondered if we could do
better by doing a linear scan if the value is less than a threshold (which
would have a much nicer memory pattern). It was a small improvement, but
nowhere near what the special-casing of the value 10 does:
With linear scan for small n:
user 0m12.561s
I therefore believe the real win is because the special case allows the
compiler to optimize away the entire lookup in __prime_list, because the
function is inlined and the value of n is known.
---
There is a comment in hashtable_policy:
// XXX This is a hack. There's no good reason for any of
// _Prime_rehash_policy's member functions to be inline.
However, I think that (with the patch) this isn't true any more. There's a big
performance win by avoiding touching __prime_list, so at least the special case
should remain inline.
---
Based on your malloc comment, I'm using tcmalloc which is a bit faster than the
default allocator, which is probably why it showed up; the patch still knocks
the same ~3 seconds off the time:
g++ -g -O3 --std=c++0x test.cc -ltcmalloc
Without patch, with -ltcmalloc
user 0m10.173s
With patch, with -ltcmalloc
user 0m7.288s
---
In terms of the (non cumulative) cpu cycles spent in the lookup vs new/delete:
With -ltcmalloc:
Without patch: 44% iter() / 48% memory.
With patch: 34% iter() / 63% memory
With stock malloc:
Without patch: 38% iter() / 58% memory
With patch: 18% iter() / 79% memory
So memory is indeed a big cost, but as I was using tcmalloc I was able to see
that the constructor was surprisingly expensive (spending about as much time in
the constructor as I was in allocation or deallocation)
---
TLDR: The patch works for me, and I think it's a significant win, because it
allows the compiler to optimize __prime_list away entirely.