This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list

From: "justin at fathomdb dot com" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Thu, 01 Sep 2011 17:33:32 +0000
Subject: [Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list
Auto-submitted: auto-generated
References: <bug-50257-4@http.gcc.gnu.org/bugzilla/>

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257

--- Comment #6 from Justin SB <justin at fathomdb dot com> 2011-09-01 17:33:32 UTC ---
Thanks so much for the help.  I created a test case (attached) and tried out
the patch (the second version); it is a big improvement.

Here are the timings:
g++ -g -O3 --std=c++0x test.cc

Without patch:
user    0m13.201s

With patch:
user    0m10.237s

---

The special casing is obviously a bit of a hack, so I wondered if we could do
better by doing a linear scan if the value is less than a threshold (which
would have a much nicer memory pattern).  It was a small improvement, but
nowhere near what the special-casing of the value 10 does:

With linear scan for small n:
user    0m12.561s

I therefore believe the real win is because the special case allows the
compiler to optimize away the entire lookup in __prime_list, because the
function is inlined and the value of n is known.

---

There is a comment in hashtable_policy:

  // XXX This is a hack.  There's no good reason for any of
  // _Prime_rehash_policy's member functions to be inline.

However, I think that (with the patch) this isn't true any more.  There's a big
performance win by avoiding touching __prime_list, so at least the special case
should remain inline.

---

Based on your malloc comment, I'm using tcmalloc which is a bit faster than the
default allocator, which is probably why it showed up; the patch still knocks
the same ~3 seconds off the time:
g++ -g -O3 --std=c++0x test.cc -ltcmalloc

Without patch, with -ltcmalloc
user    0m10.173s

With patch, with -ltcmalloc
user    0m7.288s


---

In terms of the (non cumulative) cpu cycles spent in the lookup vs new/delete:

With -ltcmalloc:
Without patch: 44% iter() / 48% memory.
With patch: 34% iter() / 63% memory

With stock malloc:
Without patch: 38% iter() / 58% memory
With patch: 18% iter() / 79% memory

So memory is indeed a big cost, but as I was using tcmalloc I was able to see
that the constructor was surprisingly expensive (spending about as much time in
the constructor as I was in allocation or deallocation)

---

TLDR: The patch works for me, and I think it's a significant win, because it
allows the compiler to optimize __prime_list away entirely.

References:
- [Bug libstdc++/50257] New: unordered_map slow initialization due to huge __prime_list
  - From: justin at fathomdb dot com

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]