This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257

--- Comment #6 from Justin SB <justin at fathomdb dot com> 2011-09-01 17:33:32 UTC ---
Thanks so much for the help.  I created a test case (attached) and tried out
the patch (the second version); it is a big improvement.

Here are the timings:
g++ -g -O3 --std=c++0x test.cc

Without patch:
user    0m13.201s

With patch:
user    0m10.237s

---

The special casing is obviously a bit of a hack, so I wondered if we could do
better by doing a linear scan if the value is less than a threshold (which
would have a much nicer memory pattern).  It was a small improvement, but
nowhere near what the special-casing of the value 10 does:

With linear scan for small n:
user    0m12.561s

I therefore believe the real win is because the special case allows the
compiler to optimize away the entire lookup in __prime_list, because the
function is inlined and the value of n is known.

---

There is a comment in hashtable_policy:

  // XXX This is a hack.  There's no good reason for any of
  // _Prime_rehash_policy's member functions to be inline.

However, I think that (with the patch) this isn't true any more.  There's a big
performance win by avoiding touching __prime_list, so at least the special case
should remain inline.

---

Based on your malloc comment, I'm using tcmalloc which is a bit faster than the
default allocator, which is probably why it showed up; the patch still knocks
the same ~3 seconds off the time:
g++ -g -O3 --std=c++0x test.cc -ltcmalloc

Without patch, with -ltcmalloc
user    0m10.173s

With patch, with -ltcmalloc
user    0m7.288s


---

In terms of the (non cumulative) cpu cycles spent in the lookup vs new/delete:

With -ltcmalloc:
Without patch: 44% iter() / 48% memory.
With patch: 34% iter() / 63% memory

With stock malloc:
Without patch: 38% iter() / 58% memory
With patch: 18% iter() / 79% memory

So memory is indeed a big cost, but as I was using tcmalloc I was able to see
that the constructor was surprisingly expensive (spending about as much time in
the constructor as I was in allocation or deallocation)

---

TLDR: The patch works for me, and I think it's a significant win, because it
allows the compiler to optimize __prime_list away entirely.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]