Created attachment 35192 [details] Small benchmark for our unordered_map change Hi, we have been using std::unordered_map with a pointer as the key in one of our applications and analysis showed that the find() function is one of two performance bottlenecks. Further analysis showed that about 40% of the total application runtime is spent in a single x86 divq instruction coming from std::__detail::_Mod_range_hashing. We think that using a modulo operation (translated to divq x86 instruction) all the time is suboptimal and have attached a simple example to show the benefits that can be achieved by replacing the modulo operation by masking. Example code (attachment) ------------------------- We specialized the _Hashtable template to insert our own implementation of __detail::_Mod_range_hashing. In general the attached code should only be considered a demo for the performance increase possible, and not be considered a good solution. Benchmark --------- The example does 50,000,000 emplace and 50,000,000 find operations on an unordered_map. The test system is a Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz using gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6). Here are the performance results for the current implementation: $ g++ -Wall -Wextra -O3 -std=c++11 umap_test.cpp && ./a.out runtime(s) emplace = 3.09947 runtime(s) find = 6.67535 Here is our optimization. $ g++ -Wall -Wextra -O3 -std=c++11 -DLESSDIV umap_test.cpp && ./a.out runtime(s) emplace = 2.21004 runtime(s) find = 2.77398 Related work ------------ Facebooks folly uses a similar approach to what we do, but relies on a fixed bucket count. libcxx uses masking to compute the bucket number only if the number of buckets is a power of two. Getting the change upstream --------------------------- If there is any interest we would be happy to help out, but we are afraid that it requires an ABI change, as we must store a mask for every unordered_map (unless using libcxx's approach).
An ABI change is not an option, although an alternative functor could be provided as an optional extension. There was a related thread a year ago starting at https://gcc.gnu.org/ml/libstdc++/2014-03/msg00024.html
Thanks for the link. I am not sure if there is really any benefit of using libdivide instead of the masking. I'll attach a first version of patch in which the functor stores the mask. Any comments welcome, I am not familiar with the library. Another possible solution would be to allow the number of buckets to be a power of two, as one can easily compute the mask for such cases. This could be triggered by the user explicitly calling rehash() with a power of two as the parameter. Increasing the number of buckets would only increase to another power of two. _Mod_range_hashing could check if the number of buckets is a power of two and use masking in that case. This would not require an ABI change. Any chance of getting such a change upstream? As far as I can see, there seems to be no easy way to have the unorered_map use our folding functor instead of _Mod_range_hashing or am I missing something?
(In reply to Jens Breitbart from comment #2) > Another possible solution would be to allow the number of buckets to be a > power of two, as one can easily compute the mask for such cases. This could > be triggered by the user explicitly calling rehash() with a power of two as > the parameter. Increasing the number of buckets would only increase to > another power of two. _Mod_range_hashing could check if the number of > buckets is a power of two and use masking in that case. This would not > require an ABI change. That sounds promising, and worth pursuing. > Any chance of getting such a change upstream? I don't see why not, although unless you have a GCC copyright assignment on file, or plan to get one (immediately, since it can take a while) it's better *not* to give us a patch, because we can't use it anyway and there can be no danger of using your code if we don't see it! > As far as I can see, there > seems to be no easy way to have the unorered_map use our folding functor > instead of _Mod_range_hashing or am I missing something? I think you would need to use the _Hastable class template directly, rather than via std::unordered_map. In theory that allows you to re-use the internals with different policies, but in practice it's not very easy.
Currently, the only implemented policy uses primes from a hard-coded list for the number of buckets. This makes it easy to precompute (and hard-code in the library) anything that may be helpful to speed-up modulo computation. With a number of buckets that is a power of 2, modulo computation becomes trivial (masking). However, the simplistic specialization of std::hash for pointers in libstdc++ means that all double* hash to a multiple of 8. So we would need to add some scrambling somewhere to avoid leaving most buckets empty in unordered_set<double*>.