This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [Bug libstdc++/54075] [4.7.1] unordered_map insert still slower than 4.6.2

From: François Dumont <frs dot dumont at gmail dot com>
To: Paolo Carlini <paolo dot carlini at oracle dot com>
Cc: Jonathan Wakely <jwakely dot gcc at gmail dot com>, "libstdc++ at gcc dot gnu dot org" <libstdc++ at gcc dot gnu dot org>, gcc-patches <gcc-patches at gcc dot gnu dot org>
Date: Fri, 16 Nov 2012 22:58:12 +0100
Subject: Re: [Bug libstdc++/54075] [4.7.1] unordered_map insert still slower than 4.6.2
References: <bug-54075-19885@http.gcc.gnu.org/bugzilla/> <bug-54075-19885-8bKDxmpStr@http.gcc.gnu.org/bugzilla/> <509ADA7E.7050504@gmail.com> <CAH6eHdRHCefZq1FQF64nJ04-OMgMBb8TeUa5y9Lvi-2H=xE0Ww@mail.gmail.com> <509B1130.7050401@oracle.com> <509B1834.5080106@oracle.com> <509C21B9.3070703@gmail.com> <509CE00A.1040704@oracle.com> <50A2BE5B.3030809@gmail.com> <50A37882.9010405@oracle.com>

Attached patch applied.

2012-11-16 François Dumont <fdumont@gcc.gnu.org>

    * include/bits/hashtable_policy.h (_Prime_rehash_policy): Remove
    automatic shrink.
    (_Prime_rehash_policy::_M_bkt_for_elements): Do not call
    _M_next_bkt anymore.
    (_Prime_rehash_policy::_M_next_bkt): Move usage of
    _S_growth_factor ...
    (_Prime_rehash_policy::_M_need_rehash): ... here.
    * include/bits/hashtable.h (_Hashtable<>): Adapt.
    * testsuite/performance/23_containers/insert_erase/41975.cc: Add
    _USE_TR1 to force build using std::tr1 container.
    * testsuite/performance/23_containers/insert/unordered_set.cc:
    Likewise.
    * testsuite/performance/23_containers/insert/54075.cc: New.

Note that I have used std::random_device in 54075 test. I also preferred to test both std and std::tr1 containers in a single run without relying on a macro check. As long as we try to align std containers performance on tr1 ones it is better maybe later we could limit it to test only std:: container per default.

I have also added several use cases and here is the result:

54075.cc std::tr1::unordered_set without hash code cached 300000 Foo insertions 10r 9u 1s 13765616mem 0pf 54075.cc std::tr1::unordered_set without hash code cached 10 times insertion of 300000 Foo 21r 21u 0s 0mem 0pf 54075.cc std::tr1::unordered_set with hash code cached 300000 Foo insertions 15r 14u 0s 18561984mem 0pf 54075.cc std::tr1::unordered_set with hash code cached 10 times insertion of 300000 Foo 23r 23u 0s 0mem 0pf 54075.cc std::tr1::unordered_multiset without hash code cached 300000 Foo insertions 9r 8u 1s 13761968mem 0pf 54075.cc std::tr1::unordered_multiset without hash code cached 10 times insertion of 300000 Foo 113r 107u 7s 126686848mem 0pf 54075.cc std::tr1::unordered_multiset with hash code cached 300000 Foo insertions 100r 100u 0s 18561984mem 0pf 54075.cc std::tr1::unordered_multiset with hash code cached 10 times insertion of 300000 Foo 111r 107u 4s 174686864mem 0pf 54075.cc std::unordered_set without hash code cached 300000 Foo insertions 9r 8u 1s 13761936mem 0pf 54075.cc std::unordered_set without hash code cached 10 times insertion of 300000 Foo 34r 34u 0s 0mem 0pf 54075.cc std::unordered_set with hash code cached 300000 Foo insertions 10r 10u 1s 18562000mem 0pf 54075.cc std::unordered_set with hash code cached 10 times insertion of 300000 Foo 37r 37u 0s 0mem 0pf 54075.cc std::unordered_multiset without hash code cached 300000 Foo insertions 8r 9u 0s 13761936mem 0pf 54075.cc std::unordered_multiset without hash code cached 10 times insertion of 300000 Foo 34r 34u 0s 0mem 0pf 54075.cc std::unordered_multiset with hash code cached 300000 Foo insertions 10r 10u 0s 18562000mem 0pf 54075.cc std::unordered_multiset with hash code cached 10 times insertion of 300000 Foo 37r 37u 0s 0mem 0pf

There is one surprising result; std::tr1::unordered_multiset is quite bad, I haven't check yet why and it is not my priority, I take it as a good news for the moment. The other important result is this one: 54075.cc std::tr1::unordered_set without hash code cached 10 times insertion of 300000 Foo 21r 21u 0s 0mem 0pf 54075.cc std::unordered_set without hash code cached 10 times insertion of 300000 Foo 34r 34u 0s 0mem 0pf

We can see that inserting the same elements again, that is to say detecting the collisions, is slower in the new implementation. It is the problem I had already signaled in bugzilla entry. In the new implementation when we need to look for bucket elements we have to check the hash code of each element to see if we are still in the same bucket, in the former there was a nullptr to signal the end of the bucket. I think I am going to try to add an empty node to mark the end of the bucket. It will introduce a memory tradeoff of one empty node for each non-empty bucket but it will also limit the impact of caching the hash code or node because we won't need to use it so often.

François

On 11/14/2012 11:54 AM, Paolo Carlini wrote:

Hi,

On 11/13/2012 10:40 PM, François Dumont wrote:
2012-11-13 François Dumont <fdumont@gcc.gnu.org>
    * include/bits/hashtable_policy.h (_Prime_rehash_policy): Remove
    automatic shrink.
    (_Prime_rehash_policy::_M_bkt_for_elements): Do not call
    _M_next_bkt anymore.
    (_Prime_rehash_policy::_M_next_bkt): Move usage of
    _S_growth_factor ...
    (_Prime_rehash_policy::_M_need_rehash): ... here.
    * include/bits/hashtable.h (_Hashtable<>): Adapt.
Tested under linux x86_64, normal and debug modes.
Patch is Ok with me, please wait another day or two for comments and then apply it.

You can also add the performance testcase, of course, note however that ::random isn't a standard C function, thus we shouldn't use it unconditionally, either manage with std::rand from <cstdlib> or do something completely different (maybe even <random> since you are writing C++11 code anyway!). I would also recommend extending quite a bit the runtime, 10x would be still completely sensible (but I guess, without using much more memory)

Paolo.

Index: include/bits/hashtable.h
===================================================================
--- include/bits/hashtable.h	(revision 193575)
+++ include/bits/hashtable.h	(working copy)
@@ -806,11 +806,6 @@
       _M_rehash_policy()
     {
       _M_bucket_count = _M_rehash_policy._M_next_bkt(__bucket_hint);
-
-      // We don't want the rehash policy to ask for the hashtable to
-      // shrink on the first insertion so we need to reset its
-      // previous resize level.
-      _M_rehash_policy._M_prev_resize = 0;
       _M_buckets = _M_allocate_buckets(_M_bucket_count);
     }
 
@@ -834,16 +829,12 @@
 	_M_element_count(0),
 	_M_rehash_policy()
       {
+	auto __nb_elems = __detail::__distance_fw(__f, __l);
 	_M_bucket_count =
-	  _M_rehash_policy._M_bkt_for_elements(__detail::__distance_fw(__f,
-								       __l));
-	if (_M_bucket_count <= __bucket_hint)
-	  _M_bucket_count = _M_rehash_policy._M_next_bkt(__bucket_hint);
+	  _M_rehash_policy._M_next_bkt(
+	    std::max(_M_rehash_policy._M_bkt_for_elements(__nb_elems),
+		     __bucket_hint));
 
-	// We don't want the rehash policy to ask for the hashtable to
-	// shrink on the first insertion so we need to reset its
-	// previous resize level.
-	_M_rehash_policy._M_prev_resize = 0;
 	_M_buckets = _M_allocate_buckets(_M_bucket_count);
 	__try
 	  {
@@ -990,6 +981,7 @@
     __rehash_policy(const _RehashPolicy& __pol)
     {
       size_type __n_bkt = __pol._M_bkt_for_elements(_M_element_count);
+      __n_bkt = __pol._M_next_bkt(__n_bkt);
       if (__n_bkt != _M_bucket_count)
 	_M_rehash(__n_bkt, _M_rehash_policy._M_state());
       _M_rehash_policy = __pol;
@@ -1641,19 +1633,12 @@
     {
       const __rehash_state& __saved_state = _M_rehash_policy._M_state();
       std::size_t __buckets
-	= _M_rehash_policy._M_bkt_for_elements(_M_element_count + 1);
-      if (__buckets <= __n)
-	__buckets = _M_rehash_policy._M_next_bkt(__n);
+	= std::max(_M_rehash_policy._M_bkt_for_elements(_M_element_count + 1),
+		   __n);
+      __buckets = _M_rehash_policy._M_next_bkt(__buckets);
 
       if (__buckets != _M_bucket_count)
-	{
-	  _M_rehash(__buckets, __saved_state);
-
-	  // We don't want the rehash policy to ask for the hashtable to shrink
-	  // on the next insertion so we need to reset its previous resize
-	  // level.
-	  _M_rehash_policy._M_prev_resize = 0;
-	}
+	_M_rehash(__buckets, __saved_state);
       else
 	// No rehash, restore previous state to keep a consistent state.
 	_M_rehash_policy._M_reset(__saved_state);
Index: include/bits/hashtable_policy.h
===================================================================
--- include/bits/hashtable_policy.h	(revision 193575)
+++ include/bits/hashtable_policy.h	(working copy)
@@ -358,7 +358,7 @@
   struct _Prime_rehash_policy
   {
     _Prime_rehash_policy(float __z = 1.0)
-    : _M_max_load_factor(__z), _M_prev_resize(0), _M_next_resize(0) { }
+    : _M_max_load_factor(__z), _M_next_resize(0) { }
 
     float
     max_load_factor() const noexcept
@@ -380,25 +380,21 @@
     _M_need_rehash(std::size_t __n_bkt, std::size_t __n_elt,
 		   std::size_t __n_ins) const;
 
-    typedef std::pair<std::size_t, std::size_t> _State;
+    typedef std::size_t _State;
 
     _State
     _M_state() const
-    { return std::make_pair(_M_prev_resize, _M_next_resize); }
+    { return _M_next_resize; }
 
     void
-    _M_reset(const _State& __state)
-    {
-      _M_prev_resize = __state.first;
-      _M_next_resize = __state.second;
-    }
+    _M_reset(_State __state)
+    { _M_next_resize = __state; }
 
     enum { _S_n_primes = sizeof(unsigned long) != 8 ? 256 : 256 + 48 };
 
     static const std::size_t _S_growth_factor = 2;
 
     float                _M_max_load_factor;
-    mutable std::size_t  _M_prev_resize;
     mutable std::size_t  _M_next_resize;
   };
 
@@ -417,35 +413,28 @@
     static const unsigned char __fast_bkt[12]
       = { 2, 2, 2, 3, 5, 5, 7, 7, 11, 11, 11, 11 };
 
-    const std::size_t __grown_n = __n * _S_growth_factor;
-    if (__grown_n <= 11)
+    if (__n <= 11)
       {
-	_M_prev_resize = 0;
 	_M_next_resize
-	  = __builtin_ceil(__fast_bkt[__grown_n]
+	  = __builtin_ceil(__fast_bkt[__n]
 			   * (long double)_M_max_load_factor);
-	return __fast_bkt[__grown_n];
+	return __fast_bkt[__n];
       }
 
     const unsigned long* __next_bkt
       = std::lower_bound(__prime_list + 5, __prime_list + _S_n_primes,
-			 __grown_n);
-    const unsigned long* __prev_bkt
-      = std::lower_bound(__prime_list + 1, __next_bkt, __n / _S_growth_factor);
-
-    _M_prev_resize
-      = __builtin_floor(*(__prev_bkt - 1) * (long double)_M_max_load_factor);
+			 __n);
     _M_next_resize
       = __builtin_ceil(*__next_bkt * (long double)_M_max_load_factor);
     return *__next_bkt;
   }
 
-  // Return the smallest prime p such that alpha p >= n, where alpha
+  // Return the smallest integer p such that alpha p >= n, where alpha
   // is the load factor.
   inline std::size_t
   _Prime_rehash_policy::
   _M_bkt_for_elements(std::size_t __n) const
-  { return _M_next_bkt(__builtin_ceil(__n / (long double)_M_max_load_factor)); }
+  { return __builtin_ceil(__n / (long double)_M_max_load_factor); }
 
   // Finds the smallest prime p such that alpha p > __n_elt + __n_ins.
   // If p > __n_bkt, return make_pair(true, p); otherwise return
@@ -467,7 +456,8 @@
 				 / (long double)_M_max_load_factor;
 	if (__min_bkts >= __n_bkt)
 	  return std::make_pair(true,
-				_M_next_bkt(__builtin_floor(__min_bkts) + 1));
+	    _M_next_bkt(std::max<std::size_t>(__builtin_floor(__min_bkts) + 1,
+					      __n_bkt * _S_growth_factor)));
 	else
 	  {
 	    _M_next_resize
@@ -475,13 +465,6 @@
 	    return std::make_pair(false, 0);
 	  }
       }
-    else if (__n_elt + __n_ins < _M_prev_resize)
-      {
-	long double __min_bkts = (__n_elt + __n_ins)
-				 / (long double)_M_max_load_factor;
-	return std::make_pair(true,
-			      _M_next_bkt(__builtin_floor(__min_bkts) + 1));
-      }
     else
       return std::make_pair(false, 0);
   }
Index: testsuite/performance/23_containers/insert_erase/41975.cc
===================================================================
--- testsuite/performance/23_containers/insert_erase/41975.cc	(revision 193575)
+++ testsuite/performance/23_containers/insert_erase/41975.cc	(working copy)
@@ -18,7 +18,11 @@
 // <http://www.gnu.org/licenses/>.
 
 #include <sstream>
-#include <unordered_set>
+#ifdef _USE_TR1
+#  include <tr1/unordered_set>
+#else
+#  include <unordered_set>
+#endif
 #include <testsuite_performance.h>
 
 namespace
@@ -40,11 +44,17 @@
       const int nb = 200000;
       start_counters(time, resource);
 
+#ifdef _USE_TR1
+      std::tr1::__unordered_set<int, std::hash<int>, std::equal_to<int>,
+				std::allocator<int>,
+			       	use_cache> us;
+#else
       std::__uset_hashtable<int, std::hash<int>, std::equal_to<int>,
 			    std::allocator<int>,
 			    std::__uset_traits<use_cache>> us;
+#endif
       for (int i = 0; i != nb; ++i)
-	us.insert(i);
+	  us.insert(i);
 
       stop_counters(time, resource);
       ostr.str("");
@@ -126,10 +136,17 @@
 
       start_counters(time, resource);
 
+#ifdef _USE_TR1
+      std::tr1::__unordered_set<std::string, std::hash<std::string>,
+				std::equal_to<std::string>,
+				std::allocator<std::string>,
+				use_cache> us;
+#else
       std::__uset_hashtable<std::string, std::hash<std::string>,
 			    std::equal_to<std::string>,
 			    std::allocator<std::string>,
 			    std::__uset_traits<use_cache>> us;
+#endif
       for (int i = 0; i != nb; ++i)
 	us.insert(strs[i]);
 
Index: testsuite/performance/23_containers/insert/unordered_set.cc
===================================================================
--- testsuite/performance/23_containers/insert/unordered_set.cc	(revision 193575)
+++ testsuite/performance/23_containers/insert/unordered_set.cc	(working copy)
@@ -17,8 +17,17 @@
 
 // { dg-options "-std=c++11" }
 
-#include <unordered_set>
 #include <testsuite_performance.h>
+#include <sstream>
+#ifdef _USE_TR1
+#  include <tr1/unordered_set>
+using namespace std::tr1;
+const char* ns = "std::tr1::";
+#else
+#  include<unordered_set>
+using namespace std;
+const char* ns = "std::";
+#endif
 
 int main()
 {
@@ -29,14 +38,16 @@
 
   const int sz = 10000000;
 
-  std::unordered_set<int> s;
+  unordered_set<int> s;
   start_counters(time, resource);
 
   for (int i = 0; i != sz ; ++i)
     s.insert(i);
 
   stop_counters(time, resource);
-  report_performance(__FILE__, "unordered_set 10000000 insertions",
+  std::ostringstream ostr;
+  ostr << ns << "unordered_set " << sz << " insertions";
+  report_performance(__FILE__, ostr.str().c_str(),
 		     time, resource);
   return 0;
 }
Index: testsuite/performance/23_containers/insert/54075.cc
===================================================================
--- testsuite/performance/23_containers/insert/54075.cc	(revision 0)
+++ testsuite/performance/23_containers/insert/54075.cc	(revision 0)
@@ -0,0 +1,132 @@
+// Copyright (C) 2012 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+// { dg-options "-std=c++11" }
+
+#include <testsuite_performance.h>
+#include <random>
+#include <sstream>
+#include <tr1/unordered_set>
+#include<unordered_set>
+
+struct Foo
+{
+  typedef std::random_device::result_type _Type;
+  _Type bar;
+  _Type baz;
+  _Type meh;
+
+  void
+  init(std::random_device& randev)
+  {
+    bar = randev();
+    baz = randev();
+    meh = randev();
+  }
+
+  std::size_t
+  hash() const noexcept
+  { return std::size_t(bar ^ baz ^ meh); }
+
+  inline bool
+  operator==(const Foo& other) const
+  { return other.bar == bar && other.baz == baz && other.meh == meh; }
+};
+
+struct HashFunction
+{
+  template<typename T>
+    std::size_t operator()(const T& t) const noexcept
+    { return t.hash(); }
+};
+
+template<typename _ContType>
+  void bench(const char* container_desc)
+  {
+    using namespace __gnu_test;
+
+    time_counter time;
+    resource_counter resource;
+
+    const int sz = 300000;
+
+    Foo foos[sz];
+    {
+      std::random_device randev;
+      for (int i = 0; i != sz; ++i)
+	foos[i].init(randev);
+    }
+
+    _ContType s;
+    start_counters(time, resource);
+
+    for (int i = 0; i != sz ; ++i)
+      s.insert(foos[i]);
+
+    stop_counters(time, resource);
+    std::ostringstream ostr;
+    ostr << container_desc << sz << " Foo insertions";
+    report_performance(__FILE__, ostr.str().c_str(), time, resource);
+
+    // Try to insert again to check performance of collision detection
+    
+    const int nb_loop = 10;
+    start_counters(time, resource);
+
+    for (int j = 0; j != nb_loop; ++j)
+      for (int i = 0; i != sz; ++i)
+	s.insert(foos[i]);
+
+    stop_counters(time, resource);
+    ostr.str("");
+    ostr << container_desc << nb_loop << " times insertion of "
+	 << sz << " Foo";
+    report_performance(__FILE__, ostr.str().c_str(), time, resource);
+  }
+
+template<bool cache>
+  using __tr1_uset = std::tr1::__unordered_set<Foo, HashFunction,
+					       std::equal_to<Foo>,
+					       std::allocator<Foo>,
+					       cache>;
+template<bool cache>
+  using __tr1_umset = std::tr1::__unordered_multiset<Foo, HashFunction,
+						     std::equal_to<Foo>,
+						     std::allocator<Foo>,
+						     cache>;
+template<bool cache>
+  using __uset = std::__uset_hashtable<Foo, HashFunction,
+				       std::equal_to<Foo>,
+				       std::allocator<Foo>,
+				       std::__uset_traits<cache>>;
+template<bool cache>
+  using __umset = std::__umset_hashtable<Foo, HashFunction,
+					 std::equal_to<Foo>,
+					 std::allocator<Foo>,
+					 std::__uset_traits<cache>>;
+
+int main()
+{
+  bench<__tr1_uset<false>>("std::tr1::unordered_set without hash code cached ");
+  bench<__tr1_uset<true>>("std::tr1::unordered_set with hash code cached ");
+  bench<__tr1_umset<false>>("std::tr1::unordered_multiset without hash code cached ");
+  bench<__tr1_umset<true>>("std::tr1::unordered_multiset with hash code cached ");
+  bench<__uset<false>>("std::unordered_set without hash code cached ");
+  bench<__uset<true>>("std::unordered_set with hash code cached ");
+  bench<__umset<false>>("std::unordered_multiset without hash code cached ");
+  bench<__umset<true>>("std::unordered_multiset with hash code cached ");
+}

Follow-Ups:
- Re: [Bug libstdc++/54075] [4.7.1] unordered_map insert still slower than 4.6.2
  - From: Paolo Carlini

References:
- Re: [Bug libstdc++/54075] [4.7.1] unordered_map insert still slower than 4.6.2
  - From: FranÃois Dumont
- Re: [Bug libstdc++/54075] [4.7.1] unordered_map insert still slower than 4.6.2
  - From: Jonathan Wakely
- Re: [Bug libstdc++/54075] [4.7.1] unordered_map insert still slower than 4.6.2
  - From: François Dumont
- Re: [Bug libstdc++/54075] [4.7.1] unordered_map insert still slower than 4.6.2
  - From: Paolo Carlini

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]