diff mbox

[Bug,libstdc++/54075,4.7.1] unordered_map insert still slower than 4.6.2

Message ID 50A2BE5B.3030809@gmail.com
State New
Headers show

Commit Message

François Dumont Nov. 13, 2012, 9:40 p.m. UTC
Here is the proposal to remove shrinking feature from hash policy. 
I have also considered your remark regarding usage of lower_bound so 
_M_bkt_for_elements doesn't call _M_next_bkt (calling lower_bound) 
anymore. For 2 of the 3 calls it was only a source of redundant 
lower_bound invocations, in the last case I call _M_next_bkt explicitly.

2012-11-13  François Dumont  <fdumont@gcc.gnu.org>

     * include/bits/hashtable_policy.h (_Prime_rehash_policy): Remove
     automatic shrink.
     (_Prime_rehash_policy::_M_bkt_for_elements): Do not call
     _M_next_bkt anymore.
     (_Prime_rehash_policy::_M_next_bkt): Move usage of
     _S_growth_factor ...
     (_Prime_rehash_policy::_M_need_rehash): ... here.
     * include/bits/hashtable.h (_Hashtable<>): Adapt.

Tested under linux x86_64, normal and debug modes.

Regarding performance, I have done a small evolution of the 54075.cc 
test proposed last time. It is now checking performance with and without 
cache of hash code. Result is:

54075.cc                     std::unordered_set 300000 Foo insertions 
without cache       9r    9u    0s 13765616mem    0pf
54075.cc                     std::unordered_set 300000 Foo insertions 
with cache      14r   13u    0s 18562064mem    0pf
54075.cc                     std::tr1::unordered_set 300000 Foo 
insertions without cache       9r    8u    1s 13765616mem    0pf
54075.cc                     std::tr1::unordered_set 300000 Foo 
insertions with cache      14r   13u    0s 18561952mem    0pf

     So the difference of performance in this case only seems to come 
from caching the hash code or not. In reported use case default behavior 
of std::unordered_set is to cache hash codes and std::tr1::unordered_set 
not to cache it. We should perhaps review default behavior regarding 
caching the hash code. Perhaps cache it if the hash functor can throw 
and not cache it otherwise, not easy to find out what's best to do.

François


On 11/09/2012 11:50 AM, Paolo Carlini wrote:
> Hi again,
>
> +    // To get previous bound we use _S_growth * 2 to avoid 
> ocillations in the
> +    // number of buckets when looping on insertion/removal of elements.
>      const unsigned long* __prev_bkt
> -      = std::lower_bound(__prime_list + 1, __next_bkt, __n / 
> _S_growth_factor);
> +      = std::lower_bound(__prime_list + 1, __next_bkt,
> +             __n / _S_growth_factor / _S_growth_factor);
>
> Looks like, here you are dividing by _S_Growth ^ 2? Is it intended? 
> But anyway, in my opinion the very my _M_prev_resize idea (thus rehash 
> shrinking, right?) is proving quite fragile and we also got negative 
> comments from the users about shrinking, which is new. Before pursuing 
> it further, I think we should double check what the other 
> implementations do, are they also shrinking? Because otherwise we 
> could, at least for the time being, remove the related bits and save 
> ourselves many headaches...
>
> Paolo.
>

Comments

Paolo Carlini Nov. 13, 2012, 10:53 p.m. UTC | #1
Hi,

On 11/13/2012 10:40 PM, François Dumont wrote:
> Here is the proposal to remove shrinking feature from hash policy. I 
> have also considered your remark regarding usage of lower_bound so 
> _M_bkt_for_elements doesn't call _M_next_bkt (calling lower_bound) 
> anymore. For 2 of the 3 calls it was only a source of redundant 
> lower_bound invocations, in the last case I call _M_next_bkt explicitly.
>
> 2012-11-13  François Dumont  <fdumont@gcc.gnu.org>
>
>     * include/bits/hashtable_policy.h (_Prime_rehash_policy): Remove
>     automatic shrink.
>     (_Prime_rehash_policy::_M_bkt_for_elements): Do not call
>     _M_next_bkt anymore.
>     (_Prime_rehash_policy::_M_next_bkt): Move usage of
>     _S_growth_factor ...
>     (_Prime_rehash_policy::_M_need_rehash): ... here.
>     * include/bits/hashtable.h (_Hashtable<>): Adapt.
>
> Tested under linux x86_64, normal and debug modes.
Thanks. First blush the patch looks good but please give us a few days 
to analyze the details of it, we don't want to make mistakes for 4.8.
> Regarding performance, I have done a small evolution of the 54075.cc 
> test proposed last time. It is now checking performance with and 
> without cache of hash code. Result is:
>
> 54075.cc                     std::unordered_set 300000 Foo insertions 
> without cache       9r    9u    0s 13765616mem    0pf
> 54075.cc                     std::unordered_set 300000 Foo insertions 
> with cache      14r   13u    0s 18562064mem    0pf
> 54075.cc                     std::tr1::unordered_set 300000 Foo 
> insertions without cache       9r    8u    1s 13765616mem    0pf
> 54075.cc                     std::tr1::unordered_set 300000 Foo 
> insertions with cache      14r   13u    0s 18561952mem    0pf
>
>     So the difference of performance in this case only seems to come 
> from caching the hash code or not. In reported use case default 
> behavior of std::unordered_set is to cache hash codes and 
> std::tr1::unordered_set not to cache it. We should perhaps review 
> default behavior regarding caching the hash code. Perhaps cache it if 
> the hash functor can throw and not cache it otherwise, not easy to 
> find out what's best to do.
Ah good. I think we finally have nailed the core performance issue. And, 
as it turns out, I'm a bit confused about the logic we have in place now 
for the defaults: can you please summarize what we are doing and which 
are the trade offs (leaving out the technicalities having to do with the 
final types)? I think the most interesting are three:

     1- std::hash<int>
     2- std::hash<std::string>
     3- user_defined_hash<xxx> which cannot throw

In the first we should normally not cache; in the second, from a 
performance point of view (from the exception safety point of view we 
could do both, because std::hash<std::string> doesn't throw anyway) it 
would be better to cache; the third case is rather tricky, because, like 
the case of std::string, from the exception safety point of view we 
could do both, thus it's purely a performance issue. Do I understand 
correctly that currently we handle 2- and 3- above in the same way, thus 
we cache? It seems to me that whereas that kind of default makes a lot 
of sense for std::string, doesn't necessarily make sense for everything 
else, and it seems to me that such kind of default makes a suboptimal 
use of the knowledge we have via __is_noexcept_hash that the functor 
doesn't throw. That seems instead a sort of user-hint to not cache! 
Given the unfortunate situation that the user has no way to explicitly 
pick a behavior when instantiating the container, we can imagine that he 
can anyway provide a strong if indirect hint by decorating or not with 
noexcept the call operator. We could even document that as part of our 
implementation defined behavior. How does it sound? Do we have a way to 
figure out what other implementations are doing? Outside std::hash, it 
should be pretty easy to instantiate with a special functor which 
internally keeps a counter... if we have evidence that the other best 
implementations don't cache for 3- we should definitely do the same.

To summarize my intuitions are (again, leaving out the final technicalities)

     a- std::hash specializations for scalar types -> no cache
     b- std::hash specialization for for std::string (or maybe 
everything else, for simplicity) -> cache
     c- user defined functor -> cache or not basing on __is_noexcept_hash

Jon?

Thanks!
Paolo.
Paolo Carlini Nov. 13, 2012, 11:11 p.m. UTC | #2
On 11/13/2012 11:53 PM, Paolo Carlini wrote:
> To summarize my intuitions are (again, leaving out the final 
> technicalities)
>
>     a- std::hash specializations for scalar types -> no cache
>     b- std::hash specialization for for std::string (or maybe 
> everything else, for simplicity) -> cache
>     c- user defined functor -> cache or not basing on __is_noexcept_hash
Alternately, if we want to stress the consistency of our behavior, just 
a single rule: __is_noexcept_hash. That means I expect that b- above is 
normally penalized performance-wise, but we can document the behavior, 
the user can simply instantiate with a user defined hash functor which 
doesn't have noexcept on the call operator and just forwards to 
std::hash<std::string> and switch the container to caching.

Paolo.
Paolo Carlini Nov. 14, 2012, 10:54 a.m. UTC | #3
Hi,

On 11/13/2012 10:40 PM, François Dumont wrote:
> 2012-11-13 François Dumont  <fdumont@gcc.gnu.org>
>
>     * include/bits/hashtable_policy.h (_Prime_rehash_policy): Remove
>     automatic shrink.
>     (_Prime_rehash_policy::_M_bkt_for_elements): Do not call
>     _M_next_bkt anymore.
>     (_Prime_rehash_policy::_M_next_bkt): Move usage of
>     _S_growth_factor ...
>     (_Prime_rehash_policy::_M_need_rehash): ... here.
>     * include/bits/hashtable.h (_Hashtable<>): Adapt.
>
> Tested under linux x86_64, normal and debug modes.
Patch is Ok with me, please wait another day or two for comments and 
then apply it.

You can also add the performance testcase, of course, note however that 
::random isn't a standard C function, thus we shouldn't use it 
unconditionally, either manage with std::rand from <cstdlib> or do 
something completely different (maybe even <random> since you are 
writing C++11 code anyway!). I would also recommend extending quite a 
bit the runtime, 10x would be still completely sensible (but I guess, 
without using much more memory)

Paolo.
François Dumont Nov. 14, 2012, 9:27 p.m. UTC | #4
On 11/13/2012 11:53 PM, Paolo Carlini wrote:
> Regarding performance, I have done a small evolution of the 54075.cc 
> test proposed last time. It is now checking performance with and 
> without cache of hash code. Result is:
>>
>> 54075.cc                     std::unordered_set 300000 Foo insertions 
>> without cache       9r    9u    0s 13765616mem    0pf
>> 54075.cc                     std::unordered_set 300000 Foo insertions 
>> with cache      14r   13u    0s 18562064mem    0pf
>> 54075.cc                     std::tr1::unordered_set 300000 Foo 
>> insertions without cache       9r    8u    1s 13765616mem    0pf
>> 54075.cc                     std::tr1::unordered_set 300000 Foo 
>> insertions with cache      14r   13u    0s 18561952mem    0pf
>>
>>     So the difference of performance in this case only seems to come 
>> from caching the hash code or not. In reported use case default 
>> behavior of std::unordered_set is to cache hash codes and 
>> std::tr1::unordered_set not to cache it. We should perhaps review 
>> default behavior regarding caching the hash code. Perhaps cache it if 
>> the hash functor can throw and not cache it otherwise, not easy to 
>> find out what's best to do.
> Ah good. I think we finally have nailed the core performance issue. 
> And, as it turns out, I'm a bit confused about the logic we have in 
> place now for the defaults: can you please summarize what we are doing 
> and which are the trade offs (leaving out the technicalities having to 
> do with the final types)?
We do not cache if the following conditions are all met:
- key type is an integral
- hash functor is empty and not final
- hash functor doesn't throw

As you can see we cache in most of the cases.

> I think the most interesting are three:
>
>     1- std::hash<int>
>     2- std::hash<std::string>
>     3- user_defined_hash<xxx> which cannot throw
>
> In the first we should normally not cache; in the second, from a 
> performance point of view (from the exception safety point of view we 
> could do both, because std::hash<std::string> doesn't throw anyway) it 
> would be better to cache; the third case is rather tricky, because, 
> like the case of std::string, from the exception safety point of view 
> we could do both, thus it's purely a performance issue. Do I 
> understand correctly that currently we handle 2- and 3- above in the 
> same way, thus we cache?
yes, because types are not integral.
> It seems to me that whereas that kind of default makes a lot of sense 
> for std::string, doesn't necessarily make sense for everything else, 
> and it seems to me that such kind of default makes a suboptimal use of 
> the knowledge we have via __is_noexcept_hash that the functor doesn't 
> throw. That seems instead a sort of user-hint to not cache! Given the 
> unfortunate situation that the user has no way to explicitly pick a 
> behavior when instantiating the container, we can imagine that he can 
> anyway provide a strong if indirect hint by decorating or not with 
> noexcept the call operator. We could even document that as part of our 
> implementation defined behavior. How does it sound? Do we have a way 
> to figure out what other implementations are doing? Outside std::hash, 
> it should be pretty easy to instantiate with a special functor which 
> internally keeps a counter... if we have evidence that the other best 
> implementations don't cache for 3- we should definitely do the same.
>
> To summarize my intuitions are (again, leaving out the final 
> technicalities)
>
>     a- std::hash specializations for scalar types -> no cache
>     b- std::hash specialization for std::string (or maybe everything 
> else, for simplicity) -> cache
>     c- user defined functor -> cache or not basing on __is_noexcept_hash
     I don't understand why we would make a distinction between 
std::hash specialization and user defined functor, especially because 
hash specialization can throw. I like the idea of caching based on 
noexcept as its the only way users can tweak this behavior. Of course it 
will mean that we will need to check for std::string explicitly.

François
Paolo Carlini Nov. 14, 2012, 10:48 p.m. UTC | #5
Hi,

On 11/14/2012 10:27 PM, François Dumont wrote:
> We do not cache if the following conditions are all met:
> - key type is an integral
> - hash functor is empty and not final
> - hash functor doesn't throw
Can somebody remind me why *exactly* we have a condition having to do 
with the empty-ness of the functor? I don't really understand it (and 
it's always tricky vs final). Otherwise I think we are coming to the 
conclusion that we could simply keep only the last condition possibly 
with the string (and wstring, etc) special cases for std::hash 
specializations + documentation explaining that the user can direct the 
behavior wrt caching via noexcept on the call operator.

Paolo.
diff mbox

Patch

Index: include/bits/hashtable.h
===================================================================
--- include/bits/hashtable.h	(revision 193484)
+++ include/bits/hashtable.h	(working copy)
@@ -806,11 +806,6 @@ 
       _M_rehash_policy()
     {
       _M_bucket_count = _M_rehash_policy._M_next_bkt(__bucket_hint);
-
-      // We don't want the rehash policy to ask for the hashtable to
-      // shrink on the first insertion so we need to reset its
-      // previous resize level.
-      _M_rehash_policy._M_prev_resize = 0;
       _M_buckets = _M_allocate_buckets(_M_bucket_count);
     }
 
@@ -834,16 +829,12 @@ 
 	_M_element_count(0),
 	_M_rehash_policy()
       {
+	auto __nb_elems = __detail::__distance_fw(__f, __l);
 	_M_bucket_count =
-	  _M_rehash_policy._M_bkt_for_elements(__detail::__distance_fw(__f,
-								       __l));
-	if (_M_bucket_count <= __bucket_hint)
-	  _M_bucket_count = _M_rehash_policy._M_next_bkt(__bucket_hint);
+	  _M_rehash_policy._M_next_bkt(
+	    std::max(_M_rehash_policy._M_bkt_for_elements(__nb_elems),
+		     __bucket_hint));
 
-	// We don't want the rehash policy to ask for the hashtable to
-	// shrink on the first insertion so we need to reset its
-	// previous resize level.
-	_M_rehash_policy._M_prev_resize = 0;
 	_M_buckets = _M_allocate_buckets(_M_bucket_count);
 	__try
 	  {
@@ -990,6 +981,7 @@ 
     __rehash_policy(const _RehashPolicy& __pol)
     {
       size_type __n_bkt = __pol._M_bkt_for_elements(_M_element_count);
+      __n_bkt = __pol._M_next_bkt(__n_bkt);
       if (__n_bkt != _M_bucket_count)
 	_M_rehash(__n_bkt, _M_rehash_policy._M_state());
       _M_rehash_policy = __pol;
@@ -1641,19 +1633,12 @@ 
     {
       const __rehash_state& __saved_state = _M_rehash_policy._M_state();
       std::size_t __buckets
-	= _M_rehash_policy._M_bkt_for_elements(_M_element_count + 1);
-      if (__buckets <= __n)
-	__buckets = _M_rehash_policy._M_next_bkt(__n);
+	= std::max(_M_rehash_policy._M_bkt_for_elements(_M_element_count + 1),
+		   __n);
+      __buckets = _M_rehash_policy._M_next_bkt(__buckets);
 
       if (__buckets != _M_bucket_count)
-	{
-	  _M_rehash(__buckets, __saved_state);
-
-	  // We don't want the rehash policy to ask for the hashtable to shrink
-	  // on the next insertion so we need to reset its previous resize
-	  // level.
-	  _M_rehash_policy._M_prev_resize = 0;
-	}
+	_M_rehash(__buckets, __saved_state);
       else
 	// No rehash, restore previous state to keep a consistent state.
 	_M_rehash_policy._M_reset(__saved_state);
Index: include/bits/hashtable_policy.h
===================================================================
--- include/bits/hashtable_policy.h	(revision 193484)
+++ include/bits/hashtable_policy.h	(working copy)
@@ -358,7 +358,7 @@ 
   struct _Prime_rehash_policy
   {
     _Prime_rehash_policy(float __z = 1.0)
-    : _M_max_load_factor(__z), _M_prev_resize(0), _M_next_resize(0) { }
+    : _M_max_load_factor(__z), _M_next_resize(0) { }
 
     float
     max_load_factor() const noexcept
@@ -380,25 +380,21 @@ 
     _M_need_rehash(std::size_t __n_bkt, std::size_t __n_elt,
 		   std::size_t __n_ins) const;
 
-    typedef std::pair<std::size_t, std::size_t> _State;
+    typedef std::size_t _State;
 
     _State
     _M_state() const
-    { return std::make_pair(_M_prev_resize, _M_next_resize); }
+    { return _M_next_resize; }
 
     void
-    _M_reset(const _State& __state)
-    {
-      _M_prev_resize = __state.first;
-      _M_next_resize = __state.second;
-    }
+    _M_reset(_State __state)
+    { _M_next_resize = __state; }
 
     enum { _S_n_primes = sizeof(unsigned long) != 8 ? 256 : 256 + 48 };
 
     static const std::size_t _S_growth_factor = 2;
 
     float                _M_max_load_factor;
-    mutable std::size_t  _M_prev_resize;
     mutable std::size_t  _M_next_resize;
   };
 
@@ -417,35 +413,28 @@ 
     static const unsigned char __fast_bkt[12]
       = { 2, 2, 2, 3, 5, 5, 7, 7, 11, 11, 11, 11 };
 
-    const std::size_t __grown_n = __n * _S_growth_factor;
-    if (__grown_n <= 11)
+    if (__n <= 11)
       {
-	_M_prev_resize = 0;
 	_M_next_resize
-	  = __builtin_ceil(__fast_bkt[__grown_n]
+	  = __builtin_ceil(__fast_bkt[__n]
 			   * (long double)_M_max_load_factor);
-	return __fast_bkt[__grown_n];
+	return __fast_bkt[__n];
       }
 
     const unsigned long* __next_bkt
       = std::lower_bound(__prime_list + 5, __prime_list + _S_n_primes,
-			 __grown_n);
-    const unsigned long* __prev_bkt
-      = std::lower_bound(__prime_list + 1, __next_bkt, __n / _S_growth_factor);
-
-    _M_prev_resize
-      = __builtin_floor(*(__prev_bkt - 1) * (long double)_M_max_load_factor);
+			 __n);
     _M_next_resize
       = __builtin_ceil(*__next_bkt * (long double)_M_max_load_factor);
     return *__next_bkt;
   }
 
-  // Return the smallest prime p such that alpha p >= n, where alpha
+  // Return the smallest integer p such that alpha p >= n, where alpha
   // is the load factor.
   inline std::size_t
   _Prime_rehash_policy::
   _M_bkt_for_elements(std::size_t __n) const
-  { return _M_next_bkt(__builtin_ceil(__n / (long double)_M_max_load_factor)); }
+  { return __builtin_ceil(__n / (long double)_M_max_load_factor); }
 
   // Finds the smallest prime p such that alpha p > __n_elt + __n_ins.
   // If p > __n_bkt, return make_pair(true, p); otherwise return
@@ -467,7 +456,8 @@ 
 				 / (long double)_M_max_load_factor;
 	if (__min_bkts >= __n_bkt)
 	  return std::make_pair(true,
-				_M_next_bkt(__builtin_floor(__min_bkts) + 1));
+	    _M_next_bkt(std::max<std::size_t>(__builtin_floor(__min_bkts) + 1,
+					      __n_bkt * _S_growth_factor)));
 	else
 	  {
 	    _M_next_resize
@@ -475,13 +465,6 @@ 
 	    return std::make_pair(false, 0);
 	  }
       }
-    else if (__n_elt + __n_ins < _M_prev_resize)
-      {
-	long double __min_bkts = (__n_elt + __n_ins)
-				 / (long double)_M_max_load_factor;
-	return std::make_pair(true,
-			      _M_next_bkt(__builtin_floor(__min_bkts) + 1));
-      }
     else
       return std::make_pair(false, 0);
   }