Hash-Based Skewed-Distribution Random-Integer find Find Timing Test

Description

This test inserts a number of values with a markedly non-uniform i.i.d. integer keys into a container, then performs a series of finds using find . It measures the average time for find as a function of the number of values in the containers. The keys are generated as follows. First, a uniform integer is created; it is then shifted left 8 bits.

(The test was executed with hash_zlob_random_int_find_timing_test 200 200 2100)

Purpose

The test checks the effect of different range-hashing functions and trigger policies (see Design::Associative Containers::Hash-Based Containers::Hash Policies and Design::Associative Containers::Hash-Based Containers::Resize Policies).

Results

Figures NHG, NHM, and NHL show the results for various hash-based associative-containers in g++, MSVC++, and local, respectively.

no image
NHG: Skewed-distribution random int find timing test using find - g++

In the above figure, the names in the legends have the following meaning:

  1. cc_hash_mask_exp_nea_lc_1div8_1div1_nsth_map- cc_hash_table with Comb_Hash_Fn = direct_mask_range_hashing , and Resize_Policy = hash_standard_resize_policy with Size_Policy = hash_exponential_size_policy , and Trigger_Policy = hash_load_check_resize_trigger with αmin = 1/8 and αmax = 1/1
  2. gp_hash_mod_quadp_prime_nea_lc_1div8_1div2_nsth_map- gp_hash_table with Comb_Hash_Fn = direct_mod_range_hashing , Resize_Policy = hash_standard_resize_policy with Size_Policy = hash_prime_size_policy , and Trigger_Policy = hash_load_check_resize_trigger with αmin = 1/8 and αmax = 1/2, and Probe_Fn = quadratic_probe_fn
  3. n_hash_map_ncah- std::tr1::unordered_map with cache_hash_code = false
  4. cc_hash_mod_prime_nea_lc_1div8_1div1_nsth_map- cc_hash_table with Comb_Hash_Fn = direct_mod_range_hashing , and Resize_Policy = hash_standard_resize_policy with Size_Policy = hash_prime_size_policy , and Trigger_Policy = hash_load_check_resize_trigger with αmin = 1/8 and αmax = 1/1
no image
NHM: Skewed-distribution random int find timing test using find - msvc++

In the above figure, the names in the legends have the following meaning:

  1. n_hash_map_ncah- stdext::hash_map
  2. cc_hash_mask_exp_nea_lc_1div8_1div1_nsth_map- cc_hash_table with Comb_Hash_Fn = direct_mask_range_hashing , and Resize_Policy = hash_standard_resize_policy with Size_Policy = hash_exponential_size_policy , and Trigger_Policy = hash_load_check_resize_trigger with αmin = 1/8 and αmax = 1/1
  3. gp_hash_mod_quadp_prime_nea_lc_1div8_1div2_nsth_map- gp_hash_table with Comb_Hash_Fn = direct_mod_range_hashing , Resize_Policy = hash_standard_resize_policy with Size_Policy = hash_prime_size_policy , and Trigger_Policy = hash_load_check_resize_trigger with αmin = 1/8 and αmax = 1/2, and Probe_Fn = quadratic_probe_fn
  4. cc_hash_mod_prime_nea_lc_1div8_1div1_nsth_map- cc_hash_table with Comb_Hash_Fn = direct_mod_range_hashing , and Resize_Policy = hash_standard_resize_policy with Size_Policy = hash_prime_size_policy , and Trigger_Policy = hash_load_check_resize_trigger with αmin = 1/8 and αmax = 1/1
no image
NHL: Skewed-distribution random int find timing test using find - local

Observations

In this setting, the keys' distribution is so skewed that the unerlying hash-table type affects performance marginally. (This is in contrast with Hash-Based Text find Find Timing Test , Hash-Based Random-Integer find Find Timing Test , Hash-Based Random-Integer Subscript Find Timing Test and Hash-Based Random-Integer Subscript Insert Timing Test .)

The range-hashing scheme affects performance dramatically. A mask-based range-hashing scheme effectively maps all values into the same bucket. Access degenerates into a search within an unordered linked-list. In Figures NHG and NHM , it should be noted that std::tr1::unordered_map and stdext::hash_map are hard-wired currently to mod-based and mask-based schemes, respectively.

When observing the settings of this test, it is apparent that the keys' distribution is far from natural. One might ask if the test is not contrived to show that, in some cases, mod-based range hashing does better than mask-based range hashing. This is, in fact just the case. We did not encounter a more natural case in which mod-based range hashing is better. In our opnion, real-life key distributions are handled better with an appropriate hash function and a mask-based range-hashing function. (shift_mask.cc shows an example of handling this a-priori known skewed distribution with a mask-based range-hashing function). If hash performance is bad, a Χ2 test can be used to check how to transform it into a more uniform distribution.

For this reason, pb_ds's default range-hashing function is mask-based.

Observations::Hash-Based Container Types summarizes some observations on hash-based containers; Observations::Hash-Based Containers' Policies summarizes some observations on hash-based containers' policies.