This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Tue, Dec 04, 2001 at 06:29:10AM +0000, Neil Booth wrote: > > > Just yesterday I found an old letter from someone suggesting I try just > > > > hash = hash * 9 + ch > > > > for the hash step. If you have a harness set up to do these tests > > already, mind trying that one? > > I think that might have been one they tested. ... OK you persuaded > me. [Digs around]. Here's the results with my data with one test > harness a guy posted (someone else had a larger set of hashes but > didn't post his harness): [snip] Finally I get back to this. I augmented the harness to indicate occupancy, mean, std.dev as well as max. (Occupancy replaces "used" as it's easier to compare across runs with different table sizes.) Then I tried various table sizes with the provided hashes, gcc's hash, the "*9" hash, and one more - the "linear congruential" hash used by Daniel Philips' ext2 indexed directories (the approach is similar to FNV). The data set was all the identifier-like strings found in all the source code I have lying around - I made no attempt to exclude words in comments or other junk; just every string that matched /[a-zA-Z_][a-zA-Z0-9_]*/. There are 552610 of them after sort -u. Results are attached, as is the revised harness. Mail me if you want the data set, it's 2.2MB bzipped. The conclusion I draw is that the perl, gcc, fnv, and dphil hashes are all about tied for bucket distribution, on identifiers, and perform well on power-of-two size tables. (Which is good, since hashtable.c uses a power-of-two table.) CRC is decent but only if given a prime table size. lennart and nine are fast, but you pay for it with longer chains and lower occupancy - lennart in particular consistently has a mean bucket size of ~8, not ~1 the way all the other good ones do. mouse and dumb are just not worth mentioning. It would be interesting to attempt to simulate hashtable.c's rehashing algorithm and see what kind of figures result. zw
There are 552610 unique items-which-look-like-identifiers (/[a-zA-Z_][a-zA-Z0-9_]*/) in my source code collection. 'gcc' is htab-hash-string from libiberty/hashtab.c. 'nine' is the simple multiply-by-nine hash. 'dphil' is Daniel Philips' linear congruential hash, from his ext2-directory-index patch. 524288: largest power of two smaller than N dumb occ 1.0% mean 108.23 dev 131.16 max 656 time 0.308sec mouse occ 44.2% mean 2.39 dev 6.76 max 229 time 0.488sec crc occ 58.6% mean 1.80 dev 1.71 max 135 time 0.423sec lennart occ 12.5% mean 8.43 dev 2.89 max 23 time 0.388sec nine occ 63.8% mean 1.65 dev 0.93 max 12 time 0.376sec perl occ 64.9% mean 1.62 dev 0.85 max 9 time 0.417sec gcc occ 65.1% mean 1.62 dev 0.84 max 9 time 0.422sec fnv occ 65.2% mean 1.62 dev 0.84 max 9 time 0.458sec dphil occ 65.1% mean 1.62 dev 0.84 max 9 time 0.537sec 1048576: smallest power of two larger than N dumb occ 0.5% mean 108.23 dev 131.16 max 656 time 0.308sec mouse occ 25.6% mean 2.06 dev 6.26 max 229 time 0.493sec crc occ 36.4% mean 1.45 dev 1.35 max 135 time 0.430sec lennart occ 6.2% mean 8.43 dev 2.89 max 23 time 0.388sec nine occ 39.8% mean 1.32 dev 0.66 max 11 time 0.385sec perl occ 40.8% mean 1.29 dev 0.57 max 7 time 0.423sec gcc occ 40.9% mean 1.29 dev 0.56 max 7 time 0.429sec fnv occ 41.0% mean 1.29 dev 0.56 max 7 time 0.465sec dphil occ 40.9% mean 1.29 dev 0.56 max 6 time 0.559sec 552611: smallest prime larger than N dumb occ 0.9% mean 108.23 dev 131.16 max 656 time 0.308sec mouse occ 47.4% mean 2.11 dev 3.58 max 113 time 0.494sec lennart occ 11.9% mean 8.43 dev 2.89 max 23 time 0.398sec nine occ 61.9% mean 1.62 dev 0.91 max 12 time 0.376sec perl occ 63.0% mean 1.59 dev 0.82 max 9 time 0.417sec gcc occ 63.1% mean 1.58 dev 0.82 max 9 time 0.422sec fnv occ 63.2% mean 1.58 dev 0.81 max 9 time 0.469sec dphil occ 63.2% mean 1.58 dev 0.81 max 9 time 0.537sec crc occ 63.0% mean 1.59 dev 0.82 max 8 time 0.436sec 1048759: smallest prime larger than 1048756 dumb occ 0.5% mean 108.23 dev 131.16 max 656 time 0.308sec mouse occ 28.6% mean 1.85 dev 3.30 max 113 time 0.498sec lennart occ 6.2% mean 8.43 dev 2.89 max 23 time 0.388sec nine occ 39.8% mean 1.32 dev 0.66 max 12 time 0.382sec gcc occ 40.9% mean 1.29 dev 0.56 max 8 time 0.429sec fnv occ 40.9% mean 1.29 dev 0.56 max 8 time 0.465sec crc occ 40.7% mean 1.29 dev 0.57 max 7 time 0.435sec perl occ 40.8% mean 1.29 dev 0.56 max 7 time 0.423sec dphil occ 40.9% mean 1.29 dev 0.56 max 6 time 0.545sec 10541534: N log2 N dumb occ 0.0% mean 108.23 dev 131.16 max 656 time 0.308sec mouse occ 3.3% mean 1.59 dev 3.01 max 113 time 0.515sec lennart occ 0.6% mean 8.43 dev 2.89 max 23 time 0.389sec nine occ 4.9% mean 1.07 dev 0.36 max 10 time 0.399sec perl occ 5.1% mean 1.03 dev 0.19 max 5 time 0.441sec gcc occ 5.1% mean 1.03 dev 0.18 max 5 time 0.447sec crc occ 5.1% mean 1.03 dev 0.19 max 5 time 0.455sec dphil occ 5.1% mean 1.03 dev 0.16 max 4 time 0.566sec fnv occ 5.1% mean 1.03 dev 0.16 max 3 time 0.486sec
Attachment:
hashtest.tar.gz
Description: Binary data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |