This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PATCH: Add objc-lang.c, further cleanup - TAKE TWO


On Tue, Dec 04, 2001 at 06:29:10AM +0000, Neil Booth wrote:
>  
> > Just yesterday I found an old letter from someone suggesting I try just
> > 
> >    hash = hash * 9 + ch
> > 
> > for the hash step.  If you have a harness set up to do these tests
> > already, mind trying that one?
> 
> I think that might have been one they tested.  ... OK you persuaded
> me.  [Digs around].  Here's the results with my data with one test
> harness a guy posted (someone else had a larger set of hashes but
> didn't post his harness):
[snip]

Finally I get back to this.

I augmented the harness to indicate occupancy, mean, std.dev as well
as max.  (Occupancy replaces "used" as it's easier to compare across
runs with different table sizes.)  Then I tried various table sizes
with the provided hashes, gcc's hash, the "*9" hash, and one more -
the "linear congruential" hash used by Daniel Philips' ext2 indexed
directories (the approach is similar to FNV).

The data set was all the identifier-like strings found in all the
source code I have lying around - I made no attempt to exclude
words in comments or other junk; just every string that matched
/[a-zA-Z_][a-zA-Z0-9_]*/.  There are 552610 of them after sort -u.

Results are attached, as is the revised harness.  Mail me if you want
the data set, it's 2.2MB bzipped.

The conclusion I draw is that the perl, gcc, fnv, and dphil hashes are
all about tied for bucket distribution, on identifiers, and perform
well on power-of-two size tables.  (Which is good, since hashtable.c
uses a power-of-two table.)  CRC is decent but only if given a prime
table size.

lennart and nine are fast, but you pay for it with longer chains and
lower occupancy - lennart in particular consistently has a mean bucket
size of ~8, not ~1 the way all the other good ones do.  mouse and dumb
are just not worth mentioning.

It would be interesting to attempt to simulate hashtable.c's rehashing
algorithm and see what kind of figures result.

zw
There are 552610 unique items-which-look-like-identifiers
(/[a-zA-Z_][a-zA-Z0-9_]*/) in my source code collection.

'gcc' is htab-hash-string from libiberty/hashtab.c.
'nine' is the simple multiply-by-nine hash.
'dphil' is Daniel Philips' linear congruential hash, from his
ext2-directory-index patch.

524288: largest power of two smaller than N
dumb         occ   1.0%   mean 108.23  dev 131.16  max  656   time 0.308sec
mouse        occ  44.2%   mean   2.39  dev   6.76  max  229   time 0.488sec
crc          occ  58.6%   mean   1.80  dev   1.71  max  135   time 0.423sec
lennart      occ  12.5%   mean   8.43  dev   2.89  max   23   time 0.388sec
nine         occ  63.8%   mean   1.65  dev   0.93  max   12   time 0.376sec
perl         occ  64.9%   mean   1.62  dev   0.85  max    9   time 0.417sec
gcc          occ  65.1%   mean   1.62  dev   0.84  max    9   time 0.422sec
fnv          occ  65.2%   mean   1.62  dev   0.84  max    9   time 0.458sec
dphil        occ  65.1%   mean   1.62  dev   0.84  max    9   time 0.537sec

1048576: smallest power of two larger than N
dumb         occ   0.5%   mean 108.23  dev 131.16  max  656   time 0.308sec
mouse        occ  25.6%   mean   2.06  dev   6.26  max  229   time 0.493sec
crc          occ  36.4%   mean   1.45  dev   1.35  max  135   time 0.430sec
lennart      occ   6.2%   mean   8.43  dev   2.89  max   23   time 0.388sec
nine         occ  39.8%   mean   1.32  dev   0.66  max   11   time 0.385sec
perl         occ  40.8%   mean   1.29  dev   0.57  max    7   time 0.423sec
gcc          occ  40.9%   mean   1.29  dev   0.56  max    7   time 0.429sec
fnv          occ  41.0%   mean   1.29  dev   0.56  max    7   time 0.465sec
dphil        occ  40.9%   mean   1.29  dev   0.56  max    6   time 0.559sec

552611: smallest prime larger than N
dumb         occ   0.9%   mean 108.23  dev 131.16  max  656   time 0.308sec
mouse        occ  47.4%   mean   2.11  dev   3.58  max  113   time 0.494sec
lennart      occ  11.9%   mean   8.43  dev   2.89  max   23   time 0.398sec
nine         occ  61.9%   mean   1.62  dev   0.91  max   12   time 0.376sec
perl         occ  63.0%   mean   1.59  dev   0.82  max    9   time 0.417sec
gcc          occ  63.1%   mean   1.58  dev   0.82  max    9   time 0.422sec
fnv          occ  63.2%   mean   1.58  dev   0.81  max    9   time 0.469sec
dphil        occ  63.2%   mean   1.58  dev   0.81  max    9   time 0.537sec
crc          occ  63.0%   mean   1.59  dev   0.82  max    8   time 0.436sec

1048759: smallest prime larger than 1048756
dumb         occ   0.5%   mean 108.23  dev 131.16  max  656   time 0.308sec
mouse        occ  28.6%   mean   1.85  dev   3.30  max  113   time 0.498sec
lennart      occ   6.2%   mean   8.43  dev   2.89  max   23   time 0.388sec
nine         occ  39.8%   mean   1.32  dev   0.66  max   12   time 0.382sec
gcc          occ  40.9%   mean   1.29  dev   0.56  max    8   time 0.429sec
fnv          occ  40.9%   mean   1.29  dev   0.56  max    8   time 0.465sec
crc          occ  40.7%   mean   1.29  dev   0.57  max    7   time 0.435sec
perl         occ  40.8%   mean   1.29  dev   0.56  max    7   time 0.423sec
dphil        occ  40.9%   mean   1.29  dev   0.56  max    6   time 0.545sec

10541534: N log2 N
dumb         occ   0.0%   mean 108.23  dev 131.16  max  656   time 0.308sec
mouse        occ   3.3%   mean   1.59  dev   3.01  max  113   time 0.515sec
lennart      occ   0.6%   mean   8.43  dev   2.89  max   23   time 0.389sec
nine         occ   4.9%   mean   1.07  dev   0.36  max   10   time 0.399sec
perl         occ   5.1%   mean   1.03  dev   0.19  max    5   time 0.441sec
gcc          occ   5.1%   mean   1.03  dev   0.18  max    5   time 0.447sec
crc          occ   5.1%   mean   1.03  dev   0.19  max    5   time 0.455sec
dphil        occ   5.1%   mean   1.03  dev   0.16  max    4   time 0.566sec
fnv          occ   5.1%   mean   1.03  dev   0.16  max    3   time 0.486sec

Attachment: hashtest.tar.gz
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]