This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Hashing of "switch/case" selections
- To: GCC general mailing list <gcc at gcc dot gnu dot org>
- Subject: Hashing of "switch/case" selections
- From: Andy Walker <jawalker at stl dot quik dot com>
- Date: Wed, 27 Dec 2000 22:43:40 -0600
I have been doing some research on this. Quick definitions for those
who are not Sorting and Searching Wizards: Hash indicies: unique
preHashed index values that each have a meaning. A Hash function: a
function applied to a value (usually an integer, or rarely, a character
string) to get a result. A unique Hash: a Hash function that returns a
unique result for every unique input. A perfect Hash: similar to a
unique Hash, but not as stringent; a Hash function that returns a unique
result for every Hash index (note that invalid, ie non-index values may,
and often do, return the same result as a valid Hash index). A minimal,
perfect Hash: a Hash function that returns "n" sequential values for
all "n" valid Hash indicies, not necessarily in the same order.. A
near-minimal, perfect Hash: as close as I can get to the the previous
definition.
Unique Hashes are usually too much to ask for. I have not found any
references that begin to suggest how to design and code such a beast.
Therefore, the perfect Hash is the minimum acceptable function. The
first consequence of this approach is that a table of the valid indicies
must be maintained for verification, to distinguish the invalid values
from the valid indicies. If there is a table, then a high and low test
must be performed to guarantee that the Hash result falls in the range
of the table, before the verification test. If we use a multiplication
for the Hash function, I expect the multiply instruction, plus
associated register loads, to be at least the equivalent of three test
and conditional branch instruction pairs. (no divide instructions -- I
am allergic to those).
A binary tree search using only five test and conditional branch pairs
plus a final jump table/verification table lookup and comparision, will
compile more quickly, will run at least as fast, and will take up less
space than the simplest general Hash function.. If there are 32 or
fewer cases, binary search is king. Note that this is totally unrelated
to the spread of the index values.
I have recently changed jobs, but at my old job I did a rough and
unscientific analysis of a modestly sized "C/C++" library. I found
rougly one-hundred-three instances of "switch" with cases. Ninety-two
of them had fewer than 32 cases, and most of those were sixteen cases or
less. Of the remaining nine, all but one were handling ASCII characters
as integers for parsing functions.
Just for peace of mind, the present tool (binary tree) handles most
situations in the most efficient manner, IMHO. Also, none of the
algorithms I found could guarantee a near-minimal, perfect Hash in
linear time. A few wild stabs in the dark often bring success. When
they do not, the binary tree is the reliable backup plan.
I am still working on "switch" for 256 or fewer cases, and more than 256
cases. The algorithm I am analysing is a multiplication with truncation
(/256), and mulitiplication then squaring with truncation (/256). When
I get some reliable timing numbers, I will post them here for further
discussion.
My experience as a programmer is that programs are almost never written
with switches to more than a few hundred cases. I personally have never
seen one. More than that, and the programmer finds a more efficient,
more readable, more maintainable solution, and he can do so because he
has a deep knowledge of the problem that is unavailable to the
compiler. My seat-of-pants estimate is that 400 cases is a good, maybe
even high, maximum for Hash evaluations. A program containing a switch
statement with more than 400 cases in it probably has enough other
severe design problems that it will never be worth running anyway.
Andy