> One very glaring way of looking at these flaws is that the old> hash function only distinguishes between 2551 different two-byte > sequences, despite there being 65536 of them; this is less than> 4%. Conversely, for *any* pair of neighbouring bytes of a key,> there are on average 24 other pairs of bytes that could be> substituted without changing the value of the old hash function.> Yet another way of looking at it (though probably hard to> formalise): the old hash function only keeps about 3.3 bits> of "randomness" for every 8 bits of key fed in to it!
This is true. I knew this at the time, and it was an intentional
(and actually correct!) choice. It turns out that typical strings
don't actually have 8 bits of information per byte: for example,
most strings that get hashed either consist of purely alphabetic
stuff or sequences of digits. Even in the case of alphabetic
strings, some characters are much more likely than others, so
the space of strings doesn't even grow by a factor of 26 with
each additional character that's added. For example, how many
2-character words and numbers are there in the English language?
I'll bet it's a lot less than 2551.
Suppose instead of the current calculation "result = result*9 + c"
we used "result = result*256 + c". The problem with this is
that result grows way too quickly, so after processing just a few
characters the first character in the string has been shifted
into the high-order bits of result. We only use the low-order
bits as the hash bucket index, so this means that only the last
few characters of the string influence that index. As a result,
there isn't enough entropy/information in the low-order bits
that are actually used, so more hash collisions result.
Ideally we'd like the magnitude of result to grow at the same rate
as the overall table size, so that the useful information in the
hash value falls into the bits that we actually use. I picked
the constant 9 after a lot of experimentation with different hash
values. Constants that are either larger or smaller cause
worse hash table performance.
It's possible that the proposed new algorithm is better than the
original Tcl one; it looks like it might use a more complex
approach that tends to spread the information from each character
more quickly across all of the bits of the hash value, whereas
the multiply-by-9 approach only gradually does this.
However, BE VERY CAREFUL ABOUT HASH FUNCTIONS. Picking a good
hash function seems simple but is actually very complex, with
lots of subtle gotchas. Many people think they know good functions,
but most of them don't actually behave very well and few people
do extensive tests on data that is representative of how the hash
functions will actually be used. Hash functions that work well
on some kinds of data may not work well on other kinds. I got
started in this because I wrote an application using an
"off-the-shelf" hash function and discovered that the application
performed very poorly at large scale. I eventually discovered
that it was only using a small portion of the hash bucket array
because the hash function did not successfully spread the hash
values across the entire array; there were so many bucket
collisions that the entire application performed badly. I ended
up with the current function after a lot of experiments.
So, if you decide to change, make sure you measure very carefully
on representative datasets, and make sure you've studied the
properties of hashing functions enough to understand exactly
how yours is working and why it is better.
-John-
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core