| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: John Ousterhout <ous...@pacbell.net>
Sun, 14 Feb 2010 20:56:57 -0800
> One very glaring way of looking at these flaws is that the old> hash  function only distinguishes between 2551 different two-byte > sequences, despite there being 65536 of them; this is less than> 4%. Conversely, for *any* pair of neighbouring bytes of a key,> there are on average 24 other pairs of bytes that could be> substituted without changing the value of the old hash function.> Yet another way of looking at it (though probably hard to> formalise): the old hash function only keeps about 3.3 bits> of "randomness" for every 8 bits of key fed in to it!

This is true. I knew this at the time, and it was an intentional
(and actually correct!) choice.  It turns out that typical strings
don't actually have 8 bits of information per byte: for example,
most strings that get hashed either consist of purely alphabetic
stuff or sequences of digits.  Even in the case of alphabetic
strings, some characters are much more likely than others, so
the space of strings doesn't even grow by a factor of 26 with
each additional character that's added.  For example, how many
2-character words and numbers are there in the English language?
I'll bet it's a lot less than 2551.

Suppose instead of the current calculation "result = result*9 + c"
we used "result = result*256 + c".  The problem with this is
that result grows way too quickly, so after processing just a few
characters the first character in the string has been shifted
into the high-order bits of result.  We only use the low-order
bits as the hash bucket index, so this means that only the last
few characters of the string influence that index.  As a result,
there isn't enough entropy/information in the low-order bits
that are actually used, so more hash collisions result.

Ideally we'd like the magnitude of result to grow at the same rate
as the overall table size, so that the useful information in the
hash value falls into the bits that we actually use.  I picked
the constant 9 after a lot of experimentation with different hash
values.  Constants that are either larger or smaller cause
worse hash table performance.

It's possible that the proposed new algorithm is better than the
original Tcl one; it looks like it might use a more complex
approach that tends to spread the information from each character
more quickly across all of the bits of the hash value, whereas
the multiply-by-9 approach only gradually does this.

However, BE VERY CAREFUL ABOUT HASH FUNCTIONS.  Picking a good
hash function seems simple but is actually very complex, with
lots of subtle gotchas. Many people think they know good functions,
but most of them don't actually behave very well and few people
do extensive tests on data that is representative of how the hash
functions will actually be used.  Hash functions that work well
on some kinds of data may not work well on other kinds.  I got
started in this because I wrote an application using an
"off-the-shelf" hash function and discovered that the application
performed very poorly at large scale.  I eventually discovered
that it was only using a small portion of the hash bucket array
because the hash function did not successfully spread the hash
values across the entire array; there were so many bucket
collisions that the entire application performed badly.  I ended
up with the current function after a lot of experiments.

So, if you decide to change, make sure you measure very carefully
on representative datasets, and make sure you've studied the
properties of hashing functions enough to understand exactly
how yours is working and why it is better.

-John-


------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread