Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: John Ousterhout <ous...@pacbell.net>

Sun, 14 Feb 2010 20:56:57 -0800

> One very glaring way of looking at these flaws is that the old> hash  function only distinguishes between 2551 different two-byte > sequences, despite there being 65536 of them; this is less than> 4%. Conversely, for *any* pair of neighbouring bytes of a key,> there are on average 24 other pairs of bytes that could be> substituted without changing the value of the old hash function.> Yet another way of looking at it (though probably hard to> formalise): the old hash function only keeps about 3.3 bits> of "randomness" for every 8 bits of key fed in to it!

This is true. I knew this at the time, and it was an intentional
(and actually correct!) choice.  It turns out that typical strings
don't actually have 8 bits of information per byte: for example,
most strings that get hashed either consist of purely alphabetic
stuff or sequences of digits.  Even in the case of alphabetic
strings, some characters are much more likely than others, so
the space of strings doesn't even grow by a factor of 26 with
each additional character that's added.  For example, how many
2-character words and numbers are there in the English language?
I'll bet it's a lot less than 2551.

Suppose instead of the current calculation "result = result*9 + c"
we used "result = result*256 + c".  The problem with this is
that result grows way too quickly, so after processing just a few
characters the first character in the string has been shifted
into the high-order bits of result.  We only use the low-order
bits as the hash bucket index, so this means that only the last
few characters of the string influence that index.  As a result,
there isn't enough entropy/information in the low-order bits
that are actually used, so more hash collisions result.

Ideally we'd like the magnitude of result to grow at the same rate
as the overall table size, so that the useful information in the
hash value falls into the bits that we actually use.  I picked
the constant 9 after a lot of experimentation with different hash
values.  Constants that are either larger or smaller cause
worse hash table performance.

It's possible that the proposed new algorithm is better than the
original Tcl one; it looks like it might use a more complex
approach that tends to spread the information from each character
more quickly across all of the bits of the hash value, whereas
the multiply-by-9 approach only gradually does this.

However, BE VERY CAREFUL ABOUT HASH FUNCTIONS.  Picking a good
hash function seems simple but is actually very complex, with
lots of subtle gotchas. Many people think they know good functions,
but most of them don't actually behave very well and few people
do extensive tests on data that is representative of how the hash
functions will actually be used.  Hash functions that work well
on some kinds of data may not work well on other kinds.  I got
started in this because I wrote an application using an
"off-the-shelf" hash function and discovered that the application
performed very poorly at large scale.  I eventually discovered
that it was only using a small portion of the hash bucket array
because the hash function did not successfully spread the hash
values across the entire array; there were so many bucket
collisions that the entire application performed badly.  I ended
up with the current function after a lot of experiments.

So, if you decide to change, make sure you measure very carefully
on representative datasets, and make sure you've studied the
properties of hashing functions enough to understand exactly
how yours is working and why it is better.

-John-


------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
[TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 07, 2010 09:36 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 08, 2010 06:40 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 08, 2010 11:21 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 08, 2010 06:45 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 09, 2010 01:47 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 09, 2010 02:42 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 09, 2010 01:05 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 09, 2010 03:48 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 09, 2010 11:14 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 09, 2010 04:15 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 10, 2010 12:38 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Lars Hellström	Feb 14, 2010 10:56 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 15, 2010 03:20 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 15, 2010 10:02 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 15, 2010 08:55 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Lars Hellstr�m	Feb 15, 2010 11:53 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 11:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 16, 2010 01:01 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 09, 2010 11:35 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Alexandre Ferrieux	Feb 11, 2010 12:08 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Alexandre Ferrieux	Feb 12, 2010 04:57 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 14, 2010 04:58 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 14, 2010 11:54 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 16, 2010 03:25 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 04:45 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 17, 2010 11:04 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Kristoffer Lawson	Feb 17, 2010 11:18 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 16, 2010 06:30 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 10:44 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 17, 2010 11:49 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 10:36 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	John Ousterhout	Feb 15, 2010 04:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 15, 2010 03:14 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 16, 2010 01:16 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 15, 2010 04:16 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 18, 2010 09:31 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 17, 2010 06:41 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 18, 2010 09:23 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 18, 2010 12:23 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 17, 2010 01:36 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 17, 2010 11:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 16, 2010 09:30 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Joe English	Feb 08, 2010 07:56 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 08, 2010 09:56 pm

◄ Messages in this thread ►

Previous post: Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

Next post: Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

Subscribe to the tcl-core RSS feed

Accounts

List Archives

Feedback & Information

ActiveState

© 2019 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.