| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Donal K. Fellows <dona...@manchester.ac.uk>
Mon, 15 Feb 2010 15:14:07 +0000
On 15/02/2010 04:56, John Ousterhout wrote:
> It's possible that the proposed new algorithm is better than the> original Tcl one; it looks like it might use a more complex> approach that tends to spread the information from each character> more quickly across all of the bits of the hash value, whereas> the multiply-by-9 approach only gradually does this.

The original Tcl algorithm is not bad for most keys. It both keeps bits
low down and spreads them up, and it uses operations that don't lose too
much information (i.e., multiplication and addition). The FNV algorithm
is remarkably similar in structure (so advantageous in terms of ease of
transition) but uses a different multiplier that spreads the information
more thoroughly and uses xor rather than addition (though we all know
those are actually closely related operations).

The main issue with the old hash algorithm was that it was too easy to
attack it and turn hash table accesses from O(1) to O(n**2). A
literature search (not very easy; this is not a heavily studied area)
indicates that the FNV hash is quite a bit more resistant; attacks would
require using much longer keys and would be more likely to hit some
other limit.

The other advantage of the FNV algorithm is that it is well-known,
reasonably well-studied, and does a reasonably good job with producing
keys for the space of "short words"; I'm not really the guy saying that
it is quite good, a number of others are (URLs elsewhere in this
thread). I don't know of anyone other than Tcl using the old hash
algorithm (though related variants have been studied and found lacking).

I'm tempted by Bob Jenkins's lookup3 hash, but it's big, complex, and
really wants to know the length of the string first so implementing it
efficiently for string keys is awkward[*]. We could use it for Tcl_Obj
keys, but I'm currently keener on using the same algorithm for both
types of user-visible keys. (I'm also a bit hesitant because of the
complexity of lookup3; it's non-trivial to code.)

[Good advice elided]
> So, if you decide to change, make sure you measure very carefully> on representative datasets, and make sure you've studied the> properties of hashing functions enough to understand exactly> how yours is working and why it is better.

On non-malicious datasets, Tcl's original algorithm works pretty well.
It's the case when someone's deliberately trying to provoke bad
behaviour that concerned me. FNV seems to do better (and has reasonable
distribution behaviour on the non-malicious datasets I've tried it with,
such as the contents of /usr/share/dict/words and the first million
cardinal numbers).

As a side note, changing the hash has flushed out a small number of
places in Tcl where we had code that actually did depend on the order of
hashing. These places were bugs waiting to happen...

Donal.
[* Maintaining binary and source compatibility was a cast-iron
    requirement for this change, so storing the length of a string key
    was not possible; there's nowhere in a Tcl_HashEntry for it to go
    and offsets of fields are baked into user code because of macros. ]

begin:vcard
fn:Donal K. Fellows
n:Fellows;Donal
org:The University of Manchester;Research Computing Services
adr:;;Oxford Road;Manchester;;M13 9PL;United Kingdom
email;internet:dona...@manchester.ac.uk
version:2.1
end:vcard

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread