On 15/02/2010 04:56, John Ousterhout wrote:
> It's possible that the proposed new algorithm is better than the> original Tcl one; it looks like it might use a more complex> approach that tends to spread the information from each character> more quickly across all of the bits of the hash value, whereas> the multiply-by-9 approach only gradually does this.
The original Tcl algorithm is not bad for most keys. It both keeps bits
low down and spreads them up, and it uses operations that don't lose too
much information (i.e., multiplication and addition). The FNV algorithm
is remarkably similar in structure (so advantageous in terms of ease of
transition) but uses a different multiplier that spreads the information
more thoroughly and uses xor rather than addition (though we all know
those are actually closely related operations).
The main issue with the old hash algorithm was that it was too easy to
attack it and turn hash table accesses from O(1) to O(n**2). A
literature search (not very easy; this is not a heavily studied area)
indicates that the FNV hash is quite a bit more resistant; attacks would
require using much longer keys and would be more likely to hit some
other limit.
The other advantage of the FNV algorithm is that it is well-known,
reasonably well-studied, and does a reasonably good job with producing
keys for the space of "short words"; I'm not really the guy saying that
it is quite good, a number of others are (URLs elsewhere in this
thread). I don't know of anyone other than Tcl using the old hash
algorithm (though related variants have been studied and found lacking).
I'm tempted by Bob Jenkins's lookup3 hash, but it's big, complex, and
really wants to know the length of the string first so implementing it
efficiently for string keys is awkward[*]. We could use it for Tcl_Obj
keys, but I'm currently keener on using the same algorithm for both
types of user-visible keys. (I'm also a bit hesitant because of the
complexity of lookup3; it's non-trivial to code.)
[Good advice elided]
> So, if you decide to change, make sure you measure very carefully> on representative datasets, and make sure you've studied the> properties of hashing functions enough to understand exactly> how yours is working and why it is better.
On non-malicious datasets, Tcl's original algorithm works pretty well.
It's the case when someone's deliberately trying to provoke bad
behaviour that concerned me. FNV seems to do better (and has reasonable
distribution behaviour on the non-malicious datasets I've tried it with,
such as the contents of /usr/share/dict/words and the first million
cardinal numbers).
As a side note, changing the hash has flushed out a small number of
places in Tcl where we had code that actually did depend on the order of
hashing. These places were bugs waiting to happen...
Donal.
[* Maintaining binary and source compatibility was a cast-iron
requirement for this change, so storing the length of a string key
was not possible; there's nowhere in a Tcl_HashEntry for it to go
and offsets of fields are baked into user code because of macros. ]
begin:vcard
fn:Donal K. Fellows
n:Fellows;Donal
org:The University of Manchester;Research Computing Services
adr:;;Oxford Road;Manchester;;M13 9PL;United Kingdom
email;internet:dona...@manchester.ac.uk
version:2.1
end:vcard
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core