| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Donal K. Fellows <dona...@manchester.ac.uk>
Sun, 14 Feb 2010 23:54:36 +0000
On 14/02/2010 16:58, Gustaf Neumann wrote:
> Am 11.02.10 01:08, schrieb Alexandre Ferrieux:>> Since it appears that timings are tricky,> Apparently. It turns out that in the both mini-benchmarks (the two> "foo" functions) the hash function for string keys (HashStringKey())> is not called at all (!) in the timed loop (e.g. Tcl variables are> compiled locals).

That's why you have to use array variables there. Though that also
doesn't call HashStringKey; it calls the equivalent for Tcl_Obj keys.

> Funny enough, just by changing the string hash function one can> measure different performance values (maybe due to processor level> caching or similar external effects).

There's a lot of tricky bits to measuring the performance. That's why I
switched to using a direct test (and had to tweak that a lot too in the
end so as to stop the gcc optimizer for making me test something other
than what I really wanted!) with the result, after much experimentation,
that the FNV hash function is slower than Tcl's original hash function.
Not that it's a bottleneck for any existing code.

The other metrics we can use here relate to the differences of
distribution of hash keys, which can have quite a substantial impact on
the overall hash *table* performance. With general data, the performance
seems to be about the same; it's difficult to construct an example with
ordinary keys where the difference is really worth mentioning. With
"malicious" data it's different though; http://wiki.tcl.tk/5164
describes how the old function (mis)behaves, and with FNV that part is
unremarkable. (There should be sets of keys which are a problem for FNV
too, but I've not constructed any. I have found[*] a reference to some 
code for generating collisions against FNV, <URL:http://www.team5150.com
/~andrew/blog/2007/03/hash_algorithm_attacks.html>, but I'm not sure
about the practicality of any attack based on the methods described
there. I also don't know what the effect is of hashing over utf-8
instead of something 8-bit clean.)

> - For small hash tables, only a few bits of the computed  hash>      values are used. For the initial size, these are only  the last two>      bits. By every hash table resize (RebuildTable) two  more bits are>      added. It is not clear how the properties of hash functions>      propagate, when only a small fraction (e.g. 2, 4, 6 etc. bits) of>      the computed hash value are actually used.

Tends not to be an issue because if anyone tries to exploit, the table
gets rebuilt and more bits of the hash are used. A way to defeat the
whole problem would be to take the whole hash value mod some number that
is coprime with the hash table size, but mod is a fairly expensive
operation. It's only really a pressing issue when people are generating
many keys with an identical full hash value (by default, Tcl does a
comparison of the full hashcode before proceeding to a key comparison).
As noted before, that's really depressingly easy with Tcl's old hash
function.

> It looks to me as if the original Tcl hash function has quite> good properties for the task it is used.

So long as people are not using it in a setting where an attacker
controls the keys, that's true. If an attacker can dictate what values
are being hashed (true in a number of cases like HTTP header parsing)
then you've got a significant performance problem.

> In order to make hashing faster in Tcl, it is most likely more> promising to look at the Tcl-hash machinery altogether.

That's a swamp I've preferred to stay out of, because there are many
constraints due to the requirement to maintain high degrees of binary
compatibility. Reworking the hash code in depth is probably an activity
for 9.0 and not before.

The other alternative is Bob Jenkins's lookup3, but that's big, scary,
and needs to know the length of the data being hashed (which either
means a strlen() every time for strings, or forces us to break our
binary-compatibility guarantees).

Donal.
[* _This_ thread is one of the top 10 hits that I find when I google for
    "fnv hash collision generating", which is a bit self-referential. ]

begin:vcard
fn:Donal K. Fellows
n:Fellows;Donal
org:The University of Manchester;Research Computing Services
adr:;;Oxford Road;Manchester;;M13 9PL;United Kingdom
email;internet:dona...@manchester.ac.uk
version:2.1
end:vcard

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread