| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Donal K. Fellows <dona...@manchester.ac.uk>
Mon, 08 Feb 2010 11:21:26 +0000
On 08/02/2010 06:40, Eric Melski wrote:
> I'm not much involved in the development of Tcl anymore, but shouldn't a> change of this magnitude require a TIP?  The consequences of changing> this algorithm are significant, yet you present this work as a fait> accompli, with the implicit assumption that it is simply understood that> the existing Tcl hash is bad and in need of replacement.  Do you have> data to support that conclusion?  What are the specific scenarios in> which the existing hash is inadequate?

The primary scenarios involved were that the hash code, which is known
to be part of the critical path for a lot of Tcl programs, is both
slower than it might be (I'll return to the analysis of that later) and
particularly easy to exploit in nasty ways. In particular, it was
exceptionally simple to generate large numbers of strings with the
property that they have the same full hash code. This made it trivial to
exploit in situations such as processing HTTP headers; I can remember
discussing this in depth back at the Tcl conference in 2003 (in the
"Flintstones Cave" bar). While it is of course clear that there is no
way to prevent that completely without using a cryptographic hash
function (slow!) or extra per-hash randomization state (not compatible
with existing code) using a stronger hash function that performs more
mixing of the bits is a benefit, especially as table sizes increase.

The advantage of the FNV algorithm is that it does plenty of mixing and
uses a multiply to do it. That latter part is an advantage because it
turns out that on modern hardware (I've checked x86, SPARC and ARM[*])
there is a built-in multiply instruction, making the operation fast.
This means that there's no real advantage to using a shift-based
combiner; a multiply will do a pretty good job at a fraction of the
cost. (There are some issues with the fact that it doesn't mix the bits
downwards, but that's not a disaster with tables of non-trivial size.)
When I've measured the speed of the hashing, I've got these results:

   Using FNV hashing instead of Tcl's traditional algorithm with this
   code:

     proc foo {} {
         set x(foobar) [set x(barfoo) [set x(ballyhoo) 1]]
         return
     }
     time { foo } 10000000

   Over the long-term, that does 3 hash lookups of fairly short strings
   (representative length?) per iteration. Specifically, it does one for
   each array element set. (The overall lookup of the command name gets
   cached in the 'foo' Tcl_Obj, the array name is a local variable and
   so is handled by index, the entries are unset at the end of the run
   without any hash lookups, and no traces are involved at all).

     Old: 1.9507125463 microseconds per iteration
     New: 1.5804703108 microseconds per iteration

   That's a consistent ~19% percent faster. By comparison (the builds
   I'm testing with aren't quite synchronized in all aspects) with this
   code:

     proc foo {} {set x [set y [set z 1]]; return}
     time { foo } 10000000

   That is, with identical operations except for using all local
   variables (there's no penalty for any length of variable name at that
   point; all go to LVT accesses) then I get these differences:

     Old: 0.8757411546 microseconds per iteration
     New: 0.9598806448000001 microseconds per iteration

   Don't know why there is a slowdown there; it seems to be something
   other than hashing. I get the pattern reproduced when I use other
   tests too, for example this (which has more non-hash operations):

     proc foo {} {for {set i 0} {$i<1000000} {incr i} {
        set x(foobar) [set x(barfoo) [set x(ballyhoo) 1]]
     }}
     time foo
     # Old:252616/New:233188 ms/iter

   vs.

     proc foo {} {for {set i 0} {$i<1000000} {incr i} {
        set x [set y [set z 1]]
     }}
     time foo
     # Old:102212/New:168359 ms/iter

   If anything, that would lead me to conclude that the improvement in
   hashing is better than the raw figures would suggest, but I can't
   prove that.

All the above were tested with the default optimizing build on OSX.

> This statement does not give me any confidence either in your analysis> of the deficiencies of the existing hash or the advantages of the new> hash.  At the very least, it seems you should be able to demonstrate the> superiority of FNV over the existing hash for the specific problem> scenarios that prompted you to change the hash in the first place.

See the above. As noted previously, there's no practical way right now
to prevent hash collision attacks at the moment while still maintaining
binary compatibility (I can't store a randomization factor in the hash
table because the structure size needs to remain the same as it
currently is in already-built extensions) but the FNV hash does perform
better in practice.

> Bob Jenkins has spent a lot of time thinking about and designing hash> algorithms.  Did you look at his web page here:>> 	http://burtleburtle.net/bob/hash/index.html>> Or his article in Dr. Dobbs Journal, reprinted and expanded here:>> 	http://burtleburtle.net/bob/hash/doobs.html>> There are some (albeit brief) comments on FNV in the latter.

I did read that article prior to selecting the FNV algorithm. I'll admit
to having some concerns about the FNV hash function as used, but it does
offer some decent performance for short words (the Tcl case!) where we
also can't guarantee alignment. And the FNV hash has one huge advantage
over Bob's lookup3 hash: it's actually simple to implement (and to do so
correctly). Bob's is definitely *not* simple and requires scanning the
string twice when it is a normal zero-terminated C string (one of the
cases supported by Tcl's hash table system). Again, changing the
definition of a Tcl_HashEntry to work around that problem would break
binary compatibility (Tcl_GetHashKey is a macro in tcl.h, and
effectively fixes the size of a hash entry for the 8.* series).

The changes I've made are purely internal to Tcl. No public interface is
impacted at all. The only behavioural difference is an alteration to
iteration order, but that's the whole point. Moreover, robust code never
relied on iteration order anyway; it's *never* been guaranteed to be
constant[**] as the hash is manipulated.

It's probably a good idea for me to distil this into a TIP, even if just
an informational one. The hash algorithm is an implementation detail,
but it is an important detail that deserves to be documented.

Donal.
[* What other processor architectures do we care about in 2010? ]
[** Except for dictionaries, and they use a different mechanism that is
     wholly independent. ]

begin:vcard
fn:Donal K. Fellows
n:Fellows;Donal
org:The University of Manchester;Research Computing Services
adr:;;Oxford Road;Manchester;;M13 9PL;United Kingdom
email;internet:dona...@manchester.ac.uk
version:2.1
end:vcard

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread