On 08/02/2010 06:40, Eric Melski wrote:
> I'm not much involved in the development of Tcl anymore, but shouldn't a> change of this magnitude require a TIP? The consequences of changing> this algorithm are significant, yet you present this work as a fait> accompli, with the implicit assumption that it is simply understood that> the existing Tcl hash is bad and in need of replacement. Do you have> data to support that conclusion? What are the specific scenarios in> which the existing hash is inadequate?
The primary scenarios involved were that the hash code, which is known
to be part of the critical path for a lot of Tcl programs, is both
slower than it might be (I'll return to the analysis of that later) and
particularly easy to exploit in nasty ways. In particular, it was
exceptionally simple to generate large numbers of strings with the
property that they have the same full hash code. This made it trivial to
exploit in situations such as processing HTTP headers; I can remember
discussing this in depth back at the Tcl conference in 2003 (in the
"Flintstones Cave" bar). While it is of course clear that there is no
way to prevent that completely without using a cryptographic hash
function (slow!) or extra per-hash randomization state (not compatible
with existing code) using a stronger hash function that performs more
mixing of the bits is a benefit, especially as table sizes increase.
The advantage of the FNV algorithm is that it does plenty of mixing and
uses a multiply to do it. That latter part is an advantage because it
turns out that on modern hardware (I've checked x86, SPARC and ARM[*])
there is a built-in multiply instruction, making the operation fast.
This means that there's no real advantage to using a shift-based
combiner; a multiply will do a pretty good job at a fraction of the
cost. (There are some issues with the fact that it doesn't mix the bits
downwards, but that's not a disaster with tables of non-trivial size.)
When I've measured the speed of the hashing, I've got these results:
Using FNV hashing instead of Tcl's traditional algorithm with this
code:
proc foo {} {
set x(foobar) [set x(barfoo) [set x(ballyhoo) 1]]
return
}
time { foo } 10000000
Over the long-term, that does 3 hash lookups of fairly short strings
(representative length?) per iteration. Specifically, it does one for
each array element set. (The overall lookup of the command name gets
cached in the 'foo' Tcl_Obj, the array name is a local variable and
so is handled by index, the entries are unset at the end of the run
without any hash lookups, and no traces are involved at all).
Old: 1.9507125463 microseconds per iteration
New: 1.5804703108 microseconds per iteration
That's a consistent ~19% percent faster. By comparison (the builds
I'm testing with aren't quite synchronized in all aspects) with this
code:
proc foo {} {set x [set y [set z 1]]; return}
time { foo } 10000000
That is, with identical operations except for using all local
variables (there's no penalty for any length of variable name at that
point; all go to LVT accesses) then I get these differences:
Old: 0.8757411546 microseconds per iteration
New: 0.9598806448000001 microseconds per iteration
Don't know why there is a slowdown there; it seems to be something
other than hashing. I get the pattern reproduced when I use other
tests too, for example this (which has more non-hash operations):
proc foo {} {for {set i 0} {$i<1000000} {incr i} {
set x(foobar) [set x(barfoo) [set x(ballyhoo) 1]]
}}
time foo
# Old:252616/New:233188 ms/iter
vs.
proc foo {} {for {set i 0} {$i<1000000} {incr i} {
set x [set y [set z 1]]
}}
time foo
# Old:102212/New:168359 ms/iter
If anything, that would lead me to conclude that the improvement in
hashing is better than the raw figures would suggest, but I can't
prove that.
All the above were tested with the default optimizing build on OSX.
> This statement does not give me any confidence either in your analysis> of the deficiencies of the existing hash or the advantages of the new> hash. At the very least, it seems you should be able to demonstrate the> superiority of FNV over the existing hash for the specific problem> scenarios that prompted you to change the hash in the first place.
See the above. As noted previously, there's no practical way right now
to prevent hash collision attacks at the moment while still maintaining
binary compatibility (I can't store a randomization factor in the hash
table because the structure size needs to remain the same as it
currently is in already-built extensions) but the FNV hash does perform
better in practice.
> Bob Jenkins has spent a lot of time thinking about and designing hash> algorithms. Did you look at his web page here:>> http://burtleburtle.net/bob/hash/index.html>> Or his article in Dr. Dobbs Journal, reprinted and expanded here:>> http://burtleburtle.net/bob/hash/doobs.html>> There are some (albeit brief) comments on FNV in the latter.
I did read that article prior to selecting the FNV algorithm. I'll admit
to having some concerns about the FNV hash function as used, but it does
offer some decent performance for short words (the Tcl case!) where we
also can't guarantee alignment. And the FNV hash has one huge advantage
over Bob's lookup3 hash: it's actually simple to implement (and to do so
correctly). Bob's is definitely *not* simple and requires scanning the
string twice when it is a normal zero-terminated C string (one of the
cases supported by Tcl's hash table system). Again, changing the
definition of a Tcl_HashEntry to work around that problem would break
binary compatibility (Tcl_GetHashKey is a macro in tcl.h, and
effectively fixes the size of a hash entry for the 8.* series).
The changes I've made are purely internal to Tcl. No public interface is
impacted at all. The only behavioural difference is an alteration to
iteration order, but that's the whole point. Moreover, robust code never
relied on iteration order anyway; it's *never* been guaranteed to be
constant[**] as the hash is manipulated.
It's probably a good idea for me to distil this into a TIP, even if just
an informational one. The hash algorithm is an implementation detail,
but it is an important detail that deserves to be documented.
Donal.
[* What other processor architectures do we care about in 2010? ]
[** Except for dictionaries, and they use a different mechanism that is
wholly independent. ]
begin:vcard
fn:Donal K. Fellows
n:Fellows;Donal
org:The University of Manchester;Research Computing Services
adr:;;Oxford Road;Manchester;;M13 9PL;United Kingdom
email;internet:dona...@manchester.ac.uk
version:2.1
end:vcard
------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core