| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Tom Jackson <tom....@gmail.com>
Tue, 16 Feb 2010 01:30:41 -0800
2010/2/15 Lars Hellström <Lars...@residenset.net>:
> Tom Jackson skrev:>>>> Hey,>>>> As far as I can tell this entire operation and the logic used to>> justify it is totally messed up.>>>> Unless I'm missing some additional code, one of the basic>> justifications for this change is completely wrong.>>>> That justification is the wiki entry which shows FNV somehow creates a>> hash key distribution that outsmarts the test program "collide". From>> my reading of the collide program it was designed to find collisions.>> Unfortunately the code seems to find X candidates which have the same>> hash value and then create an array.>>>> Guess what! This trick works for any hash function. As far as  I can>> tell it works for the new FNV hash and one which I tried the djb hash.>> Yes, but it is extremely simple for the classic Tcl hash, because:>> 1. It has the property that if $x and $y are two equal length strings which> hash to the same 32-bit value, then $a$x$b and $a$y$b will be a pair of> strings which hash to the same value too, for any strings $a and $b.>> 2. There are very many pairs of strings as short as 2 characters which hash> to the same 32-bit value.>> FNV escapes 1 by using ^ to add in the characters, and 2 by using a> multiplier larger than 255. It's probably still possible to find strings $x> and $y such that FNV($x) = FNV($y), but I suspect they'd need to be of> length at least 5, and chances are pretty good that FNV($a$x) != FNV($a$y)> (naively I'd say 255/256, but there could be strings $x and $y which are> less sensitive to the prefix $a).>> With the FNV hash function, an HTTP header parser can to some extent> randomise the results of an attacker by picking a $salt prefix and using> $salt$header as hash entry for data that has to do with the the $header> prefix. With the classic hash function, that has no effect whatsoever; an> attack works just as well no matter what $salt is chosen.

Here is a short testing script which compares the hash function in Tcl
(internal) as well as three external hash functions.

I use the collide proc to generate the array keys and display the
stats for the internal bucket distribution.

Then the keys are fed into the other procs and the resultant hash
values are saved.

The hash values are reduced modulo some prime number close to the
number of buckets used by the internal hash array.

Here's what I noticed:

The original code (current Tcl hash code) is slow to compute [collide
15], almost three minutes. Just changing the internal hash function
sped this up two orders of magnitude: just a few seconds. I guess this
is the result of the collisions?

The internal bucket assignment is far worse of a problem than the hash
function itself. With the exception of hash collisions, which
obviously go into the same bucket no matter what,  the bucket
assignment code causes even more collisions.

The external hash functions perform similarly when buckets are
determined by their modulus some prime number. The current Tcl hash
function also does okay with powers of two, but FNV and the DJB hash
do not. When using the prime method, the distributions are all much
better than the internal bucket assignment.

The FNV hash, when used as the internal hash function somehow resists
most of the bucket assignment problems. The results are still worse
than the external FNV using a prime modulus.

Maybe there is a reason to go to all the trouble of finding a good
hash function only to throw away most of the randomness in the higher
order bits. A prime modulus would maintain this randomness as much as
possible.

# Note aj => ajr and ba => bas in collide:

proc collide {level {accum ""}} {
    global a
    if {$level} {
        incr level -1
        collide $level "ajr$accum"
        collide $level "bas$accum"
    } else {
        set a($accum) 1
    }
}
puts [time {collide 10}]

puts [array statistics a]

set names [array names a]

# Change modulus for different size arrays:
# Examples:
# 19 251 521 1021 2063 4889 10007 16567
namespace eval ::hash {
    variable modulus 1021
}

proc ::hash::hashString { string } {
    variable modulus
    set hash 5381
    foreach char [split $string ""] {
	scan $char %c char
	set hash [expr {(($hash  << 5) + $hash) ^ $char}]
	set hash [string trimleft [string range $hash end-12 end] 0]
    }
    return [list $string [expr {$hash % $modulus}] $hash]
}

proc ::hash::hashString2 { string } {

    set hash 0
    variable modulus
    foreach char [split $string ""] {
	scan $char %c char
	set hash [expr {($hash + ($hash << 3 ) + $char)}]
	set hash [string trimleft [string range $hash end-12 end] 0]
    }
    return [list $string [expr {$hash % $modulus}] $hash]
}

proc ::hash::hashString3 { string } {

    variable modulus
    set hashPrime 1000193
    set hash [expr 0x811c9dc5]
    foreach char [split $string ""] {
	scan $char %c char
	set hash [expr {($hash * $hashPrime ) ^ $char}]
	set hash [string trimleft [string range $hash end-12 end] 0]
    }

    return [list $string [expr {$hash % $modulus}] $hash]

}

foreach word $names {

    set hash1 [::hash::hashString $word]
    set hash2 [::hash::hashString2 $word]
    set hash3 [::hash::hashString3 $word]

    incr hashmod1([lindex $hash1 1])
    incr hashmod2([lindex $hash2 1])
    incr hashmod3([lindex $hash3 1])

}
set desc1 "DJB hash from cdb"
set desc2 "Current Tcl hash"
set desc3 "FNV Hash"

foreach array {1 2 3} {

    foreach {mod count} [array get hashmod$array] {
	incr bucket${array}($count)
    }
}

if {0} {
foreach array {1 2 3} {
    puts "hashmod$array [set desc$array]"
    foreach name [lsort -integer [array names hashmod$array]] {
	puts "$name => [set hashmod${array}($name)]"
    }
}
}

puts "Modulus $::hash::modulus (hash%modulus = bucket)"

foreach array {1 2 3} {
    puts "Bucket Fill hashmod$array [set desc$array]"
    set total 0
    set usedBuckets 0
    foreach name [lsort -integer [array names bucket$array]] {
	set buckets [expr {[set bucket${array}($name)] * $name}]
	incr total $buckets
	incr usedBuckets [set bucket${array}($name)]
	puts "buckets with $name = [set bucket${array}($name)]  total $total"
    }
    puts "buckets with 0 = [expr $::hash::modulus - $usedBuckets]\n\n"
}

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread