On 17/02/2010 01:36, Tom Jackson wrote:
> No shit! The cost is extreme.
I've been probing the cost of using various hash functions (I've done
the comparison of the classic algorithm against FNV and lookup3) when
used in Tcl and the attached script contains preliminary results for the
two keysets:
all dictionary words
numbers from 0 to 500k
The tests basically check the cost of doing 9 hash lookups per key;
array lookups are close to as pure a test of this as we can generate in
Tcl. (The cost of the "machinery" is about a third of the total time.)
The cost of FNV over these (presumably near normal) keys is 3–10% over
the classic algorithm. The cost of lookup3 is 10-30% over the classic
algorithm. Other key-sets might have other costs.
Note that the only thing that is changing between the runs is the
contents of the function TclHashObjKey in tclObj.c, and the change was
done by altering #ifdefs only. Everything else is just Tcl HEAD (plus a
compiled in lookup3.c; I'm not retyping *that* code!)
> The problem here is that in order to get good distribution to all> bits, you have to use a near cryptographic quality hash. A> cryptographic hash can be easily truncated to generate near random> bucket indexes. This is the desired goal, but it is much more easily> achieved most of the time with a single modulus operation which takes> into account all bits. Which is easier: one modulus operation or a> near-cryptographic hash which can be safely truncated?
Mod is quite an expensive operation, at least on this machine. Applying
a mod operation (mod a large prime) to the result of the classic hash
before returning nearly takes the cost of hashing up to equivalence with
FNV.
> The system only needs to be broken once. All that is required is a set> of keys (maybe a few thousand) which hash to the same value. After> that the entire algorithm is broken (assuming the goal is to poison> the bucket code with collisions). Assuming a 32 bit key-space, it> should be easy to find at least one set of 10000 strings which share> the same hash value.>> Do I care to find such a set? NO! But this is a triviality for> cryptographers.
Remember, cryptographers are mathematicians. They have a definition of
"trivial" which doesn't match that used by software engineers. I can
remember a number of acquaintances state that providing examples once an
existence proof was done was trivial. Can't say that I really bought
their argument... :-)
In terms of hash attacks, they only have real fangs (given that we don't
conflate hashcode equivalence with equality) if it is possible to make
either:
a) A large number of *short* keys with the same hashcode, or
b) A substantial number of keys with the same hashcode as some
particular key from a set not nominated by the attacker (typically
defined by the communication protocol)
The aim of the first one is to swamp the hash table with totally
irrelevant things. The aim of the second is to retard required lookups.
> I know there is a huge bias against AOLserver, so I digress to Apache.> How does Apache handle headers?>>> From http://thomas.eibner.dk/apache/table.html (Apache table API):>> "Tables in Apache works almost like hashes in Perl. The main> difference between the two is that you can have two of the same keys> in the Apache table API. The table lookups are done in a> case-insensitive manner.">> So for everyone who has some problem with AOLserver, maybe note that> headers in Apache are handled very similarly. More specifically they> are not handled as associative arrays or dictionaries.
Checking the source to Apache (strictly the APR) I see that they're
using hashing inside their implementation of tables. OK, they're using a
truly trivial hash, but I suppose it's better than what they had before
(a single linear list).
Check for yourself at:
http://svn.apache.org/repos/asf/apr/apr/trunk/tables/apr_tables.c
> All I can say is that the more I investigate the original Tcl hash,> the more my respect grows. It seems to perform better most of the> time.
It's very very fast, but weak. With non-malicious inputs, its weakness
doesn't really matter all that much and its speed is nice. With
malicious inputs, which are easy to create, it's no better than a linear
list (plus some memory costs). The alternative hash functions that are
recommended in the literature (FNV, lookup3) are slower, but harder to
push into the ill-performing case (in the case of lookup3, it is
believed to require effort nearly equivalent to just scanning through
all possible inputs, which isn't a threat; FNV is faster but weaker).
Donal.
set f [open /usr/share/dict/words]
set words [split [read $f] \n]
close $f
for {set i 0} {$i<=500000} {incr i} {lappend numbers $i;set j " $i "}
proc hashtest list {
global a
foreach key $list {
set a($key) 1
}
set total 0
foreach key $list {
expr {
$a($key) + $a($key) + $a($key) + $a($key) +
$a($key) + $a($key) + $a($key) + $a($key)
}
}
return $total
}
proc hashdist list {
global a
catch {unset a}
foreach key $list {
set a($key) 1
}
puts [array statistics a]
array unset a *
}
hashdist $words
puts [time {hashtest $words} 5]
hashdist $numbers
puts [time {hashtest $numbers} 5]
if 0 {
Classic: {
234937 entries in table, 262144 buckets
number of buckets with 0 entries: 107652
number of buckets with 1 entries: 95174
number of buckets with 2 entries: 42648
number of buckets with 3 entries: 13018
number of buckets with 4 entries: 2949
number of buckets with 5 entries: 614
number of buckets with 6 entries: 77
number of buckets with 7 entries: 11
number of buckets with 8 entries: 1
number of buckets with 9 entries: 0
number of buckets with 10 or more entries: 0
average search distance for entry: 1.5
369312.3556 microseconds per iteration
500001 entries in table, 262144 buckets
number of buckets with 0 entries: 15311
number of buckets with 1 entries: 102375
number of buckets with 2 entries: 81077
number of buckets with 3 entries: 35085
number of buckets with 4 entries: 18459
number of buckets with 5 entries: 5181
number of buckets with 6 entries: 3238
number of buckets with 7 entries: 697
number of buckets with 8 entries: 476
number of buckets with 9 entries: 131
number of buckets with 10 or more entries: 114
average search distance for entry: 1.9
703947.1742 microseconds per iteration
}
FNV: {
234937 entries in table, 262144 buckets
number of buckets with 0 entries: 107077
number of buckets with 1 entries: 95715
number of buckets with 2 entries: 43045
number of buckets with 3 entries: 12844
number of buckets with 4 entries: 2830
number of buckets with 5 entries: 535
number of buckets with 6 entries: 84
number of buckets with 7 entries: 11
number of buckets with 8 entries: 3
number of buckets with 9 entries: 0
number of buckets with 10 or more entries: 0
average search distance for entry: 1.4
380998.4686 microseconds per iteration
500001 entries in table, 262144 buckets
number of buckets with 0 entries: 36550
number of buckets with 1 entries: 74485
number of buckets with 2 entries: 73121
number of buckets with 3 entries: 46636
number of buckets with 4 entries: 21084
number of buckets with 5 entries: 7384
number of buckets with 6 entries: 2214
number of buckets with 7 entries: 546
number of buckets with 8 entries: 112
number of buckets with 9 entries: 12
number of buckets with 10 or more entries: 0
average search distance for entry: 1.9
763999.966 microseconds per iteration
}
lookup3: {
234937 entries in table, 262144 buckets
number of buckets with 0 entries: 107080
number of buckets with 1 entries: 95815
number of buckets with 2 entries: 42869
number of buckets with 3 entries: 12837
number of buckets with 4 entries: 2930
number of buckets with 5 entries: 534
number of buckets with 6 entries: 71
number of buckets with 7 entries: 7
number of buckets with 8 entries: 1
number of buckets with 9 entries: 0
number of buckets with 10 or more entries: 0
average search distance for entry: 1.4
397791.67579999997 microseconds per iteration
500001 entries in table, 262144 buckets
number of buckets with 0 entries: 38954
number of buckets with 1 entries: 74118
number of buckets with 2 entries: 71067
number of buckets with 3 entries: 44755
number of buckets with 4 entries: 21524
number of buckets with 5 entries: 8171
number of buckets with 6 entries: 2629
number of buckets with 7 entries: 701
number of buckets with 8 entries: 183
number of buckets with 9 entries: 33
number of buckets with 10 or more entries: 9
average search distance for entry: 2.0
916809.2008 microseconds per iteration
}
}
begin:vcard
fn:Donal K. Fellows
n:Fellows;Donal
org:The University of Manchester;Research Computing Services
adr:;;Oxford Road;Manchester;;M13 9PL;United Kingdom
email;internet:dona...@manchester.ac.uk
version:2.1
end:vcard
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core