| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Donal K. Fellows <dona...@manchester.ac.uk>
Wed, 17 Feb 2010 11:56:45 +0000
On 17/02/2010 01:36, Tom Jackson wrote:
> No shit! The cost is extreme.

I've been probing the cost of using various hash functions (I've done
the comparison of the classic algorithm against FNV and lookup3) when
used in Tcl and the attached script contains preliminary results for the
two keysets:
   all dictionary words
   numbers from 0 to 500k
The tests basically check the cost of doing 9 hash lookups per key;
array lookups are close to as pure a test of this as we can generate in
Tcl. (The cost of the "machinery" is about a third of the total time.)

The cost of FNV over these (presumably near normal) keys is 3–10% over
the classic algorithm. The cost of lookup3 is 10-30% over the classic
algorithm. Other key-sets might have other costs.

Note that the only thing that is changing between the runs is the
contents of the function TclHashObjKey in tclObj.c, and the change was
done by altering #ifdefs only. Everything else is just Tcl HEAD (plus a
compiled in lookup3.c; I'm not retyping *that* code!)

> The problem here is that in order to get good distribution to all> bits, you have to use a near cryptographic quality hash. A> cryptographic hash can be easily truncated to generate near random> bucket indexes. This is the desired goal, but it is much more easily> achieved most of the time with a single modulus operation which takes> into account all bits. Which is easier: one modulus operation or a> near-cryptographic hash which can be safely truncated?

Mod is quite an expensive operation, at least on this machine. Applying
a mod operation (mod a large prime) to the result of the classic hash
before returning nearly takes the cost of hashing up to equivalence with
FNV.

> The system only needs to be broken once. All that is required is a set> of keys (maybe a few thousand) which hash to the same value. After> that the entire algorithm is broken (assuming the goal is to poison> the bucket code with collisions). Assuming a 32 bit key-space, it> should be easy to find at least one set of 10000 strings which share> the same hash value.>> Do I care to find such a set? NO! But this is a triviality for> cryptographers.

Remember, cryptographers are mathematicians. They have a definition of
"trivial" which doesn't match that used by software engineers. I can
remember a number of acquaintances state that providing examples once an
existence proof was done was trivial. Can't say that I really bought
their argument... :-)

In terms of hash attacks, they only have real fangs (given that we don't
conflate hashcode equivalence with equality) if it is possible to make
either:

  a) A large number of *short* keys with the same hashcode, or
  b) A substantial number of keys with the same hashcode as some
     particular key from a set not nominated by the attacker (typically
     defined by the communication protocol)

The aim of the first one is to swamp the hash table with totally
irrelevant things. The aim of the second is to retard required lookups.

> I know there is a huge bias against AOLserver, so I digress to Apache.> How does Apache handle headers?>>> From http://thomas.eibner.dk/apache/table.html (Apache table API):>> "Tables in Apache works almost like hashes in Perl. The main> difference between the two is that you can have two of the same keys> in the Apache table API. The table lookups are done in a> case-insensitive manner.">> So for everyone who has some problem with AOLserver, maybe note that> headers in Apache are handled very similarly. More specifically they> are not handled as associative arrays or dictionaries.

Checking the source to Apache (strictly the APR) I see that they're
using hashing inside their implementation of tables. OK, they're using a
truly trivial hash, but I suppose it's better than what they had before
(a single linear list).

Check for yourself at:
   http://svn.apache.org/repos/asf/apr/apr/trunk/tables/apr_tables.c

> All I can say is that the more I investigate the original Tcl hash,> the more my respect grows. It seems to perform better most of the> time.

It's very very fast, but weak. With non-malicious inputs, its weakness
doesn't really matter all that much and its speed is nice. With
malicious inputs, which are easy to create, it's no better than a linear
list (plus some memory costs). The alternative hash functions that are
recommended in the literature (FNV, lookup3) are slower, but harder to
push into the ill-performing case (in the case of lookup3, it is
believed to require effort nearly equivalent to just scanning through
all possible inputs, which isn't a threat; FNV is faster but weaker).

Donal.

set f [open /usr/share/dict/words]
set words [split [read $f] \n]
close $f

for {set i 0} {$i<=500000} {incr i} {lappend numbers $i;set j " $i "}

proc hashtest list {
    global a
    foreach key $list {
	set a($key) 1
    }
    set total 0
    foreach key $list {
	expr {
	    $a($key) + $a($key) + $a($key) + $a($key) +
	    $a($key) + $a($key) + $a($key) + $a($key)
	}
    }
    return $total
}

proc hashdist list {
    global a
    catch {unset a}
    foreach key $list {
	set a($key) 1
    }
    puts [array statistics a]
    array unset a *
}

hashdist $words
puts [time {hashtest $words} 5]
hashdist $numbers
puts [time {hashtest $numbers} 5]

if 0 {
    Classic: {
	234937 entries in table, 262144 buckets
	number of buckets with 0 entries: 107652
	number of buckets with 1 entries: 95174
	number of buckets with 2 entries: 42648
	number of buckets with 3 entries: 13018
	number of buckets with 4 entries: 2949
	number of buckets with 5 entries: 614
	number of buckets with 6 entries: 77
	number of buckets with 7 entries: 11
	number of buckets with 8 entries: 1
	number of buckets with 9 entries: 0
	number of buckets with 10 or more entries: 0
	average search distance for entry: 1.5
	369312.3556 microseconds per iteration
	500001 entries in table, 262144 buckets
	number of buckets with 0 entries: 15311
	number of buckets with 1 entries: 102375
	number of buckets with 2 entries: 81077
	number of buckets with 3 entries: 35085
	number of buckets with 4 entries: 18459
	number of buckets with 5 entries: 5181
	number of buckets with 6 entries: 3238
	number of buckets with 7 entries: 697
	number of buckets with 8 entries: 476
	number of buckets with 9 entries: 131
	number of buckets with 10 or more entries: 114
	average search distance for entry: 1.9
	703947.1742 microseconds per iteration
    }
    FNV: {
	234937 entries in table, 262144 buckets
	number of buckets with 0 entries: 107077
	number of buckets with 1 entries: 95715
	number of buckets with 2 entries: 43045
	number of buckets with 3 entries: 12844
	number of buckets with 4 entries: 2830
	number of buckets with 5 entries: 535
	number of buckets with 6 entries: 84
	number of buckets with 7 entries: 11
	number of buckets with 8 entries: 3
	number of buckets with 9 entries: 0
	number of buckets with 10 or more entries: 0
	average search distance for entry: 1.4
	380998.4686 microseconds per iteration
	500001 entries in table, 262144 buckets
	number of buckets with 0 entries: 36550
	number of buckets with 1 entries: 74485
	number of buckets with 2 entries: 73121
	number of buckets with 3 entries: 46636
	number of buckets with 4 entries: 21084
	number of buckets with 5 entries: 7384
	number of buckets with 6 entries: 2214
	number of buckets with 7 entries: 546
	number of buckets with 8 entries: 112
	number of buckets with 9 entries: 12
	number of buckets with 10 or more entries: 0
	average search distance for entry: 1.9
	763999.966 microseconds per iteration
    }
    lookup3: {
	234937 entries in table, 262144 buckets
	number of buckets with 0 entries: 107080
	number of buckets with 1 entries: 95815
	number of buckets with 2 entries: 42869
	number of buckets with 3 entries: 12837
	number of buckets with 4 entries: 2930
	number of buckets with 5 entries: 534
	number of buckets with 6 entries: 71
	number of buckets with 7 entries: 7
	number of buckets with 8 entries: 1
	number of buckets with 9 entries: 0
	number of buckets with 10 or more entries: 0
	average search distance for entry: 1.4
	397791.67579999997 microseconds per iteration
	500001 entries in table, 262144 buckets
	number of buckets with 0 entries: 38954
	number of buckets with 1 entries: 74118
	number of buckets with 2 entries: 71067
	number of buckets with 3 entries: 44755
	number of buckets with 4 entries: 21524
	number of buckets with 5 entries: 8171
	number of buckets with 6 entries: 2629
	number of buckets with 7 entries: 701
	number of buckets with 8 entries: 183
	number of buckets with 9 entries: 33
	number of buckets with 10 or more entries: 9
	average search distance for entry: 2.0
	916809.2008 microseconds per iteration
    }
}
begin:vcard
fn:Donal K. Fellows
n:Fellows;Donal
org:The University of Manchester;Research Computing Services
adr:;;Oxford Road;Manchester;;M13 9PL;United Kingdom
email;internet:dona...@manchester.ac.uk
version:2.1
end:vcard

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread