| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Gustaf Neumann <neum...@wu-wien.ac.at>
Wed, 17 Feb 2010 12:04:56 +0100
Am 16.02.10 17:45, schrieb Donal K. Fellows:
>> How bad is this in practice? This "attack" is not more than a bad>> performance in a pathological case, or?> It's exactly a performance issue. The issue is exactly that it is far> too easy to trigger this worst-case behaviour. 
my point is: for an "attacker", it is still very easy to attack the runtime
behavior of the existing bucket distribution algorithm
with FNV (or some other  hash function).

Get http://alice.wu-wien.ac.at:8000/xowiki/download/file/keys
an try with a seeded FNV (i used 0x811c9dc5 as seed).

    set f [open /tmp/keys r]; set keys [read $f]; close $f
    foreach key [split $keys] { set x($key) 1}

 > 55917 entries in table, 65536 buckets> number of buckets with 0 entries: 65528> number of buckets with 1 entries: 4> number of buckets with 10 or more entries: 4> average search distance for entry: 6989.14

you see, most buckets are empty, the avg is horrible.
Certainly, other hash-functions need a different set
of keys for the "attack".
>> Concerning FNV: for string lengths less then two characters, FNV and>> the classic Tcl function return the same hash value (just a side>> note).> Only because I'd made a mistake. :-) The FNV algorithm uses a baseline> value which means that the hashcodes are different.
i should have spotted this as well! I fixed this on my implementation.
The overall effects are very little, the distribution pattern of the 
"collide"
test is not changed by using a seed value.
> If the hashcodes are the same, the problem happens. 
Of course. The bucket distribution algorithm based on least significant
bits can't provide an even distribution when hash values
are identical. (Tom, a modulo function does not help either
in this case).
> If the hashcodes are> only partially the same (to some number of LSBs) then small hash tables> have a problem, but things will go away over some size.
No, the problem won't go away, if "attacked" (just the likleyhood
for accidentially happening might go down). The point is, even if
a hash-algorithm would produce no duplicates (which is impossible),
one can get an strongly unbalanced bucket distribution
(as long the number of buckets is significant smaller than
the number of hash values). This kind of attack is not hard at all.

 > Switching to a wholly new map system has the disadvantage of needing> really heavy testing. .... Changing the hash function> has much smaller "engineering impact".

Of course. Especially, when going towards e.g. linear hashing
(using more sophisticated bucket splitting). On the other hand,
this area is well understood and is used in many systems today.

 > The main benefit of going to a balanced tree is that it provides a> strong upper bound on the cost of mapping a key to a value for any> particular size of map.

well, it changes worst-case linear to worst-case logarithmic.

 > Whether that's as low a cost as for Tcl's hash> tables in the normal case, well, I don't know. For a balanced binary> tree with 32k elements, I'd expect ~15 comparisons per lookup. With hash> tables, we're talking 2 or 3 (by your figures).

guess, you got me wrong. The mentioned option was not to
replace hashing by trees in general (imho worst engineering option)
but to replace the linear search in the bucket list by a balanced
tree, such that a bucket list with 32k elements requires instead
of 16k comparisons ~15 on average.

This approach would require less engineering than going towards e.g.
linear hashing. It would be a remedy (safety net) against all
mentioned hash attacks.

 > As noted before, lookup3 is probably the best hash.

i would consider Murmur 2,0 as a serious contender.
It is simpler and apparently faster than lookp3 (also newer),
with excellent properties (see statistics at
http://murmurhash.googlepages.com/)

 > IIRC, some of the others make alignment assumptions> AIUI, murmur is one of the hashes with this problem

For lookup3 and murmur exist version for aligned and unaligned
data. Part of the length of the code of lookp3 is that it contains
essentially 3 versions (reading 32-bit chunks, 16-bit chunks and
8-bit chunks). The 32-bit chunk version can most probably
lead to unaligned access, when the keys are not on word boundaries.

     const uint32_t *k = (const uint32_t *)key;         /* read 32-bit 
chunks */

The exactly same is true for murmur and lookup3, both support
char and word chunks for input.

 > > PS: i have the suspicion that the avg search length reported by> > "array statistics" is not correct>> I've not had problems with (admittedly trivial) testing. You're> measuring at a point when there are about two times as many values as> buckets, so perfect loading will produce two values per bucket.

i would think, that the average search length per entry
should be calculated via
     sum(entries(bucket)^2)/numEntries
for unsuccessful searches, for successful searches the half
(every insert requires an unsuccessful search).

-gustaf neumann


------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread