Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Gustaf Neumann <neum...@wu-wien.ac.at>

Wed, 17 Feb 2010 12:04:56 +0100

Am 16.02.10 17:45, schrieb Donal K. Fellows:
>> How bad is this in practice? This "attack" is not more than a bad>> performance in a pathological case, or?> It's exactly a performance issue. The issue is exactly that it is far> too easy to trigger this worst-case behaviour. 
my point is: for an "attacker", it is still very easy to attack the runtime
behavior of the existing bucket distribution algorithm
with FNV (or some other  hash function).

Get http://alice.wu-wien.ac.at:8000/xowiki/download/file/keys
an try with a seeded FNV (i used 0x811c9dc5 as seed).

    set f [open /tmp/keys r]; set keys [read $f]; close $f
    foreach key [split $keys] { set x($key) 1}

 > 55917 entries in table, 65536 buckets> number of buckets with 0 entries: 65528> number of buckets with 1 entries: 4> number of buckets with 10 or more entries: 4> average search distance for entry: 6989.14

you see, most buckets are empty, the avg is horrible.
Certainly, other hash-functions need a different set
of keys for the "attack".
>> Concerning FNV: for string lengths less then two characters, FNV and>> the classic Tcl function return the same hash value (just a side>> note).> Only because I'd made a mistake. :-) The FNV algorithm uses a baseline> value which means that the hashcodes are different.
i should have spotted this as well! I fixed this on my implementation.
The overall effects are very little, the distribution pattern of the 
"collide"
test is not changed by using a seed value.
> If the hashcodes are the same, the problem happens. 
Of course. The bucket distribution algorithm based on least significant
bits can't provide an even distribution when hash values
are identical. (Tom, a modulo function does not help either
in this case).
> If the hashcodes are> only partially the same (to some number of LSBs) then small hash tables> have a problem, but things will go away over some size.
No, the problem won't go away, if "attacked" (just the likleyhood
for accidentially happening might go down). The point is, even if
a hash-algorithm would produce no duplicates (which is impossible),
one can get an strongly unbalanced bucket distribution
(as long the number of buckets is significant smaller than
the number of hash values). This kind of attack is not hard at all.

 > Switching to a wholly new map system has the disadvantage of needing> really heavy testing. .... Changing the hash function> has much smaller "engineering impact".

Of course. Especially, when going towards e.g. linear hashing
(using more sophisticated bucket splitting). On the other hand,
this area is well understood and is used in many systems today.

 > The main benefit of going to a balanced tree is that it provides a> strong upper bound on the cost of mapping a key to a value for any> particular size of map.

well, it changes worst-case linear to worst-case logarithmic.

 > Whether that's as low a cost as for Tcl's hash> tables in the normal case, well, I don't know. For a balanced binary> tree with 32k elements, I'd expect ~15 comparisons per lookup. With hash> tables, we're talking 2 or 3 (by your figures).

guess, you got me wrong. The mentioned option was not to
replace hashing by trees in general (imho worst engineering option)
but to replace the linear search in the bucket list by a balanced
tree, such that a bucket list with 32k elements requires instead
of 16k comparisons ~15 on average.

This approach would require less engineering than going towards e.g.
linear hashing. It would be a remedy (safety net) against all
mentioned hash attacks.

 > As noted before, lookup3 is probably the best hash.

i would consider Murmur 2,0 as a serious contender.
It is simpler and apparently faster than lookp3 (also newer),
with excellent properties (see statistics at
http://murmurhash.googlepages.com/)

 > IIRC, some of the others make alignment assumptions> AIUI, murmur is one of the hashes with this problem

For lookup3 and murmur exist version for aligned and unaligned
data. Part of the length of the code of lookp3 is that it contains
essentially 3 versions (reading 32-bit chunks, 16-bit chunks and
8-bit chunks). The 32-bit chunk version can most probably
lead to unaligned access, when the keys are not on word boundaries.

     const uint32_t *k = (const uint32_t *)key;         /* read 32-bit 
chunks */

The exactly same is true for murmur and lookup3, both support
char and word chunks for input.

 > > PS: i have the suspicion that the avg search length reported by> > "array statistics" is not correct>> I've not had problems with (admittedly trivial) testing. You're> measuring at a point when there are about two times as many values as> buckets, so perfect loading will produce two values per bucket.

i would think, that the average search length per entry
should be calculated via
     sum(entries(bucket)^2)/numEntries
for unsuccessful searches, for successful searches the half
(every insert requires an unsuccessful search).

-gustaf neumann


------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
[TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 07, 2010 09:36 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 08, 2010 06:40 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 08, 2010 11:21 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 08, 2010 06:45 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 09, 2010 01:47 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 09, 2010 02:42 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 09, 2010 01:05 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 09, 2010 03:48 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 09, 2010 11:14 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 09, 2010 04:15 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 10, 2010 12:38 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Lars Hellström	Feb 14, 2010 10:56 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 15, 2010 03:20 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 15, 2010 10:02 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 15, 2010 08:55 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Lars Hellstr�m	Feb 15, 2010 11:53 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 11:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 16, 2010 01:01 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 09, 2010 11:35 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Alexandre Ferrieux	Feb 11, 2010 12:08 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Alexandre Ferrieux	Feb 12, 2010 04:57 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 14, 2010 04:58 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 14, 2010 11:54 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 16, 2010 03:25 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 04:45 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 17, 2010 11:04 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Kristoffer Lawson	Feb 17, 2010 11:18 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 16, 2010 06:30 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 10:44 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 17, 2010 11:49 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 10:36 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	John Ousterhout	Feb 15, 2010 04:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 15, 2010 03:14 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 16, 2010 01:16 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 15, 2010 04:16 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 18, 2010 09:31 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 17, 2010 06:41 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 18, 2010 09:23 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 18, 2010 12:23 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 17, 2010 01:36 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 17, 2010 11:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 16, 2010 09:30 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Joe English	Feb 08, 2010 07:56 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 08, 2010 09:56 pm

◄ Messages in this thread ►

Previous post: Re: [TCLCORE] Tcl_SetResult function signature

Next post: Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

Subscribe to the tcl-core RSS feed

Accounts

List Archives

Feedback & Information

ActiveState

© 2019 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.