Am 16.02.10 17:45, schrieb Donal K. Fellows:
>> How bad is this in practice? This "attack" is not more than a bad>> performance in a pathological case, or?> It's exactly a performance issue. The issue is exactly that it is far> too easy to trigger this worst-case behaviour.
my point is: for an "attacker", it is still very easy to attack the runtime
behavior of the existing bucket distribution algorithm
with FNV (or some other hash function).
Get http://alice.wu-wien.ac.at:8000/xowiki/download/file/keys
an try with a seeded FNV (i used 0x811c9dc5 as seed).
set f [open /tmp/keys r]; set keys [read $f]; close $f
foreach key [split $keys] { set x($key) 1}
> 55917 entries in table, 65536 buckets> number of buckets with 0 entries: 65528> number of buckets with 1 entries: 4> number of buckets with 10 or more entries: 4> average search distance for entry: 6989.14
you see, most buckets are empty, the avg is horrible.
Certainly, other hash-functions need a different set
of keys for the "attack".
>> Concerning FNV: for string lengths less then two characters, FNV and>> the classic Tcl function return the same hash value (just a side>> note).> Only because I'd made a mistake. :-) The FNV algorithm uses a baseline> value which means that the hashcodes are different.
i should have spotted this as well! I fixed this on my implementation.
The overall effects are very little, the distribution pattern of the
"collide"
test is not changed by using a seed value.
> If the hashcodes are the same, the problem happens.
Of course. The bucket distribution algorithm based on least significant
bits can't provide an even distribution when hash values
are identical. (Tom, a modulo function does not help either
in this case).
> If the hashcodes are> only partially the same (to some number of LSBs) then small hash tables> have a problem, but things will go away over some size.
No, the problem won't go away, if "attacked" (just the likleyhood
for accidentially happening might go down). The point is, even if
a hash-algorithm would produce no duplicates (which is impossible),
one can get an strongly unbalanced bucket distribution
(as long the number of buckets is significant smaller than
the number of hash values). This kind of attack is not hard at all.
> Switching to a wholly new map system has the disadvantage of needing> really heavy testing. .... Changing the hash function> has much smaller "engineering impact".
Of course. Especially, when going towards e.g. linear hashing
(using more sophisticated bucket splitting). On the other hand,
this area is well understood and is used in many systems today.
> The main benefit of going to a balanced tree is that it provides a> strong upper bound on the cost of mapping a key to a value for any> particular size of map.
well, it changes worst-case linear to worst-case logarithmic.
> Whether that's as low a cost as for Tcl's hash> tables in the normal case, well, I don't know. For a balanced binary> tree with 32k elements, I'd expect ~15 comparisons per lookup. With hash> tables, we're talking 2 or 3 (by your figures).
guess, you got me wrong. The mentioned option was not to
replace hashing by trees in general (imho worst engineering option)
but to replace the linear search in the bucket list by a balanced
tree, such that a bucket list with 32k elements requires instead
of 16k comparisons ~15 on average.
This approach would require less engineering than going towards e.g.
linear hashing. It would be a remedy (safety net) against all
mentioned hash attacks.
> As noted before, lookup3 is probably the best hash.
i would consider Murmur 2,0 as a serious contender.
It is simpler and apparently faster than lookp3 (also newer),
with excellent properties (see statistics at
http://murmurhash.googlepages.com/)
> IIRC, some of the others make alignment assumptions> AIUI, murmur is one of the hashes with this problem
For lookup3 and murmur exist version for aligned and unaligned
data. Part of the length of the code of lookp3 is that it contains
essentially 3 versions (reading 32-bit chunks, 16-bit chunks and
8-bit chunks). The 32-bit chunk version can most probably
lead to unaligned access, when the keys are not on word boundaries.
const uint32_t *k = (const uint32_t *)key; /* read 32-bit
chunks */
The exactly same is true for murmur and lookup3, both support
char and word chunks for input.
> > PS: i have the suspicion that the avg search length reported by> > "array statistics" is not correct>> I've not had problems with (admittedly trivial) testing. You're> measuring at a point when there are about two times as many values as> buckets, so perfect loading will produce two values per bucket.
i would think, that the average search length per entry
should be calculated via
sum(entries(bucket)^2)/numEntries
for unsuccessful searches, for successful searches the half
(every insert requires an unsuccessful search).
-gustaf neumann
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core