| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Tom Jackson <tom....@gmail.com>
Mon, 15 Feb 2010 12:55:33 -0800
Hey,

As far as I can tell this entire operation and the logic used to
justify it is totally messed up.

Unless I'm missing some additional code, one of the basic
justifications for this change is completely wrong.

That justification is the wiki entry which shows FNV somehow creates a
hash key distribution that outsmarts the test program "collide". From
my reading of the collide program it was designed to find collisions.
Unfortunately the code seems to find X candidates which have the same
hash value and then create an array.

Guess what! This trick works for any hash function. As far as  I can
tell it works for the new FNV hash and one which I tried the djb hash.

Then I made changes to the "collide" procedure. Apparently the result
is very sensitive to the data pre-pended to the string to be hashed.

Regardless of these results, the best distribution seems to be
achieved with the djb hash:

     result = 5381;

    for (c=*string++ ; c ; c=*string++)  {
      result = ( ( result<<5 ) + result ) ^ c;

    }

More interesting to me is why the similar function it tclLiteral.c was
not changed:

line914 	result += (result<<3) + bytes[i];

But many words have been thrown around about cryptography, safety,
whatever. This really makes no sense. The structure of the hash, which
weighs toward low memory usage, is designed to open the door to many
collisions.  The whole design creates the potential for many
collisions.

You could compare this to a secure hash which maps into an enormous
keyspace. The measure of security is based upon the difficulty of
finding a collision with a given string.

So basically all the pro-change arguments are based upon un-shown
work. My tests show no difference when the collide proc is used as
given, but if you change the collide proc, the current hash seems to
work very well.

For instance, change the collide proc to this:

proc collide {level {accum ""}} {
    global a
    if {$level} {
        incr level -1
        collide $level "jb$accum"
        collide $level "ba$accum"
    } else {
        set a($accum) 1
    }
}

One interesting change is that the run time goes way down. Maybe there
is something special about the original proc. This special feature
does not reduce the security of the hash, for that you need a generic
method to find a collision with a particular string.

My recommendation: demonstrate a real vulnerability to the original
hash before introducing a fix. No vulnerability has been demonstrated.

Basically a cryptographic hash maps an infinite keyspace into a fixed,
but very large hashspace. If the hashspace is scaled to the number of
used keys, you can imagine that the cryptographic function is lost,
assuming it ever existed. Cryptography is the modern equivalent of a
leprechaun tying ribbons around every tree in order to disguise the
tree hiding the treasure. The scaled hash functions tend to cut down
most of the trees, making the treasure much easier to find.

Regardless of any problems with the current hash, why does anyone care
if some idiot chooses an array to store HTTP headers? Should the Tcl
language protect everyone because they don't realize they need to
understand that user data needs to be validated, or at least not
trusted?

Of course I'm on record as being opposed to using an array to handle
HTTP headers, only the hard headed would continue to use arrays to
represent  HTTP headers, or create any local variables automatically.
Instead, what you do is to create a representation of the user data
and then query it using known keys.

But of course the problem is that the core developers want everything:
unconscious security.

Many years ago I learned a simple lesson: everyone that came before me
knew what they were doing. If I intend to change it, I must first
understand the choices they made. I don't see any understanding here.

tom jackson

On Mon, Feb 15, 2010 at 7:14 AM, Donal K. Fellows
<dona...@manchester.ac.uk> wrote:
> On 15/02/2010 04:56, John Ousterhout wrote:>>>> It's possible that the proposed new algorithm is better than the>> original Tcl one; it looks like it might use a more complex>> approach that tends to spread the information from each character>> more quickly across all of the bits of the hash value, whereas>> the multiply-by-9 approach only gradually does this.>> The original Tcl algorithm is not bad for most keys. It both keeps bits> low down and spreads them up, and it uses operations that don't lose too> much information (i.e., multiplication and addition). The FNV algorithm> is remarkably similar in structure (so advantageous in terms of ease of> transition) but uses a different multiplier that spreads the information> more thoroughly and uses xor rather than addition (though we all know> those are actually closely related operations).>> The main issue with the old hash algorithm was that it was too easy to> attack it and turn hash table accesses from O(1) to O(n**2). A> literature search (not very easy; this is not a heavily studied area)> indicates that the FNV hash is quite a bit more resistant; attacks would> require using much longer keys and would be more likely to hit some> other limit.>> The other advantage of the FNV algorithm is that it is well-known,> reasonably well-studied, and does a reasonably good job with producing> keys for the space of "short words"; I'm not really the guy saying that> it is quite good, a number of others are (URLs elsewhere in this> thread). I don't know of anyone other than Tcl using the old hash> algorithm (though related variants have been studied and found lacking).>> I'm tempted by Bob Jenkins's lookup3 hash, but it's big, complex, and> really wants to know the length of the string first so implementing it> efficiently for string keys is awkward[*]. We could use it for Tcl_Obj> keys, but I'm currently keener on using the same algorithm for both> types of user-visible keys. (I'm also a bit hesitant because of the> complexity of lookup3; it's non-trivial to code.)>> [Good advice elided]>>>> So, if you decide to change, make sure you measure very carefully>> on representative datasets, and make sure you've studied the>> properties of hashing functions enough to understand exactly>> how yours is working and why it is better.>> On non-malicious datasets, Tcl's original algorithm works pretty well.> It's the case when someone's deliberately trying to provoke bad> behaviour that concerned me. FNV seems to do better (and has reasonable> distribution behaviour on the non-malicious datasets I've tried it with,> such as the contents of /usr/share/dict/words and the first million> cardinal numbers).>> As a side note, changing the hash has flushed out a small number of> places in Tcl where we had code that actually did depend on the order of> hashing. These places were bugs waiting to happen...>> Donal.> [* Maintaining binary and source compatibility was a cast-iron>   requirement for this change, so storing the length of a string key>   was not possible; there's nowhere in a Tcl_HashEntry for it to go>   and offsets of fields are baked into user code because of macros. ]>> ------------------------------------------------------------------------------> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW> http://p.sf.net/sfu/solaris-dev2dev> _______________________________________________> Tcl-Core mailing list> Tcl-...@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/tcl-core>>

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread