| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Tom Jackson <tom....@gmail.com>
Tue, 16 Feb 2010 17:36:59 -0800
On Tue, Feb 16, 2010 at 2:44 PM, Donal K. Fellows
<dona...@manchester.ac.uk> wrote:
> On 16/02/2010 18:30, Tom Jackson wrote:>>>> Try out my Tcl hash functions in my previous email. The problem is not>> the hash function, it is generally the method of transforming the hash>> value into a bucket index. The better performing hashes shown in your>> post are the result of a lot of randomness in the low order bits. This>> is to be expected of a hash which mixes all the bits in with the new>> bits.>>>> If the entire hash value was used, it could make a lot of sense to>> increase the sophistication of the hash function, but it would be much>> easier to just take a prime modulus of hash value just once. This>> would take into account all the bits.>> Actually, if we were using lookup3 then the entropy is distributed> pretty evenly across the entire key. It's a very sophisticated hash> function that's resistant to a lot of odd abuses.

No shit! The cost is extreme.

The problem here is that in order to get good distribution to all
bits, you have to use a near cryptographic quality hash. A
cryptographic hash can be easily truncated to generate near random
bucket indexes. This is the desired goal, but it is much more easily
achieved most of the time with a single modulus operation which takes
into account all bits. Which is easier: one modulus operation or a
near-cryptographic hash which can be safely truncated?

Usually the goal of cryptography is to force the attacker to perform a
huge number of calculations. The requirement of using a
near-cryptographic quality hash everywhere turns the tables on this
system.

The system only needs to be broken once. All that is required is a set
of keys (maybe a few thousand) which hash to the same value. After
that the entire algorithm is broken (assuming the goal is to poison
the bucket code with collisions). Assuming a 32 bit key-space, it
should be easy to find at least one set of 10000 strings which share
the same hash value.

Do I care to find such a set? NO! But this is a triviality for cryptographers.

Instead of focusing on making it hard for complete idiots like me to
generate 10000 keys which hash to the same value (while slowing down
all user code) why not notify application developers that they should
not blindly  trust unlimited input from external sources?

For instance, AOLserver does not hash header keys. But it also places
limits on the length of headers and the number of headers. If you use
either tcl arrays or tcl dicts, you instantly run into the problem of
multiple headers with the same field name and the same field name with
different cases (since HTTP headers are not case sensitive). Both of
these issues require interpretation prior to storage if you want to
use a tcl array or tcl dict.

It is nearly impossible to underestimate the problems with blindly
accepting user input to create Tcl variables or tcl array members.
Only total idiots would approve such code. You think this is harsh?
Consider the consequences. Somehow you have to explain to your boss
that the attacker outsmarted your hash function. Next thing you know
you can no longer use Tcl. Of course you were lying. The problem is
that you hoped the hash function would save you from your inability or
resistance to developing application level controls on user input.
Please give up such hopes. Please give up all hope of a generic
solution to accepting user data.

Never, every, blindly translate user input into Tcl data.

I know there is a huge bias against AOLserver, so I digress to Apache.
How does Apache handle headers?

>From http://thomas.eibner.dk/apache/table.html (Apache table API):

"Tables in Apache works almost like hashes in Perl. The main
difference between the two is that you can have two of the same keys
in the Apache table API. The table lookups are done in a
case-insensitive manner."

So for everyone who has some problem with AOLserver, maybe note that
headers in Apache are handled very similarly. More specifically they
are not handled as associative arrays or dictionaries.

Maybe it is possible that AOLserver and Apache both made the same
mistake. Of course it is necessary to identify the mistake. Why do
both APIs allow multiple case-insensitive keys?

The only reason I bring this up is that the only example used to
attack Tcl hash arrays is the mythical HTTP header issue. But since
HTTP headers should never be blindly stored in an array or dict, the
argument is moot. Nobody who has developed a commercial grade HTTP
server has used a simple associative array to store headers. For
anyone who has actually accessed HTTP headers, the reasons are all too
obvious.

All I can say is that the more I investigate the original Tcl hash,
the more my respect grows. It seems to perform better most of the
time.

tom jackson

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread