| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Donal K. Fellows <dona...@manchester.ac.uk>
Mon, 15 Feb 2010 22:02:07 +0000
On 14/02/2010 22:56, Lars Hellström wrote:
>> For truly small tables, a Lisp-style alist would actually be cheaper.>> Not that they scale...>> That is exactly what is in the buckets, is it not?

Close, but not quite. Right now, we also keep the full hashcode so that
we can do a cheap pre-check on the equality before going to the full
equality test. Apart from that (and some housekeeping info), a linked
list of key-value pairs is exactly what we've got.

The net effect is that on any lookup (read or update) the hash function
is only invoked once, and it's not invoked at all on rebuild.

> Raises the question> of whether it might be a good idea to have hash tables start out with> exactly one bucket and skip computing hash keys until the first time> the table is rebuilt. (This adds a small constant time to all accesses,> and drops one larger constant and one key-size-proportional term for> accesses in small tables.)

But it complicates the code further and would require a different
default hash table size. Right now, we start with 4 cells (preallocated
in the hash table structure) so that we minimize the number of mallocs
for small tables.

> It's probably not possible as a change> under-the-hood of Tcl8's hash tables (e.g. hash entries cache hash> keys, if I read the code correctly), but might be of interest to those> who consider alternatives for e.g. dicts or Tcl9.

Further afield (9?) I want to do arrays with things other than hash
tables backing them up. Vectors (as in BLT) and databases (as in
metakit) are particularly interesting cases. (The main problem with this
is working out what to do about traces, but that's a story for another
time.) Once that mechanism is done, I'm sure we could try using balanced
trees or something like that for array storage. However, I'm not at all
convinced that the gain will be worth it for most applications. Let the
code prove itself on its own merit when that happens!

FWIW, the one thing that I wish had been more clearly enunciated in my
undergraduate data structures course (long time ago now!) was just how
effective hash tables really are. I suspect the lecturer just didn't
believe in them, and so focused on tree balancing and things like that.
But hashes are *so* good in practice and, as long as you have a good
enough hash function, are easy to implement correctly too.

> One very glaring way of looking at these flaws is that the old hash> function only distinguishes between 2551 different two-byte sequences,> despite there being 65536 of them; this is less than 4%.

True, but because we apply hashes to our (pseudo-)UTF-8 representation,
not all pairs are possible. (We use different hash algorithms for
standard pointers and structures, and always have.) I'm not sure what
the effect of this is, and I'm *definitely* not sure what effect this
has on the attack-prone-ness of the FNV hash.

> Conversely,> for *any* pair of neighbouring bytes of a key, there are on average 24> other pairs of bytes that could be substituted without changing the> value of the old hash function. Yet another way of looking at it> (though probably hard to formalise): the old hash function only keeps> about 3.3 bits of "randomness" for every 8 bits of key fed in to it!

I suspect that's not too bad for "nice" (i.e., normal) hash keys. It's
probably not far off the entropy level of both numbers and letters in
English words.

> That the old function multiplies by such a low number as 9 is why it's> not so good already at distinguishing between two-byte keys; the 2551> above arises as (255*9+255) - (0*9+0) + 1 -- maximal contribution from> two bytes minus minimal contribution from two bytes plus 1. Multiplying> by 1 instead (i.e., dropping that result<<3 term) would probably make> it even faster, but also an even poorer hash. ;-)

Everything's made more complex by the fact that we chop high bits off
the hash value when building the table indexes. That's known
(apparently) to be non-optimal for FNV. OTOH, we resize our hash tables
too, and then use more bits of the hash so the bits aren't really
ignored. There's not been that much study of the types of hash table
that Tcl uses and what the effects of the resizing algorithm are on
overall hash efficiency. (That may be because it's a sophisticated thing
to be doing - a lot of hash table implementations don't bother - and
complex to analyse too.)

Too many things wrong that can't be fixed without breaking compatibility
with lots of existing code.

Donal.

begin:vcard
fn:Donal K. Fellows
n:Fellows;Donal
org:The University of Manchester;Research Computing Services
adr:;;Oxford Road;Manchester;;M13 9PL;United Kingdom
email;internet:dona...@manchester.ac.uk
version:2.1
end:vcard

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread