Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Donal K. Fellows <dona...@manchester.ac.uk>

Mon, 15 Feb 2010 22:02:07 +0000

On 14/02/2010 22:56, Lars Hellström wrote:
>> For truly small tables, a Lisp-style alist would actually be cheaper.>> Not that they scale...>> That is exactly what is in the buckets, is it not?

Close, but not quite. Right now, we also keep the full hashcode so that
we can do a cheap pre-check on the equality before going to the full
equality test. Apart from that (and some housekeeping info), a linked
list of key-value pairs is exactly what we've got.

The net effect is that on any lookup (read or update) the hash function
is only invoked once, and it's not invoked at all on rebuild.

> Raises the question> of whether it might be a good idea to have hash tables start out with> exactly one bucket and skip computing hash keys until the first time> the table is rebuilt. (This adds a small constant time to all accesses,> and drops one larger constant and one key-size-proportional term for> accesses in small tables.)

But it complicates the code further and would require a different
default hash table size. Right now, we start with 4 cells (preallocated
in the hash table structure) so that we minimize the number of mallocs
for small tables.

> It's probably not possible as a change> under-the-hood of Tcl8's hash tables (e.g. hash entries cache hash> keys, if I read the code correctly), but might be of interest to those> who consider alternatives for e.g. dicts or Tcl9.

Further afield (9?) I want to do arrays with things other than hash
tables backing them up. Vectors (as in BLT) and databases (as in
metakit) are particularly interesting cases. (The main problem with this
is working out what to do about traces, but that's a story for another
time.) Once that mechanism is done, I'm sure we could try using balanced
trees or something like that for array storage. However, I'm not at all
convinced that the gain will be worth it for most applications. Let the
code prove itself on its own merit when that happens!

FWIW, the one thing that I wish had been more clearly enunciated in my
undergraduate data structures course (long time ago now!) was just how
effective hash tables really are. I suspect the lecturer just didn't
believe in them, and so focused on tree balancing and things like that.
But hashes are *so* good in practice and, as long as you have a good
enough hash function, are easy to implement correctly too.

> One very glaring way of looking at these flaws is that the old hash> function only distinguishes between 2551 different two-byte sequences,> despite there being 65536 of them; this is less than 4%.

True, but because we apply hashes to our (pseudo-)UTF-8 representation,
not all pairs are possible. (We use different hash algorithms for
standard pointers and structures, and always have.) I'm not sure what
the effect of this is, and I'm *definitely* not sure what effect this
has on the attack-prone-ness of the FNV hash.

> Conversely,> for *any* pair of neighbouring bytes of a key, there are on average 24> other pairs of bytes that could be substituted without changing the> value of the old hash function. Yet another way of looking at it> (though probably hard to formalise): the old hash function only keeps> about 3.3 bits of "randomness" for every 8 bits of key fed in to it!

I suspect that's not too bad for "nice" (i.e., normal) hash keys. It's
probably not far off the entropy level of both numbers and letters in
English words.

> That the old function multiplies by such a low number as 9 is why it's> not so good already at distinguishing between two-byte keys; the 2551> above arises as (255*9+255) - (0*9+0) + 1 -- maximal contribution from> two bytes minus minimal contribution from two bytes plus 1. Multiplying> by 1 instead (i.e., dropping that result<<3 term) would probably make> it even faster, but also an even poorer hash. ;-)

Everything's made more complex by the fact that we chop high bits off
the hash value when building the table indexes. That's known
(apparently) to be non-optimal for FNV. OTOH, we resize our hash tables
too, and then use more bits of the hash so the bits aren't really
ignored. There's not been that much study of the types of hash table
that Tcl uses and what the effects of the resizing algorithm are on
overall hash efficiency. (That may be because it's a sophisticated thing
to be doing - a lot of hash table implementations don't bother - and
complex to analyse too.)

Too many things wrong that can't be fixed without breaking compatibility
with lots of existing code.

Donal.

begin:vcard
fn:Donal K. Fellows
n:Fellows;Donal
org:The University of Manchester;Research Computing Services
adr:;;Oxford Road;Manchester;;M13 9PL;United Kingdom
email;internet:dona...@manchester.ac.uk
version:2.1
end:vcard

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev

_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
[TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 07, 2010 09:36 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 08, 2010 06:40 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 08, 2010 11:21 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 08, 2010 06:45 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 09, 2010 01:47 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 09, 2010 02:42 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 09, 2010 01:05 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 09, 2010 03:48 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 09, 2010 11:14 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 09, 2010 04:15 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 10, 2010 12:38 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Lars Hellström	Feb 14, 2010 10:56 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 15, 2010 03:20 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 15, 2010 10:02 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 15, 2010 08:55 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Lars Hellstr�m	Feb 15, 2010 11:53 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 11:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 16, 2010 01:01 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 09, 2010 11:35 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Alexandre Ferrieux	Feb 11, 2010 12:08 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Alexandre Ferrieux	Feb 12, 2010 04:57 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 14, 2010 04:58 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 14, 2010 11:54 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 16, 2010 03:25 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 04:45 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 17, 2010 11:04 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Kristoffer Lawson	Feb 17, 2010 11:18 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 16, 2010 06:30 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 10:44 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 17, 2010 11:49 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 10:36 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	John Ousterhout	Feb 15, 2010 04:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 15, 2010 03:14 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 16, 2010 01:16 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 15, 2010 04:16 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 18, 2010 09:31 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 17, 2010 06:41 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 18, 2010 09:23 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 18, 2010 12:23 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 17, 2010 01:36 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 17, 2010 11:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 16, 2010 09:30 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Joe English	Feb 08, 2010 07:56 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 08, 2010 09:56 pm

◄ Messages in this thread ►

Previous post: Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

Next post: Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

Subscribe to the tcl-core RSS feed

Accounts

List Archives

Feedback & Information

ActiveState

© 2019 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.