Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Tom Jackson <tom....@gmail.com>

Thu, 18 Feb 2010 13:31:46 -0800

On Thu, Feb 18, 2010 at 1:23 AM, Donal K. Fellows
<dona...@manchester.ac.uk> wrote:
> On 17/02/2010 18:41, Tom Jackson wrote:>> The only way to justify using an array instead of a linear list {{key>> value} {key value}} is if you don't think it is important to maintain>> the data in the form it was sent. This is a valid argument. But note>> that in order to add a key, or lappend to an existing key, you have to>> search the array first. A large array will consume more resources>> during creation than a list containing the same data.>> I'm not criticising other parts of what they're doing. Just the process> of going from key to bucket chain to add to. Theirs is particularly lame> as it loads all keys that start with the same letter into the same> bucket (so trivial to attack it's unfunny) but it's not as lame as when> it was all just an alist. It was in older versions of Apache; adds might> have been quick, but lookups would have sucked...


I really wish I could figure out how anyone ever got the idea that
arrays are better than lists if the data is only used a few times.

Somehow you have to pay for the creation of the array. This makes
sense if you need continuous access to the same data over a long
period of time.

Another important point about Tcl arrays is that the history suggests
that they were designed as a method for mapping strings to opaque
structures (or pointers, whatever). The string-data to binary-data
mapping is one of the most important features of the Tcl language. So
you can't use a list to map strings to an opaque structure., because
lists are "values".

Somehow this important fact is being turned upside down. We are
focusing on the key and not the value.

Now, back to performance: You have to pay for the creation and
maintenance of the array. This is an up-front cost. If you have 10
minutes to visit Disneyland, your per-ride cost will be very high.

Enough talk. Here is a script which uses the FNV algorithm to create
an array, using the collide keys. After the script I have timings
(create the array with the keys, create a name-value list with the
keys). The result is that it takes so much time to create the array, I
have plenty of time left over to do very slow linear searches...and I
can do them case-insensitive. In addition I can recover the order of
the input list.  Why anyone thinks it is a good idea to spend time
destroying information is beyond me. But basically that is what any
list->array transform gives you.

# original collide proc
proc collide {level {accum ""} } {
    global a
    if {$level} {
        incr level -1
        collide $level "aj$accum"
        collide $level "ba$accum"
    } else {
        lappend a($accum) 1
    }
}

# new proc just generates keys
proc collideNames {level {accum ""} } {
    global names
    if {$level} {
        incr level -1
        collideNames $level "aj$accum"
        collideNames $level "ba$accum"
    } else {
        lappend names $accum
    }
}

# generate lots of keys
collideNames 15

set hostChoice [list host Host hosT hOst hoSt HoSt HOst hoST]
set valueIndex 0
set hostIndex 0

# create array using keys, note collide runs faster because it is a
compiled proc
# the loop here matches the following list creation loop.
puts "create host array: [time {

foreach name $names {
    lappend a($name) [incr valueIndex]
    if {!($valueIndex % 10007)} {
	incr hostIndex
	lappend a([lindex $hostChoice $hostIndex]) $valueIndex
    }
}
}]"

puts [array statistics a]

set valueIndex 0
set hostIndex 0

puts "create host list: [time {

foreach name $names {
    lappend headerList [list $name [incr valueIndex]]
    if {!($valueIndex % 10007)} {
	incr hostIndex
	lappend headerList [list [lindex $hostChoice $hostIndex] $valueIndex]
    }
}
}]"

puts "finding First Host: [time {set hosts [lsearch -inline -index 0
$headerList Host]}]"
puts "finding First Host nocase: [time {set hosts [lsearch -inline
-index 0 $headerList Host]}]"

puts "finding All Hosts: [time {set hosts [lsearch -inline -all -index
0 $headerList Host]}]"

puts "finding All Hosts nocase: [time {set hosts [lsearch -inline -all
-nocase -index 0 $headerList Host]}]"
puts $hosts

### End script

The only thing interesting here is the extra time required to create
the array. But lets not even get into the problems of searching for a
"host" header with no case:

% array names a -regexp (h|H)(o|O)(s|S)(t|T)
hosT hOst Host

% time {array names a -regexp (h|H)(o|O)(s|S)(t|T)}
120592 microseconds per iteration

Then you have to retrieve the values.

Here is what I get with a list representation:

create host array: 1025203 microseconds per iteration (wow! over one second)
32771 entries in table, 16384 buckets
number of buckets with 0 entries: 8342
number of buckets with 1 entries: 604
number of buckets with 2 entries: 1144
number of buckets with 3 entries: 1657
number of buckets with 4 entries: 1616
number of buckets with 5 entries: 1271
number of buckets with 6 entries: 848
number of buckets with 7 entries: 487
number of buckets with 8 entries: 237
number of buckets with 9 entries: 111
number of buckets with 10 entries: 47
number of buckets with 11 entries: 15
number of buckets with 12 entries: 4
number of buckets with 14 entries: 1
number of buckets with 10000 or more entries: 0
average search distance for entry: 3.0

create host list: 381651 microseconds per iteration (2.5 times faster)

finding First Host: 1867 microseconds per iteration

finding First Host nocase: 1697 microseconds per iteration

finding All Hosts: 6618 microseconds per iteration

finding All Hosts nocase: 40670 microseconds per iteration

Host keys:
{Host 10007} {hosT 20014} {hOst 30021}

Somewhere the conventional wisdom about list vs. array access needs to
be discredited.

tom jackson

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
[TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 07, 2010 09:36 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 08, 2010 06:40 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 08, 2010 11:21 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 08, 2010 06:45 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 09, 2010 01:47 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 09, 2010 02:42 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 09, 2010 01:05 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 09, 2010 03:48 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 09, 2010 11:14 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 09, 2010 04:15 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 10, 2010 12:38 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Lars Hellström	Feb 14, 2010 10:56 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 15, 2010 03:20 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 15, 2010 10:02 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 15, 2010 08:55 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Lars Hellstr�m	Feb 15, 2010 11:53 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 11:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 16, 2010 01:01 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Eric Melski	Feb 09, 2010 11:35 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Alexandre Ferrieux	Feb 11, 2010 12:08 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Alexandre Ferrieux	Feb 12, 2010 04:57 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 14, 2010 04:58 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 14, 2010 11:54 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 16, 2010 03:25 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 04:45 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 17, 2010 11:04 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Kristoffer Lawson	Feb 17, 2010 11:18 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 16, 2010 06:30 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 10:44 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 17, 2010 11:49 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 16, 2010 10:36 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	John Ousterhout	Feb 15, 2010 04:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 15, 2010 03:14 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Colin McCormack	Feb 16, 2010 01:16 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Larry McVoy	Feb 15, 2010 04:16 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 18, 2010 09:31 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 17, 2010 06:41 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 18, 2010 09:23 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Gustaf Neumann	Feb 18, 2010 12:23 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 17, 2010 01:36 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 17, 2010 11:56 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Tom Jackson	Feb 16, 2010 09:30 am
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Joe English	Feb 08, 2010 07:56 pm
Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm	Donal K. Fellows	Feb 08, 2010 09:56 pm

◄ Messages in this thread ►

Previous post: Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

Next post: [TCLCORE] SEH handling: why not use excpt.h on mingw?

Subscribe to the tcl-core RSS feed

Accounts

List Archives

Feedback & Information

ActiveState

© 2019 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.