| Store | Cart

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

From: Tom Jackson <tom....@gmail.com>
Thu, 18 Feb 2010 13:31:46 -0800
On Thu, Feb 18, 2010 at 1:23 AM, Donal K. Fellows
<dona...@manchester.ac.uk> wrote:
> On 17/02/2010 18:41, Tom Jackson wrote:>> The only way to justify using an array instead of a linear list {{key>> value} {key value}} is if you don't think it is important to maintain>> the data in the form it was sent. This is a valid argument. But note>> that in order to add a key, or lappend to an existing key, you have to>> search the array first. A large array will consume more resources>> during creation than a list containing the same data.>> I'm not criticising other parts of what they're doing. Just the process> of going from key to bucket chain to add to. Theirs is particularly lame> as it loads all keys that start with the same letter into the same> bucket (so trivial to attack it's unfunny) but it's not as lame as when> it was all just an alist. It was in older versions of Apache; adds might> have been quick, but lookups would have sucked...


I really wish I could figure out how anyone ever got the idea that
arrays are better than lists if the data is only used a few times.

Somehow you have to pay for the creation of the array. This makes
sense if you need continuous access to the same data over a long
period of time.

Another important point about Tcl arrays is that the history suggests
that they were designed as a method for mapping strings to opaque
structures (or pointers, whatever). The string-data to binary-data
mapping is one of the most important features of the Tcl language. So
you can't use a list to map strings to an opaque structure., because
lists are "values".

Somehow this important fact is being turned upside down. We are
focusing on the key and not the value.

Now, back to performance: You have to pay for the creation and
maintenance of the array. This is an up-front cost. If you have 10
minutes to visit Disneyland, your per-ride cost will be very high.

Enough talk. Here is a script which uses the FNV algorithm to create
an array, using the collide keys. After the script I have timings
(create the array with the keys, create a name-value list with the
keys). The result is that it takes so much time to create the array, I
have plenty of time left over to do very slow linear searches...and I
can do them case-insensitive. In addition I can recover the order of
the input list.  Why anyone thinks it is a good idea to spend time
destroying information is beyond me. But basically that is what any
list->array transform gives you.

# original collide proc
proc collide {level {accum ""} } {
    global a
    if {$level} {
        incr level -1
        collide $level "aj$accum"
        collide $level "ba$accum"
    } else {
        lappend a($accum) 1
    }
}

# new proc just generates keys
proc collideNames {level {accum ""} } {
    global names
    if {$level} {
        incr level -1
        collideNames $level "aj$accum"
        collideNames $level "ba$accum"
    } else {
        lappend names $accum
    }
}

# generate lots of keys
collideNames 15

set hostChoice [list host Host hosT hOst hoSt HoSt HOst hoST]
set valueIndex 0
set hostIndex 0

# create array using keys, note collide runs faster because it is a
compiled proc
# the loop here matches the following list creation loop.
puts "create host array: [time {

foreach name $names {
    lappend a($name) [incr valueIndex]
    if {!($valueIndex % 10007)} {
	incr hostIndex
	lappend a([lindex $hostChoice $hostIndex]) $valueIndex
    }
}
}]"

puts [array statistics a]

set valueIndex 0
set hostIndex 0

puts "create host list: [time {

foreach name $names {
    lappend headerList [list $name [incr valueIndex]]
    if {!($valueIndex % 10007)} {
	incr hostIndex
	lappend headerList [list [lindex $hostChoice $hostIndex] $valueIndex]
    }
}
}]"

puts "finding First Host: [time {set hosts [lsearch -inline -index 0
$headerList Host]}]"
puts "finding First Host nocase: [time {set hosts [lsearch -inline
-index 0 $headerList Host]}]"

puts "finding All Hosts: [time {set hosts [lsearch -inline -all -index
0 $headerList Host]}]"

puts "finding All Hosts nocase: [time {set hosts [lsearch -inline -all
-nocase -index 0 $headerList Host]}]"
puts $hosts

### End script

The only thing interesting here is the extra time required to create
the array. But lets not even get into the problems of searching for a
"host" header with no case:

% array names a -regexp (h|H)(o|O)(s|S)(t|T)
hosT hOst Host

% time {array names a -regexp (h|H)(o|O)(s|S)(t|T)}
120592 microseconds per iteration

Then you have to retrieve the values.

Here is what I get with a list representation:

create host array: 1025203 microseconds per iteration (wow! over one second)
32771 entries in table, 16384 buckets
number of buckets with 0 entries: 8342
number of buckets with 1 entries: 604
number of buckets with 2 entries: 1144
number of buckets with 3 entries: 1657
number of buckets with 4 entries: 1616
number of buckets with 5 entries: 1271
number of buckets with 6 entries: 848
number of buckets with 7 entries: 487
number of buckets with 8 entries: 237
number of buckets with 9 entries: 111
number of buckets with 10 entries: 47
number of buckets with 11 entries: 15
number of buckets with 12 entries: 4
number of buckets with 14 entries: 1
number of buckets with 10000 or more entries: 0
average search distance for entry: 3.0

create host list: 381651 microseconds per iteration (2.5 times faster)

finding First Host: 1867 microseconds per iteration

finding First Host nocase: 1697 microseconds per iteration

finding All Hosts: 6618 microseconds per iteration

finding All Hosts nocase: 40670 microseconds per iteration

Host keys:
{Host 10007} {hosT 20014} {hOst 30021}

Somewhere the conventional wisdom about list vs. array access needs to
be discredited.

tom jackson

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Tcl-Core mailing list
Tcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tcl-core

Recent Messages in this Thread
Donal K. Fellows Feb 07, 2010 09:36 am
Eric Melski Feb 08, 2010 06:40 am
Donal K. Fellows Feb 08, 2010 11:21 am
Eric Melski Feb 08, 2010 06:45 pm
Donal K. Fellows Feb 09, 2010 01:47 am
Larry McVoy Feb 09, 2010 02:42 am
Colin McCormack Feb 09, 2010 01:05 pm
Larry McVoy Feb 09, 2010 03:48 pm
Colin McCormack Feb 09, 2010 11:14 pm
Gustaf Neumann Feb 09, 2010 04:15 pm
Donal K. Fellows Feb 10, 2010 12:38 am
Lars Hellström Feb 14, 2010 10:56 pm
Larry McVoy Feb 15, 2010 03:20 am
Donal K. Fellows Feb 15, 2010 10:02 pm
Tom Jackson Feb 15, 2010 08:55 pm
Lars Hellström Feb 15, 2010 11:53 pm
Donal K. Fellows Feb 16, 2010 11:56 am
Colin McCormack Feb 16, 2010 01:01 pm
Eric Melski Feb 09, 2010 11:35 pm
Alexandre Ferrieux Feb 11, 2010 12:08 am
Alexandre Ferrieux Feb 12, 2010 04:57 pm
Gustaf Neumann Feb 14, 2010 04:58 pm
Donal K. Fellows Feb 14, 2010 11:54 pm
Gustaf Neumann Feb 16, 2010 03:25 pm
Donal K. Fellows Feb 16, 2010 04:45 pm
Gustaf Neumann Feb 17, 2010 11:04 am
Kristoffer Lawson Feb 17, 2010 11:18 am
Tom Jackson Feb 16, 2010 06:30 pm
Donal K. Fellows Feb 16, 2010 10:44 pm
Gustaf Neumann Feb 17, 2010 11:49 am
Donal K. Fellows Feb 16, 2010 10:36 pm
John Ousterhout Feb 15, 2010 04:56 am
Donal K. Fellows Feb 15, 2010 03:14 pm
Colin McCormack Feb 16, 2010 01:16 pm
Larry McVoy Feb 15, 2010 04:16 pm
Tom Jackson Feb 18, 2010 09:31 pm
Tom Jackson Feb 17, 2010 06:41 pm
Donal K. Fellows Feb 18, 2010 09:23 am
Gustaf Neumann Feb 18, 2010 12:23 pm
Tom Jackson Feb 17, 2010 01:36 am
Donal K. Fellows Feb 17, 2010 11:56 am
Tom Jackson Feb 16, 2010 09:30 am
Joe English Feb 08, 2010 07:56 pm
Donal K. Fellows Feb 08, 2010 09:56 pm
Messages in this thread