Welcome, guest | Sign In | My Account | Store | Cart

General purpose technique for counting comparisons in various searching and sorting applications.

Python, 71 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
from random import *
from heapq import *

cmps = 0

class Int(int):
    def __lt__(self, other):
        global cmps
        cmps += 1
        return int.__lt__(self, other)
    def __le__(self, other):
        global cmps
        cmps += 1
        return int.__le__(self, other)

def count_cmps(f, data, k):
    'Count comparisons in a call to f(k, data)'
    global cmps
    data = data[:]
    shuffle(data)
    cmps = 0
    result = f(k, data)
    assert result[:10] == list(range(10))
    return cmps

# -------- variants of nsmallest -------

def heapifying_smallest(k, data):
    heapify(data)
    result = [heappop(data) for j in range(k)]
    data.extend(result)
    return result

def select_nth(data, n):
    if len(data) == 1:
        return data[0]
    pivot = choice(data)
    lhs, rhs = [], []
    for elem in data:
        (lhs if elem < pivot else rhs).append(elem)
    if len(lhs) >= n+1:
        return select_nth(lhs, n)
    else:
        return select_nth(rhs, n - len(lhs))

def selecting_smallest(k, data):
    pivot = select_nth(data, k)
    return sorted(elem for elem in data if elem <= pivot)[:k]

def partitioning_smallest(n, data):
    if len(data) <= 1:
        return data
    pivot = choice(data)
    lhs, rhs = [], []
    for elem in data:
        (lhs if elem <= pivot else rhs).append(elem)
    if n < len(lhs):
        return partitioning_smallest(n, lhs)
    else:
        return sorted(lhs) + partitioning_smallest(n - len(lhs), rhs)


if __name__ == '__main__':
    # compare nsmallest implementations
    n, k = 100000, 100
    print('n: %d\tk: %d' % (n, k))
    data = list(map(Int, range(n)))
    for f in [nsmallest, heapifying_smallest,
             selecting_smallest, partitioning_smallest]:
        counts = sorted(count_cmps(f, data, k) for i in range(5))
        print(counts, f.__name__)

This benchmarking tool was created to show the relative performance of three different approaches to writing heapq.smallest().

  • (nsmallest) is the algorithm currently used in the standard library. It maintains a small heap containing the k-smallest items seen so far. As it iterates through the input data, if a new small value is encountered, that value is a substituted for the largest value on the heap.

This approach is space efficient, taking only space for a k-length list. Only one pass is made over the input iterable. For most datasets, this algorithm is very efficient because only a handful of entries ever need to get inserted in the heap. Most of elements only need one comparison against the smallest element seen so far. So, for a small k, and large n, the total number of comparisons is only a little higher than n.

  • (heapifying_smallest) reads the entire data input into a list, heapifies the list, and pops of the n-smallest values.

The approach is not space efficient; it consumes space for an n-length list. It makes one pass over the input iterable and another pass to heapify that list. Pulling off the k smallest elements takes O(k log k) time.

  • (selecting_smallest) reads the entire dataset into a list and uses an O(n) selection algorithm to find the nth smallest element. Once that element is found, a second pass is made through the list to collect all elements less than or equal to the nth smallest element. That result list is then sorted and truncated (in case the nth smallest element has duplicates). The selection algorithm is essential the same the the Quicksort algorithm, but only one side of the partition is kept.

This approach takes the most space, n elements for pulling the whole list into memory and about have that size (on average) for the initial partition.

  • (partitioning_smallest) is a variant of selecting_smallest that accumulates sorted partitions as it runs. This saves the second pass through the data and it separates the one big O(k lg k) sort into several smaller sorts that get appended together.

Here are the comparative results for n=100000 and k=100:

[105714, 105861, 105987, 106048, 106452] nsmallest
[166525, 166534, 166569, 166605, 166629] heapifying_smallest
[222351, 290743, 327718, 355590, 496865] selecting_smallest
[131171, 149186, 191046, 208020, 217504] partitioning_smallest

The results for partitioning do much better if efforts are made to chose a pivot in that is close to the intended cut-off point. For example:

pivot = sorted(choice(data) for i in range(9))[9*n//len(data)]