Welcome, guest | Sign In | My Account | Store | Cart

Merge sorted iterables with stability and built-in DSO

Python, 65 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import heapq
def mergesort(list_of_lists, key=None):
    """ Perform an N-way merge operation on sorted lists.

    @param list_of_lists: (really iterable of iterable) of sorted elements
    (either by naturally or by C{key})
    @param key: specify sort key function (like C{sort()}, C{sorted()})
    @param iterfun: function that returns an iterator.

    Yields tuples of the form C{(item, iterator)}, where the iterator is the
    built-in list iterator or something you pass in, if you pre-generate the
    iterators.

    This is a stable merge; complexity O(N lg N)

    Examples::

    print list(x[0] for x in mergesort([[1,2,3,4],
                                        [2,3.5,3.7,4.5,6,7],
                                        [2.6,3.6,6.6,9]]))
    [1, 2, 2, 2.6, 3, 3.5, 3.6, 3.7, 4, 4.5, 6, 6.6, 7, 9]

    # note stability
    print list(x[0] for x in mergesort([[1,2,3,4],
                                        [2,3.5,3.7,4.5,6,7],
                                        [2.6,3.6,6.6,9]], key=int))
    [1, 2, 2, 2.6, 3, 3.5, 3.6, 3.7, 4, 4.5, 6, 6.6, 7, 9]

    print list(x[0] for x in mergesort([[4,3,2,1],
                                        [7,6.5,4,3.7,3.3,1.9],
                                        [9,8.6,7.6,6.6,5.5,4.4,3.3]],
                                        key=lambda x: -x))
    [9, 8.6, 7.6, 7, 6.6, 6.5, 5.5, 4.4, 4, 4, 3.7, 3.3, 3.3, 3, 2, 1.9, 1]


    """

    heap = []
    for i, itr in enumerate(iter(pl) for pl in list_of_lists):
        try:
            item = itr.next()
            toadd = (key(item), i, item, itr) if key else (item, i, itr)
            heap.append(toadd)
        except StopIteration:
            pass
    heapq.heapify(heap)

    if key:
        while heap:
            _, idx, item, itr = heap[0]
            yield item, itr
            try:
                item = itr.next()
                heapq.heapreplace(heap, (key(item), idx, item, itr) )
            except StopIteration:
                heapq.heappop(heap)

    else:
        while heap:
            item, idx, itr = heap[0]
            yield item, itr
            try:
                heapq.heapreplace(heap, (itr.next(), idx, itr))
            except StopIteration:
                heapq.heappop(heap)

If it is acceptible to realize the whole list in memory, this recipe is likely faster for a small number of iterators: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305269

A simpler version without stability or key= can be found here: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/141934

This algorithm is designed for use in large-scale merge operations--I use it for merging multi-gigabyte on-disk databases, where stability matters. This is also why the iterator object is yielded along with the current item: it is often a complex object that allows me to track from which iterator the current item stems.

1 comment

mbutscher 11 years, 8 months ago  # | flag

Output of second example is wrong, it should read:

# note stability
print list(x[0] for x in mergesort([[1,2,3,4],
                                    [2,3.5,3.7,4.5,6,7],
                                    [2.6,3.6,6.6,9]], key=int))
[1, 2, 2, 2.6, 3, 3.5, 3.7, 3.6, 4, 4.5, 6, 6.6, 7, 9]
Created by Mike Klaas on Tue, 1 May 2007 (PSF)
Python recipes (4591)
Mike Klaas's recipes (1)

Required Modules

Other Information and Tasks