Welcome, guest | Sign In | My Account | Store | Cart

Merge sorted iterables with stability and built-in DSO

Python, 65 lines
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65``` ```import heapq def mergesort(list_of_lists, key=None): """ Perform an N-way merge operation on sorted lists. @param list_of_lists: (really iterable of iterable) of sorted elements (either by naturally or by C{key}) @param key: specify sort key function (like C{sort()}, C{sorted()}) @param iterfun: function that returns an iterator. Yields tuples of the form C{(item, iterator)}, where the iterator is the built-in list iterator or something you pass in, if you pre-generate the iterators. This is a stable merge; complexity O(N lg N) Examples:: print list(x[0] for x in mergesort([[1,2,3,4], [2,3.5,3.7,4.5,6,7], [2.6,3.6,6.6,9]])) [1, 2, 2, 2.6, 3, 3.5, 3.6, 3.7, 4, 4.5, 6, 6.6, 7, 9] # note stability print list(x[0] for x in mergesort([[1,2,3,4], [2,3.5,3.7,4.5,6,7], [2.6,3.6,6.6,9]], key=int)) [1, 2, 2, 2.6, 3, 3.5, 3.6, 3.7, 4, 4.5, 6, 6.6, 7, 9] print list(x[0] for x in mergesort([[4,3,2,1], [7,6.5,4,3.7,3.3,1.9], [9,8.6,7.6,6.6,5.5,4.4,3.3]], key=lambda x: -x)) [9, 8.6, 7.6, 7, 6.6, 6.5, 5.5, 4.4, 4, 4, 3.7, 3.3, 3.3, 3, 2, 1.9, 1] """ heap = [] for i, itr in enumerate(iter(pl) for pl in list_of_lists): try: item = itr.next() toadd = (key(item), i, item, itr) if key else (item, i, itr) heap.append(toadd) except StopIteration: pass heapq.heapify(heap) if key: while heap: _, idx, item, itr = heap[0] yield item, itr try: item = itr.next() heapq.heapreplace(heap, (key(item), idx, item, itr) ) except StopIteration: heapq.heappop(heap) else: while heap: item, idx, itr = heap[0] yield item, itr try: heapq.heapreplace(heap, (itr.next(), idx, itr)) except StopIteration: heapq.heappop(heap) ```

If it is acceptible to realize the whole list in memory, this recipe is likely faster for a small number of iterators: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305269

A simpler version without stability or key= can be found here: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/141934

This algorithm is designed for use in large-scale merge operations--I use it for merging multi-gigabyte on-disk databases, where stability matters. This is also why the iterator object is yielded along with the current item: it is often a complex object that allows me to track from which iterator the current item stems.

1 comment

mbutscher 13 years, 2 months ago

Output of second example is wrong, it should read:

``````# note stability
print list(x[0] for x in mergesort([[1,2,3,4],
[2,3.5,3.7,4.5,6,7],
[2.6,3.6,6.6,9]], key=int))
[1, 2, 2, 2.6, 3, 3.5, 3.7, 3.6, 4, 4.5, 6, 6.6, 7, 9]
``````
 Created by Mike Klaas on Tue, 1 May 2007 (PSF)