Merge sorted iterables with stability and built-in DSO
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | import heapq
def mergesort(list_of_lists, key=None):
""" Perform an N-way merge operation on sorted lists.
@param list_of_lists: (really iterable of iterable) of sorted elements
(either by naturally or by C{key})
@param key: specify sort key function (like C{sort()}, C{sorted()})
@param iterfun: function that returns an iterator.
Yields tuples of the form C{(item, iterator)}, where the iterator is the
built-in list iterator or something you pass in, if you pre-generate the
iterators.
This is a stable merge; complexity O(N lg N)
Examples::
print list(x[0] for x in mergesort([[1,2,3,4],
[2,3.5,3.7,4.5,6,7],
[2.6,3.6,6.6,9]]))
[1, 2, 2, 2.6, 3, 3.5, 3.6, 3.7, 4, 4.5, 6, 6.6, 7, 9]
# note stability
print list(x[0] for x in mergesort([[1,2,3,4],
[2,3.5,3.7,4.5,6,7],
[2.6,3.6,6.6,9]], key=int))
[1, 2, 2, 2.6, 3, 3.5, 3.6, 3.7, 4, 4.5, 6, 6.6, 7, 9]
print list(x[0] for x in mergesort([[4,3,2,1],
[7,6.5,4,3.7,3.3,1.9],
[9,8.6,7.6,6.6,5.5,4.4,3.3]],
key=lambda x: -x))
[9, 8.6, 7.6, 7, 6.6, 6.5, 5.5, 4.4, 4, 4, 3.7, 3.3, 3.3, 3, 2, 1.9, 1]
"""
heap = []
for i, itr in enumerate(iter(pl) for pl in list_of_lists):
try:
item = itr.next()
toadd = (key(item), i, item, itr) if key else (item, i, itr)
heap.append(toadd)
except StopIteration:
pass
heapq.heapify(heap)
if key:
while heap:
_, idx, item, itr = heap[0]
yield item, itr
try:
item = itr.next()
heapq.heapreplace(heap, (key(item), idx, item, itr) )
except StopIteration:
heapq.heappop(heap)
else:
while heap:
item, idx, itr = heap[0]
yield item, itr
try:
heapq.heapreplace(heap, (itr.next(), idx, itr))
except StopIteration:
heapq.heappop(heap)
|
If it is acceptible to realize the whole list in memory, this recipe is likely faster for a small number of iterators: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305269
A simpler version without stability or key= can be found here: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/141934
This algorithm is designed for use in large-scale merge operations--I use it for merging multi-gigabyte on-disk databases, where stability matters. This is also why the iterator object is yielded along with the current item: it is often a complex object that allows me to track from which iterator the current item stems.
Output of second example is wrong, it should read: