Welcome, guest | Sign In | My Account | Store | Cart

This generator will sort only the requested part of a list. Useful for getting just the first few elements of a list and avoiding the overhead of sorting the whole thing.

Python, 8 lines
 1 2 3 4 5 6 7 8 import heapq def isorted(iterable): lst = list(iterable) heapq.heapify(lst) pop = heapq.heappop while lst: yield pop(lst)

This second version is by Raymond Hettiger, and is both simpler and faster than my original one, based on a generator-based quicksort.

The name isorted follows the convention from the itertools package and mimics the new python 2.4 "sorted" builtin.

Hope this is useful for someone, I needed more than once something like this.

George Yoshida 20 years ago

generator-based quicksort. I modified Matteo's original script so that isorted can take any iterable as an argument.

from itertools import ifilter, count

try:
from itertools import tee
except ImportError, err:
def tee(iterable):
def gen(next, data={}, cnt=[0]):
for i in count():
if i == cnt[0]:
item = data[i] = next()
cnt[0] += 1
else:
item = data.pop(i)
yield item
it = iter(iterable)
return (gen(it.next), gen(it.next))

def isorted(iterable):
iterable = iter(iterable)
try:
pivot = iterable.next()
except:
return

a, b = tee(iterable)
for x in isorted(ifilter(lambda item:item&lt;pivot, a)):
yield x
yield pivot
for x in isorted(ifilter(lambda item:item&gt;=pivot, b)):
yield x

Note : my generator-based quicksort does not construct a list.

I used itertools.tee, which will appear in Python 2.4.

Daniel Harding 20 years ago

stable version. This version guarantees stability by selecting the first element as the pivot (this results in poor performance if the list is already sorted, but is no worse than the original version). Also, unlike the original version, it never modifies the list that is passed into it.

MIN_SIZE = 128

def isorted(lst):
if len(lst) &lt;= MIN_SIZE:
lst = lst[:]
lst.sort()
for x in lst:
yield x
else:
pivot = lst[0]
lst = lst[1:]
lt = [x for x in lst if x &lt; pivot]
for x in isorted(lt):
yield x
yield pivot
ge = [x for x in lst if x &gt;= pivot]
for x in isorted(ge):
yield x

George and Daniel, thanks for your comments. I did some test, and saw that the problem in all these versions is performance: python's builtin list.sort is so good that there are really few cases in which these solutions are better. The real performance bonus would be implementing this in C.

If someone who has "the right to speak" :-) thinks this is worth the itertools package, I'd be happy to study how to implement it.

Raymond Hettinger 20 years ago

Simplification and speedup with heapq. In terms of use cases, lazy sorting is valuable only when you don't plan on consuming all of the output (otherwise, you might as well use a regular sort). Choosing the n smallest items out of a list would be a good application.

Keeping this in mind, a heap is a natural choice of underlying data structures:

import heapq

def isorted(iterable):
lst = list(iterable)
heapq.heapify(lst)
pop = heapq.heappop
while lst:
yield pop(lst)

isorted() is not an especially good candidate for itertools because it immediately manifests the whole input into memory and because most its use cases tend to be dominated by list.sort() and heapq.

Raymond, you're absolutely right. Although, I argue that a well-written lazy sort could be almost as efficient as list.sort. Not that it would matter, since heapsort is just fine. I'm adopting your modification for version 2. :-)

Jim Baker 17 years, 8 months ago

For 2.4 and later, consider using heapq's nsmallest or nlargest functions. The timsort implementation used by sorted and sort is great. But laziness can still win out. In my experience, lazy sorting is significantly more efficient when a percentage of the query is being used as seen in top N type queries. Such queries very popular for user interfaces! Consequently I like the fact that 2.4 added this functionality in heapq.nlargest and heapq.nsmallest.

There should be no surprise that lazy sorting can be more efficient, we are comparing O(n + k logn) = O(n) -- if k is a constant as it is with top 50 -- vs O(n logn).

Of course that percentage will vary with the constant overhead you will see in your code. As usual, try it both ways if the sorting is in a key piece of code, as it was in mine.

 Created by Matteo Dell'Amico on Tue, 27 Apr 2004 (PSF)