Essentially a sorted bag.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | from operator import itemgetter
from heapq import nlargest
def mostcommon(iterable, n=None):
"""Return a sorted list of the most common to least common elements and
their counts. If n is specified, return only the n most common elements.
"""
bag = {}
bag_get = bag.get
for elem in iterable:
bag[elem] = bag_get(elem, 0) + 1
if n is None:
return sorted(bag.iteritems(), key=itemgetter(1), reverse=True)
it = enumerate(bag.iteritems())
nl = nlargest(n, ((cnt, i, elem) for (i, (elem, cnt)) in it))
return [(elem, cnt) for cnt, i, elem in nl]
>>> mostcommon((word for line in open('in.txt') for word in line.split()), n=5)
[('to', 80), ('for', 75), ('the', 61), ('in', 57), ('of', 54)]
|
Given an iterable input of length x with u unique elements, the performance of the counting step is O(x). The performance of the sorting step is O(u log u) when n is None and O(u log n) when n is specified.
The key= field of the sorted() function provides a fast way of specifying the item counts as a sort key. Since nlargest() does not have a key= field, the way to achieve equivalent results is to construct temporary tuple records with the item count as the primary key and an enumeration as the secondary key. The enumeration will break ties and ensure sort stability when some items have equal item counts. It also avoids potentially non-existant or expensive ordering comparisions of the items themselves.
The recipe uses several Py2.4 features including generator expressions, sorted(), operator.itemgetter(), and heapq.nlargest(). The result is succinct code that is close to optimal in terms of running time and memory utilization.
Alternative. def mostcommon(iterable, n=None):
____bag = defaultdict(int)
____for elem in iterable:
________bag[elem] += 1
____c=0
____for i in sorted(bag.iteritems(), key=itemgetter(1), reverse=True):
________yield i
________if n:
____________c+=1
____________if c==n:
________________break
Note that using nlargest as in the original recipe is cheaper than sorting the whole set and returning its first elements.