Welcome, guest | Sign In | My Account | Store | Cart

Essentially a sorted bag.

Python, 20 lines
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20``` ```from operator import itemgetter from heapq import nlargest def mostcommon(iterable, n=None): """Return a sorted list of the most common to least common elements and their counts. If n is specified, return only the n most common elements. """ bag = {} bag_get = bag.get for elem in iterable: bag[elem] = bag_get(elem, 0) + 1 if n is None: return sorted(bag.iteritems(), key=itemgetter(1), reverse=True) it = enumerate(bag.iteritems()) nl = nlargest(n, ((cnt, i, elem) for (i, (elem, cnt)) in it)) return [(elem, cnt) for cnt, i, elem in nl] >>> mostcommon((word for line in open('in.txt') for word in line.split()), n=5) [('to', 80), ('for', 75), ('the', 61), ('in', 57), ('of', 54)] ```

Given an iterable input of length x with u unique elements, the performance of the counting step is O(x). The performance of the sorting step is O(u log u) when n is None and O(u log n) when n is specified.

The key= field of the sorted() function provides a fast way of specifying the item counts as a sort key. Since nlargest() does not have a key= field, the way to achieve equivalent results is to construct temporary tuple records with the item count as the primary key and an enumeration as the secondary key. The enumeration will break ties and ensure sort stability when some items have equal item counts. It also avoids potentially non-existant or expensive ordering comparisions of the items themselves.

The recipe uses several Py2.4 features including generator expressions, sorted(), operator.itemgetter(), and heapq.nlargest(). The result is succinct code that is close to optimal in terms of running time and memory utilization.

Louis RIVIERE 16 years, 8 months ago

Alternative. def mostcommon(iterable, n=None):

____bag = defaultdict(int)

____for elem in iterable:

________bag[elem] += 1

____c=0

____for i in sorted(bag.iteritems(), key=itemgetter(1), reverse=True):

________yield i

________if n:

____________c+=1

____________if c==n:

________________break

Gabriel Genellina 15 years, 4 months ago

Note that using nlargest as in the original recipe is cheaper than sorting the whole set and returning its first elements.

 Created by Raymond Hettinger on Thu, 25 Nov 2004 (PSF)

### Required Modules

• (none specified)