Welcome, guest | Sign In | My Account | Store | Cart

Essentially a sorted bag.

Python, 20 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from operator import itemgetter
from heapq import nlargest

def mostcommon(iterable, n=None):
    """Return a sorted list of the most common to least common elements and
    their counts.  If n is specified, return only the n most common elements.

    """
    bag = {}
    bag_get = bag.get
    for elem in iterable:
        bag[elem] = bag_get(elem, 0) + 1
    if n is None:
        return sorted(bag.iteritems(), key=itemgetter(1), reverse=True)
    it = enumerate(bag.iteritems())
    nl = nlargest(n, ((cnt, i, elem) for (i, (elem, cnt)) in it))
    return [(elem, cnt) for cnt, i, elem in nl]

>>> mostcommon((word for line in open('in.txt') for word in line.split()), n=5)
[('to', 80), ('for', 75), ('the', 61), ('in', 57), ('of', 54)]

Given an iterable input of length x with u unique elements, the performance of the counting step is O(x). The performance of the sorting step is O(u log u) when n is None and O(u log n) when n is specified.

The key= field of the sorted() function provides a fast way of specifying the item counts as a sort key. Since nlargest() does not have a key= field, the way to achieve equivalent results is to construct temporary tuple records with the item count as the primary key and an enumeration as the secondary key. The enumeration will break ties and ensure sort stability when some items have equal item counts. It also avoids potentially non-existant or expensive ordering comparisions of the items themselves.

The recipe uses several Py2.4 features including generator expressions, sorted(), operator.itemgetter(), and heapq.nlargest(). The result is succinct code that is close to optimal in terms of running time and memory utilization.

2 comments

Louis RIVIERE 16 years, 7 months ago  # | flag

Alternative. def mostcommon(iterable, n=None):

____bag = defaultdict(int)

____for elem in iterable:

________bag[elem] += 1

____c=0

____for i in sorted(bag.iteritems(), key=itemgetter(1), reverse=True):

________yield i

________if n:

____________c+=1

____________if c==n:

________________break

Gabriel Genellina 15 years, 3 months ago  # | flag

Note that using nlargest as in the original recipe is cheaper than sorting the whole set and returning its first elements.

Created by Raymond Hettinger on Thu, 25 Nov 2004 (PSF)
Python recipes (4591)
Raymond Hettinger's recipes (97)

Required Modules

  • (none specified)

Other Information and Tasks