ActiveState Code

Recipe 304162: Summary reports using itertools.groupby


Lists of data grouped by a key value are common - obvious examples are spreadsheets or other tabular arrangements of information. In many cases, the new itertools groupby function introduced in Python 2.4 can provide a means of easily generating summaries of such information.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from itertools import groupby
from operator import itemgetter

def summary(data, key=itemgetter(0), value=itemgetter(1)):
    """Summarise the supplied data.

       Produce a summary of the data, grouped by the given key (default: the
       first item), and giving totals of the given value (default: the second
       item).

       The key and value arguments should be functions which, given a data
       record, return the relevant value.
    """

    for k, group in groupby(data, key):
        yield (k, sum(value(row) for row in group))

if __name__ == "__main__":
    # Example: given a set of sales data for city within region,
    # produce a sales report by region
    sales = [('Scotland', 'Edinburgh', 20000),
             ('Scotland', 'Glasgow', 12500),
             ('Wales', 'Cardiff', 29700),
             ('Wales', 'Bangor', 12800),
             ('England', 'London', 90000),
             ('England', 'Manchester', 45600),
             ('England', 'Liverpool', 29700)]

    for region, total in summary(sales, key=itemgetter(0), value=itemgetter(2)):
        print "%10s: %d" % (region, total)

Discussion

In many situations, data is available in tabular form, where the information is naturally grouped by a subset of the data values. Examples include results from database queries or data from spreadsheets. Often, it is useful to be able to produce summaries of the detail data.

The new groupby function (part of the Python 2.4 itertools module) is designed for handling such grouped data. It takes as input an iterator, along with a function to extract the "key" value from a record. It yields each distinct key from the iterator in turn, along with a new iterator which runs through the data values associated with that key.

A common use of the groupby function would be to generate summary totals for a data set. The summary function defined above shows one way of doing this. For a summary report, two extraction functions are required, one to extract the "key", which is passed to the groupby function, and one to extract the values to be summarised.

It should be noted that the groupby function does not sort its input. This can mean that with unsorted data, multiple groups with the same key will appear. If this is not appropriate, the list.sort method (or the sorted builtin) can be used to pre-sort the data. The same key function as is supplied to groupby can also be used as a key argument to the sort.

This recipe provides a good illustration of how the new Python 2.4 features work well together - in addition to the groupby function, the operator.itemgetter convenience function is used to provide natural defaults for the summary function, and a generator expression is used as the argument to the sum() function. When sorted input is required, the new key argument to list.sort provides a convenient means to reuse an existing key function, and the sorted() builtin extends this to sequences other than lists.

Comments

  1. 1. At 2:37 a.m. on 20 sep 2004, Andy Elvey said:

    Very nice example! This is a great little snippet of code - well done! As a relative newcomer to Python, I would be very keen to see an example of this algorithm using the built-in "csv" module to read a file and summarise the data. That would be a very nice "next step" for this algorithm, making it even more applicable to real-world use (given that csv or similar formats are widely used).

  2. 2. At 6:58 a.m. on 13 oct 2004, Paul Moore (the author) said:

    Pretty simple: with a file sales.dat something like

    Scotland,Edinburgh,20000
    Scotland,Glasgow,12500
    Wales,Cardiff,29700
    Wales,Bangor,12800
    England,London,90000
    England,Manchester,45600
    England,Liverpool,29700
    

    all you do is change the definition of sales to

    sales = csv.reader(open("itert.dat"))
    

    One other change is required - because the csv module returns all values as strings, you need to convert the values to integers - the value argument has to change to

    value=lambda r: int(r[2])
    

    rather than using itemgetter(2).

    Add error handling and explicit closing of files to taste...

Sign in to comment