Welcome, guest | Sign In | My Account | Store | Cart

Lists of data grouped by a key value are common - obvious examples are spreadsheets or other tabular arrangements of information. In many cases, the new itertools groupby function introduced in Python 2.4 can provide a means of easily generating summaries of such information.

Python, 30 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from itertools import groupby
from operator import itemgetter

def summary(data, key=itemgetter(0), value=itemgetter(1)):
    """Summarise the supplied data.

       Produce a summary of the data, grouped by the given key (default: the
       first item), and giving totals of the given value (default: the second
       item).

       The key and value arguments should be functions which, given a data
       record, return the relevant value.
    """

    for k, group in groupby(data, key):
        yield (k, sum(value(row) for row in group))

if __name__ == "__main__":
    # Example: given a set of sales data for city within region,
    # produce a sales report by region
    sales = [('Scotland', 'Edinburgh', 20000),
             ('Scotland', 'Glasgow', 12500),
             ('Wales', 'Cardiff', 29700),
             ('Wales', 'Bangor', 12800),
             ('England', 'London', 90000),
             ('England', 'Manchester', 45600),
             ('England', 'Liverpool', 29700)]

    for region, total in summary(sales, key=itemgetter(0), value=itemgetter(2)):
        print "%10s: %d" % (region, total)

In many situations, data is available in tabular form, where the information is naturally grouped by a subset of the data values. Examples include results from database queries or data from spreadsheets. Often, it is useful to be able to produce summaries of the detail data.

The new groupby function (part of the Python 2.4 itertools module) is designed for handling such grouped data. It takes as input an iterator, along with a function to extract the "key" value from a record. It yields each distinct key from the iterator in turn, along with a new iterator which runs through the data values associated with that key.

A common use of the groupby function would be to generate summary totals for a data set. The summary function defined above shows one way of doing this. For a summary report, two extraction functions are required, one to extract the "key", which is passed to the groupby function, and one to extract the values to be summarised.

It should be noted that the groupby function does not sort its input. This can mean that with unsorted data, multiple groups with the same key will appear. If this is not appropriate, the list.sort method (or the sorted builtin) can be used to pre-sort the data. The same key function as is supplied to groupby can also be used as a key argument to the sort.

This recipe provides a good illustration of how the new Python 2.4 features work well together - in addition to the groupby function, the operator.itemgetter convenience function is used to provide natural defaults for the summary function, and a generator expression is used as the argument to the sum() function. When sorted input is required, the new key argument to list.sort provides a convenient means to reuse an existing key function, and the sorted() builtin extends this to sequences other than lists.

4 comments

Andy Elvey 19 years, 7 months ago  # | flag

Very nice example! This is a great little snippet of code - well done! As a relative newcomer to Python, I would be very keen to see an example of this algorithm using the built-in "csv" module to read a file and summarise the data. That would be a very nice "next step" for this algorithm, making it even more applicable to real-world use (given that csv or similar formats are widely used).

Paul Moore (author) 19 years, 6 months ago  # | flag

Pretty simple: with a file sales.dat something like

Scotland,Edinburgh,20000
Scotland,Glasgow,12500
Wales,Cardiff,29700
Wales,Bangor,12800
England,London,90000
England,Manchester,45600
England,Liverpool,29700

all you do is change the definition of sales to

sales = csv.reader(open("itert.dat"))

One other change is required - because the csv module returns all values as strings, you need to convert the values to integers - the value argument has to change to

value=lambda r: int(r[2])

rather than using itemgetter(2).

Add error handling and explicit closing of files to taste...

Raymond Hettinger 13 years, 1 month ago  # | flag

To get the full speed benefit from itertools, replace the genexp with imap:

for k, group in groupby(data, key):
    yield (k, sum(imap(value, group)))

With that minor change, the inner-loop runs at C-speed (with no trips around Python's eval-loop).

Daniel Bara 12 years, 5 months ago  # | flag

Great Example. It did help me. Now I have an array that contains more than 1 column to add. I'm trying to modify the example to match this need. and did it ass follows.

from itertools import groupby, imap 
from operator import itemgetter

def summary(data, key=itemgetter(0), value=itemgetter(1), value2=itemgetter(2) ):
    """Summarise the supplied data.

       Produce a summary of the data, grouped by the given key (default: the
       first item), and giving totals of the given value (default: the second
       item).

       The key and value arguments should be functions which, given a data
       record, return the relevant value.
    """
    for k, group in groupby(data, key):
        yield (k, sum(value(row) for row in group), sum(value2(row) for row in group))

if __name__ == "__main__":
    # Example: given a set of sales data for city within region,
    # produce a sales report by region
    sales = [('Scotland', 'Edinburgh', 20000, 5),
             ('Scotland', 'Glasgow', 12500, 3),
             ('Wales', 'Cardiff', 29700, 5),
             ('Wales', 'Bangor', 12800, 8),
             ('England', 'London', 90000,10),
             ('England', 'Manchester', 45600, 20),
             ('England', 'Liverpool', 29700, 23)]

    for region, total, total2 in summary(sales, key=itemgetter(0), value=itemgetter(2), value2=itemgetter(3)):
        print "%10s: %d %d" % (region, total, total2)

The second third argument in the yield is always 0. Any help?

Thanks.

Dan