Welcome, guest | Sign In | My Account | Store | Cart

Guido inspired SQL-like GROUPBY class that also encapsulates the logic in a Unix-like "sort | uniq".

Python, 30 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class groupby(dict):
    def __init__(self, seq, key=lambda x:x):
        for value in seq:
            k = key(value)
            self.setdefault(k, []).append(value)
    __iter__ = dict.iteritems

# -------------------------- Examples -----------------------------------

>>> letters = 'abracadabra'
>>> [g for k, g in groupby(letters)]                # grouped
[['a', 'a', 'a', 'a', 'a'], ['r', 'r'], ['b', 'b'], ['c'], ['d']]
>>> [k for k, g in groupby(letters)]                # uniq
['a', 'r', 'b', 'c', 'd']
>>> [(k, len(g)) for k, g in groupby(letters)]      # uniq -c
[('a', 5), ('r', 2), ('b', 2), ('c', 1), ('d', 1)]
>>> [k for k, g in groupby(letters) if len(g) > 1]  # uniq -d
['a', 'r', 'b']

>>> data = [('becky', 'girl', 5), ('john', 'boy', 10), ('sue', 'girl', 10)]
>>> for k, g in groupby(data, key=lambda r: r[1]):
...     print k
...     for record in g:
...         print "   ", record
...
boy
    ('john', 'boy', 10)
girl
    ('becky', 'girl', 5)
    ('sue', 'girl', 10)

Used for: 1. Grouping records in reports 2. Listing the unique keys in a database 3. Counting the number of keys in each group 4. Finding records with duplicate keys

Since the underlying implementation is a dictionary of lists: 1. The build time is O(n) 2. The input can be in any order 3. The keys must be hashable 4. The order of key output is arbitrary 5. The order of values for each group is stable (matches original record order)

To get sorted output, change the code for __iter__ to: <pre> def __iter__(self): keys = self.keys() keys.sort() for k in keys: yield k, self[k]

</pre>

9 comments

Wesley Dyk 17 years, 11 months ago  # | flag

Zope Friendly Version. I use a Zope installation that uses Python 2.1 and it doesn't support iterators. Also, it doesn't allow variable names that start with '_'. So I made a modification to use in Zope for use in creating web reports. Just create a new script or a function within a script with the parameters seq and key (just like __init__ in the recipe). Use this code inside the function or script:

groupdict = {}
for item in seq:
  k = key(item)
  groupdict.setdefault(k, []).append(item)
return groupdict.items()

Since the Zope version that I have uses Python 2.1, it can't use iteritems, so it has to return an actual list. This means that a copy of the list of key, value pairs is created. This could drop performance if you have a large sequence.

Maxim Krikun 17 years, 8 months ago  # | flag

redundant list creation. Note that in this line

self.setdefault(k, []).append(value)

an empty list is instantiated on each iteration.

I prefer to use for similar tasks a dict subclass with KeyError catched inside, like follows:

class idict(dict):
    '''self-initializing dictionnary'''
    def __init__(self,init):
        dict.__init__(self)
        self.init=init
    def __getitem__(self,key):
        try:
            return dict.__getitem__(self,key)
        except KeyError:
            if callable(self.init):
                self[key]=self.init()
            else:
                self[key]=self.init
            return dict.__getitem__(self,key)

Then the groupby could be defined as a function:

def groupby(seq, key=lambda x:x):
    d=idict(list)
    for value in seq:
        d[key(value)].append(value)
    return d.items()

This class also provides a simple way to count list entries:

def count_list_items(ll):
    d=idict(0)
    for l in ll:
        d[l]+=1
    return d.items()
Jonathan Wright 17 years, 1 month ago  # | flag

Can someone put the idict in its own recipe? Recently I have used the idict class a number of times. It seems useful enough to warrant its own recipe.

Thanks, Jonathan.

Raymond Hettinger (author) 16 years, 7 months ago  # | flag

idict() is pennywise and pound foolish. The cost of setdefault() instantiating an empty list is miniscule in comparison with the overhead of a __setitem__ call to idict().

runsun pan 15 years, 10 months ago  # | flag

An application of this nice recipe:

groupbyhead: Group a list of items according to the starting character(s) of items.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/259173

Louis RIVIERE 14 years, 3 months ago  # | flag

Getting rid of setdefault ; using defaultdict instead. def group(data, key=None):

____d=defaultdict(list)

____for v in data:

________k=key(v) if key else v

________d[k].append(v)

____return d.items()

Tim J 14 years, 2 months ago  # | flag

defaultdict is New in version 2.5. Alas. I itch for it weekly.

s_h_a_i_o 10 years, 2 months ago  # | flag

The cost of setdefault() instantiating an empty list is miniscule in comparison with the overhead of a __setitem__ call to idict()

However, when calling d[key] many times with the same missing key, defaultdict will each time call its default, while idict fills the d[key], which is then no longer missing.

Is there a way to have defaultdict filling the key when it encounters a missing value ?

Charlie Clark 9 years, 1 month ago  # | flag

@ s_h_a_i_o Worth noting that groupby is now in the Standard Library's itertools module.

Created by Raymond Hettinger on Fri, 9 Jan 2004 (PSF)
Python recipes (4591)
Raymond Hettinger's recipes (97)
HongxuChen's Fav (39)

Required Modules

  • (none specified)

Other Information and Tasks