Welcome, guest | Sign In | My Account | Store | Cart

We all know the groupby() which is available in the itertools standard module. This one yields groups of consecutive elements in the input which are meant to be together in one group. For non-consecutive elements this will yield more than one group for the same key.

So effectively, groupby() only reformats a flat list into bunches of elements from that list without reordering anything. In practice this means that for input sorted by key this works perfect, but for unsorted input it might yield several groups for the same key (with groups for other keys in between). Typically needed, though, is a grouping with reordering if necessary.

I implemented a likewise lazy function (yielding generators) which also accepts ungrouped input.

Python, 8 lines
1
2
3
4
5
6
7
8
def groupbyUnsorted(input, key=lambda x:x):
  yielded = set()
  keys = [ key(element) for element in input ]
  for i, wantedKey in enumerate(keys):
    if wantedKey not in yielded:
      yield (wantedKey,
          (input[j] for j in range(i, len(input)) if keys[j] == wantedKey))
    yielded.add(wantedKey)

This derived from a StackOverflow question in which the list of lists

xs = [
    [1,2,3,4],
    [5,6,7,8],
    [9,0,0,1],
    [2,3],
    [0],
    [5,8,3,2,5,1],
    [6,4],
    [1,6,9,9,2,9]
]

was supposed to be grouped by length of the lists.

A solution using my function looks like this:

{ key: list(value) for (key, value) in groupbyUnsorted(xs, len) }

1 comment

Matteo Dell'Amico 6 years, 11 months ago  # | flag

The OP's version has O(mn) complexity, where m is the number of keys and n is the list's length. It's possible to have a much faster version with optimal O(n) complexity, by resorting to a dict of lists where we store the indexes of the original sequence.

import collections

def groupby_unsorted(seq, key=lambda x: x):
    indexes = collections.defaultdict(list)
    for i, elem in enumerate(seq):
        indexes[key(elem)].append(i)
    for k, idxs in indexes.items():
        yield k, (seq[i] for i in idxs)

Here's some quick benchmarking on my laptop:

import random

l = [random.randrange(1000) for _ in range(100000)]

def consume(groups):
    for k, group in groups:
        for item in group:
            pass

%timeit(consume(groupbyUnsorted(l))) # OP's version
1 loop, best of 3: 6.19 s per loop

%timeit(consume(groupby_unsorted(l))) # new version
10 loops, best of 3: 52 ms per loop

More than 100 times faster in this case :)