Welcome, guest | Sign In | My Account | Store | Cart

groupby() For Unsorted Input (Python recipe) by Alfe
ActiveState Code (http://code.activestate.com/recipes/580800/)

We all know the groupby() which is available in the itertools standard module. This one yields groups of consecutive elements in the input which are meant to be together in one group. For non-consecutive elements this will yield more than one group for the same key.

So effectively, groupby() only reformats a flat list into bunches of elements from that list without reordering anything. In practice this means that for input sorted by key this works perfect, but for unsorted input it might yield several groups for the same key (with groups for other keys in between). Typically needed, though, is a grouping with reordering if necessary.

I implemented a likewise lazy function (yielding generators) which also accepts ungrouped input.

      def groupbyUnsorted(input, key=lambda x:x):
  yielded = set()
  keys = [ key(element) for element in input ]
  for i, wantedKey in enumerate(keys):
    if wantedKey not in yielded:
      yield (wantedKey,
          (input[j] for j in range(i, len(input)) if keys[j] == wantedKey))
    yielded.add(wantedKey)

      

This derived from a StackOverflow question in which the list of lists

xs = [
    [1,2,3,4],
    [5,6,7,8],
    [9,0,0,1],
    [2,3],
    [0],
    [5,8,3,2,5,1],
    [6,4],
    [1,6,9,9,2,9]
]

was supposed to be grouped by length of the lists.

A solution using my function looks like this:

{ key: list(value) for (key, value) in groupbyUnsorted(xs, len) }

Tags: algorithm, datastructures, generators, grouping, lazy

1 comment

Matteo Dell'Amico 6 years, 11 months ago # | flag

The OP's version has O(mn) complexity, where m is the number of keys and n is the list's length. It's possible to have a much faster version with optimal O(n) complexity, by resorting to a dict of lists where we store the indexes of the original sequence.

import collections

def groupby_unsorted(seq, key=lambda x: x):
    indexes = collections.defaultdict(list)
    for i, elem in enumerate(seq):
        indexes[key(elem)].append(i)
    for k, idxs in indexes.items():
        yield k, (seq[i] for i in idxs)

Here's some quick benchmarking on my laptop:

import random

l = [random.randrange(1000) for _ in range(100000)]

def consume(groups):
    for k, group in groups:
        for item in group:
            pass

%timeit(consume(groupbyUnsorted(l))) # OP's version
1 loop, best of 3: 6.19 s per loop

%timeit(consume(groupby_unsorted(l))) # new version
10 loops, best of 3: 52 ms per loop

More than 100 times faster in this case :)

Created by Alfe on Fri, 12 May 2017 (MIT)

◄	Python recipes (4591)	►
◄	Alfe's recipes (12)	►

Required Modules

(none specified)

Other Information and Tasks

Licensed under the MIT License
Viewed 63959 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

groupby() For Unsorted Input (Python recipe) by Alfe ActiveState Code (http://code.activestate.com/recipes/580800/)