ActiveState Code

Recipe 303279: Getting items in batches


You want to get the items from a sequence (or other iterable) a batch at a time, including a short batch at the end if need be.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from itertools import islice, chain

def batch(iterable, size):
    sourceiter = iter(iterable)
    while True:
        batchiter = islice(sourceiter, size)
        yield chain([batchiter.next()], batchiter)


seq = xrange(19)
for batchiter in batch(seq, 3):
    print "Batch: ",
    for item in batchiter:
        print item,
    print

Batch:  0 1 2
Batch:  3 4 5
Batch:  6 7 8
Batch:  9 10 11
Batch:  12 13 14
Batch:  15 16 17
Batch:  18

Discussion

Since I wanted to be able to batch iterables that weren't materialized in memory or whose length was unknown, one of the goals with this recipe was that it should only require the source iterable to support iteration, rather than require it to support indexing or have a known length as in some other approaches to batching.

An earlier version of this recipe met that goal, but I was unhappy that it required memory for the list built for each batch. To keep memory requirements to a minimum what I needed to yield instead was a size-bounded iterator over the original iterable. That's exactly what is provided by the itertools module's islice function.

But knowing when we're done batching is the tricky part, as islice is happy to continue returning iterators (albeit zero-length ones) even when the source iterator is exhausted. The problem is that we can't tell in advance whether an iterator is of zero length. One approach would be to maintain an internal count for each iterator and terminate when the yielded iterator turned out after the fact to have a count of zero. But I'd prefer never to yield a zero-length iterator in the first place, as this makes things more complicated for the client. Python's iterators do not have something like a hasNext method, so the only way to know whether an iterator can produce any items is by actually trying to consume it, which is the approach used in the recipe. If the initial iteration fails then we know we are done. But if it succeeds, we yield the obtained value chained with the (partially consumed) islice object. Note that each batch should be entirely consumed before proceeding to the next one, otherwise you will get unexpected behaviour.

Comments

  1. 1. At 11:16 a.m. on 10 sep 2004, Hamish Lawson (the author) said:

    islice is just what I needed. Soon after submitting the recipe I began thinking of alternatives to building a list in memory for each batch. islice is just what I needed to do the basic batching, which I've incorporated in my revised version, but I keep the result of islice as an iterator rather than convert it to a list as I want to keep memory requirements to a minimum.

  2. 2. At 3:54 a.m. on 30 nov 2004, Peter Otten said:

    All batches must be exhausted immediately. You should add a warning that a batch has to be entirely consumed before you can proceed to the next one. Otherwise users may be puzzled by the following:

    >>> len([bi for bi in batch(range(19), 3)])
    19
    
    >>> list([bi for bi in batch(range(19), 3)][3])
    [3]
    

    Catching the StopIteration is not necessary when you are essentially reraising it:

    >>> def batch(iterable, size):
    ...     it = iter(iterable)
    ...     while True:
    ...             bi = islice(it, size)
    ...             yield chain([bi.next()], bi)
    ...
    >>> map(list, batch(xrange(10), 3))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    
  3. 3. At 9:46 a.m. on 25 feb 2005, Raymond Hettinger said:

    Another approach. The groupby() function is somewhat versatile if some imagination goes into defining the key= function.

    from itertools import groupby
    
    def batch(iterable, size):
        def ticker(x, s=size, a=[-1]):
            r = a[0] = a[0] + 1
            return r // s
        for k, g in groupby(iterable, ticker):
             yield g
    
  4. 4. At 9:53 a.m. on 15 apr 2005, Hamish Lawson (the author) said:

    Removed handling of StopIteration. Duh! Yes, catching and reraising StopIteration is redundant. Thanks! I've removed it.

  5. 5. At 10:18 a.m. on 15 apr 2005, Hamish Lawson (the author) said:

    What to do about batches not being consumed? Your comment about unconsumed batches got me to thinking about what the right behaviour should be. Compensate for a batch iterator not getting exhausted by internally consuming the source iterator? Or have each batch iterator begin where the last one finished, so producing an infinite number of batch iterators if none of them get exhausted?

  6. 6. At 5:39 a.m. on 19 feb 2008, Wade Leftwich said:

    Using a generator class instead. I just posted my own solution to the same batching problem; should have searched a bit more thoroughly first. Here is a generator class that does the same thing. For those (like me) not thoroughly conversant with itertools, it might be easier to understand:

    class groupcount(object):
        """Accept a (possibly infinite) iterable and yield a succession
        of sub-iterators from it, each of which will yield N values.
    
        >>> gc = groupcount('abcdefghij', 3)
        >>> for subgroup in gc:
        ...     for item in subgroup:
        ...             print item,
        ...     print
        ...
        a b c
        d e f
        g h i
        j
        """
    
        def __init__(self, iterable, n=10):
            self.it = iter(iterable)
            self.n = n
    
        def __iter__(self):
            return self
    
        def next(self):
            self.ondeck = self.it.next()
            return self._groupby()
    
        def _groupby(self):
            yield self.ondeck
            for i in xrange(1, self.n):
                yield self.it.next()
    
  7. 7. At 9:11 p.m. on 5 may 2009, Daniel Lepage said:

    You can also generate the key func from itertools:

    from itertools import groupby, izip, count
    
    def batch(iterable, size):
        c = count()
        for k, g in groupby(iterable, lambda x:c.next()//size):
             yield g
    

Sign in to comment