Welcome, guest | Sign In | My Account | Store | Cart
4

You want to get the items from a sequence (or other iterable) a batch at a time, including a short batch at the end if need be.

Python, 23 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from itertools import islice, chain

def batch(iterable, size):
    sourceiter = iter(iterable)
    while True:
        batchiter = islice(sourceiter, size)
        yield chain([batchiter.next()], batchiter)


seq = xrange(19)
for batchiter in batch(seq, 3):
    print "Batch: ",
    for item in batchiter:
        print item,
    print

Batch:  0 1 2
Batch:  3 4 5
Batch:  6 7 8
Batch:  9 10 11
Batch:  12 13 14
Batch:  15 16 17
Batch:  18

Since I wanted to be able to batch iterables that weren't materialized in memory or whose length was unknown, one of the goals with this recipe was that it should only require the source iterable to support iteration, rather than require it to support indexing or have a known length as in some other approaches to batching.

An earlier version of this recipe met that goal, but I was unhappy that it required memory for the list built for each batch. To keep memory requirements to a minimum what I needed to yield instead was a size-bounded iterator over the original iterable. That's exactly what is provided by the itertools module's islice function.

But knowing when we're done batching is the tricky part, as islice is happy to continue returning iterators (albeit zero-length ones) even when the source iterator is exhausted. The problem is that we can't tell in advance whether an iterator is of zero length. One approach would be to maintain an internal count for each iterator and terminate when the yielded iterator turned out after the fact to have a count of zero. But I'd prefer never to yield a zero-length iterator in the first place, as this makes things more complicated for the client. Python's iterators do not have something like a hasNext method, so the only way to know whether an iterator can produce any items is by actually trying to consume it, which is the approach used in the recipe. If the initial iteration fails then we know we are done. But if it succeeds, we yield the obtained value chained with the (partially consumed) islice object. Note that each batch should be entirely consumed before proceeding to the next one, otherwise you will get unexpected behaviour.

9 comments

Hamish Lawson (author) 12 years, 9 months ago  # | flag

islice is just what I needed. Soon after submitting the recipe I began thinking of alternatives to building a list in memory for each batch. islice is just what I needed to do the basic batching, which I've incorporated in my revised version, but I keep the result of islice as an iterator rather than convert it to a list as I want to keep memory requirements to a minimum.

Peter Otten 12 years, 7 months ago  # | flag

All batches must be exhausted immediately. You should add a warning that a batch has to be entirely consumed before you can proceed to the next one. Otherwise users may be puzzled by the following:

>>> len([bi for bi in batch(range(19), 3)])
19

>>> list([bi for bi in batch(range(19), 3)][3])
[3]

Catching the StopIteration is not necessary when you are essentially reraising it:

>>> def batch(iterable, size):
...     it = iter(iterable)
...     while True:
...             bi = islice(it, size)
...             yield chain([bi.next()], bi)
...
>>> map(list, batch(xrange(10), 3))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
Raymond Hettinger 12 years, 4 months ago  # | flag

Another approach. The groupby() function is somewhat versatile if some imagination goes into defining the key= function.

from itertools import groupby

def batch(iterable, size):
    def ticker(x, s=size, a=[-1]):
        r = a[0] = a[0] + 1
        return r // s
    for k, g in groupby(iterable, ticker):
         yield g
Hamish Lawson (author) 12 years, 2 months ago  # | flag

Removed handling of StopIteration. Duh! Yes, catching and reraising StopIteration is redundant. Thanks! I've removed it.

Hamish Lawson (author) 12 years, 2 months ago  # | flag

What to do about batches not being consumed? Your comment about unconsumed batches got me to thinking about what the right behaviour should be. Compensate for a batch iterator not getting exhausted by internally consuming the source iterator? Or have each batch iterator begin where the last one finished, so producing an infinite number of batch iterators if none of them get exhausted?

Wade Leftwich 9 years, 4 months ago  # | flag

Using a generator class instead. I just posted my own solution to the same batching problem; should have searched a bit more thoroughly first. Here is a generator class that does the same thing. For those (like me) not thoroughly conversant with itertools, it might be easier to understand:

class groupcount(object):
    """Accept a (possibly infinite) iterable and yield a succession
    of sub-iterators from it, each of which will yield N values.

    >>> gc = groupcount('abcdefghij', 3)
    >>> for subgroup in gc:
    ...     for item in subgroup:
    ...             print item,
    ...     print
    ...
    a b c
    d e f
    g h i
    j
    """

    def __init__(self, iterable, n=10):
        self.it = iter(iterable)
        self.n = n

    def __iter__(self):
        return self

    def next(self):
        self.ondeck = self.it.next()
        return self._groupby()

    def _groupby(self):
        yield self.ondeck
        for i in xrange(1, self.n):
            yield self.it.next()
Daniel Lepage 8 years, 1 month ago  # | flag

You can also generate the key func from itertools:

from itertools import groupby, izip, count

def batch(iterable, size):
    c = count()
    for k, g in groupby(iterable, lambda x:c.next()//size):
         yield g
Andrew S Brown 1 year, 6 months ago  # | flag

My suggestion for returning batches of non-fixed size:

def batches(iterable, size):
    source = iter(iterable)
    while True:
        chunk = [val for _, val in zip(range(size), source) if val is not None]
        if not chunk:
            raise StopIteration
        yield chunk
Andrew S Brown 1 year, 6 months ago  # | flag

The following version is slightly simpler and users iterators better:

def batches(iterable, size):
    source = iter(iterable)
    while True:
        chunk = [val for _, val in izip(xrange(size), source)]
        if not chunk:
            raise StopIteration
        yield chunk

Add a comment

Sign in to comment

Created by Hamish Lawson on Fri, 3 Sep 2004 (PSF)
Python recipes (4587)
Hamish Lawson's recipes (1)

Required Modules

  • (none specified)

Other Information and Tasks