You want to get the items from a sequence (or other iterable) a batch at a time, including a short batch at the end if need be.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | from itertools import islice, chain
def batch(iterable, size):
sourceiter = iter(iterable)
while True:
batchiter = islice(sourceiter, size)
yield chain([batchiter.next()], batchiter)
seq = xrange(19)
for batchiter in batch(seq, 3):
print "Batch: ",
for item in batchiter:
print item,
print
Batch: 0 1 2
Batch: 3 4 5
Batch: 6 7 8
Batch: 9 10 11
Batch: 12 13 14
Batch: 15 16 17
Batch: 18
|
Since I wanted to be able to batch iterables that weren't materialized in memory or whose length was unknown, one of the goals with this recipe was that it should only require the source iterable to support iteration, rather than require it to support indexing or have a known length as in some other approaches to batching.
An earlier version of this recipe met that goal, but I was unhappy that it required memory for the list built for each batch. To keep memory requirements to a minimum what I needed to yield instead was a size-bounded iterator over the original iterable. That's exactly what is provided by the itertools module's islice function.
But knowing when we're done batching is the tricky part, as islice is happy to continue returning iterators (albeit zero-length ones) even when the source iterator is exhausted. The problem is that we can't tell in advance whether an iterator is of zero length. One approach would be to maintain an internal count for each iterator and terminate when the yielded iterator turned out after the fact to have a count of zero. But I'd prefer never to yield a zero-length iterator in the first place, as this makes things more complicated for the client. Python's iterators do not have something like a hasNext method, so the only way to know whether an iterator can produce any items is by actually trying to consume it, which is the approach used in the recipe. If the initial iteration fails then we know we are done. But if it succeeds, we yield the obtained value chained with the (partially consumed) islice object. Note that each batch should be entirely consumed before proceeding to the next one, otherwise you will get unexpected behaviour.
islice is just what I needed. Soon after submitting the recipe I began thinking of alternatives to building a list in memory for each batch. islice is just what I needed to do the basic batching, which I've incorporated in my revised version, but I keep the result of islice as an iterator rather than convert it to a list as I want to keep memory requirements to a minimum.
All batches must be exhausted immediately. You should add a warning that a batch has to be entirely consumed before you can proceed to the next one. Otherwise users may be puzzled by the following:
Catching the StopIteration is not necessary when you are essentially reraising it:
Another approach. The groupby() function is somewhat versatile if some imagination goes into defining the key= function.
Removed handling of StopIteration. Duh! Yes, catching and reraising StopIteration is redundant. Thanks! I've removed it.
What to do about batches not being consumed? Your comment about unconsumed batches got me to thinking about what the right behaviour should be. Compensate for a batch iterator not getting exhausted by internally consuming the source iterator? Or have each batch iterator begin where the last one finished, so producing an infinite number of batch iterators if none of them get exhausted?
Using a generator class instead. I just posted my own solution to the same batching problem; should have searched a bit more thoroughly first. Here is a generator class that does the same thing. For those (like me) not thoroughly conversant with itertools, it might be easier to understand:
You can also generate the key func from itertools:
My suggestion for returning batches of non-fixed size:
The following version is slightly simpler and users iterators better: