This is a very simple wrapper of an iterator or iterable, such that the iterator can be iterated streamingly without generating all elements or any at all, but the object can still be iterated from the beginning as many times as wanted. In effect, it is a streamingly loaded list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | import itertools
class SavedIterable (object):
"""Wrap an iterable and cache it.
The SavedIterable can be accessed streamingly, while still being
incrementally cached. Later attempts to iterate it will access the
whole of the sequence.
When it has been cached to its full extent once, it reduces to a
thin wrapper of a sequence iterator. The SavedIterable will pickle
into a list.
>>> s = SavedIterable(xrange(5))
>>> iter(s).next()
0
>>> list(s)
[0, 1, 2, 3, 4]
>>> iter(s) # doctest: +ELLIPSIS
<listiterator object at 0x...>
>>> import pickle
>>> pickle.loads(pickle.dumps(s))
[0, 1, 2, 3, 4]
>>> u = SavedIterable(xrange(5))
>>> one, two = iter(u), iter(u)
>>> one.next(), two.next()
(0, 0)
>>> list(two)
[1, 2, 3, 4]
>>> list(one)
[1, 2, 3, 4]
>>> SavedIterable(range(3))
[0, 1, 2]
"""
def __new__(cls, iterable):
if isinstance(iterable, list):
return iterable
return object.__new__(cls)
def __init__(self, iterable):
self.iterator = iter(iterable)
self.data = []
def __iter__(self):
if self.iterator is None:
return iter(self.data)
return self._incremental_caching_iter()
def _incremental_caching_iter(self):
indices = itertools.count()
while True:
idx = indices.next()
try:
yield self.data[idx]
except IndexError:
pass
else:
continue
if self.iterator is None:
return
try:
x = self.iterator.next()
self.data.append(x)
yield x
except StopIteration:
self.iterator = None
def __reduce__(self):
# pickle into a list with __reduce__
# (callable, args, state, listitems)
return (list, (), None, iter(self))
if __name__ == '__main__':
import doctest
doctest.testmod()
|
The recipe is as simple as possible and no simpler. It addresses performance by returning a list iterator directly when the iterable is fully cached.
For my application, I pickle the SavedIterable to a list, which has the advantage that the pickles do not rely on this caching datastructure at all. As another completely optional addition, the __new__ classmethod returns a list if a list is passed; this emphasises that the class is a simple replacement to "streamingly load" a sequence.
There seems to be a problem when one iterator exhausts the iterable and there are still other live iterators:
[Also, just a stylistic note: __new__ is a classmethod, and its first argument is called 'cls' usually, not 'self']
Thank you for your comment Gabriel, I realized this problem but haven't worried yet.
I think changing:
to:
is enough to fix it.
That is addressed, but there is still a problem:
so it's not so simple as I thought it was.
I have fixed the recipe for concurrent iterators. We have to check both the cache and the iterator for that to work, so the _incremental_caching_iter has changed in its appearance. Less pretty, but the concept is still very simple.
Just a clarification: I didn't understand the problem first (even though I said so), so thank you Gabriel! This fixes a potential bug in my program.
Why not use itertools.tee() which does the caching for you?