Welcome, guest | Sign In | My Account | Store | Cart

This is a very simple wrapper of an iterator or iterable, such that the iterator can be iterated streamingly without generating all elements or any at all, but the object can still be iterated from the beginning as many times as wanted. In effect, it is a streamingly loaded list.

Python, 75 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import itertools

class SavedIterable (object):
    """Wrap an iterable and cache it.

    The SavedIterable can be accessed streamingly, while still being
    incrementally cached. Later attempts to iterate it will access the
    whole of the sequence.

    When it has been cached to its full extent once, it reduces to a
    thin wrapper of a sequence iterator. The SavedIterable will pickle
    into a list.

    >>> s = SavedIterable(xrange(5))
    >>> iter(s).next()
    0
    >>> list(s)
    [0, 1, 2, 3, 4]

    >>> iter(s)   # doctest: +ELLIPSIS
    <listiterator object at 0x...>

    >>> import pickle
    >>> pickle.loads(pickle.dumps(s))
    [0, 1, 2, 3, 4]

    >>> u = SavedIterable(xrange(5))
    >>> one, two = iter(u), iter(u)
    >>> one.next(), two.next()
    (0, 0)
    >>> list(two)
    [1, 2, 3, 4]
    >>> list(one)
    [1, 2, 3, 4]

    >>> SavedIterable(range(3))
    [0, 1, 2]
    """
    def __new__(cls, iterable):
        if isinstance(iterable, list):
            return iterable
        return object.__new__(cls)
    def __init__(self, iterable):
        self.iterator = iter(iterable)
        self.data = []
    def __iter__(self):
        if self.iterator is None:
            return iter(self.data)
        return self._incremental_caching_iter()
    def _incremental_caching_iter(self):
        indices = itertools.count()
        while True:
            idx = indices.next()
            try:
                yield self.data[idx]
            except IndexError:
                pass
            else:
                continue
            if self.iterator is None:
                return
            try:
                x = self.iterator.next()
                self.data.append(x)
                yield x
            except StopIteration:
                self.iterator = None
    def __reduce__(self):
        # pickle into a list with __reduce__
        # (callable, args, state, listitems)
        return (list, (), None, iter(self))

if __name__ == '__main__':
    import doctest
    doctest.testmod()

The recipe is as simple as possible and no simpler. It addresses performance by returning a list iterator directly when the iterable is fully cached.

For my application, I pickle the SavedIterable to a list, which has the advantage that the pickles do not rely on this caching datastructure at all. As another completely optional addition, the __new__ classmethod returns a list if a list is passed; this emphasises that the class is a simple replacement to "streamingly load" a sequence.

6 comments

Gabriel Genellina 14 years, 5 months ago  # | flag

There seems to be a problem when one iterator exhausts the iterable and there are still other live iterators:

>>> s = SavedIterable(xrange(5))
>>> i1 = iter(s)
>>> next(i1)
0
>>> next(i1)
1
>>> next(i1)
2
>>> for x in s: print x,
...
0 1 2 3 4
>>> next(i1) # expected: 3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

[Also, just a stylistic note: __new__ is a classmethod, and its first argument is called 'cls' usually, not 'self']

Ulrik Sverdrup (author) 14 years, 5 months ago  # | flag

Thank you for your comment Gabriel, I realized this problem but haven't worried yet.

I think changing:

 for x in self.iterator:

to:

 for x in (self.iterator or ()):

is enough to fix it.

Ulrik Sverdrup (author) 14 years, 5 months ago  # | flag

That is addressed, but there is still a problem:

s = SavedIterable(xrange(5))
i = iter(a)
i.next()
list(s)
list(i) # get [] but should be [1, .. 4]

so it's not so simple as I thought it was.

Ulrik Sverdrup (author) 14 years, 5 months ago  # | flag

I have fixed the recipe for concurrent iterators. We have to check both the cache and the iterator for that to work, so the _incremental_caching_iter has changed in its appearance. Less pretty, but the concept is still very simple.

Ulrik Sverdrup (author) 14 years, 5 months ago  # | flag

Just a clarification: I didn't understand the problem first (even though I said so), so thank you Gabriel! This fixes a potential bug in my program.

Raymond Hettinger 13 years, 7 months ago  # | flag

Why not use itertools.tee() which does the caching for you?