ActiveState Code

Recipe 497006: zip_exc(), a lazy zip() that ensures that all iterables have the same length


Using zip(names, values) may inadvertently eat some of your data when there are, e. g., fewer values than names. This is easy to fix with assert len(names) == len(values) if the arguments' length is known, but not if they are arbitrary iterables. With zip_exc() no such glitches go unnoticed as list(zip_exc(names, values)) throws a LengthMismatch exception if the number of names and values differ.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
"""
>>> list(zip_exc([]))
[]

>>> list(zip_exc((), (), ()))
[]

>>> list(zip_exc("abc", range(3)))
[('a', 0), ('b', 1), ('c', 2)]

>>> try:
...     list(zip_exc("", range(3)))
... except LengthMismatch:
...     print "mismatch"
mismatch

>>> try:
...     list(zip_exc(range(3), ()))
... except LengthMismatch:
...     print "mismatch"
mismatch

>>> try:
...     list(zip_exc(range(3), range(2), range(4)))
... except LengthMismatch:
...     print "mismatch"
mismatch

>>> items = zip_exc(range(3), range(2), range(4))
>>> items.next()
(0, 0, 0)
>>> items.next()
(1, 1, 1)
>>> try: items.next()
... except LengthMismatch: print "mismatch"
mismatch
"""

from itertools import chain, izip

class LengthMismatch(Exception):
    pass

def _throw():
    raise LengthMismatch
    yield None # unreachable

def _check(rest):
    for i in rest:
        try:
            i.next()
        except LengthMismatch:
            pass
        else:
            raise LengthMismatch
    return
    yield None # unreachable

def zip_exc(*iterables):
    """Like itertools.izip(), but throws a LengthMismatch exception if
    the iterables' lengths differ.
    """
    rest = [chain(i, _throw()) for i in iterables[1:]]
    first = chain(iterables[0], _check(rest))
    return izip(*[first] + rest)

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Discussion

My implementation looks a bit different than the straightforward approach used in http://mail.python.org/pipermail/python-3000/2006-March/000160.html, for example.

To keep the performance hit low, I've tried hard move as much of the work into code written in C (The chain() and izip() functions from the marvelous itertools module). I challenge you to come up with something faster in pure Python :-)

Comments

  1. 1. At 11:29 p.m. on 31 aug 2006, Steven Bethard said:

    Raymond Hettinger's solution. Raymond Hettinger as ususal provided a beautiful solution to this problem:

    def izip_exact(*args):
        sentinel = object()
        iters = [chain(it, repeat(sentinel)) for it in args]
        for result in izip(*iters):
            if sentinel in result:
                if all(value==sentinel for value in result):
                    return
                raise ValueError('sequences of different lengths')
            yield result
    

    I believe the plan is to add something like this to the itertools recipes section in the not-so-distant future.

  2. 2. At 11:35 p.m. on 1 sep 2006, Peter Otten (the author) said:

    Mine is faster for "long" sequences. zip_exc() doesn't do expensive tests for every tuple. So I'm claiming "conceptual beauty", a term which I've just made up :-)

    $ python2.5 -m timeit -s'from zips import zip_exc as zip; data = [range(1000)]3' 'for item in zip(data): pass'

    1000 loops, best of 3: 255 usec per loop

    $ python2.5 -m timeit -s'from zips import izip_exact as zip; data = [range(1000)]3' 'for item in zip(data): pass'

    1000 loops, best of 3: 1.08e+03 usec per loop

    On the other hand the setup for izip_exact() is a bit more lightweight. With three sequences, the break-even is at 20 items (i. e. for less than 20 3-tuples izip_exact() wins):

    $ python2.5 -m timeit -s'from zips import izip_exact as zip; data = [range(20)]3' 'for item in zip(data): pass'

    10000 loops, best of 3: 33.9 usec per loop

    $ python2.5 -m timeit -s'from zips import zip_exc as zip; data = [range(20)]3' 'for item in zip(data): pass'

    10000 loops, best of 3: 33.5 usec per loop

    I used the release candidate of 2.5 for the measurements because it provides all() as a built-in.

    PS: If you want to use Raymond's approach in 2.4 you can define

    def all(items):
        return True in (True for item in items if item)
    
  3. 3. At 11:20 p.m. on 3 sep 2006, Raymond Hettinger said:

    Alternate version without the O(n) sentinel search. If speed is the main concern, the O(n) sentinel search step can be replaced with a generator that tracks the number of calls to next():

    def izip_exact(*args):
        count = []
        def counter():
            while 1:
                count.append(None)
                yield None
        sentinel = counter()
        iters = [chain(it, sentinel) for it in args]
        for result in izip(*iters):
            if count:
                if len(count) == len(args):
                    return
                raise ValueError('sequences of different lengths')
            yield result
    
  4. 4. At 10:40 p.m. on 4 sep 2006, Raymond Hettinger said:

    FYI, Peter's version is still the fastest.

  5. 5. At 11:54 a.m. on 7 sep 2006, Raymond Hettinger said:

    Faster Py2.4 version of all().

    def all(seq):
        for elem in ifilterfalse(None, seq):
            return False
        return True
    

    Also, when using all() in the above recipe, use an "is" test instead of "==":

    if all(value is sentinel for value in result):
        return
    

Sign in to comment