Welcome, guest | Sign In | My Account | Store | Cart

Python's "in" operator is extremely handy, but O(N) when applied to an N-item sequence; if a sequence is subject to frequent "in" tests, an auxiliary dictionary at its side can boost performance A LOT if the values are hashable.

Python, 69 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# simple, naive, excellent... but perhaps slow!
def addUnique1(baseList, otherList):
    for item in otherList:
        if item not in baseList:
            baseList.append(item)

# may be faster if otherList is large:
def addUnique2(baseList, otherList):
    auxDict = {}
    for item in baseList:
        auxDict[item] = None
    for item in otherList:
        if not auxDict.has_key(item):
            baseList.append(item)
            auxDict[item] = None

# often better is to wrap the sequence, together
# with its auxiliary dictionary, in an object
# (using __contains__ to speed "in"-tests) --
# note the dictionary must be carefully maintained
# to stay "in sync" with the sequence!  Here's a
# version which does the syncing "just in time",
# when a membership test is actually required...:
import UserList
class FunkyList(UserList.UserList):
    def __init__(self, initlist=None):
        UserList.__init__(self, initlist)
        self._dict_ok = None
    def _sync_dict(self):
        if not self._dict_ok:
            self._dict = {}
            for item in self.data:
                self._dict[item] = None
            self._dict_ok = 1
    def __contains__(self, item):
        self._sync_dict()
        return self._dict.has_key(item)

    # the auxiliary, internal-use method that
    # resets the 'dictionary OK' flag then
    # delegates the actual operation
    def _delegate_modify(self, method, *args):
        self._dict_ok = None
        return method(self, *args)

    # patiently delegate each potentially membership-changing
    # method through the _delegate_modify one, so that _dict_ok 
    # gets reset whenever membership may have changed

    def __setitem__(self, *args)
        return self._delegate_modify(UserList.__setitem__, *args)
    def __delitem__(self, *args):
        return self._delegate_modify(UserList.__delitem__, *args)
    def __setslice__(self, *args):
        return self._delegate_modify(UserList.__setslice__, *args)
    def __delslice__(self, *args):
        return self._delegate_modify(UserList.__delslice__, *args)
    def __iadd__(self, *args):
        return self._delegate_modify(UserList.__iadd__, *args)
    def append(self, *args):
        return self._delegate_modify(UserList.append, *args)
    def insert(self, *args):
        return self._delegate_modify(UserList.insert, *args)
    def pop(self, *args):
        return self._delegate_modify(UserList.pop, *args)
    def remove(self, *args):
        return self._delegate_modify(UserList.remove, *args)
    def extend(self, *args):
        return self._delegate_modify(UserList.extend, *args)

A membership check ("in" operator) on a sequence of N items is O(N); if M such tests are performed, overall time is O(M*N). Preparing an auxiliary dictionary whose keys are the sequence's items is also [roughly] O(N), but then the M tests are just [roughly] O(M), so overall we have [roughly] O(N+M).

Even better overall performance may often be obtained by 'permanently' placing the auxiliary dictionary on the side of the sequence, encapsulating both into one object; however, in this case, the dictionary must be "maintained", as the sequence is modified, so that it stays "in sync" with the actual membership of the sequence. The above code, for example, extends UserList, and delegates to it every method -- this is done explicitly, for methods that may modify the membership, to enable resetting a flag that asserts "the auxiliary dict is in sync"; the auxdict gets rebuilt, if needed, when __contains__ is used (the "in" operator calls the __contains__ method when applied to an instance that has it).

Of course, performance characteristics depend on the actual pattern of membership tests versus membership modifications, and it may need some careful profiling to find the right approach for a given use. This recipe, however, caters for a very common pattern of use -- where sequence-modifying operations happen "in bunches", then, for some time, no sequence modification is performed, but several membership-tests may be done.

Rebuilding the dictionary when needed is also simpler than incrementally maintaining it at each sequence-modifying step (which requires careful analysis of what is being removed, and what inserted, particularly upon such operations as slice-assignment; if incremental maintenance of the auxdict is desired, it will likely be wise to have the values in the auxdict be a count of the number of occurrences of each key's value in the sequence -- a list of the indices where the value is present being another lively possibility, but taking more work to maintain), although, depending on usage patterns, the resulting software might happen to be substantially slower.

Of course, all of this is only needed if _the sequence itself_ is needed -- i.e., the order of items in the sequence is significant; otherwise, keeping just the dictionary is obviously simpler and more effective! (Again, the dictionary can map values to _counts_ if the only need is for the data structure to be a 'bag' rather than a 'set').

An important prerequisite for any of these membership-test optimizations is that the values in the sequence must be _hashable_ ones -- else, of course, they cannot be keys in a dictionary! For example, a list of tuples might be subjected to this treatment, but for a list of lists they would not be applicable.

Created by Alex Martelli on Mon, 26 Mar 2001 (PSF)
Python recipes (4591)
Alex Martelli's recipes (27)

Required Modules

  • (none specified)

Other Information and Tasks