List with fast membership test (__contains__) « Python recipes

Just a list subclass with a fast hash based __contains__ provided by a Set. Inheriting from list was far more work than I had anticipated AND the code below is still buggy.

Still, this provides O(1) indexed access and O(1) contains. If you use this expect the slicing code to be broken despite the attempts to get it right. Fixes graciously accepted!

      ## {{{ http://code.activestate.com/recipes/577185/ (r2)
class MembershipTestList(list):
    def __init__(self, *args, **kwargs):
        super(MembershipTestList, self).__init__(*args, **kwargs)
        self._members = set(*args, **kwargs)

    def __getslice__(self, i, j):
        return self.__getitem__(slice(i, j))
    def __setslice__(self, i, j, seq):
        return self.__setitem__(slice(i, j), seq)
    def __delslice__(self, i, j):
        return self.__delitem__(slice(i, j))
    
    def __getitem__(self, k):
        if isinstance(k, slice):
            indices = k.indices(len(self))
            return [self.__getitem__(i) for i in xrange(*indices)]
        return super(MembershipTestList, self).__getitem__(k)
                    
    def __setitem__(self, k, seq):
        if isinstance(k, slice):
            indices = k.indices(len(self))
            posgen = iter(xrange(*indices))
            for v in seq:
                try:
                    pos = posgen.next()
                except:
                    pos += 1
                    posgen = itertools.count(pos+1)
                self.__setitem__(posgen.next(), v)
        else:
            self._members.add(seq)
            super(MembershipTestList, self).__setitem__(k, seq)
        
    def __delitem__(self, k):
        if isinstance(k, slice):
            indices = k.indices(len(self))
            for i in xrange(*indices):
                self.__delitem__(i)
        else:
            v = super(MembershipTestList, self).__getitem__(k)
            self._members.remove(v)
            super(MembershipTestList, self).__delitem__(k)

    def append(self, value):
        self._members.add(value)
        return super(MembershipTestList, self).append(value)

    def insert(self, pos, value):
        self._members.add(value)
        return super(MembershipTestList, self).insert(pos, value)

    def remove(self, value):
        self._members.remove( value )
        return super(MembershipTestList, self).remove(value)
    
    def extend(self, iterable):
        for x in iterable:
            self.append(x)
    
    def pop(self, index=None):
        if index is None:
            index = len(self)
        v = self[index]
        self.__delitem__(index)
        return v
        
    def __contains__(self, v):
        return v in self._members
## end of http://code.activestate.com/recipes/577185/ }}}

      

well, this means you can only put hashable things into the MembershipTestList.

Tags: list, subclass

5 comments

Jeffrey Jose 14 years ago # | flag

Correct me if I'm wrong, but as I understand the whole idea of the recipe is to speed up "member-lookup" via __contains__ - which is evoked during "in" keyword.

But yet, your class does the vanilla "in" lookup in __contains__.

Bottomline - inorder to speed up __contains__, you store the members, but yet retrieve it via the vanilla __contains__.

Muhammad Alkarouri 14 years ago # | flag

About Jeffrey's comment:

I guess the question would be: is set.__contains__ quicker than list.__contains__ by a margin enough to be quicker than the function call involved?

The answer, as always, can be found by actually profiling:

>>> timeit.timeit('79 in k', 'k=list(range(1000))')
2: 2.49610886299795
>>> timeit.timeit('79 in k', 'k=set(range(1000))')
3: 0.093028862606843177
>>> timeit.timeit('379 in k', 'k=list(range(1000))')
4: 11.671246764602756
>>> timeit.timeit('379 in k', 'k=set(range(1000))')
5: 0.13121108967759199

So the set.__contains__ is quicker than the list.__contains__.

What about trying the recipe?

>>> timeit.timeit('79 in k', MTSCode + 'k=MembershipTestList(range(1000))')
10: 0.51234137299570648
>>> timeit.timeit('379 in k', MTSCode + 'k=MembershipTestList(range(1000))')
11: 0.542829046708448

(MTSCode contains the recipe code) It is somewhere between both. So the recipe owner's claims are correct.

I guess the moral of the story is: performance needs to be measured.

Gabriel Genellina 14 years ago # | flag

Quite a few problems:

>>> k = MembershipTestList(range(10))
>>> k[3] = 100
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "list_con_set.py", line 34, in __setitem__
    self._members.add(v)
UnboundLocalError: local variable 'v' referenced before assignment


>>> k += range(10, 20)
>>> 15 in k
False    # should be True
>>> set(k) == k._members
False


>>> del k[:4]
>>> k
[1, 3, 5, 7, 8, 9]
# should be: [4, 5, 6, 7, 8, 9]


>>> k[1:3:2]=[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "list_con_set.py", line 32, in __setitem__
    self.__setitem__(posgen.next(), v)
StopIteration    

>>> k[1:5:2]=[]
# should fail with ValueError, both sides are not the same size

Also, remove() should raise ValueError instead of KeyError; and there is a missing import of itertools.

Correctly emulating the built-in list behavior may be much harder than you expected, and I'm unsure if it's worth the effort. Anyway, probably it's better to base your work on collections.MutableSequence instead; just implement the required abstract methods.

Paul Bohm 14 years ago # | flag

Hey Jeffrey,

I checked the python source before implementing this. __contains__ for python lists are O(n), whereas __contains__ for sets is O(1). OrderedSet and OrderedDict have the O(1) __contains__, but i've seen no implementations that preserve the O(1) indexed access (they are linked lists, not lists).

I'm working on code operating on large datasets in tight loops, which means both O(n) __contains__ and O(n) indexed access are not an option.

Gabriel: Great points! Implementing the full list interface is horrible. I'm not using slices or even the full API myself. This code was written in a time of need, and I didn't find anything else that was suitable. Despite the bugs I hope it's better than having to start from scratch.

I've fixed the UnboundLocalError.

david.gaarenstroom 14 years ago # | flag

You may have made __contains__ faster, at the cost of other operations... Your __getslice__ and __getitem__ functions however can be omitted, and the other functions can be further optimized.

Also, you didn't provide a speedup for "list.index()", which in my opinion is more interesting than __contains__, and that wouldn't work in your current approach. Also, the code does not handle multiple occurrences of the same value too well on deletion. E.g.:

mlist = MembershipTestList([3,2,1,3,4,1,2])
del mlist[2]
print 1 in k

But it's a useful observation. I'd rather speed this up in C, since it's so low level. It could be very useful for inclusion into Python. I might call it a "hashedlist", speeding up "index" as well and allowing multiple occurrences of values...

◄	Python recipes (4591)	►
◄	Paul Bohm's recipes (1)	►

List with fast membership test (contains) (Python recipe) by Paul Bohm
ActiveState Code (http://code.activestate.com/recipes/577185/)

5 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

List with fast membership test (__contains__) (Python recipe) by Paul Bohm ActiveState Code (http://code.activestate.com/recipes/577185/)

5 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

List with fast membership test (contains) (Python recipe) by Paul Bohm
ActiveState Code (http://code.activestate.com/recipes/577185/)