Just a list subclass with a fast hash based __contains__ provided by a Set. Inheriting from list was far more work than I had anticipated AND the code below is still buggy.
Still, this provides O(1) indexed access and O(1) contains. If you use this expect the slicing code to be broken despite the attempts to get it right. Fixes graciously accepted!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | ## {{{ http://code.activestate.com/recipes/577185/ (r2)
class MembershipTestList(list):
def __init__(self, *args, **kwargs):
super(MembershipTestList, self).__init__(*args, **kwargs)
self._members = set(*args, **kwargs)
def __getslice__(self, i, j):
return self.__getitem__(slice(i, j))
def __setslice__(self, i, j, seq):
return self.__setitem__(slice(i, j), seq)
def __delslice__(self, i, j):
return self.__delitem__(slice(i, j))
def __getitem__(self, k):
if isinstance(k, slice):
indices = k.indices(len(self))
return [self.__getitem__(i) for i in xrange(*indices)]
return super(MembershipTestList, self).__getitem__(k)
def __setitem__(self, k, seq):
if isinstance(k, slice):
indices = k.indices(len(self))
posgen = iter(xrange(*indices))
for v in seq:
try:
pos = posgen.next()
except:
pos += 1
posgen = itertools.count(pos+1)
self.__setitem__(posgen.next(), v)
else:
self._members.add(seq)
super(MembershipTestList, self).__setitem__(k, seq)
def __delitem__(self, k):
if isinstance(k, slice):
indices = k.indices(len(self))
for i in xrange(*indices):
self.__delitem__(i)
else:
v = super(MembershipTestList, self).__getitem__(k)
self._members.remove(v)
super(MembershipTestList, self).__delitem__(k)
def append(self, value):
self._members.add(value)
return super(MembershipTestList, self).append(value)
def insert(self, pos, value):
self._members.add(value)
return super(MembershipTestList, self).insert(pos, value)
def remove(self, value):
self._members.remove( value )
return super(MembershipTestList, self).remove(value)
def extend(self, iterable):
for x in iterable:
self.append(x)
def pop(self, index=None):
if index is None:
index = len(self)
v = self[index]
self.__delitem__(index)
return v
def __contains__(self, v):
return v in self._members
## end of http://code.activestate.com/recipes/577185/ }}}
|
well, this means you can only put hashable things into the MembershipTestList.
Correct me if I'm wrong, but as I understand the whole idea of the recipe is to speed up "member-lookup" via __contains__ - which is evoked during "in" keyword.
But yet, your class does the vanilla "in" lookup in __contains__.
Bottomline - inorder to speed up __contains__, you store the members, but yet retrieve it via the vanilla __contains__.
About Jeffrey's comment:
I guess the question would be: is set.__contains__ quicker than list.__contains__ by a margin enough to be quicker than the function call involved?
The answer, as always, can be found by actually profiling:
So the set.__contains__ is quicker than the list.__contains__.
What about trying the recipe?
(MTSCode contains the recipe code) It is somewhere between both. So the recipe owner's claims are correct.
I guess the moral of the story is: performance needs to be measured.
Quite a few problems:
Also, remove() should raise ValueError instead of KeyError; and there is a missing import of itertools.
Correctly emulating the built-in list behavior may be much harder than you expected, and I'm unsure if it's worth the effort. Anyway, probably it's better to base your work on collections.MutableSequence instead; just implement the required abstract methods.
Hey Jeffrey,
I checked the python source before implementing this. __contains__ for python lists are O(n), whereas __contains__ for sets is O(1). OrderedSet and OrderedDict have the O(1) __contains__, but i've seen no implementations that preserve the O(1) indexed access (they are linked lists, not lists).
I'm working on code operating on large datasets in tight loops, which means both O(n) __contains__ and O(n) indexed access are not an option.
Gabriel: Great points! Implementing the full list interface is horrible. I'm not using slices or even the full API myself. This code was written in a time of need, and I didn't find anything else that was suitable. Despite the bugs I hope it's better than having to start from scratch.
I've fixed the UnboundLocalError.
You may have made __contains__ faster, at the cost of other operations... Your __getslice__ and __getitem__ functions however can be omitted, and the other functions can be further optimized.
Also, you didn't provide a speedup for "list.index()", which in my opinion is more interesting than __contains__, and that wouldn't work in your current approach. Also, the code does not handle multiple occurrences of the same value too well on deletion. E.g.:
But it's a useful observation. I'd rather speed this up in C, since it's so low level. It could be very useful for inclusion into Python. I might call it a "hashedlist", speeding up "index" as well and allowing multiple occurrences of values...