The fastest way to remove duplicates from a sequence depends on some pretty subtle properties of the sequence elements, such as whether they're hashable, and whether they support full comparisons. The unique() function tries three methods, from fastest to slowest, letting runtime exceptions pick the best method available for the sequence at hand.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | def unique(s):
"""Return a list of the elements in s, but without duplicates.
For example, unique([1,2,3,1,2,3]) is some permutation of [1,2,3],
unique("abcabc") some permutation of ["a", "b", "c"], and
unique(([1, 2], [2, 3], [1, 2])) some permutation of
[[2, 3], [1, 2]].
For best speed, all sequence elements should be hashable. Then
unique() will usually work in linear time.
If not possible, the sequence elements should enjoy a total
ordering, and if list(s).sort() doesn't raise TypeError it's
assumed that they do enjoy a total ordering. Then unique() will
usually work in O(N*log2(N)) time.
If that's not possible either, the sequence elements must support
equality-testing. Then unique() will usually work in quadratic
time.
"""
n = len(s)
if n == 0:
return []
# Try using a dict first, as that's the fastest and will usually
# work. If it doesn't work, it will usually fail quickly, so it
# usually doesn't cost much to *try* it. It requires that all the
# sequence elements be hashable, and support equality comparison.
u = {}
try:
for x in s:
u[x] = 1
except TypeError:
del u # move on to the next method
else:
return u.keys()
# We can't hash all the elements. Second fastest is to sort,
# which brings the equal elements together; then duplicates are
# easy to weed out in a single pass.
# NOTE: Python's list.sort() was designed to be efficient in the
# presence of many duplicate elements. This isn't true of all
# sort functions in all languages or libraries, so this approach
# is more effective in Python than it may be elsewhere.
try:
t = list(s)
t.sort()
except TypeError:
del t # move on to the next method
else:
assert n > 0
last = t[0]
lasti = i = 1
while i < n:
if t[i] != last:
t[lasti] = last = t[i]
lasti += 1
i += 1
return t[:lasti]
# Brute force is all that's left.
u = []
for x in s:
if x not in u:
u.append(x)
return u
|
The interesting issues are discussed in the code comments: this is a very pure example of how algorithm efficiency depends on the strength of the assumptions you can make! Of course you could split this into three distinct functions, and call the one best meeting your needs directly. In practice, however, the brute force method is so slow for large sequences that nothing measurable is lost by simply letting the function as written try the faster methods first.
and when you can't lose order... unique() systematically loses order, but if items are hashable it's not hard to keep order intact -- only eliminate "later" insertions of already-present items. In case items _aren't_ hashable, for a big enough N using their cPickle.dumps() might be worth it... that's easily generalized to "uniqueness within equivalence classes" -- parameter function idfun must return hashable objects which are == for and only for items that are duplicates.
To get the latest rather than earliest appearance of an item (==a member of an equivalence class as defined by idfun), the best solution is to take quite another equivalent tack:
uniquest generalizes quite easily to cases in which choice among multiple items in the same equivalence class needs to depend on some arbitrary precedence-function that considers both the actual items and their indices of occurrence, as long as said precedence can operate pairwise -- you just need to replace the simplistic "seen[marker] = index, item" with a call to the precedence function when marker is already a key in dictionary 'seen' -- the precedence function must return the (index,item) pair for the chosen occurrence among the two (the one already seen and the newly occurring item/index). Here's the code, giving up the default idfun since this is really useful only for a substantial equivalence-function...:
(comment continued...)
(...continued from previous comment)
For example, say we have a list of words: we need to extract from it a list respecting order where no two items begin with the same letter; out of several words in the original list that begin with the same letter, we want to keep the longest, and, for equal length, the one appearing later in the list. Phew, sounds complicated, right? Not with fancy_unique to help us...:
Here, 'prefer' is slightly tricky/fragile as it "knows" fancy_unique always calls it with id1>id2, so the older id2,word2 pair need only be returned when word2 is longer than word1, otherwise id1,word1 must always be the result. It's only a bit wordier to express the full set of tests in 'prefer', though.
And the nice thing is, these are ALL O(N) [in the same sense as the best case of the unique in the original recipe], although when one is using cPickle.dumps and/or other heavy stuff the multiplicative constants may be a tad heavyweight:-).
Lightweight and fast ...
But these don't address unhashability. If you are prepared to lose some backwards compatibility by using list comprehensions, then these functions are tidier variants of Tim's first algorithm. However they don't work if the supplied list contains unhashable items (such as a another list). Providing fallback algorithms to address that is what makes Tim's code longer.
another approach. This always worked for me:
You could always wrap it in a function.
this is much faster for large sequences, using a dictionary. This uses a dictionary with repr, similar to my previous approach, but much faster for large sequences. repr handles the hashability issues:
Python 2.4. In Python 2.4, sets will be introduced as a builtin-type. This probably means the 'dictionary-method' can be replaced by a method using sets.
second method is O(n^2). In your second method,
(t[lasti] = last = t[i])
list indexing is O(n), so the whole algo works in O(n^2).
Use built-in functions. Why the map() with set._setitem_? Create a set from the list, like this:
In pre-set Python, I've used a similar approach for dictionaries to be used as sets, though I haven't timed it against a for loop. Convert the list into a list of tuples with the second value a "1" (or true). Make that list into a dictionary.
For the all-purpose unique routine, the for loop is probably better than the dict/map/lambda one-liner, because it will fail earlier on non-hashable elements. This approach makes a whole list before it tries to hash anything. Calling set() directly should not have that problem.
One other minor thing: getting the length of a sequence is so fast that I don't see the point of saving it at the beginning of the routine. Try this to test for empty sequences, and to get graceful behavior when passed None. Then get the length closer to the loop that uses it.
Indexing is O(1). Python's list data type does not mean a linked list. Under the hood it's an array of pointers, so that indexing works in constant time. Which means that the sorting version of unique() does have complexity O(n*log(n)), just like Tim Peters said.
Using an iterator... Sometimes you don't want to iterate through all the sequence, and just want the first few unique elements. Here is an iterator that does just that, assuming objects are hashable (otherwise, approaches seen in this recipe and comments can be used).
Or you could use Chris Perkins' wonderful trick.
Yes, Python 2.4 sets are marvelous and fast.
A derivate version. http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/438599
To get R. Hettinger's unique work for unhashable objects you can use some "idfun" there, too:
(Maybe repr is not the best choice.)
I am happy to cross over this recipe, it really helps me a lot. Thanks!
I found a use case where this algorithm fails:
aa = set('a')
bb = set('b')
cc = [aa, bb, aa]
unique(cc) --> [set(['a']), set(['b']), set(['a'])]