Welcome, guest | Sign In | My Account | Store | Cart
5

The fastest way to remove duplicates from a sequence depends on some pretty subtle properties of the sequence elements, such as whether they're hashable, and whether they support full comparisons. The unique() function tries three methods, from fastest to slowest, letting runtime exceptions pick the best method available for the sequence at hand.

Python, 67 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def unique(s):
    """Return a list of the elements in s, but without duplicates.

    For example, unique([1,2,3,1,2,3]) is some permutation of [1,2,3],
    unique("abcabc") some permutation of ["a", "b", "c"], and
    unique(([1, 2], [2, 3], [1, 2])) some permutation of
    [[2, 3], [1, 2]].

    For best speed, all sequence elements should be hashable.  Then
    unique() will usually work in linear time.

    If not possible, the sequence elements should enjoy a total
    ordering, and if list(s).sort() doesn't raise TypeError it's
    assumed that they do enjoy a total ordering.  Then unique() will
    usually work in O(N*log2(N)) time.

    If that's not possible either, the sequence elements must support
    equality-testing.  Then unique() will usually work in quadratic
    time.
    """

    n = len(s)
    if n == 0:
        return []

    # Try using a dict first, as that's the fastest and will usually
    # work.  If it doesn't work, it will usually fail quickly, so it
    # usually doesn't cost much to *try* it.  It requires that all the
    # sequence elements be hashable, and support equality comparison.
    u = {}
    try:
        for x in s:
            u[x] = 1
    except TypeError:
        del u  # move on to the next method
    else:
        return u.keys()

    # We can't hash all the elements.  Second fastest is to sort,
    # which brings the equal elements together; then duplicates are
    # easy to weed out in a single pass.
    # NOTE:  Python's list.sort() was designed to be efficient in the
    # presence of many duplicate elements.  This isn't true of all
    # sort functions in all languages or libraries, so this approach
    # is more effective in Python than it may be elsewhere.
    try:
        t = list(s)
        t.sort()
    except TypeError:
        del t  # move on to the next method
    else:
        assert n > 0
        last = t[0]
        lasti = i = 1
        while i < n:
            if t[i] != last:
                t[lasti] = last = t[i]
                lasti += 1
            i += 1
        return t[:lasti]

    # Brute force is all that's left.
    u = []
    for x in s:
        if x not in u:
            u.append(x)
    return u

The interesting issues are discussed in the code comments: this is a very pure example of how algorithm efficiency depends on the strength of the assumptions you can make! Of course you could split this into three distinct functions, and call the one best meeting your needs directly. In practice, however, the brute force method is so slow for large sequences that nothing measurable is lost by simply letting the function as written try the faster methods first.

16 comments

Alex Martelli 14 years, 9 months ago  # | flag

and when you can't lose order... unique() systematically loses order, but if items are hashable it's not hard to keep order intact -- only eliminate "later" insertions of already-present items. In case items _aren't_ hashable, for a big enough N using their cPickle.dumps() might be worth it... that's easily generalized to "uniqueness within equivalence classes" -- parameter function idfun must return hashable objects which are == for and only for items that are duplicates.

def uniquer(seq, idfun=None):
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker in seen: continue
        seen[marker] = 1
        result.append(item)
    return result

To get the latest rather than earliest appearance of an item (==a member of an equivalence class as defined by idfun), the best solution is to take quite another equivalent tack:

def uniquest(seq, idfun=None):
    import sys
    if idfun is None:
        def idfun(x): return x
    seen = {}
    for item, index in zip(seq,xrange(sys.maxint)):
        marker = idfun(item)
        seen[marker] = index, item
    auxlist = seen.keys()
    auxlist.sort()
    return [item for index, item in auxlist]

uniquest generalizes quite easily to cases in which choice among multiple items in the same equivalence class needs to depend on some arbitrary precedence-function that considers both the actual items and their indices of occurrence, as long as said precedence can operate pairwise -- you just need to replace the simplistic "seen[marker] = index, item" with a call to the precedence function when marker is already a key in dictionary 'seen' -- the precedence function must return the (index,item) pair for the chosen occurrence among the two (the one already seen and the newly occurring item/index). Here's the code, giving up the default idfun since this is really useful only for a substantial equivalence-function...:

def fancy_unique(seq, idfun, precedence):
    seen = {}
    for item, index in zip(seq,xrange(sys.maxint)):
        marker = idfun(item)
        if seen.has_key(marker):
            index, item = precedence((index,item),seen[marker])
        seen[marker] = index, item
    auxlist = seen.keys()
    auxlist.sort()
    return [item for index, item in auxlist]

(comment continued...)

Alex Martelli 14 years, 9 months ago  # | flag

(...continued from previous comment)

For example, say we have a list of words: we need to extract from it a list respecting order where no two items begin with the same letter; out of several words in the original list that begin with the same letter, we want to keep the longest, and, for equal length, the one appearing later in the list. Phew, sounds complicated, right? Not with fancy_unique to help us...:

def complicated_stuff(wordlist):
    def first_letter(aword): return aword[0].lower()
    def prefer((id1,word1),(id2,word2)):
        if len(word2)>len(word1): return id2,word2
        return id1,word1
    return fancy_unique(wordlist, first_letter, prefer)

Here, 'prefer' is slightly tricky/fragile as it "knows" fancy_unique always calls it with id1>id2, so the older id2,word2 pair need only be returned when word2 is longer than word1, otherwise id1,word1 must always be the result. It's only a bit wordier to express the full set of tests in 'prefer', though.

And the nice thing is, these are ALL O(N) [in the same sense as the best case of the unique in the original recipe], although when one is using cPickle.dumps and/or other heavy stuff the multiplicative constants may be a tad heavyweight:-).

Raymond Hettinger 14 years, 4 months ago  # | flag

Lightweight and fast ...

def uniq(alist)    # Fastest order preserving
    set = {}
    return [set.setdefault(e,e) for e in alist if e not in set]

def uniq(alist)    # Fastest without order preserving
    set = {}
    map(set.__setitem__, alist, [])
    return set.keys()
Hamish Lawson 13 years, 9 months ago  # | flag

But these don't address unhashability. If you are prepared to lose some backwards compatibility by using list comprehensions, then these functions are tidier variants of Tim's first algorithm. However they don't work if the supplied list contains unhashable items (such as a another list). Providing fallback algorithms to address that is what makes Tim's code longer.

Clark Updike 13 years, 4 months ago  # | flag

another approach. This always worked for me:

>>> hasDupes=[['a','b'],['a','b'],['c','d']]
>>> noDupes=[]
>>> [noDupes.append(i) for i in hasDupes if not noDupes.count(i)]
[None, None]
>>> noDupes
[['a', 'b'], ['c', 'd']]
>>>

You could always wrap it in a function.

Clark Updike 13 years, 4 months ago  # | flag

this is much faster for large sequences, using a dictionary. This uses a dictionary with repr, similar to my previous approach, but much faster for large sequences. repr handles the hashability issues:

>>> import operator
>>> setitem = operator.setitem
>>> hasDupes=[['a','b'],['a','b'],['c','d']]*1000
>>> noDupes={}
>>> [operator.setitem(noDupes,repr(i),i) for i in hasDupes if not noDupes.has_key(repr(i))]
[None, None]
>>> noDupes = noDupes.values()
>>> noDupes
[['a', 'b'], ['c', 'd']]
Gerrit Holl 12 years, 8 months ago  # | flag

Python 2.4. In Python 2.4, sets will be introduced as a builtin-type. This probably means the 'dictionary-method' can be replaced by a method using sets.

Sridhar R 11 years, 9 months ago  # | flag

second method is O(n^2). In your second method,

(t[lasti] = last = t[i])

list indexing is O(n), so the whole algo works in O(n^2).

  • R. Sridhar
Walter Underwood 11 years, 4 months ago  # | flag

Use built-in functions. Why the map() with set._setitem_? Create a set from the list, like this:

return set(alist).keys()

In pre-set Python, I've used a similar approach for dictionaries to be used as sets, though I haven't timed it against a for loop. Convert the list into a list of tuples with the second value a "1" (or true). Make that list into a dictionary.

adict = dict(map(lambda a: (a,1), alist))

For the all-purpose unique routine, the for loop is probably better than the dict/map/lambda one-liner, because it will fail earlier on non-hashable elements. This approach makes a whole list before it tries to hash anything. Calling set() directly should not have that problem.

One other minor thing: getting the length of a sequence is so fast that I don't see the point of saving it at the beginning of the routine. Try this to test for empty sequences, and to get graceful behavior when passed None. Then get the length closer to the loop that uses it.

if not s: return []
Mikko Pekkarinen 11 years, 3 months ago  # | flag

Indexing is O(1). Python's list data type does not mean a linked list. Under the hood it's an array of pointers, so that indexing works in constant time. Which means that the sorting version of unique() does have complexity O(n*log(n)), just like Tim Peters said.

Matteo Dell'Amico 11 years, 3 months ago  # | flag

Using an iterator... Sometimes you don't want to iterate through all the sequence, and just want the first few unique elements. Here is an iterator that does just that, assuming objects are hashable (otherwise, approaches seen in this recipe and comments can be used).

def iunique(iterable):
    seen = set([])
    for i in iterable:
        if i not in seen:
            seen.add(i)
            yield i
Frank P Mora 11 years, 1 month ago  # | flag

Or you could use Chris Perkins' wonderful trick.

Chris Perkins' recipe  “The Secret Name of List Comprehensions”
'_[1]' is again of high value.


>>> hasDups= [['a','b'],['a','b'],['c','d']]
>>> [ u for u in hasDups if u not in locals()['_[1]'] ]
[['a','b'],['c','d']]

Also it is not necessary to append or setitem as the primary operation
of a list comprehension. If you want to construct a list it does exactly that.
If you want to name it for reuse then

>>> noDups=[ u for u in hadDups if u not in locals()['_[1]'] ]
Frank P Mora 11 years, 1 month ago  # | flag

Yes, Python 2.4 sets are marvelous and fast.

And they are as polymorphic as the Python list.

>>> list(set([ chr(i) for i in range(97,123)] * 1000))
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
anaxagramma 6 years, 5 months ago  # | flag

To get R. Hettinger's unique work for unhashable objects you can use some "idfun" there, too:

def unique(seq, idfun=repr):
    seen = {}
    return [seen.setdefault(idfun(e),e) for e in seq if idfun(e) not in seen]

(Maybe repr is not the best choice.)

Liang Junhua 3 years, 10 months ago  # | flag

I am happy to cross over this recipe, it really helps me a lot. Thanks!

Add a comment

Sign in to comment