Welcome, guest | Sign In | My Account | Store | Cart

Set that remembers original insertion order.

Python, 67 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import collections

class OrderedSet(collections.MutableSet):

    def __init__(self, iterable=None):
        self.end = end = [] 
        end += [None, end, end]         # sentinel node for doubly linked list
        self.map = {}                   # key --> [key, prev, next]
        if iterable is not None:
            self |= iterable

    def __len__(self):
        return len(self.map)

    def __contains__(self, key):
        return key in self.map

    def add(self, key):
        if key not in self.map:
            end = self.end
            curr = end[1]
            curr[2] = end[1] = self.map[key] = [key, curr, end]

    def discard(self, key):
        if key in self.map:        
            key, prev, next = self.map.pop(key)
            prev[2] = next
            next[1] = prev

    def __iter__(self):
        end = self.end
        curr = end[2]
        while curr is not end:
            yield curr[0]
            curr = curr[2]

    def __reversed__(self):
        end = self.end
        curr = end[1]
        while curr is not end:
            yield curr[0]
            curr = curr[1]

    def pop(self, last=True):
        if not self:
            raise KeyError('set is empty')
        key = self.end[1][0] if last else self.end[2][0]
        self.discard(key)
        return key

    def __repr__(self):
        if not self:
            return '%s()' % (self.__class__.__name__,)
        return '%s(%r)' % (self.__class__.__name__, list(self))

    def __eq__(self, other):
        if isinstance(other, OrderedSet):
            return len(self) == len(other) and list(self) == list(other)
        return set(self) == set(other)

            
if __name__ == '__main__':
    s = OrderedSet('abracadaba')
    t = OrderedSet('simsalabim')
    print(s | t)
    print(s & t)
    print(s - t)

Runs on Py2.6 or later (and runs on 3.0 or later without any modifications).

Implementation based on a doubly linked link and an internal dictionary. This design gives OrderedSet the same big-Oh running times as regular sets including O(1) adds, removes, and lookups as well as O(n) iteration.

11 comments

Andrew Henshaw 15 years ago  # | flag

In "__init__", iterable is checked to be "not None" before calling "update", which seems redundant because "update" does the same check before doing anything.

Adam Olsen 14 years, 11 months ago  # | flag

Although this recipe is far superior in that it actually is a set, my older recipe doesn't require the cyclic GC to be collected. http://code.activestate.com/recipes/528878/ It'd be fairly easy to combine the two.

Raymond Hettinger (author) 14 years, 11 months ago  # | flag

Adam, this recipe doesn't use or rely on GC. Individual links get deleted immediately when discard() is called. Likewise, when the whole set is no longer needed, the __del__ method neatly deletes clears all of the links without waiting for GC.

Also, FWIW, there is an alternate version of this recipe using weakrefs. See recipe 576696. Neither version needs GC.

Nagappan Alagappan 14 years, 6 months ago  # | flag

Hello Raymond,

I'm planing to use this module in LDTPv2 - http://ldtp.freedesktop.org under LGPLv2 license. If you have any objections, please write me to nagappan (at) gmail (dot) com. I have just mentioned your name in the file. I could not find your email id to specify in orderedset.py (http://cgit.freedesktop.org/ldtp/ldtp2/tree/ldtpd/orderedset.py). You can email me that as well, if you are interested with my proposal :)

You have done a wonderful job.

Thanks Nagappan

Greg 13 years, 7 months ago  # | flag

The __eq__ operator for comparison with a non-OrderedSet returns

not self.isdisjoint(other)

but this returns False when comparing empty sets. I think the following should return True, not False:

OrderedSet([]) == set([])
Don Sawatzky 13 years, 5 months ago  # | flag

I used this OrderedSet in a chain of functions processing lists. I soon found it was bogging down the calculations. I studied the code and find it is overloaded with Python features. For my purpose, the following simpler function runs 4 times faster.

def OrderedSet(alist):

    """ Creates an ordered set from a list of tuples or other hashable items """

    mmap = {} # implements hashed lookup

    oset = [] # storage for set

    for item in alist:

            #Save unique items in input order

            if item not in mmap:

                    mmap[item] = 1

                    oset.append(item)

    return oset
Carlos Martín 13 years ago  # | flag

I've create a python package for this recipe and upload to Pypi as 'oset'. Code on http://gitorious.org/sleipnir/python-oset. It also works on python 2.5.

@Raymond: If you are interested in maintain this, it's all yours :)

Oren Zomer 12 years, 2 months ago  # | flag

Hello Raymond,

I created an MIT Licensed project that is using your OrderedSet recipe. I've added the required MIT License comment at the beginning: http://pypsifas.hg.sourceforge.net/hgweb/pypsifas/pypsifas/file/7bfaf2ed308c/psifas/datatypes/ordered_set.py

If you have any problem with that, please let me know.

Oren Zomer oren.zomer@gmail.com

Pascal Lamblin 11 years, 3 months ago  # | flag

I experienced memory problems (objects were not garbage-collected) when using this recipe, but the one based on weakref worked better.

The __del__ method is not guaranteed to be called when the set is not needed, it will only be called during the garbage collection phase, if the set is ever garbage-collected. Moreover, reference cycles can actually be made worse by the addition of a __del__ method: http://docs.python.org/2/library/gc.html#gc.garbage

Pascal Lamblin 11 years, 3 months ago  # | flag

For the record, removing the __del__ method actually got rid of the memory leak I was experiencing, at least on Python 2.7.

Grant Jenks 8 years, 6 months ago  # | flag

You may also want to check out the SortedContainers package. It's a pure-Python and fast implementation of SortedList, SortedDict, and SortedSet data types. "Sorted" isn't the same as "ordered" but it's sometimes what you actually want. There's also extensive performance benchmarks with comparisons.