Welcome, guest | Sign In | My Account | Store | Cart

An implementation of dictionaries preserving key insertion order. Despite using a doubly-linked list to keep track of the appropriate order, this list is actually embedded in the dictionary. As a consequence, there is little space penalty, and also every operation exhibits an efficient implementation (i.e., no need to perform lookups or deletions multiple times, as it happens with other versions of this data structure).

Python, 298 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
# ordereddict.py
# A dictionary that remembers insertion order
# Tested under Python 2.7 and 2.6.6 only
#
# Copyright (C) 2011 by Lucio Santi <lukius at gmail dot com>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.


from _abcoll import *
try:
    from thread import get_ident as _get_ident
except ImportError:
    from dummy_thread import get_ident as _get_ident
    
from operator import eq as _eq
from itertools import imap as _imap

__author__ = 'Lucio Santi <lukius at gmail dot com>'
__version__ = '1.1'
__all__ = ['OrderedDict']



########################### Constants ###########################
FORWARD = 0
BACKWARDS = 1

KEY = 0
VALUE = 1
NEXT = 3
PREVIOUS = 2
#################################################################


class OrderedDict(dict, MutableMapping):
    'A dictionary that remembers insertion order.'
  
    # This implementation uses a doubly-linked list of nodes, each
    # node being a 4-tuple <key, value, previous node, next node>.
    # Despite this, the interesting thing about it is that the list
    # is actually embedded in the dictionary. As a consequence,
    # there is little space penalty, and also every operation
    # exhibits an efficient implementation (i.e., no need to perform 
    # lookups or deletions multiple times, as it happens with other 
    # versions of this data structure.).
    #
    # It is worth noticing that passing an OrderedDict as an argument
    # to the dict constructor won't behave as expected. This is due
    # to the fact that the internal dictionary keeps additional information
    # apart from a key's value. If needed, the instance method dict()
    # provides a dict copy of an OrderedDict.
  
    update = MutableMapping.update
    setdefault = MutableMapping.setdefault
    __ne__ = MutableMapping.__ne__

    ######################## Class methods #########################
    @classmethod
    def fromkeys(cls, iterable, value = None):
        '''od.fromkeys(S[, v]) -> New ordered dictionary with keys from S
        and values equal to v (which defaults to None).

        '''      
        d = cls()
        for key in iterable:
            d[key] = value
        return d
    ################################################################    


    ######################## Initialization ########################
    def __init__(self, *args, **kwds):
        """Initialize an ordered dictionary.  Signature is the same as for
        regular dictionaries, but keyword arguments are not recommended
        because their insertion order is arbitrary.
        """
        if len(args) > 1:
            raise TypeError('expected at most 1 arguments, got %d' % len(args))
        try:
            self.first_node
        except AttributeError:    
            self.first_node = None
            self.last_node = None
	self.update(*args, **kwds)
    ################################################################
     
     
    ################## Data access & manipulation ##################
    __marker = object()
    
    def __getitem__(self, key):
        'od.__getitem__(y) <==> od[y]'
        node = dict.__getitem__(self, key)
        return node[VALUE]
        
    def get(self, key, default = None):
        'od.get(k[,d]) -> od[k] if k in od, else d.  d defaults to None.'
        try:
            value = self.__getitem__(key)
        except KeyError:
            value = default
        return value          
        
    def __setitem__(self, key, value):
        'od.__setitem__(i, y) <==> od[i]=y'
        try:
            node = dict.__getitem__(self, key)
            node[VALUE] = value
        except KeyError:
            new_node =  [key, value, self.last_node, None]
            if( self.first_node is None ):
              self.first_node = new_node
            if( self.last_node is not None ):
              self.last_node[NEXT] = new_node
            self.last_node = new_node
            dict.__setitem__(self, key, new_node)

    def __delitem__(self, key):
        'od.__delitem__(y) <==> del od[y]'
        removed_node = dict.pop(self,key)
        self.__adjust_after_removing(removed_node)

    def pop(self, key, default = __marker):
        '''od.pop(k[,d]) -> v, remove specified key and return the corresponding
        value. If key is not found, d is returned if given, otherwise KeyError
        is raised.'''
        removed_node = dict.pop(self, key, default)
        if( removed_node is self.__marker ):
            raise KeyError, key
        if( removed_node is default ):
            return default
        self.__adjust_after_removing(removed_node)
        return removed_node[VALUE]
        
    def popitem(self, last = True):
        '''od.popitem() -> (k, v), remove and return some (key, value) pair as a
        2-tuple; but raise KeyError if od is empty.'''
        if not self:
            raise KeyError('dictionary is empty')
        key = next(reversed(self) if last else iter(self))
        value = self.pop(key)
        return key, value
        
    def clear(self):
        'od.clear() -> None.  Remove all items from od.'
        dict.clear(self)
        self.first_node = None
        self.last_node = None        
        
    def __adjust_after_removing(self, a_node):
        'Adjust a_node previous and next pointers after its removal.'
        previous = a_node[PREVIOUS]
        next = a_node[NEXT]
        
        if( next ):
            next[PREVIOUS] = previous
        else:
            self.last_node = previous        
        if( previous ):
            previous[NEXT] = next
        else:
            self.first_node = next
    ################################################################        
     
     
    #################### Iteration & keys/values ################### 
    def __walk(self, direction = FORWARD, action = lambda x: x, *arguments):
        'Iterate over action applied to each node, in the appropriate order.'
        if( direction == FORWARD ):
            next = NEXT
            first = self.first_node
        elif( direction == BACKWARDS ):
            next = PREVIOUS
            first = self.last_node
        current_node = first
        while( current_node ):
            yield action(current_node, *arguments)
            current_node = current_node[next]        
            
    def __walk_to_list(self, direction = FORWARD, action = lambda x: x, *arguments):
        '''Obtain a list of objects resulting from applying action to
        each node, in the appropriate order.'''
        return_list = list()
        item_generator = self.__walk(direction = direction, action = action, *arguments)
        for item in item_generator: return_list.append(item)
        return return_list
      
    def __iter__(self):
        'od.__iter__() <==> iter(od)'
        return self.__walk( action = lambda node: node[KEY] )
            
    def __reversed__(self):
        'od.__reversed__() <==> reversed(od)'
        return self.__walk( direction = BACKWARDS, action = lambda node: node[KEY] )   
               
    def keys(self):
        "od.keys() -> list of od's keys"
        return self.__walk_to_list( action = lambda node: node[KEY] )      

    def values(self):
        "od.values() -> list of od's values"
        return self.__walk_to_list( action = lambda node: node[VALUE] )
  
    def items(self):
        "od.items() -> list of od's (key, value) pairs, as 2-tuples"
        return self.__walk_to_list( action = lambda node: (node[KEY], node[VALUE]) )
        
    def iterkeys(self):
        'od.iterkeys() -> an iterator over the keys of od'
        return iter(self)
        
    def itervalues(self):
        'od.itervalues() -> an iterator over the values of od'
        return self.__walk( action = lambda node: node[VALUE] )

    def iteritems(self):
        'od.iteritems() -> an iterator over the (key, value) items of od'
        return self.__walk( action = lambda node: (node[KEY], node[VALUE]) )
    ################################################################    
        
        
    ############################# Copies ###########################
    def copy(self):
        'od.copy() -> a shallow copy of od'
        return self.__class__(self)    

    def dict(self):
        'od.dict() -> a dict copy of od'
        d = {}
        for item in self.iteritems(): d[item[KEY]] = item[VALUE]
        return d
    ################################################################        
    
    
    ########################## Miscellaneous #######################
    def __repr__(self, _repr_running = {}):
        'od.__repr__() <==> repr(od)'
        call_key = id(self), _get_ident()
        if call_key in _repr_running:
            return '...'
        _repr_running[call_key] = 1
        try:
            if not self:
                return '%s()' % (self.__class__.__name__,)
            return '%s(%r)' % (self.__class__.__name__, self.items())
        finally:
            del _repr_running[call_key]
            
    def __reduce__(self):
        'Return state information for pickling'
        items = self.items()
        tmp = self.first_node, self.last_node
        del self.first_node, self.last_node
        inst_dict = vars(self).copy()
        self.first_node, self.last_node = tmp
        if inst_dict:
            return (self.__class__, (items,), inst_dict)
        return self.__class__, (items,)

    def __eq__(self, other):
        '''od.__eq__(y) <==> od==y.  Comparison to another OD is order-sensitive
        while comparison to a regular mapping is order-insensitive.

        '''      
        if isinstance(other, OrderedDict):
            return len(self) == len(other) and \
                   all(_imap(_eq, self.iteritems(), other.iteritems()))
        return dict.__eq__(self.dict(), other)
        
    def viewkeys(self):
        "od.viewkeys() -> a set-like object providing a view on od's keys"
        return KeysView(self)

    def viewvalues(self):
        "od.viewvalues() -> an object providing a view on od's values"
        return ValuesView(self)

    def viewitems(self):
        "od.viewitems() -> a set-like object providing a view on od's items"
        return ItemsView(self)
    ################################################################

Time complexity for each method equals the corresponding for its dict counterpart. Besides, little overhead is introduced since the extra operations performed are essentially reference manipulations (in order to restore the structure invariants) or field projections (for dictionary nodes).

After some experimental analysis, it was observed that this implementation indeed outperforms the official (Python 2.7) one. For values of n ranging between 250 and 21000, this function was evaluated a hundred times for a collections.OrderedDict and for a ordereddict.OrderedDict:

def measure(n, d):
    for i in xrange(n): d['a'*i] = 'a'*i
    for k in d: d[k] = 0
    it = d.items()
    for i in xrange(n): del d['a'*i]

As n gets larger, the contrast becomes much more noticeable. In fact, for n = 21k, there was a 15 second difference.

5 comments

Stefan Behnel 12 years, 9 months ago  # | flag

Your implementations of .dict() and .__setitem__() seem broken. For the first, I think you should be using .itervalues() instead of .iteritems(). For the latter, I would expect that the insertion order gets updated even in the case that the key is already in the dict.

Also, it would be nice if you could provide more detailed timings of the different operations compared to the existing OrderedDict. Use the timeit module for that.

Lucio Santi (author) 12 years, 9 months ago  # | flag

Stephan,

I think both methods are working OK. The iteritems() methods walks through each (key, value) pair, and evaluating item[KEY] and item[VALUE] for each of the items involved projects effectively its key and its value (those constants are defined at the top, being KEY = 0 and VALUE = 1). As an example,

>>> d = ordereddict.OrderedDict()
>>> d['hello'] = 'world'
>>> d['helo'] = 'ehlo'
>>> d
OrderedDict([('hello', 'world'), ('helo', 'ehlo')])
>>> d.dict()
{'helo': 'ehlo', 'hello': 'world'}

On the other hand, the __setitem__() method is consistent with the behavior described here (http://www.python.org/dev/peps/pep-0372/, section Q&A, first question) which is, moreover, used by the other implementations available out there.

Regarding time measuring, I indeed used the timeit module. Here you can see some plots I made comparing the Python 2.7 implementation and mine: http://img268.imageshack.us/img268/8405/ordereddict1.png http://img560.imageshack.us/img560/7997/ordereddict2.png

For n between 10000 and 21000, the corresponding values obtained (in seconds) were: n Python2.7 This impl. 10000 37.3870368004 34.6250030994 11000 45.3485190868 41.0726940632 12000 53.2696518898 47.8958530426 13000 64.2209510803 56.2664649487 14000 70.5243411064 63.8596489429 15000 79.9213080406 72.1154520512 16000 89.9099729061 81.4596869946 17000 100.817600012 92.0595479012 18000 112.128964901 103.336715937 19000 124.144135952 111.130847216 20000 137.069084883 123.627164125 21000 150.834888935 135.424236059

If you need more details, feel free to ask.

Thanks for your feedback!

Lucio Santi (author) 12 years, 9 months ago  # | flag

Ouch... the tabs are broken. Sorry. If you want, I can email those values.

Raymond Hettinger 12 years, 8 months ago  # | flag

The reason the version shipped in 2.7 uses two dictionaries (one for key/value pairs and the other for key/node pairs) is that many tools in Python will access the inherited dict directly and bypass the method overrides.

The following code behaves badly for the above implementation but works fine with the version shipping with 2.7:

>>> d = santi.OrderedDict(red=1, blue=3)
>>> dict(d)
{'blue': ['blue', 3, None, ['red', 1, [...], None]], 'red': ['red', 1, ['blue', 3, None, [...]], None]}
>>> d = collections.OrderedDict(red=1, blue=3)
>>> dict(d)
{'blue': 3, 'red': 1}

In addition, I question the comparative timings on key operations. The version shipping with 2.7 inherits __getitem__() and get() from dict itself. Accordingly, those two critical operations are a fast as regular dictionaries. No pure python versions of __getitem__() and get() can compete with that.

Lucio Santi (author) 12 years, 8 months ago  # | flag

The following code behaves badly for the above implementation but works fine with the version shipping with 2.7:

True. In fact, I pointed this out on the comments at the top of the code. As a side note, I don't seem to fully understand why would it be strictly necessary to convert such a dictionary into a plain dict, being the former a dictionary too (i.e., same protocol applies for both).

The version shipping with 2.7 inherits __getitem__() and get() from dict itself. Accordingly, those two critical operations are a fast as regular dictionaries. No pure python versions of __getitem__() and get() can compete with that.

True again, of course. However, time penalty is actually low for those opearations (a __getitem__ is essentially a dict.__getitem__ followed by a list indexing, which is fast). The true benefit comes considering the rest: insertion, deletion and iteration. The plots above were constructed after measuring a mix of such operations, trying to mimic a standard use of the data structure.

Anyhow, beyond Python, I consider this to be a pretty efficient implementation. A C version could exploit it further, not only improving time usage but space consumption too. For example, it would be pointless to store the key inside the node, considering that each key must be stored already in its corresponding hash table entry. Additionally, it would be sloppy to include two pointers inside each node too: one is enough if an XOR list is implemented.

Thanks for your comments!