An implementation of dictionaries preserving key insertion order. Despite using a doubly-linked list to keep track of the appropriate order, this list is actually embedded in the dictionary. As a consequence, there is little space penalty, and also every operation exhibits an efficient implementation (i.e., no need to perform lookups or deletions multiple times, as it happens with other versions of this data structure).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | # ordereddict.py
# A dictionary that remembers insertion order
# Tested under Python 2.7 and 2.6.6 only
#
# Copyright (C) 2011 by Lucio Santi <lukius at gmail dot com>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
from _abcoll import *
try:
from thread import get_ident as _get_ident
except ImportError:
from dummy_thread import get_ident as _get_ident
from operator import eq as _eq
from itertools import imap as _imap
__author__ = 'Lucio Santi <lukius at gmail dot com>'
__version__ = '1.1'
__all__ = ['OrderedDict']
########################### Constants ###########################
FORWARD = 0
BACKWARDS = 1
KEY = 0
VALUE = 1
NEXT = 3
PREVIOUS = 2
#################################################################
class OrderedDict(dict, MutableMapping):
'A dictionary that remembers insertion order.'
# This implementation uses a doubly-linked list of nodes, each
# node being a 4-tuple <key, value, previous node, next node>.
# Despite this, the interesting thing about it is that the list
# is actually embedded in the dictionary. As a consequence,
# there is little space penalty, and also every operation
# exhibits an efficient implementation (i.e., no need to perform
# lookups or deletions multiple times, as it happens with other
# versions of this data structure.).
#
# It is worth noticing that passing an OrderedDict as an argument
# to the dict constructor won't behave as expected. This is due
# to the fact that the internal dictionary keeps additional information
# apart from a key's value. If needed, the instance method dict()
# provides a dict copy of an OrderedDict.
update = MutableMapping.update
setdefault = MutableMapping.setdefault
__ne__ = MutableMapping.__ne__
######################## Class methods #########################
@classmethod
def fromkeys(cls, iterable, value = None):
'''od.fromkeys(S[, v]) -> New ordered dictionary with keys from S
and values equal to v (which defaults to None).
'''
d = cls()
for key in iterable:
d[key] = value
return d
################################################################
######################## Initialization ########################
def __init__(self, *args, **kwds):
"""Initialize an ordered dictionary. Signature is the same as for
regular dictionaries, but keyword arguments are not recommended
because their insertion order is arbitrary.
"""
if len(args) > 1:
raise TypeError('expected at most 1 arguments, got %d' % len(args))
try:
self.first_node
except AttributeError:
self.first_node = None
self.last_node = None
self.update(*args, **kwds)
################################################################
################## Data access & manipulation ##################
__marker = object()
def __getitem__(self, key):
'od.__getitem__(y) <==> od[y]'
node = dict.__getitem__(self, key)
return node[VALUE]
def get(self, key, default = None):
'od.get(k[,d]) -> od[k] if k in od, else d. d defaults to None.'
try:
value = self.__getitem__(key)
except KeyError:
value = default
return value
def __setitem__(self, key, value):
'od.__setitem__(i, y) <==> od[i]=y'
try:
node = dict.__getitem__(self, key)
node[VALUE] = value
except KeyError:
new_node = [key, value, self.last_node, None]
if( self.first_node is None ):
self.first_node = new_node
if( self.last_node is not None ):
self.last_node[NEXT] = new_node
self.last_node = new_node
dict.__setitem__(self, key, new_node)
def __delitem__(self, key):
'od.__delitem__(y) <==> del od[y]'
removed_node = dict.pop(self,key)
self.__adjust_after_removing(removed_node)
def pop(self, key, default = __marker):
'''od.pop(k[,d]) -> v, remove specified key and return the corresponding
value. If key is not found, d is returned if given, otherwise KeyError
is raised.'''
removed_node = dict.pop(self, key, default)
if( removed_node is self.__marker ):
raise KeyError, key
if( removed_node is default ):
return default
self.__adjust_after_removing(removed_node)
return removed_node[VALUE]
def popitem(self, last = True):
'''od.popitem() -> (k, v), remove and return some (key, value) pair as a
2-tuple; but raise KeyError if od is empty.'''
if not self:
raise KeyError('dictionary is empty')
key = next(reversed(self) if last else iter(self))
value = self.pop(key)
return key, value
def clear(self):
'od.clear() -> None. Remove all items from od.'
dict.clear(self)
self.first_node = None
self.last_node = None
def __adjust_after_removing(self, a_node):
'Adjust a_node previous and next pointers after its removal.'
previous = a_node[PREVIOUS]
next = a_node[NEXT]
if( next ):
next[PREVIOUS] = previous
else:
self.last_node = previous
if( previous ):
previous[NEXT] = next
else:
self.first_node = next
################################################################
#################### Iteration & keys/values ###################
def __walk(self, direction = FORWARD, action = lambda x: x, *arguments):
'Iterate over action applied to each node, in the appropriate order.'
if( direction == FORWARD ):
next = NEXT
first = self.first_node
elif( direction == BACKWARDS ):
next = PREVIOUS
first = self.last_node
current_node = first
while( current_node ):
yield action(current_node, *arguments)
current_node = current_node[next]
def __walk_to_list(self, direction = FORWARD, action = lambda x: x, *arguments):
'''Obtain a list of objects resulting from applying action to
each node, in the appropriate order.'''
return_list = list()
item_generator = self.__walk(direction = direction, action = action, *arguments)
for item in item_generator: return_list.append(item)
return return_list
def __iter__(self):
'od.__iter__() <==> iter(od)'
return self.__walk( action = lambda node: node[KEY] )
def __reversed__(self):
'od.__reversed__() <==> reversed(od)'
return self.__walk( direction = BACKWARDS, action = lambda node: node[KEY] )
def keys(self):
"od.keys() -> list of od's keys"
return self.__walk_to_list( action = lambda node: node[KEY] )
def values(self):
"od.values() -> list of od's values"
return self.__walk_to_list( action = lambda node: node[VALUE] )
def items(self):
"od.items() -> list of od's (key, value) pairs, as 2-tuples"
return self.__walk_to_list( action = lambda node: (node[KEY], node[VALUE]) )
def iterkeys(self):
'od.iterkeys() -> an iterator over the keys of od'
return iter(self)
def itervalues(self):
'od.itervalues() -> an iterator over the values of od'
return self.__walk( action = lambda node: node[VALUE] )
def iteritems(self):
'od.iteritems() -> an iterator over the (key, value) items of od'
return self.__walk( action = lambda node: (node[KEY], node[VALUE]) )
################################################################
############################# Copies ###########################
def copy(self):
'od.copy() -> a shallow copy of od'
return self.__class__(self)
def dict(self):
'od.dict() -> a dict copy of od'
d = {}
for item in self.iteritems(): d[item[KEY]] = item[VALUE]
return d
################################################################
########################## Miscellaneous #######################
def __repr__(self, _repr_running = {}):
'od.__repr__() <==> repr(od)'
call_key = id(self), _get_ident()
if call_key in _repr_running:
return '...'
_repr_running[call_key] = 1
try:
if not self:
return '%s()' % (self.__class__.__name__,)
return '%s(%r)' % (self.__class__.__name__, self.items())
finally:
del _repr_running[call_key]
def __reduce__(self):
'Return state information for pickling'
items = self.items()
tmp = self.first_node, self.last_node
del self.first_node, self.last_node
inst_dict = vars(self).copy()
self.first_node, self.last_node = tmp
if inst_dict:
return (self.__class__, (items,), inst_dict)
return self.__class__, (items,)
def __eq__(self, other):
'''od.__eq__(y) <==> od==y. Comparison to another OD is order-sensitive
while comparison to a regular mapping is order-insensitive.
'''
if isinstance(other, OrderedDict):
return len(self) == len(other) and \
all(_imap(_eq, self.iteritems(), other.iteritems()))
return dict.__eq__(self.dict(), other)
def viewkeys(self):
"od.viewkeys() -> a set-like object providing a view on od's keys"
return KeysView(self)
def viewvalues(self):
"od.viewvalues() -> an object providing a view on od's values"
return ValuesView(self)
def viewitems(self):
"od.viewitems() -> a set-like object providing a view on od's items"
return ItemsView(self)
################################################################
|
Time complexity for each method equals the corresponding for its dict counterpart. Besides, little overhead is introduced since the extra operations performed are essentially reference manipulations (in order to restore the structure invariants) or field projections (for dictionary nodes).
After some experimental analysis, it was observed that this implementation indeed outperforms the official (Python 2.7) one. For values of n ranging between 250 and 21000, this function was evaluated a hundred times for a collections.OrderedDict and for a ordereddict.OrderedDict:
def measure(n, d):
for i in xrange(n): d['a'*i] = 'a'*i
for k in d: d[k] = 0
it = d.items()
for i in xrange(n): del d['a'*i]
As n gets larger, the contrast becomes much more noticeable. In fact, for n = 21k, there was a 15 second difference.
Your implementations of .dict() and .__setitem__() seem broken. For the first, I think you should be using .itervalues() instead of .iteritems(). For the latter, I would expect that the insertion order gets updated even in the case that the key is already in the dict.
Also, it would be nice if you could provide more detailed timings of the different operations compared to the existing OrderedDict. Use the timeit module for that.
Stephan,
I think both methods are working OK. The iteritems() methods walks through each (key, value) pair, and evaluating item[KEY] and item[VALUE] for each of the items involved projects effectively its key and its value (those constants are defined at the top, being KEY = 0 and VALUE = 1). As an example,
On the other hand, the __setitem__() method is consistent with the behavior described here (http://www.python.org/dev/peps/pep-0372/, section Q&A, first question) which is, moreover, used by the other implementations available out there.
Regarding time measuring, I indeed used the timeit module. Here you can see some plots I made comparing the Python 2.7 implementation and mine: http://img268.imageshack.us/img268/8405/ordereddict1.png http://img560.imageshack.us/img560/7997/ordereddict2.png
For n between 10000 and 21000, the corresponding values obtained (in seconds) were: n Python2.7 This impl. 10000 37.3870368004 34.6250030994 11000 45.3485190868 41.0726940632 12000 53.2696518898 47.8958530426 13000 64.2209510803 56.2664649487 14000 70.5243411064 63.8596489429 15000 79.9213080406 72.1154520512 16000 89.9099729061 81.4596869946 17000 100.817600012 92.0595479012 18000 112.128964901 103.336715937 19000 124.144135952 111.130847216 20000 137.069084883 123.627164125 21000 150.834888935 135.424236059
If you need more details, feel free to ask.
Thanks for your feedback!
Ouch... the tabs are broken. Sorry. If you want, I can email those values.
The reason the version shipped in 2.7 uses two dictionaries (one for key/value pairs and the other for key/node pairs) is that many tools in Python will access the inherited dict directly and bypass the method overrides.
The following code behaves badly for the above implementation but works fine with the version shipping with 2.7:
In addition, I question the comparative timings on key operations. The version shipping with 2.7 inherits __getitem__() and get() from dict itself. Accordingly, those two critical operations are a fast as regular dictionaries. No pure python versions of __getitem__() and get() can compete with that.
True. In fact, I pointed this out on the comments at the top of the code. As a side note, I don't seem to fully understand why would it be strictly necessary to convert such a dictionary into a plain dict, being the former a dictionary too (i.e., same protocol applies for both).
True again, of course. However, time penalty is actually low for those opearations (a __getitem__ is essentially a dict.__getitem__ followed by a list indexing, which is fast). The true benefit comes considering the rest: insertion, deletion and iteration. The plots above were constructed after measuring a mix of such operations, trying to mimic a standard use of the data structure.
Anyhow, beyond Python, I consider this to be a pretty efficient implementation. A C version could exploit it further, not only improving time usage but space consumption too. For example, it would be pointless to store the key inside the node, considering that each key must be stored already in its corresponding hash table entry. Additionally, it would be sloppy to include two pointers inside each node too: one is enough if an XOR list is implemented.
Thanks for your comments!