dbdict: a dbm based on a dict subclass.
On open, loads full file into memory. On close, writes full dict to disk (atomically). Supported output file formats: csv, json, and pickle. Input file format automatically discovered.
Usable by the shelve module for fast access.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | import pickle, json, csv, os, shutil
class PersistentDict(dict):
''' Persistent dictionary with an API compatible with shelve and anydbm.
The dict is kept in memory, so the dictionary operations run as fast as
a regular dictionary.
Write to disk is delayed until close or sync (similar to gdbm's fast mode).
Input file format is automatically discovered.
Output file format is selectable between pickle, json, and csv.
All three serialization formats are backed by fast C implementations.
'''
def __init__(self, filename, flag='c', mode=None, format='pickle', *args, **kwds):
self.flag = flag # r=readonly, c=create, or n=new
self.mode = mode # None or an octal triple like 0644
self.format = format # 'csv', 'json', or 'pickle'
self.filename = filename
if flag != 'n' and os.access(filename, os.R_OK):
fileobj = open(filename, 'rb' if format=='pickle' else 'r')
with fileobj:
self.load(fileobj)
dict.__init__(self, *args, **kwds)
def sync(self):
'Write dict to disk'
if self.flag == 'r':
return
filename = self.filename
tempname = filename + '.tmp'
fileobj = open(tempname, 'wb' if self.format=='pickle' else 'w')
try:
self.dump(fileobj)
except Exception:
os.remove(tempname)
raise
finally:
fileobj.close()
shutil.move(tempname, self.filename) # atomic commit
if self.mode is not None:
os.chmod(self.filename, self.mode)
def close(self):
self.sync()
def __enter__(self):
return self
def __exit__(self, *exc_info):
self.close()
def dump(self, fileobj):
if self.format == 'csv':
csv.writer(fileobj).writerows(self.items())
elif self.format == 'json':
json.dump(self, fileobj, separators=(',', ':'))
elif self.format == 'pickle':
pickle.dump(dict(self), fileobj, 2)
else:
raise NotImplementedError('Unknown format: ' + repr(self.format))
def load(self, fileobj):
# try formats from most restrictive to least restrictive
for loader in (pickle.load, json.load, csv.reader):
fileobj.seek(0)
try:
return self.update(loader(fileobj))
except Exception:
pass
raise ValueError('File not in a supported format')
if __name__ == '__main__':
import random
# Make and use a persistent dictionary
with PersistentDict('/tmp/demo.json', 'c', format='json') as d:
print(d, 'start')
d['abc'] = '123'
d['rand'] = random.randrange(10000)
print(d, 'updated')
# Show what the file looks like on disk
with open('/tmp/demo.json', 'rb') as f:
print(f.read())
|
Provides persistent dictionary support. Loads the full file into memory, leaves it there for full speed dict access, and then writes the full dict back on close (with an atomic commit).
Useful when lookup and mutation speed are more important than the time spent on the initial load and the final write-back.
Similar to the "F" mode in the gdbm module: "The F flag opens the database in fast mode. Writes to the database will not be synchronized".
Nice clean implementation! Very nice :)
Since Windows locks open files, if line 37 catches an exception on Windows it will not be able to delete the temporary file because it is still open.
I fixed this by removing the finally clause and closing the file in the try clause:
In addition to the wonderful modification by Vye I use a couple of more modifications to make it more useful. There is no need to open a file in text mode i.e. just 'r' any more at least for csv. I'm not sure about json. This helps solving lots of newline problems on windows at least.
So line 23 which is
should ideally be
And similarly line 34 which is
should become
This one change saved me a big problem of a blank line between every two data lines.
And if one wants to make this an OrderedDict then apart from the inheritance one needs to call the __init__ before self.load is called. So the first few lines of the code become.
Note the use of super to make portable code. Also the commented line at the end #OrderedDict was what I first tried and it worked when I moved it above the "if flag != 'n'..." line, but using super I felt was cooler!
Lastly I also added a deepcopy function. Because PersistentDict overrides the dict init, copy.deepcopy() doesn't work any more.
So here is my deepcopy function.
Hope this helps someone! And wonderful recipe by AS. Although I still havent figured out how for loader in ... works
Excellent! A note: On load():
Will return 'File not in a supported format' even if the file is perfectly fine but an unrelated error occurred (such as an
ImportError
due to circular imports on unpickling). That and the fact that pickle protocol 2 is being used instead of pickle protocol 0 makes debugging really hard on more complex code using pickle. So for debugging purposes I'd suggest removing theexcept Exception
and, if using pickle, change the protocol to 0.You can't remove Exception as Alex_ suggests and still have the multiple format support work.
The loop in on load() is designed so that if one format fails, it moves on and tries the next.
If you remove the multiple format support, and just go with pickle, then Alex_s suggestion is a good one.
Anyone know what "Usable by the shelve module for fast access" in the description means?
Thx.
"Useable by shelve" means that the PersistentDict can be passed in the the Shelf constructor as the underlying dictionary:
Quick correction, the *args in __init__ is redundant (with keyword args before it, its always empty)
the way to achieve what I think you wanted eg, where....
creates a new dictionary with the second arg as the initial data, can be done with...
with the call to dict.__init__ being moved so that the content of any existing file doesn't get overwriiten by <data>.
The signature allows a call like this:
This is parallel to what how regular dicts wor:
While it make be inconvenient if want to use data, flag, mode, or format with a keyword argument, it is parallel to the signature of other Python containers such as: