Welcome, guest | Sign In | My Account | Store | Cart

dbdict: a dbm based on a dict subclass.

On open, loads full file into memory. On close, writes full dict to disk (atomically). Supported output file formats: csv, json, and pickle. Input file format automatically discovered.

Usable by the shelve module for fast access.

Python, 89 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import pickle, json, csv, os, shutil

class PersistentDict(dict):
    ''' Persistent dictionary with an API compatible with shelve and anydbm.

    The dict is kept in memory, so the dictionary operations run as fast as
    a regular dictionary.

    Write to disk is delayed until close or sync (similar to gdbm's fast mode).

    Input file format is automatically discovered.
    Output file format is selectable between pickle, json, and csv.
    All three serialization formats are backed by fast C implementations.

    '''

    def __init__(self, filename, flag='c', mode=None, format='pickle', *args, **kwds):
        self.flag = flag                    # r=readonly, c=create, or n=new
        self.mode = mode                    # None or an octal triple like 0644
        self.format = format                # 'csv', 'json', or 'pickle'
        self.filename = filename
        if flag != 'n' and os.access(filename, os.R_OK):
            fileobj = open(filename, 'rb' if format=='pickle' else 'r')
            with fileobj:
                self.load(fileobj)
        dict.__init__(self, *args, **kwds)

    def sync(self):
        'Write dict to disk'
        if self.flag == 'r':
            return
        filename = self.filename
        tempname = filename + '.tmp'
        fileobj = open(tempname, 'wb' if self.format=='pickle' else 'w')
        try:
            self.dump(fileobj)
        except Exception:
            os.remove(tempname)
            raise
        finally:
            fileobj.close()
        shutil.move(tempname, self.filename)    # atomic commit
        if self.mode is not None:
            os.chmod(self.filename, self.mode)

    def close(self):
        self.sync()

    def __enter__(self):
        return self

    def __exit__(self, *exc_info):
        self.close()

    def dump(self, fileobj):
        if self.format == 'csv':
            csv.writer(fileobj).writerows(self.items())
        elif self.format == 'json':
            json.dump(self, fileobj, separators=(',', ':'))
        elif self.format == 'pickle':
            pickle.dump(dict(self), fileobj, 2)
        else:
            raise NotImplementedError('Unknown format: ' + repr(self.format))

    def load(self, fileobj):
        # try formats from most restrictive to least restrictive
        for loader in (pickle.load, json.load, csv.reader):
            fileobj.seek(0)
            try:
                return self.update(loader(fileobj))
            except Exception:
                pass
        raise ValueError('File not in a supported format')



if __name__ == '__main__':
    import random

    # Make and use a persistent dictionary
    with PersistentDict('/tmp/demo.json', 'c', format='json') as d:
        print(d, 'start')
        d['abc'] = '123'
        d['rand'] = random.randrange(10000)
        print(d, 'updated')

    # Show what the file looks like on disk
    with open('/tmp/demo.json', 'rb') as f:
        print(f.read())

Provides persistent dictionary support. Loads the full file into memory, leaves it there for full speed dict access, and then writes the full dict back on close (with an atomic commit).

Useful when lookup and mutation speed are more important than the time spent on the initial load and the final write-back.

Similar to the "F" mode in the gdbm module: "The F flag opens the database in fast mode. Writes to the database will not be synchronized".

9 comments

James Mills 15 years, 1 month ago  # | flag

Nice clean implementation! Very nice :)

Vye 11 years, 3 months ago  # | flag

Since Windows locks open files, if line 37 catches an exception on Windows it will not be able to delete the temporary file because it is still open.

I fixed this by removing the finally clause and closing the file in the try clause:

    try:
        with open(tempname, 'wb' if self.format=='pickle' else 'w') as fileobj:
            self.dump(fileobj)
    except Exception:
        os.remove(tempname)
        raise
Mavin Kai 11 years ago  # | flag

In addition to the wonderful modification by Vye I use a couple of more modifications to make it more useful. There is no need to open a file in text mode i.e. just 'r' any more at least for csv. I'm not sure about json. This helps solving lots of newline problems on windows at least.

So line 23 which is

fileobj = open(filename, 'rb' if format=='pickle' else 'r')

should ideally be

fileobj = open(filename, 'rb')

And similarly line 34 which is

fileobj = open(tempname, 'wb' if self.format=='pickle' else 'w')

should become

fileobj = open(tempname, 'wb')

This one change saved me a big problem of a blank line between every two data lines.

And if one wants to make this an OrderedDict then apart from the inheritance one needs to call the __init__ before self.load is called. So the first few lines of the code become.

import pickle, json, csv, os, shutil
from collections import OrderedDict
class PersistentDict(OrderedDict):

def __init__(self, filename, flag='c', mode=None, format='pickle', *args, **kwds):
    self.flag = flag                    # r=readonly, c=create, or n=new
    self.mode = mode                    # None or an octal triple like 0644
    self.format = format                # 'csv', 'json', or 'pickle'
    self.filename = filename
    super(PersistentDict, self).__init__()
    if flag != 'n' and os.access(filename, os.R_OK):
        fileobj = open(filename, 'rb')
        with fileobj:
            self.load(fileobj)
    #OrderedDict.__init__(self, *args, **kwds)

Note the use of super to make portable code. Also the commented line at the end #OrderedDict was what I first tried and it worked when I moved it above the "if flag != 'n'..." line, but using super I felt was cooler!

Lastly I also added a deepcopy function. Because PersistentDict overrides the dict init, copy.deepcopy() doesn't work any more.

So here is my deepcopy function.

def __deepcopy__(self, requesteddeepcopy):
    return copy.deepcopy(OrderedDict(self))

Hope this helps someone! And wonderful recipe by AS. Although I still havent figured out how for loader in ... works

alex_ 10 years, 8 months ago  # | flag

Excellent! A note: On load():

try:
    return self.update(loader(fileobj))
except Exception:
    pass
raise ValueError('File not in a supported format')

Will return 'File not in a supported format' even if the file is perfectly fine but an unrelated error occurred (such as an ImportError due to circular imports on unpickling). That and the fact that pickle protocol 2 is being used instead of pickle protocol 0 makes debugging really hard on more complex code using pickle. So for debugging purposes I'd suggest removing the except Exception and, if using pickle, change the protocol to 0.

activestate 10 years, 6 months ago  # | flag

You can't remove Exception as Alex_ suggests and still have the multiple format support work.

The loop in on load() is designed so that if one format fails, it moves on and tries the next.

If you remove the multiple format support, and just go with pickle, then Alex_s suggestion is a good one.

Martin Miller 9 years, 9 months ago  # | flag

Anyone know what "Usable by the shelve module for fast access" in the description means?

Thx.

Raymond Hettinger (author) 9 years, 9 months ago  # | flag

"Useable by shelve" means that the PersistentDict can be passed in the the Shelf constructor as the underlying dictionary:

import shelve

s = shelve.Shelf(PersistentDict('/tmp/demo.json', 'c', format='json'))
s['abc'] = 123
s['def'] = 456
print s.keys()
s.close()
Matthew Bull 9 years, 7 months ago  # | flag

Quick correction, the *args in __init__ is redundant (with keyword args before it, its always empty)

the way to achieve what I think you wanted eg, where....

p_dict=PersistentDict(somefile,{'foo':'bar'},flag='n')

creates a new dictionary with the second arg as the initial data, can be done with...

def __init__(self, filename, data={}, flag='c', mode=None, format='pickle', **kwds):
    self.flag = flag                    # r=readonly, c=create, or n=new
    self.mode = mode                    # None or an octal triple like 0644
    self.format = format                # 'csv', 'json', or 'pickle'
    self.filename = filename
    dict.__init__(self, data, **kwds)
    if flag != 'n' and os.access(filename, os.R_OK):
        fileobj = open(filename, 'rb')
        with fileobj:
            self.load(fileobj)

with the call to dict.__init__ being moved so that the content of any existing file doesn't get overwriiten by <data>.

Raymond Hettinger (author) 9 years, 7 months ago  # | flag

The signature allows a call like this:

 PersistentDict('/tmp/demo.json', 'c', None, 'json',
                        zip(somekeys, somevalues), x=1, y=2, z=3)

This is parallel to what how regular dicts wor:

dict(zip(somekeys, somevalues), x=1, y=2, z=3)

While it make be inconvenient if want to use data, flag, mode, or format with a keyword argument, it is parallel to the signature of other Python containers such as:

     array.array('c', somedata)
     collections.defaultdict(int, list_of_items, x=1, y=2, z=3)