Welcome, guest | Sign In | My Account | Store | Cart

A method to parse a file like object containing data in the record jar format as described by ESR in "The Art of Unix Programming" (see http://www.faqs.org/docs/artu/ch05s02.html#id2906931).

Python, 60 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/usr/bin/env python

# recordjar.py - Parse a Record-Jar into a list of dictionaries.
# Copyright 2005 Lutz Horn <lutz.horn@gmx.de>
# Licensed unter the same terms as Python.

def parse_jar(flo):
    """Parse a Record-Jar from a file like object into a list of dictionaries.

    This method parses a file like object as described in "The Art of Unix
    Programming" <http://www.faqs.org/docs/artu/ch05s02.html#id2906931>.

    The records are divided by lines containing '%%'. Each record consists of
    one or more lines, each containing a key, a colon, and a value. Whitespace
    around both key and value are ignored.

    >>> import StringIO
    >>> flo = StringIO.StringIO("a:b\\nc:d\\n%%\\nx:y\\n")
    >>> out = parse_jar(flo)
    >>> print out
    [{'a': 'b', 'c': 'd'}, {'x': 'y'}]

    If a record contains a key more than once, the value for this key is a list
    containing the values in their order of occurence.

    >>> flo = StringIO.StringIO("a:b\\nc:d\\n%%\\nx:y\\nx:z\\n")
    >>> out = parse_jar(flo)
    >>> print out
    [{'a': 'b', 'c': 'd'}, {'x': ['y', 'z']}]

    Leading or trailing separator lines ('%%') and lines containing only
    whitespace are ignored.

    >>> flo = StringIO.StringIO("%%\\na:b\\nc:d\\n%%\\n\\nx:y\\nx:z\\n")
    >>> out = parse_jar(flo)
    >>> print out
    [{'a': 'b', 'c': 'd'}, {'x': ['y', 'z']}]
    """
    records = []
    for record in flo.read().split("%%"):
        dict = {}
        for line in [line for line in record.split("\n") if line.strip() != ""]:
            key, value = line.split(":", 1)
            key, value = key.strip(), value.strip()
            try:
                dict[key].append(value)
            except AttributeError:
                dict[key] = [dict[key], value]
            except KeyError:
                dict[key] = value
        if len(dict) > 0:
            records.append(dict)
    return records

def _test():
    import doctest, recordjar
    return doctest.testmod(recordjar)

if __name__ == "__main__":
    _test()

The record jar format is a very useful format for data with only one level of depth. To quote Eric S. Raymond:

"If you need a textual format that will support multiple records with a variable repertoire of explicit fieldnames, [it is] one of the least surprising and human-friendliest ways to do it". (http://www.faqs.org/docs/artu/ch05s02.html#id2906931)

If a file containing records ist not too large and can be read into memory all at once, the approach using split() on strings makes it unnecessary to keep track of state while reading the file. It's all strings and lists.

The situation of a record containing a field for a given key more than once can be dealed with in two different way. First, any occurence of the key after the first one can override the previous. In this case, dict[key] would always be set to value. Second, all occurences can be collected in a list if there is more than one. This is done by first trying to append value to an already present list in dict[key] and catching the two possible exceptions AttributeError, if dict[key] is not a list, and KeyError, if dict[key] is not set.

1 comment

Ian Bicking 18 years, 9 months ago  # | flag

RFC822. The rfc822 module: http://python.org/doc/current/lib/module-rfc822.html provides something like this. You might use that module like:

>>> from cStringIO import StringIO
>>> import rfc822
>>> f = StringIO("a:1\nb:2\n\na:10\nb:20")
>>> records = []
>>> while 1:
...     msg = rfc822.Message(f)
...     if not msg: #EOF reached
...         break
...     records.append(dict(msg))
>>> records
[{'a': '1', 'b': '2'}, {'a': '10', 'b': '20'}]

Blank lines separate records. Records can be extended to multiple lines by using leading whitespace on secondary lines.