RFC 822-style parser « Python recipes

In The Art of Unix Programming, Eric S. Raymond describes a data file metaformat based on RFC 822. [http://www.faqs.org/docs/artu/ch05s02.html#id2902039] This is a simple parser for that format.

      """Parse files stored in the RFC 822 metaformat."""

from extensions.itertools import two_finger
from re import compile as Regex

def lines(string):
    """Get the logical lines of the string."""
    return merge_lines(string.splitlines())

def load(string):
    """Parse the given string."""
    return pairs(remove_comments(lines(string)))

def merge_lines(lines):
    """Merge every line that begins with whitespace with its predecessor.  May
    raise a ParseError."""
    new_lines = []
    offset = 0
    for offset, line in enumerate(lines):
        if len(line) > 0 and not line.isspace():
            break
    lines = lines[offset:]
    if starts_with_whitespace(lines[0]):
        raise ParseError("%d: '%s': Keys cannot be indented.")
    for line in lines:
        if starts_with_whitespace(line):
            new_lines[-1] += line
        else:
            new_lines.append(line)
    return [line.strip() for line in new_lines]

def pairify(string):
    """Convert a string of the form "key: value" to a tuple ("key",
    "value").  May raise a ParseError."""
    items = string.split(":", 1)
    try:
        return items[0].strip(), items[1].strip()
    except IndexError:
        raise ParseError("'%s': Keys must be terminated with a colon." %
                         string)

def pairs(lines):
    """Convert a list of lines into a dictionary.  May raise a ParseError."""
    return dict(pairify(line) for line in lines)

def remove_comments(lines):
    """Remove all lines containing a comment."""
    comment_line = Regex("^\s*#.*$")
    eol_comment = Regex(r"(?<!\\)#.*$")
    return [eol_comment.sub("", line) for line in lines if not comment_line.match(line)]

def starts_with_whitespace(line):
    return len(line) == 0 or line[0].isspace()

class ParseError(Exception): pass

      

Tags: data, formats

2 comments

Gabriel Genellina 14 years, 3 months ago # | flag

I don't have the book, but it seems more logical (to me at least) to remove comments before joining continuation lines. That is, load() should return:

pairs(merge_lines(remove_comments(string.splitlines())))

so this text:

text = """
Key1: some long value
  that wraps over # comment
  # this comment should be ignored...
  and finishes here
Key2: this entry contains # a comment
  and finishes here
  # but this comment is ignored
"""

is parsed as:

{'Key2': 'this entry contains   and finishes here', 
 'Key1': 'some long value  that wraps over   and finishes here'}

Karl Dickman (author) 14 years, 3 months ago # | flag

True. When I first wrote load(), I hadn't planned on allowing end-of-line comments, so remove_comments() was only going to remove lines matching /^\s#.$/. When I included end-of-line comments I did not make the necessary modifications.

◄	Python recipes (4591)	►
◄	Karl Dickman's recipes (2)	►

RFC 822-style parser (Python recipe) by Karl Dickman
ActiveState Code (http://code.activestate.com/recipes/576996/)

2 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

RFC 822-style parser (Python recipe) by Karl Dickman ActiveState Code (http://code.activestate.com/recipes/576996/)

2 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

RFC 822-style parser (Python recipe) by Karl Dickman
ActiveState Code (http://code.activestate.com/recipes/576996/)