Welcome, guest | Sign In | My Account | Store | Cart

The (?P<...>...) notation in Python's regular expressions can be viewed as a classification of matched tokens. The names of these classes can be used to dispatch tokens to appropriate handlers:

Python, 52 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import re


scan = re.compile(flags=re.VERBOSE, pattern=r"""
    (?P<number> \d+\.?\d* ) |
    (?P<operator> \+ | - | \* | / ) |
    (?P<eof> $ ) |
    (?P<error> \S )  # catch all (except whitespace)
""").finditer


def parse(source):
    parser = RPNParser()
    for each in scan(source):
        dispatch = getattr(parser, each.lastgroup)
        dispatch(each.group())
    return parser.stack[0]


class RPNParser(object):

    def __init__(self):
        self.stack = []

    def number(self, token):
        self.stack.append(token)

    def operator(self, token):
        right = self.stack.pop()
        left = self.stack.pop()
        self.stack.append(Expression(left, token, right))

    def eof(self, token):
        if len(self.stack) != 1:
            raise Exception("source kaputt!")

    def error(self, token):
        raise Exception("unknown token %s!" % token)


class Expression(object):

    def __init__(self, *args):
        self.args = args

    def __str__(self):
        return "(%s %s %s)" % self.args


if __name__ == "__main__":

    print parse("1.5 2 + 3.33 1.11 - *")

The example is a simple parser for arithmetic expressions in Reverse Polish Notation, hence the Name RPNParser. First, scan is defined as a tokenizer function and group names (?P<...>...) are defined for tokens matched by the subsequent regex patterns. The RPNParser class has correspondingly named methods which handle the respective tokens. The parse function uses getattr and re.MatchObject's lastgroup attribute to select an appropriate method to dispatch the token to. There an AST is built successively that represents the input string. Then the AST is returned.

More elaborate things are imaginable, e.g. I use the same approach with an ATN-ish state machine to parse source code in my toy Prolog interpreter. There, the handler methods primarily manage state transitions and delegate the actual creation of the Prolog-AST to a Builder object.

Created by Mick Krippendorf on Mon, 28 Sep 2009 (MIT)
Python recipes (4591)
Mick Krippendorf's recipes (2)

Required Modules

Other Information and Tasks