ActiveState Code

Recipe 576914: Automagically dispatch commands using regex token classes


The (?P<...>...) notation in Python's regular expressions can be viewed as a classification of matched tokens. The names of these classes can be used to dispatch tokens to appropriate handlers:

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import re


scan = re.compile(flags=re.VERBOSE, pattern=r"""
    (?P<number> \d+\.?\d* ) |
    (?P<operator> \+ | - | \* | / ) |
    (?P<eof> $ ) |
    (?P<error> \S )  # catch all (except whitespace)
""").finditer


def parse(source):
    parser = RPNParser()
    for each in scan(source):
        dispatch = getattr(parser, each.lastgroup)
        dispatch(each.group())
    return parser.stack[0]


class RPNParser(object):

    def __init__(self):
        self.stack = []

    def number(self, token):
        self.stack.append(token)

    def operator(self, token):
        right = self.stack.pop()
        left = self.stack.pop()
        self.stack.append(Expression(left, token, right))

    def eof(self, token):
        if len(self.stack) != 1:
            raise Exception("source kaputt!")

    def error(self, token):
        raise Exception("unknown token %s!" % token)


class Expression(object):

    def __init__(self, *args):
        self.args = args

    def __str__(self):
        return "(%s %s %s)" % self.args


if __name__ == "__main__":

    print parse("1.5 2 + 3.33 1.11 - *")

Discussion

The example is a simple parser for arithmetic expressions in Reverse Polish Notation, hence the Name RPNParser. First, scan is defined as a tokenizer function and group names (?P<...>...) are defined for tokens matched by the subsequent regex patterns. The RPNParser class has correspondingly named methods which handle the respective tokens. The parse function uses getattr and re.MatchObject's lastgroup attribute to select an appropriate method to dispatch the token to. There an AST is built successively that represents the input string. Then the AST is returned.

More elaborate things are imaginable, e.g. I use the same approach with an ATN-ish state machine to parse source code in my toy Prolog interpreter. There, the handler methods primarily manage state transitions and delegate the actual creation of the Prolog-AST to a Builder object.

Sign in to comment