Choice of GPL or Python license
Lastest release
version 0.1 on Jan 5th, 2011

Reflex: A lightweight lexical scanner library.

Reflex supports regular expressions, rule actions, multiple scanner states, tracking of line/column numbers, and customizable token classes.

Reflex is not a "scanner generator" in the sense of generating source code. Instead, it generates a scanner object dynamically based on the set of input rules sepecified. The rules themselves are ordinary python regular expressions, combined with rule actions which are simply python functions.

Example use:

# Create a scanner. The "start" parameter specifies the name of the # starting state. Note: The state argument can be any hashable python # type. scanner = reflex.scanner( "start" )

# Add some rules. # The whitespace rule has no actions, so whitespace will be skipped scanner.rule( "s+" )

# Rules for identifiers and numbers. TOKEN_IDENT = 1 TOKEN_NUMBER = 2 scanner.rule( "[a-zA-Z_][w_]*", token=TOKEN_IDENT ) scanner.rule( "0x[da-fA-F]+|d+", token=TOKEN_NUMBER )

# The "string" rule kicks us into the string state TOKEN_STRING = 3 scanner.rule( """, tostate="string" )

# Define the string state. "string_escape" and "string_chars" are # action functions which handle the parsed charaxcters and escape # sequences and append them to a buffer. Once a quotation mark # is encountered, we set the token type to be TOKEN_STRING # and return to the start state. scanner.state( "string" ) scanner.rule( """, tostate="start", token=TOKEN_STRING ) scanner.rule( "\.", string_escape ) scanner.rule( "[^"\]+", string_text )

Invoking the scanner: The scanner can be called as a function which takes a reference to a stream (such as a file object) which iterates over input lines. The "context" argument is for application use, The result is an iterator which produces a series of tokens. The same scanner can be used to parse multiple input files, by creating a new stream for each file.

# Return an instance of the scanner. token_iter = scanner( istream, context )

Getting the tokens. Here is a simple example of looping through the input tokens. A real-world use would most likely involve comparing vs. the type of the current token.

# token.id is the token type (the same as the token= argument in the rule) # token.value is the actual characters that make up the token. # token.line is the line number on which the token was encountered. # token.pos is the column number of the first character of the token. for token in token_iter: print token.id, token.value, token.line, token.pos

Action functions are python functions which take a single argument, which is the token stream instance.

# Action function to handle striing text. # Appends the value of the current token to the string data def string_text( token_stream ): string_data += scanner.token.value

The token_stream object has a number of interesting and usable attributes:

states: dictionary of scanner states state: the current state stream: the input line stream context: the context pointer that was passed to the scanner token: the current token line: the line number of the current parse position pos: the column number of the current parse position

Note - reflex currently has a limit of 99 rules for each state. (That is the maximum number of capturing groups allowed in a python regular expression.)

Last updated Jan 5th, 2011

