This function uses the power of regular expressions to extract parts of a structured text string. It can build a token list from many types of code and data formats. It finds string types (with quotes) and nested structures that use parentheses, brackets, and braces. If you need to extract a different syntax, you can provide a custom token pattern in the function arguments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | import re
def tokeniser( text, tokenpat=None, blockchar='()[]{}' ):
'Lightweight text tokeniser for simple structured text'
defpat = r'''
(-?\d+\.?\d*)| # find -nn.nnn or -nnn or nn numbers
(\w+)| # look for words (identifiers) next
(".*?")| # look for double-quoted strings
('.*?')| # look for single quoted strings
([ \t]+)| # gather white space (but not new lines)
(\n)| # check for a new line character
(.) # capture any other text as single characters'''
openchar, closechar = blockchar[0::2], blockchar[1::2]
blockpair = dict( zip( closechar, openchar ) )
stack = []
block = []
synpat = re.compile( tokenpat or defpat, re.M + re.S + re.X )
for token in synpat.split( text ):
if token:
if token in openchar:
block.append( [] )
stack.append( block )
block = block[-1]
block.append( token )
if token in closechar:
assert block[0] == blockpair[ token ], 'Block end mismatch'
assert stack, 'Block start mismatch'
block = stack.pop()
assert stack == [], 'Block not closed'
return block
def showtokens( tokens, indent=0 ):
for token in tokens:
if type( token ) == list:
showtokens( token, indent+1 )
else:
print '%sToken: %s' % (' '*indent, `token`)
if __name__ == '__main__':
example = '''
for x in xseq[2:]:
print fn( x*-5.5, "it\'s big", "", {'g':[0]} )
end
'''.strip()
result = tokeniser( example )
showtokens( result )
|
This is a simple, lightweight, and fast tokeniser and structure builder. The default pattern extracts numbers, identifiers, strings, white space, and punctuation. The output is a list of tokens with nested structures embedded as sub-lists.
The output for the example in the code above is:
Token: 'for'
Token: ' '
Token: 'x'
Token: ' '
Token: 'in'
Token: ' '
Token: 'xseq'
Token: '['
Token: '2'
Token: ':'
Token: ']'
Token: ':'
Token: '\n'
Token: '\t'
Token: 'print'
Token: ' '
Token: 'fn'
Token: '('
Token: ' '
Token: 'x'
Token: '*'
Token: '-5.5'
Token: ','
Token: ' '
Token: '"it\'s big"'
Token: ','
Token: ' '
Token: '""'
Token: ','
Token: ' '
Token: '{'
Token: "'g'"
Token: ':'
Token: '['
Token: '0'
Token: ']'
Token: '}'
Token: ' '
Token: ')'
Token: '\n'
Token: 'end'
This function can be adapted for tokenising many different structured text formats (eg: JSON, XML, wiki markup, CSV files, configuration files, and custom small languages). It has also been used to extract times, dates, latitude/longitude, URLs, email addresses, and numbers from standard text files.
This technique has a number of limitations:
- No line numbers, error location, literal escapes, or input checks.
- For normal English, it will have trouble with single quote characters (eg: can't and 5'9").
- The ordering of regex terms and the use of non-greedy patterns are very important.
- The "blockchar" values must be found by the token pattern string.
For more complex tasks, you should consider PyParsing ( http://pyparsing.wikispaces.com/ ) or PLY ( http://www.dabeaz.com/ply/ ).
Update 18 May: A friend suggested I provide an example output and add comments to the regular expression to help those not familiar with Python. Hopefully this recipe is now a little more clear and useful.