Welcome, guest | Sign In | My Account | Store | Cart

This function uses the power of regular expressions to extract parts of a structured text string. It can build a token list from many types of code and data formats. It finds string types (with quotes) and nested structures that use parentheses, brackets, and braces. If you need to extract a different syntax, you can provide a custom token pattern in the function arguments.

Python, 46 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import re

def tokeniser( text, tokenpat=None, blockchar='()[]{}' ):
	'Lightweight text tokeniser for simple structured text'
	defpat = r'''
		(-?\d+\.?\d*)|  # find -nn.nnn or -nnn or nn numbers
		(\w+)|          # look for words (identifiers) next
		(".*?")|        # look for double-quoted strings
		('.*?')|        # look for single quoted strings
		([ \t]+)|       # gather white space (but not new lines)
		(\n)|           # check for a new line character
		(.)             # capture any other text as single characters'''
	openchar, closechar = blockchar[0::2], blockchar[1::2]
	blockpair = dict( zip( closechar, openchar ) )
	stack = []
	block = []
	synpat = re.compile( tokenpat or defpat, re.M + re.S + re.X )
	for token in synpat.split( text ):
		if token:
			if token in openchar:
				block.append( [] )
				stack.append( block )
				block = block[-1]
			block.append( token )
			if token in closechar:
				assert block[0] == blockpair[ token ], 'Block end mismatch'
				assert stack, 'Block start mismatch'
				block = stack.pop()
	assert stack == [], 'Block not closed'
	return block

def showtokens( tokens, indent=0 ):
	for token in tokens:
		if type( token ) == list:
			showtokens( token, indent+1 )
		else:
			print '%sToken: %s' % ('    '*indent, `token`)

if __name__ == '__main__':
	example = '''
for x in xseq[2:]:
	print fn( x*-5.5, "it\'s big", "", {'g':[0]} )
end
	'''.strip()
	result = tokeniser( example )
	showtokens( result )

This is a simple, lightweight, and fast tokeniser and structure builder. The default pattern extracts numbers, identifiers, strings, white space, and punctuation. The output is a list of tokens with nested structures embedded as sub-lists.

The output for the example in the code above is:

Token: 'for'
Token: ' '
Token: 'x'
Token: ' '
Token: 'in'
Token: ' '
Token: 'xseq'
    Token: '['
    Token: '2'
    Token: ':'
    Token: ']'
Token: ':'
Token: '\n'
Token: '\t'
Token: 'print'
Token: ' '
Token: 'fn'
    Token: '('
    Token: ' '
    Token: 'x'
    Token: '*'
    Token: '-5.5'
    Token: ','
    Token: ' '
    Token: '"it\'s big"'
    Token: ','
    Token: ' '
    Token: '""'
    Token: ','
    Token: ' '
        Token: '{'
        Token: "'g'"
        Token: ':'
            Token: '['
            Token: '0'
            Token: ']'
        Token: '}'
    Token: ' '
    Token: ')'
Token: '\n'
Token: 'end'

This function can be adapted for tokenising many different structured text formats (eg: JSON, XML, wiki markup, CSV files, configuration files, and custom small languages). It has also been used to extract times, dates, latitude/longitude, URLs, email addresses, and numbers from standard text files.

This technique has a number of limitations:

  • No line numbers, error location, literal escapes, or input checks.
  • For normal English, it will have trouble with single quote characters (eg: can't and 5'9").
  • The ordering of regex terms and the use of non-greedy patterns are very important.
  • The "blockchar" values must be found by the token pattern string.

For more complex tasks, you should consider PyParsing ( http://pyparsing.wikispaces.com/ ) or PLY ( http://www.dabeaz.com/ply/ ).

Update 18 May: A friend suggested I provide an example output and add comments to the regular expression to help those not familiar with Python. Hopefully this recipe is now a little more clear and useful.