Welcome, guest | Sign In | My Account | Store | Cart

One of the few problems with using Python to process XML is the speed -- if the XML becomes somewhat large (>1Mb), it slows down exponentially as the size of the XML increases. One way to increase the processing speed is to break the XML down via tag name. This is especially handy if you are only interested in one part of the XML, or between certain elements throughout the XML.

Here is a function that I came up with to handle this problem -- I call it "tinyDom". It uses the Sax reader from PyXML, although it could be easily changed for minidom, etc.

The In parameters are the XML as a string, the tag name that you want to build the DOM around, and an optional postition to start at within the XML. It returns a DOM tree and the character position that it stopped at.

Python, 29 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import re
from xml.dom.ext.reader import Sax

def tinyDom(xmlStr,tagname, start=0):
	
	# This builds a regex of the opening and the closing tag
        # Note that it doesn't handle singleton tags
	begStr = "<%s.*" %tagname
	endStr = "</%s.*" %tagname


	# find the beginning and ending tag
        begTag=re.search(begStr,xmlStr[start:])
	endTag=re.search(endStr,xmlStr[start:])

	if begTag:
		beg = begTag.start() 
	else:
		return None, start

	if endTag:
		end = endTag.end() 
	else:
		return None, start

        if beg > end:
             return None, start

	return Sax.FromXml(begTag.string[beg:end]),end+start

Note that if it can't find the an opening or closing tag, it returns "None" as the DOM -- this is how it handles singleton tags, or if that element doesn't exist.

This example assumes that you know what character you want to start looking for that tag in. The default is at charactor 0 (the beginning of the XML), but you probably don't what that all the time.