Welcome, guest | Sign In | My Account | Store | Cart

Breaking large XML documents into chunks to speed processing (Python recipe) by Mike Hostetler
ActiveState Code (http://code.activestate.com/recipes/84515/)

One of the few problems with using Python to process XML is the speed -- if the XML becomes somewhat large (>1Mb), it slows down exponentially as the size of the XML increases. One way to increase the processing speed is to break the XML down via tag name. This is especially handy if you are only interested in one part of the XML, or between certain elements throughout the XML.

Here is a function that I came up with to handle this problem -- I call it "tinyDom". It uses the Sax reader from PyXML, although it could be easily changed for minidom, etc.

The In parameters are the XML as a string, the tag name that you want to build the DOM around, and an optional postition to start at within the XML. It returns a DOM tree and the character position that it stopped at.

      import re
from xml.dom.ext.reader import Sax

def tinyDom(xmlStr,tagname, start=0):
	
	# This builds a regex of the opening and the closing tag
        # Note that it doesn't handle singleton tags
	begStr = "<%s.*" %tagname
	endStr = "</%s.*" %tagname


	# find the beginning and ending tag
        begTag=re.search(begStr,xmlStr[start:])
	endTag=re.search(endStr,xmlStr[start:])

	if begTag:
		beg = begTag.start() 
	else:
		return None, start

	if endTag:
		end = endTag.end() 
	else:
		return None, start

        if beg > end:
             return None, start

	return Sax.FromXml(begTag.string[beg:end]),end+start

      

Note that if it can't find the an opening or closing tag, it returns "None" as the DOM -- this is how it handles singleton tags, or if that element doesn't exist.

This example assumes that you know what character you want to start looking for that tag in. The default is at charactor 0 (the beginning of the XML), but you probably don't what that all the time.

Tags: xml

Created by Mike Hostetler on Wed, 31 Oct 2001 (PSF)

◄	Python recipes (4591)	►
◄	Mike Hostetler's recipes (3)	►

Required Modules

Other Information and Tasks

Licensed under the PSF License
Viewed 9272 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

Breaking large XML documents into chunks to speed processing (Python recipe) by Mike Hostetler ActiveState Code (http://code.activestate.com/recipes/84515/)