SAX is commonly used on large XML files because they don't fit nicely into core memory necessary for the friendlier DOM API.
When dealing with -really- large XML files, multiple passes over the file becomes costly.
The SAX handler in this recipe allows you to handle an XML file multiple ways in a single pass by dispatching to the handlers you supply.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | def _curry(fn, *cargs, **ckwargs):
def call_fn(*fargs, **fkwargs):
d = ckwargs.copy()
d.update(fkwargs)
return fn(*(cargs + fargs), **d)
return call_fn
class MultiHandler(ContentHandler, object):
"""
MultiHandler is a handler for the xml.sax parser.
Its purpose is to dispatch calls to one or more other handlers.
When dealing with really large XML files (say, Wikipedia's 100GB full text dump)
this is handy so that you can process the information in multiple (modular) ways
without having to read the whole file off disk in separate passes.
If an exception is thrown from a constituent handler call, MultiHandler will
dump a diagnostic to the supplied errout (or stderr)
and continue processing.
Example usage:
import sys
from xml.sax import make_parser
from xml.sax.handler import feature_namespaces, ContentHandler
from MultiHandler import MultiHandler
mh = MultiHandler()
mh.handlers.append(YourHandler())
mh.handlers.append(YourOtherHandler())
parser = make_parser()
parser.setFeature(feature_namespaces, 0)
parser.setContentHandler(mh)
parser.parse(sys.stdin)
"""
#ContentHandler is just inherited to make isinstance happy.
#we'll be overridding everything using new-style __getattribute__.
def __init__(self, errout=None):
self.handlers = []
self.errout = errout
if self.errout == None:
import sys
self.errout = sys.stderr
def __getattribute__(self, name):
if name == 'handlers' or name == 'errout':
return object.__getattribute__(self, name)
def handlerCall(self, *args, **kwargs):
for handler in self.handlers:
try:
m = getattr(handler, name)
m(*args, **kwargs)
except:
self.errout.write('MultiHandler: error dispatching %s to handler %s\n' % \
(name, str(handler)))
ret = _curry(handlerCall, self)
return ret
|
In working with the obscenely large Wikipedia english full text XML dumps (100 GB and growing rapidly), I needed to handle the files in different ways, including:
x Creating a file per page x Slicing the pages to keep only revisions within a certain period of time x Keeping only a specific list of pages.
The initial run, which just sliced the pages, took 6 days to complete using a 3 GHz processor and a new SATA 200GB drive running reiserfs. Python was only taking 30% of the CPU.
You may wish to use this when facing a similar situation: large XML files you'd like to handle in more than one way without making multiple passes.