Welcome, guest | Sign In | My Account | Store | Cart

SAX is commonly used on large XML files because they don't fit nicely into core memory necessary for the friendlier DOM API.

When dealing with -really- large XML files, multiple passes over the file becomes costly.

The SAX handler in this recipe allows you to handle an XML file multiple ways in a single pass by dispatching to the handlers you supply.

Python, 56 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def _curry(fn, *cargs, **ckwargs):
    def call_fn(*fargs, **fkwargs):
        d = ckwargs.copy()
        d.update(fkwargs)
        return fn(*(cargs + fargs), **d)
    return call_fn

class MultiHandler(ContentHandler, object):
   """
   MultiHandler is a handler for the xml.sax parser.
   Its purpose is to dispatch calls to one or more other handlers.
   When dealing with really large XML files (say, Wikipedia's 100GB full text dump)
   this is handy so that you can process the information in multiple (modular) ways
   without having to read the whole file off disk in separate passes.
   
   If an exception is thrown from a constituent handler call, MultiHandler will 
      dump a diagnostic to the supplied errout (or stderr) 
      and continue processing.
      
   Example usage:
      import sys
      from xml.sax import make_parser
      from xml.sax.handler import feature_namespaces, ContentHandler
      from MultiHandler import MultiHandler
   
      mh = MultiHandler()
      mh.handlers.append(YourHandler())
      mh.handlers.append(YourOtherHandler())
      parser = make_parser()
      parser.setFeature(feature_namespaces, 0)
      parser.setContentHandler(mh)
      parser.parse(sys.stdin)   
   """
   #ContentHandler is just inherited to make isinstance happy.  
   #we'll be overridding everything using new-style __getattribute__.
   def __init__(self, errout=None):
      self.handlers = []
      self.errout = errout
      if self.errout == None:
         import sys
         self.errout = sys.stderr
   
   def __getattribute__(self, name):
      if name == 'handlers' or name == 'errout':
         return object.__getattribute__(self, name)
      def handlerCall(self, *args, **kwargs):
         for handler in self.handlers:
            try:
               m = getattr(handler, name)
               m(*args, **kwargs)
            except:
               self.errout.write('MultiHandler: error dispatching %s to handler %s\n' %  \
                     (name, str(handler)))
   
      ret = _curry(handlerCall, self)
      return ret

In working with the obscenely large Wikipedia english full text XML dumps (100 GB and growing rapidly), I needed to handle the files in different ways, including:

x Creating a file per page x Slicing the pages to keep only revisions within a certain period of time x Keeping only a specific list of pages.

The initial run, which just sliced the pages, took 6 days to complete using a 3 GHz processor and a new SATA 200GB drive running reiserfs. Python was only taking 30% of the CPU.

You may wish to use this when facing a similar situation: large XML files you'd like to handle in more than one way without making multiple passes.

Created by Jeremy Dunck on Thu, 5 Jan 2006 (PSF)
Python recipes (4591)
Jeremy Dunck's recipes (1)

Required Modules

Other Information and Tasks