A SAX parser can report contiguous text using multiple characters events. This is often unexpected and can cause obscure bugs or require complicated adjustments to SAX handlers. By inserting text_normalize_filter into the SAX handler chain all downstream parsers are ensured that all text nodes in the document Infoset are reported as a single SAX characters event.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | from xml.sax.saxutils import XMLFilterBase
class text_normalize_filter(XMLFilterBase):
"""
SAX filter to ensure that contiguous white space nodes are
delivered merged into a single node
"""
def __init__(self, upstream, downstream):
XMLFilterBase.__init__(self, upstream)
self._downstream = downstream
self._accumulator = []
return
def _complete_text_node(self):
if self._accumulator:
self._downstream.characters(''.join(self._accumulator))
self._accumulator = []
return
def startElement(self, name, attrs):
self._complete_text_node()
self._downstream.startElement(name, attrs)
return
def startElementNS(self, name, qname, attrs):
self._complete_text_node()
self._downstream.startElementNS(name, qname, attrs)
return
def endElement(self, name):
self._complete_text_node()
self._downstream.endElement(name)
return
def endElementNS(self, name, qname):
self._complete_text_node()
self._downstream.endElementNS(name, qname)
return
def processingInstruction(self, target, body):
self._complete_text_node()
self._downstream.processingInstruction(target, body)
return
def comment(self, body):
self._complete_text_node()
self._downstream.comment(body)
return
def characters(self, text):
self._accumulator.append(text)
return
def ignorableWhitespace(self, ws):
self._accumulator.append(text)
return
if __name__ == "__main__":
import sys
from xml import sax
from xml.sax.saxutils import XMLGenerator
parser = sax.make_parser()
#XMLGenerator is a special SAX handler that merely writes
#SAX events back into an XML document
downstream_handler = XMLGenerator()
#upstream, the parser, downstream, the next handler in the chain
filter_handler = text_normalize_filter(parser, downstream_handler)
#The SAX filter base is designed so that the filter takes
#on much of the interface of the parser itself, including the
#"parse" method
filter_handler.parse(sys.argv[1])
|
Update: See updated versions of this recipe in the Python Cookbook, 2nd Edition:
http://www.oreilly.com/catalog/pythoncook2/
And as part of Amara XML Toolkit (class amara.saxtools.normalize_text_filter):
http://uche.ogbuji.net/uche.ogbuji.net/tech/4Suite/amara/
A SAX parser can report contiguous text using multiple characters events. In other words, given the following XML document:
abc
The text "abc" could technically be reported as three characters events: one for the "a" character, one for the "b" and a third for the "c". Such an extreme case is unlikely in real life, but not impossible.
The usual reason a parser would report text nodes in bits would be buffering of the XML input source. Most low-level parsers use a buffer of certain number of characters that are read and parsed at a time. If a text node straddles such a buffer boundary, many parsers will just wrap up the current text event and start a new one to send characters from the next buffer. If you don't account for this in your SAX handlers you may run into very obscure and hard-to-reproduce bugs. Even if the parser you usually use does combine text nodes for you, you never know when you'll run your code such that a different parser is selected. You'd need to write logic to accommodate the possibility, which can be rather cumbersome when mixed into typical SAX-style state machine logic.
text_normalize_filter ensures that all text events are reported to downstream SAX handlers in the manner most developers would expect. In the above xample case, the filter would consolidate the three characters events into a single one for the entire text node "abc".
For more on SAX filters in general, see my article "Tip: SAX filters for flexible processing":
http://www-106.ibm.com/developerworks/xml/library/x-tipsaxflex.html
Warning: XMLGenerator as of Python 2.3 does not do anything with comments or PIs, so if you run the main code on XML with either feature, you'll have a gap in the output, along with other likely but minor deviations between input and output.
comment() won't be called. The comment method is on LexicalHandler, which isn't filtered by XMLFilterBase.
To get comments flowing into the filter, modify as follows (approximately; my code's at work at the moment...)
Hard to chain. Taking both upstream and downstream parameters in the __init__ makes it hard to chain this with other filters.
I prefer to keep the XMLFilterBase __init__ signature:
and then, rather than calling handlers on the downstream handler directly, delegate to the default XMLFilterBase handlers. These call out to whatever handlers have been set on the filter (with setContentHandler, etc).
This lets you chain filters like so:
and also lets you use a filter in contexts which call the parse method on your behalf:
Quite nice, but ... the Filter fails at this little File which contains an German Umlaut
Traceback
Quite nice, but ... the Filter fails at this little File which contains an German Umlaut
Traceback
Sorry. for posting two times my last comment. I think I was wrong. Its not a use case to print Unicode in the shell. So I will rate the recipe with a lot of stars :-)