A SAX parser can report contiguous text using multiple characters events. This is often unexpected and can cause obscure bugs or require complicated adjustments to SAX handlers. By inserting text_normalize_filter into the SAX handler chain all downstream parsers are ensured that all text nodes in the document Infoset are reported as a single SAX characters event.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
from xml.sax.saxutils import XMLFilterBase class text_normalize_filter(XMLFilterBase): """ SAX filter to ensure that contiguous white space nodes are delivered merged into a single node """ def __init__(self, upstream, downstream): XMLFilterBase.__init__(self, upstream) self._downstream = downstream self._accumulator =  return def _complete_text_node(self): if self._accumulator: self._downstream.characters(''.join(self._accumulator)) self._accumulator =  return def startElement(self, name, attrs): self._complete_text_node() self._downstream.startElement(name, attrs) return def startElementNS(self, name, qname, attrs): self._complete_text_node() self._downstream.startElementNS(name, qname, attrs) return def endElement(self, name): self._complete_text_node() self._downstream.endElement(name) return def endElementNS(self, name, qname): self._complete_text_node() self._downstream.endElementNS(name, qname) return def processingInstruction(self, target, body): self._complete_text_node() self._downstream.processingInstruction(target, body) return def comment(self, body): self._complete_text_node() self._downstream.comment(body) return def characters(self, text): self._accumulator.append(text) return def ignorableWhitespace(self, ws): self._accumulator.append(text) return if __name__ == "__main__": import sys from xml import sax from xml.sax.saxutils import XMLGenerator parser = sax.make_parser() #XMLGenerator is a special SAX handler that merely writes #SAX events back into an XML document downstream_handler = XMLGenerator() #upstream, the parser, downstream, the next handler in the chain filter_handler = text_normalize_filter(parser, downstream_handler) #The SAX filter base is designed so that the filter takes #on much of the interface of the parser itself, including the #"parse" method filter_handler.parse(sys.argv)
Update: See updated versions of this recipe in the Python Cookbook, 2nd Edition:
And as part of Amara XML Toolkit (class amara.saxtools.normalize_text_filter):
A SAX parser can report contiguous text using multiple characters events. In other words, given the following XML document:
The text "abc" could technically be reported as three characters events: one for the "a" character, one for the "b" and a third for the "c". Such an extreme case is unlikely in real life, but not impossible.
The usual reason a parser would report text nodes in bits would be buffering of the XML input source. Most low-level parsers use a buffer of certain number of characters that are read and parsed at a time. If a text node straddles such a buffer boundary, many parsers will just wrap up the current text event and start a new one to send characters from the next buffer. If you don't account for this in your SAX handlers you may run into very obscure and hard-to-reproduce bugs. Even if the parser you usually use does combine text nodes for you, you never know when you'll run your code such that a different parser is selected. You'd need to write logic to accommodate the possibility, which can be rather cumbersome when mixed into typical SAX-style state machine logic.
text_normalize_filter ensures that all text events are reported to downstream SAX handlers in the manner most developers would expect. In the above xample case, the filter would consolidate the three characters events into a single one for the entire text node "abc".
For more on SAX filters in general, see my article "Tip: SAX filters for flexible processing":
Warning: XMLGenerator as of Python 2.3 does not do anything with comments or PIs, so if you run the main code on XML with either feature, you'll have a gap in the output, along with other likely but minor deviations between input and output.