A SAX parser can report contiguous text using multiple characters events. This is often unexpected and can cause obscure bugs or require complicated adjustments to SAX handlers. By inserting text_normalize_filter into the SAX handler chain all downstream parsers are ensured that all text nodes in the document Infoset are reported as a single SAX characters event.
Python, 73 lines
Update: See updated versions of this recipe in the Python Cookbook, 2nd Edition:
And as part of Amara XML Toolkit (class amara.saxtools.normalize_text_filter):
A SAX parser can report contiguous text using multiple characters events. In other words, given the following XML document:
The text "abc" could technically be reported as three characters events: one for the "a" character, one for the "b" and a third for the "c". Such an extreme case is unlikely in real life, but not impossible.
The usual reason a parser would report text nodes in bits would be buffering of the XML input source. Most low-level parsers use a buffer of certain number of characters that are read and parsed at a time. If a text node straddles such a buffer boundary, many parsers will just wrap up the current text event and start a new one to send characters from the next buffer. If you don't account for this in your SAX handlers you may run into very obscure and hard-to-reproduce bugs. Even if the parser you usually use does combine text nodes for you, you never know when you'll run your code such that a different parser is selected. You'd need to write logic to accommodate the possibility, which can be rather cumbersome when mixed into typical SAX-style state machine logic.
text_normalize_filter ensures that all text events are reported to downstream SAX handlers in the manner most developers would expect. In the above xample case, the filter would consolidate the three characters events into a single one for the entire text node "abc".
For more on SAX filters in general, see my article "Tip: SAX filters for flexible processing":
Warning: XMLGenerator as of Python 2.3 does not do anything with comments or PIs, so if you run the main code on XML with either feature, you'll have a gap in the output, along with other likely but minor deviations between input and output.