Welcome, guest | Sign In | My Account | Store | Cart

A SAX parser can report contiguous text using multiple characters events. This is often unexpected and can cause obscure bugs or require complicated adjustments to SAX handlers. By inserting text_normalize_filter into the SAX handler chain all downstream parsers are ensured that all text nodes in the document Infoset are reported as a single SAX characters event.

Python, 73 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from xml.sax.saxutils import XMLFilterBase

class text_normalize_filter(XMLFilterBase):
    """
    SAX filter to ensure that contiguous white space nodes are
    delivered merged into a single node
    """
    
    def __init__(self, upstream, downstream):
        XMLFilterBase.__init__(self, upstream)
        self._downstream = downstream
        self._accumulator = []
        return

    def _complete_text_node(self):
        if self._accumulator:
            self._downstream.characters(''.join(self._accumulator))
            self._accumulator = []
        return

    def startElement(self, name, attrs):
        self._complete_text_node()
        self._downstream.startElement(name, attrs)
        return

    def startElementNS(self, name, qname, attrs):
        self._complete_text_node()
        self._downstream.startElementNS(name, qname, attrs)
        return

    def endElement(self, name):
        self._complete_text_node()
        self._downstream.endElement(name)
        return

    def endElementNS(self, name, qname):
        self._complete_text_node()
        self._downstream.endElementNS(name, qname)
        return

    def processingInstruction(self, target, body):
        self._complete_text_node()
        self._downstream.processingInstruction(target, body)
        return

    def comment(self, body):
        self._complete_text_node()
        self._downstream.comment(body)
        return

    def characters(self, text):
        self._accumulator.append(text)
        return

    def ignorableWhitespace(self, ws):
        self._accumulator.append(text)
        return


if __name__ == "__main__":
    import sys
    from xml import sax
    from xml.sax.saxutils import XMLGenerator
    parser = sax.make_parser()
    #XMLGenerator is a special SAX handler that merely writes
    #SAX events back into an XML document
    downstream_handler = XMLGenerator()
    #upstream, the parser, downstream, the next handler in the chain
    filter_handler = text_normalize_filter(parser, downstream_handler)
    #The SAX filter base is designed so that the filter takes
    #on much of the interface of the parser itself, including the
    #"parse" method
    filter_handler.parse(sys.argv[1])

Update: See updated versions of this recipe in the Python Cookbook, 2nd Edition:

http://www.oreilly.com/catalog/pythoncook2/

And as part of Amara XML Toolkit (class amara.saxtools.normalize_text_filter):

http://uche.ogbuji.net/uche.ogbuji.net/tech/4Suite/amara/

A SAX parser can report contiguous text using multiple characters events. In other words, given the following XML document:

abc

The text "abc" could technically be reported as three characters events: one for the "a" character, one for the "b" and a third for the "c". Such an extreme case is unlikely in real life, but not impossible.

The usual reason a parser would report text nodes in bits would be buffering of the XML input source. Most low-level parsers use a buffer of certain number of characters that are read and parsed at a time. If a text node straddles such a buffer boundary, many parsers will just wrap up the current text event and start a new one to send characters from the next buffer. If you don't account for this in your SAX handlers you may run into very obscure and hard-to-reproduce bugs. Even if the parser you usually use does combine text nodes for you, you never know when you'll run your code such that a different parser is selected. You'd need to write logic to accommodate the possibility, which can be rather cumbersome when mixed into typical SAX-style state machine logic.

text_normalize_filter ensures that all text events are reported to downstream SAX handlers in the manner most developers would expect. In the above xample case, the filter would consolidate the three characters events into a single one for the entire text node "abc".

For more on SAX filters in general, see my article "Tip: SAX filters for flexible processing":

http://www-106.ibm.com/developerworks/xml/library/x-tipsaxflex.html

Warning: XMLGenerator as of Python 2.3 does not do anything with comments or PIs, so if you run the main code on XML with either feature, you'll have a gap in the output, along with other likely but minor deviations between input and output.

5 comments

James Kew 20 years, 3 months ago  # | flag

comment() won't be called. The comment method is on LexicalHandler, which isn't filtered by XMLFilterBase.

To get comments flowing into the filter, modify as follows (approximately; my code's at work at the moment...)

from xml.sax.saxutils import XMLFilterBase
from xml.sax.saxlib import LexicalHandler
from xml.sax.handler import property_lexical_handler

# Deriving from LexicalHandler gets you default no-op impls
class text_normalize_filter(XMLFilterBase, LexicalHandler):
    # __init__ as before

    # Override comment() method in LexicalHandler
    def comment(self, body):
        # Impl as before

    # Override XMLFilterBase.parse to connect the LexicalHandler
    # Can only do this by setting the relevant property
    # May throw SAXNotSupportedException
    def parse(self, source):
        self._parent.setProperty(property_lexical_handler, self)
        # Delegate to XMLFilterBase for the rest
        XMLFilterBase.parse(self, source)
James Kew 20 years, 3 months ago  # | flag

Hard to chain. Taking both upstream and downstream parameters in the __init__ makes it hard to chain this with other filters.

I prefer to keep the XMLFilterBase __init__ signature:

def __init__(self, parent):
    XMLFilterBase.__init__(self, parent)
    self._accumulator = []

and then, rather than calling handlers on the downstream handler directly, delegate to the default XMLFilterBase handlers. These call out to whatever handlers have been set on the filter (with setContentHandler, etc).

def startElement(self, name, attrs):
    self._complete_text_node()
    XMLFilterBase.startElement(self, name, attrs)

This lets you chain filters like so:

parser = sax.make_parser()
filtered_parser = text_normalise_filter(some_other_filter(parser))

and also lets you use a filter in contexts which call the parse method on your behalf:

doc = xml.dom.minidom.parse(input_file, parser=filtered_parser)
Günter Jantzen 20 years ago  # | flag

Quite nice, but ... the Filter fails at this little File which contains an German Umlaut

<?xml version="1.0" encoding="ISO-8859-1"?>
<Hallo Welt = "schön"/>

Traceback

...
File "C:\Progs\python\lib\xml\sax\saxutils.py", line 83, in startElement
UnicodeError: ASCII encoding error: ordinal not in range(128)
Günter Jantzen 20 years ago  # | flag

Quite nice, but ... the Filter fails at this little File which contains an German Umlaut

<?xml version="1.0" encoding="ISO-8859-1"?>
<Hallo Welt = "schön"/>

Traceback

...
File "C:\Progs\python\lib\xml\sax\saxutils.py", line 83, in startElement
UnicodeError: ASCII encoding error: ordinal not in range(128)
Günter Jantzen 20 years ago  # | flag

Sorry. for posting two times my last comment. I think I was wrong. Its not a use case to print Unicode in the shell. So I will rate the recipe with a lot of stars :-)