Welcome, guest | Sign In | My Account | Store | Cart

This code show how to use the relatively unknown LexicalHandler interface, which is an extension to the standard SAX2 interfaces like ContentHandler (we assume you already have some SAX2 know-how).

Python, 79 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# echoxml.py

import sys
from xml.sax import sax2exts, saxutils, handler
from xml.sax import SAXNotSupportedException, SAXNotRecognizedException

class EchoGenerator(saxutils.XMLGenerator):

    def __init__(self, out=None, encoding="iso-8859-1"):
        saxutils.XMLGenerator.__init__(self, out, encoding)
        self._in_entity = 0
        self._in_cdata = 0

    def characters(self, content):
        if self._in_entity:
            return
        elif self._in_cdata:
            self._out.write(content)
        else:
            saxutils.XMLGenerator.characters(self, content)

    # -- LexicalHandler interface

    def comment(self, content):
        self._out.write('<!--%s-->' % content)

    def startDTD(self, name, public_id, system_id):
        self._out.write('<!DOCTYPE %s' % name)
        if public_id:
            self._out.write(' PUBLIC %s %s' % (
                saxutils.quoteattr(public_id),
                saxutils.quoteattr(system_id)))
        elif system_id:
            self._out.write(' SYSTEM %s' % saxutils.quoteattr(system_id))

    def endDTD(self):
        self._out.write('>\n')

    def startEntity(self, name):
        self._out.write('&%s;' % name)
        self._in_entity = 1

    def endEntity(self, name):
        self._in_entity = 0

    def startCDATA(self):
        self._out.write('<![CDATA[')
        self._in_cdata = 1

    def endCDATA(self):
        self._out.write(']]>')
        self._in_cdata = 0


def test(xmlfile):
    parser = sax2exts.make_parser([
        'pirxx',
        'xml.sax.drivers2.drv_xmlproc',
        'xml.sax.drivers2.drv_pyexpat',
    ])
    print >>sys.stderr, "*** Using", parser

    try:
        parser.setFeature(handler.feature_namespaces, 1)
    except (SAXNotRecognizedException, SAXNotSupportedException):
        pass
    try:
        parser.setFeature(handler.feature_validation, 0)
    except (SAXNotRecognizedException, SAXNotSupportedException):
        pass

    saxhandler = EchoGenerator()
    parser.setContentHandler(saxhandler)
    parser.setProperty(handler.property_lexical_handler, saxhandler)
    parser.parse(xmlfile)


if __name__ == "__main__":
    test('books.xml')

In addition to the standard SAX2 events, a LexicalHandler receives events for things in an XML document that are not usually reported by a SAX2 parser: comments, DTD information, entities and CDATA sections. Thus, you can get at information otherwise hidden from you, which means a read/modify/write application can reproduce a document much more closely to its original representation than otherwise possible with plain SAX2. The code just does that, it parses a file and does its best to echo it unchanged to standard output.

You can pass a LexcialHandler instance to the parser by using the "http://xml.org/sax/properties/lexical-handler" property.

Still, you lose some things, especially in the document leader (the part of the document before the root element). A possible improvement is thus to copy the document leader literally from the source file to the output. This can be done by using a SAX2 locator, which tells you, within the startDocument event, the exact location of the root element. Using that information, you can copy the document leader verbatim, and then append the document proper.

My tests using Python 2.1, PyXML 0.7 (from CVS) and PIRXX 1.2 indicate that PIRXX (i.e. Xerces/C) reports all events, xmlproc leaves out the start/end entity ones, and pyexpat misses those too, in addition to the start/end DTD events.