People often ask how to extract the text from an XML document. This small program does it.
1 2 3 4 5 6 7 8 9 10 11 12
from xml.sax.handler import ContentHandler import xml.sax import sys class textHandler(ContentHandler): def characters(self, ch): sys.stdout.write(ch.encode("Latin-1")) parser = xml.sax.make_parser() handler = textHandler() parser.setContentHandler(handler) parser.parse("test.xml")
Sometimes you want to get rid of XML tags to re-key a document, or to spell check it. This will work with any well-formed XML document. It is quite efficient. If the document isn't well-formed, you could try a solution based on the xml lexer described in another recipe called "XML lexing".