People often ask how to extract the text from an XML document. This small program does it.
Python, 12 lines
1 2 3 4 5 6 7 8 9 10 11 12
from xml.sax.handler import ContentHandler import xml.sax import sys class textHandler(ContentHandler): def characters(self, ch): sys.stdout.write(ch.encode("Latin-1")) parser = xml.sax.make_parser() handler = textHandler() parser.setContentHandler(handler) parser.parse("test.xml")
Sometimes you want to get rid of XML tags to re-key a document, or to spell check it. This will work with any well-formed XML document. It is quite efficient. If the document isn't well-formed, you could try a solution based on the xml lexer described in another recipe called "XML lexing".
Direct link to author's "XML lexing" http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/65125
Another way. from sgmllib import SGMLParser
class XMLJustText ( SGMLParser ) : def handle_data ( self, data ) : print data
XMLJustText ( ) . feed ( "text 1text 2" )
How to write output in file?