People often ask how to extract the text from an XML document. This small program does it.
1 2 3 4 5 6 7 8 9 10 11 12 | from xml.sax.handler import ContentHandler
import xml.sax
import sys
class textHandler(ContentHandler):
def characters(self, ch):
sys.stdout.write(ch.encode("Latin-1"))
parser = xml.sax.make_parser()
handler = textHandler()
parser.setContentHandler(handler)
parser.parse("test.xml")
|
Sometimes you want to get rid of XML tags to re-key a document, or to spell check it. This will work with any well-formed XML document. It is quite efficient. If the document isn't well-formed, you could try a solution based on the xml lexer described in another recipe called "XML lexing".
Tags: xml
Direct link to author's "XML lexing" http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/65125
Another way. from sgmllib import SGMLParser
class XMLJustText ( SGMLParser ) : def handle_data ( self, data ) : print data
XMLJustText ( ) . feed ( "text 1text 2" )
How to write output in file?