Welcome, guest | Sign In | My Account | Store | Cart

People often ask how to extract the text from an XML document. This small program does it.

Python, 12 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from xml.sax.handler import ContentHandler
import xml.sax
import sys

class textHandler(ContentHandler):
    def characters(self, ch):
        sys.stdout.write(ch.encode("Latin-1"))

parser = xml.sax.make_parser()
handler = textHandler()
parser.setContentHandler(handler)
parser.parse("test.xml")

Sometimes you want to get rid of XML tags to re-key a document, or to spell check it. This will work with any well-formed XML document. It is quite efficient. If the document isn't well-formed, you could try a solution based on the xml lexer described in another recipe called "XML lexing".

3 comments

Bill Bell 17 years, 8 months ago  # | flag
Bill Bell 17 years, 8 months ago  # | flag

Another way. from sgmllib import SGMLParser

class XMLJustText ( SGMLParser ) : def handle_data ( self, data ) : print data

XMLJustText ( ) . feed ( "text 1text 2" )

Tester 10 years, 8 months ago  # | flag

How to write output in file?