This small script will check whether one or more XML documents are well-formed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
from xml.sax.handler import ContentHandler from xml.sax import make_parser from glob import glob import sys def parsefile(file): parser = make_parser() parser.setContentHandler(ContentHandler()) parser.parse(file) for arg in sys.argv[1:]: for filename in glob(arg): try: parsefile(filename) print "%s is well-formed" % filename except Exception, e: print "%s is NOT well-formed! %s" % (filename, e)
This uses the SAX API with a "dummy" ContentHandler that does nothing. It parses the whole document and throws an exception if there is an error. The exception will be caught and printed like this:
$ python wellformed.py test.xml test.xml is NOT well-formed! test.xml:1002:2: mismatched tag
This means that character 2 on line 1002 has a mismatched tag.
The script will not check adherence to a DTD or schema. That is a separate issue. The performance of the script should be quite good. You can find out more about SAX handlers in the recipe: "Count tags".
Using "expat" directly to get the best performance. By using "expat" parser directly, you get a better performance. I changed function parsefile() such that, it calls now the "expat" parser instead. It would be important, if you check the well-formedness of a big XML file.
Check for validness. A modification of the script for the validness checking of an XML document with an internal DTD