This small script will check whether one or more XML documents are well-formed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from xml.sax.handler import ContentHandler
from xml.sax import make_parser
from glob import glob
import sys
def parsefile(file):
parser = make_parser()
parser.setContentHandler(ContentHandler())
parser.parse(file)
for arg in sys.argv[1:]:
for filename in glob(arg):
try:
parsefile(filename)
print "%s is well-formed" % filename
except Exception, e:
print "%s is NOT well-formed! %s" % (filename, e)
|
This uses the SAX API with a "dummy" ContentHandler that does nothing. It parses the whole document and throws an exception if there is an error. The exception will be caught and printed like this:
$ python wellformed.py test.xml test.xml is NOT well-formed! test.xml:1002:2: mismatched tag
This means that character 2 on line 1002 has a mismatched tag.
The script will not check adherence to a DTD or schema. That is a separate issue. The performance of the script should be quite good. You can find out more about SAX handlers in the recipe: "Count tags".
Using "expat" directly to get the best performance. By using "expat" parser directly, you get a better performance. I changed function parsefile() such that, it calls now the "expat" parser instead. It would be important, if you check the well-formedness of a big XML file.
Check for validness. A modification of the script for the validness checking of an XML document with an internal DTD