Welcome, guest | Sign In | My Account | Store | Cart

This small script will check whether one or more XML documents are well-formed.

Python, 17 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from xml.sax.handler import ContentHandler
from xml.sax import make_parser
from glob import glob
import sys

def parsefile(file):
    parser = make_parser()
    parser.setContentHandler(ContentHandler())
    parser.parse(file)

for arg in sys.argv[1:]:
    for filename in glob(arg):
        try:
            parsefile(filename)
            print "%s is well-formed" % filename
        except Exception, e:
            print "%s is NOT well-formed! %s" % (filename, e)

This uses the SAX API with a "dummy" ContentHandler that does nothing. It parses the whole document and throws an exception if there is an error. The exception will be caught and printed like this:

$ python wellformed.py test.xml test.xml is NOT well-formed! test.xml:1002:2: mismatched tag

This means that character 2 on line 1002 has a mismatched tag.

The script will not check adherence to a DTD or schema. That is a separate issue. The performance of the script should be quite good. You can find out more about SAX handlers in the recipe: "Count tags".

2 comments

Farhad Fouladi 21 years, 1 month ago  # | flag

Using "expat" directly to get the best performance. By using "expat" parser directly, you get a better performance. I changed function parsefile() such that, it calls now the "expat" parser instead. It would be important, if you check the well-formedness of a big XML file.

import xml.parsers.expat,sys
from glob import glob

def parsefile(file):
    parser = xml.parsers.expat.ParserCreate()
    parser.ParseFile(open(file, "r"))

for arg in sys.argv[1:]:
    for filename in glob(arg):
        try:
            parsefile(filename)
            print "%s is well-formed" % filename
        except Exception, e:
            print "%s is %s" % (filename, e)
dmitry mozzherin 20 years, 4 months ago  # | flag

Check for validness. A modification of the script for the validness checking of an XML document with an internal DTD

#!/usr/bin/env python

from xml.parsers.xmlproc import xmlval
import sys

def parseFile(file):
    parser=xmlval.XMLValidator()
    parser.parse_resource(file)

if len(sys.argv) != 2:
    print '''Usage:

python %s filename

''' % sys.argv[0]
    sys.exit(0)

file=sys.argv[1]
#f=open(file)

try:
    parseFile(file)
    print "%s is well-formed and valid" % file

except Exception,e:
    print "%s is not well-formed or not valid %s!" % (file, e)