ActiveState Code

Recipe 52256: Check xml well-formedness


This small script will check whether one or more XML documents are well-formed.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from xml.sax.handler import ContentHandler
from xml.sax import make_parser
from glob import glob
import sys

def parsefile(file):
    parser = make_parser()
    parser.setContentHandler(ContentHandler())
    parser.parse(file)

for arg in sys.argv[1:]:
    for filename in glob(arg):
        try:
            parsefile(filename)
            print "%s is well-formed" % filename
        except Exception, e:
            print "%s is NOT well-formed! %s" % (filename, e)

Discussion

This uses the SAX API with a "dummy" ContentHandler that does nothing. It parses the whole document and throws an exception if there is an error. The exception will be caught and printed like this:

$ python wellformed.py test.xml test.xml is NOT well-formed! test.xml:1002:2: mismatched tag

This means that character 2 on line 1002 has a mismatched tag.

The script will not check adherence to a DTD or schema. That is a separate issue. The performance of the script should be quite good. You can find out more about SAX handlers in the recipe: "Count tags".

Comments

  1. 1. At 4:59 a.m. on 2 apr 2003, Farhad Fouladi said:

    Using "expat" directly to get the best performance. By using "expat" parser directly, you get a better performance. I changed function parsefile() such that, it calls now the "expat" parser instead. It would be important, if you check the well-formedness of a big XML file.

    import xml.parsers.expat,sys
    from glob import glob
    
    def parsefile(file):
        parser = xml.parsers.expat.ParserCreate()
        parser.ParseFile(open(file, "r"))
    
    for arg in sys.argv[1:]:
        for filename in glob(arg):
            try:
                parsefile(filename)
                print "%s is well-formed" % filename
            except Exception, e:
                print "%s is %s" % (filename, e)
    
  2. 2. At 2:45 p.m. on 4 jan 2004, dmitry mozzherin said:

    Check for validness. A modification of the script for the validness checking of an XML document with an internal DTD

    #!/usr/bin/env python
    
    from xml.parsers.xmlproc import xmlval
    import sys
    
    def parseFile(file):
        parser=xmlval.XMLValidator()
        parser.parse_resource(file)
    
    if len(sys.argv) != 2:
        print '''Usage:
    
    python %s filename
    
    ''' % sys.argv[0]
        sys.exit(0)
    
    file=sys.argv[1]
    #f=open(file)
    
    try:
        parseFile(file)
        print "%s is well-formed and valid" % file
    
    except Exception,e:
        print "%s is not well-formed or not valid %s!" % (file, e)
    

Sign in to comment