Checking XML Well-Formedness

Credit: Paul Prescod

Problem

You need to check if an XML document is well-formed (not if it conforms to a DTD or schema), and you need to do this quickly.

Solution

SAX (presumably using a fast parser such as Expat underneath) is the fastest and simplest way to perform this task:

from xml.sax.handler import ContentHandler
from xml.sax import make_parser
from glob import glob
import sys

def parsefile(file):
    parser = make_parser(  )
    parser.setContentHandler(ContentHandler(  ))
    parser.parse(file)

for arg in sys.argv[1:]:
    for filename in glob(arg):
        try:
            parsefile(filename)
            print "%s is well-formed" % filename
        except Exception, e:
            print "%s is NOT well-formed! %s" % (filename, e)

Discussion

A text is a well-formed XML document if it adheres to all the basic syntax rules for XML documents. In other words, it has a correct XML declaration and a single root element, all tags are properly nested, tag attributes are quoted, and so on.

This recipe uses the SAX API with a dummy ContentHandler that does nothing. Generally, when we parse an XML document with SAX, we use a ContentHandler instance to process the document’s contents. But in this case, we only want to know if the document meets the most fundamental syntax constraints of XML; therefore, there is no processing that we need to do, and the do-nothing handler suffices.

The parsefile function parses the whole document and throws an exception if there is an error. The recipe’s main code catches any such exception and prints it out like this:

$ python wellformed.py test.xml
test.xml is NOT well-formed! test.xml:1002:2: mismatched tag

This means that character 2 on line 1,002 has a mismatched tag.

This recipe does not check adherence to a DTD or schema. That is a separate procedure called validation. The performance of the script should be quite good, precisely because it focuses on performing a minimal irreducible core task.

See Also

Recipe 12.3, Recipe 12.4, and Recipe 12.6 for other uses of the SAX API; the PyXML package (http://pyxml.sourceforge.net/) includes the pure-Python validating parser xmlproc, which checks the conformance of XML documents to specific DTDs; the PyRXP package from ReportLab is a wrapper around the faster validating parser RXP (http://www.reportlab.com/xml/pyrxp.html), which is available under the GPL license.

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.