Credit: Paul Prescod
You need to check if an XML document is well-formed (not if it conforms to a DTD or schema), and you need to do this quickly.
SAX (presumably using a fast parser such as Expat underneath) is the fastest and simplest way to perform this task:
from xml.sax.handler import ContentHandler
from xml.sax import make_parser
from glob import glob
import sys
def parsefile(file):
parser = make_parser( )
parser.setContentHandler(ContentHandler( ))
parser.parse(file)
for arg in sys.argv[1:]:
for filename in glob(arg):
try:
parsefile(filename)
print "%s is well-formed" % filename
except Exception, e:
print "%s is NOT well-formed! %s" % (filename, e)
A text is a well-formed XML document if it adheres to all the basic syntax rules for XML documents. In other words, it has a correct XML declaration and a single root element, all tags are properly nested, tag attributes are quoted, and so on.
This recipe uses the SAX API with a dummy
ContentHandler
that does nothing. Generally, when
we parse an XML document with SAX, we use a
ContentHandler
instance to process the document’s contents. But in
this case, we only want to know if the document meets the most
fundamental syntax constraints of XML; therefore, there is no
processing that we need to do, and the do-nothing handler suffices.
The
parsefile
function parses the whole document and throws an exception if there
is an error. The recipe’s main code catches any such
exception and prints it out like this:
$ python wellformed.py test.xml
test.xml is NOT well-formed! test.xml:1002:2: mismatched tag
This means that character 2 on line 1,002 has a mismatched tag.
This recipe does not check adherence to a DTD or schema. That is a separate procedure called validation. The performance of the script should be quite good, precisely because it focuses on performing a minimal irreducible core task.
Recipe 12.3, Recipe 12.4,
and Recipe 12.6 for other uses of the SAX API; the
PyXML package (http://pyxml.sourceforge.net/) includes the
pure-Python validating parser xmlproc
, which
checks the conformance of XML documents to specific DTDs; the PyRXP
package from ReportLab is a wrapper around the faster validating
parser RXP (http://www.reportlab.com/xml/pyrxp.html),
which is available under the GPL license.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.