Normalizing an XML Document

Credit: David Ascher, Paul Prescod

Problem

You want to compare two different XML documents using standard tools such as diff.

Solution

Normalize each XML document using the following recipe, then use a whitespace-insensitive diff tool:

from xml.dom import minidom
dom = minidom.parse(input)
dom.writexml(open(outputfname, "w"))

Discussion

Different editing tools munge XML differently. Some, like text editors, make no modification that is not explicitly done by the user. Others, such as XML-specific editors, sometimes change the order of attributes or automatically indent elements to facilitate the reading of raw XML. There are reasons for each approach, but unfortunately, the two approaches can lead to confusing differences—for example, if one author uses a plain editor while another uses a fancy XML editor, and a third person is in charge of merging the two sets of changes. In such cases, one should use an XML-difference engine. Typically, however, such tools are not easy to come by. Most are written in Java and don’t deal well with large XML documents (performing tree-diffs efficiently is a hard problem!).

Luckily, combinations of small steps can solve the problem nicely. First, normalize each XML document, then use a standard line-oriented diff tool to compare the normalized outputs. This recipe is a simple XML normalizer. All it does is parse the XML into a Document Object Model (DOM) and write it out. In the process, elements with no children are written in the more compact form (<foo/> rather than <foo></foo>), and attributes are sorted lexicographically.

The second stage is easily done by using some options to the standard diff, such as the -w option, which ignores whitespace differences. Or you might want to use Python’s standard module difflib, which by default also ignores spaces and tabs, and has the advantage of being available on all platforms since Python 2.1.

There’s a slight problem that shows up if you use this recipe unaltered. The standard way in which minidom outputs XML escapes quotation marks results in all " inside of elements appearing as ". This won’t make a difference to smart XML editors, but it’s not a nice thing to do for people reading the output with vi or emacs. Luckily, fixing minidom from the outside isn’t hard:

def _write_data(writer, data):
    "Writes datachars to writer."
    replace = _string.replace
    data = replace(data, "&", "&amp;")
    data = replace(data, "<", "&lt;")
    data = replace(data, ">", "&gt;")
    writer.write(data)

def my_writexml(self, writer, indent="", addindent="", newl=""):
    _write_data(writer, "%s%s%s" % (indent, self.data, newl))

minidom.Text.writexml = my_writexml

Here, we substitute the writexml method for Text nodes with a version that calls a new _write_data function identical to the one in minidom, except that the escaping of quotation marks is skipped. Naturally, the preceding should be done before the call to minidom.parse to be effective.

Python Cookbook by

Normalizing an XML Document

Problem

Solution

Discussion

See Also

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly