Credit: David Ascher, Paul Prescod
Normalize each XML document using the following recipe, then use a whitespace-insensitive diff tool:
from xml.dom import minidom dom = minidom.parse(input) dom.writexml(open(outputfname, "w"))
Different editing tools munge XML differently. Some, like text editors, make no modification that is not explicitly done by the user. Others, such as XML-specific editors, sometimes change the order of attributes or automatically indent elements to facilitate the reading of raw XML. There are reasons for each approach, but unfortunately, the two approaches can lead to confusing differences—for example, if one author uses a plain editor while another uses a fancy XML editor, and a third person is in charge of merging the two sets of changes. In such cases, one should use an XML-difference engine. Typically, however, such tools are not easy to come by. Most are written in Java and don’t deal well with large XML documents (performing tree-diffs efficiently is a hard problem!).
Luckily, combinations of small steps can solve the problem nicely.
First, normalize each XML document, then use a standard line-oriented
diff tool to compare the normalized outputs.
This recipe is a simple XML normalizer. All it does is parse the XML
into a Document Object Model (DOM) and write it out. In the process,
elements with no children are written in the more compact form
(<foo/>
rather than
<foo></foo>
), and attributes are
sorted lexicographically.
The second stage is easily done by using some options to the standard
diff, such as the -w
option,
which ignores whitespace differences. Or you might want to use
Python’s standard module
difflib
, which by default also ignores spaces
and tabs, and has the advantage of being available on all platforms
since Python 2.1.
There’s a slight problem that shows up if you use
this recipe unaltered. The standard way in which
minidom
outputs XML escapes
quotation marks results in all "
inside of
elements appearing as
"
. This
won’t make a difference to smart XML editors, but
it’s not a nice thing to do for people reading the
output with vi or emacs.
Luckily, fixing minidom
from the outside
isn’t hard:
def _write_data(writer, data): "Writes datachars to writer." replace = _string.replace data = replace(data, "&", "&") data = replace(data, "<", "<") data = replace(data, ">", ">") writer.write(data) def my_writexml(self, writer, indent="", addindent="", newl=""): _write_data(writer, "%s%s%s" % (indent, self.data, newl)) minidom.Text.writexml = my_writexml
Here, we substitute the writexml
method for
Text
nodes with a version that calls a new
_write_data
function
identical to the one in minidom
, except that the
escaping of quotation marks is skipped. Naturally, the preceding
should be done before the call to minidom.parse
to
be effective.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.