Transforming an XML Document Using Python

Credit: David Ascher

Problem

You have an XML document that you want to tweak.

Solution

Suppose that you want to convert element attributes into child elements. A simple subclass of the XMLGenerator object gives you complete freedom in such XML-to-XML transformation tasks:

from xml.sax import saxutils, make_parser
import sys

class Tweak(saxutils.XMLGenerator):
    def startElement(self, name, attrs):
        saxutils.XMLGenerator.startElement(self, name, {})
        attributes = attrs.keys(  )
        attributes.sort(  )
        for attribute in attributes:
            self._out.write("<%s>%s</%s>" % (attribute,
                            attrs[attribute], attribute))

parser = make_parser(  )
dh = Tweak(sys.stdout)
parser.setContentHandler(dh)
parser.parse(sys.argv[1])

Discussion

This particular recipe defines a Tweak subclass of the XMLGenerator class provided by the xml.sax.saxutils module. The only purpose of the subclass is to perform special handling of element starts while relying on its base class to do everything else. SAX is a nice and simple (after all, that’s what the S stands for) API for processing XML documents. It defines various kinds of events that occur when an XML document is being processed, such as startElement and endElement.

The key to understanding this recipe is to understand that Python’s XML library provides a base class, XMLGenerator, which performs an identity transform. If you feed it an XML document, it will output an equivalent XML document. Using standard Python object-oriented techniques of subclassing and method override, you are free to specialize how the generated XML document differs from the source. The code above simply takes each element (attributes and their values are passed in as a dictionary on startElement calls), relies on the base class to output the proper XML for the element (but omitting the attributes), and then writes an element for each attribute.

Subclassing the XMLGenerator class is a nice place to start when you need to tweak some XML, especially if your tweaks don’t require you to change the existing parent-child relationships. For more complex jobs, you may want to explore some other ways of processing XML, such as minidom or pulldom. Or, if you’re really into that sort of thing, you could use XSLT (see Recipe 12.5).

See Also

Recipe 12.5 for various ways of driving XSLT from Python; Recipe 12.2, Recipe 12.3, and Recipe 12.4 for other uses of the SAX API.

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.