Credit: Paul Prescod
You want to get a sense of how often particular elements occur in an XML document, and the relevant counts must be extracted rapidly.
You can subclass
SAX’s
ContentHandler
to make your own specialized
classes for any kind of task, including the collection of such
statistics:
from xml.sax.handler import ContentHandler
import xml.sax
class countHandler(ContentHandler):
def _ _init_ _(self):
self.tags={}
def startElement(self, name, attr):
if not self.tags.has_key(name):
self.tags[name] = 0
self.tags[name] += 1
parser = xml.sax.make_parser( )
handler = countHandler( )
parser.setContentHandler(handler)
parser.parse("test.xml")
tags = handler.tags.keys( )
tags.sort( )
for tag in tags:
print tag, handler.tags[tag]
When I start with a new XML content set, I like to get a sense of
which elements are in it and how often they occur. I use variants of
this recipe. I can also collect attributes just as easily, as you can
see. If you add a stack, you can keep track of which elements occur
within other elements (for this, of course, you also have to override
the
endElement
method so you can pop the stack).
This recipe also works well as a simple example of a SAX application,
usable as the basis for any SAX application. Alternatives to SAX
include pulldom
and minidom
.
These would be overkill for this simple job, though. For any simple
processing, this is generally the case, particularly if the document
you are processing is very large. DOM approaches are generally
justified only when you need to perform complicated editing and
alteration on an XML document, when the document itself is
complicated by references that go back and forth inside it, or when
you need to correlate (e.g., compare) multiple documents with each
other.
ContentHandler
subclasses offer many other options, and the online Python
documentation does a good job of explaining them. This
recipe’s countHandler
class
overrides ContentHandler
’s
startElement
method, which the parser calls at the start of each element, passing
as arguments the element’s tag name as a Unicode
string and the collection of attributes. Our override of this method
counts the number of times each tag name occurs. In the end, we
extract the dictionary used for counting and emit it (in alphabetical
order, which we easily obtain by sorting the keys).
In the implementation of this recipe, an alternative to testing the
tags dictionary with has_key
might offer a
slightly more concise way to code the startElement
method:
def startElement(self, name, attr): self.tags[name] = 1 + self.tags.get(name,0)
This counting idiom for dictionaries is so frequent that it’s probably worth encapsulating in its own function despite its utter simplicity:
def count(adict, key, delta=1, default=0): adict[key] = delta + adict.get(key, default)
Using this, you could code the startElement
method
in the recipe as:
def startElement(self, name, attr): count(self.tags, name)
Recipe 12.2, Recipe 12.4, and Recipe 12.6 for other uses of the SAX API.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.