Counting Tags in a Document

Credit: Paul Prescod

Problem

You want to get a sense of how often particular elements occur in an XML document, and the relevant counts must be extracted rapidly.

Solution

You can subclass SAX’s ContentHandler to make your own specialized classes for any kind of task, including the collection of such statistics:

from xml.sax.handler import ContentHandler
import xml.sax

class countHandler(ContentHandler):
    def _ _init_ _(self):
        self.tags={}

    def startElement(self, name, attr):
        if not self.tags.has_key(name):
            self.tags[name] = 0
        self.tags[name] += 1

parser = xml.sax.make_parser(  )
handler = countHandler(  )
parser.setContentHandler(handler)
parser.parse("test.xml")

tags = handler.tags.keys(  )
tags.sort(  )
for tag in tags:
 print tag, handler.tags[tag]

Discussion

When I start with a new XML content set, I like to get a sense of which elements are in it and how often they occur. I use variants of this recipe. I can also collect attributes just as easily, as you can see. If you add a stack, you can keep track of which elements occur within other elements (for this, of course, you also have to override the endElement method so you can pop the stack).

This recipe also works well as a simple example of a SAX application, usable as the basis for any SAX application. Alternatives to SAX include pulldom and minidom. These would be overkill for this simple job, though. For any simple processing, this is generally the case, particularly if the document you are processing is very large. DOM approaches are generally justified only when you need to perform complicated editing and alteration on an XML document, when the document itself is complicated by references that go back and forth inside it, or when you need to correlate (e.g., compare) multiple documents with each other.

ContentHandler subclasses offer many other options, and the online Python documentation does a good job of explaining them. This recipe’s countHandler class overrides ContentHandler’s startElement method, which the parser calls at the start of each element, passing as arguments the element’s tag name as a Unicode string and the collection of attributes. Our override of this method counts the number of times each tag name occurs. In the end, we extract the dictionary used for counting and emit it (in alphabetical order, which we easily obtain by sorting the keys).

In the implementation of this recipe, an alternative to testing the tags dictionary with has_key might offer a slightly more concise way to code the startElement method:

def startElement(self, name, attr):
    self.tags[name] = 1 + self.tags.get(name,0)

This counting idiom for dictionaries is so frequent that it’s probably worth encapsulating in its own function despite its utter simplicity:

def count(adict, key, delta=1, default=0):
    adict[key] = delta + adict.get(key, default)

Using this, you could code the startElement method in the recipe as:

    def startElement(self, name, attr): count(self.tags, name)

Python Cookbook by

Counting Tags in a Document

Problem

Solution

Discussion

See Also

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly