Credit: Mark Nenadov
The expat
parser is normally used through the SAX interface, but sometimes you
may want to use expat
directly to extract the best
possible performance.
Python is very explicit about the lower-level mechanisms that its higher-level modules’ packages use. You’re normally better off accessing the higher levels, but sometimes, in the last few stages of an optimization quest, or just to gain better understanding of what, exactly, is going on, you may want to access the lower levels directly from your code. For example, here is how you can use Expat directly, rather than through SAX:
import xml.parsers.expat, sys class MyXML: Parser = "" # Prepare for parsing def _ _init_ _(self, xml_filename): assert xml_filename != "" self.xml_filename = xml_filename self.Parser = xml.parsers.expat.ParserCreate( ) self.Parser.CharacterDataHandler = self.handleCharData self.Parser.StartElementHandler = self.handleStartElement self.Parser.EndElementHandler = self.handleEndElement # Parse the XML file def parse(self): try: xml_file = open(self.xml_filename, "r") except: print "ERROR: Can't open XML file %s"%self.xml_filename raise else: try: self.Parser.ParseFile(xml_file) finally: xml_file.close( ) # to be overridden by implementation-specific methods def handleCharData(self, data): pass def handleStartElement(self, name, attrs): pass def handleEndElement(self, name): pass
This recipe presents a reusable way to use
xml.parsers.expat
directly to parse an XML file.
SAX is more standardized and rich in functionality, but
expat
is also usable, and sometimes it can be even
lighter than the already lightweight SAX approach. To reuse the
MyXML
class, all you need to do is define a new
class, inheriting from MyXML
. Inside your new
class, override the inherited XML handler methods, and
you’re ready to go.
Specifically, the MyXML
class creates a parser
object that does callbacks to the callables that are its attributes.
The
StartElementHandler
callable is called at the start of each element, with the tag name
and the attributes as arguments.
EndElementHandler
is called at the end of each element, with the tag name as the only
argument. Finally,
CharacterDataHandler
is called for each text string the parser encounters, with the string
as the only argument. The MyXML
class uses the
handleStartElement
,
handleEndElement
, and
handleCharData
methods as such callbacks.
Therefore, these are the methods you should override when you
subclass MyXML
to perform whatever
application-specific processing you require.
Recipe 12.2, Recipe 12.3, Recipe 12.4, and Recipe 12.6 for uses of the higher-level SAX API; while Expat was the brainchild of James Clark, Expat 2.0 is a group project, with a home page at http://expat.sourceforge.net/.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.