Credit: Paul Prescod
Once again, subclassing SAX’s
ContentHandler
makes this
extremely easy:
from xml.sax.handler import ContentHandler
import xml.sax
import sys
class textHandler(ContentHandler):
def characters(self, ch):
sys.stdout.write(ch.encode("Latin-1"))
parser = xml.sax.make_parser( )
handler = textHandler( )
parser.setContentHandler(handler)
parser.parse("test.xml")
Sometimes you want to get rid of XML tags—for example, to rekey a document or to spellcheck it. This recipe performs this task and will work with any well-formed XML document. It is quite efficient. If the document isn’t well-formed, you could try a solution based on the XML lexer (shallow parser) shown in Recipe 12.12.
In this recipe’s textHandler
class, we subclass
ContentHander
’s
characters
method, which the parser calls for each
string of text in the XML document (excluding tags, XML comments, and
processing instructions), passing as the only argument the piece of
text as a Unicode string.
We have to encode
this
Unicode before we can emit it to standard output. In this recipe,
we’re using the Latin-1 (also known as ISO-8859-1)
encoding, which covers all Western-European alphabets and is
supported by many popular output devices (e.g., printers and
terminal-emulation windows). However, you should use whatever
encoding is most appropriate for the documents
you’re handling and is supported by the devices you
use. The configuration of your devices may depend on your operating
system’s concepts of locale and code page.
Unfortunately, these vary too much between operating systems for me
to go into further detail.
Recipe 12.2, Recipe 12.3, and Recipe 12.6 for other uses of the SAX API; see Recipe 12.12 for a very different approach to XML lexing that works on XML fragments.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.