BUY THIS BOOK
Add to Cart

Print Book $39.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint this content?


Python & XML
Python & XML

By Christopher A. Jones, Fred L. Drake, Jr.
Price: $39.95 USD
£28.50 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Python and XML
Python and XML are two very different animals, each with a rich history. Python is a full-scale programming language that has grown from scripting world roots in a very organic way, through the vision and guidance of Python's inventor, Guido van Rossum. Guido continues to take into account the needs of Python developers as Python matures. XML, on the other hand, though strongly impacted by the ideas of a small cadre of visionaries, has grown from standards-committee roots. It has seen both quiet adoption and wrenching battles over its future. Why bother putting the two technologies together?
Before the Python/XML combination, there seemed no easy or effective way to work with XML in a distributed environment. Developers were forced to rely on a variety of tools used in awkward combination with one other. We used shell scripting and Perl to process text and interact with the operating system, and then used Java XML API's for processing XML and network programming. The shell provided an excellent means of file manipulation and interaction with the Unix system, and Perl was a good choice for simple text manipulation, providing access to the Unix APIs. Unfortunately, neither sported a sophisticated object model. Java, on the other hand, featured an object-oriented environment, a robust platform API for network programming, threads, and graphical user interface (GUI) application development. But with Java, we found an immediate lack of text manipulation power; scripting languages typically provided strong text processing. Python presented a perfect solution, as it combines the strengths of all of these various options.
Like most scripting languages, Python features excellent text and file manipulation capabilities. Yet, unlike most scripting languages, Python sports a powerful object-oriented environment with a robust platform API for network programming, threads, and graphical user interface development. It can be extended with components written in C and C++ with ease, allowing it to be connected to most existing libraries. To top it off, Python has been shown to be more portable than other popular interpreted languages, running comfortably on platforms ranging from massive parallel Connection Machines to personal digital assistants and other embedded systems. As a result, Python is an excellent choice for XML programming and distributed application development.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Key Advantages of XML
XML has a few key advantages that make it the data language of choice on the Internet. These advantages were designed into XML from the beginning, and, in fact, are what make it so appealing to Internet developers.
First, XML is both human- and machine-readable. This is not a subtle point. Have you ever tried to read a Microsoft Word document with a text editor? You can't if it was saved as a .doc file, because the information in a .doc document is in a binary (computer readable only) format, even though most Word documents primarily consist of text. A Word document cannot be shared with any other application besides Word—unless that application has been taught the intricacies of Word's binary format. In this case, the application must also be taught to expect changes in Word's format each time there is a new release from Microsoft.
This sounds annoying for the developer, but how bad is it, really? After all, Word is incredibly popular, so it must not be too hard to figure out. Let's look at the top of the Word file that contains this chapter:
Ï_ࡱ_á                > _ ÿ    _           _   B_       _  D_  _  
ÿÿÿ    ?_  @_  A_ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á 7         _  _¿      _     _  >_  _ 
bjbjU_U_                         __ 0¸_ 7|  7|  W_  _     C            
           ÿÿ_         ÿÿ_         ÿÿ_                 l     Ê_      
Ê_  Ê_      Ê_      Ê_      Ê_      Ê_  ¶           _      
This certainly looks familiar to anyone who has ever opened a Word file with a text editor. We don't see our recognizable text (the content we intended) so we must assume it is buried deep in the file. Determining what the true content is and where it is can be difficult, but it shouldn't be. It
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The XML Specifications
In the trade press, we often see references about how XML "now supports" some particular industry-specific application. The article that follows is often confused, offering some small morsel of information about an industry consortium that has released a new specification for an XML-based language to support interoperability of data within the consortium's industry. As technical people, we usually note that it doesn't apply to the industries we're involved in, or else it does, but the specification is too early a draft to be useful. In fact, our managers will probably agree with us most of the time, or they'll be privy to some relevant information that causes them to disagree. If we step up the corporate ladder a couple more rungs, however, we often find an increase in the level of confusion over XML. Sometimes, this is accompanied by either a call to "adopt XML" (too often with a list of particular specifications that are not intended to be used together), or a reaction that XML is too immature to use at all.
So we need to think about just what we can work with that will meet the following criteria:
  • It must make technical sense for our application.
  • It should be sufficiently well-defined that implementation is possible.
  • It must be able to be explained and justified to (at least) our direct managers.
  • It won't freak out the upper management.
Ok, we're technical people, so we may have to ignore that last item; it certainly won't be covered in this book. In fact, most of this really can't be covered in technical material. There are many specifications in various stages of maturity, and most are specific to one industry or another. However, we can point out what the foundation specifications are, because those you will need regardless of your industry or other requirements.
The XML specification itself is a document created and maintained by the W3C. As of this writing, the current version is Extensible Markup Language (XML) 1.0 (Second Edition), and is available from the W3C web site at http://www.w3.org/TR/REC-xml. (The second edition differs from the first only in that some editorial corrections and clarifications have been made; the specification is stable.)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Power of Python and XML
Now that we've introduced you to the world of XML, we'll look at what Python brings to the table. We'll review the Python features that apply to XML, and then we'll give some specific examples of Python with XML. As a very high-level language, Python includes many powerful data structures as part of the core language and libraries. The more recent versions of Python, from 2.0 onward, include excellent support for Unicode and an impressive range of encodings, as well as an excellent (and fast!) XML parser that provides character data from XML as Unicode strings. Python's standard library also contains implementations of the industry-standard DOM and SAX interfaces for working with XML data, and additional support for alternate parsers and interfaces is available.
Of course, this much could be said of other modern high-level languages as well. Java certainly includes an impressive library of highly usable data structures, and Perl offers equivalent data structures also. What makes Python preferable to those languages and their libraries? There are several features, of which we briefly discuss the most important:
  • Python source code is easy to read and maintain.
  • The interactive interpreter makes it simple to try out code fragments.
  • Python is incredibly portable, but does not restrict access to platform-specific capabilities.
  • The object-oriented features are powerful without being obscure.
There are many languages capable of doing what can be done with Python, but it is rare to find all of the "peripheral" qualities of Python in any single language. These qualities do not so much make Python more capable, but they make it much easier to apply, reducing programming hours. This allows more time to be spent finding better ways to solve real problems or just allows the programmer to move on to the next problem. Here we discuss these features in more detail.
Easy to read and maintain
As a programming language, Python exhibits a remarkable clarity of expression. Though some programmers accustomed to other languages view Python's use of significant whitespace with surprise, everyone seems to think it makes Python source code significantly more readable than languages that require more special characters to be introduced to mark structure in the source. Python's structures are not simpler than those of other languages, but the different syntax makes source code "feel" much cleaner in Python.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Can We Do with It?
Now that we've looked at how we can use XML with Python, we need to look at how we can apply our knowledge of XML and Python to real applications. In the Internet age, this means widely distributed systems operating across the Internet.
There's a lot to working with the Internet beyond XML and the CGI programming done in many of the examples in the book. In case you're not already familiar with this topic, we include an introduction to the facilities in the Python standard library that help create clients and servers for the Internet in Chapter 8. We review how to retrieve data from remote servers, and how to submit form-based requests programmatically and read the result. We then learn to build custom web servers that respond to HTTP requests, allowing us to build servers that do exactly what we need them to.
With these skills under our hat, we proceed to look at the emerging world of "web services." Chapter 9 describes what we mean by web services and introduces the specifications coming out in that area. We look at two packages that allow us to use SOAP to call on web services and demonstrate how to create one in Python.
In Chapter 10, we pull together much of what we've learned with an extended example that demonstrates how it all works together. Using XML as a communications medium, we are able to build an application that uses a variety of technologies and operates in diverse environments.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: XML Fundamentals
XML is not new! XML, the Extensible Markup Language, began development in 1996 and became an official World Wide Web Consortium (W3C) standard in 1998. XML is derived from the Standard Generalized Markup Language (SGML), which has been around for a great while. SGML has long been used as a means of document management, and is the parent of HTML. XML, on the other hand, is an outgrowth of these earlier markup languages intended for information sharing on the Internet. While HTML is effective for communicating how a page should look inside a web browser, XML speaks more to how information should be structured or used between or among applications (including web browsers) running on the Internet.
The basic structure of an XML document is simple. Most can be reduced to a few simple components. Consider the following:
<?xml version="1.0"?>
<PurchaseOrder>
  <account refnum="2390094"/>
  <item sku="33-993933" qty="4">
    <name>Potato Smasher</name>
    <description>Smash Potatoes like never before.</description>
  </item>
</PurchaseOrder>
In this example, the first line, starting with the <? characters, is the XML declaration. It states which version of XML is being used and can also include information about the character encoding of the document. The text starting with <PurchaseOrder> and ending with </PurchaseOrder> is an XML element. An element must have an opening and closing tag, or the opening tag must end with the characters /> if it is to be empty. The account element shown here is an example of an empty element that ends with a />. The item element opens, contains two other elements, and then closes. The sku="33-993933" expression is an attribute named sku with its value 33-993933 in quotes. An element can have as many attributes as needed. Both the name and description elements are followed by character data or text. Finally, the elements are closed and the document terminates.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Structure in a Nutshell
The basic structure of an XML document is simple. Most can be reduced to a few simple components. Consider the following:
<?xml version="1.0"?>
<PurchaseOrder>
  <account refnum="2390094"/>
  <item sku="33-993933" qty="4">
    <name>Potato Smasher</name>
    <description>Smash Potatoes like never before.</description>
  </item>
</PurchaseOrder>
In this example, the first line, starting with the <? characters, is the XML declaration. It states which version of XML is being used and can also include information about the character encoding of the document. The text starting with <PurchaseOrder> and ending with </PurchaseOrder> is an XML element. An element must have an opening and closing tag, or the opening tag must end with the characters /> if it is to be empty. The account element shown here is an example of an empty element that ends with a />. The item element opens, contains two other elements, and then closes. The sku="33-993933" expression is an attribute named sku with its value 33-993933 in quotes. An element can have as many attributes as needed. Both the name and description elements are followed by character data or text. Finally, the elements are closed and the document terminates.
In the remainder of this chapter, we walk through the relevant parts of the XML specification, highlighting the most important items for you to be aware of as you embark on coding with Python and XML.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Document Types and Schemas
When we talk about document types, we are speaking of something very similar to the notion of types in a programming language. Programming language types are used to describe structures that can be composed in particular ways, and document types do the same thing. The primitive components and the types of composition that are allowed differ, but they are conceptually aligned. A document type is commonly referred to as a schema. The difference between a document type and a database schema can be shallow in many applications, though the similarity is not always relevant. We often use schema to refer to a document type when it is not important how it was defined, because the phrase "document type" has historical associations with a particular schema language.
Schemas are valuable for several reasons, but two dominate: they require critical thinking about the applications and data to design, and they can be used to help specify how documents should constructed and interpreted when exchanged across organizational boundaries. The latter can be especially critical in applications such as supply-chain integration, where the automated exchange of dynamically generated documents can incur contractual obligations—it becomes very important that everyone agree what the documents mean, because misinterpretation can be very costly!
Document types are built on top of data types as well as on top of structuring rules, in which data types are very analogous to the primitive types provided by most programming languages. Different schema languages use different sets of data types, some being extensible and others allowing the use of arbitrary typing systems rather than providing their own. Some schema languages allow data types to be specified for any document content, and others limit the ability to apply data types to specific constructs.
All schema languages let the allowed ordering and nesting of elements be defined, and let attributes be associated with element types. Everything else is open to variation, so it helps to be aware of the general differences and select a schema language based on the requirements of the application, the availability of tools, and interoperability requirements.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Types of Conformance
As with any specification, the primary reason for the XML specification's existence is to hold documents against it and make sure they conform to the specification. If so, then the rules within the specification can be used in reading, transforming, or applying the document. However, we must remember that XML defines two things: syntax for document instances, and a way to define new language using XML. It also tells us that we can use the former without the latter, so it must define what it means to conform to the specification in both cases.
If a document uses the XML syntax but does not depend on a specific markup language defined using the means provided by the XML recommendation, it needs to be well-formed in order to conform with XML. This is a form of conformance introduced by XML rather than inherited from SGML. On the other hand, a document that declares that it uses a specific markup language defined by a DTD is said to be valid if it is both well-formed and the elements and character data are arranged in a way that complies with the rules given by the specified Document Type Definition.
The XML specification defines a collection of text to be an XML document if it is well-formed according to the rules of the specification. The term well-formed is widely used in XML, and it refers to a document that is syntactically acceptable. For example:
<?xml version="1.0"?>
<book>
  <title>Python and XML</title>
</book>
The preceding document is well-formed. That is, beyond the XML declaration (described in more detail in Section 2.5.6, later in this chapter) pointing out that the document uses Version 1.0 of XML, both the book and title elements are opened and closed so that elements nest within each other in a strictly hierarchical way. You can't open a book and close a magazine.
Being well-formed is required but not sufficient to describe the concept of validity, which deals with the conformance of a document to a Document Type Definition. It's one thing to have the structure arranged such that it is syntactically acceptable, but quite another to ensure that the information contained within the document is organized in the appropriate fashion and contains all of the necessary elements to be of use in an application or transaction.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Physical Structures
XML text is stored in entities. Entities are identified in various ways, but most commonly by filename or URI. There is no constraint on this, however, and many systems do use alternate means for entity storage — for example, many live happily in large databases. Many XML documents involve more than one entity; perhaps the most common arrangement is that the document is in one entity and its type definition is in another. As documents get larger, increasing numbers of entities are often involved with each document. This may be more common with document-centric applications than with data-communication applications of XML.
Entities are typically given names in one or more global namespaces. XML requires that entities be given system identifiers, which are always URIs. The term has roots in the SGML community, where system identifiers were used to refer to storage locations using whatever syntax the tools in use happened to understand. An additional global namespace is shared with the SGML world; the identifiers in that space are called formal public identifiers (FPIs). Use of this namespace is very limited in the XML world, as it is not always easily mapped to URLs that can be used to retrieve arbitrary resources, although there are ways to do it. They do see some use, and extensible support for FPIs is available in the PyXML toolkit.
Entities are used for several things in XML:
Document entities
Regardless of the application, all documents start somewhere. With XML, they are also guaranteed to end in the same entity. The entity containing the start of the document is called the document entity. The document entity is interesting because it is the only entity that may be completely anonymous. An application can provide the content of the entity directly to the XML parser, allowing it to operate without extracting the text from a disk file or another local or remote data source.
External entities
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Constructing XML Documents
Documents are the heart of XML. Any amount of usable XML is presented as a document, often stored in a file. One of the very first things you must understand in order to use XML is how to create a well-formed document. In this section, we examine the syntactic components of a document, starting with the individual characters and looking at how they are viewed when building larger syntactic constructs. Then we look at the constructs defined for all documents by the XML recommendation.
The XML Specification defines a character as "an atomic unit of text as specified by ISO/IEC 10646." (Remember, ISO/IEC 10646 is more commonly referred to as Unicode.) Of course, this explanation is exactly what you should say at a party if someone asks. One of the goals of both standardization and XML is to make documents easily understandable by platforms around the globe. As such, simple things like ASCII characters can become quite complex.
Regardless, the specification states that legal characters are "tab, carriage return, line feed," as well as belonging to the aforementioned Unicode specification. If you were to write an XML parser, the topic of characters and standardization would be of incredible importance to you. For the rest of us, it's usually enough to choose an XML parser that gets it right.
You can declare the character encoding used in an XML document using the optional XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
For an external entity that is not a document itself, a variation of the XML declaration, called an encoding declaration, is used:
<?xml encoding="UTF-8"?>
More information on the XML declaration is provided in "The Document Prolog" later in this chapter. For now, let's look at some of the most widely used character sets and encodings. (A character set that can be mapped into Unicode can be considered an encoding of Unicode, even if it does not directly support everything defined in Unicode.)

Section 2.5.1.1: The ASCII character set

The American Standard Code for Information Interchange (ASCII) is a 7-bit text format (meaning that it takes a sequence of seven 1's and 0's to form a character). ASCII is understood by virtually ever computer in use. Unicode extends ASCII, so the first 128 characters of Unicode coincide with the first 128 characters of ASCII.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Document Type Definitions
As discussed earlier, Document Type Definitions, or DTDs, are the form of document types specified by the XML 1.0 recommendation. Though there are alternatives, DTDs remain one of the most common ways of specifying a document type. In this section, we discuss the syntax of the various declarations that can occur in the Document Type Declaration; these can all appear in both the internal and external subsets.
Entities are sources of data that are used to compose a larger construct. Most, called general entities, are used to construct documents, but some, known as parameter entities, are used to construct the document type itself. Both are defined using an entity declaration in the Document Type Definition. Each kind of entity is defined in a separate namespace; there can be a general entity named myEntity and a parameter entity of the same name, and the names do not clash.
Entities can be declared more than once — the first definition for a name takes precedence. This allows the internal subset to override a definition provided in the external subset; when used with parameter entities, this mechanism can be used to extend DTDs. Document type extension generally works best when the DTD being extended has been carefully designed with this in mind. The DocBook DTD for technical documentation is an excellent example of this.
General entities can take a variety of forms: they may be parsed entities, consisting of XML text, or unparsed, such as an image stored as a Portable Network Graphics (PNG) file. The text of a parsed entity may be included in the entity declaration, or it may reside in an external source. The body of an unparsed entity is always stored externally. Most entities used with XML are parsed entities; unparsed constructs, such as images, are typically referenced using an absolute or relative URL rather than by a named entity.
Parsed general entities are used to define substitution text for a (typically) shorter name. Recall that in XML, text includes not only character data, but markup as well, so the substitution can actually insert additional structure into the document as long as all structures are complete within the substitution. At production time, a parser resolves the entity into its substitution text, and evaluates the document based on how it looks after the entities have been resolved. A simple internal entity is as easy to create as a symbol and its replacement text:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Canonical XML
The term canonicalization originally was "borrowed" loosely from its more ancient context to indicate that one structure of an instance document is the same as the master, or commonly accepted, structure of the document. Canonicalization is sometimes referred to as C14N for brevity; this is similar to the more common use of I18N for internationalization.
Canonical XML is an emerging W3C recommendation that allows you to see if one physical representation of a document is equivalent to another physical representation of the same document in order to determine if they are "canonically" equivalent. In this section, we explore some of the technical features of Canonical XML to gain a better understanding of its application to suit your needs.
To begin the process of converting a document to canonical form, you, or rather your Canonical XML processor, must start with some form of XML that it can understand. Therefore, your first parameter to a canonical translator should be an XPath node set, or a serialized XML document. The second parameter is a Boolean value, which indicates whether comments should be analyzed.
In the case of a node set, it must have normalized line feeds, normalized attribute values, substituted CDATA sections with their character content, and resolved character and parsed entity references. In other words, each node must be fully cooked. No stranded entities and no superfluous whitespace are allowed. All whitespace within the root element must be preserved with the exception of line-delimiter normalization. The whole approach leads you to think that the document is being worked over—flattened, stretched, and pulled like pizza dough just prior to being cooked.
Although Canonical XML depends on XPath, it imposes a few rules on the XPath node sets that are sent into any Canonical XML processor.
  1. An element's namespace and attribute nodes must follow the element but precede any children.
  2. Namespace nodes must exist prior to attribute nodes.
  3. Namespace nodes for an element are sorted lexicographically by local name.
  4. Attribute nodes for an element are sorted lexicographically with the namespace URI as a primary key and the local name as a secondary key.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Going Beyond the XML Specification
The standards developed at the W3C ensure interoperability between distributed systems and the applications developers around the world. As we progress in this book from XML tools and strategies in your local applications to distributed application development, several new XML terms and issues come into the forefront.
As discussed in Section 1.2.2 in Chapter 1, namespaces provide a means to combine elements from different knowledge domains or schemas. The Namespaces specification accomplishes this by allowing element and attribute names to be qualified with a URI; every URI corresponds to a unique namespace. Namespaces are used for several purposes in practice, but the most important is to allow a document to contain elements defined by different schema (possibly originating from different organizations) without having naming conflicts.
Namespaces are used by associating a named xmlns attribute with a URI. Namespaces are communicated in an XML document using the reserved colon character in an element name, prefixed with the xmlns symbol. For example:
<sumc:purchaseOrder refnum="389473984-38844"
    xmlns:sumc="http://www.superultramegacorp.com">
  <sumc:product name="Magical Widget" sku="398-4993833">
    <sumc:qty value="24">One Case Order</sumc:qty>
    <sumc:amount value="34.56">34.56</sumc:amount>
    <sumc:shipping value="overnight">Next-day</sumc:shipping>
  </sumc:product>
</sumc:purchaseOrder>
In this document, the namespace of SuperUltraMegaCorp is defined. The prefix sumc has been associated with it in the xmlns:sumc attribute. Elements prefixed with sumc: are within this namespace. This purchaseOrder now has a context that can set it apart from a similarly structured purchase order intended for a different business domain.
XPath is discussed at length in Chapter 5. For now it is worth a mention, lest you start to develop your own method for querying XML without understanding what standards are offered. XPath offers a standardized method of querying XML for specific information, whether it's a single element or node, or a collection of elements. The standardization is of value not when you're writing the backend part of your application, but rather when you need to expose search capabilities either programmatically or via the web.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: The Simple API for XML
The Simple API for XML, otherwise known as SAX, is a popular interface for working with XML data. Let's start by looking at the background and history of SAX, after which we'll describe the major components of the interface. Once the overview is complete, we can look at several examples to help you see how to use it in your own applications.
Before SAX, almost every XML parser offered its own interface, so applications were built to use specific parsers. The interfaces were low-level and generally similar in structure; the differences were mostly in the details. When new parsers were made available, applications had to be modified extensively to work with the different interface in order to take advantage of the new parser, even though the fundamental structure was essentially unchanged.
As is so often the case, the solution lay in introducing another layer of indirection. A group of XML developers using Java, led by David Megginson on the XML-DEV mailing list, defined a set of Java interfaces that allowed an application to work with any parser. The only requirement was that there be a driver for the new API for each parser. The driver was a class that used the parser-specific interface to make calls back to the application using the new, general interface. The application would create handler objects that implemented methods the driver would use to call back to the application. When Megginson released the specification, he also released a set of drivers for many of the more popular Java XML parsers. The initial specification supported the XML 1.0 recommendation, but not any of the more complex layers that have been built on top of it; the initiatives to create those were largely in their infancy at the time. The group of developers called the new API the "Simple API for XML," or SAX, because it was actually simpler than most of the parser-specific interfaces it was designed to abstract away.
The new API was widely received as a major step forward for application writers—it was easy to use, allowed the use of arbitrary parsers with an application, and was carefully defined before any other common APIs were available. Java programmers became extremely happy as the stress levels dropped in their professional lives. Developers in other languages adapted the specification in ways that allowed SAX to remain an identifiable API even as it was made to work with the native conventions used in those languages. Python programmers in the XML-SIG, led by Lars Marius Garshol, created an adaptation of the API and implemented drivers for several parsers. This implementation was accepted as part of the PyXML package.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Birth of SAX
Before SAX, almost every XML parser offered its own interface, so applications were built to use specific parsers. The interfaces were low-level and generally similar in structure; the differences were mostly in the details. When new parsers were made available, applications had to be modified extensively to work with the different interface in order to take advantage of the new parser, even though the fundamental structure was essentially unchanged.
As is so often the case, the solution lay in introducing another layer of indirection. A group of XML developers using Java, led by David Megginson on the XML-DEV mailing list, defined a set of Java interfaces that allowed an application to work with any parser. The only requirement was that there be a driver for the new API for each parser. The driver was a class that used the parser-specific interface to make calls back to the application using the new, general interface. The application would create handler objects that implemented methods the driver would use to call back to the application. When Megginson released the specification, he also released a set of drivers for many of the more popular Java XML parsers. The initial specification supported the XML 1.0 recommendation, but not any of the more complex layers that have been built on top of it; the initiatives to create those were largely in their infancy at the time. The group of developers called the new API the "Simple API for XML," or SAX, because it was actually simpler than most of the parser-specific interfaces it was designed to abstract away.
The new API was widely received as a major step forward for application writers—it was easy to use, allowed the use of arbitrary parsers with an application, and was carefully defined before any other common APIs were available. Java programmers became extremely happy as the stress levels dropped in their professional lives. Developers in other languages adapted the specification in ways that allowed SAX to remain an identifiable API even as it was made to work with the native conventions used in those languages. Python programmers in the XML-SIG, led by Lars Marius Garshol, created an adaptation of the API and implemented drivers for several parsers. This implementation was accepted as part of the PyXML package.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Understanding SAX
The first job of using SAX is to design and implement a handler that works with your specific XML documents. When dealing with a large project or working with a vast catalogue of valid documents, it may make sense to implement a few comprehensive handlers to deal with multiple document types. However, for smaller projects, it may be more desirable to implement handlers for each specific document type that you encounter. As you start to build more complex applications, you will see that the things you're attempting to do with the XML as well as the XML documents themselves can drive the way you develop your document handlers. Often, the SAX methods that you implement extract data from the event stream, which you can then hand off to another application (such as a database). Or you might want to apply intelligent business logic to it. It's likely that the task will drive your development strategy.
In all practical use, SAX is a callback-based API in which you implement handler objects to process XML. You pass a reference to your SAX handler objects to a SAX-capable parser (or driver; we'll use "parser" to refer to either). When parsing begins, the parser calls the methods on your handler objects and allows you to process the XML, so that you can do something useful with it in your applications and distributed systems.
SAX is an excellent stream-based API. It allows for faster processing of documents, as well as handling of documents that are simply too large to load into memory. Additionally, the event-based API allows you to react to parsing events and errors in "real-time," as they occur, while parsing the document, rather than waiting for the entire document to load. This can be especially valuable when used in a graphical application that needs to remain responsive to the user. Another huge win for many applications is the lower memory consumption when compared to DOM-based code; by allowing the application control over any objects created during parsing, the application can minimize the needed storage overhead and discard objects as soon as they are no longer required.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reading an Article
In this example, we look at how we can extract and use information from an XML document using SAX. The particular documents our script works with are simple news articles, but we'll see how to work with elements, attributes, and textual content.
Some of the trade-offs of using SAX depend on what you're trying to accomplish, and how the XML is structured. SAX treats XML as a continuous stream, firing events to your handler as they happen. Example 3-1 shows article.xml.
Example 3-1. article.xml
<?xml version="1.0"?>
<webArticle category="news" subcategory="technical">
    <header title="NASA Builds Warp Drive"
           length="3k"
           author="Joe Reporter"
           distribution="all"/>
    <body>Seattle, WA - Today an anonymous individual
           announced that NASA has completed building a
           Warp Drive and has parked a ship that uses
           the drive in his back yard.  This individual
           claims that although he hasn't been contacted by
           NASA concerning the parked space vessel, he assumes
           that he will be launching it later this week to
           mount an exhibition to the Andromeda Galaxy.
    </body>
</webArticle>
Example 3-1 contains markup that is structured in a few different ways, and can be interesting to parse via SAX. A document such as article.xml requires that we understand how the document is structured prior to writing a handler to parse it. Therefore, the handler is tightly coupled to the document's structure.
You can write the ArticleHandler class to a new file, handlers.py; we'll keep adding new handlers to this file throughout the chapter. Keep it simple at first, just to see how SAX works:
# - ArticleHandler (add to handlers.py file)
class ArticleHandler(ContentHandler):
  """
  A handler to deal with articles in XML
  """
  def startElement(self, name, attrs):
    print "Start element:", name
Now we need to create a script to instantiate the parser, assign the handler, and do the actual work.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Searching File Information
In this section, we create a file indexing script that can generate an XML document representing your entire filesystem or a specific portion of it. Indexing files with XML is a powerful way to keep track of information, or perform bulk operations on groups of particular files on a disk. You can create an XML-generating indexing routine easily in Python. The index.py program in Example 3-4 (which shows up a little later in the chapter) starts in any directory you specify and generates an element for each file or directory that exists beneath the starting point. Once we have the index of file information, we look at how to use SAX to search the information to filter the list of files for whatever criteria interests us at the time.
The main part of this routine works by just checking each file in a starting directory, and then recursing into any directories it finds beneath the starting directory. Recursion allows it to index an entire filesystem if you choose. On Unix, the program performs a lot of work, as it does content checking via a popen call to the file command for each file. (While this could be made more efficient by calling find less often and requiring it to operate on more than one file at a time, that isn't the topic of this book.) One of the key methods of this class is indexDirectoryFiles:
def indexDirectoryFiles(self, dir):
     """Index a directory structure and creates an XML output file."""


     # prepare output XML file
     self.__fd = open(self.outputFile, "w")
     self.__fd.write('<?xml version="1.0" encoding="' +
                     XML_ENC + '"?>\n')
     self.__fd.write("<IndexedFiles>\n")

     # do actual indexing
     self.__indexDir(dir)
     # close out XML file
     self.__fd.write("</IndexedFiles>\n")
     self.__fd.close()
An XML file is created with the name given in outputFile and an XML declaration and root element are added. The indexDirectoryFiles method calls its internal _ _indexDir method—this is the real worker method. It is a recursive method that descends the file hierarchy, indexing files along the way.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Building an Image Index
If you've ever visited an image library on the Internet, you've probably enjoyed (even taken for granted) the way a collection of small thumbnail images acts as links for full-sized counterparts. Many artists, when presenting a portfolio online, adopt this effective approach to displaying their work. With the rise of digital cameras and scanners, more and more people are finding themselves pulling directories full of images onto the Web in a format that makes for easy browsing. In the next section, we build a Python script that takes a full directory of images and thumbnail images and creates a master HTML page with the thumbnails acting as links to the full-size image. The saxthumbs.py program expects you to have a pre-existing directory of images and thumbnails, and operates on the output of the index.py script we created earlier.
In order for the saxthumbs.py SAX handler to correctly process a thumbnail directory, the images need to follow a naming convention (easily changeable by editing the code). Currently, the saxthumbs.py handler expects to find file elements within the XML document that have a corresponding <imagename>.jpg file that is the entire image, and a t-<imagename>.jpg file that is a thumbnail-size image.
When using index.py to create a list of your image files, point it to a directory that has image files named accordingly:
$> ls -l *newimage*
-rw-rw-r--   1 shm00    shm00       98197 Jan 18 11:08 newimage.jpg
-rw-rw-r--   1 shm00    shm00        5272 Jan 18 11:42 t-newimage.jpg
In this manner, every file that ends in .jpg and has a corresponding t-<imagename>.jpg file (note the size differences) is assimilated into the thumbnail index.
There is an easy way to set up your image files on Unix systems, using the convert command. This command is part of the ImageMagick package, and is installed by default by most modern Linux distributions. For other Unix systems, the package is available at http://www.imagemagick.org/.
$> convert image.jpg -geometry 192x128 t-image.jpg
This will take image.jpg, no matter how large it is, and make a 192x128 size thumbnail in JPEG format. Of course, if the image is a Windows bitmap image (with the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Converting XML to HTML
The PyXML package contains XML parsers, including PyExpat, as well as support for SAX and DOM, and much more. While learning the ropes of the PyXML package, it would be nice to have a comprehensive list of all the classes and methods. Since this is a programming book, it seems appropriate to write a Python program to extract the information we need—and in XML, no less!
Let's generate an XML file that details each of the files in the PyXML package, the classes therein, and the methods of the class. This process allows us to generate quick, usable XML. Rather than a replacement for all the snazzy code-to-documentation generators out there, Example 3-8 shows a simple, quick way to generate XML that we can experiment with and use throughout the examples in this chapter. After all, when manipulating XML, it helps to have a few hundred thousand bytes of it sitting around to play with. (This program also demonstrates the simplicity of examining all the files in a directory tree in using the os.path.walk function.)
Example 3-8. genxml.py
"""
genxml.py

Descends PyXML tree, indexing source files and creating
XML tags for use in navigating the source.
"""

import os
import sys

from xml.sax.saxutils import escape


def process(filename, fp):
  print "* Processing:", filename,

  # parse the file
  pyFile = open(filename)
  fp.write("<file name=\"" + filename + "\">\n")
  inClass = 0
  line = pyFile.readline(  )
  while line:
    line = line.strip(  )
    if line.startswith("class") and line[-1] == ":":
      if inClass:
        fp.write(" </class>\n")
      inClass = 1
      fp.write(" <class name='" + line[:-1]  + "'>\n")

    elif line.find("def") > 0 and line[:-1] == ":" and inClass:
      fp.write("  <method name='" + escape(line[:-1]) + "'/>\n")

    line = pyFile.readline(  )

  pyFile.close(  )
  if inClass:
    fp.write(" </class>\n")
    inClass = 0

  fp.write("</file>\n")

def finder(fp, dirname, names):
  """Add files in the directory dirname to a list."""
  for name in names:
    if name.endswith(".py"):
      path = os.path.join(dirname, name)
      if os.path.isfile(path):
        process(path, fp)

def main(  ):
  print "[genxml.py started]"

  xmlFd = open("pyxml.xml", "w")
  xmlFd.write("<?xml version=\"1.0\"?>\n")
  xmlFd.write("<pyxml>\n")

  os.path.walk(sys.argv[1], finder, xmlFd)

  xmlFd.write("</pyxml>")
  xmlFd.close(  )

  print "[genxml.py finished]"

if __name__ == "__main__":
  main(  )
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Advanced Parser Factory Usage
PyXML features several parsers, and multiple ways to instantiate them, depending on whether you're using SAX, trying to create a DOM tree, or doing something completely different. Designed for portable code, a ParserFactory class is provided that supplies a SAX-ready parser guaranteed available in your runtime environment. Additionally, you can explicitly create a parser (or SAX driver) by dipping into any specific package, such as PyExpat. We illustrate an example of both, but normally you should rely on the parser factory to instantiate a parser.
The make_parser function (imported from xml.sax) returns a SAX driver for the first available parser in the list that you supply, or returns an available parser if no list is specified or if the list contains parsers that are not found or cannot be loaded. The make_parser function has its roots as part of the xml.sax.saxexts.ParserFactory class, but it is better to import the method from xml.sax (more on this in a bit). For example:
from xml.sax import make_parser
parser = make_parser(  )
At the time of this writing, if you have PyXML installed, a call to make_parser without an argument is sure to return either a PyExpat or xmlproc driver. If you dig into the source of the xml.sax module, you will see this list supplied to the ParserFactory class. If you instantiate a parser factory directly out of xml.sax.saxexts, you need to be sure to supply a list containing the name of at least one valid parser, or it won't be able to create a parser:
>>> from xml.sax.saxexts import ParserFactory
>>> p = ParserFactory(  )   
>>> parser = p.make_parser(  )
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.0/site-packages/_xmlplus/sax/saxexts.py", 
	  line 77, in make_parser
    raise SAXReaderNotAvailable("No parsers found", None)
xml.sax._exceptions.SAXReaderNotAvailable: No parsers found
If you supply a list of parsers or drivers, you get what you're after:
>>> from xml.sax.saxexts import ParserFactory
>>> p = ParserFactory(["xml.sax.drivers.drv_pyexpat"])
>>> parser = p.make_parser(  )
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Native Parser Interfaces
Now that we've looked at how SAX can be used and have seen just how regular the code is to set up the parser and the ContentHandler, you may be wondering how much of that ease comes from using SAX and how much is a matter of convenience functions in the Python libraries. While we won't delve deeply into the native interfaces of the individual parsers, this is a good question, and can lead to some interesting observations.
The key advantage to using SAX is that the callback methods have the same names and significance regardless of the actual parser you use. There are at least two nice results of this: changing parsers does not affect your application, and your code is more maintainable because someone new to the code is more likely to know the SAX interface than any particular parser-specific interface.
So just how do the native interfaces to the individual parsers differ from SAX, and why would we choose to use them instead? Let's take a quick look at the PyExpat parser to get a taste of the differences.
Of course, to use PyExpat, you need to have it installed. It is included as part of the Python installer for Windows, and is built automatically on Unix if you have the Expat library installed. If you did not install PyExpat as part of Python, it is installed as part of the PyXML package.
PyExpat resides in the xml.parsers.expat module. If we want to modify our last example to use PyExpat directly, we don't have a lot of work to do, but there are a few changes. Since the PyExpat handler methods closely match the SAX handlers, at least for the basic use we demonstrate here, we can use the same handler class we've already written. The imports won't need to change much:
#!/usr/bin/env python

import sys

from xml.parsers import expat
from handlers    import PyXMLConversionHandler
Once the parser is imported, it can be created and used:
parser = expat.ParserCreate(  )
Were we to do this at the interactive prompt, we could poke at the parser object to see what attributes it has:
>>> from xml.parsers import expat
>>> parser = expat.ParserCreate(  )
>>> dir(parser)
['CharacterDataHandler', 'CommentHandler', 'DefaultHandler', 'DefaultHandlerExpa
nd', 'EndCdataSectionHandler', 'EndElementHandler', 'EndNamespaceDeclHandler', '
ErrorByteIndex', 'ErrorCode', 'ErrorColumnNumber', 'ErrorLineNumber', 'ExternalE
ntityParserCreate', 'ExternalEntityRefHandler', 'GetBase', 'NotStandaloneHandler
', 'NotationDeclHandler', 'Parse', 'ParseFile', 'ProcessingInstructionHandler',
'SetBase', 'StartCdataSectionHandler', 'StartElementHandler', 'StartNamespaceDec
lHandler', 'UnparsedEntityDeclHandler', 'ordered_attributes', 'returns_unicode',
 'specified_attributes']
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: The Document Object Model
The Document Object Model (DOM) is an interface that exposes document structure programmatically to developers. Perhaps the most common application of the DOM is "Dynamic HTML" (DHTML), where an HTML document can be modified programmatically within the browser using an embedded scripting language. Typically, the scripting language is some flavor of ECMAScript (such as JavaScript or JScript), since most browsers support it, but others can be used as well. (For browsers on Windows, this can even be Python!) This allows you to change the background color of a table cell, or dynamically change font faces after the page is in the browser. The DOM defines the interface for vendors to offer compatible APIs.
The DOM is also extremely useful when exposed by a library such as the Python Standard Library or PyXML. It can allow you to use Python to manipulate an XML document already in memory. With the DOM interfaces, you can either change or extract portions of the document.
The Document Object Model is defined in a series of recommendations from the W3C. The specifications clearly cover XML (or we would not be describing them in this book), but they cover other things as well. The initial version of the DOM actually came from the HTML world; browser vendors invented it in various flavors as part of the APIs available to client-side scripts embedded in web pages. Since the vendors each implemented different interfaces, there was a call from content creators to have a standardized interface so their pages would work in at least roughly equivalent ways on the different browsers. Since the W3C is the best available shared ground on which the vendors could build a common specification, the DOM specifications are developed there.
All standards organizations face issues regarding the longevity of their specifications, and the W3C is no exception, no matter that it is quite young compared to more traditional standards groups such as ANSI and ISO. Given the relative youth of the W3C, it has had to deal with these issues almost from the start due to the rapid pace of development and the way standards are applied on the Internet. It does follow a traditional model however, rather than following the less formal (though highly effective) model of the Internet Engineering Task Force (IETF).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The DOM Specifications
Content preview·Buy PDF of this chapter|