Search the Catalog
Perl & XML

Perl & XML

By Erik T. Ray, Jason McIntosh
April 2002
0-596-00205-X, Order Number: 205x
224 pages, $34.95 US $54.95 CA

Chapter 3
XML Basics: Reading and Writing

This chapter covers the two most important tasks in working with XML: reading it into memory and writing it out again. XML is a structured, predictable, and standard data storage format, and as such carries a price. Unlike the line-by-line, make-it-up-as-you-go style that typifies text hacking in Perl, XML expects you to learn the rules of its game--the structures and protocols outlined in Chapter 2--before you can play with it. Fortunately, much of the hard work is already done, in the form of module-based parsers and other tools that trailblazing Perl and XML hackers already created (some of which we touched on in Chapter 1).

Knowing how to use parsers is very important. They typically drive the rest of the processing for you, or at least get the data into a state where you can work with it. Any good programmer knows that getting the data ready is half the battle. We'll look deeply into the parsing process and detail the strategies used to drive processing.

Parsers come with a bewildering array of options that let you configure the output to your needs. Which character set should you use? Should you validate the document or merely check if it's well formed? Do you need to expand entity references, or should you keep them as references? How can you set handlers for events or tell the parser to build a tree for you? We'll explain these options fully so you can get the most out of parsing.

Finally, we'll show you how to spit XML back out, which can be surprisingly tricky if one isn't aware of XML's expectations regarding text encoding. Getting this step right is vital if you ever want to be able to use your data again without painful hand fixing.

XML Parsers

File I/O is an intrinsic part of any programming language, but it has always been done at a fairly low level: reading a character or a line at a time, running it through a regular expression filter, etc. Raw text is an unruly commodity, lacking any clear rules for how to separate discrete portions, other than basic, flat concepts such as newline-separated lines and tab-separated columns. Consequently, more data packaging schemes are available than even the chroniclers of Babel could have foreseen. It's from this cacophony that XML has risen, providing clear rules for how to create boundaries between data, assign hierarchy, and link resources in a predictable, unambiguous fashion. A program that relies on these rules can read any well-formed XML document, as if someone had jammed a babelfish into its ear.[1]

Where can you get this babelfish to put in your program's ear? An XML parser is a program or code library that translates XML data into either a stream of events or a data object, giving your program direct access to structured data. The XML can come from one or more files or filehandles, a character stream, or a static string. It could be peppered with entity references that may or may not need to be resolved. Some of the parts could come from outside your computer system, living in some far corner of the Internet. It could be encoded in a Latin character set, or perhaps in a Japanese set. Fortunately for you, the developer, none of these details have to be accounted for in your program because they are all taken care of by the parser, an abstract tunnel between the physical state of data and the crystallized representation seen by your subroutines.

An XML parser acts as a bridge between marked-up data (data packaged with embedded XML instructions) and some predigested form your program can work with. In Perl's case, we mean hashes, arrays, scalars, and objects made of references to these old friends. XML can be complex, residing in many files or streams, and can contain unresolved regions (entities) that may need to be patched up. Also, a parser usually tries to accept only good XML, rejecting it if it contains well-formedness errors. Its output has to reflect the structure (order, containment, associative data) while ignoring irrelevant details such as what files the data came from and what character set was used. That's a lot of work. To itemize these points, an XML parser:

In XML, data and markup are mixed together, so the parser first has to sift through a character stream and tell the two apart. Certain characters delimit the instructions from data, primarily angle brackets (< and >) for elements, comments, and processing instructions, and ampersand (&) and semicolon (;) for entity references. The parser also knows when to expect a certain instruction, or if a bad instruction has occurred; for example, an element that contains data must bracket the data in both a start and end tag. With this knowledge, the parser can quickly chop a character stream into discrete portions as encoded by the XML markup.

The next task is to fill in placeholders. Entity references may need to be resolved. Early in the process of reading XML, the processor will have encountered a list of placeholder definitions in the form of entity declarations, which associate a brief identifier with an entity. The identifier is some literal text defined in the document's DTD, and the entity itself can be defined right there or at the business end of a URL. These entities can themselves contain entity references, so the process of resolving an entity can take several iterations before the placeholders are filled in.

You may not always want entities to be resolved. If you're just spitting XML back out after some minor processing, then you may want to turn entity resolution off or substitute your own routine for handling entity references. For example, you may want to resolve external entity references (entities whose values are in locations external to the document, pointed to by URLs), but not resolve internal ones. Most parsers give you the ability to do this, but none will let you use entity references without declaring them.

That leads to the third task. If you allow the parser to resolve external entities, it will fetch all the documents, local or remote, that contain parts of the larger XML document. In doing so, all these entities get smushed into one unbroken document. Since your program usually doesn't need to know how the document is distributed physically, information about the physical origin of any piece of data goes away once it knits the whole document together.

While interpreting the markup, the parser may trip over a syntactic error. XML was designed to make it very easy to spot such errors. Everything from attributes to empty element tags have rigid rules for their construction so a parser doesn't have to think very hard about it. For example, the following piece of XML has an obvious error. The start tag for the <decree> element contains an attribute with a defective value assignment. The value "now" is missing a second quote character, and there's another error, somewhere in the end tag. Can you see it?

<decree effective="now>All motorbikes 
shall be painted red.</decree<

When such an error occurs, the parser has little choice but to shut down the operation. There's no point in trying to parse the rest of the document. The point of XML is to make things unambiguous. If the parser had to guess how the document should look,[2] it would open up the data to uncertainty and you'd lose that precious level of confidence in your program. Instead, the XML framers (wisely, we feel) opted to make XML parsers choke and die on bad XML documents. If the parser likes your XML, it is said to be well formed.

What do we mean by "grammatical errors"? You will encounter them only with so-called validating parsers. A document is considered to be valid if it passes a test defined in a DTD. XML-based languages and applications often have DTDs to set a minimal standard above well-formedness for how elements and data should be ordered. For example, the W3C has posted at least one DTD to describe XHTML (the XML-compliant flavor of HTML), listing all elements that can appear, where they can go, and what they can contain. It would be grammatically correct to put a <p> element inside a <body>, but putting <p> inside <head>, for example, would be incorrect. And don't even think about inserting an element <blooby> anywhere in the document, because it isn't declared anywhere in the DTD.[3] If even one error of this type is in a document, then the whole document is considered invalid. It may be well formed, but not valid against the particular DTD. Often, this level of checking is more of a burden than a help, but it's available if you need it.

Rounding out our list is the requirement that a parser ship the digested data to a program or end user. You can do this in many ways, and we devote much of the rest of the book in analyzing them. We can break up the forms into a few categories:

Event stream
First, a parser can generate an event stream: the parser converts a stream of markup characters into a new kind of stream that is more abstract, with data that is partially processed and easier to handle by your program.

Object Representation
Second, a parser can construct a data structure that reflects the information in the XML markup. This construction requires more resources from your system, but may be more convenient because it creates a persistent object that will wait around while you work on it.

Hybrid form
We might call the third group "hybrid" output. It includes parsers that try to be smart about processing, using some advance knowledge about the document to construct an object representing only a portion of your document.

Example (of What Not to Do): A Well-Formedness Checker

We've described XML parsers abstractly, but now it's time to get our hands dirty. We're going to write our own parser whose sole purpose is to check whether a document is well-formed XML or if it fails the basic test. This is about the simplest a parser can get; it doesn't drive any further processing, but just returns a "yes" or "no."

Our mission here is twofold. First, we hope to shave some of the mystique off of XML processing--at the end of the day, it's just pushing text around. However, we also want to emphasize that writing a proper parser in Perl (or any language) requires a lot of work, which would be better spent writing more interesting code that uses one of the many available XML-parsing Perl modules. To that end, we'll write only a fraction of a pure-Perl XML parser with a very specific goal in mind.

WARNING:   Feel free to play with this program, but please don't try to use this code in a production environment! It's not a real Perl and XML solution, but an illustration of the sorts of things that parsers do. Also, it's incomplete and will not always give correct results, as we'll show later. Don't worry; the rest of this book talks about real XML parsers and Perl tools you'll want to use.

The program is a loop in which regular expressions match XML markup objects and pluck them out of the text. The loop runs until nothing is left to remove, meaning the document is well formed, or until the regular expressions can't match anything in the remaining text, in which case it's not well-formed. A few other tests could abort the parsing, such as when an end tag is found that doesn't match the name of the currently open start tag. It won't be perfect, but it should give you a good idea of how a well-formedness parser might work.

Example 3-1 is a routine that parses a string of XML text, tests to see if it is well-formed, and returns a boolean value. We've added some pattern variables to make it easier to understand the regular expressions. For example, the string $ident contains regular expression code to match an XML identifier, which is used for elements, attributes, and processing instructions.

Example 3-1: A rudimentary XML parser

sub is_well_formed {
    my $text = shift;                     # XML text to check
 
    # match patterns
    my $ident = '[:_A-Za-z][:A-Za-z0-9\-\._]*';   # identifier
    my $optsp = '\s*';                            # optional space
    my $att1 = "$ident$optsp=$optsp\"[^\"]*\"";   # attribute
    my $att2 = "$ident$optsp=$optsp'[^']*'";      # attr. variant
    my $att = "($att1|$att2)";                    # any attribute
 
    my @elements = (  );                    # stack of open elems
 
    # loop through the string to pull out XML markup objects
    while( length($text) ) {
 
        # match an empty element
        if( $text =~ /^&($ident)(\s+$att)*\s*\/>/ ) {
            $text = $';
 
        # match an element start tag
        } elsif( $text =~ /^&($ident)(\s+$att)*\s*>/ ) {
            push( @elements, $1 );
            $text = $';
 
        # match an element end tag
        } elsif( $text =~ /^&\/($ident)\s*>/ ) {
            return unless( $1 eq pop( @elements ));
            $text = $';
 
        # match a comment
        } elsif( $text =~ /^&!--/ ) {
            $text = $';
            # bite off the rest of the comment
            if( $text =~ /-->/ ) {
                $text = $';
                return if( $` =~ /--/ );  # comments can't
                                            # contain '--'
            } else {
                return;
            }
 
        # match a CDATA section
        } elsif( $text =~ /^&!\[CDATA\[/ ) {
            $text = $';
            # bite off the rest of the comment
            if( $text =~ /\]\]>/ ) {
                $text = $';
            } else {
                return;
            }
 
        # match a processing instruction
        } elsif( $text =~ m|^&\?$ident\s*[^\?]+\?>| ) {
            $text = $';
 
        # match extra whitespace
        # (in case there is space outside the root element)
        } elsif( $text =~ m|^\s+| ) {
            $text = $';
 
        # match character data
        } elsif( $text =~ /(^[^&&>]+)/ ) {
            my $data = $1;
            # make sure the data is inside an element
            return if( $data =~ /\S/ and not( @elements ));
            $text = $';
            
        # match entity reference
        } elsif( $text =~ /^&$ident;+/ ) {
            $text = $';
         
        # something unexpected
        } else {
            return;
        }
    }
    return if( @elements );     # the stack should be empty
    return 1;
}

Perl's arrays are so useful partly due to their ability to masquerade as more abstract computer science data constructs.[4] Here, we use a data structure called a stack, which is really just an array that we access with push( ) and pop( ). Items in a stack are last-in, first-out (LIFO), meaning that the last thing put into it will be the first thing to be removed from it. This arrangement is convenient for remembering the names of currently open elements because at any time, the next element to be closed was the last element pushed onto the stack. Whenever we encounter a start tag, it will be pushed onto the stack, and it will be popped from the stack when we find an end tag. To be well-formed, every end tag must match the previous start tag, which is why we need the stack.

The stack represents all the elements along a branch of the XML tree, from the root down to the current element being processed. Elements are processed in the order in which they appear in a document; if you view the document as a tree, it looks like you're going from the root all the way down to the tip of a branch, then back up to another branch, and so on. This is called depth-first order, the canonical way all XML documents are processed.

There are a few places where we deviate from the simple looping scheme to do some extra testing. The code for matching a comment is several steps, since it ends with a three-character delimiter, and we also have to check for an illegal string of dashes "--" inside the comment. The character data matcher, which performs an extra check to see if the stack is empty, is also noteworthy; if the stack is empty, that's an error because nonwhitespace text is not allowed outside of the root element. Here is a short list of well-formedness errors that would cause the parser to return a false result:

Try the parser out on some test cases. Probably the simplest complete, well-formed XML document you will ever see is this:

<:-/> 

The next document should cause the parser to halt with an error. (Hint: look at the <message> end tag.)

<memo>
  <to>self</to>
  <message>Don't forget to mow the car and wash the
  lawn.<message>
</memo>

Many other kinds of syntax errors could appear in a document, and our program picks up most of them. However, it does miss a few. For example, there should be exactly one root element, but our program will accept more than one:

<root>I am the one, true root!</root>
<root>No, I am!</root>
<root>Uh oh...</root>

Other problems? The parser cannot handle a document type declaration. This structure is sometimes seen at the top of a document that specifies a DTD for validating parsers, and it may also declare some entities. With a specialized syntax of its own, we'd have to write another loop just for the document type declaration.

Our parser's most significant omission is the resolution of entity references. It can check basic entity reference syntax, but doesn't bother to expand the entity and insert it into the text. Why is that bad? Consider that an entity can contain more than just some character data. It can contain any amount of markup, too, from an element to a big, external file. Entities can also contain other entity references, so it might require many passes to resolve one entity reference completely. The parser doesn't even check to see if the entities are declared (it couldn't anyway, since it doesn't know how to read a document type declaration syntax). Clearly, there is a lot of room for errors to creep into a document through entities, right under the nose of our parser. To fix the problems just mentioned, follow these steps:

  1. Add a parsing loop to read in a document type declaration before any other parsing occurs. Any entity declarations would be parsed and stored, so we can resolve entity references later in the document.
  2. Parse the DTD, if the document type declaration mentions one, to read any entity declarations.
  3. In the main loop, resolve all entity references when we come across them. These entities have to be parsed, and there may be entity references within them, too. The process can be rather loopy, with loops inside loops, recursion, or other complex programming stunts.

What started out as a simple parser now has grown into a complex beast. That tells us two things: that the theory of parsing XML is easy to grasp; and that, in practice, it gets complicated very quickly. This exercise was useful because it showed issues involved in parsing XML, but we don't encourage you to write code like this. On the contrary, we expect you to take advantage of the exhaustive work already put into making ready-made parsers. Let's leave the dark ages and walk into the happy land of prepackaged parsers.

XML::Parser

Writing a parser requires a lot of work. You can't be sure if you've covered everything without a lot of testing. Unless you're a mutant who loves to write efficient, low-level parser code, your program will probably be slow and resource-intensive. The good news is that a wide variety of free, high quality, and easy-to-use XML parser packages (written by friendly mutants) already exist to help you. People have bashed Perl and XML together for years, and you have a barnful of conveniently pre-invented wheels at your disposal.

Where do Perl programmers go to find ready-made modules to use in their programs? They go to the Comprehensive Perl Archive Network (CPAN), a many-mirrored public resource full of free, open-source Perl code. If you aren't familiar with using CPAN, you must change your isolationist ways and learn to become a programmer of the world. You'll find a multitude of modules authored by folks who have walked the path of Perl and XML before you, and who've chosen to share the tools they've made with the rest of the world.

TIP:   Don't think of CPAN as a catalog of ready-made solutions for all specific XML problems. Rather, look at it as a toolbox or a source of building blocks you can assemble and configure to craft a solution. While some modules specialize in popular XML applications like RSS and SOAP, most are more general-purpose. Chances are, you won't find a module that specifically addresses your needs. You'll more likely take one of the general XML modules and adapt it somehow. We'll show that this process is painless and reveal several ways to configure general modules to your particular application.

XML parsers differ from one another in two major ways. First, they differ in their parsing style, which is how the parser works with XML. There are a few different strategies, such as building a data structure or creating an event stream. Another attribute of parsers, called standards-completeness, is a spectrum ranging from ad hoc on one extreme to an exhaustive, standards-based solution on the other. The balance on the latter axis is slowly moving from the eccentric, nonstandard side toward the other end as the Perl community agrees on how to implement major standards like SAX and DOM.

The XML::Parser module is the great-grandpappy of all Perl-based XML processors. It is a multifaceted parser, offering a handful of different parsing styles. On the standards axis, it's closer to ad hoc than standards-compliant; however, being the first efficient XML parser to appear on the Perl horizon, it has a dear place in our hearts and is still very useful. While XML::Parser uses a nonstandard API and has a reputation for getting a bit persnickety over some issues, it works. It parses documents with reasonable speed and flexibility, and as all Perl hackers know, people tend to glom onto the first usable solution that appears on the radar, no matter how ugly it is. Thus, nearly all of the first few years' worth of Perl and XML modules and programs based themselves on XML::Parser.

Since 2001 or so, however, other low-level parsing modules have emerged that base themselves on faster and more standards-compliant core libraries. We'll touch on these modules shortly. However, we'll start out with an examination of XML::Parser, giving a nod to its venerability and functionality.

In the early days of XML, a skilled programmer named James Clark wrote an XML parser library in C and called it Expat.[5] Fast, efficient, and very stable, it became the parser of choice among early adopters of XML. To bring XML into the Perl realm, Larry Wall wrote a low-level API for it and called the module XML::Parser::Expat. Then he built a layer on top of that, XML::Parser, to serve as a general-purpose parser for everybody. Now maintained by Clark Cooper, XML::Parser has served as the foundation of many XML modules.

The C underpinnings are the secret to XML::Parser's success. We've seen how to write a basic parser in Perl. If you apply our previous example to a large XML document, you'll wait a long time before it finishes. Others have written complete XML parsers in Perl that are portable to any system, but you'll find much better performance in a compiled C parser like Expat. Fortunately, as with every other Perl module based on C code (and there are actually lots of these modules because they're not too hard to make, thanks to Perl's standard XS library),[6] it's easy to forget you're driving Expat around when you use XML::Parser.

Example: Well-Formedness Checker Revisited

To show how XML::Parser might be used, let's return to the well-formedness checker problem. It's very easy to create this tool with XML::Parser, as shown in Example 3-2.

Example 3-2: Well-formedness checker using XML::Parser

use XML::Parser;
 
my $xmlfile = shift @ARGV;              # the file to parse
 
# initialize parser object and parse the string
my $parser = XML::Parser->new( ErrorContext => 2 );
eval { $parser->parsefile( $xmlfile ); };
 
# report any error that stopped parsing, or announce success
if( $@ ) {
    $@ =~ s/at \/.*?$//s;               # remove module line number
    print STDERR "\nERROR in '$xmlfile':\n$@\n";
} else {
    print STDERR "'$xmlfile' is well-formed\n";
}

Here's how this program works. First, we create a new XML::Parser object to do the parsing. Using an object rather than a static function call means that we can configure the parser once and then process multiple files without the overhead of repeatedly recreating the parser. The object retains your settings and keeps the Expat parser routine alive for as long as you want to parse files, and then cleans everything up when you're done.

Next, we call the parsefile( ) method inside an eval block because XML::Parser tends to be a little overzealous when dealing with parse errors. If we didn't use an eval block, our program would die before we had a chance to do any cleanup. We check the variable $@ for content in case there was an error. If there was, we remove the line number of the module at which the parse method "died" and then print out the message.

When initializing the parser object, we set an option ErrorContext => 2. XML::Parser has several options you can set to control parsing. This one is a directive sent straight to the Expat parser that remembers the context in which errors occur and saves two lines before the error. When we print out the error message, it tells us what line the error happened on and prints out the region of text with an arrow pointing to the offending mistake.

Here's an example of our checker choking on a syntactic faux pas (where we decided to name our program xwf as an XML well-formedness checker):

$ xwf ch01.xml 
 
ERROR in 'ch01.xml':
 
not well-formed (invalid token) at line 66, column 22, byte 2354:
 
<chapter id="dorothy-in-oz">
<title>Lions, Tigers & Bears</title>
=====================^

Notice how simple it is to set up the parser and get powerful results. What you don't see until you run the program yourself is that it's fast. When you type the command, you get a result in a split second.

You can configure the parser to work in different ways. You don't have to parse a file, for example. Use the method parse( ) to parse a text string instead. Or, you could give it the option NoExpand => 1 to override default entity expansion with your own entity resolver routine. You could use this option to prevent the parser from opening external entities, limiting the scope of its checking.

Although the well-formedness checker is a very useful tool that you certainly want in your XML toolbox if you work with XML files often, it only scratches the surface of what we can do with XML::Parser. We'll see in the next section that a parser's most important role is in shoveling packaged data into your program. How it does this depends on the particular style you select.

Parsing Styles

XML::Parser supports several different styles of parsing to suit various development strategies. The style doesn't change how the parser reads XML. Rather, it changes how it presents the results of parsing. If you need a persistent structure containing the document, you can have it. Or, if you'd prefer to have the parser call a set of routines you write, you can do it that way. You can set the style when you initialize the object by setting the value of style. Here's a quick summary of the available styles:

Debug
This style prints the document to STDOUT, formatted as an outline (deeper elements are indented more). parse( ) doesn't return anything special to your program.

Tree
This style creates a hierarchical, tree-shaped data structure that your program can use for processing. All elements and their data are crystallized in this form, which consists of nested hashes and arrays.

Object
Like tree, this method returns a reference to a hierarchical data structure representing the document. However, instead of using simple data aggregates like hashes and lists, it consists of objects that are specialized to contain XML markup objects.

Subs
This style lets you set up callback functions to handle individual elements. Create a package of routines named after the elements they should handle and tell the parser about this package by using the pkg option. Every time the parser finds a start tag for an element called <fooby>, it will look for the function fooby( ) in your package. When it finds the end tag for the element, it will try to call the function _fooby( ) in your package. The parser will pass critical information like references to content and attributes to the function, so you can do whatever processing you need to do with it.

Stream
Like Subs, you can define callbacks for handling particular XML components, but callbacks are more general than element names. You can write functions called handlers to be called for "events" like the start of an element (any element, not just a particular kind), a set of character data, or a processing instruction. You must register the handler package with either the Handlers option or the setHandlers( ) method.

custom
You can subclass the XML::Parser class with your own object. Doing so is useful for creating a parser-like API for a more specific application. For example, the XML::Parser::PerlSAX module uses this strategy to implement the SAX event processing standard.

Example 3-3 is a program that uses XML::Parser with Style set to Tree. In this mode, the parser reads the whole XML document while building a data structure. When finished, it hands our program a reference to the structure that we can play with.

Example 3-3: An XML tree builder

use XML::Parser;
 
# initialize parser and read the file
$parser = new XML::Parser( Style => 'Tree' );
my $tree = $parser->parsefile( shift @ARGV );
 
# serialize the structure
use Data::Dumper;
print Dumper( $tree );

In tree mode, the parsefile( ) method returns a reference to a data structure containing the document, encoded as lists and hashes. We use Data::Dumper, a handy module that serializes data structures, to view the result. Example 3-4 is the datafile.

Example 3-4: An XML datafile

<preferences>
  <font role="console">
    <fname>Courier</name>
    <size>9</size>
  </font>
  <font role="default">
    <fname>Times New Roman</name>
    <size>14</size>
  </font>
  <font role="titles">
    <fname>Helvetica</name>
    <size>10</size>
  </font>
</preferences>

With this datafile, the program produces the following output (condensed and indented to be easier to read):

$tree = [ 
          'preferences', [ 
            {}, 0, '\n', 
            'font', [ 
              { 'role' => 'console' }, 0, '\n',
              'size', [ {}, 0, '9' ], 0, '\n', 
              'fname', [ {}, 0, 'Courier' ], 0, '\n'
            ], 0, '\n',
            'font', [ 
              { 'role' => 'default' }, 0, '\n',
              'fname', [ {}, 0, 'Times New Roman' ], 0, '\n',
              'size', [ {}, 0, '14' ], 0, '\n'
            ], 0, '\n', 
            'font', [ 
               { 'role' => 'titles' }, 0, '\n',
               'size', [ {}, 0, '10' ], 0, '\n',
               'fname', [ {}, 0, 'Helvetica' ], 0, '\n',
            ], 0, '\n',
          ]
        ];

It's a lot easier to write code that dissects the above structure than to write a parser of your own. We know, because the parser returned a data structure instead of dying mid-parse, that the document was 100 percent well-formed XML. In Chapter 4, we will use the Stream mode of XML::Parser, and in Chapter 6, we'll talk more about trees and objects.

Stream-Based Versus Tree-Based Processing

Remember the Perl mantra, "There's more than one way to do it"? It is also true when working with XML. Depending on how you want to work and what kind of resources you have, many options are available. One developer may prefer a low-maintenance parsing job and is prepared to be loose and sloppy with memory to get it. Another will need to squeeze out faster and leaner performance at the expense of more complex code. XML processing tasks vary widely, so you should be free to choose the shortest path to a solution.

There are a lot of different XML processing strategies. Most fall into two categories: stream-based and tree-based. With the stream-based strategy, the parser continuously alerts a program to patterns in the XML. The parser functions like a pipeline, taking XML markup on one end and pumping out processed nuggets of data to your program. We call this pipeline an event stream because each chunk of data sent to the program signals something new and interesting in the XML stream. For example, the beginning of a new element is a significant event. So is the discovery of a processing instruction in the markup. With each update, your program does something new--perhaps translating the data and sending it to another place, testing it for some specific content, or sticking it onto a growing heap of data.

With the tree-based strategy, the parser keeps the data to itself until the very end, when it presents a complete model of the document to your program. Instead of a pipeline, it's like a camera that takes a picture and transmits the replica to you. The model is usually in a much more convenient state than raw XML. For example, nested elements may be represented in native Perl structures like lists or hashes, as we saw in an earlier example. Even more useful are trees of blessed objects with methods that help navigate the structure from one place to another. The whole point to this strategy is that your program can pull out any data it needs, in any order.

Why would you prefer one over the other? Each has strong and weak points. Event streams are fast and often have a much slimmer memory footprint, but at the expense of greater code complexity and impermanent data. Tree building, on the other hand, lets the data stick around for as long as you need it, and your code is usually simple because you don't need special tricks to do things like backwards searching. However, trees wither when it comes to economical use of processor time and memory.

All of this is relative, of course. Small documents don't cause much hardship to a typical computer, especially since CPU cycles and megabytes are getting cheaper every day. Maybe the convenience of a persistent data structure will outweigh any drawbacks. On the other hand, when working with Godzilla-sized documents like books, or huge numbers of documents all at once, you'll definitely notice the crunch. Then the agility of event stream processors will start to look better. It's impossible to give you any hard-and-fast rules, so we'll leave the decision up to you.

An interesting thing to note about the stream-based and tree-based strategies is that one is the basis for the other. That's right, an event stream drives the process of building a tree data structure. Thus, most low-level parsers are event streams because you can always write a tree building layer on top. This is how XML::Parser and most other parsers work.

In a related, more recent, and very cool development, XML event streams can also turn any kind of document into some form of XML by writing stream-based parsers that generate XML events from whatever data structures lurk in that document type.

There's a lot more to say about event streams and tree builders--so much, in fact, that we've devoted two whole chapters to the topics. Chapter 4 takes a deep plunge into the theory behind event streams with lots of examples for making useful programs out of them. Chapter 6 takes you deeper into the forest with lots of tree-based examples. After that, Chapter 8 shows you unusual hybrids that provide the best of both worlds.

Putting Parsers to Work

Enough tinkering with the parser's internal details. We want to see what you can do with the stuff you get from parsers. We've already seen an example of a complete, parser-built tree structure in Example 3-3, so let's do something with the other type. We'll take an XML event stream and make it drive processing by plugging it into some code to handle the events. It may not be the most useful tool in the world, but it will serve well enough to show you how real-world XML processing programs are written.

XML::Parser (with Expat running underneath) is at the input end of our program. Expat subscribes to the event-based parsing school we described earlier. Rather than loading your whole XML document into memory and then turning around to see what it hath wrought, it stops every time it encounters a discrete chunk of data or markup, such as an angle-bracketed tag or a literal string inside an element. It then checks to see if our program wants to react to it in any way.

Your first responsibility is to give the parser an interface to the pertinent bits of code that handle events. Each type of event is handled by a different subroutine, or handler. We register our handlers with the parser by setting the Handlers option at initialization time. Example 3-5 shows the entire process.

Example 3-5: A stream-based XML processor

use XML::Parser;
 
# initialize the parser
my $parser = XML::Parser->new( Handlers => 
                                     {
                                      Start=>\&handle_start,
                                      End=>\&handle_end,
                                     });
$parser->parsefile( shift @ARGV );
 
my @element_stack;                # remember which elements are open
 
# process a start-of-element event: print message about element
#
sub handle_start {
    my( $expat, $element, %attrs ) = @_;
 
    # ask the expat object about our position
    my $line = $expat->current_line;
 
    print "I see an $element element starting on line $line!\n";
 
    # remember this element and its starting position by pushing a
    # little hash onto the element stack
    push( @element_stack, { element=>$element, line=>$line });
 
    if( %attrs ) {
        print "It has these attributes:\n";
        while( my( $key, $value ) = each( %attrs )) {
            print "\t$key => $value\n";
        }
    }
}
 
# process an end-of-element event
#
sub handle_end {
    my( $expat, $element ) = @_;
 
    # We'll just pop from the element stack with blind faith that
    # we'll get the correct closing element, unlike what our
    # homebrewed well-formedness did, since XML::Parser will scream
    # bloody murder if any well-formedness errors creep in.
    my $element_record = pop( @element_stack );
    print "I see that $element element that started on line ",
          $$element_record{ line }, " is closing now.\n";
}

It's easy to see how this process works. We've written two handler subroutines called handle_start( ) and handle_end( ) and registered each with a particular event in the call to new( ). When we call parse( ), the parser knows it has handlers for a start-of-element event and an end-of-element event. Every time the parser trips over an element start tag, it calls the first handler and gives it information about that element (element name and attributes). Similarly, any end tag it encounters leads to a call of the other handler with similar element-specific information.

Note that the parser also gives each handler a reference called $expat. This is a reference to the XML::Parser::Expat object, a low-level interface to Expat. It has access to interesting information that might be useful to a program, such as line numbers and element depth. We've taken advantage of this fact, using the line number to dazzle users with our amazing powers of document analysis.

Want to see it run? Here's how the output looks after processing the customer database document from Example 1-1:

I see a spam-document element starting on line 1!
It has these attributes:
        version => 3.5
        timestamp => 2002-05-13 15:33:45
I see a customer element starting on line 3!
I see a first-name element starting on line 4!
I see that the first-name element that started on line 4 is closing now.
I see a surname element starting on line 5!
I see that the surname element that started on line 5 is closing now.
I see a address element starting on line 6!
I see a street element starting on line 7!
I see that the street element that started on line 7 is closing now.
I see a city element starting on line 8!
I see that the city element that started on line 8 is closing now.
I see a state element starting on line 9!
I see that the state element that started on line 9 is closing now.
I see a zip element starting on line 10!
I see that the zip element that started on line 10 is closing now.
I see that the address element that started on line 6 is closing now.
I see a email element starting on line 12!
I see that the email element that started on line 12 is closing now.
I see a age element starting on line 13!
I see that the age element that started on line 13 is closing now.
I see that the customer element that started on line 3 is closing now.
  [... snipping other customers for brevity's sake ...]
I see that the spam-document element that started on line 1 is closing now.

Here we used the element stack again. We didn't actually need to store the elements' names ourselves; one of the methods you can call on the XML::Parser::Expat object returns the current context list, a newest-to-oldest ordering of all elements our parser has probed into. However, a stack proved to be a useful way to store additional information like line numbers. It shows off the fact that you can let events build up structures of arbitrary complexity--the "memory" of the document's past.

There are many more event types than we handle here. We don't do anything with character data, comments, or processing instructions, for example. However, for the purpose of this example, we don't need to go into those event types. We'll have more exhaustive examples of event processing in the next chapter, anyway.

Before we close the topic of event processing, we want to mention one thing: the Simple API for XML processing, more commonly known as SAX. It's very similar to the event processing model we've seen so far, but the difference is that it's a W3C-supported standard. Being a W3C-supported standard means that it has a standardized, canonical set of events. How these events should be presented for processing is also standardized. The cool thing about it is that with a standard interface, you can hook up different program components like Legos and it will all work. If you don't like one parser, just plug in another (and sophisticated tools like the XML::SAX module family can even help you pick a parser based on the features you need). Get your XML data from a database, a file, or your mother's shopping list; it shouldn't matter where it comes from. SAX is very exciting for the Perl community because we've long been criticized for our lack of standards compliance and general barbarism. Now we can be criticized for only one of those things. You can expect a nice, thorough discussion on SAX (specifically, PerlSAX, our beloved language's mutation thereof) in Chapter 5.

XML::LibXML

XML::LibXML, like XML::Parser, is an interface to a library written in C. Called libxml2, it's part of the GNOME project.[7] Unlike XML::Parser, this new parser supports a major standard for XML tree processing known as the Document Object Model (DOM).

DOM is another much-ballyhooed XML standard. It does for tree processing what SAX does for event streams. If you have your heart set on climbing trees in your program and you think there's a likelihood that it might be reused or applied to different data sources, you're better off using something standard and interchangeable. Again, we're happy to delve into DOM in a future chapter and get you thinking in standards-complaint ways. That topic is coming up in Chapter 7.

Now we want to show you an example of another parser in action. We'd be remiss if we focused on just one kind of parser when so many are out there. Again, we'll show you a basic example, nothing fancy, just to show you how to invoke the parser and tame its power. Let's write another document analysis tool like we did in Example 3-5, this time printing a frequency distribution of elements in a document.

Example 3-6 shows the program. It's a vanilla parser run because we haven't set any options yet. Essentially, the parser parses the filehandle and returns a DOM object, which is nothing more than a tree structure of well-designed objects. Our program finds the document element, and then traverses the entire tree one element at a time, all the while updating the hash of frequency counters.

Example 3-6: A frequency distribution program

use XML::LibXML;
use IO::Handle;
 
# initialize the parser
my $parser = new XML::LibXML;
 
# open a filehandle and parse
my $fh = new IO::Handle;
if( $fh->fdopen( fileno( STDIN ), "r" )) {
    my $doc = $parser->parse_fh( $fh );
    my %dist;
    &proc_node( $doc->getDocumentElement, \%dist );
    foreach my $item ( sort keys %dist ) {
        print "$item: ", $dist{ $item }, "\n";
    }
    $fh->close;
}
 
# process an XML tree node: if it's an element, update the
# distribution list and process all its children
#
sub proc_node {
    my( $node, $dist ) = @_;
    return unless( $node->nodeType eq &XML_ELEMENT_NODE );
    $dist->{ $node->nodeName } ++;
    foreach my $child ( $node->getChildnodes ) {
        &proc_node( $child, $dist );
    }
}

Note that instead of using a simple path to a file, we use a filehandle object of the IO::Handle class. Perl filehandles, as you probably know, are magic and subtle beasties, capable of passing into your code characters from a wide variety of sources, including files on disk, open network sockets, keyboard input, databases, and just about everything else capable of outputting data. Once you define a filehandle's source, it gives you the same interface for reading from it as does every other filehandle. This dovetails nicely with our XML-based ideology, where we want code to be as flexible and reusable as possible. After all, XML doesn't care where it comes from, so why should we pigeonhole it with one source type?

The parser object returns a document object after parsing. This object has a method that returns a reference to the document element--the element at the very root of the whole tree. We take this reference and feed it to a recursive subroutine, proc_node( ), which happily munches on elements and scribbles into a hash variable every time it sees an element. Recursion is an efficient way to write programs that process XML because the structure of documents is somewhat fractal: the same rules for elements apply at any depth or position in the document, including the root element that represents the entire document (modulo its prologue). Note the "node type" check, which distinguishes between elements and other parts of a document (such as pieces of text or processing instructions).

For every element the routine looks at, it has to call the object's getChildnodes( ) method to continue processing on its children. This call is an essential difference between stream-based and tree-based methodologies. Instead of having an event stream take the steering wheel of our program and push data at it, thus calling subroutines and codeblocks in a (somewhat) unpredictable order, our program now has the responsibility of navigating through the document under its own power. Traditionally, we start at the root element and go downward, processing children in order from first to last. However, because we, not the parser, are in control now, we can scan through the document in any way we want. We could go backwards, we could scan just a part of the document, we could jump around, making multiple passes though the tree--the sky's the limit. Here's the result from processing a small chapter coded in DocBook XML:

$ xfreq < ch03.xml
chapter: 1
citetitle: 2
firstterm: 16
footnote: 6
foreignphrase: 2
function: 10
itemizedlist: 2
listitem: 21
literal: 29
note: 1
orderedlist: 1
para: 77
programlisting: 9
replaceable: 1
screen: 1
section: 6
sgmltag: 8
simplesect: 1
systemitem: 2
term: 6
title: 7
variablelist: 1
varlistentry: 6
xref: 2

The result shows only a few lines of code, but it sure does a lot of work. Again, thanks to the C library underneath, it's quite speedy.

XML::XPath

We've seen examples of parsers that dutifully deliver the entire document to you. Often, though, you don't need the whole thing. When you query a database, you're usually looking for only a single record. When you crack open a telephone book, you're not going to sit down and read the whole thing. There is obviously a need for some mechanism of extracting a specific piece of information from a vast document. Look no further than XPath.

XPath is a recommendation from the folks who brought you XML.[8] It's a grammar for writing expressions that pinpoint specific pieces of documents. Think of it as an addressing scheme. Although we'll save the nitty-gritty of XPath wrangling for Chapter 8, we can tantalize you by revealing that it works much like a mix of regular expressions with Unix-style file paths. Not surprisingly, this makes it an attractive feature to add to parsers.

Matt Sergeant's XML::XPath module is a solid implementation, built on the foundation of XML::Parser. Given an XPath expression, it returns a list of all document parts that match the description. It's an incredibly simple way to perform some powerful search and retrieval work.

For instance, suppose we have an address book encoded in XML in this basic form:

<contacts>
  <entry>
    <name>Bob Snob</name>
    <street>123 Platypus Lane</street>
    <city>Burgopolis</city>
    <state>FL</state>
    <zip>12345</zip>
  </entry>
 <!--More entries go here-->
</contacts>

Suppose you want to extract all the zip codes from the file and compile them into a list. Example 3-7 shows how you could do it with XPath.

Example 3-7: Zip code extractor

use XML::XPath;
 
my $file = 'customers.xml';
my $xp = XML::XPath->new(filename=>$file);
 
# An XML::XPath nodeset is an object which contains the result of
# smacking an XML document with an XPath expression; we'll do just
# this, and then query the nodeset to see what we get.
my $nodeset = $xp->find('//zip');
 
my @zipcodes;                   # Where we'll put our results
if (my @nodelist = $nodeset->get_nodelist) {
  # We found some zip elements! Each node is an object of the class
  # XML::XPath::Node::Element, so I'll use that class's 'string_value'
  # method to extract its pertinent text, and throw the result for all
  # the nodes into our array.
  @zipcodes = map($_->string_value, @nodelist);
 
  # Now sort and prepare for output
  @zipcodes = sort(@zipcodes);
  local $" = "\n";
  print "I found these zipcodes:\n@zipcodes\n";
} else {
  print "The file $file didn't have any 'zip' elements in it!\n";
}

Run the program on a document with three entries and we'll get something like this:

I found these zipcodes:
03642
12333
82649

This module also shows an example of tree-based parsing, by the way, as its parser loads the whole document into an object tree of its own design and then allows the user to selectively interact with parts of it via XPath expressions. This example is just a sample of what you can do with advanced tree processing modules. You'll see more of these modules in Chapter 8.

XML::LibXML's element objects support a findnodes( ) method that works much like XML::XPath's, using the invoking Element object as the current context and returning a list of objects that match the query. We'll play with this functionality later in Chapter 10.

Document Validation

Being well-formed is a minimal requirement for XML everywhere. However, XML processors have to accept a lot on blind faith. If we try to build a document to meet some specific XML application's specifications, it doesn't do us any good if a content generator slips in a strange element we've never seen before and the parser lets it go by with nary a whimper. Luckily, a higher level of quality control is available to us when we need to check for things like that. It's called document validation.

Validation is a sophisticated way of comparing a document instance against a template or grammar specification. It can restrict the number and type of elements a document can use and control where they go. It can even regulate the patterns of character data in any element or attribute. A validating parser tells you whether a document is valid or not, when given a DTD or schema to check against.

Remember that you don't need to validate every XML document that passes over your desk. DTDs and other validation schemes shine when working with specific XML-based markup languages (such as XHTML for web pages, MathML for equations, or CaveML for spelunking), which have strict rules about which elements and attributes go where (because having an automated way to draw attention to something fishy in the document structure becomes a feature).

However, validation usually isn't crucial when you use Perl and XML to perform a less specific task, such as tossing together XML documents on the fly based on some other, less sane data format, or when ripping apart and analyzing existing XML documents.

Basically, if you feel that validation is a needless step for the job at hand, you're probably right. However, if you knowingly generate or modify some flavor of XML that needs to stick to a defined standard, then taking the extra step or three necessary to perform document validation is probably wise. Your toolbox, naturally, gives you lots of ways to do this. Read on.

DTDs

Document type descriptions (DTDs) are documents written in a special markup language defined in the XML specification, though they themselves are not XML. Everything within these documents is a declaration starting with a <! delimiter and comes in four flavors: elements, attributes, entities, and notations.

Example 3-8 is a very simple DTD.

Example 3-8: A wee little DTD

<!ELEMENT memo (to, from, message)>
<!ATTLIST memo priority (urgent|normal|info) 'normal'>
<!ENTITY % text-only "(#PCDATA)*">
<!ELEMENT to %text-only;>
<!ELEMENT from %text-only;>
<!ELEMENT message (#PCDATA | emphasis)*>
<!ELEMENT emphasis %text-only;>
<!ENTITY myname "Bartholomus Chiggin McNugget">

This DTD declares five elements, an attribute for the <memo> element, a parameter entity to make other declarations cleaner, and an entity that can be used inside a document instance. Based on this information, a validating parser can reject or approve a document. The following document would pass muster:

<!DOCTYPE memo SYSTEM "/dtdstuff/memo.dtd">
<memo priority="info">
  <to>Sara Bellum</to>
  <from>&myname;</from>
  <message>Stop reading memos and get back to work!</message>
</memo>

If you removed the <to> element from the document, it would suddenly become invalid. A well-formedness checker wouldn't give a hoot about missing elements. Thus, you see the value of validation.

Because DTDs are so easy to parse, some general XML processors include the ability to validate the documents they parse against DTDs. XML::LibXML is one such parser. A very simple validating parser is shown in Example 3-9.

Example 3-9: A validating parser

use XML::LibXML;
use IO::Handle;
 
# initialize the parser
my $parser = new XML::LibXML;
 
# open a filehandle and parse
my $fh = new IO::Handle;
if( $fh->fdopen( fileno( STDIN ), "r" )) {
    my $doc = $parser->parse_fh( $fh );
    if( $doc and $doc->is_valid ) {
        print "Yup, it's valid.\n";
    } else {
        print "Yikes! Validity error.\n";
    }
    $fh->close;
}

This parser would be simple to add to any program that requires valid input documents. Unfortunately, it doesn't give any information about what specific problem makes it invalid (e.g., an element in an improper place), so you wouldn't want to use it as a general-purpose validity checking tool.[9] T. J. Mather's XML::Checker is a better module for reporting specific validation errors.

Schemas

DTDs have limitations; they aren't able to check what kind of character data is in an element and if it matches a particular pattern. What if you wanted a parser to tell you if a <date> element has the wrong format for a date, or if it contains a street address by mistake? For that, you need a solution such as XML Schema. XML Schema is a second generation of DTD and brings more power and flexibility to validation.

As noted in Chapter 2, XML Schema enjoys the dubious distinction among the XML-related W3C specification family for being the most controversial schema (at least among hackers). Many people like the concept of schemas, but many don't approve of the XML Schema implementation, which is seen as too cumbersome or constraining to be used effectively.

Alternatives to XML Schema include OASIS-Open's RelaxNG (http://www.oasis-open.org/committees/relaxng/) and Rick Jelliffe's Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html). Like XML Schema, these specifications detail XML-based languages used to describe other XML-based languages and let a program that knows how to speak that schema use it to validate other XML documents. We find Schematron particularly interesting because it has had a Perl module attached to it for a while (in the form of Kip Hampton's XML::Schematron family).

Schematron is especially interesting to many Perl and XML hackers because it builds on existing popular XML technologies that already have venerable Perl implementations. Schematron defines a very simple language with which you list and group together assertions of what things should look like based on XPath expressions. Instead of a forward-looking grammar that must list and define everything that can possibly appear in the document, you can choose to validate a fraction of it. You can also choose to have elements and attributes validate based on conditions involving anything anywhere else in the document (wherever an XPath expression can reach). In practice, a Schematron document looks and feels like an XSLT stylesheet, and with good reason: it's intended to be fully implementable by way of XSLT. In fact, two of the XML::Schematron Perl modules work by first transforming the user-specified schema document into an XSLT sheet, which it then simply passes through an XSLT processor.

Schematron lacks any kind of built-in data typing, so you can't, for example, do a one-word check to insist that an attribute conforms to the W3C date format. You can, however, have your Perl program make a separate step using any method you'd like (perhaps through the XML::XPath module) to come through date attributes and run a good old Perl regular expression on them. Also note that no schema language will ever provide a way to query an element's content against a database, or perform any other action outside the realm of the document. This is where mixing Perl and schemas can come in very handy.

XML::Writer

Compared to all we've had to deal with in this chapter so far, writing XML will be a breeze. It's easier to write it because now the shoe's on the other foot: your program has a data structure over which it has had complete control and knows everything about, so it doesn't need to prepare for every contingency that it might encounter when processing input.

There's nothing particularly difficult about generating XML. You know about elements with start and end tags, their attributes, and so on. It's just tedious to write an XML output method that remembers to cross all the t's and dot all the i's. Does it put a space between every attribute? Does it close open elements? Does it put that slash at the end of empty elements? You don't want to have to think about these things when you're writing more important code. Others have written modules to take care of these serialization details for you.

David Megginson's XML::Writer is a fine example of an abstract XML generation interface. It comes with a handful of very simple methods for building any XML document. Just create a writer object and call its methods to crank out a stream of XML. Table 3-1 lists some of these methods.

Table 3-1: XML::Writer methods

Name

Function

end( )

Close the document and perform simple well-formedness checking (e.g., make sure that there is one root element and that every start tag has an associated end tag). If the option UNSAFE is set, however, most well-formedness checking is skipped.

xmlDecl([$endoding, $standalone])

Add an XML Declaration at the top of the document. The version is hard-wired as "1.0".

doctype($name, [$publicId, $systemId])

Add a document type declaration at the top of the document.

comment($text)

Write an XML comment.

pi($target [, $data])

Output a processing instruction.

startTag($name [, $aname1 => $value1, ...])

Create an element start tag. The first argument is the element name, which is followed by attribute name-value pairs.

emptyTag($name [, $aname1 => $value1, ...])

Set up an empty element tag. The arguments are the same as for the startTag( ) method.

endTag([$name])

Create an element end tag. Leave out the argument to have it close the currently open element automatically.

dataElement($name, $data [, $aname1 => $value1, ...])

Print an element that contains only character data. This element includes the start tag, the data, and the end tag.

characters($data)

Output a parcel of character data.

Using these routines, we can build a complete XML document. The program in Example 3-10, for example, creates a basic HTML file.

Example 3-10: HTML generator

use IO;
my $output = new IO::File(">output.xml");
 
use XML::Writer;
my $writer = new XML::Writer( OUTPUT => $output );
 
$writer->xmlDecl( 'UTF-8' );
$writer->doctype( 'html' );
$writer->comment( 'My happy little HTML page' );
$writer->pi( 'foo', 'bar' );
$writer->startTag( 'html' );
$writer->startTag( 'body' );
$writer->startTag( 'h1' );
$writer->startTag( 'font', 'color' => 'green' );
$writer->characters( "<Hello World!>" );
$writer->endTag(  );
$writer->endTag(  );
$writer->dataElement( "p", "Nice to see you." );
$writer->endTag(  );
$writer->endTag(  );
$writer->end(  );

This example outputs the following:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<!-- My happy little HTML page -->
<?foo bar?>
<html><body><h1><font color="green">&lt;Hello World!&gt;</font></h2><p>Nice to see you.</p></body></html>

Some nice conveniences are built into this module. For example, it automatically takes care of illegal characters like the ampersand (&) by turning them into the appropriate entity references. Quoting of entity values is automatic, too. At any time during the document-building process, you can check the context you're in with predicate methods like within_element('foo'), which tells you if an element named 'foo' is open.

By default, the module outputs a document with all the tags run together. You might prefer to insert whitespace in some places to make the XML more readable. If you set the option NEWLINES to true, then it will insert newline characters after element tags. If you set DATA_MODE, a similar effect will be achieved, and you can combine DATA_MODE with DATA_INDENT to automatically indent lines in proportion to depth in the document for a nicely formatted document.

The nice thing about XML is that it can be used to organize just about any kind of textual data. With XML::Writer, you can quickly turn a pile of information into a tightly regimented document. For example, you can turn a directory listing into a hierarchical database like the program in Example 3-11.

Example 3-11: Directory mapper

use XML::Writer;
my $wr = new XML::Writer( DATA_MODE => 'true', DATA_INDENT => 2 );
&as_xml( shift @ARGV );
$wr->end;
 
# recursively map directory information into XML
#
sub as_xml {
    my $path = shift;
    return unless( -e $path );
 
    # if this is a directory, create an element and
    # stuff it full of items
    if( -d $path ) {
        $wr->startTag( 'directory', name => $path );
 
        # Load the names of all things in this
        # directory into an array
        my @contents = (  );
        opendir( DIR, $path );
        while( my $item = readdir( DIR )) {
            next if( $item eq '.' or $item eq '..' );
            push( @contents, $item );
        }
        closedir( DIR );
 
        # recurse on items in the directory
        foreach my $item ( @contents ) {
            &as_xml( "$path/$item" );
        }
 
        $wr->endTag( 'directory' );
 
    # We'll lazily call anything that's not a directory a file.
    } else {
        $wr->emptyTag( 'file', name => $path );
    }
}

Here's how the example looks when run on a directory (note the use of DATA_MODE and DATA_INDENT to improve readability):

$ ~/bin/dir /home/eray/xtools/XML-DOM-1.25
 
<directory name="/home/eray/xtools/XML-DOM-1.25">
  <directory name="/home/eray/xtools/XML-DOM-1.25/t">
    <file name="/home/eray/xtools/XML-DOM-1.25/t/attr.t" />
    <file name="/home/eray/xtools/XML-DOM-1.25/t/minus.t" />
    <file name="/home/eray/xtools/XML-DOM-1.25/t/example.t" />
    <file name="/home/eray/xtools/XML-DOM-1.25/t/print.t" />
    <file name="/home/eray/xtools/XML-DOM-1.25/t/cdata.t" />
    <file name="/home/eray/xtools/XML-DOM-1.25/t/astress.t" />
    <file name="/home/eray/xtools/XML-DOM-1.25/t/modify.t" />
  </directory>
  <file name="/home/eray/xtools/XML-DOM-1.25/DOM.gif" />
  <directory name="/home/eray/xtools/XML-DOM-1.25/samples">
    <file
    name="/home/eray/xtools/XML-DOM-1.25/samples/REC-xml-19980210.xml"
    />
  </directory>
  <file name="/home/eray/xtools/XML-DOM-1.25/MANIFEST" />
  <file name="/home/eray/xtools/XML-DOM-1.25/Makefile.PL" />
  <file name="/home/eray/xtools/XML-DOM-1.25/Changes" />
  <file name="/home/eray/xtools/XML-DOM-1.25/CheckAncestors.pm" />
  <file name="/home/eray/xtools/XML-DOM-1.25/CmpDOM.pm" />

We've seen XML::Writer used step by step and in a recursive context. You could also use it conveniently inside an object tree structure, where each XML object type has its own "to-string" method making the appropriate calls to the writer object. XML::Writer is extremely flexible and useful.

Other Methods of Output

Remember that many parser modules have their own ways to turn their current content into simple, pretty strings of XML. XML::LibXML, for example, lets you call a toString( ) method on the document or any element object within it. Consequently, more specific processor classes that subclass from this module or otherwise make internal use of it often make the same method available in their own APIs and pass end user calls to it to the underlying parser object. Consult the documentation of your favorite processor to see if it supports this or a similar feature.

Finally, sometimes all you really need is Perl's print function. While it lives at a lower level than tools like XML::Writer, ignorant of XML-specific rules and regulations, it gives you a finer degree of control over the process of turning memory structures into text worthy of throwing at filehandles. If you're doing especially tricky work, falling back to print may be a relief, and indeed some of the stunts we pull in Chapter 10 use print. Just don't forget to escape those naughty < and & characters with their respective entity references, as shown in Table 2-1, or be generous with CDATA sections.

Character Sets and Encodings

No matter how you choose to manage your program's output, you must keep in mind the concept of character encoding--the protocol your output XML document uses to represent the various symbols of its language, be they an alphabet of letters or a catalog of ideographs and diacritical marks. Character encoding may represent the trickiest part of XML-slinging, perhaps especially so for programmers in Western Europe and the Americas, most of whom have not explored the universe of possible encodings beyond the 128 characters of ASCII.

While it's technically legal for an XML document's encoding declaration to contain the name of any text encoding scheme, the only ones that XML processors are, according to spec, required to understand are UTF-8 and UTF-16. UTF-8 and UTF-16 are two flavors of Unicode, a recent and powerful character encoding architecture that embraces every funny little squiggle a person might care to make.

In this section, we conspire with Perl and XML to nudge you gently into thinking about Unicode, if you're not pondering it already. While you can do everything described in this book by using the legacy encoding of your choice, you'll find, as time passes, that you're swimming against the current.

Unicode, Perl, and XML

Unicode has crept in as the digital age's way of uniting the thousands of different writing systems that have paid the salaries of monks and linguists for centuries. Of course, if you program in an environment where non-ASCII characters are found in abundance, you're probably already familiar with it. However, even then, much of your text processing work might be restricted to low-bit Latin alphanumerics, simply because that's been the character set of choice--of fiat, really--for the Internet. Unicode hopes to change this trend, Perl hopes to help, and sneaky little XML is already doing so.

As any Unicode-evangelizing document will tell you,[10] Unicode is great for internationalizing code. It lets programmers come up with localization solutions without the additional worry of juggling different character architectures.

However, Unicode's importance increases by an order of magnitude when you introduce the question of data representation. The languages that a given program's users (or programmers) might prefer is one thing, but as computing becomes more ubiquitous, it touches more people's lives in more ways every day, and some of these people speak Kurku. By understanding the basics of Unicode, you can see how it can help to transparently keep all the data you'll ever work with, no matter the script, in one architecture.

Unicode Encodings

We are careful to separate the words "architecture" and "encoding" because Unicode actually represents one of the former that contains several of the latter.

In Unicode, every discrete squiggle that's gained official recognition, from A to α, has its own code point--a unique positive integer that serves as its address in the whole map of Unicode. For example, the first letter of the Latin alphabet, capitalized, lives at the hexadecimal address 0x0041 (as it does in ASCII and friends), and the other two symbols, the lowercase Greek alpha and the smileyface, are found in 0x03B1 and 0x263A, respectively. A character can be constructed from any one of these code points, or by combining several of them. Many code points are dedicated to holding the various diacritical marks, such as accents and radicals, that many scripts use in conjunction with base alphabetical or ideographic glyphs.

These addresses, as well as those of the tens of thousands (and, in time, hundreds of thousands) of other glyphs on the map, remain true across Unicode's encodings. The only difference lies in the way these numbers are encoded in the ones and zeros that make up the document at its lowest level.

Unicode officially supports three types of encoding, all named UTF (short for Unicode Transformation Format), followed by a number representing the smallest bit-size any character might take. The encodings are UTF-8, UTF-16, and UTF-32. UTF-8 is the most flexible of all, and is therefore the one that Perl has adopted.

UTF-8

The UTF-8 encoding, arguably the most Perlish in its impish trickery, is also the most efficient since it's the only one that can pack characters into single bytes. For that reason, UTF-8 is the default encoding for XML documents: if XML documents specify no encoding in their declarations, then processors should assume that they use UTF-8.

Each character appearing within a document encoded with UTF-8 uses as many bytes as it has to in order to represent that character's code point, up to a maximum of six bytes. Thus, the character A, with the itty-bitty address of 0x41, gets one byte to represent it, while our friend lives way up the street in one of Unicode's blocks of miscellaneous doohickeys, with the address 0x263A. It takes three bytes for itself--two for the character's code point number and one that signals to text processors that there are, in fact, multiple bytes to this character. Several centuries from now, after Earth begrudgingly joins the Galactic Friendship Union and we find ourselves needing to encode the characters from countless off-planet civilizations, bytes four through six will come in quite handy.

UTF-16

The UTF-16 encoding uses a full two bytes to represent the character in question, even if its ordinal is small enough to fit into one (which is how UTF-8 would handle it). If, on the other hand, the character is rare enough to have a very high ordinal, then it gets an additional two bytes tacked onto it (called a surrogate pair), bringing that one character's total length to four bytes.

TIP:   Because Unicode 2.0 used a 16-bits-per-character style as its sole supported encoding, many people, and the programs they write, talk about the "Unicode encoding" when they really mean Unicode UTF-16. Even new applications' "Save As..." dialog boxes sometimes offer "Unicode" and "UTF-8" as separate choices, even though these labels don't make much sense in Unicode 3.2 terminology.

UTF-32

UTF-32 works a lot like UTF-16, but eliminates any question of variable character size by declaring that every invoked Unicode-mapped glyph shall occupy exactly four bytes. Because of its maximum maximosity, this encoding doesn't see much practical use, since all but the most unusual communication would have significantly more than half of its total mass made up of leading zeros, which doesn't work wonders for efficiency. However, if guaranteed character width is an inflexible issue, this encoding can handle all the million-plus glyph addresses that Unicode accommodates. Of the three major Unicode encodings, UTF-32 is the one that XML parsers aren't obliged to understand. Hence, you probably don't need to worry about it, either.

Other Encodings

The XML standard defines 21 names for character sets that parsers might use (beyond the two they're required to know, UTF-8 and UTF-16). These names range from ISO-8859-1 (ASCII plus 128 characters outside the Latin alphabet) to Shift_JIS, a Microsoftian encoding for Japanese ideographs. While they're not Unicode encodings per se, each character within them maps to one or more Unicode code points (and vice versa, allowing for round-tripping between common encodings by way of Unicode).

XML parsers in Perl all have their own ways of dealing with other encodings. Some may need an extra little nudge. XML::Parser, for example, is weak in its raw state because its underlying library, Expat, understands only a handful of non-Unicode encodings. Fortunately, you can give it a helping hand by installing Clark Cooper's XML::Encoding module, an XML::Parser subclass that can read and understand map files (themselves XML documents) that bind the character code points of other encodings to their Unicode addresses.

Core Perl support

As with XML, Perl's relationship with Unicode has heated up at a cautious but inevitable pace.[11] Generally, you should use Perl version 5.6 or greater to work with Unicode properly in your code. If you do have 5.6 or greater, consult its perlunicode manpage for details on how deep its support runs, as each release since then has gradually deepened its loving embrace with Unicode. If you have an even earlier Perl, whew, you really ought to consider upgrading it. You can eke by with some of the tools we'll mention later in this chapter, but hacking Perl and XML means hacking in Unicode, and you'll notice the lack of core support for it.

Currently, the most recent stable Perl release, 5.6.1, contains partial support for Unicode. Invoking the use utf8 pragma tells Perl to use UTF-8 encoding with most of its string-handling functions. Perl also allows code to exist in UTF-8, allowing identifiers built from characters living beyond ASCII's one-byte reach. This can prove very useful for hackers who primarily think in glyphs outside the Latin alphabet.

Perl 5.8's Unicode support will be much more complete, allowing UTF-8 and regular expressions to play nice. The 5.8 distribution also introduces the Encode module to Perl's standard library, which will allow any Perl programmer to shift text from legacy encodings to Unicode without fuss:

use Encode 'from_to';
from_to($data, "iso-8859-3", "utf-8"); # from legacy to
utf-8

Finally, Perl 6, being a redesign of the whole language that includes everything the Perl community learned over the last dozen years, will naturally have an even more intimate relationship with Unicode (and will give us an excuse to print a second edition of this book in a few years). Stay tuned to the usual information channels for continuing developments on this front as we see what happens.

Encoding Conversion

If you use a version of Perl older than 5.8, you'll need a little extra help when switching from one encoding to another. Fortunately, your toolbox contains some ratchety little devices to assist you.

iconv and Text::Iconv

iconv is a library and program available for Windows and Unix (inlcuding Mac OS X) that provides an easy interface for turning a document of type A into one of type B. On the Unix command line, you can use it like this:

$ iconv -f latin1 -t utf8 my_file.txt > my_unicode_file.txt

If you have iconv on your system, you can also grab the Text::Iconv Perl module from CPAN, which gives you a Perl API to this library. This allows you to quickly re-encode on-disk files or strings in memory.

Unicode::String

A more portable solution comes in the form of the Unicode::String module, which needs no underlying C library. The module's basic API is as blissfully simple as all basic APIs should be. Got a string? Feed it to the class's constructor method and get back an object holding that string, as well as a bevy of methods that let you squash and stretch it in useful and amusing ways. Example 3-12 tests the module.

Example 3-12: Unicode test

use Unicode::String;
 
my $string = "This sentence exists in ASCII and UTF-8, but not UTF-16. Darn!\n";
my $u = Unicode::String->new($string);
 
# $u now holds an object representing a stringful of 16-bit characters
 
# It uses overloading so Perl string operators do what you expect!
$u .= "\n\nOh, hey, it's Unicode all of a sudden. Hooray!!\n"
 
# print as UTF-16 (also known as UCS2)
print $u->ucs2;
 
# print as something more human-readable
print $u->utf8;

The module's many methods allow you to downgrade your strings, too--specifically, the utf7 method lets you pop the eighth bit off of UTF-8 characters, which is acceptable if you need to throw a bunch of ASCII characters at a receiver that would flip out if it saw chains of UTF-8 marching proudly its way instead of the austere and solitary encodings of old.

WARNING:   XML::Parser sometimes seems a little too eager to get you into Unicode. No matter what a document's declared encoding is, it silently transforms all characters with higher Unicode code points into UTF-8, and if you ask the parser for your data back, it delivers those characters back to you in that manner. This silent transformation can be an unpleasant surprise. If you use XML::Parser as the core of any processing software you write, be aware that you may need to use the convertion tools mentioned in this section to massage your data into a more suitable format.

Byte order marks

If, for some reason, you have an XML document from an unknown source and have no idea what its encoding might be, it may behoove you to check for the presence of a byte order mark (BOM) at the start of the document. Documents that use Unicode's UTF-16 and UTF-32 encodings are endian-dependent (while UTF-8 escapes this fate by nature of its peculiar protocol). Not knowing which end of a byte carries the significant bit will make reading these documents similar to reading them in a mirror, rendering their content into a garble that your programs will not appreciate.

Unicode defines a special code point, U+FEFF, as the byte order mark. According to the Unicode specification, documents using the UTF-16 or UTF-32 encodings have the option of dedicating their first two or four bytes to this character.[12] This way, if a program carefully inspecting the document scans the first two bits and sees that they're 0xFE and 0xFF, in that order, it knows it's big-endian UTF-16. On the other hand, if it sees 0xFF 0xFE, it knows that document is little-endian because there is no Unicode code point of U+FFFE. (UTF-32's big- and little-endian BOMs have more padding: 0x00 0x00 0xFE 0xFF and 0xFF 0xFE 0x00 0x00, respectively.)

The XML specification states that UTF-16- and UTF-32-encoded documents must use a BOM, but, referring to the Unicode specification, we see that documents created by the engines of sane and benevolent masters will arrive to you in network order. In other words, they arrive to you in a big-endian fashion, which was some time ago declared as the order to use when transmitting data between machines. Conversely, because you are sane and benevolent, you should always transmit documents in network order when you're not sure which order to use. However, if you ever find yourself in doubt that you've received a sane document, just close your eyes and hum this tune:

open XML_FILE, $filename or die "Can't read $filename: $!";
my $bom; # will hold possible byte order mark
 
# read the first two bytes
read XML_FILE, $bom, 2;
 
# Fetch their numeric values, via Perl's ord() function
my $ord1 = ord(substr($bom,0,1));
my $ord2 = ord(substr($bom,1,1));
 
if ($ord1 == 0xFE && $ord2 == 0xFF) {
  # It looks like a UTF-16 big-endian document!
  # ... act accordingly here ...
} elsif ($ord1 == 0xFF && $ord2 == 0xEF) {
  # Oh, someone was naughty and sent us a UTF-16 little-endian document.
  # Probably we'll want to effect a byteswap on the thing before working with it.
} else {
  # No byte order mark detected.
}

You might run this example as a last-ditch effort if your parser complains that it can't find any XML in the document. The first line might indeed be a valid <?xml ... > declaration, but your parser sees some gobbledygook instead.


1. Readers of Douglas Adams' book The Hitchhiker's Guide to the Galaxy will recall that a babelfish is a living, universal language-translation device, about the size of an anchovy, that fits, head-first, into a sentient being's aural canal.

2. Most HTML browsers try to ignore well-formedness errors in HTML documents, attempting to fix them and move on. While ignoring these errors may seem to be more convenient to the reader, it actually encourages sloppy documents and results in overall degradation of the quality of information on the Web. After all, would you fix parse errors if you didn't have to?

3. If you insist on authoring a <blooby>-enabled web page in XML, you can design your own extension by drafting a DTD that uses entity references to pull in the XHTML DTD, and then defines your own special elements on top of it. At this point it's not officially XHTML anymore, but a subclass thereof.

4. The O'Reilly book Mastering Algorithms with Perl by Jon Orwant, Jarkko Hietaniemi, and John Macdonald devotes a chapter to this topic.

5. James Clark is a big name in the XML community. He tirelessly promotes the standard with his free tools and involvement with the W3C. You can see his work at http://www.jclark.com/. Clark is also editor of the XSLT and XPath recommendation documents at http://www.w3.org/.

6. See man perlxs or Chapter 25 of O'Reilly's Programming Perl, Third Edition for more information.

7. For downloads and documentation, see http://www.libxml.org/.

8. Check out the specification at http://www.w3.org/TR/xpath/.

9. The authors prefer to use a command-line tool called nsgmls available from http://www.jclark.com/. Public web sites, such as http://www.stg.brown.edu/service/xmlvalid/, can also validate arbitrary documents. Note that, in these cases, the XML document must have a DOCTYPE declaration, whose system identifier (if it has one) must contain a resolvable URL and not a path on your local system.

10. These documents include Chapter 15 of O'Reilly's Programming Perl, Third Edition and the FAQ that the Unicode consortium hosts at http://unicode.org/unicode/faq/.

11. The romantic metaphor may start to break down for you here, but you probably understand by now that Perl's polyamorous proclivities help make it the language that it is.

12. UTF-8 has its own byte order mark, but its purpose is to identify the document at UTF-8, and thus has little use in the XML world. The UTF-8 encoding doesn't have to worry about any of this endianness business since all its characters are made of strung-together byte sequences that are always read from first to last instead of little boxes holding byte pairs whose order may be questionable.

Back to: Perl & XML


oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy

© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.com