BUY THIS BOOK
Add to Cart

Print Book $39.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint this content?


Learning XML
Learning XML, Second Edition

By Erik T. Ray
Price: $39.95 USD
£28.50 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
Anywhere there is information, you'll find XML, or at least hear it scratching at the door. XML has grown into a huge topic, inspiring many technologies and branching into new areas. So priority number one is to get a broad view, and ask the big questions, so that you can find your way through the dense jungle of standards and concepts.
A few questions come to mind. What is XML? We will attack this from different angles. It's more than the next generation of HTML. It's a general-purpose information storage system. It's a markup language toolkit. It's an open standard. It's a collection of standards. It's a lot of things, as you'll see.
Where did XML come from? It's good to have a historical perspective. You'll see how XML evolved out of earlier efforts like SGML, HTML, and the earliest presentational markup.
What can I do with XML? A practical question, again with several answers: you can store and retrieve data, ensure document integrity, format documents, and support many cultural localizations. And what can't I do with XML? You need to know about the limitations, as it may not be a good fit with your problem.
How do I get started? Without any hesitation, I hope. I'll describe the tools you need to get going with XML and test the examples in this book. From authoring, validating, checking well-formedness, transforming, formatting, and writing programs, you'll have a lot to play with.
So now let us dive into the big questions. By the end of this chapter, you should know enough to decide where to go from here. Future chapters will describe topics in more detail, such as core markup, quality control, style and presentation, programming interfaces, and internationalization.
XML is a lot like the ubiquitous plastic containers of Tupperware®. There is really no better way to keep your food fresh than with those colorful, airtight little boxes. They come in different sizes and shapes so you can choose the one that fits best. They lock tight so you know nothing is leaking out and germs can't get in. You can tell items apart based on the container's color, or even scribble on it with magic marker. They're stackable and can be nested in larger containers (in case you want to take them with you on a picnic). Now, if you think of information as a precious commodity like food, then you can see the need for a containment system like Tupperware
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is XML?
XML is a lot like the ubiquitous plastic containers of Tupperware®. There is really no better way to keep your food fresh than with those colorful, airtight little boxes. They come in different sizes and shapes so you can choose the one that fits best. They lock tight so you know nothing is leaking out and germs can't get in. You can tell items apart based on the container's color, or even scribble on it with magic marker. They're stackable and can be nested in larger containers (in case you want to take them with you on a picnic). Now, if you think of information as a precious commodity like food, then you can see the need for a containment system like Tupperware®.
XML contains, shapes, labels, structures, and protects information. It does this with symbols embedded in the text, called markup. Markup enhances the meaning of information in certain ways, identifying the parts and how they relate to each other. For example, when you read a newspaper, you can tell articles apart by their spacing and position on the page and the use of different fonts for titles and headings. Markup works in a similar way, except that instead of spaces and lines, it uses symbols.
Markup is important to electronic documents because they are processed by computer programs. If a document has no labels or boundaries, then a program will not know how to distinguish a piece of text from any other piece. Essentially, the program would have to work with the entire document as a unit, severely limiting the interesting things you can do with the content. A newspaper with no space between articles and only one text style would be a huge, uninteresting blob of text. You could probably figure out where one article ends and another starts, but it would be a lot of work. A computer program wouldn't be able to do even that, since it lacks all but the most rudimentary pattern-matching skills.
XML's markup divides a document into separate information containers called
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Where Did XML Come From?
XML is the result of a long evolution of data packaging reaching back to the days of punched cards. It is useful to trace this path to see what mistakes and discoveries influenced the design decisions.
Early electronic formats were more concerned with describing how things should look (presentation) than with document structure and meaning. troff and TEX, two early formatting languages, did a fantastic job of formatting printed documents, but lacked any sense of structure. Consequently, documents were limited to being viewed on screen or printed as hard copies. You couldn't easily write programs to search for and siphon out information, cross-reference information electronically, or repurpose documents for different applications.
Generic coding, which uses descriptive tags rather than formatting codes, eventually solved this problem. The first organization to seriously explore this idea was the Graphic Communications Association (GCA). In the late 1960s, the GenCode project developed ways to encode different document types with generic tags and to assemble documents from multiple pieces.
The next major advance was Generalized Markup Language (GML), a project by IBM. GML's designers, Charles Goldfarb, Edward Mosher, and Raymond Lorie, intended it as a solution to the problem of encoding documents for use with multiple information subsystems. Documents coded in this markup language could be edited, formatted, and searched by different programs because of its content-based tags. IBM, a huge publisher of technical manuals, has made extensive use of GML, proving the viability of generic coding.
Inspired by the success of GML, the American National Standards Institute (ANSI) Committee on Information Processing assembled a team, with Goldfarb as project leader, to develop a standard text-description language based upon GML. The GCA GenCode committee contributed their expertise as well. Throughout the late 1970s and early 1980s, the team published working drafts and eventually created a candidate for an industry standard (GCA 101-1983) called the Standard Generalized Markup Language (SGML). This was quickly adopted by both the U.S. Department of Defense and the U.S. Internal Revenue Service.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Can I Do with XML?
Let me tackle that question by sorting the kinds of problems for which you would use XML.
Just about every software application needs to store some data. There are look-up tables, work files, preference settings, and so on. XML makes it very easy to do this. Say, for example, you've created a calendar program and you need a way to store holidays. You could hardcode them, of course, but that's kind of a hassle since you'd have to recompile the program if you need to add to the list. So you decide to save this data in a separate file using XML. Example 1-4 shows how it might look.
Example 1-4. Calendar data file
<caldata>
  <holiday type="international">
    <name>New Year's Day</name>
    <date><month>January</month><day>1</day></date>
  </holiday>
  <holiday type="personal">
    <name>Erik's birthday</name>
    <date><month>April</month><day>23</day></date>
  </holiday>
  <holiday type="national">
    <name>Independence Day</name>
    <date><month>July</month><day>4</day></date>
  </holiday>
  <holiday type="religious">
    <name>Christmas</name>
    <date><month>December</month><day>25</day></date>
  </holiday>
</caldata>
Now all your program needs to do is read in the XML file and convert the markup into some convenient data structure using an XML parser. This software component reads and digests XML into a more usable form. There are lots of libraries that will do this, as well as standalone programs. Outputting XML is just as easy as reading it. Again, there are modules and libraries people have written that you can incorporate in any program.
XML is a very good choice for storing data in many cases. It's easy to parse and write, and it's open for users to edit themselves. Parsers have mechanisms to verify syntax and completeness, so you can protect your program from corrupted data. XML works best for small data files or for data that is not meant to be searched randomly. A novel is a good example of a document that is not randomly accessed (unless you are one of those people who peek at the ending of a novel before finishing), whereas a telephone directory
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Do I Get Started?
By now you are chomping at the bit, eager to gallop into XML coding of your own. Let's take a look at how to set up your own XML authoring and processing environment.
The most important item in your XML toolbox is the XML editor. This program lets you read and compose XML, and often comes with services to prevent mistakes and clarify the view of your document. There is a wide spectrum of quality and expense in editors, which makes choosing one that's right for you a little tricky. In this section, I'll take you on a tour of different kinds.
Even the lowliest plain-text editor is sufficient to work with XML. You can use Text- Edit on the Mac, NotePad or WordPad on Windows, or vi on Unix. The only limitation is whether it supports the character set used by the document. In most cases, it will be UTF-8. Some of these text editors support an XML "mode" which can highlight markup and assist in inserting tags. Some popular free editors include vim, elvis, and, my personal favorite, emacs.
emacs is a powerful text editor with macros and scripted functions. Lennart Stafflin has written an XML plug-in for it called psgml, available at http://www.lysator.liu.se/~lenst/. It adds menus and commands for inserting tags and showing information about a DTD. It even comes with an XML parser that can detect structural mistakes while you're editing a document. Using psgml and a feature called "font-lock," you can set up xemacs, an X Window version of emacs, to highlight markup in color. Figure 1-5 is a snapshot of xemacs with an XML document open.
Figure 1-5: Highlighted markup in xemacs with psgml
Morphon Technologies' XMLEditor is a fine example of a graphical user interface. As you can see in Figure 1-6, the window sports several panes. On the left is an outline view of the book, in which you can quickly zoom in on a particular element, open it, collapse it, and move it around. On the right is a view of the text without markup. And below these panes is an attribute editing pane. The layout is easy to customize and easy to use. Note the formatting in the text view, achieved by applying a CSS stylesheet to the document. Morphon's editor sells for $150 and you can download a 30-day demo at
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Markup and Core Concepts
There's a Far Side cartoon by Gary Larson about an unusual chicken ranch. Instead of strutting around, pecking at seed, the chickens are all lying on the ground or draped over fences as if they were made of rubber. You see, it was a boneless chicken ranch.
Just as skeletons give us vertebrates shape and structure, markup does the same for text. Take out the markup and you have a mess of character data without any form. It would be very difficult to write a computer program that did anything useful with that content. Software relies on markup to label and delineate pieces of data, the way suitcases make it easy for you to carry clothes with you on a trip.
This chapter focuses on the details of XML markup. Here I will describe the fundamental building blocks of all XML-derived languages: elements, attributes, entities, processing instructions, and more. And I'll show you how they all fit together to make a well-formed XML document. Mastering these concepts is essential to understanding every other topic in the book, so read this chapter carefully.
All of the markup rules for XML are laid out in the W3C's technical recommendation for XML version 1.0 (http://www.w3.org/TR/2000/REC-xml-20001006). This is the second edition of the original which first appeared in 1998. You may also find Tim Bray's annotated, interactive version useful. Go and check it out at http://www.xml.com/axml/testaxml.htm.
If XML markup is a structural skeleton for a document, then tags are the bones. They mark the boundaries of elements, allow insertion of comments and special instructions, and declare settings for the parsing environment. A parser, the front line of any program that processes XML, relies on tags to help it break down documents into discrete XML objects. There are a handful of different XML object types, listed in Table 2-1.
Table 2-1: Types of tags in XML
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Tags
If XML markup is a structural skeleton for a document, then tags are the bones. They mark the boundaries of elements, allow insertion of comments and special instructions, and declare settings for the parsing environment. A parser, the front line of any program that processes XML, relies on tags to help it break down documents into discrete XML objects. There are a handful of different XML object types, listed in Table 2-1.
Table 2-1: Types of tags in XML
Object
Purpose
Example
empty element
Represent information at a specific point in the document.
<xref linkend="abc"/>
container element
Group together elements and character data.
<p>This is a paragraph.</p>
declaration
Add a new parameter, entity, or grammar definition to the parsing environment.
<!ENTITY author "Erik Ray">
processing instruction
Feed a special instruction to a particular type of software.
<?print-formatter force-linebreak?>
comment
Insert an annotation that will be ignored by the XML processor.
<!— here's where I left off —>
CDATA section
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Documents
An XML document is a special construct designed to archive data in a way that is most convenient for parsers. It has nothing to do with our traditional concept of documents, like the Magna Carta or Time magazine, although those texts could be stored as XML documents. It simply is a way of describing a piece of XML as being whole and intact for parsing.
It's important to think of the document as a logical entity rather than a physical one. In other words, don't assume that a document will be contained within a single file on a computer. Quite often, a document may be spread out across many files, and some of these may live on different systems. All that is required is that the XML parser reading the document has the ability to assemble the pieces into a coherent whole. Later, we will talk about mechanisms used in XML for linking discrete physical entities into a complete logical unit.
As Figure 2-2 shows, an XML document has two parts. First is the document prolog, a special section containing metadata. The second is an element called the document element, also called the root element for reasons you will understand when we talk about trees. The root element contains all the other elements and content in the document.
Figure 2-2: Parts of an XML document
The prolog is optional. If you leave it out, the parser will fall back on its default settings. For example, it automatically selects the character encoding UTF-8 (or UTF-16, if detected) unless something else is specified. The root element is required, because a document without data is just not a document.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Document Prolog
Being a flexible markup language toolkit, XML lets you use different character encodings, define your own grammars, and store parts of the document in many places. An XML parser needs to know about these particulars before it can start its work. You communicate these options to the parser through a construct called the document prolog.
The document prolog (if you use one) comes at the top of the document, before the root element. There are two parts (both optional): an XML declaration and a document type declaration. The first sets parameters for basic XML parsing while the second is for more advanced settings. The XML declaration, if used, has to be the first line in the document. Example 2-1 shows a document containing a full prolog.
Example 2-1. A document with a full prolog
<?xml version="1.0" standalone="no"?>              The XML declaration
<!DOCTYPE                                          Beginning of the DOCTYPE declaration
  reminder                                         Root element name
  SYSTEM "/home/eray/reminder.dtd"                 DTD identifier              
  [                                                Internal subset start delimiter
    <!ENTITY smile "<graphic file="smile.eps"/>">  Entity declaration
  ]>                                               Internal subset end delimiter
<reminder>                                         Start of document element
  &smile;                                          Reference to the entity declared above
  <msg>Smile! It can always get worse.</msg>
</reminder>                                        End of document element
            
The XML declaration is a small collection of details that prepare an XML processor for working with a document. It is optional, but when used it must always appear in the first line. Figure 2-3 shows the form it takes. It starts with the delimiter <?xml (1), contains a number of parameters (2), and ends with the delimiter
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Elements
Elements are the building blocks of XML, dividing a document into a hierarchy of regions, each serving a specific purpose. Some elements are containers, holding text or elements. Others are empty, marking a place for some special processing such as importing a media object. In this section, I'll describe the rules for how to construct elements.
Figure 2-9 shows the syntax for a container element. It begins with a start tag consisting of an angle bracket (1) followed by a name (2). The start tag may contain some attributes (3) separated by whitespace, and it ends with a closing angle bracket (4). After the start tag is the element's content and then an end tag. The end tag consists of an opening angle bracket and a slash (5), the element's name again (2), and a closing bracket (4). The name in the end tag must match the one in the start tag exactly.
Figure 2-9: Container element syntax
An empty element is very similar, as seen in Figure 2-10. It starts with an angle bracket delimiter (1), and contains a name (2) and a number of attributes (3). It is closed with a slash and a closing angle bracket (4). It has no content, so there is no need for an end tag.
Figure 2-10: Empty element syntax
An attribute defines a property of the element. It associates a name with a value, which is a string of character data. The syntax, shown in Figure 2-11 is a name (1), followed by an equals sign (2), and a string (4) inside quotes (3). Two kinds of quotes are allowed: double (") and single ('). Quote characters around an attribute value must match.
Figure 2-11: Form of an attribute
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Entities
Entities are placeholders in XML. You declare an entity in the document prolog or in a DTD, and you can refer to it many times in the document. Different types of entities have different uses. You can substitute characters that are difficult or impossible to type with character entities. You can pull in content that lives outside of your document with external entities. And rather than type the same thing over and over again, such as boilerplate text, you can instead define your own general entities.
Figure 2-17 shows the different kinds of entities and their roles. In the family tree of entity types, the two major branches are parameter entities and general entities. Parameter entities are used only in DTDs, so I'll talk about them later, in Chapter 4. This section will focus on the other type, general entities.
Figure 2-17: Entity types
An entity consists of a name and a value. When an XML parser begins to process a document, it first reads a series of declarations, some of which define entities by associating a name with a value. The value is anything from a single character to a file of XML markup. As the parser scans the XML document, it encounters entity references, which are special markers derived from entity names. For each entity reference, the parser consults a table in memory for something with which to replace the marker. It replaces the entity reference with the appropriate replacement text or markup, then resumes parsing just before that point, so the new text is parsed too. Any entity references inside the replacement text are also replaced; this process repeats as many times as necessary.
Recall from Section 2.3.2 earlier in this chapter that an entity reference consists of an ampersand (&), the entity name, and a semicolon (;). The following is an example of a document that declares three general entities and references them in the text:
<?xml version="1.0"?>
<!DOCTYPE message SYSTEM "/xmlstuff/dtds/message.dtd"
[
  <!ENTITY client "Mr. Rufus Xavier Sasperilla">
  <!ENTITY agent "Ms. Sally Tashuns">
  <!ENTITY phone "<number>617-555-1299</number>">
]>
<message>
<opening>Dear &client;</opening>
<body>We have an exciting opportunity for you! A set of 
ocean-front cliff dwellings in Pi&#241;ata, Mexico, have been
renovated as time-share vacation homes. They're going fast! To 
reserve a place for your holiday, call &agent; at &phone;. 
Hurry, &client;. Time is running out!</body>
</message>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Miscellaneous Markup
Rounding out the list of markup objects are comments, processing instructions, and CDATA sections. They all have one thing in common: they shield content from the parser in some fashion. Comments keep text from ever getting to the parser. CDATA sections turn off the tag resolution, and processing instructions target specific processors.
Comments are notes in the document that are not interpreted by the XML processor. If you're working with other people on the same files, these messages can be invaluable. They can be used to identify the purpose of files and sections to help navigate a cluttered document, or simply to communicate with each other.
Figure 2-21 shows the form of a comment. It starts with the delimiter <!-- (1) and ends with the delimiter --> (3). Between these delimiters goes the comment text (2) which can be just about any kind of text you want, including spaces, newlines, and markup. The only string not allowed inside a comment is two or more dashes in succession, since the parser would interpret that string as the end of the comment.
Figure 2-21: Comment syntax
Comments can go anywhere in your document except before the XML declaration and inside tags. The XML processor removes them completely before parsing begins. So this piece of XML:
<p>The quick brown fox jumped<!-- test -->over the lazy dog. 
The quick brown <!-- test --> fox jumped over the lazy dog. The<!--

test

-->quick brown fox 
jumped over the lazy dog.</p>
will look like this to the parser:
<p>The quick brown fox jumpedover the lazy dog. 
The quick brown  fox jumped over the lazy dog. Thequick brown fox 
jumped over the lazy dog.</p>
Since comments can contain markup, they can be used to "turn off" parts of a document. This is valuable when you want to remove a section temporarily, keeping it in the file for later use. In this example, a region of code is commented out:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Modeling Information
Designing a markup language is a task similar to designing a building. First, you have to ask some questions: Who am I building it for? How will it be constructed? How will it be used? Do I give it many small rooms or a few large ones? Will the rooms be generic and interchangeable or specialized? Is there are role for the building, like storage, office space, or factory work? It takes a lot of planning to do it right.
When designing a markup language, there are many questions to answer: What constitutes a document? How detailed do you need it to be? How will it be generated? Is it flexible enough to handle every expected situation? Is it generic enough to support different formatting options and modes? Your decisions will help answer the most basic question which is, how can you represent a piece of information as XML? This problem is part of the important topic of data modeling.
In this chapter, we look at the ways in which different kinds of data are modelled using XML. First, I'll show you the most basic kinds of documents, simple collections of preferences for software applications. The next category covers narrative documents with characteristics such as text flows, block and inline elements, and titled sections. Lastly, under the broad umbrella of "complex" data, I'll talk about the myriad specialized markup languages for everything from vector graphics to remote procedure calls.
XML can be used like an extremely basic database. Since the early days of computer operating systems, data has been stored in files as tables, like the venerable /etc/passwd file:
nobody:*:-2:-2:Unprivileged User:/nohome:/noshell
root:*:0:0:System Administrator:/var/root:/bin/tcsh
daemon:*:1:1:System Services:/var/root:/noshell
smmsp:*:25:25:Sendmail User:/private/etc/mail:/noshell
Data like this isn't too hard to parse, but it has problems, too. Certain characters aren't allowed. Each record lives on a separate line, so data can't span lines. A syntax error is easy to create and may be difficult to locate. XML's explicit markup gives it natural immunity to these types of problems.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Simple Data Storage
XML can be used like an extremely basic database. Since the early days of computer operating systems, data has been stored in files as tables, like the venerable /etc/passwd file:
nobody:*:-2:-2:Unprivileged User:/nohome:/noshell
root:*:0:0:System Administrator:/var/root:/bin/tcsh
daemon:*:1:1:System Services:/var/root:/noshell
smmsp:*:25:25:Sendmail User:/private/etc/mail:/noshell
Data like this isn't too hard to parse, but it has problems, too. Certain characters aren't allowed. Each record lives on a separate line, so data can't span lines. A syntax error is easy to create and may be difficult to locate. XML's explicit markup gives it natural immunity to these types of problems.
If you are writing a program that reads or saves data to a file, there are good reasons to go with XML. Parsers have been written to parse it already, so all you need to do is link to a library and use one of several easy interfaces: SAX, DOM, or XPath. Syntax errors are easy to catch, and that too is automated by the parser. Technologies like DTDs and Schema even check the structure and contents of elements for you, to ensure completeness and ordering.
A dictionary is a simple one-to-one mapping of properties to values. A property has a name, or key, which is a unique identifier. A dictionary is kind of like a table with two columns. It's a simple but very effective way to serialize data.
In the Macintosh OS X operating system, Apple selected XML as its format for preference files (called property lists). For the Chess program, the property list is in a file called com.apple.Chess.plist, shown here:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist SYSTEM "file://localhost/System/Library/DTDs/PropertyList.dtd">
<plist version="0.9">
  <dict>
    <!--    KEY                       VALUE    -->
    <key>BothSides</key>            <false/>
    <key>Level</key>                <integer>1</integer>
    <key>PlayerHasWhite</key>       <true/>
    <key>SpeechRecognition</key>    <false/>
  </dict>
</plist>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Narrative Documents
Now let's look at an important category of XML. A narrative document contains text meant to be read by people rather than machines. Web pages, books, journals, articles, and essays are all narrative documents. These documents have some common traits. First, order of elements is inviolate. Try reading a book backward and you'll agree it's much less interesting that way (and it gives away the ending). The text runs in a single path called a flow, which the reader follows from beginning to end.
Another key feature of narrative documents is specialized element groups, including sections, blocks, and inlines. Sections are what you would imagine: elements that break up the document into parts like chapters, subsections, and so on. Blocks are rectangular regions such as titles and paragraphs. Inlines are strings inside those blocks specially marked for formatting. Figure 3-2 shows how a typical formatted document would render these elements.
Figure 3-2: Flows, blocks, inlines
A narrative document contains at least one flow, a stream of text to be read continuously from start to finish. If there are multiple flows, one will be dominant, branching occasionally into short tangential flows like sidebars, notes, tips, warnings, footnotes, and so on. The main flow is typically formatted as a column, while other flows are often in boxes interrupting the main flow, or moved to the side or the very end, with some kind of link (e.g., a footnote symbol).
Markup for flows are varied. Some XML applications like XHTML do not support more than one flow. Others, like DocBook, have rich support for flows, encapsulating them as elements inside the main flow. The best representations allow flows to be moved around, floated within the confines of the formatted page.
The main flow is broken up into sections
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Complex Data
XML really shines when data is complex. It turns the most abstract concepts into concrete databases ready for processing by software. Multimedia formats like Scalable Vector Graphics (SVG) and Synchronized Multimedia Integration Language (SMIL) map pictures and movies into XML markup. Complex ideas in the scientific realm are just as readily coded as XML, as proven by MathML (equations), the Chemical Markup Language (chemical formulae), and the Molecular Dynamics Language (molecule interactions).
The reason XML is so good at modelling complex data is that the same building blocks for narrative documents—elements and attributes—can apply to any composition of objects and properties. Just as a book breaks down into chapters, sections, blocks, and inlines, many abstract ideas can be deconstructed into discrete and hierarchical components. Vector graphics, for example, are composed of a finite set of shapes with associated properties. You can represent each shape as an element and use attributes to hammer down the details.
SVG is a good example of how to represent objects as elements. Take a gander at the simple SVG document in Example 3-7. Here we have three different shapes represented by as many elements: a common rectangle, an ordinary circle, and an exciting polygon. Attributes in each element customize the shape, setting color and spatial dimensions.
Example 3-7. An SVG document
<?xml version="1.0"?>
<svg>
  <desc>Three shapes</desc>
  <rect fill="green" x="1cm" y="1cm" width="3cm" height="3cm"/>
  <circle fill="red" cx="3cm" cy="2cm" r="4cm"/>
  <polygon fill="blue" points="110,160 50,300 180,290"/>
</svg>
Vector graphics are scalable, meaning you can stretch the image vertically or horizontally without any loss of sharpness. The image processor just recalculates the coordinates for you, leaving you to concentrate on higher concepts like composition, color, and grouping.
SVG adds other benefits too. Being an XML application, it can be tested for well-formedness, can be edited in any generic XML editor, and is easy to write software for. DTDs and Schema are available to check for missing information, and they provide an easy way to distinguish between versions.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Documents Describing Documents
Many XML documents contain metadata, information about themselves that help search engines to categorize them. But not everyone takes advantage of the possibilities of metadata. And, unless you're using an exhaustive program that spiders through an entire document collection, it's difficult to summarize the set and choose a particular article from it. Making matters worse, not all documents have the capability to describe themselves, such as sound and graphics files. To address these problems, a class of documents evolved that specialize in describing other documents.
To fully describe different kinds of documents, these markup languages have some interesting features in common. They list the time documents have been updated using standard time formats. They label the content type, be it text, image, sound, or something else. They may contain text descriptions for a user to peruse. For international documents, they may track the language encodings. Also interesting is the way documents are uniquely identified: using a physical address or some nonphysical identifier.
Rich Site Summary (or Really Simple Syndication, depending on whom you talk to) was created by Netscape Corp. to describe content on web sites. They wanted to make a portal that was customizable, allowing readers to subscribe to particular subject areas or channels . Each time they returned to the site, they would see updates on their favorite topics, saving them the trouble of hunting around for this news on their own. Thus was born the service known as content aggregation .
Since the time when there were a few big content aggregators like Netscape and Userland, the landscape has shifted to include hundreds of smaller, more granular services. Instead of subscribing to channels that mix together lots of different sources, you can subscribe to individual sites for an even higher level of customization. Everything from the BBC to a swarm of one-person weblogs are at your disposal. Publishing has never been easier.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Quality Control with Schemas
Up until now, we have been talking about the things all XML documents have in common. Well-formedness rules are universal, ensuring perfect compatibility with generic tools and APIs. This syntax homogeneity is a big selling point for XML, but equally important is the need for ways to distinguish XML-based languages from each other. A document usually attempts to conform to a language of some sort, and we need methods to test its level of conformance.
Schemas, the topic of this chapter, are the shepherds of markup languages. They keep documents from straying outside of the herd and causing trouble. For instance, an administrator of a web site can use a schema to determine which web pages are legal XHTML, and which are only pretending to be. A schema can also be used to publish a specification for a language in a succinct and unambiguous way.
In the general sense of the word, a schema is a generic representation of a class of things. For example, a schema for restaurant menus could be the phrase "a list of dishes available at a particular eating establishment." A schema may resemble the thing it describes, the way a "smiley face" represents an actual human face. The information contained in a schema allows you to identify when something is or is not a representative instance of the concept.
In the XML context, a schema is a pass-or-fail test for documents. A document that passes the test is said to conform to it, or be valid. Testing a document with a schema is called validation . A schema ensures that a document fulfills a minimum set of requirements, finding flaws that could result in anomalous processing. It also may serve as a way to formalize an application, being a publishable object that describes a language in unambiguous rules.
An XML schema is like a program that tells a processor how to read a document. It's very similar to a later topic we'll discuss called transformations
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Basic Concepts
In the general sense of the word, a schema is a generic representation of a class of things. For example, a schema for restaurant menus could be the phrase "a list of dishes available at a particular eating establishment." A schema may resemble the thing it describes, the way a "smiley face" represents an actual human face. The information contained in a schema allows you to identify when something is or is not a representative instance of the concept.
In the XML context, a schema is a pass-or-fail test for documents. A document that passes the test is said to conform to it, or be valid. Testing a document with a schema is called validation . A schema ensures that a document fulfills a minimum set of requirements, finding flaws that could result in anomalous processing. It also may serve as a way to formalize an application, being a publishable object that describes a language in unambiguous rules.
An XML schema is like a program that tells a processor how to read a document. It's very similar to a later topic we'll discuss called transformations. The processor reads the rules and declarations in the schema and uses this information to build a specific type of parser, called a validating parser. The validating parser takes an XML instance as input and produces a validation report as output. At a minimum, this report is a return code, true if the document is valid, false otherwise. Optionally, the parser can create a Post Schema Validation Infoset (PSVI) including information about data types and structure that may be used for further processing.
Validation happens on at least four levels:
Structure
The use and placement of markup elements and attributes.
Data typing
Patterns of character data (e.g., numbers, dates, text).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
DTDs
The original XML document model is the Document Type Definition (DTD). DTDs actually predate XML; they are a reduced hand-me-down from SGML with the core syntax almost completely intact. The following describes how a DTD defines a document type.
  • A DTD declares a set of allowed elements. You cannot use any element names other than those in this set. Think of this as the "vocabulary" of the language.
  • A DTD defines a content model for each element. The content model is a pattern that tells what elements or data can go inside an element, in what order, in what number, and whether they are required or optional. Think of this as the "grammar" of the language.
  • A DTD declares a set of allowed attributes for each element. Each attribute declaration defines the name, datatype, default values (if any), and behavior (e.g., if it is required or optional) of the attribute.
  • A DTD provides a variety of mechanisms to make managing the model easier, for example, the use of parameter entities and the ability to import pieces of the model from an external file.
According to the XML Recommendation, all external parsed entities (including DTDs) should begin with a text declaration. It looks like an XML declaration except that it explicitly excludes the standalone property. If you need to specify a character set other than the default UTF-8 (see Chapter 9 for more about character sets), or to change the XML version number from the default 1.0, this is where you would do it.
If you specify a character set in the DTD, it won't automatically carry over into XML documents that use the DTD. XML documents have to specify their own encodings in their document prologs.
After the text declaration, the resemblance to normal document prologs ends. External parsed entities, including DTDs, must not contain a document type declaration.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
W3C XML Schema
DTDs are chiefly directed toward describing how elements are arranged in a document. They say very little about the content in the document, other than whether an element can contain character data. Although attributes can be declared to be of different types (e.g. ID, IDREF, enumerated), there is no way to constrain the type of data in an element.
Returning to the example in Section 4.2.3, we can see how this limitation can be a serious problem. Suppose that a census taker submitted the document in Example 4-5.
Example 4-5. A bad CensusML document
<census-record taker="9170">
  <date><month>?</month><day>110</day><year>03</year></date>
  <address>
    <city>Munchkinland</city>
    <street></street>
    <county></county>
    <country>Here, silly</country>
    <postalcode></postalcode>
  </address>
  <person employed="fulltime" pid="?">
    <name>
      <last>Burgle</last>
      <first>Brad</first>
    </name>
    <age>2131234</age>
    <gender>yes</gender>
  </person>
</census-record>
There are a lot of things wrong with this document. The date is in the wrong format. Several important fields were left empty. The stated age is an impossibly large number. The gender, which ought to be "male" or "female," contains something else. The personal identification number has a bad value. And yet, to our infinite dismay, the DTD would pick up none of these problems.
It isn't hard to write a program that would check the data types, but that's a low-level operation, prone to bugs and requiring technical ability. It's also getting away from the point of DTDs, which is to create a kind of metadocument, a formal description of a markup language. Programming languages aren't portable and don't work well as a way of conveying syntactic and semantic details. So we have to conclude that DTDs don't go far enough in describing a markup language.
To make matters worse, what the DTD will reject as bad markup are often trivial things. For example, the contents of date and
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
RELAX NG
RELAX NG is a powerful schema validation language that builds on earlier work including RELAX and TREX. Like W3C Schema, it uses XML syntax and supports namespaces and data typing. It goes further by integrating attributes into content models, which greatly simplifies the structure of the schema. It offers superior handling of unordered content and supports context-sensitive content models.
In general, it just seems easier to write schemas in RELAX NG than in W3C Schema. The syntax is very clear, with elements like zeroOrMore for specifying optional repeating content. Declarations can contain other declarations, leading to a more natural representation of a document's structure.
Consider the simple schema in Example 4-7 which models a document type for logging work activity. It's easy to read this schema and understand the structure of a typical document.
Example 4-7. A simple RELAX NG schema
<element name="worklog"
         xmlns="http://relaxng.org/ns/structure/1.0"
         xmlns:ann="http://relaxng.org/ns/compatibility/annotations/1.0">
  <ann:documentation>A document for logging work activity, broken down
         into days, and further into tasks.</ann:documentation>
  <zeroOrMore>
    <element name="day">
      <attribute name="date">
        <text/>
      </attribute>
      <zeroOrMore>
        <element name="task">
          <element name="description">
            <text/>
          </element>
          <element name="time-start">
            <text/>
          </element>
          <element name="time-end">
            <text/>
          </element>
        </element>
      </zeroOrMore>
    </element>
  </zeroOrMore>
</element>
The same thing would look like this as a DTD:
<!ELEMENT worklog (day*)>
<!ELEMENT day (task*)>
<!ELEMENT task (description, time-start, time-end)>
<!ELEMENT description #PCDATA>
<!ELEMENT time-start #PCDATA>
<!ELEMENT time-end #PCDATA>
<!ATTLIST day date CDATA #REQUIRED>
Although the DTD is more compact, it relies on a special syntax that is decidedly not XML-ish. RELAX NG accomplishes the same thing with more readability.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Schematron
Schematron takes a different approach from the schema languages we've seen so far. Instead of being prescriptive, as in "this element has the following content model," it relies instead on a series of Boolean tests. Depending on the result of a test, the schema will output some predetermined message.
The tests are based on XPath, which is a very granular and exhaustive set of node examination tools. Relying on XPath is clever, taking much of the complexity out of the schema language. XPath, which is used in places such as XSLT and some implementations of DOM, can scratch an itch that more blunt tools like DTDs can't reach. As the creator of Schematron, Rick Jelliffe, says it's like "a feather duster for the furthest corners of a room where the vacuum cleaner (DTD) cannot reach."
The basic structure of a Schematron schema is this:
<schema xmlns="http://www.ascc.net/xml/schematron">
  <pattern>
    <rule context="XPath Expression">
      <assert test="XPath Expression">
        message
      </assert>
      <report test="XPath Expression">
        message
      </report>
      ...more tests...
    </rule>
    ...more rules...
  </pattern>
  ...more patterns...
</schema>
A pattern in Schematron does not carry the same meaning as patterns in RELAX NG. Here, it's just a logical grouping of rules. If your schema is testing books, one pattern may hold rules for chapters while another groups rules for appendixes. So think of this as more of a higher-level, conceptual testing pattern, rather than as a specific node-matching pattern.
The context for each test is determined by a rule. Its context attribute contains an XSLT pattern that matches nodes. Each node found becomes the context node, on which all tests inside the rule are applied.
The children of a rule, report and assert, each apply a test to the context node. The test is another XPath expression, stored in a test attribute. report's contents will be output if its XPath expression evaluates to "true."
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Schemas Compared
Each of the schemas we've looked at has compelling features and significant flaws. Some of the important points are listed Table 4-2.
Table 4-2: A comparison of schema
Feature
DTD
W3C Schema
RELAX NG
Schematron
XML syntax
No
Yes
Yes
Yes
Namespace compatible
No
Yes
Yes
Yes
Declares entities
Yes
No
No
No
Tests datatypes
No
Yes
Yes
Yes
Default attribute values
Yes
Yes
No
No
Notations
Yes
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Presentation Part I: CSS
Content preview·Buy PDF of this chapter|