Search the Catalog
Learning XML: (Guide to) Creating Self-Describing Data

Learning XML

(Guide to) Creating Self-Describing Data

Erik T. Ray
January 2001
0-596-00046-4, Order Number: 0464
368 pages, $34.95

Chapter 2:
Markup and Core Concepts

Contents:

The Anatomy of a Document
Elements: The Building Blocks of XML
Attributes: More Muscle for Elements
Namespaces: Expanding Your Vocabulary
Entities: Placeholders for Content
Miscellaneous Markup
Well-Formed Documents
Getting the Most out of Markup
XML Application: DocBook

This is probably the most important chapter in the book, as it describes the fundamental building blocks of all XML-derived languages: elements, attributes, entities, and processing instructions. It explains what a document is, and what it means to say it is well-formed or valid. Mastering these concepts is a prerequisite to understanding the many technologies, applications, and software related to XML.

How do we know so much about the syntactical details of XML? It's all described in a technical document maintained by the W3C, the XML recommendation (http://www.w3.org/TR/2000/REC-xml-20001006). It's not light reading, and most users of XML won't need it, but you many be curious to know where this is coming from. For those interested in the standards process and what all the jargon means, take a look at Tim Bray's interactive, annotated version of the recommendation at http://www.xml.com/axml/testaxml.htm.

The Anatomy of a Document

Example 2-1 shows a bite-sized XML example. Let's take a look.

Example 2.1. A Small XML Document

<?xml version="1.0"?>
<time-o-gram pri="important">
  <to>Sarah</to>
  <subject>Reminder</subject>
  <message>Don't forget to recharge K-9 
    <emphasis>twice a day</emphasis>. 
    Also, I think we should have his 
    bearings checked out. See you soon 
    (or late). I have a date with 
    some <villain>Daleks</villain>...
  </message>
  <from>The Doctor</from>
</time-o-gram>

It's a goofy example, but perfectly acceptable XML. XML lets you name the parts anything you want, unlike HTML, which limits you to predefined tag names. XML doesn't care how you're going to use the document, how it will appear when formatted, or even what the names of the elements mean. All that matters is that you follow the basic rules for markup described in this chapter. This is not to say that matters of organization aren't important, however. You should choose element names that make sense in the context of the document, instead of random things like signs of the zodiac. This is more for your benefit and the benefit of the people using your XML application than anything else.

This example, like all XML, consists of content interspersed with markup symbols. The angle brackets (<>) and the names they enclose are called tags. Tags demarcate and label the parts of the document, and add other information that helps define the structure. The text between the tags is the content of the document, raw information that may be the body of a message, a title, or a field of data. The markup and the content complement each other, creating an information entity with partitioned, labeled data in a handy package.

Although XML is designed to be relatively readable by humans, it isn't intended to create a finished document. In other words, you can't open up just any XML-tagged document in a browser and expect it to be formatted nicely.[1]XML is really meant as a way to hold content so that, when combined with other resources such as a stylesheet, the document becomes a finished product style and polish .

[1]Some browsers, such as Internet Explorer 5.0, do attempt to handle XML in an intelligent way, often by displaying it as a hierarchical outline that can be understood by humans. However, while it looks a lot better than munged-together text, it is still not what you would expect in a finished document. For example, a table should look like a table, a paragraph should be a block of text, and so on. XML on its own cannot convey that information to a browser.

We'll look at how to combine a stylesheet with an XML document to generate formatted output in Chapter 4, "Presentation: Creatingthe End Product". For now, let's just imagine what it might look like with a simple stylesheet applied. For example, it could be rendered as shown in Example 2-2.

Example 2.2. The Memorandum, Formatted with a Stylesheet

TIME-O-GRAM
Priority: important
To: Sarah
Subject: Reminder
Don't forget to recharge K-9 twice a day. 
Also, I think we should have his bearings checked out. 
See you soon (or late).  I have a date with some Daleks...
From: The Doctor

The rendering of this example is purely speculative at this point. If we used some other stylesheet, we could format the same memo a different way. It could change the order of elements, say by displaying the From: line above the message body. Or it could compress the message body to a width of 20 characters. Or it could go even further by using different fonts, creating a border around the message, causing parts to blink on and off--whatever you want. The beauty of XML is that it doesn't put any restrictions on how you present the document.

Let's look closely at the markup to discern its structure. As Figure 2-1 demonstrates, the markup tags divide the memo into regions, represented in the diagram as boxes containing other boxes. The first box contains a special declarative prolog that provides administrative information about the document. (We'll come back to that in a moment.) The other boxes are called elements. They act as containers and labels of text. The largest element, labeled <time-o-gram>, surrounds all the other elements and acts as a package that holds together all the subparts. Inside it are specialized elements that represent the distinct functional parts of the document. Looking at this diagram, we can say that the major parts of a <time-o-gram> are the destination (<to>), the sender (<from>), a message teaser (<subject>), and the message body (<message>). The last is the most complex, mixing elements and text together in its content. So we can see from this example that even a simple XML document can harbor several levels of structure.

figure

Figure 2.1. Elements in the memo document

A Tree View

Elements divide the document into its constituent parts. They can contain text, other elements, or both. Figure 2-2 breaks out the hierarchy of elements in our memo. This diagram, called a tree because of its branching shape, is a useful representation for discussing the relationships between document parts. The black rectangles represent the seven elements. The top element (<time-o-gram>) is called the root element. You'll often hear it called the document element, because it encloses all the other elements and thus defines the boundary of the document. The rectangles at the end of the element chains are called leaves, and represent the actual content of the document. Every object in the picture with arrows leading to or from it is a node.

figure

Figure 2.2. Tree diagram of the memo

There's one piece of Figure 2-2 that we haven't yet mentioned: the box on the left labeled pri. It was inside the <time-o-gram> tag, but here we see it branching off the element. This is a special kind of content called an attribute that provides additional information about an element. Like an element, an attribute has a label (pri) and some content (important). You can think of it as a name/value pair contained in the <time-o-gram> element tag. Attributes are used mainly for modifying an element's behavior rather than holding data; later processing might print "High Priority" in large letters at the top of the document, for example.

Now let's stretch the tree metaphor further and think about the diagram as a sort of family tree, where every node is a parent or a child (or both) of other nodes. Note, though, that unlike a family tree, an XML element has only one parent. With this perspective, we can see that the root element (a grizzled old <time-o-gram>) is the ancestor of all the other elements. Its children are the four elements directly beneath it. They, in turn, have children, and so on until we reach the childless leaf nodes, which contain the text of the document and any empty elements. Elements that share the same parent are said to be siblings.

Every node in the tree can be thought of as the root of a smaller subtree. Subtrees have all the properties of a regular tree, and the top of each subtree is the ancestor of all the descendant nodes below it. We will see in Chapter 6, "Transformation:RepurposingDocuments", that an XML document can be processed easily by breaking it down into smaller subtrees and reassembling the result later. Figure 2-3 shows some examples of subtrees in our <time-o-gram> example.

figure

Figure 2.3. Some subtrees

And that's the 10-minute overview of XML. The power of XML is its simplicity. In the rest of this chapter, we'll talk about the details of the markup.

The Document Prolog

Somehow, we need to tip off the world that our document is marked up in XML. If we leave it to a computer program to guess, we're asking for trouble. A lot of markup languages look similar, and when you add different versions to the mix, it becomes difficult to tell them apart. This is especially true for documents on the World Wide Web, where there are literally hundreds of different file formats in use.

The top of an XML document is graced with special information called the document prolog. At its simplest, the prolog merely says that this is an XML document and declares the version of XML being used:

<?xml version="1.0"?>

But the prolog can hold additional information that nails down such details as the document type definition being used, declarations of special pieces of text, the text encoding, and instructions to XML processors.

Let's look at a breakdown of the prolog, and then we'll examine each part in more detail. Figure 2-4 shows an XML document. At the top is an XML declaration (1). After this is a document type declaration (2) that links to a document type definition (3) in a separate file. This is followed by a set of declarations (4). These four parts together comprise the prolog (6), although not every prolog will have all four parts. Finally, the root element (5) contains the rest of the document. This ordering cannot be changed: if there is an XML declaration, it must be on the first line; if there is a document type declaration, it must precede the root element.

figure

Figure 2.4. A Document with a prolog and a root element

Let's take a closer look at our <time-o-gram> document's prolog, shown here in Example 2-3. Note that because we're examining the prolog in more detail, the numbers in Example 2-3 aren't the same as those in Figure 2-4.

Example 2.3. A Document Prolog

<?xml version="1.0" encoding="utf-8"?>                         ()<!DOCTYPE time-o-gram                                          ()    PUBLIC "-//LordsOfTime//DTD TimeOGram 1.8//EN"             ()    "http://www.lordsoftime.org/DTDs/timeogram.dtd"            ()[                                                              ()    <!ENTITY sj "Sarah Jane">                                  ()    <!ENTITY me "Doctor Who">
]>                                                             ()

. The XML declaration describes some of the most general properties of the document, telling the XML processor that it needs an XML parser to interpret this document.

. The document type declarationdescribes the root element type, in this case <time-o-gram>, and (on lines 3 and 4) designates a document type definition(DTD) to control markup structure.

. The identity code, called a public identifier, specifies the DTD to use.

. A system identifierspecifies the location of the DTD. In this example, the system identifier is a URL.

. This is the beginning of the internal subset, which provides a place for special declarations.

. Inside this internal subset are two entity declarations.

. The end of both the internal subset (]) and the document type declaration (>) complete the prolog.

Each of these terms is described in more detail later in this chapter.

The XML declaration

The XML declaration is an announcement to the XML processor that this document is marked up in XML. Its form is shown in Figure 2-5. The declaration begins with the five-character delimiter <?xml (1), followed by some number of property definitions (2), each of which has a property name (3) and value in quotes (4). The declaration ends with the two-character closing delimiter ?> (5).

figure

Figure 2.5. XML declaration syntax

There are three properties that you can set:

version

Sets the version number. Currently there is only one XML version, so the value is always 1.0. However, as new versions are approved, this property will tell the XML processor which version to use. You should always define this property in your prolog.

encoding

Defines the character encoding used in the document, such as US-ASCII or iso-8859-1. If you know you're using a character set other than the standard Latin characters of UTF-8 (e.g., Japanese Katana, or Cyrillic), you should declare this property. Otherwise, it's okay to leave it out. Character encodings are explained in Chapter 7, "Internationalization".

standalone

Tells the XML processor whether there are any other files to load. For example, you would set this to no if there are external entities (see "Entities: Placeholders for Content"" later in this chapter) or a DTD to load in addition to the document's main file. If you know that the file can stand on its own, setting standalone="yes" can improve downloading performance. This parameter is explained in more detail in Chapter 5, "Document Models:A Higher Levelof Control".

Some examples of well-formed XML declarations are:

<?xml version="1.0"?>
<?xml version='1.0' encoding='US-ASCII' standalone='yes'?>
<?xml version = '1.0' encoding= 'iso-8859-1' standalone ="no"?>

All of the properties are optional, but you should try to include at least the version number in case something changes drastically in a future revision of the XML specification. The parameter names must be lowercase, and all values must be quoted with either double or single quotes.

The document type declaration

The second part of the prolog is the document type declaration.[2]This is where you can specify various parameters such as entity declarations, the DTD to use for validating the document, and the name of the root element. By referring to a DTD, you are requesting that the parser compare the document instance to a document model, a process called validity checking. Checking the validity of your document is optional, but it is useful if you need to ensure that the document follows predictable patterns and includes required data. See Chapter 5, "Document Models:A Higher Levelof Control" for detailed information on DTDs and validity checking.

[2]Be careful not to confuse this term with the document type definition, DTD. A DTD is a collection of parameters that describe a document type, and can be used by many instances of that document type.

The syntax for a document type declaration is shown in Figure 2-6. The declaration starts with the literal string <!DOCTYPE (1) followed by the root element (2), which is the first XML element to appear in the document and the one that contains the rest of the document. If you are using a DTD with the document, you need to include the URI of the DTD (3) next, so the XML processor can find it. After that comes the internal subset (5), which is bound on either side by square brackets (4 and 6). The declaration ends with a closing >.

figure

Figure 2.6. Document type declaration syntax

The internal subset provides a place to put various declarations for use in your document, as we saw in Figure 2-4. These declarations might include entity definitions, and parts of DTDs. The internal subset is the only place where you can put these declarations within the document itself.

The internal subset is used to augment or redefine the declarations found in the external subset. The external subset is the collection of declarations existing outside the document, like in a DTD. The URI you provide in the document type declaration points to a file containing these external declarations. Internal and external subsets are optional. Chapter 5, "Document Models:A Higher Levelof Control" explains internal and external subsets.

Elements: The Building Blocks of XML

Elements are parts of a document. You can separate a document into parts so they can be rendered differently, or used by a search engine. Elements can be containers, with a mixture of text and other elements. This element contains only text:

<flooby>This is text contained inside an element</flooby>

and this element contains both text and elements:

<outer>this is text<inner>more
text</inner>still more text</outer>

Some elements are empty, and contribute information by their position and attributes. There is an empty element inside this example:

<outer>an element can be empty: <nuttin//></outer>

Figure 2-7 shows the syntax for a container element. It begins with a start tag (1) consisting of an angle bracket (<) followed by a name (2). The start tag may contain some attributes (3) separated by whitespace, and it ends with a closing angle bracket (>). An attribute defines a property of the element and consists of a name (4) joined by an equals sign (=) to a value in quotes (5). An element can have any number of attributes, but no two attributes can have the same name. Following the start tag is the element's content (6), which in turn is followed by an end tag (7). The end tag consists of an opening angle bracket, a slash, the element's name, and a closing bracket. The end tag has no attributes, and the element name must match the start tag's name exactly.

figure

Figure 2.7. Container element syntax

As shown in Figure 2-8, an empty element (one with no content) consists of a single tag (1) that begins with an opening angle bracket (<) followed by the element name (2). This is followed by some number of attributes (3), each of which consists of a name (4) and a value in quotes (5), and the element ends with a slash (/) and a closing angle bracket.

figure

Figure 2.8. Empty element syntax

An element name must start with a letter or an underscore, and can contain any number of letters, numbers, hyphens, periods, and underscores.[3] Element names can include accented Roman characters; letters from alphabets such as Cyrillic, Greek, Hebrew, Arabic, Thai, Hiragana, Katakana, and Devanagari; and ideograms from Chinese, Japanese, and Korean. The colon symbol is used in namespaces, as explained in "Namespaces: Expanding Your Vocabulary," so avoid using it in element names that don't use a namespace. Space, tab, newline, equals sign, and any quote characters are separators for element names, attribute names, and attribute values, so they are not allowed either. Some valid element names are: <Bob>, <chapter.title>, <THX-1138>, or even <_>. XML names are case-sensitive, so <Para>, <para>, and <pArA> are three different elements.

[3]Practically speaking, you should avoid using extremely long element names, in case an XML processor cannot handle names above a certain length. There is no specific number, but probably anything over 40 characters is unnecessarily long.

There can be no space between the opening angle bracket and the element name, but adding extra space anywhere else in the element tag is okay. This allows you to break an element across lines to make it more readable. For example:

<boat
  type="trireme"
><crewmember   class="rower">Dronicus Laborius</crewmember    >

There are two rules about the positioning of start and end tags:

To understand the second rule, think of elements as boxes. A box can sit inside or outside another box, but it can't protrude through the box without making a hole in the side. Thus, the following example of overlapping elements doesn't work:

<a>Don't <b>do</a> this!</b>

These untangled elements are okay:

<a>No problem</a><b>here</b>

Anything in the content that is not an element is text, or character data. The text can include any character in the character set that was specified in the prolog. However, some characters must be represented in a special way so as not to confuse the parser. For example, the left angle bracket (<) is reserved for element tags. Including it directly in content causes an ambiguous situation: is it the start of an XML tag or is it just data? Here's an example:

<foo>x < y</foo>    yikes!

To resolve this conflict, you need to use a special code in place of the offending character. For the left angle bracket, the code is &lt;. (The equivalent code for the right angle bracket is &gt;.) So we can rewrite the above example like this:

<foo>x &lt; y</foo>

Such a substitution is known as an entity reference. We'll describe entities and entity references in "Entities: Placeholders for Content"."

In XML, all characters are preserved as a matter of course, including the white-space characters space, tab, and newline; compare this to programming languages such as Perl and C, where whitespace characters are essentially ignored. In markup languages such as HTML, multiple sequential spaces are collapsed by the browser into a single space, and lines can be broken anywhere to suit the formatter. XML, on the other hand, keeps all space characters by default.

XML Is Not HTML

If you've had some experience writing HTML documents, you should pay close attention to XML's rules for elements. Shortcuts you can get away with in HTML are not allowed in XML. Some important changes you should take note of include:

Unlike many HTML elements, XML elements are based strictly on function, and not on format. You should not assume any kind of formatting or presentational style based on markup alone. Instead, XML leaves presentation for stylesheets, which are separate documents that map the elements to styles.

Attributes: More Muscle for Elements

Sometimes you need to convey more information about an element than its name and content can express. The use of attributes lets you describe details about the element more clearly. An attribute can be used to give the element a unique label so it can be easily located, or it can describe a property about the element, such as the location of a file at the end of a link. It can be used to describe some aspect of the element's behavior or to create a subtype. For example, in our <time-o-gram> earlier in the chapter, we used the attribute pri to identify it as having a high priority. As shown in Figure 2-9, an attribute consists of a property name (1), an equals sign (2), and a value in quotes (3).

figure

Figure 2.9. Attribute syntax

An element can have any number of attributes, as long as each has a unique name. Here is an element with three attributes:

<kiosk music="bagpipes" color="red" id="page-81527">

Attributes are separated by spaces. They must always follow the element name, but they can be in any order. The values must be in single (') or double (") quotes. If the value contains quotes, use the opposite kind of quote to contain it. Here is an example:

<choice test='msg="hi"'/>

If you prefer, you can replace the quote with the entity &apos; for a single quote or &quot; for a double quote:

<choice test='msg=&quot;hi&quot;'/>

An element can contain only one occurrence of each attribute. So the following is not allowed:

<!-- Wrong -->
<team person="sue" person="joe" person="jane">

Here are some possible alternatives. Use one attribute to hold all the values:

<team persons="sue joe jane">

Use multiple attributes:

<team person1="sue" person2="joe" person3="jane">

Use elements:

<team>
&nbsp;&nbsp;<person>sue</person>
&nbsp;&nbsp;<person>joe</person>
&nbsp;&nbsp;<person>jane</person>
</team>

Attribute values can be constrained to certain types if you use a DTD. One type is ID, which tells XML that the value is a unique identifier code for the element. No two elements in a document can have the same ID. Another type, IDREF, is a reference to an ID. Let's demonstrate how these might be used. First, there is an element somewhere in the document with an ID-type attribute:

<part id="bolt-1573">...</part>

Elsewhere, there is an element that refers to it:

<part id="nut-44456">
  <description>This nut is compatible with <partref
  idref="bolt-1573"//>.</description>...

If you use a DTD with your document, you can actually assign the ID and IDREF types to particular attributes and your XML parser will enforce the syntax of the value, as well as warn you if the IDREF points to a nonexistent element or if the ID doesn't have a unique value. We talk more about these attributes in Chapter 3, "Connecting Resourceswith Links".

Another way a DTD can restrict attributes is by creating an allowed set of values. You may want to use an attribute called day that can have one of seven values: "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", or "Sunday". The DTD can then tell an XML parser to reject any value not on that list, e.g., day="Halloween" is invalid. For a more detailed explanation of attribute types, see Chapter 5, "Document Models:A Higher Levelof Control".

Reserved Attribute Names

Some attribute names have been set aside for special purposes by the XML working group. These attributes are reserved for XML's use and begin with the prefix xml:. The names xml:lang and xml:space are defined for XML Version 1.0. Two other names, xml:link and xml:attribute, are defined by XLink, another standard that complements XML and defines how elements can link to one another. These special attribute names are described here:

xml:lang

Classifies an element by the language of its content. For example, xml:lang="en" describes an element as having English content. This is useful for creating conditional text, which is content selected by an XML processor based on criteria such as what language the user wants to view a document in. We'll return to this topic in Chapter 7, "Internationalization".

xml:space

Specifies whether whitespace should be preserved in an element's content. If set to "preserve", any XML processor displaying the document should honor all newlines, spaces, and tabs in the element's content. If it is set to "default", then the processor can do whatever it wants with whitespace (i.e., it sets its own default). If the xml:space attribute is omitted, the processor preserves whitespace by default. Thus, if you want to compress whitespace in an element, set the attribute xml:space="default" and make sure you are using an XML processor whose default is to remove extra whitespace.

xml:link

Signals to an XLink processor that an element is a link element. For information on how to use this attribute, see Chapter 3, "Connecting Resourceswith Links".

xml:attribute

In addition to xml:link, XLink relies on a number of attribute names. But to prevent conflict with other potential uses of those attributes, XLink defines the xml:attribute attribute, which allows you to "remap" those special attributes. That is, you can say, "When XLink is looking for an attribute called title, I want you to use the attribute called linkname instead." This attribute is also discussed in more detail in Chapter 3, "Connecting Resourceswith Links".

Namespaces: Expanding Your Vocabulary

What happens when you want to include elements or attributes from different document types? For example, you might want to put an equation encoded in the MathML language inside an XML document. You can't combine multiple DTDs for a single document, unfortunately, but no one says you have to use a DTD in XML. If you can survive without a DTD (and most browsers will tolerate documents without them), you can use a feature of XML called namespaces.

A namespace is a group of element and attribute names. You can declare that an element exists within a particular namespace and that it should be validated against that namespace's DTD. By appending a namespace prefix to an element or attribute name, you tell the parser which namespace it comes from.

Imagine, for example, that the English language is divided into namespaces corresponding to conceptual topics. We'll take two of these, say hardware and food. The topic hardware contains words such as hammer and bolt, while food has words like fruit and meat. Both namespaces contain the word nut, which has a different meaning in each context even though it's spelled the same in both. It really is two different words with the same name, but how can we express that fact without causing a namespace clash?

This same problem can occur in XML, where two XML objects in different name-spaces can have the same name, resulting in ambiguity about where they came from. The solution is to have each element or attribute specify which namespace it comes from by including the namespace as a prefix.

The syntax for this qualified element name is shown in Figure 2-10. A namespace prefix (1) is joined by a colon (2) to the local name of the element or attribute (3).

figure

Figure 2.10. Qualified name syntax

Figure 2-11 illustrates how an element, <nut>, must be treated to use the versions from both the hardware and food namespaces.

figure

Figure 2.11. Qualifying an element's namespace with prefixes

Namespaces aren't useful only for preventing name clashes. More generally, they help the XML processor sort out different groups of elements for different treatments. Returning to the MathML example, the elements from MathML's namespace must be treated differently from regular XML elements. The browser needs to know when to enter "math equation mode" and when to be in "regular XML mode." Namespaces are crucial for the browser to switch modes.

In another example, the transformation language XSLT (see Chapter 6, "Transformation:RepurposingDocuments") relies on namespaces to distinguish between XML objects that are data, and those that are instructions for processing the data. The instructional elements and attributes have an xsl: namespace prefix. Anything without a namespace prefix is treated as data in the transformation process.

A namespace must be declared in the document before you can use it. The declaration is in the form of an attribute inside an element. Any descendants of that element become part of the namespace. Figure 2-12 shows the syntax for a namespace declaration. It starts with the keyword xmlns: (1) to alert the XML parser that this attribute is a namespace declaration. This is followed by a colon, then a namespace prefix (2), an equals sign, and finally a URL in quotes (3).

figure

Figure 2.12. Namespace declaration syntax

For example:

<part-catalog
    xmlns:bob="http://www.bobco.com/">

If the namespace prefix bob isn't to your liking, you can use any name you want, as long as it observes the element-naming rules. As a result, b, bobs-company, or wiggledy.piggledy are all acceptable names. Be careful not to use prefixes like xml, xsl, or other names reserved by XML and related languages.

The value of the xmlns: attribute is a URL, usually belonging to the organization that maintains the namespace. The XML processor isn't required to do anything with the URL, however. There doesn't even have to be a document at the location it points to. Specifying the URL is a formality to provide additional information about the namespace, such as who owns it and what version you're using.

Any element in the document can contain a namespace declaration. Most often, the root element will contain the declarations used in the document, but that's not a requirement. You may find it useful to limit the scope of a namespace to a region inside the document by declaring the namespace in a deeper element. In that case, the namespace applies only to that element and its descendants.

Here's an example of a document combining two namespaces, myns and eq:

<?xml version="1.0"?>
<myns:journal xmlns:myns="http://www.psycholabs.org/mynamespace/">
  <myns:experiment>
    <myns:date>March 4, 2001</myns:date>
    <myns:subject>Effects of Caffeine on Psychokinetic
     Ability</myns:subject>
    <myns:abstract>The experiment consists of a subject, a can of 
     caffeinated soda, and a goldfish tank. The ability to make a 
     goldfish turn in a circle through the power of a human's mental 
     control is given by the well-known equation:
     
     <eq:formula xmlns:eq="http://www.mathstuff.org/">
       <eq:variable>P</eq:variable> = 
       <eq:variable>m</eq:variable>
       <eq:variable>M</eq:variable> /
       <eq:variable>d</eq:variable>
     </eq:formula>
     
     where P is the probability it will turn in a given time interval, 
     m is the mental acuity of the fish, M is the mental acuity of 
     the subject, and d is the distance between 
     fish and subject.</myns:abstract>
     ...
  </myns:experiment>
</myns:journal>

We can declare one of the namespaces to be the default by omitting the colon (:) and the name from the xmlns attribute. Elements and attributes in the default namespace don't need the namespace prefix, resulting in clearer markup:

<?xml version="1.0"?>
<journal xmlns="http://www.psycholabs.org/mynamespace/">
  <experiment>
    <date>March 4, 2001</date>
    <subject>Effects of Caffeine on Psychokinetic Ability</subject>
    <abstract>The experiment consists of a subject, a can of 
     caffeinated soda, and a goldfish tank. The ability to make a 
     goldfish turn in a circle through the power of a human's mental 
     control is given by the well-known equation:
     
     <eq:formula xmlns:eq="http://www.mathstuff.org/">
       <eq:variable>P</eq:variable> = 
       <eq:variable>m</eq:variable>
       <eq:variable>M</eq:variable> /
       <eq:variable>d</eq:variable>
     </eq:formula>
     
     where P is the probability it will turn in a given time interval, 
     m is the mental acuity of the fish, M is the mental acuity 
     of the subject, and d is the distance between 
     fish and subject.</myns:abstract>
     ...
  </experiment>
</journal>

WARNING

Namespaces can be a headache if used in conjunction with a DTD. It would be nice if the parser ignored any elements or attributes from another namespace, so your document would validate under a DTD that had no knowledge of the namespace. Unfortunately, that is not the case. To use a namespace with a DTD, you have to rewrite the DTD so it knows about the elements in that namespace.

Another problem with namespaces is that they don't import a DTD or any other kind of information about the elements and attributes you're using. So you can actually make up your own elements, add the namespace prefix, and the parser will be none the wiser. This makes namespaces less useful for those who want to constrain their documents to conform to a DTD.

For these and other reasons, namespaces are a point of contention among XML planners. It's not clear what will happen in the future, but something needs to be done to bridge the gap between structure enforcement and namespaces.

Entities: Placeholders for Content

With the basic parts of XML markup defined, there is one more component we need to look at. An entity is a placeholder for content, which you declare once and can use many times almost anywhere in the document. It doesn't add anything semantically to the markup. Rather, it's a convenience to make XML easier to write, maintain, and read.

Entities can be used for different reasons, but they always eliminate an inconvenience. They do everything from standing in for impossible-to-type characters to marking the place where a file should be imported. You can define entities of your own to stand in for recurring text such as a company name or legal boilerplate. Entities can hold a single character, a string of text, or even a chunk of XML markup. Without entities, XML would be much less useful.

You could, for example, define an entity w3url to represent the W3C's URL. Whenever you enter the entity in a document, it will be replaced with the text http://www.w3.org/.

Figure 2-13 shows the different kinds of entities and their roles. The two major entity types are parameter entities and generalentities. Parameter entities are used only in DTDs, so we'll describe them in Chapter 5, "Document Models:A Higher Levelof Control". In this section, we'll focus on the other type, general entities. General entities are placeholders for any content that occurs at the level of or inside the root element of an XML document.

figure

Figure 2.13. Taxonomy of entities

An entity consists of a name and a value. When an XML parser begins to process a document, it first reads a series of declarations, some of which define entities by associating a name with a value. The value is anything from a single character to a file of XML markup. As the parser scans the XML document, it encounters entity references, which are special markers derived from entity names. For each entity reference, the parser consults a table in memory for something with which to replace the marker. It replaces the entity reference with the appropriate replacement text or markup, then resumes parsing just before that point, so the new text is parsed too. Any entity references inside the replacement text are also replaced; this process repeats as many times as necessary.

Figure 2-14 shows that there are two kinds of syntax for entity references. The first, consisting of an ampersand (&), the entity name, and a semicolon (;), is for general entities. The second, distinguished by a percent sign (%) instead of the ampersand, is for parameter entities.

figure

Figure 2.14. Syntax for entity references

The following is an example of a document that declares three general entities and references them in the text:

<?xml version="1.0"?>
<!DOCTYPE message SYSTEM "/xmlstuff/dtds/message.dtd"
[
  <!ENTITY client "Mr. Rufus Xavier Sasperilla">
  <!ENTITY agent "Ms. Sally Tashuns">
  <!ENTITY phone "<number>617-555-1299</number>">
]>
<message>
<opening>Dear &client;</opening>
<body>We have an exciting opportunity for you! A set of 
ocean-front cliff dwellings in Pi&#241;ata, Mexico have been
renovated as time-share vacation homes. They're going fast! To 
reserve a place for your holiday, call &agent; at &phone;. 
Hurry, &client;. Time is running out!</body>
</message>

The entities &client;, &agent;, and &phone; are declared in the internal subset of this document and referenced in the <message> element. A fourth entity, &#241;, is a numbered character entity that represents the character ñ. This entity is referenced but not declared; no declaration is necessary because numbered character entities are implicitly defined in XML as references to characters in the current character set. (For more information about character sets, see Chapter 7, "Internationalization".) The XML parser simply replaces the entity with the correct character.

The previous example looks like this with all the entities resolved:

<?xml version="1.0"?>
<!DOCTYPE message SYSTEM "/xmlstuff/dtds/message.dtd">
<message>
<opening>Dear Mr. Rufus Xavier Sasperilla</opening>
<body>We have an exciting opportunity for you! A set of 
ocean-front cliff dwellings in Piñata, Mexico have been
renovated as time-share vacation homes. They're going fast! To 
reserve a place for your holiday, call Ms. Sally Tashuns at
<number>617-555-1299</number>.
Hurry, Mr. Rufus Xavier Sasperilla. Time is running out!</body>
</message>

All entities (besides predefined ones) must be declared before they are used in a document. Two acceptable places to declare them are in the internal subset, which is ideal for local entities, and in an external DTD, which is more suitable for entities shared between documents. If the parser runs across an entity reference that hasn't been declared, either implicitly (a predefined entity) or explicitly, it can't insert replacement text in the document because it doesn't know what to replace the entity with. This error prevents the document from being well-formed.

Character Entities

Entities that contain a single character are called, naturally, character entities. These fall into several groups:

Predefined character entities

Some characters cannot be used in the text of an XML document because they conflict with the special markup delimiters. For example, angle brackets (<>) are used to delimit element tags. The XML specification provides the following predefined character entities, so you can express these characters safely:

Name Value
amp &
apos '
gt >
lt <
quot "

Numbered character entities

XML supports Unicode, a huge character set with tens of thousands of different symbols, letters, and ideograms. You should be able to use any Unicode character in your document. The problem is how enter a nonstandard character from a keyboard with less than 100 keys, or how to represent one in a text-only editor display. One solution is to use a numbered character entity, an entity whose name is of the form #n, where n is a number that represents the character's position in the Unicode character set.

The number in the name of the entity can be expressed in decimal or hexadecimal format. For example, a lowercase c with a cedilla (ç) is the 231st Unicode character. It can be represented in decimal as &#231; or in hexadecimal as &#xe7;. Note that the hexadecimal version is distinguished with an x as the prefix to the number. The range of characters that can be represented this way starts at zero and goes up to 65,536. We'll discuss character sets and encodings in more detail in Chapter 7, "Internationalization".

Named character entities

The problem with numbered character entities is that they're hard to remember: you need to consult a table every time you want to use a special character. An easier way to remember them is to use mnemonic entity names. These named character entities use easy-to-remember names for references like &THORN;, which stands for the Icelandic capital thorn character (Þ).

Unlike the predefined and numeric character entities, you do have to declare named character entities. In fact, they are technically no different from other general entities. Nevertheless, it's useful to make the distinction, because large groups of such entities have been declared in DTD modules that you can use in your document. An example is ISO-8879, a standardized set of named character entities including Latin, Greek, Nordic, and Cyrillic scripts, math symbols, and various other useful characters found in European documents.

Mixed-Content Entities

Entity values aren't limited to a single character, of course. The more general mixed-content entities have values of unlimited length and can include markup as well as text. These entities fall into two categories: internal and external. For internal entities, the replacement text is defined in the entity declaration; for external entities, it is located in another file.

Internal entities

Internal mixed-content entities are most often used to stand in for oft-repeated phrases, names, and boilerplate text. Not only is an entity reference easier to type than a long piece of text, but it also improves accuracy and maintainability, since you only have to change an entity once for the effect to appear everywhere. The following example proves this point:

<?xml version="1.0"?>
<!DOCTYPE press-release SYSTEM "http://www.dtdland.org/dtds/reports.dtd" 
[
  <!ENTITY bobco "Bob's Bolt Bazaar, Inc.">
]>
<press-release>
<title>&bobco; Earnings Report for Q3</title>
<par>The earnings report for &bobco; in fiscal
quarter Q3 is generally good. Sales of &bobco; bolts increased 35%
over this time a year ago.</par>
<par>&bobco; has been supplying high-quality bolts to contractors
for over a century, and &bobco; is recognized as a leader in the
construction-grade metal fastener industry.</par>
</press-release>

The entity &bobco; appears in the document five times. If you want to change something about the company name, you only have to enter the change in one place. For example, to make the name appear inside a <companyname> element, simply edit the entity declaration:

<!ENTITY bobco 
  "<companyname>Bob's Bolt Bazaar, Inc.</companyname>">

When you include markup in entity declarations, be sure not to use the predefined character entities (e.g., &lt; and &gt;). The parser knows to read the markup as an entity value because the value is quoted inside the entity declaration. Exceptions to this are the quote-character entity &quot; and the single-quote character entity &apos;. If they would conflict with the entity declaration's value delimiters, then use the predefined entities, e.g., if your value is in double quotes and you want it to contain a double quote.

Entities can contain entity references, as long as the entities being referenced have been declared previously. Be careful not to include references to the entity being declared, or you'll create a circular pattern that may get the parser stuck in a loop. Some parsers will catch the circular reference, but it is an error.

External entities

Sometimes you may need to create an entity for such a large amount of mixed content that it is impractical to fit it all inside the entity declaration. In this case, you should use an external entity, an entity whose replacement text exists in another file. External entities are useful for importing content that is shared by many documents, or that changes too frequently to be stored inside the document. They also make it possible to split a large, monolithic document into smaller pieces that can be edited in tandem and that take up less space in network transfers. Figure 2-15 illustrates how fragments of XML and text can be imported into a document.

figure

Figure 2.15. Using external entities to import XML and text

External entities effectively break a document into multiple physical parts. However, all that matters to the XML processor is that the parts assemble into a perfect whole. That is, all the parts in their different locations must still conform to the well-formedness rules. The XML parser stitches up all the pieces into one logical document; with the correct markup, the physical divisions should be irrelevant to the meaning of the document.

External entities are a linking mechanism. They connect parts of a document that may exist on other systems, far across the Internet. The difference from traditional XML links (XLinks) is that for external entities, the XML processor must insert the replacement text at the time of parsing. See Chapter 3, "Connecting Resourceswith Links" for others kinds of links.

External entities must always be declared, so the parser knows where to find the replacement text. In the following example, a document declares the three external entities &part1;, &part2;, and &part3; to hold its content:

The syntax just shown for declaring an external entity uses the keyword SYSTEM followed by a quoted string containing a filename. This string is called a system identifier and is used to identify a resource by location. The quoted string is actually a URL, so you can include files from anywhere on the Internet. For example:

<!ENTITY catalog SYSTEM "http://www.bobsbolts.com/catalog.xml">

The system identifier suffers from the same drawback as all URLs: if the referenced item is moved, the link breaks. To avoid that problem, you can use a public identifier in the entity declaration. In theory, a public identifier will endure any location shuffling and still fetch the correct resource. For example:

<!ENTITY faraway PUBLIC "-//BOB//FILE Catalog//EN"
    "http://www.bobsbolts.com/catalog.xml">

Of course, for this to work, the XML processor has to know how to use public identifiers, and it must be able to find a catalog that maps them to actual locations. In addition, there's no guarantee that the catalog is up to date. A lot can go wrong. Perhaps for this reason, the public identifier must be accompanied by a system identifier (here, "http://www.bobsbolts.com/catalog.xml"). If the XML processor for some reason can't handle the public identifier, it falls back on the system identifier. Most web browsers in use today can't deal with public identifiers, so perhaps the backup is a good idea.

Unparsed Entities

The last kind of entity discussed in this chapter is the unparsed entity. This kind of entity holds content that should not be parsed because it contains something other than text and would likely confuse the parser. Unparsed entities are used to import graphics, sound files, and other non-character data.

The declaration for an unparsed entity looks similar to that of an external entity, with some additional information at the end. For example:

<?xml version="1.0"?>
<!DOCTYPE doc [
  <!ENTITY mypic SYSTEM "photos/erik.gif" NDATA GIF>
]>
<doc>
  <para>Here's a picture of me:</para>

  &mypic;

</doc>

This declaration differs from an external entity declaration in that there is an NDATA keyword following the system path information. This keyword tells the parser that the entity's content is in a special format, or notation, other than the usual parsed mixed content. The NDATA keyword is followed by a notation identifier that specifies the data format. In this case, the entity is a graphic file encoded in the GIF format, so the word GIF is appropriate.

The notation identifier must be declared in a separate notation declaration, which is a complex affair discussed in Chapter 5, "Document Models:A Higher Levelof Control". GIF and other notations are not built into XML, and an XML processor may not know what to do with them. At the very least, the parser will not blindly load the entity's content and attempt to parse it, which offers some protection from errors.

Miscellaneous Markup

Elements, attributes, namespaces, and entities are the most important markup objects, but they are not the end of the story. Other markup objects including comments, processing instructions, and CDATA sections shield content from the parser in various ways, allowing you to include specialized information.

Comments

Comments are notes in the document that are not interpreted by the parser. If you're working with other people on the same files, these messages can be invaluable. They can be used to identify the purpose of files and sections to help navigate a cluttered document, or simply to communicate with each other. So, in XML there is a special kind of markup called a comment. The syntax for comments is shown in Figure 2-17.

figure

Figure 2.17. Syntax for comments

A comment starts with four characters: an open angle bracket, an exclamation point, and two dashes (1). It ends with two dashes and a closing angle bracket (3). In between these delimiters goes the content to be ignored (2). The comment can contain almost any kind of text you want, including spaces, newlines, and markup. However, since two dashes in a row (--) are used tell the parser when a comment begins and ends, they can't be placed anywhere inside the comment. This means that instead of using dashes to create an easily visible line, you should use another symbol like an equals sign (=) or an underscore (_):

Good:  <!--========================================================-->

Good:  <!--________________________________________________________-->

Good:  <!-- - - - - - - - - - - -  - - - - - - - - - - - - - - - - -->

Bad:   <!------------------------------------------------------------>

Bad:   <!--                 -- Don't do this! --                   -->

Comments can go anywhere in your document except before the XML declaration and inside tags; an XML parser will ignore those completely. So this piece of XML:

<p>The quick brown fox jumped<!-- test -->over the lazy dog. 
The quick brown <!-- test --> fox jumped over the lazy dog. The<!--

test

-->quick brown fox 
jumped over the lazy dog.</p>

becomes this, after the parser has removed the comments:

<p>The quick brown fox jumpedover the lazy dog. 
The quick brown  fox jumped over the lazy dog. Thequick brown fox 
jumped over the lazy dog.</p>

Since comments can contain markup, they can be used to "turn off" parts of a document. This is valuable when you want to remove a section temporarily, keeping it in the file for later use. In this example, a region of code is commented out:

<p>Our store is located at:</p>
<!--
<address>59 Sunspot Avenue</address>
-->
<address>210 Blather Street</address>

When using this technique, be careful not to comment out any comments, i.e., don't put comments inside comments. Since they contain double dashes in their delimiters, the parser will complain when it gets to the inner comment.

CDATA Sections

If you mark up characters frequently in your text, you may find it tedious to use the predefined entities &lt;, &gt;, &amp;. They require typing and are generally hard to read in the markup. There's another way to type lots of forbidden characters, however: the CDATA section.

CDATA is an acronym for "character data," which just means "not markup." Essentially, you're telling the parser that this section of the document contains no markup and should be treated as regular text. The only thing that cannot go inside a CDATA section is the ending delimiter (]]>). For that, you have to resort to a predefined entity and write it as ]]&gt;.

The CDATA section syntax is shown in Figure 2-18. A CDATA section begins with the nine-character delimiter <![CDATA[ (1), and it ends with the delimiter ]]> (3). The content of the section (2) may contain markup characters (<, >, and &) but they are ignored by the XML processor.

figure

Figure 2.18. CDATA section syntax

Here's an example of a CDATA section in action:

<para>Then you can say <![CDATA[if (&x < &y)]]> and be done 
with it.</para>

CDATA sections are most convenient when used over large areas, say the size of a small computer program. If you use it a lot for small pieces of text, your document will become hard to read, so you'd be better off using entity references.

Processing Instructions

Presentational information should be kept out of a document whenever possible. Still, there may be times when you don't have any other option, for example, if you need to store page numbers in the document to facilitate generation of an index. This information applies only to a specific XML processor and may be irrelevant or misleading to others. The prescription for this kind of information is a processing instruction. It is a container for data that is targeted toward a specific XML processor.

Processing instructions (PIs) contain two pieces of information: a target keyword and some data. The parser passes processing instructions up to the next level of processing. If the processing instruction handler recognizes the target keyword, it may choose to use the data; otherwise, the data is discarded. How the data will help processing is up to the developer.

Figure 2-19 shows the PI syntax. A PI starts with a two-character delimiter (1) consisting of an open angle bracket and a question mark (<?), followed by a target (2), an optional string of characters that is the data portion of the PI (3), and a closing delimiter (4), consisting of a question mark and closing angle bracket (?>).

figure

Figure 2.19. Processing instruction syntax

"Funny," you say, "PIs look a lot like the XML declaration." You're right: the XML declaration can be thought of as a processing instruction for all XML processors[4] that broadcast general information about the document.

[4]This syntactic trick allows XML documents to be processed easily by older SGML systems; they simply treat the XML declaration as another processing instruction, ignored except by XML processors.

The target is a keyword that an XML processor uses to determine whether the data is meant for it or not. The keyword doesn't necessarily mean anything, such as the name of the software that will use it. More than one program can use a PI, and a single program can accept multiple PIs. It's sort of like posting a message on a wall saying, "The party has moved to the green house," and people interested in the party will follow the instructions, while those uninterested won't.

The PI can contain any data except the combination ?>, which would be interpreted as the closing delimiter. Here are some examples of valid PIs:

<?flubber pg=9 recto?>
<?thingie?>
<?xyz stop: the presses?>

If there is no data string, the target keyword itself can function as the data. A forced line break is a good example. Imagine that there is a long section heading that extends off the page. Rather than relying on an automatic formatter to break the title just anywhere, we want to force it to break in a specific place.

Here is what a forced line break would look like:

<title>The Confabulation of Branklefitzers <?lb?>in a Portlebunky 
Frammins <?lb?>Without Denaculization of <?lb?>Crunky Grabblefooties
</title>

Well-Formed Documents

XML gives you considerable power to choose your own element types and invent your own grammars to create custom-made markup languages. But this flexibility can be dangerous for XML parsers if they don't have some minimal rules to protect them. A parser dedicated to a single markup language such as an HTML browser can accept some sloppiness in markup, because the set of tags is small and there isn't much complexity in a web page. Since XML processors have to be prepared for any kind of markup language, a set of ground rules is necessary.

These rules are very simple syntax constraints. All tags must use the proper delimiters; an end tag must follow a start tag; elements can't overlap; and so on. Documents that satisfy these rules are said to be well-formed. Some of these rules are listed here.

The first rule is that an element containing text or elements must have start and end tags.

Good Bad
<list>
  <listitem>soupcan</listitem>
  <listitem>alligator</listitem>
  <listitem>tree</listitem>
</list>
<list>
  <listitem>soupcan
  <listitem>alligator
  <listitem>tree
</list>

An empty element's tag must have a slash (/) before the end bracket.

Good Bad
<graphic filename="icon.png"/>
<graphic filename="icon.png">

All attribute values must be in quotes.

Good Bad
<figure filename="icon.png"/>
<figure filename=icon.png/>

Elements may not overlap.

Good Bad
<a>A good <b>nesting</b> 
example.</a>
<a>This is <b>a poor</a> 
  nesting scheme.</b>

Isolated markup characters may not appear in parsed content. These include <, ]]>, and &.

Good Bad
<equation>5 &lt; 2</equation>
<equation>5 < 2</equation>

A final rule stipulates that element names may start only with letters and underscores, and may contain only letters, numbers, hyphens, periods, and underscores. Colons are allowed for namespaces.

Good Bad
<example-one>
<_example2>
<Example.Three>
<bad*characters>
<illegal space>
<99number-start>

Why All the Rules?

Web developers who cut their teeth on HTML will notice that XML's syntax rules are much more strict than HTML's. Why all the hassle about well-formed documents? Can't we make parsers smart enough to figure it out on their own? Let's look at the case for requiring end tags in every container element. In HTML, end tags can sometimes be omitted, leaving it up to the browser to decide where an element ends:

<body>
  <p>This is a paragraph.
  <p>This is also a paragraph.
</body>

This is acceptable in HTML because there is no ambiguity about the <p> element. HTML doesn't allow a <p> to reside inside another <p>, so it's clear that the two are siblings. All HTML parsers have built-in knowledge of HTML, referred to as a grammar. In XML, where the grammar is not set in stone, ambiguity can result:

<blurbo>This is one element.
<blurbo>This is another element.

Is the second <blurbo> a sibling or a child of the first? You can't tell because you don't know anything about that element's content model. XML doesn't require you to use a grammar-defining DTD, so the parser can't know the answer either. Because XML parsers have to work in the absence of grammar, we have to cut them some slack and follow the well-formedness rules.

Getting the Most out of Markup

These days, more and more software vendors are claiming that their products are "XML-compliant." This sounds impressive, but is it really something to be excited about? Certainly, well-formed XML guarantees some minimum standards for data quality; however, that isn't the whole story. XML is not itself a language, but a set of rules for designing markup languages. Therefore, until you see what kind of language the vendors have created for their products, you should greet such claims with cautious optimism.

The truth is, many XML-derived markup languages are atrocious. Often, developers don't put much thought into the structure of the document data, and their markup ends up looking like the same disorganized native data files with different tags. A good markup language has a thoughtful design, makes good use of containers and attributes, names objects clearly, and has a logical hierarchical structure.

Here's a case in point. A well-known desktop publishing program can output its data as XML. However, it has a serious problem that limits its usefulness: the hierarchical structure is very flat. There are no sections or divisions to contain paragraphs and smaller sections; all paragraphs are on the same level, and section heads are just glorified paragraphs. Compare that to an XML language such as DocBook (see "XML Application: DocBook"" later in this chapter), which uses nested elements to represent relationships: that is, to make it clear that regions of text are inside particular sections. This information is important for setting up styles in stylesheets or doing transformations.

Another markup language is used for encoding marketing information for electronic books. Its design flaw is an unnecessarily obscure and unhelpful element-naming scheme. Elements used to hold information such as the ISBN or the document title are named <A5>, <B2>, or <C1>. These names have nothing to do with the purpose of the elements, whereas element names like <isbn> and <title> would have been easily understood.

Elements are the first consideration for a good markup language. They can supply a lot of information in different ways:

Type

The name inside the start and end tags of an element distinguishes it from other types and gives XML programs a handle for processing. These names should be representations of the element's purpose in the document and should be readable by humans as well as machines. Choose names that are as descriptive and recognizable as possible, like <model> or <programlisting>. Follow the convention of all-lowercase letters and avoid alternating cases (e.g., <OrderedList>), as people will forget when to use which case. Resist the urge to use generic element types that could hold almost anything. And anyone who chooses nonsensical names like <XjKnpl> or <J-9> should be taken outside and pelted with donuts.

Content

An element's content can include characters, elements, or a mixture of both. Elements inside mixed content modify the character data (for example, labeling a word for emphasis), and are called inline elements. Other elements are used to divide a document into parts, and are often called components or blocks. In character data, whitespace is usually significant, unlike in HTML and other markup languages.

Position

The position of an element inside another element is important. The order of elements is always preserved, so a sequence of items such as a numbered list can be expressed. Elements, often those without content, can be used to mark a place in text; for example, to insert a graphic or footnote. Two elements can mark a range of text when it would be inconvenient to span that range with a single element.

Hierarchy

The element's ancestors can contribute information as well. For example, a <title> is formatted differently when it is inside a <chapter>, <section>, or <table>, with different typefaces and sizes. Stylesheets can use the information about ancestor elements to decide how to process an element.

Namespace

Elements can be categorized by their source or purpose using namespaces. In XSLT, for example, the xsl namespace elements are used to control the transformation process, while other elements are merely data for producing the result tree. Some web browsers can handle documents with multiple name-spaces, such as Amaya's support of MathML equations within HTML pages. In both cases, the namespace helps the XML processor decide how to process the elements.

The second consideration for a good markup language is the use of attributes. Use them sparingly, because they tend to clutter up markup--but do use them when you need them. An attribute conveys specific information about an element that helps specify its role in the document. It should not be used to hold content. Sometimes, it's hard to decide between an attribute or a child element. Here are some rough guidelines.

Use an element when:

Use an attribute when:

Processing instructions should be used as little as possible. They generally hold noncontent information that doesn't pertain to any one element and is used by a particular XML processor. For example, PIs can be used to remember where to break a page for a printed copy, but would be useless for a web version of the document. It's not a good idea for a markup language to rely too heavily on PIs.

Doubtless you will run across good and bad examples of XML markup, but you don't have to make the same mistakes yourself. Strive to put as much thought as possible into your design.

XML Application: DocBook

An XMLapplication is a markup language derived from XML rules, not to be confused with XML software applications, called XMLprocessors in this book. An XML application is often a standard in its own right, with a publicly available DTD. One such application is DocBook, a markup language for technical documentation.

DocBook is a large markup language consisting of several hundred elements. It was developed by a consortium of companies and organizations to handle a wide variety of technical documentation tasks. DocBook is flexible enough to encode everything from one-page manuals to multiple-volume sets of books. Today, DocBook enjoys a large base of users, including open source developers and publishers. Details about the DocBook standard can be found in Appendix B, "A Taxonomy of Standards".

Example 2-4 is an instance of a DocBook document, in this case a product instruction manual. (Actually, it uses a DTD called "Barebones DocBook," a similar but much smaller version of DocBook described in Chapter 5, "Document Models:A Higher Levelof Control".) Throughout this example are numbered markers corresponding to comments appearing at the end.

Example 2.4. A DocBook Document

<?xml version="1.0" encoding="utf-8"?>  (1)
<!DOCTYPE book SYSTEM "/xmlstuff/dtds/barebonesdb.dtd" (2)
[ <!ENTITY companyname "Cybertronix"> <!ENTITY productname "Sonic Screwdriver 9000"> ]> <book> (3)
<title>&productname; User Manual</title> (4)
<author>Indigo Riceway</author> <preface id="preface"> <title>Preface</title> <sect1 id="about"> <title>Availability</title> <!-- Note to author: maybe put a picture here? --> <para> (5)
The information in this manual is available in the following forms: </para> <itemizedlist> (6)
<listitem><para> Instant telepathic injection </para></listitem> <listitem><para> Lumino-goggle display </para></listitem> <listitem><para> Ink on compressed, dead, arboreal matter </para></listitem> <listitem><para> Cuneiform etched in clay tablets </para></listitem> </itemizedlist> <para> The &productname; is sold in galactic pamphlet boutiques or wherever &companyname; equipment can be purchased. For more information, or to order a copy by hyperspacial courier, please visit our universe-wide Web page at <systemitem (7)
role="url">http://www.cybertronix.com/sonic_screwdrivers.html</systemitem>. </para> </sect1> <sect1 id="disclaimer"> <title>Notice</title> <para> While <emphasis>every</emphasis> (8)
effort has been taken to ensure the accuracy and usefulness of this guide, we cannot be held responsible for the occasional inaccuracy or typographical error. </para> </sect1> </preface> <chapter id="intro"> (9)
<title>Introduction</title> <para> Congratulations on your purchase of one of the most valuable tools in the universe! The &companyname; &productname; is equipment no hyperspace traveller should be without. Some of the myriad tasks you can achieve with this device are: </para> <itemizedlist> <listitem><para> Pick locks in seconds. Never be locked out of your tardis again. Good for all makes and models including Yale, Dalek, and Xngfzz. </para></listitem> <listitem><para> Spot-weld metal, alloys, plastic, skin lesions, and virtually any other material. </para></listitem> <listitem><para> Rid your dwelling of vermin. Banish insects, rodents, and computer viruses from your time machine or spaceship. </para></listitem> <listitem><para> Slice and process foodstuffs from tomatoes to brine-worms. Unlike a knife, there is no blade to go dull. </para></listitem> </itemizedlist> <para> Here is what satisfied customers are saying about their &companyname; &productname;: </para> <comment> (10)
Should we name the people who spoke these quotes? --Ed. </comment> <blockquote> <para> <quote>It helped me escape from the prison planet Garboplactor VI. I wouldn't be alive today if it weren't for my Cybertronix 9000.</quote> </para> </blockquote> <blockquote> <para> <quote>As a bartender, I have to mix martinis <emphasis>just right</emphasis>. Some of my customers get pretty cranky if I slip up. Luckily, my new sonic screwdriver from Cybertronix is so accurate, it gets the mixture right every time. No more looking down the barrel of a kill-o-zap gun for this bartender!</quote> </para> </blockquote> </chapter> <chapter id="controls"> <title>Mastering the Controls</title> <sect1> <title>Overview</title> <para> <xref linkend="controls-diagram"/> is a diagram of the parts of your &productname;. </para> <figure id="controls-diagram"> (11)
<title>Exploded Parts Diagram</title> <graphic fileref="parts.gif"/> </figure> <para> <xref linkend="controls-table"/> (12)
lists the function of the parts labeled in the diagram. </para> <table id="controls-table"> (13)
<title>Control Descriptions</title> <tgroup cols="2"> <thead> <row> <entry>Control</entry> <entry>Purpose</entry> </row> </thead> <tbody> <row> <entry>Decoy Power Switch</entry> <entry><para> Looks just like an on-off toggle button, but only turns on a small flashlight when pressed. Very handy when your &productname; is misplaced and discovered by primitive aliens who might otherwise accidentally injure themselves. </para></entry> </row> <row> <entry><emphasis>Real</emphasis> Power Switch</entry> <entry><para> An invisible fingerprint-scanning capacitance-sensitive on/off switch. </para></entry> </row> ... <row> <entry>The <quote>Z</quote> Twiddle Switch</entry> <entry><para> We're not entirely sure what this does. Our lab testers have had various results from teleportation to spontaneous liquification. <emphasis role="bold">Use at your own risk!</emphasis> </para></entry> </row> </tbody> </tgroup> </table> <note> <para> A note to arthropods: Stop forcing your inflexible appendages to adopt un-ergonomic positions. Our new claw-friendly control template is available. </para> </note> <sect2 id="power-sect"> <title>Power Switch</title> <sect3 id="decoy-power-sect"> <title>Why a decoy?</title> <comment> Talk about the Earth's Tunguska Blast of 1908 here. </comment> </sect3> </sect2> </sect1> <sect1> <title>The View Screen</title> <para> The view screen displays error messages and warnings, such as a <errorcode>LOW-BATT</errorcode> (14)
(low battery) message.<footnote> (15)
<para> The advanced model now uses a direct psychic link to the user's visual cortex, but it should appear approximately the same as the more primitive liquid crystal display. </para> </footnote> When your &productname; starts up, it should show a status display like this: </para> (16)
<screen>STATUS DISPLAY BATT: 1.782E8 V TEMP: 284 K FREQ: 9.32E3 Hz WARRANTY: ACTIVE</screen> </sect1> <sect1> <title>The Battery</title> <para> Your &productname; is capable of generating tremendous amounts of energy. For that reason, any old battery won't do. The power source is a tiny nuclear reactor containing a piece of ultra-condensed plutonium that provides up to 10 megawatts of power to your device. With a half-life of over 20 years, it will be a long time before a replacement is necessary. </para> </sect1> </chapter> </book>

Following are notes about Example 2-4:

1. The XML declaration states this file contains an XML document corresponding to Version 1.0 of the XML specification, and the UTF-8 character set should be used (see Chapter 7, "Internationalization" for more about character sets). The standalone property is not mentioned, so the default value of "no" will be used.

2. This document type declaration does three things. First, it tells us that <book> will be the root element. Second, it associates a DTD with the document, specifying the location /xmlstuff/dtds/barebonesdb.dtd. Third, it declares two general entities in the document's internal subset of declarations. These entities will be used throughout the document wherever the company name or product name are used. If in the future the product's name is changed or the company is bought out, the author needs only to update the values in the entity declarations.

3. The <book> element is the document root, the element that contains all the content. It begins a hierarchy that includes a <preface> and <chapter>, followed by some sections labeled <sect1>, then <sect2>, and so on, down to the level of paragraphs and lists. Only two <chapter>s are shown in the example, but in a real document they would be followed by additional chapters, each with its own sections and paragraphs, etc.

4. Notice that all the major components (preface, chapter, sections) start with a <title> element. This is an example of how an element can be used in different contexts. In a formatted copy of this document, the titles in different levels will be rendered differently, some large and others small. A stylesheet will use the hierarchical information (i.e., what is the ancestor of this <title>) to determine how to format it.

5. A <para> is an example of a block element, which means that it starts on a new line and contains a mixture of character data and elements that are bound in a rectangular region.

6. This element begins a bulleted list of items. If this were a numbered list (for instance, <orderedlist> instead of <itemizedlist>), we would not have to insert the numbers as content. The XML formatter would do that for us, simultaneously preserving the order of <listitem>s and automatically generating numbers according to the stylesheet's settings. This is another example of an element (<listitem>) that is treated differently based on which element it appears in.

7. This <systemitem> element is an example of an inline element that modifies text within the flow. In this case, it labels its contents as a URL to a resource on the Internet. The XML processor can use this information both to apply style (make it appear different from surrounding text) and in certain media, for example, a computer display, to turn it into a link that the user can click to view the resource.

8. Here's another inline element, this time encoding its contents as text requiring emphasis, perhaps turning it bold or italic.

9. The <chapter> element has an ID attribute because we may want to add a cross-reference to it somewhere in the text. A cross-reference is an empty element like this:

<xref linkend="idref"//>

where idref is the value of the referenced element's ID. In this case, it might be <xref linkend="chapt-1"/>. When the document is formatted, this cross-reference element is replaced with text, like for instance, "Chapter 1, `Introduction'".

10. This block element contains a comment meant as a note to someone on the editorial team. It will be formatted so it stands out, perhaps appearing in a lighter shade. When the book goes to press, a different stylesheet will be used that prevents these <comment> elements from being printed.

11. This <figure> element contains a graphic and its caption. The <graphic> element is a link (see Chapter 3, "Connecting Resourceswith Links") to a graphic file, which the XML processor will have to import for displaying.

12. Here's an example of a cross-reference in action. It references a <table> element (the linkend attribute and the <table>'s ID attribute are the same). This is an ID-IDREF link, which is described in Chapter 3, "Connecting Resourceswith Links". The formatter will replace the <xref> element with text such as "Table 2-1". Now, if you read the sentence again and substitute that text for the cross-reference element, it makes sense, right? One reason to use a cross-reference element like this instead of just writing "Table 2-1" is that if the table is moved to another chapter, the formatter will update the text automatically.

13. This is how a table[5] with eight rows and two columns would be marked up in DocBook. The first row, appearing in a <thead>, is the head of the table.

14. The <errorcode> element is an inline tag, but in this case does not denote special formatting (although we can choose to format it differently if we want to). Instead, it labels a specific kind of item: an error code used in a computer program. DocBook is full of special computer terms: for example, <filename>, <function>, and <guimenuitem>, which are used as inline elements.

We want to mark up these items in detail because there is a strong possibility someone might want to search the book for a particular kind of item. You can always plug a keyword into a search engine and it will fetch the matches for you, but if you can constrain the search to the content of <errorcode> elements, you are much more likely to receive only a relevant match, rather than a homonym in the wrong context. For example, the keyword string occurs in many programming languages, and can be anything from part of a method name to a data type. To search an entire book on Java would give you back literally hundreds of matches, so to narrow your search you could specify that the term is contained within a certain element like <type>.

15. Here, we've inserted a footnote. The <footnote> element acts as both a container of text and a marker, labeling a specific point for special processing. When the document is formatted, that point becomes the location of a footnote symbol such as an asterisk (*). The contents of the footnote are moved somewhere else, probably to the bottom of the page.

16. A <screen> is defined to preserve all whitespace (spaces, tabs, newlines), since computer programs often contain extra space to make them more readable. XML preserves whitespace in any element unless told not to. DocBook tells XML processors to disregard extra space in all but a few elements, so when the document is formatted, paragraphs lose extra spaces and justify correctly, while screens and program listings retain their extra spaces.

That's a quick snapshot of DocBook in action. For more information about this popular XML application, check out the description in Appendix B, "A Taxonomy of Standards".

Back to: Learning XML: (Guide to) Creating Self-Describing Data


O'Reilly Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies

© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.com