BUY THIS BOOK

Safari Books Online

What is this?

Looking to Reprint this content?


XML Pocket Reference
XML Pocket Reference Extensible Markup Language

By Robert Eckstein

Cover | Table of Contents


Table of Contents

Chapter 1: XML Pocket Reference
The Extensible Markup Language (XML) is a document processing standard proposed by the World Wide Web Consortium (W3C), the same group responsible for overseeing the HTML standard. Although the exact specifications have not been completed yet, many expect XML and its sibling technologies to replace HTML as the markup language of choice for dynamically generated content, including nonstatic web pages. Already several browser and word processor companies are integrating XML support into their products.
XML is actually a simplified form of Standard Generalized Markup Language (SGML), an international documentation standard that has existed since the 1980s. However, SGML is extremely bulky, especially for the Web. Much of the credit for XML’s creation can be attributed to Jon Bosak of Sun Microsystems, Inc., who started the W3C working group responsbile for scaling down SGML to a form more suitable for the Internet.
Put succinctly, XML is a meta-language that allows you to create and format your own document markups. With HTML, existing markup is static: <HEAD> and <BODY>, for example, are tightly integrated into the HTML standard and cannot be changed or extended. XML, on the other hand, allows you to create your own markup tags and configure each to your liking: for example, <HeadingA>, <Sidebar>, <Quote>, or <ReallyWildFont>. Each of these elements can be defined through your own document type definitions and stylesheets and applied to one or more XML documents. Thus, it is important to realize that there are no “correct” tags for an XML document, except those you define yourself.
While many XML applications currently support Cascading Style Sheets (CSS), a more extensible stylesheet specification exists called the Extensible Stylesheet Language (XSL). By using XSL, you ensure that your XML documents are formatted the same no matter which application or platform they appear on.
Note: As you read this, the XSL specification is still in flux. There have been several rumors regarding XSL and the formatting object’s portions of the specifications changing dramatically. (There is even one rumor about XSL becoming its own language in the future.) However, even if XSL changes dramatically in the future, the material presented here should give you a firm foundation and enough expertise to make any leap of knowledge much easier.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Terminology
Before we move further, we need to standardize some terminology. An XML document consists of one or more elements. An element is marked with the following form:
<Body>
This is text formatted according to the Body element
</Body>
This element consists of two tags, an opening tag which places the name of the element between a less-than sign (<) and a greater-than sign (>), and a closing tag which is identical except for the forward slash (/) that appears before the element name. Like HTML, the text contained between the opening and closing tags is considered part of the element and is formatted according to the element’s rules.
Elements can have attributes applied, such as the following:
<Price currency="Euro">25.43</Price>
Here, the attribute is specified inside of the opening tag and is called “currency.” It is given a value of “Euro,” which is expressed inside quotation marks. Attributes are often used to further refine or modify the default behavior of an element.
In addition to the standard elements, XML also supports empty elements. An empty element has no text appearing between the opening and closing tag. Hence, both tags can (optionally) be merged together, with a forward slash appearing before the closing marker. For example, these elements are identical:
<Picture src="blueball.gif"></Picture>
<Picture src="blueball.gif"/>
Empty elements are often used to add nontextual content to a document, or to provide additional information to the application that is parsing the XML. Note that while the closing slash may not be used in single-tag HTML elements, it is mandatory for single-tag XML empty elements.
Whereas HTML browsers often ignore simple errors in documents, XML applications are not nearly as forgiving. For the HTML reader, there are a few bad habits from which we should first dissuade you:
Attribute values must be in quotation marks.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XML Reference
Now that you have had a quick taste of working with XML, here is an overview of the more common rules and constructs of the XML language.
These are the rules for a well-formed XML document:
  • The document must either use a DTD or contain an XML declaration with the standalone attribute set to “no”. For example:
    <?xml version="1.0" standalone="no"?>
  • All element attribute values must be in quotation marks.
  • An element must have both an opening and closing tag, unless it is an empty element.
  • If a tag is a standalone empty element, it must contain a closing slash (/) before the end of the tag.
  • All opening and closing element tags must nest correctly.
  • Isolated markup characters are not allowed in text: < or & must use entity references instead. In addition, the sequence ]]> must be expressed as ]]&gt; when used as regular text. (Entity references are discussed in further detail later.)
  • Well-formed XML documents without a corresponding DTD must have all attributes of type CDATA by default.
The following XML instructions are legal.
<?xml ... ?>
<?xml version= number [encoding= encoding ] [standalone= yes|no ] ?>
Description
Although they are not required to, XML documents typically begin with an XML declaration. An XML declaration must start with the characters <?xml
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Document Type Definitions
A DTD specifies how elements inside an XML document should relate to each other. It also provides grammar rules for the document and each of its elements. A document that adheres to the specifications outlined by its DTD is considered to be valid. (Don’t confuse this with a well-formed document, which adheres to the XML syntax rules outlined earlier.)
You must declare each of the elements that appear inside your XML document within your DTD. You can do so with the <!ELEMENT> declaration, which uses the this format:
<!ELEMENT elementname 
                  rule>
This declares an XML element and an associated rule, which relates the element logically in the XML document. The element name should not include <> characters. An element name must start with a letter or an underscore. After that, it can have any number of letters, numbers, hyphens, periods, or underscores in its name. Element names may not start with the string xml, in any variation of upper- or lowercase. You can use a colon in element names only if you are using namespaces; otherwise, it is forbidden.

Section 1.3.1.1: ANY and PCDATA

The simplest element declaration states that between the opening and closing tags of the element, anything can appear:
<!ELEMENT library ANY>
Using the ANY keyword allows you to include both other tags and general character data within the element. However, you may want to specify a situation where you want only general characters appearing. This type of data is better known as parsed character data, or PCDATA for short. You can specify that an element can contain only PCDATA with the following declaration:
<!ELEMENT title (#PCDATA)>
Remember, this declaration means that any character data that is not an element can appear between the element tags. Therefore, it’s legal to write the following in your XML document:
<title></title>
<title>XML Pocket Reference</title>
<title>Java Network Programming</title>
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Extensible Stylesheet Language
The Extensible Stylesheet Language (XSL) is one of the most intricate parts of the XML specification. It’s also a bit of a moving target right now: as we write this, the XSL specification is moving in a completely new direction, possibly becoming its own language in the future. Much of the information in the following pages will be out-of-date very soon, as it is based on the current XSL specification in early 1999. For the very latest information on XSL, visit the home page for the W3C XSL working group at http://www.w3.org/Style/XSL/. This section will still provide you with a firm understanding of how XSL is meant to be used.
As we mentioned, XSL works by applying element-formatting rules that you define to each XML document it encounters. In reality, XSL simply transforms each XML document from one series of element types to another. For example, XSL can be used to apply HTML formatting to an XML document, which would transform it from:
<?xml version="1.0"?>
<OReilly:Book title="XML Comments">
 <OReilly:Chapter title="Working with XML">
   <OReilly:Image src="http://www.oreilly.com/1.gif"/>
   <OReilly:HeadA>Starting XML</OReilly:HeadA>
   <OReilly:Body>If you haven\(ast used XML, then ...
     </OReilly:Body>
 </OReilly:Chapter>
</OReilly:Book>
to the following HTML:
<HTML>
  <HEAD>
  <TITLE>XML Comments</TITLE>
  </HEAD>
  <BODY>
    <H1>Working with XML</H1>
    <img src="http://www.oreilly.com/1.gif"/>
    <H2>Starting XML</H2>
    <P>If you haven\(ast used XML, then ...</P>
  </BODY>
</HTML>
If you look carefully, you can see a predefined hierarchy that remains from the source content to the resulting content. To venture a guess, the <OReilly:Book> element probably maps to the <HTML>, <HEAD>, <TITLE>, and <BODY> elements in HTML. The <OReilly:Chapter> element maps to the HTML <H1> element, the <OReilly:Image> element maps to the <img> element, and so on.
This demonstrates an essential aspect of XML: each document contains a hierarchy of elements that can be organized in a tree-like fashion. (If the document uses a DTD, that hierarchy is well-defined.) In the previous XML example, the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
XLink and XPointer
The final piece of XML we cover is XLink and XPointer. These two creations fall under the Extensible Linking Language (XLL), a separate portion of the XML standard dedicated to working with XML links. Before we delve into this, however, we should warn you that the standard described here is subject to change at any time.
It’s important to remember that an XML link is only an assertion of a relationship between pieces of documents; how the link is actually presented to a user depends on a number of factors, including the application processing the XML document.
In order to create a link, we must first have a labeling scheme for XML elements. We do this by assigning an identifier to specific elements we want to reference using an ID attribute:
<paragraph id="attack">Suddenly the skies were filled 
with aircraft.</paragraph>
You can think of IDs in XML documents as street addresses: they provide a unique identifier for an element within a document. However, just as there might be an identical address in a different city, an element in a different document might have the same ID. Consequently, you can tie together an ID with the document’s URI, as shown below:
http://www.oreilly.com/documents/story.xml#attack
The combination of a document’s URI and an element’s ID should uniquely identify that element throughout the universe. Remember that an ID attribute does not need to be named ”id,“ as we showed in the first example. You can name it anything you want, as long as you define it as an XML ID in the document’s DTD. (However, using ”id“ is preferred in the event that the XML processor does not read the DTD.)
Should you give an ID to every element in your documents? No: the odds are that most elements will never be referenced. It’s best to place IDs on items that a reader would want to refer to later, such as chapter and section divisions, as well as important items, such as term definitions.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!

Return to XML Pocket Reference