Read an XML Document

Before you can do much with an XML document, you need to understand its basic parts. This hack explores the most common struchture found in XML

This hack lays the basic groundwork for XML: what it looks like and how it’s put together. Example 1-1 shows a simple document (start.xml) that contains some of the most common XML structures: an XML declaration, a comment, elements, attributes, an empty element, and a character reference. start.xml is well-formed, meaning that it conforms to the syntax rules in the XML specification. XML documents must be well-formed.

Example 1-1. start.xml

1. <?xml version="1.0" encoding="UTF-8"?>
2. 
3. <!-- a time instant -->
4. <time timezone="PST">
5. <hour>11</hour>
6. <minute>59</minute>
7. <second>59</second>
8. <meridiem>p.m.</meridiem>
9. <atomic signal="true" symbol="&#x25D1;"/>
10. </time>

The XML Declaration

The first line of the example contains an XML declaration, which is recommended by the XML spec but is not mandatory. If present, it must appear on the first line of the document. It is a human- and machine-readable flag that states a few facts about the content of the document.

Tip

An XML declaration is not a processing instruction, although it looks like one. Processing instructions are discussed in [Hack #3] .

In general, an XML declaration provides three pieces of information about the document that contains it: the XML version information; the character encoding in use; and whether the document stands alone or relies on information from an external source.

Version information

If you use an XML declaration, it must include version information (as in version="1.0“). Currently, XML Version 1.0 is in the broadest use, but Version 1.1 is also now available (http://www.w3.org/TR/xml11/), so 1.1 is also a possible value for version. The main differences between Versions 1.0 and 1.1 are that 1.1 supports a later version of Unicode (4.0 instead of 2.0), has a more liberal policy for characters used in names, adds a couple space characters, and allows character references for control characters that were forbidden in 1.0 (for details see http://www.w3.org/TR/xml11/#sec-xml11).

The encoding declaration

An optional encoding declaration allows you to explicitly state the character encoding used in the document. Character encoding refers to the way characters are represented internally, usually by one or more 8-bit bytes or octets. If no encoding declaration exists in a document’s XML declaration, that XML document is required to use either UTF-8 or UTF-16 encoding. A UTF-16 document must begin with a special character called a Byte Order Mark or BOM (the zero-width, no break space U+FEFF; see http://www.unicode.org/charts/PDF/UFE70.pdf). As values for encoding, you should use names registered at Internet Assigned Numbers Authority or IANA (http://www.iana.org/assignments/character-sets). In addition to UTF-8 and UTF-16, possible choices include US-ASCII, ISO-8859-1, ISO-2022-JP, and Shift_JIS (http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding). If you use an encoding that is uncommon, make sure that your XML processor supports the encoding or you’ll get an error. You’ll find more in the discussion on character encoding [Hack #27] ; also see http://www.w3.org/TR/REC-xml#charencoding.

The standalone declaration

An optional standalone declaration (not shown in Example 1-1) can tell an XML processor whether an XML document depends on external markup declarations; i.e., whether it relies on declarations in an external Document Type Definition (DTD). A DTD defines the content of valid XML documents. This declaration can have a value of yes or no.

Don’t worry too much about standalone declarations. If you don’t use external markup declarations, the standalone declaration has no meaning, whether its value is yes or no (standalone="yes" or standalone="no“). On the other hand, if you use external markup declarations but no standalone document declaration, the value no is assumed. Given this logic, there isn’t much real need for standalone declarations—other than acting as a visual cue—unless your processor can convert an XML document from one that does not stand alone to one that does, which may be more efficient in a networked environment. (See http://www.w3.org/TR/REC-xml#sec-rmd.)

Comments

Comments can contain human-readable information that can help you understand the purpose of a document or the markup in it. A comment appears on line 3 of Example 1-1. XML comments are generally ignored by XML processors, but a processor may keep track of them if this is desired (http://www.w3.org/TR/REC-xml.html#sec-comments). They begin with a <!-- and end with -->, but can’t contain the character sequence --. You can place comments anywhere in an XML document except inside other markup, such as inside tag brackets.

Elements

A legal or compliant XML document must have at least one element. An element can have either one tag—called an empty element—or two tags—a start tag and an end tag with content in between.

The first or top element in an XML document—such as the time element on line 4—is called the document element or root element. A document element is required in any XML document. The content of the time element consists of five child elements: hour, minute, second, meridiem, and atomic.

Element content includes text (officially called parsed character data ), other child elements, or a mix of text and elements. For example, 11 is the text content of the hour element. Elements can contain a few other things, but these are the most common as far as content goes.

The atomic element on line 9 in Example 1-1 is an example of an empty element. Empty elements don’t have any content; i.e., they consist of a single tag (<atomic signal="true" symbol="&#x25D1;"/>). The other elements all have start tags and end tags; for example, <hour> is a start tag and </hour> is an end tag.

XML documents are structured documents, and that structure comes essentially from the parent-child relationship between elements. In Example 1-1, hour, minute, second, meridiem, and atomic are the children of time, and time is the parent of hour, minute, second, meridiem, and atomic. The depth of elements can go much deeper than the simple parent-child relationship. Such elements are called ancestor elements and descendant elements.

Mixed content

The document start.xml in Example 1-1 doesn’t show mixed content. The document mixed.xml in Example 1-2 shows what mixed content looks like.

Example 1-2. mixed.xml

<?xml version="1.0" encoding="UTF-8"?>
    
<!-- a time instant -->
<time timezone="PST">The time is: <hour>11</hour>:<minute>59</minute>: 
<second>59</second> <meridiem>p.m.</meridiem></time>

The time element has both text (e.g., “The time is:”) and child element content (e.g., hour, minute, and second).

Attributes

XML elements may also contain attributes that modify elements in some way. In start.xml, the elements time and atomic both contain attributes. For example, on line 4 of Example 1-1, the start tag of the time element contains a timezone attribute. Attributes may occur only in start tags and empty element tags, but never in end tags (see http://www.w3.org/TR/REC-xml#sec-starttags). An attribute specification consists of an attribute name paired with an attribute value. For example, in timezone="PDT“, timezone is the attribute name and PDT is the value, separated by an equals sign (=). Attribute values must be enclosed in matching pairs of single (' ) or double (“) quotes.

Whether to use elements or attributes, and when and where they should be used to represent data, is the subject of long debate [Hack #40] . To illustrate, some prefer that the data in the document time.xml be represented as:

<time hour="11" minute="59" second="59"/>

After considering the problem for several years, my conclusion is that it seems to be more of a matter of taste than anything else. The short answer is: do what works for you.

Character references

The attribute symbol on line 9 of Example 1-1 contains something called a character reference [Hack #4] . Character references allow access to characters that are not normally available through the keyboard. A character reference begins with an ampersand (& ) and ends with a semicolon (;). In the character reference &#x25D1;, the hexadecimal number 25D1 preceded by #x refers to the Unicode character “circle with right half black” (http://www.unicode.org/charts/PDF/U25A0.pdf), which looks like this when it is rendered:

CDATA Sections

One structure not shown in the example (see [Hack #43] ) is something called a CDATA section. CDATA sections in XML (http://www.w3.org/TR/REC-xml/#sec-cdata-sect) allow you to hide characters like < and & from an XML processor. This is because these characters have special meaning: a < begins an element tag and & begins a character reference or entity reference. A CDATA section begins with the characters <![CDATA[ and ends with ]]>. For example, the company element in the following fragment contains a CDATA section:

<company><![CDATA[ Fitzgerald & Daughters ]]></company>

When munched, the & character in the CDATA section is hidden from the processor so that it isn’t interpreted as markup the way the start of an entity reference or character reference would be.

You now should understand the basic components of an XML document.

See Also

  • Learning XML by Erik Ray (O’Reilly)

  • XML: A Primer by Simon St.Laurent (Hungry Minds, Inc.)

  • XML 1.1 Bible by Elliotte Rusty Harold (Hungry Minds, Inc.)

Get XML Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.