Before you can do much with an XML document, you need to understand its basic parts. This hack explores the most common struchture found in XML
This hack lays the basic groundwork for XML: what it looks like and how it’s put together. Example 1-1 shows a simple document (start.xml) that contains some of the most common XML structures: an XML declaration, a comment, elements, attributes, an empty element, and a character reference. start.xml is well-formed, meaning that it conforms to the syntax rules in the XML specification. XML documents must be well-formed.
Example 1-1. start.xml
1. <?xml version="1.0" encoding="UTF-8"?> 2. 3. <!-- a time instant --> 4. <time timezone="PST"> 5. <hour>11</hour> 6. <minute>59</minute> 7. <second>59</second> 8. <meridiem>p.m.</meridiem> 9. <atomic signal="true" symbol="◑"/> 10. </time>
The first line of the example contains an XML declaration, which is recommended by the XML spec but is not mandatory. If present, it must appear on the first line of the document. It is a human- and machine-readable flag that states a few facts about the content of the document.
Tip
An XML declaration is not a processing instruction, although it looks like one. Processing instructions are discussed in [Hack #3] .
In general, an XML declaration provides three pieces of information about the document that contains it: the XML version information; the character encoding in use; and whether the document stands alone or relies on information from an external source.
If
you use an XML declaration, it
must include version information (as in
version="1.0
“). Currently,
XML Version 1.0 is in the broadest use,
but Version 1.1 is also now available (http://www.w3.org/TR/xml11/), so
1.1
is also a possible value for
version
. The main differences between Versions 1.0
and 1.1 are that 1.1 supports a later version of Unicode (4.0 instead
of 2.0), has a more liberal policy for characters used in names, adds
a couple space characters, and allows character references for
control characters that were forbidden in 1.0 (for details see
http://www.w3.org/TR/xml11/#sec-xml11).
An
optional encoding declaration
allows you to explicitly state the character encoding used in the
document. Character encoding refers to the way characters are
represented internally, usually by one or more 8-bit bytes or
octets. If no encoding declaration exists in a
document’s XML declaration, that XML document is
required to use either UTF-8 or UTF-16 encoding. A UTF-16 document
must begin with a special character called a Byte Order Mark or BOM
(the zero-width, no break space U+FEFF; see http://www.unicode.org/charts/PDF/UFE70.pdf).
As values for encoding
, you should use names
registered at Internet Assigned Numbers Authority or IANA
(http://www.iana.org/assignments/character-sets).
In addition to UTF-8 and UTF-16, possible choices include US-ASCII,
ISO-8859-1, ISO-2022-JP, and Shift_JIS (http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding).
If you use an encoding that is uncommon, make sure that your XML
processor supports the encoding or you’ll get an
error. You’ll find more in the discussion on
character encoding
[Hack #27]
;
also see http://www.w3.org/TR/REC-xml#charencoding.
An
optional standalone declaration
(not shown in Example 1-1) can tell an XML processor
whether an XML document depends on external markup declarations;
i.e., whether it relies on declarations in an external
Document Type
Definition (DTD). A DTD defines the content of valid XML documents.
This declaration can have a value of yes
or
no
.
Don’t worry too much about standalone declarations.
If you don’t use external markup declarations, the
standalone declaration has no meaning, whether its value is
yes
or no
(standalone="yes
" or
standalone="no
“). On the other hand, if you use
external markup declarations but no standalone document declaration,
the value no
is assumed. Given this logic, there
isn’t much real need for standalone
declarations—other than acting as a visual cue—unless
your processor can convert an XML document from one that does not
stand alone to one that does, which may be more efficient in a
networked environment. (See http://www.w3.org/TR/REC-xml#sec-rmd.)
Comments
can contain human-readable information that can help you understand
the purpose of a document or the markup in it. A comment appears on
line 3 of Example 1-1. XML comments are generally
ignored by XML processors, but a processor may keep track of them if
this is desired (http://www.w3.org/TR/REC-xml.html#sec-comments).
They begin with a <!--
and end with -->
,
but can’t contain the character sequence
--
. You can place comments anywhere in an XML
document except inside other markup, such as inside tag brackets.
A legal or compliant XML document must have at least one element. An element can have either one tag—called an empty element—or two tags—a start tag and an end tag with content in between.
The first or top element in an XML document—such as the
time
element on line 4—is called the
document
element or
root
element. A
document element is required in any XML document. The content of the
time
element consists of five child elements:
hour
, minute
,
second
, meridiem
, and
atomic
.
Element content includes text (officially called parsed
character data
), other child elements, or a mix of text
and elements. For example, 11
is the text content
of the hour
element. Elements can contain a few
other things, but these are the most common as far as content goes.
The atomic
element on line 9 in Example 1-1 is an example of an
empty element. Empty elements
don’t have any content; i.e., they consist of a
single tag (<atomic signal="true" symbol="◑"/>
). The other elements all have
start tags and end tags; for example, <hour>
is a start tag and </hour>
is an end tag.
XML documents are structured documents, and that structure comes
essentially from the parent-child relationship between
elements. In Example 1-1, hour
,
minute
, second
,
meridiem
, and atomic
are the
children of time
, and time
is
the parent of hour
, minute
,
second
, meridiem
, and
atomic
. The depth of elements can go much deeper
than the simple parent-child relationship. Such elements are called
ancestor elements
and
descendant
elements.
The document start.xml in Example 1-1 doesn’t show mixed content. The document mixed.xml in Example 1-2 shows what mixed content looks like.
Example 1-2. mixed.xml
<?xml version="1.0" encoding="UTF-8"?> <!-- a time instant --> <time timezone="PST">The time is: <hour>11</hour>:<minute>59</minute>: <second>59</second> <meridiem>p.m.</meridiem></time>
The time
element has both text (e.g.,
“The time is:”) and child element
content (e.g., hour
, minute
,
and second
).
XML
elements may also contain attributes that modify elements in some
way. In start.xml, the elements
time
and atomic
both contain
attributes. For example, on line 4 of Example 1-1,
the start tag of the time
element contains a
timezone
attribute. Attributes may occur only in
start tags and empty element tags, but never in end tags (see
http://www.w3.org/TR/REC-xml#sec-starttags).
An attribute
specification
consists of an attribute name
paired with an attribute value. For example, in
timezone="PDT
“, timezone
is the
attribute name and PDT
is the value, separated by
an equals sign (=
). Attribute values must be
enclosed in matching pairs of single
('
) or double
(“) quotes.
Whether to use elements or attributes, and when and where they should be used to represent data, is the subject of long debate [Hack #40] . To illustrate, some prefer that the data in the document time.xml be represented as:
<time hour="11" minute="59" second="59"/>
After considering the problem for several years, my conclusion is that it seems to be more of a matter of taste than anything else. The short answer is: do what works for you.
The attribute
symbol
on line 9 of Example 1-1
contains something called a character
reference
[Hack #4]
. Character references allow
access to characters that are not normally available through the
keyboard. A character reference begins with an ampersand
(&
) and ends with a
semicolon (;
). In the character reference
◑
, the hexadecimal number
25D1
preceded by #x
refers to
the Unicode character “circle with right half
black” (http://www.unicode.org/charts/PDF/U25A0.pdf),
which looks like this when it is rendered:
◑
One structure not shown in the
example (see
[Hack #43]
) is
something called a CDATA section. CDATA sections
in XML (http://www.w3.org/TR/REC-xml/#sec-cdata-sect)
allow you to hide characters like <
and
&
from an XML processor. This is because these
characters have special meaning: a <
begins an
element tag and &
begins a character reference
or entity reference. A CDATA section begins with the characters
<![CDATA[
and ends with
]]>
. For example, the
company
element in the following fragment contains
a CDATA section:
<company><![CDATA[ Fitzgerald & Daughters ]]></company>
When munched, the &
character in the CDATA
section is hidden from the processor so that it
isn’t interpreted as markup the way the start of an
entity reference or character reference would be.
You now should understand the basic components of an XML document.
Get XML Hacks now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.