The Eight-Minute XML Tutorial - Automating System Administration with Perl

by David N. Blank-Edelman

One of the most impressive features of XML (eXtensible Markup Language) is how little you need to know to get started. This appendix gives you some of the key pieces of information you’ll need. The references at the end of Chapter 6, Working with Configuration Files point you to many excellent resources that you can turn to for more information.

Automating System Administration with Perl, Second Edition book cover

This excerpt is from System Administration with Perl, Second Edition . Thoroughly updated and expanded in its second edition to cover the latest operating systems, technologies, and Perl modules, Automating System Administration with Perl will help you perform your job with less effort. The second edition not only offers you the right tools for your job, but also suggests the best way to approach particular problems and securely automate pressing tasks.

buy button

XML Is a Markup Language

Thanks to the ubiquity of XML’s older and stodgier cousin, HTML, almost everyone is familiar with the notion of a markup language. Like HTML, XML consists of plain text interspersed with little bits of special descriptive or instructive text. HTML has a rigid definition for which bits of markup text, called tags, are allowed, while XML allows you to make up your own.

Consequently, XML provides a range of expression far beyond that of HTML. One example of this range of expression is found in Chapter 6, Working with Configuration Files, but here’s another simple example that you should find easy to read even if you don’t have any prior XML experience:

<hosts>

  <machine>
    <name> quiddish </name>
    <department> Software Sorcery </department>
    <room> 314WVH </room>

    <owner> Horry Patter </owner>
    <ipaddress> 192.168.1.13 </ipaddress>
  </machine>
  <machine>

    <name> dibby </name>
    <department> Hardware Hackery </department>
    <room> 310WVH </room>

    <owner> Harminone Grenger </owner>
    <ipaddress> 192.168.1.15 </ipaddress>
  </machine>
</hosts>

XML Is Picky

Despite XML’s flexibility, it is pickier in places than HTML. There are syntax and grammar rules that your data must follow. These rules are set down rather tersely in the XML specification found at http://www.w3.org/TR/REC-xml/. Rather than poring through the official spec, I recommend you seek out one of the annotated versions, such as Tim Bray’s version (available at http://www.xml.com) or Robert Ducharme’s book XML: The Annotated Specification (Prentice Hall). The former is online and free; the latter has many good examples of actual XML code.

Here are two of the XML rules that tend to trip up people who know HTML:

  • If you begin something, you must end it. In the preceding example, we started a machine listing with <machine> and finished it with </machine>. Leaving off the ending tag would not have been acceptable XML.

  • In HTML, tags like <img src="picture.jpg"> are legally allowed to stand by themselves. Not so in XML. This would have to be written as either:

    <img src="picture.jpg"> </img>

    or:

    <img src="picture.jpg" />

    The extra slash at the end of this last tag lets the XML parser know that this single tag serves as both a start and an end tag. A pair of start and end tags and the data they contain are together called an element.

  • Start tags and end tags must mirror one another exactly. Changing the case is not allowed, because XML is case-sensitive. If your start tag is <MaChINe>, your end tag must be </MaChINe> and cannot be </MACHine> or any other case combination. HTML is much more forgiving in this regard.

These are three of the general rules in the XML specification. But sometimes you want to define your own additional rules for an XML parser to enforce (where by “enforce” I mean “complain vociferously” or “stop parsing” while reading the XML data if a violation is encountered). If we use our previous machine database XML snippet as an example, one additional rule we might to enforce is “all <machine> entries must contain a <name> and an <ipaddress> element.” You may also wish to restrict the contents of an element to a set of specific values, like YES or NO.

How these rules get defined is less straightforward than the other material we’ll cover, because there are several complementary and competitive definition “languages” afloat at the moment.

The current XML specification uses a Document Type Definition (DTD), the SGML standby. Here’s an example piece of XML code from the XML specification that has its definition code at the beginning of the document itself:

<?xml version="1.0" encoding="UTF-8" ?>

<!DOCTYPE greeting [
  <!ELEMENT greeting (#PCDATA)>
]>
<greeting>Hello, world!</greeting>

The first line of this example specifies the version of XML in use and the character encoding (Unicode) for the document. The next three lines define the types of data in this document. This is followed by the actual document content (the <greeting> element) in the final line of the example.

If we wanted to define how the <hosts> XML code at the beginning of this appendix should be validated, we could place something like this at the beginning of the file:

<?xml version="1.0" encoding="UTF-8" ?>

<!DOCTYPE hosts [
  <!ELEMENT hosts      (machine)*>
  <!ELEMENT machine    (name,department,room,owner,ipaddress)>
  <!ELEMENT name       (#PCDATA)>
  <!ELEMENT department (#PCDATA)>
  <!ELEMENT room       (#PCDATA)>
  <!ELEMENT owner      (#PCDATA)>

  <!ELEMENT ipaddress  (#PCDATA)>
]>

This definition requires that a hosts element contains machine elements and that each machine element consists of name, department, room, owner, and ipaddress elements (in this specific order). Each of those elements is described as being #PCDATA (see the section the section called “Leftovers” for details).

The World Wide Web Consortium (W3C) has also created a specification for data descriptions called schemas for DTD-like purposes. Schemas are themselves written in XML code. Here’s an example of schema code that uses the 1.0 XML Schema recommendation syntax found at http://www.w3.org/XML/Schema (version 1.1 of this recommendation was still in process while this book was being written):

<?xml version='1.0' ?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:complexType name="MachineType">
    <xsd:sequence>
      <xsd:element name="name" type="xsd:string"/>
      <xsd:element name="department" type="xsd:string"/>
      <xsd:element name="room" type="xsd:string"/>

      <xsd:element name="owner" type="xsd:string"/>
      <xsd:element name="ipaddress" type="xsd:string"/>
    </xsd:sequence>
  </xsd:complexType>
  <xsd:complexType name="ListOfMachines">
    <xsd:sequence>

        <xsd:element name="machine" type="MachineType"
                       minOccurs="1" maxOccurs="unbounded" />
    </xsd:sequence>
  </xsd:complexType>

  <xsd:element name="hosts" type="ListOfMachines" />
</xsd:schema>

Both the DTD and schema mechanisms can get complicated quickly, so we’re going to leave further discussion of them to the books that are dedicated to XML/SGML.

Two Key XML Terms

You can’t go very far in XML without learning two important terms. First, XML data is said to be well-formed if it follows all of the XML syntax and grammar rules (matching tags, etc.). Often a simple check for well-formed data can help you spot typos in XML files. That’s an advantage when the data you are dealing with holds configuration information, as in the machine database excerpted in the last section.

Second, XML data is said to be valid if it conforms to the rules we’ve set down in one of the data definition mechanisms mentioned earlier. For instance, if your data file conforms to its DTD, it is valid XML data.

Valid data by definition is well-formed, but the converse does not have to be true. It is possible to have perfectly wonderful XML data that does not have an associated DTD or schema. If it parses properly, it is well-formed, but not valid.

Leftovers

Here are three terms that appear throughout the XML literature and may stymie the XML beginner:

Attribute

The descriptions of an element that are part of the initial start tag. To reuse a previous example, in the element <img src="picture.jpg" />, src="picture.jpg" is an attribute. There is some controversy in the XML world about when to use the contents of an element and when to use attributes. The best set of guidelines on this particular issue is found at http://www.oasis-open.org/cover/elementsAndAttrs.html.

CDATA

The term CDATA (Character Data) is used in two contexts. Most of the time it refers to everything in an XML document that is not markup (tags, etc.). The second context involves CDATA sections. A CDATA section is declared to indicate that an XML parser should leave that section of data alone even if it contains text that could be construed as markup. CDATA sections look a little strange. Here’s the example from the XML spec:

<![CDATA[<greeting>Hello,world!</greeting>]]>

In this case the <greeting></greeting> tags get treated like just plain characters and not as markup that needs to be parsed.

PCDATA

Tim Bray’s annotation of the XML specification (mentioned earlier) gives the following definition:

The string PCDATA itself stands for “Parsed Character Data.” It is another inheritance from SGML; in this usage, “parsed” means that the XML processor will read this text looking for markup signaled by < and & characters.

You can think of this as data composed of CDATA and potentially some markup. Most XML data falls into this classification.

Here are two final tips about things that experienced XML users say may trip up people new to XML:

  • Pay attention to the characters that, as in HTML, cannot be included in your XML data without being represented as entity references. These include <, >, &, '(single quote), and " (double quote). These are represented using the same convention as in HTML: &lt;, &gt;, &amp;, &apos;, and &quot;. Lots of new users get stymied because they leave an ampersand somewhere in their data and it doesn’t parse.

  • If you are going to place non-UTF-8 data into your documents, be sure to specify an encoding. Encodings are specified in the XML declaration:

    <?xml version="1.0" encoding="iso-8859-1" ?>

    A common mistake is to either omit this declaration or declare the document as UTF-8 when it has other kinds of characters in it.

XML has a bit of a learning curve, but this small tutorial should help you get started. Once you have the basics down, you can begin to look at some of the more complex specifications that surround XML, including XSLT (for transforming XML to something else, such as HTML), XPath (a way of referring to a specific part of an XML document; see the next appendix), and SOAP/XML-RPC (used to communicate with remote services using messages written in XML).

References for More Information

See the end of Chapter 6, Working with Configuration Files for more references on XML-related topics.

If you enjoyed this excerpt, buy a copy of Automating System Administration with Perl, Second Edition .