The Eight-Minute XML Tutorial: Appendix C - Perl for System Administration

by David N. Blank-Edelman
Perl for System Administration book cover

This excerpt is from Perl for System Administration.

Perl for System Administration is aimed at all levels of administrators on the Unix, Windows NT, or MacOS platforms. Assuming only a little familiarity with Perl, it explores the pockets of administration where Perl can be most useful, including filesystem management, user administration, directory services, database administration, log files, and security and network monitoring. Perl for System Administration is for anyone who needs to use Perl for administrative tasks and needs to hit the ground running.

buy button

One of the most impressive features of XML (eXtensible Markup Language) is how little you need to know to get started. This appendix gives you some of the key pieces of information. For more information, see one of the many books being released on the topic or the references at the end of Chapter 3.

XML Is a Markup Language

Thanks to the ubiquity of XML’s older and stodgier cousin, HTML, almost everyone is familiar with the notion of a markup language. Like HTML, XML consists of plain text interspersed with little bits of special descriptive or instructive text. HTML has a rigid definition for which bits of markup text, called tags, are allowed, while XML allows you to make up your own.

XML provides a range of expression far beyond that of HTML. We see an example of this expression in Chapter 3, but here’s another simple example that should be easy to read even without any prior XML experience:

<machine>
  <name> quidditch </name>
  <department> Software Sorcery </department>
  <room> 129A </room>
  <owner> Harry Potter </owner>
  <ipaddress> 192.168.1.13 </ipaddress>
</machine>

XML Is Picky

Despite XML’s flexibility, it is pickier in places than HTML. There are syntax and grammar rules that your data must follow. These rules are set down rather tersely in the XML specification found at http://www.w3.org/TR/1998/REC-xml-19980210. Rather than poring through the official spec, I recommend you seek out one of the annotated versions, like Tim Bray’s version at http://www.xml.com, or Robert Ducharme’s book XML: The Annotated Specification (Prentice Hall). The former is online and free; the latter has many good examples of actual XML code.

Here are two of the XML rules that tend to trip up people who know HTML:

  1. If you begin something, you must end it. In the above example we started a machine listing with <machine> and finished it with </machine>. Leaving off the ending tag would not have been acceptable XML.

    In HTML, tags like <img src="picture.jpg" > are legally allowed to stand by themselves. Not so in XML; this would have to be written either as:

<img src="picture.jpg" > </img>

or:

<img src="picture.jpg" />

The extra slash at the end of this last tag lets the XML parser know that this single tag serves as both its own start and end tag. Data and its surrounding start and end tags is called an element.

  1. Start tags and end tags must mirror themselves exactly. Mixing case in not allowed. If your start tag is <MaChINe>, your end tag must be </MaChINe>, and cannot be </MACHine> or any other case combination. HTML is much more forgiving in this regard.

These are two of the general rules in the XML specification. But sometimes you want to define your own rules for an XML parser to enforce. By “enforce” we mean “complain vociferously” or “stop parsing” while reading the XML data. If we use our previous machine database XML snippet as an example, one additional rule we might to enforce is “all <machine> entries must contain a <name> and an <ipaddress> element.” You may also wish to restrict the contents of an element to a set of specific values like “YES” or “NO.”

How these rules get defined is less straightforward than the other material we’ll cover because there are several complimentary and competitive proposals for a definition “language” afloat at the moment. XML will eventually be self-defining (i.e., the document itself or something linked into the document describes its structure).

The current XML specification uses a DTD (Document Type Definition), the SGML standby. Here’s an example piece of XML code from the XML specification that has its definition code at the beginning of the document itself:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE greeting [
  <!ELEMENT greeting (#PCDATA)>
]>
<greeting>Hello, world!</greeting>

The first line of this example specifies the version of XML in use and the character encoding (Unicode) for the document. The next three lines define the types of data in this document. This is followed by the actual document content (the <greeting> element) in the final line of the example.

If we wanted to define how the <machine> XML code at the beginning of this appendix should be validated, we could place something like this at the beginning of the file:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE machines [
  <!ELEMENT machine (name,department,room,owner,ipaddress)>
  <!ELEMENT name       (#PCDATA)>
  <!ELEMENT department (#PCDATA)>
  <!ELEMENT room       (#PCDATA)>
  <!ELEMENT owner      (#PCDATA)>
  <!ELEMENT ipaddress  (#PCDATA)>
]>

This definition requires that a machine element consist of name, department, room, owner, and ipaddress elements (in this specific order). Each of those elements is described as being PCDATA (see the Section 3.4 section at the end of this appendix).

Another popular set of proposals that are not yet specifications recommend using data descriptions called schemas for DTD-like purposes. Schemas are themselves written in XML code. Here’s an example of schema code that uses the Microsoft implementation of the XML-data proposal found at http://www.w3.org/TR/1998/NOTE-XML-data/:

<?XML version='1.0' ?>
<schema id='MachineSchema' 
        xmlns="urn:schemas-microsoft-com:xml-data"
        xmlns:dt="urn:schemas-microsoft-com:datatypes">

<!-- define our element types (they are all just strings/PCDATA) -->
    <elementType id="name">
        <string/>
    </elementType>
    <elementType id="department">
        <string/>
    </elementType>
    <elementType id="room">
      <string/>
    </elementType>
    <elementType id="owner">
        <string/>
    </elementType>
    <elementType id="ipaddress">
        <string/>
    </elementType>

    <!-- now define our actual machine element -->
    <elementType id="Machine" content="CLOSED">
       <element type="#name"       occurs="REQUIRED"/>
       <element type="#department" occurs="REQUIRED"/>
       <element type="#room"       occurs="REQUIRED"/>
       <element type="#owner"      occurs="REQUIRED"/>
       <element type="#ipaddress"  occurs="REQUIRED"/>
    </elementType>
</schema>

XML schema technology is (as of this writing) still very much in the discussion phase in the standards process. XML-data, which we used in the above example, is just one of the proposals in front of the Working Group studying this issue. Because the technology moves fast, I recommend paying careful attention to the most current standards (found at http://www.w3.org) and your software’s level of compliance with them.

Both the mature DTD and fledgling schema mechanisms can get complicated quickly, so we’re going to leave further discussion of them to the books that are dedicated to XML/SGML.

Two Key XML Terms

You can’t go very far in XML without learning these two important terms. XML data is said to be well-formed if it follows all of the XML syntax and grammar rules (matching tags, etc.). Often a simple check for well-formed data can help spot typos in XML files. That’s already an advantage when the data you are dealing with holds configuration information like the machine database excerpted above.

XML data is said to be valid if it conforms to the rules we’ve set down in one of the data definition mechanisms mentioned earlier. For instance, if your data file conforms to its DTD, it is valid XML data.

Valid data by definition is well-formed, but the converse does not have to be true. It is possible to have perfectly wonderful XML data that does not have an associated DTD or schema. If it parses properly, it is well-formed, but not valid.

Leftovers

Here are three terms that appear throughout the XML literature and may stymie the XML beginner:

Attribute

The descriptions of an element that are part of the initial start tag. To reuse a previous example, in <img src="picture.jpg" />, src="picture.jpg"is an attribute for this element. There is some controversy in the XML world about when to use the contents of an element and when to use attributes. The best set of guidelines on this particular issue can be found at http://www.oasis-open.org/cover/elementsAndAttrs.html.

CDATA

The term CDATA (Character Data) is used in two contexts. Most of the time it refers to everything in an XML document that is not markup (tags, etc). The second context involves CDATA sections. A CDATA section is declared to indicate that an XML parser should leave that section of data alone even if it contains text that could be construed as markup.

PCDATA

Tim Bray’s annotation of the XML specification (mentioned earlier) gives the following definition:

The string PCDATA itself stands for “Parsed Character Data.” It is another inheritance from SGML; in this usage, “parsed” means that the XML processor will read this text looking for markup signaled by < and & characters.

You can think of this as data composed of CDATA and potentially some markup. Most XML data falls into this classification.

XML has a bit of a learning curve. This small tutorial should help you get started.

If you enjoyed this excerpt, buy a copy of Perl for System Administration