Use Character and Entity References

Not all characters are available on the keyboard! This hack shows you how to represent such characters in an XML document by using decimal and hexadecimal character references, and how to represent entities by using entity references.

In XML, character and entity references are formed by surrounding a numerical value or a name with & and ;—for example, © is a decimal character reference and © is an entity reference. This hack shows you how to use both.

Character References

According to the third and latest edition of the XML 1.0 specification (http://www.w3.org/TR/REC-xml/), XML processors must accept over 1,000,000 hexadecimal characters (http://www.w3.org/TR/REC-xml/#charsets). It’s possible that you won’t be able to find all those characters on your keyboard! Don’t worry. You can use character references instead.

Tip

You can look up the semantics of individual Unicode characters at http://www.unicode.org/charts/.

You can reference characters using either decimal or hexadecimal numbers. Which one you use is a matter of style. The document Namen.xml uses both (Example 1-5); it contains some German names enclosed in German language tags.

Example 1-5. Namen.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet href="Namen.css" type="text/css"?>

   

<Namen xml:lang="de">

<Name>

 <Vorname>Marie</Vorname>

 <Nachname>M&#252;ller</Nachname>

 <Geschlecht>&#9792;</Geschlecht>

</Name>

<Name>

 <Vorname>Klaus</Vorname>

 <Nachname>M&#xfc;ller</Nachname>

 <Geschlecht>&#x2642;</Geschlecht>

</Name>

</Namen>

On lines 7 and 8 are the decimal character references &#252; and &#9792;, respectively. The first one refers to the letter u with an umlaut (ü) and the second one is a female sign. Lines 12 and 13 use the hexadecimal character references &#xfc; (ü) and &#x2642; (male sign), respectively. You can see how these character references are rendered in Opera in Figure 1-6.

Namen.xml in Opera, styled by Namen.css

Figure 1-6. Namen.xml in Opera, styled by Namen.css

The xml:lang attribute

Incidentally, the xml:lang attribute on line 4 is a special language identification attribute in XML 1.0 (http://www.w3.org/TR/REC-xml/#sec-lang-tag). Its value de is a language identifier as defined by RFC 3066 (http://www.ietf.org/rfc/rfc3066.txt) and ISO 639 (search http://www.iso.ch). Other examples of language identifiers are en (English), fr (French), and es (Spanish).

Entity References

XML has five predefined entities, listed in Table 1-1. These predefined entities can be used where the equivalent literal character is forbidden. For example, an attribute value cannot contain a less-than sign (<), because it looks too much like the beginning of a tag to an XML parser. No problem: you can use &lt; instead. Likewise, you cannot use an ampersand in parsed character data, the text content of an element. Why? Again, it looks like the beginning of a character or entity reference to an XML parser. Again, no problem: you can use &amp; instead.

Table 1-1. XML predefined entities

Entity reference

Description

&lt;

Less-than sign or open angle bracket (<)

&gt;

Greater-than sign or close angle bracket (>)

&amp;

Ampersand (&)

&apos;

Apostrophe or single quote (')

&quot;

Quote or double quote (“)

The following document, copy.xml in Example 1-6, uses a predefined entity and also declares and references a new entity.

Example 1-6. copy.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet href="copy.css" type="text/css"?>

<!DOCTYPE time [<!ENTITY copy "&#169;">]>



<!-- a time instant -->

<time timezone="PST">

 <hour>11</hour>

 <minute>59</minute>

 <second>59</second>

 <meridiem>p.m.</meridiem>

 <atomic signal="true"/>

 <copyright>&copy; O'Reilly &amp; Associates</copyright>

</time>

The entity copy is declared in the document type declaration on line 3. The keyword is ENTITY; it is followed by the entity name copy; and this is followed by the value or content of the entity in quotes, "&#169;“. (This entity comes standard in HTML and XHTML.) Line 12 of this document references the entity declared on line 3 (&copy;) and also references the XML 1.0 predefined entity for an ampersand (&amp;). Open this document in Firefox (it is styled by the CSS stylesheet copy.css) and it will appear like Figure 1-7.

copy.xml in Firefox

Figure 1-7. copy.xml in Firefox

Character references provide a convenient means to access a very large number of characters. Entities [Hack #25] are also a convenient means to store information and access it elsewhere, even multiple times if necessary.

Get XML Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.