O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
XML Hacks
By Michael Fitzgerald
July 2004
More Info

HACK
#4
Use Character and Entity References
Not all characters are available on the keyboard! This hack shows you how to represent such characters in an XML document by using decimal and hexadecimal character references, and how to represent entities by using entity references
[Discuss (0) | Link to this hack]

In XML, character and entity references are formed by surrounding a numerical value or a name with & and ;—for example, © is a decimal character reference and © is an entity reference. This hack shows you how to use both.

Character References

According to the third and latest edition of the XML 1.0 specification (http://www.w3.org/TR/REC-xml/), XML processors must accept over 1,000,000 hexadecimal characters (http://www.w3.org/TR/REC-xml/#charsets). It's possible that you won't be able to find all those characters on your keyboard! Don't worry. You can use character references instead.

TIP

You can look up the semantics of individual Unicode characters at http://www.unicode.org/charts/.

You can reference characters using either decimal or hexadecimal numbers. Which one you use is a matter of style. The document Namen.xml uses both (); it contains some German names enclosed in German language tags.

On lines 7 and 8 are the decimal character references ü and ♀, respectively. The first one refers to the letter u with an umlaut (ü) and the second one is a female sign. Lines 12 and 13 use the hexadecimal character references ü (ü) and ♂ (male sign), respectively. You can see how these character references are rendered in Opera in .

Figure 1. Namen.xml in Opera, styled by Namen.css

The xml:lang attribute

Incidentally, the xml:lang attribute on line 4 is a special language identification attribute in XML 1.0 (http://www.w3.org/TR/REC-xml/#sec-lang-tag). Its value de is a language identifier as defined by RFC 3066 (http://www.ietf.org/rfc/rfc3066.txt) and ISO 639 (search http://www.iso.ch). Other examples of language identifiers are en (English), fr (French), and es (Spanish).

Entity References

XML has five predefined entities, listed in . These predefined entities can be used where the equivalent literal character is forbidden. For example, an attribute value cannot contain a less-than sign (<), because it looks too much like the beginning of a tag to an XML parser. No problem: you can use &lt; instead. Likewise, you cannot use an ampersand in parsed character data, the text content of an element. Why? Again, it looks like the beginning of a character or entity reference to an XML parser. Again, no problem: you can use &amp; instead.

Table 1. XML predefined entities

Entity reference

Description

&lt;

Less-than sign or open angle bracket (<)

&gt;

Greater-than sign or close angle bracket (>)

&amp;

Ampersand (&)

&apos;

Apostrophe or single quote (')

&quot;

Quote or double quote (")

The following document, copy.xml in , uses a predefined entity and also declares and references a new entity.

The entity copy is declared in the document type declaration on line 3. The keyword is ENTITY; it is followed by the entity name copy; and this is followed by the value or content of the entity in quotes, "&#169;". (This entity comes standard in HTML and XHTML.) Line 12 of this document references the entity declared on line 3 (&copy;) and also references the XML 1.0 predefined entity for an ampersand (&amp;). Open this document in Firefox (it is styled by the CSS stylesheet copy.css) and it will appear like .

Figure 2. copy.xml in Firefox

Character references provide a convenient means to access a very large number of characters. Entities are also a convenient means to store information and access it elsewhere, even multiple times if necessary.


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.