Most of the power of SAX is exposed through event callbacks. In previous chapters you’ve seen some of the most widely used event callbacks as well as how to ensure that all the callbacks are generated and reported to application code.
This chapter presents the rest of the standard SAX event-handling interfaces (including the extension handlers), then talks about some of the common ways that event consumers use those interfaces. These interfaces are primarily implemented by application code that consumes events and needs to solve particular problems. You might also write custom event producers, which call these interfaces directly rather than expecting some type of XMLReader to issue them.
In Section 2.3, in Chapter 2, we looked at the most important APIs used to handle XML document content. Some other APIs were deferred to this section because they aren’t used as widely. Depending on what problems you’re solving, you may rely heavily on some of these additional methods.
Five ContentHandler callbacks were discussed in Chapter 2: Section 2.3.4 explained how characters and element boundaries were reported, and Section 2.6.4 explained how namespace-prefix scopes were reported. But the interface has five other methods. Here’s what they do and when you’ll want to use them:
-
void setDocumentLocator (Locator l)
This is normally the first callback from a parser; the single parameter is a Locator, discussed later. Strictly speaking, SAX parsers are not required to provide a locator or to make this callback; however, you’d want to avoid parsers that don’t provide this information. Your implementation of this callback will normally just save the locator; it can’t do much more since it’s the only SAX event callback that can’t throw a SAXException:
class MyHandler implements ContentHandler ... { private Locator locator; ... public void setDocumentLocator (Locator l) { locator = l; } ... }
Use this object as discussed later in this chapter, in Section 4.1.2. It is the standard way to report the base URI of the XML text currently being parsed; that information is essential for resolving relative URIs. It’s also essential for diagnostics that tell you where application code detects errors in large quantities of XML text.
-
void startDocument ()
,void endDocument ()
These two callbacks bracket processing for a document, and they are normally used to manage application state associated with the document being parsed. If you’re parsing a document, these methods will always be called once each, even when parsing is cut short by a thrown exception. No other methods have such guarantees.
startDocument()
is always called before any data is reported from the parser, and is normally used to initialize application data structures. It will usually be the second callback from the parser; parsers that provide a Locator will report that first. You can’t rely on asetDocumentLocator()
call beforestartDocument()
; structure your initialization code to do the real work in the callback guaranteed to be available.endDocument()
is always called to report that no more document data will be provided. The normal application response is to clean up all state associated with the current parse. The parser closes any input data streams you gave it using an InputSource (discussed later), so the application doesn’t need to do that. Cleanup would include forgetting any saved Locator since that object is no longer usable when the parse is complete. Also, you’d likely close other files or sockets that were opened while processing this document:class MyHandler implements ContentHandler ... { ... public void startDocument () throws SAXException { // initialize data structures for ALL handlers here ... } public void endDocument () throws SAXException { // free those same data structures locator = null; elementStack = null; ... } ... }
These two calls are widely used in robust SAX code because they provide such good hooks to control memory usage and manage associated file descriptors. However, some SAX2 parsers have a bug that reduces the robustness offered by SAX; they won’t correctly call
endDocument()
when parsing is aborted by throwing exceptions.-
void processingInstruction (target, data)
Processing Instructions (PIs) are used in XML for data that doesn’t obey the rules of a DTD. They can be placed anywhere in a document, including within the DTD, except inside other markup constructs like tags. Unlike comments, PIs are designed for applications to use. They’re part of the document structure that programmatic logic must understand; they can follow rules, just not ones found in a DTD or schema. This method has two parameters:
-
String
target
XML applications use this parameter to determine how to handle the PI. You can rely on the fact that it’ll never be the string
xml
(in any combination of upper- and lowercase characters) because XML and text declarations are not processing instructions.Some documents follow the convention that the target of a PI names a notation (perhaps the fully qualified URI found in its system identifier) and the meaning is associated with the notation rather than the name. That’s a fine practice to follow, but it isn’t essential. Most code just compares target names as strings, rather than use data reported with
DTDHandler.notationDecl()
to figure out what a target name should mean.-
String
data
This parameter is data associated with the PI, and it may be the null string if no data was provided after the target name. Some applications use the syntax of an attribute here; others don’t bother.
Processing instructions are natural to use in template systems and other document-oriented applications.[19]
Processing instructions are normally safe to ignore when your processing doesn’t recognize them (passing them on to any subsequent processing stage), or to store. If the parser does recognize them, it normally acts on then immediately. For example, an
<?xml-stylesheet ...?>
PI might select a particular XSLT stylesheet to use for generating a servlet’s output. The processing instruction event is used later, in Example 6-9.-
-
void ignorableWhitespace(buf,offset,len)
This is an optional callback, made by most parsers (including all that are validating) to report whitespace that separates elements in element content models, like those of the form
(title,para*,sect1*)
but not(#PCDATA|para|comment)*
,ANY
, orEMPTY
. Whitespace before or after the document’s root element is not treated as ignorable and is completely discarded. Providing this information is a requirement of the XML specification, since this kind of whitespace is defined to be markup rather than document content. If the parser doesn’t see such a content model declaration for any reason, it can’t use this callback; it’ll usecharacters()
instead, and applications will need to figure out if the whitespace is part of markup or part of content.The parameters are exactly the same as those of the
characters()
callback, except that you know the characters in the specified range will all be spaces, tabs, or newlines. (Keep that in mind if you’re directly producing ignorable whitespace to feed some event consumer. Using CRLF- or CR-style line ends here is a bug, though you might not see immediate consequences.) Likecharacters()
, this method can be called several times in a row, to complete processing a single stretch of characters.There are two popular ways to handle this callback. My favorite is to drop all the characters; they’re only in the source document to make the elements lay out nicely, so they won’t ever mean anything. There’s rarely a reason to even look at the data, much less save it. The other option is to delegate handling and just call the
characters()
callback with the whitespace.-
void skippedEntity (String name)
The parameter is a String that identifies an internal or external parsed entity. General entity names are presented as found in their declarations (
dudley
). Parameter entity names begin with a percent sign (%nell
). The external DTD subset is special; it’s an unnamed parameter entity and is reported with the name[dtd]
. You might not be able to tell if the skipped entity was an internal or external entity, even using DeclHandler events.You probably don’t ever want to see this call, since it means that part of your document has been hidden. XML 1.0 processors are required to report this case; SAX 1.0 didn’t, and most other parser-level APIs (such as DOM Level 2) still don’t. This is a call that only nonvalidating parsers may issue, and even then only if they are not parsing all the external entities referred to in documents—that is, where one or both of the external entities feature flags is set to false, to disable reading external general or parameter entities. No widely used Java parsers clear those flags by default, so this is a rare call in Java. However some C parsers, such as Expat (used in Mozilla), won’t normally parse external entities, so the notion isn’t exotic in all languages.
This useful interface is sometimes overlooked.
It gives information that is essential for providing
location-sensitive diagnostics and is often given to
SAXParseException constructors.
That same information is also needed to resolve relative URIs
in document content or attribute values (such as
xml:base
).
Parsers provide one instance of this class, which can be
used inside event callbacks to find what entity triggered
the event and approximately where.
Use that locator only during such callbacks.
There are only a few methods in this class.
-
String getSystemId ()
This is the most important method in this interface. It returns the base URI (system ID) for the entity being parsed; this is always an absolute URI. (However, versions of Xerces that are current at this writing have a bug here. They sometimes return nonabsolute URIs.) Use this method to identify the document or external entity in diagnostics or to resolve relative URIs (perhaps in conjunction with
xml:base
attributes).If the parser doesn’t know this value, null is returned. This normally indicates that the parser was not given such a URI inside of a InputSource encapsulating document text. That’s bad practice except when it’s unavoidable, such as parsing in-memory data or input to the POST method in a servlet.
-
int getLineNumber ()
,int getColumnNumber ()
These two functions approximate the current position of a parser within an entity. The position reflected is where the relevant event’s data ended. It is only an approximation for diagnostics, but most parsers do try to be accurate about the line number.
These numbers count up from 1 as appropriate for user-oriented diagnostics. Not all implementations will provide these values; the value
-1
is returned to indicate that no value was provided.-
String getPublicId ()
A public identifier may be provided with this method. Otherwise null is returned. This may be useful for diagnostics in some cases.
One common use for a locator is to report an error detected while an application processes document content. The SAXParseException class has two constructors that take locator parameters. (The descriptive string is always first, the locator is second, and an optional “root cause” exception is third.) Once you create such an exception, it can be thrown directly, which always terminates a parse. Or you pass it to an ErrorHandler to centralize error handling-policy in your application:
// "locator" was saved when setDocumentLocator() was called earlier // or was initialized to null; this is safe in both cases try { ... engine.setWarpFactor (11); ... } catch (DriveException e) { SAXParseException spe = new SAXParseException ( "The warp engine's gonna blow!", locator, e); errHandler.error (e); // we'll get here whenever such problems are ignored }
To resolve relative URIs in document content—for example, one found in an xhtml:a href="..."/
reference in a link checker—you’d use code like this
(ignoring xml:base
complications):
public void startElement (String uri, String lname, String qname, Attributes atts) throws SAXException { if (xhtmlURI.equals (uri)) { if ("a".equals (lname)) { String href = atts.getValue ("href"); if (href != null) { // ASSUMES: locator is nonnull System.out.println ("Found href to: " + new URI (new URI(locator.getSystemId ()), href)); } // else presumably <xhtml:a name="..."/> } } ... }
Some of the XMLReader
implementations cannot possibly call
ContentHandler.setDocumentLocator()
with a Locator.
When parsing in-memory data structures, such as a DOM document,
a locator will normally be meaningless.
When parsing in-memory buffers like a String (with
a StringReader), there won’t
usually be a URI in the locator.
If your application supports the layered
xml:base
convention (which lets documents
“lie” about their true locations for purposes of resolving
relative URIs), it will need to track those
attributes itself, as part of a context stack mechanism.
(An example of such a stack is shown later, in
Example 5-1.)
Such attributes can sometimes help make up for SAX event
sources that can’t provide locator information, such as
DOM-to-SAX producers.
But they can confuse things too: in the following
example, xml:base
would apply to the
top element and its direct children, but nothing
within the external entity reference.
(Let’s assume, for the sake of discussion, that no element
has an xml:base
attribute.)
<top xml:base="http://www.example.com/moved/doc2.xml"> <xhtml:a href="abc.xml"/> <xhtml:div> &external; </xhtml:div> <xhtml:a href="xyz.xml"/> </top>
When character content of an element is reported, characters from different external entities will get different callbacks, so the locator can be used to tell those different entities apart from each other.
One of the goals of XML was to bring Unicode into widespread use so that the Web could really become worldwide in terms of people, not just technology. This brings several concerns into text management. You may not need to worry about these if you’re working only in ASCII or with just one character encoding. While you’re just starting out with Java and XML you should certainly avoid worrying about these details. Some other users of SAX2 will need to understand these issues. Since they surface primarily with ContentHandler event callbacks, we briefly summarize them here.
If your application works with MathML, or in various
languages whose character sets gained support in Unicode 3.1
through the so-called Astral Planes, you will need to know
that what Java calls a char
is not really
the same thing as a Unicode character or an XML character.
If you aren’t using such languages, you’ll probably be able
to ignore this issue for a while. Still, you might want to
read about Unicode 3.1 to learn more about this and minimize
trouble later.
By the time you read this, the W3C may even have
completed its “Blueberry” XML update, intended to allow
the use of some such characters within XML names.
In the case of such characters, whose Unicode code point
is above the value U+FFFF
(the maximum
16-bit code point), these characters are mapped to two
Java char
values, called a
surrogate pair.
The char
values are in a range reserved
for surrogate characters, with a
high surrogate always
immediately followed by a low surrogate.
(This is called a big-endian sequence.)
Surrogate pairs can show up in several places in XML,
and hence in SAX2:
in character content, processing instructions, attribute
values (including defaults in the DTD), and comments.
At this time, Java does not have APIs to explicitly
support characters using surrogate pairs, although character
arrays and java.lang.String will hold
them as if the char
values weren’t part of
the same character.
The java.lang.Character class doesn’t
recognize surrogate pairs.
The best precaution seems to be to prefer APIs that talk in
terms of slices of character arrays (or
Strings), rather than
in terms of individual Java char
values.
This approach also handles other situations where more
than one char
value is needed per character.
Depending on the character encodings you’re using
and the applications you’re implementing, you may also need
to pay attention to the W3C Character Model
(http://www.w3.org/TR/charmod/
at this writing) and Unicode Normalization Form C.
Briefly, these aim to eliminate undesirable representations
of characters and to handle some other cases where Unicode
characters aren’t the same as XML characters or a Java
char
, such as composite characters.
For example, many accented characters are represented by
composing two or more Unicode characters.
Systems work better when they only need to handle one way to
represent such characters, and Form C addresses that problem.
[19] For example, the
syntax of PHP, the web page scripting tool, looks like a
processing instruction, <?php ...?>
.
For various reasons, PHP is not actually an XML
document syntax.
Get SAX2 now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.