When a user tries to save a document as XML, Word presents several options. The “Save As” dialog, shown again in Figure 4-21, includes two checkboxes representing XML save options: the “Apply transform” checkbox and the “Save data only” checkbox. These options correspond to the final two (optional) processes in our processing model diagram (Figure 4-7).
Rather than solely relying on the user to make the right choice, you
can specify default save settings for a particular document,
obviating the need for user intervention. You can set these through
the Word UI (in the Tools → Templates and Add-Ins . . . → XML Schema → XML Options dialog), or by
declaring them in the underlying WordprocessingML representation. In
our primary XML editing scenario, the onload
XSLT transformation that Word applies when opening the document is
what determines what the default XML save settings for a document
will be.
In our press release template, the onload
stylesheet turns “Save data only”
off and “Apply custom
transform” on. It does this by
generating declarations for these settings inside the
w:docPr
element. Below is the relevant excerpt
from the stylesheet:
<w:docPr> <!-- ... --> <w:removeWordSchemaOnSave w:val="off"/> <w:useXSLTWhenSaving/> <w:saveThroughXSLT w:xslt="\\intra\pr\harvestPressRelease.xsl"/> </w:docPr>
The w:removeWordSchemaOnSave
element corresponds
to the “Save data only” option.
Here, it is explicitly turned off. The
w:useXSLTWhenSaving
element turns the
“Apply custom transform” option on.
Finally, the w:saveThroughXSLT
element specifies
the file name of the particular XSLT stylesheet to apply when the
w:useXSLTWhenSaving
option is turned on.
When the “Save data only” option is
turned on (via the w:removeWordSchemaOnSave
element),
Word strips all WordprocessingML markup from
the document when the user saves it, leaving only custom XML elements
and attributes. This is the same process that Word uses to prepare an
embedded XML document for schema validation. In both cases, the
“Ignore mixed
content” document option parameterizes the behavior
of the process, optionally causing it to subsequently strip out
remaining mixed content text after it has stripped out the
WordprocessingML markup.
Unlike Word’s default onload
rendering process for arbitrary XML documents (which the
XML2WORD.XSL
stylesheet implements), its default
onsave
process (“Save data
only”) is not implemented in an XSLT stylesheet that
you can view—at least not one that’s included
in the files installed with Office. However, since it is important to
understand exactly what this process does, we’ve
included in Example 4-9 an XSLT stylesheet that
approximates its behavior. This stylesheet is designed to produce the
exact same result as the “Save data
only” process, when selected as the transform to
apply when saving a document.[3]
Example 4-9. An approximation of the “Save data only” process, saveDataOnly.xsl
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core" xmlns:aml="http://schemas.microsoft.com/aml/2001/core" xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:v="urn:schemas-microsoft-com:office:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:st="urn:schemas-microsoft-com:office:smarttags"> <!-- UTF-8 encoding and standalone declaration --> <xsl:output encoding="UTF-8" standalone="no"/> <!-- ***************************************************************** --> <!-- Global Variables --> <!-- ***************************************************************** --> <!-- True if w:ignoreMixedContent is present and @w:val isn't "off" --> <xsl:variable name="ignoreMixedContent" select="/w:wordDocument/w:docPr/w:ignoreMixedContent [not(@w:val='off')]"/> <!-- Result of first pass (before optionally stripping mixed content text) --> <xsl:variable name="first-pass-result"> <xsl:apply-templates select="/*"/> </xsl:variable> <!-- ***************************************************************** --> <!-- Template rules in default mode --> <!-- ***************************************************************** --> <!-- Start here --> <xsl:template match="/"> <!-- line break after XML declaration --> <xsl:text>
</xsl:text> <!-- Re-create any PIs preserved inside o:CustomDocumentProperties --> <xsl:call-template name="create-pis"> <xsl:with-param name="escaped-pis" select="string( /w:wordDocument/o:CustomDocumentProperties/o:processingInstructions)"/> </xsl:call-template> <!-- Apply a second pass to strip mixed content text only if $ignoreMixedContent is true --> <xsl:choose> <xsl:when test="$ignoreMixedContent"> <xsl:apply-templates select="msxsl:node-set($first-pass-result)/node( )" mode="strip-mixed-content"/> </xsl:when> <xsl:otherwise> <xsl:copy-of select="$first-pass-result"/> </xsl:otherwise> </xsl:choose> </xsl:template> <!— Replicate all elements by default (filtering out unnecessary namespace nodes) —> <xsl:template match="*"> <xsl:element name="{local-name( )}" namespace="{namespace-uri( )}"> <xsl:apply-templates select="@*|node( )"/> </xsl:element> </xsl:template> <!— Copy attributes by default —> <xsl:template match="@*"> <xsl:copy/> </xsl:template> <!— Preserve text inside w:t elements (other than headers, footers, etc.) —> <xsl:template match="w:t[not(ancestor::w:sectPr)]/text( )"> <xsl:copy/> </xsl:template> <!— Strip out all other text (field instructions, doc properties, etc.) —> <xsl:template match="text( )"/> <!— Process children of, but do not copy, elements in Word's namespaces —> <xsl:template match="w:*|sl:*|aml:*|wx:*|w10:*|v:*|o:*|dt:*|st:*"> <xsl:apply-templates/> </xsl:template> <!— Strip out all attributes in Word's namespaces —> <xsl:template match="@w:*|@sl:*|@aml:*|@wx:*|@w10:*|@v:*|@o:*|@dt:*|@st:*"/> <!-- ***************************************************************** --> <!-- Template rules in "strip-mixed-content" mode --> <!-- ***************************************************************** --> <!-- Copy elements, attributes, PIs, and text straight through --> <xsl:template match="@*|node( )" mode="strip-mixed-content"> <xsl:copy> <xsl:apply-templates select="@*|node( )" mode="strip-mixed-content"/> </xsl:copy> </xsl:template> <!-- But strip out mixed content text --> <xsl:template match="text( )[preceding-sibling::* or following-sibling::*]" mode="strip-mixed-content"/> <!-- ***************************************************************** --> <!-- Named templates --> <!-- ***************************************************************** --> <!-- For re-creating PIs stored as text in o:CustomDocumentProperties; (See XML2WORD.XSL) --> <xsl:template name="create-pis"> <xsl:param name="escaped-pis"/> <xsl:if test="$escaped-pis"> <xsl:processing-instruction name="{substring-before( substring-after($escaped-pis,'<?'), ' ')}"> <xsl:value-of select="substring-before( substring-after($escaped-pis,' '), '?>')"/> </xsl:processing-instruction> <xsl:text>
</xsl:text> <xsl:call-template name="create-pis"> <xsl:with-param name="escaped-pis" select="substring-after($escaped-pis,'?>')"/> </xsl:call-template> </xsl:if> </xsl:template> </xsl:stylesheet>
The highlighted template rules in Example 4-9 define the essence of what the “Save data only” process does. They strip out elements and attributes in any of the Word-specific namespaces but preserve all elements and attributes in other namespaces. The rest of the stylesheet is concerned with implementing two other features of the “Save data only” process: stripping mixed content and preserving processing instructions.
Also like Word’s built-in “Save data only” process, the stylesheet in Example 4-9 alters its behavior according to whether the “Ignore mixed content” document option is turned on or off.
First, the stylesheet defines a global variable named
$ignoreMixedContent
that is true as long as the
w:ignoreMixedContent
element is present and is not
turned off.
<!-- True if w:ignoreMixedContent is present and @w:val isn't "off" --> <xsl:variable name="ignoreMixedContent" select="/w:wordDocument/w:docPr/w:ignoreMixedContent [not(@w:val='off')]"/>
Then, after stripping out the Word-specific markup, the stylesheet
further processes the document if and only if
$ignoreMixedContent
is true. This is implemented
as a second pass (with the help of the msxsl:node-set( )
extension function):
<!-- Apply a second pass to strip mixed content text only if $ignoreMixedContent is true --> <xsl:choose> <xsl:when test="$ignoreMixedContent"> <xsl:apply-templates select="msxsl:node-set($first-pass-result)/node( )" mode="strip-mixed-content"/> </xsl:when> <xsl:otherwise> <xsl:copy-of select="$first-pass-result"/> </xsl:otherwise> </xsl:choose>
Finally, the template rules in the
strip-mixed-content
mode effect an identity
transformation with one exception. The operative template rule strips
out all mixed content text in the document, i.e., all text nodes that
have any element siblings, by doing nothing:
<xsl:template match="text( )[preceding-sibling::* or following-sibling::*]" mode="strip-mixed-content"/>
Thus, the saveDataOnly.xsl
stylesheet behaves
like the “Save data only” process,
stripping out mixed content text only if the “Ignore
mixed content” document option is turned on.
When opening an arbitrary XML
document that has one or more processing instructions (PIs) outside
the root element, Word’s default
onload
stylesheet
(XML2WORD.XSL
) preserves those PIs by escaping
the PI markup as text and storing the resulting string in a custom
document property named o:processingInstructions
(in the
o:CustomDocumentProperties
element). Then, when the user saves the document, the
“Save data only” process converts
the escaped PI markup back to literal processing instructions in the
final XML document saved by Word.
The saveDataOnly.xsl
stylesheet in Example 4-9 exhibits the same behavior. First, it calls a
named template, passing it the string value of the
o:processingInstructions
element:
<!-- Re-create any PIs preserved inside o:CustomDocumentProperties --> <xsl:call-template name="create-pis"> <xsl:with-param name="escaped-pis" select="string( /w:wordDocument/o:CustomDocumentProperties/o:processingInstructions)"/> </xsl:call-template>
Then, the template named create-pis
does the
actual work of converting the value of the
$escaped-pis
parameter to real processing
instructions in the result document. It recursively parses the
escaped PI markup until no processing instructions are left:
<!-- For re-creating PIs stored as text in o:CustomDocumentProperties; (See XML2WORD.XSL) --> <xsl:template name="create-pis"> <xsl:param name="escaped-pis"/> <xsl:if test="$escaped-pis"> <xsl:processing-instruction name="{substring-before( substring-after($escaped-pis,'<?'), ' ')}"> <xsl:value-of select="substring-before( substring-after($escaped-pis,' '), '?>')"/> </xsl:processing-instruction> <xsl:text>
</xsl:text> <xsl:call-template name="create-pis"> <xsl:with-param name="escaped-pis" select="substring-after($escaped-pis,'?>')"/> </xsl:call-template> </xsl:if> </xsl:template>
This PI re-creation process only works when the
onload
stylesheet
preserves the PIs in exactly the way that the “Save
data only” process expects. If you want your own
custom onload
stylesheets to preserve PIs, take
a look at the XML2WORD.XSL
file to see exactly
how it’s done. Basically, it converts a single PI to
a string with these components:
'<?' <PITarget> <nbsp> <PIText>
'?>'
Each subsequent escaped PI is concatenated to the end of the last
one. And the final value is stored in the
o:processingInstructions
element.
In our press release template, the onload
stylesheet preserves PIs from the source document in the same way
that the XML2WORD.XSL
stylesheet does. However,
rather than using the “Save data
only” process to re-create the PIs, the press
release template declares its own custom onsave
stylesheet, which re-creates them in the same way that the
“Save data only” process would
have. Of course, when you have control over both the
onload
and onsave
stylesheets, you can choose whatever mechanism you’d
like for preserving PIs. The press release template could have used a
different approach, but the approach used by
XML2WORD.XSL
and the “Save data
only” process works perfectly fine. Rather than
reinventing the wheel, the press release template takes the same
approach.
One favorable consequence of preserving processing instructions from
the source document is that the mso-application
PI
is preserved in XML documents that Word edits, retaining the
file’s association with the Word application. This
means that users don’t have to do anything special
to open the file in Word; they just double-click it like any other
Word document. Conversely, the mso-application
PI
is only present in the saved document when it was already present in
the XML document that Word opened. Word does not automatically output
the mso-application
PI whenever it saves a custom
XML document. On the contrary, it is quite possible to open, edit,
and save XML documents in Word without leaving any evidence that Word
was ever used to edit the file. The point is that you as the
developer do have control over what processing instructions appear in
the result.
Tip
To force the presence of the
mso-application
(or any other) processing
instruction in your result document (regardless of whether it was
present in the source document), you can simply use the
xsl:processing-instruction
element in your
onsave
stylesheet. Or, if you are using “Save data
only” with no onsave
stylesheet, you can use your onload
stylesheet
to effectively hard-code the PI to the list of escaped PIs in the
o:processingInstructions
custom document property.
In this case, the “Save data only”
process will regenerate the PI just as if it was preserved from the
source document.
The “Apply
Custom Transform” document option allows you to save
an XML document through an onsave
XSLT
stylesheet. As reflected in our original processing model diagram in
Figure 4-7, what document the
onsave
stylesheet is applied to depends on
whether the “Save data only” option
is turned on. If “Save data only”
is turned off, then the onsave
stylesheet is
applied directly to the WordprocessingML document. If
“Save data only” is turned on, then
the onsave
stylesheet is applied to the result
of stripping the Word-specific markup from the merged XML and
WordprocessingML view.
Our press release template uses an onsave
stylesheet called harvestPressRelease.xsl
. Since
the “Save data only” option is
turned off, this stylesheet is applied to the entire WordprocessingML
document when the user saves it. The purpose of
harvestPressRelease.xsl
is to behave just like
the “Save data only” process, with
some notable exceptions: it converts w:p
elements
in the body of the press release to para
elements
in the result, and it converts a run with the
“Lead-in Emphasis” style to a
leadIn
element in the result.
The harvestPressRelease.xsl
stylesheet behaves
just like the “Save data only”
process in the sense that it strips out all Word-specific markup from
the result, and, except for the para
element, it
leaves all custom tags intact. It turns out that the
saveDataOnly.xsl
stylesheet introduced in the
last section possesses more than academic interest. It not only can
be used to understand the precise behavior of the
“Save data only” process, i.e., as
a learning aid, but it can also be used directly by custom
onsave
stylesheets that want to slightly alter
its behavior. Our press release template’s
onsave
stylesheet does just that—it
imports the saveDataOnly.xsl
stylesheet,
selectively modifying its behavior. Example 4-10
shows harvestPressRelease.xsl
in its entirety.
Example 4-10. The onsave stylesheet for the harvestPressRelease.xsl template
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" xmlns:pr="http://xmlportfolio.com/pressRelease" xmlns="http://xmlportfolio.com/pressRelease" exclude-result-prefixes="w pr"> <xsl:import href="saveDataOnly.xsl"/> <!-- Skip by the single surrogate paragraph --> <xsl:template match="pr:para"> <!-- Apply templates to all non-empty Word paragraphs --> <xsl:apply-templates select="w:p[normalize-space(.)]"/> </xsl:template> <!-- Convert w:p elements inside PR body to para elements --> <xsl:template match="pr:para/w:p"> <para> <!-- This element contains mixed content; explicitly preserve space --> <xsl:attribute name="xml:space">preserve</xsl:attribute> <xsl:apply-templates/> </para> </xsl:template> <!-- Convert "Lead-in Emphasis" runs to leadIn elements --> <xsl:template match="w:r[w:rPr/w:rStyle/@w:val = /w:wordDocument/w:styles/w:style [w:name/@w:val='Lead-in Emphasis']/@w:styleId]"> <!-- Only process this run if the immediately preceding run does not have the same style --> <xsl:if test="not(preceding-sibling::w:r[1] [w:rPr/w:rStyle/@w:val = current( )/w:rPr/w:rStyle/@w:val] )"> <leadIn> <xsl:call-template name="merge-adjacent-style-runs"/> </leadIn> </xsl:if> </xsl:template> <!-- Merge adjacent runs that have the same style --> <xsl:template name="merge-adjacent-style-runs" match="w:r" mode="merge-runs"> <xsl:apply-templates/> <!-- Recursively apply to the immediately following run only if it has the same style --> <xsl:apply-templates select="following-sibling::w:r[1] [w:rPr/w:rStyle/@w:val = current( )/w:rPr/w:rStyle/@w:val]" mode="merge-runs"/> </xsl:template> <!-- Override mixed-content-stripping for text inside pr:para elements --> <xsl:template match="pr:para/text( )" mode="strip-mixed-content"> <xsl:copy/> </xsl:template> </xsl:stylesheet>
As you can see, this stylesheet imports the
saveDataOnly.xsl
stylesheet we looked at
earlier:
<xsl:import href="saveDataOnly.xsl"/>
Now, let’s briefly walk through each template rule
in the stylesheet. The first custom rule that will get triggered is
also the first one listed in the document. It matches the single
pr:para
element (where pr
maps
to the press release namespace) that contains the body text of the
press release. Rather than creating a shallow copy of the element, as
saveDataOnly.xsl
would have done by default, it
instructs processing to skip by the element altogether and to process
its non-empty paragraph (w:p
) children instead:
<!-- Skip by the single surrogate paragraph --> <xsl:template match="pr:para"> <!-- Apply templates to all non-empty Word paragraphs --> <xsl:apply-templates select="w:p[normalize-space(.)]"/> </xsl:template>
The next template rule matches the paragraph (w:p
)
children of pr:para
. Each w:p
element is effectively replaced by a para
element
(in the press release namespace). The
xml:space="preserve
" attribute is programmatically
added to the result so that Word (and other potential processes)
won’t strip out what it deems to be insignificant
whitespace from the document when it loads it again. Since the
para
element contains mixed content, all child
text nodes, including whitespace-only text nodes, should be
considered significant:
<!-- Convert w:p elements inside PR body to para elements --> <xsl:template match="pr:para/w:p"> <para> <!-- This element contains mixed content; explicitly preserve space --> <xsl:attribute name="xml:space">preserve</xsl:attribute> <xsl:apply-templates/> </para> </xsl:template>
The next template rule gets triggered by runs that have the
“Lead-in Emphasis” character style.
The purpose of this template rule is to convert such runs into
leadIn
elements. However, its job is complicated
by the fact that Word has a tendency to output adjacent runs that
have the same style. Rather that creating a separate
leadIn
element for each of these, this template
rule, with help from the recursive template named
merge-adjacent-style-runs
, does just that; it
merges adjacent runs in the same style so that only one
leadIn
element is created per contiguous sequence:
<!-- Convert "Lead-in Emphasis" runs to leadIn elements --> <xsl:template match="w:r[w:rPr/w:rStyle/@w:val = /w:wordDocument/w:styles/w:style [w:name/@w:val='Lead-in Emphasis']/@w:styleId]"> <!-- Only process this run if the immediately preceding run does not have the same style --> <xsl:if test="not(preceding-sibling::w:r[1] [w:rPr/w:rStyle/@w:val = current( )/w:rPr/w:rStyle/@w:val] )"> <leadIn> <xsl:call-template name="merge-adjacent-style-runs"/> </leadIn> </xsl:if> </xsl:template>
Finally, harvestPressRelease.xsl
must override
one other aspect of
saveDataOnly.xsl
’s behavior.
Rather than strip out all mixed content text (which
saveDataOnly.xsl
does when
“Ignore mixed content” is turned
on, as it is in the press release template), it must preserve the
mixed content text found inside the newly created
pr:para
elements. It does this by overriding the
default template rule for text nodes in the
strip-mixed-content
mode, explicitly copying text
nodes that are children of pr:para
elements:
<!-- Override mixed-content-stripping for text inside pr:para elements --> <xsl:template match="pr:para/text( )" mode="strip-mixed-content"> <xsl:copy/> </xsl:template>
Thus, the harvestPressRelease.xsl
stylesheet
behaves very similarly to Word’s
“Save data only” process. In fact,
for most of the elements in a press release document, it behaves
identically, thanks to the saveDataOnly.xsl
stylesheet that it imports. However, by incrementally overriding the
default behavior of saveDataOnly.xsl
, it enables
limited but effective support for repeating paragraphs and mixed
content.
Between the “Save data only” and “Apply custom transform” options, there are four possible combinations. When does it make sense to choose one combination over another? Table 4-1 lists some possible use cases for each combination.
Table 4-1. XML save settings and corresponding use cases
“Save data only” |
“Apply custom transform” |
Example use cases |
---|---|---|
off |
off | |
on |
off |
Saving custom markup only (most common configuration for Smart Documents) |
off |
on |
Converting Word paragraphs to custom elements; converting styled text to custom elements |
on |
on |
Converting elements back to attributes; re-ordering or otherwise re-structuring the document |
When you are using an
onsave
XSLT stylesheet and you need to decide
whether or not to turn “Save data
only” on, ask yourself these questions: Is all the
information I need to create my final, saved XML document present in
the XML elements and attributes that are embedded in the Word
document being edited? Or do I need to query some aspect of the
WordprocessingML markup, because the embedded XML tags do not tell
the whole story? The onsave
stylesheet for our
press release template, since it converts Word paragraphs to custom
paragraphs, for example, indeed does need to have access to the
WordprocessingML markup. Therefore, the press release template takes
the third approach shown in this table; it turns
“Save data only”
off and “Apply custom
transform”
on.
[3] For this stylesheet to
work as intended, the “Apply
transform” checkbox must be checked, the
saveDataOnly.xsl
file must be selected as the
transform to apply, and the “Save data
only” checkbox must be
unchecked. The reason it must be unchecked is
that the saveDataOnly.xsl
stylesheet is designed
to be applied to the document instead of the
“Save data only” process, rather
than in addition to it.
Get Office 2003 XML now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.