XML Save Options

When a user tries to save a document as XML, Word presents several options. The “Save As” dialog, shown again in Figure 4-21, includes two checkboxes representing XML save options: the “Apply transform” checkbox and the “Save data only” checkbox. These options correspond to the final two (optional) processes in our processing model diagram (Figure 4-7).

XML save options in the “Save As” dialog

Figure 4-21. XML save options in the “Save As” dialog

Rather than solely relying on the user to make the right choice, you can specify default save settings for a particular document, obviating the need for user intervention. You can set these through the Word UI (in the Tools Templates and Add-Ins . . . XML Schema XML Options dialog), or by declaring them in the underlying WordprocessingML representation. In our primary XML editing scenario, the onload XSLT transformation that Word applies when opening the document is what determines what the default XML save settings for a document will be.

In our press release template, the onload stylesheet turns “Save data only” off and “Apply custom transform” on. It does this by generating declarations for these settings inside the w:docPr element. Below is the relevant excerpt from the stylesheet:

  <w:docPr>
    <!-- ... -->
    <w:removeWordSchemaOnSave w:val="off"/>
    <w:useXSLTWhenSaving/>
    <w:saveThroughXSLT w:xslt="\\intra\pr\harvestPressRelease.xsl"/>
  </w:docPr>

The w:removeWordSchemaOnSave element corresponds to the “Save data only” option. Here, it is explicitly turned off. The w:useXSLTWhenSaving element turns the “Apply custom transform” option on. Finally, the w:saveThroughXSLT element specifies the file name of the particular XSLT stylesheet to apply when the w:useXSLTWhenSaving option is turned on.

The “Save Data Only” Document Option

When the “Save data only” option is turned on (via the w:removeWordSchemaOnSave element), Word strips all WordprocessingML markup from the document when the user saves it, leaving only custom XML elements and attributes. This is the same process that Word uses to prepare an embedded XML document for schema validation. In both cases, the “Ignore mixed content” document option parameterizes the behavior of the process, optionally causing it to subsequently strip out remaining mixed content text after it has stripped out the WordprocessingML markup.

Unlike Word’s default onload rendering process for arbitrary XML documents (which the XML2WORD.XSL stylesheet implements), its default onsave process (“Save data only”) is not implemented in an XSLT stylesheet that you can view—at least not one that’s included in the files installed with Office. However, since it is important to understand exactly what this process does, we’ve included in Example 4-9 an XSLT stylesheet that approximates its behavior. This stylesheet is designed to produce the exact same result as the “Save data only” process, when selected as the transform to apply when saving a document.[3]

Example 4-9. An approximation of the “Save data only” process, saveDataOnly.xsl

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:msxsl="urn:schemas-microsoft-com:xslt"
  xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
  xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"
  xmlns:aml="http://schemas.microsoft.com/aml/2001/core"
  xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
  xmlns:w10="urn:schemas-microsoft-com:office:word"
  xmlns:v="urn:schemas-microsoft-com:office:vml"
  xmlns:o="urn:schemas-microsoft-com:office:office"
  xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
  xmlns:st="urn:schemas-microsoft-com:office:smarttags">
   
  <!-- UTF-8 encoding and standalone declaration -->
  <xsl:output encoding="UTF-8" standalone="no"/>
   
   
  <!-- ***************************************************************** -->
  <!--                        Global Variables                           -->
  <!-- ***************************************************************** -->
   
  <!-- True if w:ignoreMixedContent is present and @w:val isn't "off" -->
  <xsl:variable name="ignoreMixedContent"
                select="/w:wordDocument/w:docPr/w:ignoreMixedContent
                        [not(@w:val='off')]"/>
   
  <!-- Result of first pass (before optionally stripping mixed content text) -->
  <xsl:variable name="first-pass-result">
    <xsl:apply-templates select="/*"/>
  </xsl:variable>
   
   
  <!-- ***************************************************************** -->
  <!--                Template rules in default mode                     -->
  <!-- ***************************************************************** -->
   
  <!-- Start here -->
  <xsl:template match="/">
   
    <!-- line break after XML declaration -->
    <xsl:text>&#xA;</xsl:text>
   
    <!-- Re-create any PIs preserved inside o:CustomDocumentProperties --> 
    <xsl:call-template name="create-pis">
      <xsl:with-param name="escaped-pis" select="string(
         /w:wordDocument/o:CustomDocumentProperties/o:processingInstructions)"/>
    </xsl:call-template>
   
    <!-- Apply a second pass to strip mixed content text only if
         $ignoreMixedContent is true -->
    <xsl:choose>
      <xsl:when test="$ignoreMixedContent">
        <xsl:apply-templates select="msxsl:node-set($first-pass-result)/node( )"
                             mode="strip-mixed-content"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:copy-of select="$first-pass-result"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
   
  <!— Replicate all elements by default
                  (filtering out unnecessary namespace nodes) —>
                  <xsl:template match="*">
                  <xsl:element name="{local-name( )}" namespace="{namespace-uri( )}">
                  <xsl:apply-templates select="@*|node( )"/>
                  </xsl:element>
                  </xsl:template>
   
                  <!— Copy attributes by default —>
                  <xsl:template match="@*">
                  <xsl:copy/>
                  </xsl:template>
   
                  <!— Preserve text inside w:t elements (other than headers, footers, etc.) —>
                  <xsl:template match="w:t[not(ancestor::w:sectPr)]/text( )">
                  <xsl:copy/>
                  </xsl:template>
   
                  <!— Strip out all other text (field instructions, doc properties, etc.) —>
                  <xsl:template match="text( )"/>
   
                  <!— Process children of, but do not copy, elements in Word's namespaces —>
                  <xsl:template match="w:*|sl:*|aml:*|wx:*|w10:*|v:*|o:*|dt:*|st:*">
                  <xsl:apply-templates/>
                  </xsl:template>
   
                  <!— Strip out all attributes in Word's namespaces —>
                  <xsl:template match="@w:*|@sl:*|@aml:*|@wx:*|@w10:*|@v:*|@o:*|@dt:*|@st:*"/>
   
   
  <!-- ***************************************************************** -->
  <!--          Template rules in "strip-mixed-content" mode             -->
  <!-- ***************************************************************** -->
   
  <!-- Copy elements, attributes, PIs, and text straight through -->
  <xsl:template match="@*|node( )" mode="strip-mixed-content">
    <xsl:copy>
      <xsl:apply-templates select="@*|node( )" mode="strip-mixed-content"/>
    </xsl:copy>
  </xsl:template> 
   
  <!-- But strip out mixed content text -->
  <xsl:template match="text( )[preceding-sibling::* or following-sibling::*]"
                mode="strip-mixed-content"/>
   
  <!-- ***************************************************************** -->
  <!--                        Named templates                            -->
  <!-- ***************************************************************** -->
   
  <!-- For re-creating PIs stored as text in o:CustomDocumentProperties;
       (See XML2WORD.XSL) -->
  <xsl:template name="create-pis">
    <xsl:param name="escaped-pis"/>
    <xsl:if test="$escaped-pis">
      <xsl:processing-instruction
           name="{substring-before(
                    substring-after($escaped-pis,'&lt;?'),
                    '&#160;')}">
        <xsl:value-of select="substring-before(
                                substring-after($escaped-pis,'&#160;'),
                                '?>')"/>
      </xsl:processing-instruction>
      <xsl:text>&#xA;</xsl:text>
      <xsl:call-template name="create-pis">
        <xsl:with-param name="escaped-pis"
                        select="substring-after($escaped-pis,'?>')"/>
      </xsl:call-template>
    </xsl:if>
  </xsl:template>
   
</xsl:stylesheet>

The highlighted template rules in Example 4-9 define the essence of what the “Save data only” process does. They strip out elements and attributes in any of the Word-specific namespaces but preserve all elements and attributes in other namespaces. The rest of the stylesheet is concerned with implementing two other features of the “Save data only” process: stripping mixed content and preserving processing instructions.

Stripping mixed content

Also like Word’s built-in “Save data only” process, the stylesheet in Example 4-9 alters its behavior according to whether the “Ignore mixed content” document option is turned on or off.

First, the stylesheet defines a global variable named $ignoreMixedContent that is true as long as the w:ignoreMixedContent element is present and is not turned off.

  <!-- True if w:ignoreMixedContent is present and @w:val isn't "off" -->
  <xsl:variable name="ignoreMixedContent"
                select="/w:wordDocument/w:docPr/w:ignoreMixedContent
                        [not(@w:val='off')]"/>

Then, after stripping out the Word-specific markup, the stylesheet further processes the document if and only if $ignoreMixedContent is true. This is implemented as a second pass (with the help of the msxsl:node-set( ) extension function):

    <!-- Apply a second pass to strip mixed content text only if
         $ignoreMixedContent is true -->
    <xsl:choose>
      <xsl:when test="$ignoreMixedContent">
        <xsl:apply-templates select="msxsl:node-set($first-pass-result)/node( )"
                             mode="strip-mixed-content"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:copy-of select="$first-pass-result"/>
      </xsl:otherwise>
    </xsl:choose>

Finally, the template rules in the strip-mixed-content mode effect an identity transformation with one exception. The operative template rule strips out all mixed content text in the document, i.e., all text nodes that have any element siblings, by doing nothing:

  <xsl:template match="text( )[preceding-sibling::* or following-sibling::*]"
                mode="strip-mixed-content"/>

Thus, the saveDataOnly.xsl stylesheet behaves like the “Save data only” process, stripping out mixed content text only if the “Ignore mixed content” document option is turned on.

Preserving processing instructions

When opening an arbitrary XML document that has one or more processing instructions (PIs) outside the root element, Word’s default onload stylesheet (XML2WORD.XSL) preserves those PIs by escaping the PI markup as text and storing the resulting string in a custom document property named o:processingInstructions (in the o:CustomDocumentProperties element). Then, when the user saves the document, the “Save data only” process converts the escaped PI markup back to literal processing instructions in the final XML document saved by Word.

The saveDataOnly.xsl stylesheet in Example 4-9 exhibits the same behavior. First, it calls a named template, passing it the string value of the o:processingInstructions element:

    <!-- Re-create any PIs preserved inside o:CustomDocumentProperties --> 
    <xsl:call-template name="create-pis">
      <xsl:with-param name="escaped-pis" select="string(
         /w:wordDocument/o:CustomDocumentProperties/o:processingInstructions)"/>
    </xsl:call-template>

Then, the template named create-pis does the actual work of converting the value of the $escaped-pis parameter to real processing instructions in the result document. It recursively parses the escaped PI markup until no processing instructions are left:

  <!-- For re-creating PIs stored as text in o:CustomDocumentProperties;
       (See XML2WORD.XSL) -->
  <xsl:template name="create-pis">
    <xsl:param name="escaped-pis"/>
    <xsl:if test="$escaped-pis">
      <xsl:processing-instruction
           name="{substring-before(
                    substring-after($escaped-pis,'&lt;?'),
                    '&#160;')}">
        <xsl:value-of select="substring-before(
                                substring-after($escaped-pis,'&#160;'),
                                '?>')"/>
      </xsl:processing-instruction>
      <xsl:text>&#xA;</xsl:text>
      <xsl:call-template name="create-pis">
        <xsl:with-param name="escaped-pis"
                        select="substring-after($escaped-pis,'?>')"/>
      </xsl:call-template>
    </xsl:if>
  </xsl:template>

This PI re-creation process only works when the onload stylesheet preserves the PIs in exactly the way that the “Save data only” process expects. If you want your own custom onload stylesheets to preserve PIs, take a look at the XML2WORD.XSL file to see exactly how it’s done. Basically, it converts a single PI to a string with these components:

'<?' <PITarget> <nbsp> <PIText> '?>'

Each subsequent escaped PI is concatenated to the end of the last one. And the final value is stored in the o:processingInstructions element.

In our press release template, the onload stylesheet preserves PIs from the source document in the same way that the XML2WORD.XSL stylesheet does. However, rather than using the “Save data only” process to re-create the PIs, the press release template declares its own custom onsave stylesheet, which re-creates them in the same way that the “Save data only” process would have. Of course, when you have control over both the onload and onsave stylesheets, you can choose whatever mechanism you’d like for preserving PIs. The press release template could have used a different approach, but the approach used by XML2WORD.XSL and the “Save data only” process works perfectly fine. Rather than reinventing the wheel, the press release template takes the same approach.

One favorable consequence of preserving processing instructions from the source document is that the mso-application PI is preserved in XML documents that Word edits, retaining the file’s association with the Word application. This means that users don’t have to do anything special to open the file in Word; they just double-click it like any other Word document. Conversely, the mso-application PI is only present in the saved document when it was already present in the XML document that Word opened. Word does not automatically output the mso-application PI whenever it saves a custom XML document. On the contrary, it is quite possible to open, edit, and save XML documents in Word without leaving any evidence that Word was ever used to edit the file. The point is that you as the developer do have control over what processing instructions appear in the result.

Tip

To force the presence of the mso-application (or any other) processing instruction in your result document (regardless of whether it was present in the source document), you can simply use the xsl:processing-instruction element in your onsave stylesheet. Or, if you are using “Save data only” with no onsave stylesheet, you can use your onload stylesheet to effectively hard-code the PI to the list of escaped PIs in the o:processingInstructions custom document property. In this case, the “Save data only” process will regenerate the PI just as if it was preserved from the source document.

The “Apply Custom Transform” Document Option

The “Apply Custom Transform” document option allows you to save an XML document through an onsave XSLT stylesheet. As reflected in our original processing model diagram in Figure 4-7, what document the onsave stylesheet is applied to depends on whether the “Save data only” option is turned on. If “Save data only” is turned off, then the onsave stylesheet is applied directly to the WordprocessingML document. If “Save data only” is turned on, then the onsave stylesheet is applied to the result of stripping the Word-specific markup from the merged XML and WordprocessingML view.

Our press release template uses an onsave stylesheet called harvestPressRelease.xsl. Since the “Save data only” option is turned off, this stylesheet is applied to the entire WordprocessingML document when the user saves it. The purpose of harvestPressRelease.xsl is to behave just like the “Save data only” process, with some notable exceptions: it converts w:p elements in the body of the press release to para elements in the result, and it converts a run with the “Lead-in Emphasis” style to a leadIn element in the result.

The harvestPressRelease.xsl stylesheet behaves just like the “Save data only” process in the sense that it strips out all Word-specific markup from the result, and, except for the para element, it leaves all custom tags intact. It turns out that the saveDataOnly.xsl stylesheet introduced in the last section possesses more than academic interest. It not only can be used to understand the precise behavior of the “Save data only” process, i.e., as a learning aid, but it can also be used directly by custom onsave stylesheets that want to slightly alter its behavior. Our press release template’s onsave stylesheet does just that—it imports the saveDataOnly.xsl stylesheet, selectively modifying its behavior. Example 4-10 shows harvestPressRelease.xsl in its entirety.

Example 4-10. The onsave stylesheet for the harvestPressRelease.xsl template

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
  xmlns:pr="http://xmlportfolio.com/pressRelease"
  xmlns="http://xmlportfolio.com/pressRelease"
  exclude-result-prefixes="w pr">
   
  <xsl:import href="saveDataOnly.xsl"/>
   
  <!-- Skip by the single surrogate paragraph -->
  <xsl:template match="pr:para">
    <!-- Apply templates to all non-empty Word paragraphs -->
    <xsl:apply-templates select="w:p[normalize-space(.)]"/>
  </xsl:template>
   
  <!-- Convert w:p elements inside PR body to para elements -->
  <xsl:template match="pr:para/w:p">
    <para>
      <!-- This element contains mixed content; explicitly preserve space -->
      <xsl:attribute name="xml:space">preserve</xsl:attribute>
      <xsl:apply-templates/>
    </para>
  </xsl:template>
   
  <!-- Convert "Lead-in Emphasis" runs to leadIn elements -->
  <xsl:template match="w:r[w:rPr/w:rStyle/@w:val =
                           /w:wordDocument/w:styles/w:style
                             [w:name/@w:val='Lead-in Emphasis']/@w:styleId]">
   
    <!-- Only process this run if the immediately preceding
         run does not have the same style -->
    <xsl:if test="not(preceding-sibling::w:r[1]
                      [w:rPr/w:rStyle/@w:val = current( )/w:rPr/w:rStyle/@w:val]
                     )">
      <leadIn>
        <xsl:call-template name="merge-adjacent-style-runs"/>
      </leadIn>
    </xsl:if>
  </xsl:template>
   
  <!-- Merge adjacent runs that have the same style -->
  <xsl:template name="merge-adjacent-style-runs" match="w:r" mode="merge-runs">
    <xsl:apply-templates/>
   
    <!-- Recursively apply to the immediately following run
         only if it has the same style -->
    <xsl:apply-templates
         select="following-sibling::w:r[1]
                 [w:rPr/w:rStyle/@w:val = current( )/w:rPr/w:rStyle/@w:val]"
         mode="merge-runs"/>
  </xsl:template>
   
  <!-- Override mixed-content-stripping for text inside pr:para elements -->
  <xsl:template match="pr:para/text( )" mode="strip-mixed-content">
    <xsl:copy/>
  </xsl:template>
   
</xsl:stylesheet>

As you can see, this stylesheet imports the saveDataOnly.xsl stylesheet we looked at earlier:

  <xsl:import href="saveDataOnly.xsl"/>

Now, let’s briefly walk through each template rule in the stylesheet. The first custom rule that will get triggered is also the first one listed in the document. It matches the single pr:para element (where pr maps to the press release namespace) that contains the body text of the press release. Rather than creating a shallow copy of the element, as saveDataOnly.xsl would have done by default, it instructs processing to skip by the element altogether and to process its non-empty paragraph (w:p) children instead:

  <!-- Skip by the single surrogate paragraph -->
  <xsl:template match="pr:para">
    <!-- Apply templates to all non-empty Word paragraphs -->
    <xsl:apply-templates select="w:p[normalize-space(.)]"/>
  </xsl:template>

The next template rule matches the paragraph (w:p) children of pr:para. Each w:p element is effectively replaced by a para element (in the press release namespace). The xml:space="preserve" attribute is programmatically added to the result so that Word (and other potential processes) won’t strip out what it deems to be insignificant whitespace from the document when it loads it again. Since the para element contains mixed content, all child text nodes, including whitespace-only text nodes, should be considered significant:

  <!-- Convert w:p elements inside PR body to para elements -->
  <xsl:template match="pr:para/w:p">
    <para>
      <!-- This element contains mixed content; explicitly preserve space -->
      <xsl:attribute name="xml:space">preserve</xsl:attribute>
      <xsl:apply-templates/>
    </para>
  </xsl:template>

The next template rule gets triggered by runs that have the “Lead-in Emphasis” character style. The purpose of this template rule is to convert such runs into leadIn elements. However, its job is complicated by the fact that Word has a tendency to output adjacent runs that have the same style. Rather that creating a separate leadIn element for each of these, this template rule, with help from the recursive template named merge-adjacent-style-runs, does just that; it merges adjacent runs in the same style so that only one leadIn element is created per contiguous sequence:

  <!-- Convert "Lead-in Emphasis" runs to leadIn elements -->
  <xsl:template match="w:r[w:rPr/w:rStyle/@w:val =
                           /w:wordDocument/w:styles/w:style
                             [w:name/@w:val='Lead-in Emphasis']/@w:styleId]">
   
    <!-- Only process this run if the immediately preceding
         run does not have the same style -->
    <xsl:if test="not(preceding-sibling::w:r[1]
                      [w:rPr/w:rStyle/@w:val = current( )/w:rPr/w:rStyle/@w:val]
                     )">
      <leadIn>
        <xsl:call-template name="merge-adjacent-style-runs"/>
      </leadIn>
    </xsl:if>
  </xsl:template>

Finally, harvestPressRelease.xsl must override one other aspect of saveDataOnly.xsl’s behavior. Rather than strip out all mixed content text (which saveDataOnly.xsl does when “Ignore mixed content” is turned on, as it is in the press release template), it must preserve the mixed content text found inside the newly created pr:para elements. It does this by overriding the default template rule for text nodes in the strip-mixed-content mode, explicitly copying text nodes that are children of pr:para elements:

  <!-- Override mixed-content-stripping for text inside pr:para elements -->
  <xsl:template match="pr:para/text( )" mode="strip-mixed-content">
    <xsl:copy/>
  </xsl:template>

Thus, the harvestPressRelease.xsl stylesheet behaves very similarly to Word’s “Save data only” process. In fact, for most of the elements in a press release document, it behaves identically, thanks to the saveDataOnly.xsl stylesheet that it imports. However, by incrementally overriding the default behavior of saveDataOnly.xsl, it enables limited but effective support for repeating paragraphs and mixed content.

When to Use These Options

Between the “Save data only” and “Apply custom transform” options, there are four possible combinations. When does it make sense to choose one combination over another? Table 4-1 lists some possible use cases for each combination.

Table 4-1. XML save settings and corresponding use cases

“Save data only”

“Apply custom transform”

Example use cases

off

off

Saving the document as WordprocessingML

on

off

Saving custom markup only (most common configuration for Smart Documents)

off

on

Converting Word paragraphs to custom elements; converting styled text to custom elements

on

on

Converting elements back to attributes; re-ordering or otherwise re-structuring the document

When you are using an onsave XSLT stylesheet and you need to decide whether or not to turn “Save data only” on, ask yourself these questions: Is all the information I need to create my final, saved XML document present in the XML elements and attributes that are embedded in the Word document being edited? Or do I need to query some aspect of the WordprocessingML markup, because the embedded XML tags do not tell the whole story? The onsave stylesheet for our press release template, since it converts Word paragraphs to custom paragraphs, for example, indeed does need to have access to the WordprocessingML markup. Therefore, the press release template takes the third approach shown in this table; it turns “Save data only” off and “Apply custom transform” on.



[3] For this stylesheet to work as intended, the “Apply transform” checkbox must be checked, the saveDataOnly.xsl file must be selected as the transform to apply, and the “Save data only” checkbox must be unchecked. The reason it must be unchecked is that the saveDataOnly.xsl stylesheet is designed to be applied to the document instead of the “Save data only” process, rather than in addition to it.

Get Office 2003 XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.