O'Reilly    
 Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples


Stalking the Non-Conforming Microsoft Format: Part II

by Andy Oram
01/01/2000

[Editor's note: The following article is a follow-up to Stalking the Non-Conforming Microsoft Format, also by Andy Oram.]

My previous posting drew both cries of pained recognition and suggestions for interesting new avenues to explore. The new avenues led to this follow-up article. But since different versions of Word behave very differently, I will never be able to travel all the twisted routes down which HTML conversion might lead me.

One reader told me that Word 2000 does a better job of conversion than the Word 97 that I used for my article. So I found a computer with Word 2000 and the result was...well, I wouldn't say "better," let's call it "different."

On the positive side, Word 2000 preserved footnotes during conversion to HTML. It understood that a "Heading 1" paragraph should become an H1 element. But in all other conversions Word 2000 is a step backwards. Lists, for instance, are no longer <LI> but <p class=MsoListBullet>. Presentation has almost completely won out over structure.

Comparing the HTML output of Word 97 and Word 2000 is a breathtaking experience, a front-row seat for viewing software bloat. For instance, a table comes out of the Word 97 almost as simple as I would code it myself:

<TABLE BORDER CELLSPACING=1 CELLPADDING=7 WIDTH=590>
<TR><TD WIDTH="33%" VALIGN="TOP">
<P>
etc.

The corresponding Word 2000 output is:

<table border=1 cellspacing=0 cellpadding=0 style='border-collapse:collapse;
border:none;mso-border-alt:solid windowtext .5pt;mso-padding-alt:0in 5.4pt 0in 5.4pt'>
<tr>
<td width=197 valign=top style='width:2.05in;border:solid windowtext .5pt; padding:0in 5.4pt 0in 5.4pt'>
<p class=MsoNormal>
etc.

As you can guess; the reliance on styles in Word 2000 creates correspondingly large headers. In a one-page document, the header is 75% to 80% of the HTML file.

Another helpful reader, Sandford Smith, made the sensible suggestion that I use a Web Page template to create an HTML document. To be fussy, that's outside the scope of this article, because I want to convert a regular Word document rather than start a new one, and I don't think I'd use Word as a WYSIWYG Web page editor. But I gave templates a try anyway.

Word 97 offers just a couple Web Page templates. When you choose one, your paragraph and text markers reflect some popular HTML tags like blockquote. This doesn't help you to convert a file. (I tried both inserting a DOC file and applying the Web Page template to an existing file. In both cases I came out with the same tags as I got originally from the "Save as HTML..." option). But if you write a new document using the right tags, the resulting HTML is clean.

On Word 2000 it's a whole different outcome. Choosing "New" from the file menu offers a wealth of possible formats, including no fewer than eight different types of Web pages and a Web Page Wizard, which I tried out and quickly dismissed with a wave of my hand. Instead, I started with the simplest Web Page format and whimsically chose a theme of "Citrus punch," expecting to be transported to the Alhambra. Instead, it led me up the garden path. I found essentially the same paragraph types as the normal Word template, and all the same crud in the HTML output.

I'm also perturbed to see that Microsoft has turned HTML into a programming language with inventions like <![if...]>...<![endif]>. I can't imagine what Microsoft tool uses them, but they sometimes turn up in the craziest situations. For instance, footnote numbers are enclosed in these conditional tags, but the text of the footnotes is not. There are already debates in the Web community over whether to manipulate text through traditional programming languages (Perl, Java) or to ask the browser to do the heavy lifting through new descriptive languages like XSL. If somebody wants to invent a new paradigm like these grand <![if...]>...<![endif]> flourishes, it must be done with considerably greater care than I saw in the Word 2000 HTML output.

Copyright © 2009 O'Reilly Media, Inc.