Stalking the Non-Conforming Microsoft Format
by Andy Oram01/01/2000
Just as social scientists take out an odd toolbox of sleuthing tactics to decipher the moves of Russian political leaders (and of the Soviet ones who preceded them) we all find times when we're trying to second-guess the behavior of a Microsoft product. My own foray into MS Kremlinology this week started when I converted a document from Word to HTML using the enticing "Save as HTML..." option (which seems to be on the File menu for PR purposes, because it behaves the same as choosing "Save as..." and then picking HTML as the format).
Why do some things in Microsoft conversions go well, while others violate standards or produce utter bizarreness? Is each problem a design flaw or a deliberate attack of "embrace and extend"? One always has to guess. Interestingly, many of the problems I found seem to spring from poor design in Word.
The casual hacker's dismissal of Microsoft as a bunch of bumblers is quite unfair. The company routinely comes up with excellent initiatives; an example that received a lot of praise recently is Cleartype fonts for electronic text display (although they look to me like just a careful application of anti-aliasing). Although I was using Word on this particular project because the journal I wrote for demanded it, there are many times I like Word for its own strengths. And somebody at Microsoft is good at designing standards (SOAP, DOM, CIFS, etc.). But we might have a less rough transition to the Information Age if this person monitored the people writing their software.
Even though the Microsoft "Save as HTML..." output looks decent in a browser (Netscape Navigator as well as Microsoft IE--and even good old lynx), I would be ashamed to put it up on my site. In retrospect, I don't think fixing the Microsoft HTML output took any less time than converting the 19-page document to plain text and tagging everything myself.
Read on if you want details.
----
Many shops like Word for its modern approach to document structure. It lets you tag your paragraphs (and smaller items) so that something you mark as "Heading 1" in your Word document can easily be converted into a <H1> tag in HTML. Right? So what did the Word converter change my "Heading 1" to? See here:
<B><FONT FACE="Arial" SIZE=5><P>
Some way to preserve meta-information! And it converted both "Heading 2" and "Heading 3" tags to the same output. I cannot reconstruct what is an <H2> and what is an <H3> without checking the original Word document.
To find a clue to this particular cluelessness, I converted the Word document to RTF and looked at the headings. And I think I discovered the problem: the RTF (which I imagine reflects the binary Word storage format) includes duplicate information regarding a Heading 1, a Heading 2, etc. At the beginning of the RTF file, "heading 1" is mapped to a bunch of display attributes (bold font, etc.), "heading 2" to another set, and so forth. When I check where my heading appears in the RTF document, it appears with the string of display attributes along with a pointer to the information-rich tag. And the HTML converter picked up the display attributes instead of the tag. Things you wish your mother had told you about RTF.
I was not overly impressed, therefore, when I saw that conversion did represent lists correctly (<OL> or <UL>) and added end tags (like </P> and </LI>) so that the HTML will supposedly conform to XHTML and XML. Any good impressions I had were dispelled when I found list items where I had italicized a word at the start of the item. The <I> tag came before the list tag, even though the </I> closing tag came inside the list:
<I><LI>Encryption</I> and
This looks stupid to anybody viewing the source, and will break under XHTML. I'm not sure why the HTML came out this way, but I don't believe a programmer could code the conversion tool that way. The only hint I have, once again from the RTF, is that the \i tag setting italic in RTF is not directly tied to the word "Encryption" but is mixed into the bunch of attributes that mark the paragraph as a list item. So the conversion tool probably doesn't know that the italic is meant to be for one word.
Now for special non-ASCII characters. I'm smart, you see, because I use smart quotes. Also smart apostrophes and em-dashes, all of which appear in Word as special octal codes. (You sometimes see these codes, like \222 for an apostrophe, in email from people whose mailer doesn't behave well.) Some of these characters were converted to HTML entities by the converter (nice job). Strangely, smart quotes were converted into straight quotes. In a completely inconsistent manner, so far as I could tell, some characters were left in the raw \222 form. But, hey, no problem, it's fine, guys. Because the converter also included a META tag in the HTML with a CONTENT line containing "charset=windows-1252". So the browsers all know what to do with the special characters, right?
Fellow editors tell me I'm lucky to be converting from Word rather than some other Microsoft office tool, such as Excel. The "Internet Assisant Wizard" in Excel inserts random extra spacing (ignored by browsers), cute filler such as bolded non-breaking spaces, and secret messages such as "vnd.ms-excel.numberformat:_($* #,##0.00_)[semicolon]_($* (#,##0.00)[semicolon]_($* [dquote]-[dquote]??_) [semicolon]_(@_)" in each TD element of the table, readable only by heaven knows what Microsoft utility.
One big, final lapse: the converter didn't know what to do with footnotes. Footnotes leave a problem for converters without a doubt, because there's no concept of footnotes in HTML. But is the solution to simply leave them all out, and to leave out all the markers too? So that I have to reinsert every footnote by hand? It took me about half an hour to restore 67 footnotes (using GNU Emacs macros on a Unix system, I should add). Since the RTF shows the footnote embedded with the main text, the converter could certainly have done something intelligent with them if it had been designed that way.
But don't worry about footnotes, Mr. Gates, because nobody doing scholarly research is going to check a Web site. That kind of serious application will wait for a new Internet generation. Instead, we'll all spend our time on MS Kremlinology.
![]()