O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  



Google Prefers HTML over PDF
To most of the online world, if your information isn't indexed by a major search engine it might as well not exist. One experiment suggests that information formatted using HTML has a better chance of getting indexed than the same information formatted using PDF.

Contributed by:
Sid Steward
[01/17/05 | Discuss (0) | Link to this hack]

Earlier, I performed Google queries on my online PDFs and discovered they were only partially indexed. I wanted to know: would Google treat HTML the same way?

Burst PDF vs. Burst HTML

I took BE.pdf and created two parallel editions: the burst PDF edition, where every document page is a single PDF file, and the burst HTML edition, where every document page is a single HTML file. I linked to every page of these two editions from http://accesspdf.com/pdf_html_test, which shuffles the two editions together. After leaving this material online for awhile, Google finally indexed it.

First, let's search the PDF pages for "beatles", as we did before. Running the Google query:

   site:accesspdf.com filetype:pdf inurl:pdf_pages "beatles"

Yields 20 PDF pages, which hold a total of 32 occurrences of "beatles." This falls short of our baseline of 40 occurrences, established above. One of the pages overlooked by Google is page 44 ("John Lennon of the Beatles"). A direct search for page 44:

   site:accesspdf.com filetype:pdf inurl:pdf_pages inurl:pg_044

indicates that Google did not index it.

Now let's run the same search on the HTML pages:

   site:accesspdf.com filetype:html inurl:html_pages "beatles"

This turns up 25 pages with a total of 38 occurrences of "beatles." But our baseline is 25 pages with 40 occurrences! Our 40 baseline hits include "beatlesesque" (page 119) and "beatless" (page 192). Google does not lump these in with "beatles," as our baseline search did.

HTML Wins!

Given two, nearly identical documents, one in PDF and one in HTML, Google seems to have completely indexed the HTML document while indexing only about 80% of the PDF. How about that?

See also:

Original Article and Comments at AccessPDF.com


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.