O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  



Internet Search Engines: PDF vs. HTML
So much information is stored in PDF, yet a little experimentation suggestes that PDF gets short shrift from Google. So let's experiment a little more in this two-part PDF vs. HTML Google shootout.

Contributed by:
Sid Steward
[01/06/05 | Discuss (0) | Link to this hack]

Full PDF vs. Burst PDF

I've been wondering just how well Google indexes PDF content. So I experimented by performing some Google searches on the PDF pages hosted at http://www.pdfhacks.com/eno/. That site hosts a single, 216-page PDF (BE.pdf). It also hosts this same PDF as a sequence of single PDF pages linked together using HTML frames (here).

If you search the full PDF using its pdfportal page, you get about as good a search as possible. Searching BE.pdf for "beatles" using pdfportal yields 40 hits. That's my baseline.

Now, let's see what Google turns up. I'll search the full PDF, BE.pdf, using these search terms:

   site:pdfhacks.com filetype:pdf inurl:BE.pdf beatles

Google has no trouble locating BE.pdf. The problem is that it is only one hit: BE.pdf. (Actully, it gave me three hits, each the same BE.pdf but at different URLs). To clarify my search for "beatles" in this 216-page PDF, I click the "View as HTML" link provided by Google. This also highlights my search term. Turns out that Google's HTML cache of this PDF stops at page 62. So, it only catches the first 11 occurrences of "beatles."

I wonder: did Google index any of this full PDF after page 62? A search for "beatles provided the most" (page 50) yields my BE.pdf. A search for "manner of the beatles" (page 160) does not. "showing the beatles that" (page 87) also fails. "john lennon of the beatles" (page 50) works. "breakup of the beatles" (page 169) fails. So it seems that Google indexed less than 1/3 of BE.pdf. Maybe BE.pdf just isn't important enough to bother indexing the whole thing.

Now let's try searching for "beatles" within the directory of single PDF pages:

   site:pdfhacks.com filetype:pdf inurl:skinned_php "beatles"

That search yields six PDF pages in Google (44, 101, 119, 125, 207, and 216). Two of those pages show "beatles" twice, so that makes eight hits total. Let's take the phrase "different shades of meaning" from page 119 and perform a broader search:

   site:pdfhacks.com filetype:pdf "different shades of meaning"

As expected, this catches our single page 119, but it does not catch the full PDF, BE.pdf.

PDF vs. HTML?

For my next experiment, I took BE.pdf and created two parallel editions: the burst PDF edition, where every document page is a single PDF file, and the burst HTML edition, where every document page is a single HTML file. I linked to every page of these two editions from http://accesspdf.com/pdf_html_test, which shuffles the two editions together.

I just posted this material online, so we'll need to wait until Google indexes it. Then I'll perform some similar side-by-side tests. Stay tuned!

See also: Original Article and Comments at AccessPDF.com


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.