Earlier, I performed Google queries on my online PDFs and discovered they were only partially indexed. I wanted to know: would Google treat HTML the same way?
Burst PDF vs. Burst HTML
I took BE.pdf and created two parallel editions: the burst PDF edition, where every document page is a single PDF file, and the burst HTML edition, where every document page is a single HTML file. I linked to every page of these two editions from http://accesspdf.com/pdf_html_test, which shuffles the two editions together. After leaving this material online for awhile, Google finally indexed it.
First, let's search the PDF pages for "beatles", as we did before. Running the Google query:
site:accesspdf.com filetype:pdf inurl:pdf_pages "beatles"
Yields 20 PDF pages, which hold a total of 32 occurrences of "beatles." This falls short of our baseline of 40 occurrences, established above. One of the pages overlooked by Google is page 44 ("John Lennon of the Beatles"). A direct search for page 44:
site:accesspdf.com filetype:pdf inurl:pdf_pages inurl:pg_044
indicates that Google did not index it.
Now let's run the same search on the HTML pages:
site:accesspdf.com filetype:html inurl:html_pages "beatles"
This turns up 25 pages with a total of 38 occurrences of "beatles." But our baseline is 25 pages with 40 occurrences! Our 40 baseline hits include "beatlesesque" (page 119) and "beatless" (page 192). Google does not lump these in with "beatles," as our baseline search did.
Given two, nearly identical documents, one in PDF and one in HTML, Google seems to have completely indexed the HTML document while indexing only about 80% of the PDF. How about that?
Original Article and Comments at AccessPDF.com