Chapter 9. Programming Google
Hacks 92-100
When search engines first appeared on the scene, they were more open to being spidered, scraped, and aggregated. Sites like Excite and AltaVista didn’t worry too much about the odd surfer using Perl to grab a slice of a page or meta-search engines including their results in aggregated search results. Sure, egregious data suckers might get shut out, but the search engines weren’t worried about sharing their information on a smaller scale.
Google never took that stance. Instead, they have regularly
prohibited meta-search engines from using their content without a
license, and they try their best to block unidentified web agents
like Perl’s LWP::Simple module
or even wget
on the command line. Google has even
been known to block IP address ranges for running automated queries.
Google had every right to do this; after all, it was their search technology, database, and computer power. Unfortunately, however, these policies meant that casual researchers and Google nuts, like you and I, didn’t have the ability to play with their rich dataset in any automated way.
Google changed all that with the release of the
Google Web API (http://api.google.com/) in the spring of
2002. The Google Web API doesn’t allow you to do
every kind of search possible—for example, it
doesn’t support the
phonebook
:syntax—but it does make available Google’s eight-billion-page web database so that developers can create their own interfaces and use Google search results to their ...
Get Google Hacks, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.