
How do these hacks stand up? Comment on a hack from the book by choosing the associated "Discuss" link below. You can also view the code from any of the hacks by clicking on the "Listing" or "Code" links. A number of hacks have been selected to be featured online in their entirety; you may view those hacks by clicking on the hack titles that are linked.
You can also download all the scripts and other files for this book here.
Walking Softly
HACK
#1 |
 |
|
A Crash Course in Spidering and Scraping
A few of the whys and wherefores of spidering
and scraping
[Discuss (0) | Link to this hack]
|
 |
HACK
#2 |
 |
|
Best Practices for You and Your Spider
Some rules for the road as
you're writing your own well-behaved
spider
[Discuss (0) | Link to this hack]
|
 |
HACK
#3 |
 |
|
Anatomy of an HTML Page
Getting the knack of scraping is more than just
code; it takes knowing HTML and other kinds of web page files
[Discuss (0) | Link to this hack]
|
 |
HACK
#4 |
 |
|
Registering Your Spider
If you have a spider you're
programming or planning on using even a minimal amount, you need to
make sure it can be easily identified. The most low-key of spiders
can be the subject of lots of attention
[Discuss (0) | Link to this hack]
|
 |
HACK
#6 |
 |
|
Keeping Your Spider Out of Sticky Situations
You see tasty data here, there, and everywhere.
Before you dive in, check the site's acceptable use
policies
[Discuss (0) | Link to this hack]
|
 |
HACK
#7 |
 |
|
Finding the Patterns of Identifiers
If you find that the online database or
resource you want uses unique identification numbers, you can stretch
what it does by combining it with other sites and identification
values
[Discuss (0) | Link to this hack]
|
 | Assembling a Toolbox
HACK
#8 |
 |
|
Installing Perl Modules
A fair number of our hacks require modules not
included with the standard Perl distribution. Here,
we'll show you how to install these modules on
Windows, Mac OS X, and Unix-based systems
[Discuss (2) | Link to this hack]
|
 |
HACK
#10 |
 |
|
More Involved Requests with LWP::UserAgent
Knowing how to download web pages is great, but
it doesn't help us when we want to submit forms,
fake browser settings, or get more information about our request.
Here, we'll jump into the more useful
LWP::UserAgent
[Discuss (0) | Link to this hack]
|
 |
HACK
#11 |
 |
|
Adding HTTP Headers to Your Request
Add more functionality to your programs, or
mimic common browsers, to circumvent server-side filtering of unknown
user agents
[Discuss (0) | Link to this hack]
|
 |
HACK
#12 |
 |
|
Posting Form Data with LWP
Automate form submission, whether username and
password authentication, supplying your Zip Code for location-based
services, or simply filling out a number of customizable fields for
search engines
[Discuss (1) | Link to this hack]
|
 |
HACK
#13 |
 |
|
Authentication, Cookies, and Proxies
Access restricted resources programmatically by
supplying proper authentication tokens, cookies, or proxy server
information
[Discuss (0) | Link to this hack]
|
 |
HACK
#14 |
 |
|
Handling Relative and Absolute URLs
Glean the full URL of any relative reference,
such as "sample/index.html" or
"../../images/flowers.gif", by
using the helper functions of URI
[Discuss (0) | Link to this hack]
|
 |
HACK
#15 |
 |
|
Secured Access and Browser Attributes
If you're planning on
accessing secured resources, such as your online banking, intranet,
or the like, you'll need to send and receive data
over a secured LWP connection
[Discuss (0) | Link to this hack]
|
 |
HACK
#16 |
 |
|
Respecting Your Scrapee's Bandwidth
Be a better Net citizen by reducing load on
remote sites, either by ensuring you're downloading
only changed content, or by supporting compression
[Discuss (0) | Link to this hack]
|
 |
HACK
#17 |
 |
|
Respecting robots.txt
The robots.txt file is a bastion of fair play,
allowing a site to restrict what visiting scrapers are allowed to see
and do or, indeed, keep them out entirely. Play fair by respecting
their requests
[Discuss (0) | Link to this hack]
|
 |
HACK
#19 |
 |
|
Scraping with HTML::TreeBuilder
One of many popular HTML parsers available in
Perl, HTML::TreeBuilder approaches the art of HTML parsing as a
parent/child relationship
[Discuss (0) | Link to this hack]
|
 |
HACK
#20 |
 |
|
Parsing with HTML::TokeParser
HTML::TokeParser allows you to follow a path
through HTML code, storing the contents of tags as you move nearer
your desire
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#21 |
 |
|
WWW::Mechanize 101
While LWP::UserAgent and the rest of the LWP
suite provide powerful tools for accessing and downloading web
content, WWW::Mechanize can automate many of the
tasks you'd normally have to code
The Code
[Discuss (2) | Link to this hack]
|
 |
HACK
#22 |
 |
|
Scraping with WWW::Mechanize
Never miss another Buffy the Vampire Slayer
episode again with this easy-to-learn introduction to WWW::Mechanize
and HTML::TokeParser
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#23 |
 |
|
In Praise of Regular Expressions
You don't always need to use a
module like HTML::TokeParser or HTML::TreeBuilder in order to parse
HTML. Sometimes, a few simple regular expressions can save you the
effort
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#24 |
 |
|
Painless RSS with Template::Extract
Wouldn't it be nice if you
could simply visualize what data on a page looks like, explain it in
template form to Perl, and not bother with the need for parsers,
regular expressions, and other programmatic logic?
That's exactly what Template::Extract helps you
do
[Discuss (0) | Link to this hack]
|
 |
HACK
#25 |
 |
|
A Quick Introduction to XPath
Sure, you've got your
traditional HTML parsers of the tree and token variety, and
you've got regular expressions that can be as
innocent or convoluted as you wish. But if neither are perfect fits
to your scraping needs, consider XPath
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#26 |
 |
|
Downloading with curl and wget
There are a number of command-line utilities to
download files over HTTP and FTP. We'll talk about
two of the more popular choices: curl and wget
[Discuss (0) | Link to this hack]
|
 |
HACK
#27 |
 |
|
More Advanced wget Techniques
wget has a huge number of features that can
make downloading data from the web easier than sitting down and
rolling your own Perl script. Here, we'll cover some
of the more useful configuration options
[Discuss (0) | Link to this hack]
|
 |
HACK
#28 |
 |
|
Using Pipes to Chain Commands
Chaining commands into a one-liner can make for
powerful functionality
[Discuss (0) | Link to this hack]
|
 |
HACK
#29 |
 |
|
Running Multiple Utilities at Once
You've got scrapers, spiders,
and robots aplenty, all to run daily according to a particular
schedule. Should you set up a half-dozen cron jobs, or combine them
into one script?
[Discuss (0) | Link to this hack]
|
 |
HACK
#30 |
 |
|
Utilizing the Web Scraping Proxy
With the use of a Perl proxy,
you'll be able to browse web sites and have the LWP
code written out automatically for you. Although not perfect, it can
certainly be a time saver
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#31 |
 |
|
Being Warned When Things Go Wrong
When you're writing any script
that operates on data you don't control, from either
a database, a text file, or a resource on the Internet,
it's always a good idea to add a healthy dose of
error checking
[Discuss (0) | Link to this hack]
|
 |
HACK
#32 |
 |
|
Being Adaptive to Site Redesigns
It's a typical story: you work
all night long to create the perfect script to solve all your woes,
and when you wake in the morning ready to run it
"for real," you find the site
you're scraping has changed its URLs or
HTML
[Discuss (0) | Link to this hack]
|
 | Collecting Media Files
HACK
#33 |
 |
|
Detective Case Study: Newgrounds
Learn how to gumshoe your way through a
site's workflow, regardless of whether there are
pop-up windows, JavaScripts, frames, or other bits of obscuring
technology
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#35 |
 |
|
Downloading Movies from the Library of Congress
Often, downloading from the Web is accomplished
more easily with a little exploration and a command-line utility or
favorite browser than with even the most accomplished programming
[Discuss (0) | Link to this hack]
|
 |
HACK
#36 |
 |
|
Downloading Images from Webshots
Search a large collection of
community-contributed images, based on keywords of your choice, and
then download the visual finding
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#37 |
 |
|
Downloading Comics with dailystrips
Love comics but hate visiting multiple sites
for your daily dose? Automate your stripping with some easy-to-use
open source Perl software
[Discuss (0) | Link to this hack]
|
 |
HACK
#38 |
 |
|
Archiving Your Favorite Webcams
Got a number of scenic or strategically placed
webcams you watch daily? Or would like to ensure that your coworkers
are actually doing the work you've assigned them?
Keep on top of your pictorial problems with Python
The Code
[Discuss (1) | Link to this hack]
|
 |
HACK
#39 |
 |
|
News Wallpaper for Your Site
Grab today's news images for
your web site or as an RSS feed, suitable for viewing in your
favorite syndicated news application
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#40 |
 |
|
Saving Only POP3 Email Attachments
Get oodles of attachments from mailing lists
and friends? Learn how to save them to your hard drive automatically
with a little Perl voodoo
The Code
[Discuss (1) | Link to this hack]
|
 |
HACK
#42 |
 |
|
Downloading from Usenet with nget
Even though common wisdom states that porn
peddlers and spam pushers have overrun Usenet, there are still a
number of groups resolutely producing good content for good folks. In
this hack, we'll show how to download files from
news groups of your choice
[Discuss (1) | Link to this hack]
|
 | Gleaning Data from Databases
HACK
#43 |
 |
|
Archiving Yahoo! Groups Messages with yahoo2mbox
Looking to keep a local archive of your
favorite mailing list? With yahoo2mbox,
you can import the final results into your favorite
mailer
[Discuss (0) | Link to this hack]
|
 |
HACK
#44 |
 |
|
Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
Yahoo! Groups makes it easy to run an email
discussion group at no cost. Sadly, there's no
simple way to download all the messages—until now
The Code
[Discuss (2) | Link to this hack]
|
 |
HACK
#46 |
 |
|
Spidering the Yahoo! Catalog
Writing a spider to spider an existing
spider's site may seem convoluted, but it can prove
useful when you're looking for location-based
services. This hack walks through creating a framework for full-site
spidering, including additional filters to lessen your
load
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#48 |
 |
|
Scattersearch with Yahoo! and Google
Sometimes, illuminating results can be found
when scraping from one site and feeding the results into the API of
another. With scattersearching,
you can narrow down the most popular related results, as suggested by
Yahoo! and Google
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#50 |
 |
|
Weblog-Free Google Results
With so many weblogs being indexed by Google, you
might worry about too much emphasis on the hot topic of the moment.
In this hack, we'll show you how to remove the
weblog factor from your Google results
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#51 |
 |
|
Spidering, Google, and Multiple Domains
When you want to search a site, you tend to go
straight to the site itself and use its native capabilities. But what
if you could use Google to search across many similar sites, scraping
the pages of most relevance?
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#52 |
 |
|
Scraping Amazon.com Product Reviews
While Amazon.com has made some reviews
available through their Web Services API, most are available only at
the Amazon.com web site, requiring a little screen scraping to grab
them
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#53 |
 |
|
Receive an Email Alert for Newly Added Amazon.com Reviews
This hack keeps an eye on Amazon.com and
notifies you, via email, when a new product review is posted to items
you're tracking
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#54 |
 |
|
Scraping Amazon.com Customer Advice
Screen scraping can give you access to
Amazon.com community features not yet implemented through
Amazon.com's public Web Services API. In this hack,
we'll implement a script to scrape customer buying
advice
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#55 |
 |
|
Publishing Amazon.com Associates Statistics
Share some insider knowledge, such as the most
popular item sold, with your site's audience by
republishing your Amazon.com
Associates sales statistics
The Code
[Discuss (3) | Link to this hack]
|
 |
HACK
#57 |
 |
|
Related Amazon.com Products with Alexa
Given any URL, Alexa will return traffic data,
user ratings, and even related Amazon.com products. This hack creates
a cloud of related product data for any given URL
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#58 |
 |
|
Scraping Alexa's Competitive Data with Java
Alexa tracks the browsing habits of its
millions of users daily. This hack allows you to aggregate the
traffic statistics of multiple web properties into one RSS file, with
subscriptions available daily
The Code
[Discuss (2) | Link to this hack]
|
 |
HACK
#59 |
 |
|
Finding Album Information with FreeDB and Amazon.com
By combining identifying information from one
database with related information from another, you can create
powerful applications with little effort
[Discuss (0) | Link to this hack]
|
 |
HACK
#60 |
 |
|
Expanding Your Musical Tastes
Looking for new music to complement your stale
collection? With this script, you'll be able to pass
some names of your favorite artists, and get Audioscrobbler
recommendations
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#61 |
 |
|
Saving Daily Horoscopes to Your iPod
You've got a zillion songs on
your new iPod, and you're traveling around town
oblivious to the sounds of the city. Worried about getting hit by a
car, finding that special someone, or knowing when to ask for that
raise? Take your horoscope along with you by running this hack
daily
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#62 |
 |
|
Graphing Data with RRDTOOL
Graphing data over time, either by itself or in
comparison with another dataset, is the Holy Grail of analytical
research. With the use of RRDTOOL, you'll be able to
store and display time-series data
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#63 |
 |
|
Stocking Up on Financial Quotes
Keeping track of multiple stocks can be a
cumbersome task, but using the Finance::Quote Perl module can greatly
simplify it. And, while we're at it,
we'll generate pretty graphs with
RRDTOOL
The Code
[Discuss (1) | Link to this hack]
|
 |
HACK
#64 |
 |
|
Super Author Searching
By combining multiple sites into one powerful
script, you can get aggregated data results that are more complete
than just one site could give
[Discuss (0) | Link to this hack]
|
 |
HACK
#65 |
 |
|
Mapping O'Reilly Best Sellers to Library Popularity
If you're using Google to look
for books in university libraries, you'll get better
results using a Library of Congress Number than a plain old ISBN
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#66 |
 |
|
Using All Consuming to Get Book Lists
You can retrieve a list of the most-mentioned
books in the weblog community, as well as personal book lists and
recommendations, through either of All Consuming's
two web service APIs
[Discuss (0) | Link to this hack]
|
 |
HACK
#70 |
 |
|
Using the Link Cosmos of Technorati
Similar to other indexing sites like Blogdex,
the Link Cosmos at Technorati keeps track of an immense number of
blogs, correlating popular links and topics
for all to see. With the recently released API, developers can now
integrate the results into their own scripts
[Discuss (0) | Link to this hack]
|
 |
HACK
#71 |
 |
|
Finding Related RSS Feeds
If you're a regular reader of
weblogs, you know that most syndicate their content in a format
called RSS. By querying aggregated RSS databases, you can find
related sites you may be interested in reading
[Discuss (0) | Link to this hack]
|
 |
HACK
#72 |
 |
|
Automatically Finding Blogs of Interest
An easy way to find interesting new sites is to
peruse an existing site's blogroll: a listing of
blogs they read regularly. Let's create a spider to
automate this by looking for keywords in the content of outbound
links
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#73 |
 |
|
Scraping TV Listings
Freeing yourself from flipping through a weekly
publication by visiting the TV Guide Online web site might sound like
a good idea, but being forced to load heavy pages, showing only hours
at a time and channels you don't care for,
isn't exactly the utopia for which you were
hoping
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#74 |
 |
|
What's Your Visitor's Weather Like?
You have a web site, as most people do, and
you're interested in getting a general idea of what
you're visitor's weather is like.
Want to know if you get more comments when it's
raining or sunny? With the groundwork laid in this hack, that and
other nonsense will be readily available
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#75 |
 |
|
Trendspotting with Geotargeting
Compare the relative popularity of a trend or
fashion in different locations, using only Google and Directi search
results
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#77 |
 |
|
Geographic Distance and Back Again
When you're traveling from one
place to another, it's usually handy to know exactly
how many miles you're going to be on the road. One
of the best ways to get the most accurate result is to use latitude
and longitude
The Code
[Discuss (2) | Link to this hack]
|
 |
HACK
#80 |
 |
|
Reformatting Bugtraq Reports
Since Bugtraq is such an important part of a
security administrator's watch list,
it'll only be a matter of time before
you'll want to integrate it more closely with your
daily habits
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#81 |
 |
|
Keeping Tabs on the Web via Email
If you find yourself checking your email more
than cruising the Web, you might appreciate a little Perl work to
bring the Web to your mailbox
[Discuss (0) | Link to this hack]
|
 |
HACK
#82 |
 |
|
Publish IE's Favorites to Your Web Site
You're surfing at a
friend's house and think, "What is
that URL? I have a link to it in my favorites. I wish I were
home." How about making your favorites available no
matter where you go?
The Code
[Discuss (1) | Link to this hack]
|
 |
HACK
#83 |
 |
|
Spidering GameStop.com Game Prices
Looking to get notification when
"Army Men: Quest for Some Semblance of
Quality" goes on sale at $5.99? With this hack,
you'll be able to keep an eye on your most desired
(or derisive) video game titles
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#84 |
 |
|
Bargain Hunting with PHP
If you're always on the
lookout for the best deals, coupons, and contests, a little bit of
PHP-scraping code can help you stay up-to-date
The Code
[Discuss (1) | Link to this hack]
|
 |
HACK
#85 |
 |
|
Aggregating Multiple Search Engine Results
Even though Google may solve all your searching
needs on a daily basis, there may come a time when you need a
"super search"—something that
queries multiple search engines or databases at once
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#86 |
 |
|
Robot Karaoke
Who says people get to have all the fun? With
this hack, you can let your computer do a little singing, by scraping
the LyricsFreak.com web site and sending the results to a
text-to-speech translator
The Code
[Discuss (1) | Link to this hack]
|
 |
HACK
#87 |
 |
|
Searching the Better Business Bureau
Is that new company offering to build your
house, deliver your groceries, and walk your dog legit and free of
complaint? Find out with an automated query of the Better Business
Bureau's web site
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#88 |
 |
|
Searching for Health Inspections
How healthy are the restaurants in your
neighborhood? And when you find a good one, how do you get there? By
combining databases with maps!
The Code
[Discuss (0) | Link to this hack]
|
 | Maintaining Your Collections
HACK
#91 |
 |
|
Scheduling Tasks Without cron
If you want to run any of the hacks in this
book on a regular basis, your best option is to use cron, a powerful
Linux-based scheduler. But what if you're on a
different OS or don't have access for some other
reason?
[Discuss (0) | Link to this hack]
|
 |
HACK
#92 |
 |
|
Mirroring Web Sites with wget and rsync
Is there a site you check frequently, or do you
want a backup of your own site? Various mirroring tools are available
that can ensure you're creating duplicate and
complete backups on another machine
[Discuss (0) | Link to this hack]
|
 | Giving Back to the World
HACK
#94 |
 |
|
Using XML::RSS to Repurpose Data
By using the popular syndication format known
as RSS, you can use your newly scraped data in dozens of different
aggregators, toolkits, and more
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#96 |
 |
|
Making Your Resources Scrapable with Regular Expressions
A few tricks can make your web page data easier
to parse, without needing complicated HTML libraries or convoluted
logic. The benefits extend to more than just visitors; your own HTML
will be more understandable too
[Discuss (0) | Link to this hack]
|
 |
HACK
#97 |
 |
|
Making Your Resources Scrapable with a REST Interface
Consider offering alternative versions of site
documents for a variety of human and machine visitors, based on how
they present themselves
[Discuss (0) | Link to this hack]
|
 |
HACK
#98 |
 |
|
Making Your Resources Scrapable with XML-RPC
If you want to make your
site's information accessible to lots of aspiring
spider builders, don't worry about regular
expressions. Just add a little XML-RPC
[Discuss (0) | Link to this hack]
|
 |
HACK
#99 |
 |
|
Creating an IM Interface
Add some Perl code here, an AOL Instant
Messenger account there, and one of your favorite scraping scripts,
and you have yourself an automated instant-messaging bot
The Code
[Discuss (0) | Link to this hack]
|
 |
HACK
#100 |
 |
|
Going Beyond the Book
As much as we would have liked to deliver a
1,500-page tome, sooner or later you're going to
have to think outside the confines of this book
[Discuss (0) | Link to this hack]
|
 |
|
O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website:
| Customer Service:
| Book issues:
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
|
|