O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Spidering Hacks
By Morbus Iff, Tara Calishain
October 2003
More Info

How do these hacks stand up? Comment on a hack from the book by choosing the associated "Discuss" link below. You can also view the code from any of the hacks by clicking on the "Listing" or "Code" links. A number of hacks have been selected to be featured online in their entirety; you may view those hacks by clicking on the hack titles that are linked.

You can also download all the scripts and other files for this book here.

Jump to: Walking Softly  | Assembling a Toolbox  | Collecting Media Files  | Gleaning Data from Databases  | Maintaining Your Collections  | Giving Back to the World

Walking Softly

HACK
#1

A Crash Course in Spidering and Scraping
A few of the whys and wherefores of spidering and scraping
[Discuss (0) | Link to this hack]

HACK
#2

Best Practices for You and Your Spider
Some rules for the road as you're writing your own well-behaved spider
[Discuss (0) | Link to this hack]

HACK
#3

Anatomy of an HTML Page
Getting the knack of scraping is more than just code; it takes knowing HTML and other kinds of web page files
[Discuss (0) | Link to this hack]

HACK
#4

Registering Your Spider
If you have a spider you're programming or planning on using even a minimal amount, you need to make sure it can be easily identified. The most low-key of spiders can be the subject of lots of attention
[Discuss (0) | Link to this hack]

HACK
#5

Preempting Discovery
Rather than await discovery, introduce yourself!
[Discuss (0) | Link to this hack]

HACK
#6

Keeping Your Spider Out of Sticky Situations
You see tasty data here, there, and everywhere. Before you dive in, check the site's acceptable use policies
[Discuss (0) | Link to this hack]

HACK
#7

Finding the Patterns of Identifiers
If you find that the online database or resource you want uses unique identification numbers, you can stretch what it does by combining it with other sites and identification values
[Discuss (0) | Link to this hack]

Assembling a Toolbox

HACK
#8

Installing Perl Modules
A fair number of our hacks require modules not included with the standard Perl distribution. Here, we'll show you how to install these modules on Windows, Mac OS X, and Unix-based systems
[Discuss (2) | Link to this hack]

HACK
#9

Simply Fetching with LWP::Simple
Suck web content easily using the aptly named LWP::Simple
[Discuss (0) | Link to this hack]

HACK
#10

More Involved Requests with LWP::UserAgent
Knowing how to download web pages is great, but it doesn't help us when we want to submit forms, fake browser settings, or get more information about our request. Here, we'll jump into the more useful LWP::UserAgent
[Discuss (0) | Link to this hack]

HACK
#11

Adding HTTP Headers to Your Request
Add more functionality to your programs, or mimic common browsers, to circumvent server-side filtering of unknown user agents
[Discuss (0) | Link to this hack]

HACK
#12

Posting Form Data with LWP
Automate form submission, whether username and password authentication, supplying your Zip Code for location-based services, or simply filling out a number of customizable fields for search engines
[Discuss (1) | Link to this hack]

HACK
#13

Authentication, Cookies, and Proxies
Access restricted resources programmatically by supplying proper authentication tokens, cookies, or proxy server information
[Discuss (0) | Link to this hack]

HACK
#14

Handling Relative and Absolute URLs
Glean the full URL of any relative reference, such as "sample/index.html" or "../../images/flowers.gif", by using the helper functions of URI
[Discuss (0) | Link to this hack]

HACK
#15

Secured Access and Browser Attributes
If you're planning on accessing secured resources, such as your online banking, intranet, or the like, you'll need to send and receive data over a secured LWP connection
[Discuss (0) | Link to this hack]

HACK
#16

Respecting Your Scrapee's Bandwidth
Be a better Net citizen by reducing load on remote sites, either by ensuring you're downloading only changed content, or by supporting compression
[Discuss (0) | Link to this hack]

HACK
#17

Respecting robots.txt
The robots.txt file is a bastion of fair play, allowing a site to restrict what visiting scrapers are allowed to see and do or, indeed, keep them out entirely. Play fair by respecting their requests
[Discuss (0) | Link to this hack]

HACK
#18

Adding Progress Bars to Your Scripts
Give a visual indication that a download is progressing smoothly
The Code
[Discuss (3) | Link to this hack]

HACK
#19

Scraping with HTML::TreeBuilder
One of many popular HTML parsers available in Perl, HTML::TreeBuilder approaches the art of HTML parsing as a parent/child relationship
[Discuss (0) | Link to this hack]

HACK
#20

Parsing with HTML::TokeParser
HTML::TokeParser allows you to follow a path through HTML code, storing the contents of tags as you move nearer your desire
The Code
[Discuss (0) | Link to this hack]

HACK
#21

WWW::Mechanize 101
While LWP::UserAgent and the rest of the LWP suite provide powerful tools for accessing and downloading web content, WWW::Mechanize can automate many of the tasks you'd normally have to code
The Code
[Discuss (2) | Link to this hack]

HACK
#22

Scraping with WWW::Mechanize
Never miss another Buffy the Vampire Slayer episode again with this easy-to-learn introduction to WWW::Mechanize and HTML::TokeParser
The Code
[Discuss (0) | Link to this hack]

HACK
#23

In Praise of Regular Expressions
You don't always need to use a module like HTML::TokeParser or HTML::TreeBuilder in order to parse HTML. Sometimes, a few simple regular expressions can save you the effort
The Code
[Discuss (0) | Link to this hack]

HACK
#24

Painless RSS with Template::Extract
Wouldn't it be nice if you could simply visualize what data on a page looks like, explain it in template form to Perl, and not bother with the need for parsers, regular expressions, and other programmatic logic? That's exactly what Template::Extract helps you do
[Discuss (0) | Link to this hack]

HACK
#25

A Quick Introduction to XPath
Sure, you've got your traditional HTML parsers of the tree and token variety, and you've got regular expressions that can be as innocent or convoluted as you wish. But if neither are perfect fits to your scraping needs, consider XPath
The Code
[Discuss (0) | Link to this hack]

HACK
#26

Downloading with curl and wget
There are a number of command-line utilities to download files over HTTP and FTP. We'll talk about two of the more popular choices: curl and wget
[Discuss (0) | Link to this hack]

HACK
#27

More Advanced wget Techniques
wget has a huge number of features that can make downloading data from the web easier than sitting down and rolling your own Perl script. Here, we'll cover some of the more useful configuration options
[Discuss (0) | Link to this hack]

HACK
#28

Using Pipes to Chain Commands
Chaining commands into a one-liner can make for powerful functionality
[Discuss (0) | Link to this hack]

HACK
#29

Running Multiple Utilities at Once
You've got scrapers, spiders, and robots aplenty, all to run daily according to a particular schedule. Should you set up a half-dozen cron jobs, or combine them into one script?
[Discuss (0) | Link to this hack]

HACK
#30

Utilizing the Web Scraping Proxy
With the use of a Perl proxy, you'll be able to browse web sites and have the LWP code written out automatically for you. Although not perfect, it can certainly be a time saver
The Code
[Discuss (0) | Link to this hack]

HACK
#31

Being Warned When Things Go Wrong
When you're writing any script that operates on data you don't control, from either a database, a text file, or a resource on the Internet, it's always a good idea to add a healthy dose of error checking
[Discuss (0) | Link to this hack]

HACK
#32

Being Adaptive to Site Redesigns
It's a typical story: you work all night long to create the perfect script to solve all your woes, and when you wake in the morning ready to run it "for real," you find the site you're scraping has changed its URLs or HTML
[Discuss (0) | Link to this hack]

Collecting Media Files

HACK
#33

Detective Case Study: Newgrounds
Learn how to gumshoe your way through a site's workflow, regardless of whether there are pop-up windows, JavaScripts, frames, or other bits of obscuring technology
The Code
[Discuss (0) | Link to this hack]

HACK
#34

Detective Case Study: iFilm
Sometimes, the detective work is more complicated than the solution
The Code
[Discuss (0) | Link to this hack]

HACK
#35

Downloading Movies from the Library of Congress
Often, downloading from the Web is accomplished more easily with a little exploration and a command-line utility or favorite browser than with even the most accomplished programming
[Discuss (0) | Link to this hack]

HACK
#36

Downloading Images from Webshots
Search a large collection of community-contributed images, based on keywords of your choice, and then download the visual finding
The Code
[Discuss (0) | Link to this hack]

HACK
#37

Downloading Comics with dailystrips
Love comics but hate visiting multiple sites for your daily dose? Automate your stripping with some easy-to-use open source Perl software
[Discuss (0) | Link to this hack]

HACK
#38

Archiving Your Favorite Webcams
Got a number of scenic or strategically placed webcams you watch daily? Or would like to ensure that your coworkers are actually doing the work you've assigned them? Keep on top of your pictorial problems with Python
The Code
[Discuss (1) | Link to this hack]

HACK
#39

News Wallpaper for Your Site
Grab today's news images for your web site or as an RSS feed, suitable for viewing in your favorite syndicated news application
The Code
[Discuss (0) | Link to this hack]

HACK
#40

Saving Only POP3 Email Attachments
Get oodles of attachments from mailing lists and friends? Learn how to save them to your hard drive automatically with a little Perl voodoo
The Code
[Discuss (1) | Link to this hack]

HACK
#41

Downloading MP3s from a Playlist
Automatically save the MP3 files that make up an M3U playlist
The Code
[Discuss (4) | Link to this hack]

HACK
#42

Downloading from Usenet with nget
Even though common wisdom states that porn peddlers and spam pushers have overrun Usenet, there are still a number of groups resolutely producing good content for good folks. In this hack, we'll show how to download files from news groups of your choice
[Discuss (1) | Link to this hack]

Gleaning Data from Databases

HACK
#43

Archiving Yahoo! Groups Messages with yahoo2mbox
Looking to keep a local archive of your favorite mailing list? With yahoo2mbox, you can import the final results into your favorite mailer
[Discuss (0) | Link to this hack]

HACK
#44

Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
Yahoo! Groups makes it easy to run an email discussion group at no cost. Sadly, there's no simple way to download all the messages—until now
The Code
[Discuss (2) | Link to this hack]

HACK
#45

Gleaning Buzz from Yahoo!
Stay hip with the latest Yahoo! Buzz search results
The Code
[Discuss (1) | Link to this hack]

HACK
#46

Spidering the Yahoo! Catalog
Writing a spider to spider an existing spider's site may seem convoluted, but it can prove useful when you're looking for location-based services. This hack walks through creating a framework for full-site spidering, including additional filters to lessen your load
The Code
[Discuss (0) | Link to this hack]

HACK
#47

Tracking Additions to Yahoo!
Keep track of the number of sites added to your favorite Yahoo! categories
The Code
[Discuss (0) | Link to this hack]

HACK
#48

Scattersearch with Yahoo! and Google
Sometimes, illuminating results can be found when scraping from one site and feeding the results into the API of another. With scattersearching, you can narrow down the most popular related results, as suggested by Yahoo! and Google
The Code
[Discuss (0) | Link to this hack]

HACK
#49

Yahoo! Directory Mindshare in Google
How does link popularity compare in Yahoo!'s searchable subject index versus Google's full-text index? Find out by calculating mindshare!
The Code
[Discuss (3) | Link to this hack]

HACK
#50

Weblog-Free Google Results
With so many weblogs being indexed by Google, you might worry about too much emphasis on the hot topic of the moment. In this hack, we'll show you how to remove the weblog factor from your Google results
The Code
[Discuss (0) | Link to this hack]

HACK
#51

Spidering, Google, and Multiple Domains
When you want to search a site, you tend to go straight to the site itself and use its native capabilities. But what if you could use Google to search across many similar sites, scraping the pages of most relevance?
The Code
[Discuss (0) | Link to this hack]

HACK
#52

Scraping Amazon.com Product Reviews
While Amazon.com has made some reviews available through their Web Services API, most are available only at the Amazon.com web site, requiring a little screen scraping to grab them
The Code
[Discuss (0) | Link to this hack]

HACK
#53

Receive an Email Alert for Newly Added Amazon.com Reviews
This hack keeps an eye on Amazon.com and notifies you, via email, when a new product review is posted to items you're tracking
The Code
[Discuss (0) | Link to this hack]

HACK
#54

Scraping Amazon.com Customer Advice
Screen scraping can give you access to Amazon.com community features not yet implemented through Amazon.com's public Web Services API. In this hack, we'll implement a script to scrape customer buying advice
The Code
[Discuss (0) | Link to this hack]

HACK
#55

Publishing Amazon.com Associates Statistics
Share some insider knowledge, such as the most popular item sold, with your site's audience by republishing your Amazon.com Associates sales statistics
The Code
[Discuss (3) | Link to this hack]

HACK
#56

Sorting Amazon.com Recommendations by Rating
Find the highest-rated items among your Amazon.com productrecommendations
The Code
[Discuss (0) | Link to this hack]

HACK
#57

Related Amazon.com Products with Alexa
Given any URL, Alexa will return traffic data, user ratings, and even related Amazon.com products. This hack creates a cloud of related product data for any given URL
The Code
[Discuss (0) | Link to this hack]

HACK
#58

Scraping Alexa's Competitive Data with Java
Alexa tracks the browsing habits of its millions of users daily. This hack allows you to aggregate the traffic statistics of multiple web properties into one RSS file, with subscriptions available daily
The Code
[Discuss (2) | Link to this hack]

HACK
#59

Finding Album Information with FreeDB and Amazon.com
By combining identifying information from one database with related information from another, you can create powerful applications with little effort
[Discuss (0) | Link to this hack]

HACK
#60

Expanding Your Musical Tastes
Looking for new music to complement your stale collection? With this script, you'll be able to pass some names of your favorite artists, and get Audioscrobbler recommendations
The Code
[Discuss (0) | Link to this hack]

HACK
#61

Saving Daily Horoscopes to Your iPod
You've got a zillion songs on your new iPod, and you're traveling around town oblivious to the sounds of the city. Worried about getting hit by a car, finding that special someone, or knowing when to ask for that raise? Take your horoscope along with you by running this hack daily
The Code
[Discuss (0) | Link to this hack]

HACK
#62

Graphing Data with RRDTOOL
Graphing data over time, either by itself or in comparison with another dataset, is the Holy Grail of analytical research. With the use of RRDTOOL, you'll be able to store and display time-series data
The Code
[Discuss (0) | Link to this hack]

HACK
#63

Stocking Up on Financial Quotes
Keeping track of multiple stocks can be a cumbersome task, but using the Finance::Quote Perl module can greatly simplify it. And, while we're at it, we'll generate pretty graphs with RRDTOOL
The Code
[Discuss (1) | Link to this hack]

HACK
#64

Super Author Searching
By combining multiple sites into one powerful script, you can get aggregated data results that are more complete than just one site could give
[Discuss (0) | Link to this hack]

HACK
#65

Mapping O'Reilly Best Sellers to Library Popularity
If you're using Google to look for books in university libraries, you'll get better results using a Library of Congress Number than a plain old ISBN
The Code
[Discuss (0) | Link to this hack]

HACK
#66

Using All Consuming to Get Book Lists
You can retrieve a list of the most-mentioned books in the weblog community, as well as personal book lists and recommendations, through either of All Consuming's two web service APIs
[Discuss (0) | Link to this hack]

HACK
#67

Tracking Packages with FedEx
When you absolutely, positively have to know where your package is right now!
The Code
[Discuss (0) | Link to this hack]

HACK
#68

Checking Blogs for New Comments
Tend to respond directly to weblog posts with a comment or three? Ever wonder about the reactions to your comments? This hack automates the process of keeping up with the conversation you started
The Code
[Discuss (0) | Link to this hack]

HACK
#69

Aggregating RSS and Posting Changes
With the proliferation of individual and group weblogs, it's typical for one person to post in multiple places. Thanks to RSS syndication, you can easily aggregate all your disparate posts into one weblog
The Code
[Discuss (1) | Link to this hack]

HACK
#70

Using the Link Cosmos of Technorati
Similar to other indexing sites like Blogdex, the Link Cosmos at Technorati keeps track of an immense number of blogs, correlating popular links and topics for all to see. With the recently released API, developers can now integrate the results into their own scripts
[Discuss (0) | Link to this hack]

HACK
#71

Finding Related RSS Feeds
If you're a regular reader of weblogs, you know that most syndicate their content in a format called RSS. By querying aggregated RSS databases, you can find related sites you may be interested in reading
[Discuss (0) | Link to this hack]

HACK
#72

Automatically Finding Blogs of Interest
An easy way to find interesting new sites is to peruse an existing site's blogroll: a listing of blogs they read regularly. Let's create a spider to automate this by looking for keywords in the content of outbound links
The Code
[Discuss (0) | Link to this hack]

HACK
#73

Scraping TV Listings
Freeing yourself from flipping through a weekly publication by visiting the TV Guide Online web site might sound like a good idea, but being forced to load heavy pages, showing only hours at a time and channels you don't care for, isn't exactly the utopia for which you were hoping
The Code
[Discuss (0) | Link to this hack]

HACK
#74

What's Your Visitor's Weather Like?
You have a web site, as most people do, and you're interested in getting a general idea of what you're visitor's weather is like. Want to know if you get more comments when it's raining or sunny? With the groundwork laid in this hack, that and other nonsense will be readily available
The Code
[Discuss (0) | Link to this hack]

HACK
#75

Trendspotting with Geotargeting
Compare the relative popularity of a trend or fashion in different locations, using only Google and Directi search results
The Code
[Discuss (0) | Link to this hack]

HACK
#76

Getting the Best Travel Route by Train
A web scraper can help you find faster train connections in Europe
The Code
[Discuss (0) | Link to this hack]

HACK
#77

Geographic Distance and Back Again
When you're traveling from one place to another, it's usually handy to know exactly how many miles you're going to be on the road. One of the best ways to get the most accurate result is to use latitude and longitude
The Code
[Discuss (2) | Link to this hack]

HACK
#78

Super Word Lookup
Working on a paper, book, or thesis and need a nerdy definition of one word, and alternatives to another?
The Code
[Discuss (0) | Link to this hack]

HACK
#79

Word Associations with Lexical Freenet
There will come a time when you want a little more than simple word definitions, synonyms, or etymologies. Lexical Freenet takes you beyond these simple results, providing associative data, or "paths," from your word to others
The Code
[Discuss (0) | Link to this hack]

HACK
#80

Reformatting Bugtraq Reports
Since Bugtraq is such an important part of a security administrator's watch list, it'll only be a matter of time before you'll want to integrate it more closely with your daily habits
The Code
[Discuss (0) | Link to this hack]

HACK
#81

Keeping Tabs on the Web via Email
If you find yourself checking your email more than cruising the Web, you might appreciate a little Perl work to bring the Web to your mailbox
[Discuss (0) | Link to this hack]

HACK
#82

Publish IE's Favorites to Your Web Site
You're surfing at a friend's house and think, "What is that URL? I have a link to it in my favorites. I wish I were home." How about making your favorites available no matter where you go?
The Code
[Discuss (1) | Link to this hack]

HACK
#83

Spidering GameStop.com Game Prices
Looking to get notification when "Army Men: Quest for Some Semblance of Quality" goes on sale at $5.99? With this hack, you'll be able to keep an eye on your most desired (or derisive) video game titles
The Code
[Discuss (0) | Link to this hack]

HACK
#84

Bargain Hunting with PHP
If you're always on the lookout for the best deals, coupons, and contests, a little bit of PHP-scraping code can help you stay up-to-date
The Code
[Discuss (1) | Link to this hack]

HACK
#85

Aggregating Multiple Search Engine Results
Even though Google may solve all your searching needs on a daily basis, there may come a time when you need a "super search"—something that queries multiple search engines or databases at once
The Code
[Discuss (0) | Link to this hack]

HACK
#86

Robot Karaoke
Who says people get to have all the fun? With this hack, you can let your computer do a little singing, by scraping the LyricsFreak.com web site and sending the results to a text-to-speech translator
The Code
[Discuss (1) | Link to this hack]

HACK
#87

Searching the Better Business Bureau
Is that new company offering to build your house, deliver your groceries, and walk your dog legit and free of complaint? Find out with an automated query of the Better Business Bureau's web site
The Code
[Discuss (0) | Link to this hack]

HACK
#88

Searching for Health Inspections
How healthy are the restaurants in your neighborhood? And when you find a good one, how do you get there? By combining databases with maps!
The Code
[Discuss (0) | Link to this hack]

HACK
#89

Filtering for the Naughties
Use search engines to construct your own parental control ratings for sites
The Code
[Discuss (0) | Link to this hack]

Maintaining Your Collections

HACK
#90

Using cron to Automate Tasks
Run scripts on a repetitive basis with the cron utility
[Discuss (0) | Link to this hack]

HACK
#91

Scheduling Tasks Without cron
If you want to run any of the hacks in this book on a regular basis, your best option is to use cron, a powerful Linux-based scheduler. But what if you're on a different OS or don't have access for some other reason?
[Discuss (0) | Link to this hack]

HACK
#92

Mirroring Web Sites with wget and rsync
Is there a site you check frequently, or do you want a backup of your own site? Various mirroring tools are available that can ensure you're creating duplicate and complete backups on another machine
[Discuss (0) | Link to this hack]

HACK
#93

Accumulating Search Results Over Time
Graphing search results over time can lead to interesting discoveries
The Code
[Discuss (0) | Link to this hack]

Giving Back to the World

HACK
#94

Using XML::RSS to Repurpose Data
By using the popular syndication format known as RSS, you can use your newly scraped data in dozens of different aggregators, toolkits, and more
The Code
[Discuss (0) | Link to this hack]

HACK
#95

Placing RSS Headlines on Your Site
Place other site's syndicated headlines on your own pages, periodically
The Code
[Discuss (1) | Link to this hack]

HACK
#96

Making Your Resources Scrapable with Regular Expressions
A few tricks can make your web page data easier to parse, without needing complicated HTML libraries or convoluted logic. The benefits extend to more than just visitors; your own HTML will be more understandable too
[Discuss (0) | Link to this hack]

HACK
#97

Making Your Resources Scrapable with a REST Interface
Consider offering alternative versions of site documents for a variety of human and machine visitors, based on how they present themselves
[Discuss (0) | Link to this hack]

HACK
#98

Making Your Resources Scrapable with XML-RPC
If you want to make your site's information accessible to lots of aspiring spider builders, don't worry about regular expressions. Just add a little XML-RPC
[Discuss (0) | Link to this hack]

HACK
#99

Creating an IM Interface
Add some Perl code here, an AOL Instant Messenger account there, and one of your favorite scraping scripts, and you have yourself an automated instant-messaging bot
The Code
[Discuss (0) | Link to this hack]

HACK
#100

Going Beyond the Book
As much as we would have liked to deliver a 1,500-page tome, sooner or later you're going to have to think outside the confines of this book
[Discuss (0) | Link to this hack]


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.