My Account
View Cart
Home
Community
Books & Videos
Safari Books Online
Conferences
Training
School of Technology
About
Complete List
Bestsellers
New Releases
Rough Cuts
Upcoming Titles
Ebooks
By Publisher
By Series
Out of Print
Order Info
Search
Search Tips
Tell a friend
Spidering Hacks
100 Industrial-Strength Tips & Tools
By
Kevin Hemenway
,
Tara Calishain
October 2003
Pages: 424
|
Table of Contents
|
Index
|
Sample Hacks
|
Colophon
Table of Contents
Chapter 1
Walking Softly
Hacks #1-7
A Crash Course in Spidering and Scraping
Best Practices for You and Your Spider
Anatomy of an HTML Page
Registering Your Spider
Preempting Discovery
Keeping Your Spider Out of Sticky Situations
Finding the Patterns of Identifiers
Chapter 2
Assembling a Toolbox
Hacks #8-32
Perl Modules
Resources You May Find Helpful
Installing Perl Modules
Simply Fetching with LWP::Simple
More Involved Requests with LWP::UserAgent
Adding HTTP Headers to Your Request
Posting Form Data with LWP
Authentication, Cookies, and Proxies
Handling Relative and Absolute URLs
Secured Access and Browser Attributes
Respecting Your Scrapee's Bandwidth
Respecting robots.txt
Adding Progress Bars to Your Scripts
Scraping with HTML::TreeBuilder
Parsing with HTML::TokeParser
WWW::Mechanize 101
Scraping with WWW::Mechanize
In Praise of Regular Expressions
Painless RSS with Template::Extract
A Quick Introduction to XPath
Downloading with curl and wget
More Advanced wget Techniques
Using Pipes to Chain Commands
Running Multiple Utilities at Once
Utilizing the Web Scraping Proxy
Being Warned When Things Go Wrong
Being Adaptive to Site Redesigns
Chapter 3
Collecting Media Files
Hacks #33-42
Detective Case Study: Newgrounds
Detective Case Study: iFilm
Downloading Movies from the Library of Congress
Downloading Images from Webshots
Downloading Comics with dailystrips
Archiving Your Favorite Webcams
News Wallpaper for Your Site
Saving Only POP3 Email Attachments
Downloading MP3s from a Playlist
Downloading from Usenet with nget
Chapter 4
Gleaning Data from Databases
Hacks #43-89
Archiving Yahoo! Groups Messages with yahoo2mbox
Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
Gleaning Buzz from Yahoo!
Spidering the Yahoo! Catalog
Tracking Additions to Yahoo!
Scattersearch with Yahoo! and Google
Yahoo! Directory Mindshare in Google
Weblog-Free Google Results
Spidering, Google, and Multiple Domains
Scraping Amazon.com Product Reviews
Receive an Email Alert for Newly Added Amazon.com Reviews
Scraping Amazon.com Customer Advice
Publishing Amazon.com Associates Statistics
Sorting Amazon.com Recommendations by Rating
Related Amazon.com Products with Alexa
Scraping Alexa's Competitive Data with Java
Finding Album Information with FreeDB and Amazon.com
Expanding Your Musical Tastes
Saving Daily Horoscopes to Your iPod
Graphing Data with RRDTOOL
Stocking Up on Financial Quotes
Super Author Searching
Mapping O'Reilly Best Sellers to Library Popularity
Using All Consuming to Get Book Lists
Tracking Packages with FedEx
Checking Blogs for New Comments
Aggregating RSS and Posting Changes
Using the Link Cosmos of Technorati
Finding Related RSS Feeds
Automatically Finding Blogs of Interest
Scraping TV Listings
What's Your Visitor's Weather Like?
Trendspotting with Geotargeting
Getting the Best Travel Route by Train
Geographic Distance and Back Again
Super Word Lookup
Word Associations with Lexical Freenet
Reformatting Bugtraq Reports
Keeping Tabs on the Web via Email
Publish IE's Favorites to Your Web Site
Spidering GameStop.com Game Prices
Bargain Hunting with PHP
Aggregating Multiple Search Engine Results
Robot Karaoke
Searching the Better Business Bureau
Searching for Health Inspections
Filtering for the Naughties
Chapter 5
Maintaining Your Collections
Hacks #90-93
Using cron to Automate Tasks
Scheduling Tasks Without cron
Mirroring Web Sites with wget and rsync
Accumulating Search Results Over Time
Chapter 6
Giving Back to the World
Hacks #94-100
Using XML::RSS to Repurpose Data
Placing RSS Headlines on Your Site
Making Your Resources Scrapable with Regular Expressions
Making Your Resources Scrapable with a REST Interface
Making Your Resources Scrapable with XML-RPC
Creating an IM Interface
Going Beyond the Book
Colophon
Return to
Spidering Hacks