-
Chapter 1 Walking Softly
-
Hacks #1-7
-
A Crash Course in Spidering and Scraping
-
Best Practices for You and Your Spider
-
Anatomy of an HTML Page
-
Registering Your Spider
-
Preempting Discovery
-
Keeping Your Spider Out of Sticky Situations
-
Finding the Patterns of Identifiers
-
-
Chapter 2 Assembling a Toolbox
-
Hacks #8-32
-
Perl Modules
-
Resources You May Find Helpful
-
Installing Perl Modules
-
Simply Fetching with LWP::Simple
-
More Involved Requests with LWP::UserAgent
-
Adding HTTP Headers to Your Request
-
Posting Form Data with LWP
-
Authentication, Cookies, and Proxies
-
Handling Relative and Absolute URLs
-
Secured Access and Browser Attributes
-
Respecting Your Scrapee's Bandwidth
-
Respecting robots.txt
-
Adding Progress Bars to Your Scripts
-
Scraping with HTML::TreeBuilder
-
Parsing with HTML::TokeParser
-
WWW::Mechanize 101
-
Scraping with WWW::Mechanize
-
In Praise of Regular Expressions
-
Painless RSS with Template::Extract
-
A Quick Introduction to XPath
-
Downloading with curl and wget
-
More Advanced wget Techniques
-
Using Pipes to Chain Commands
-
Running Multiple Utilities at Once
-
Utilizing the Web Scraping Proxy
-
Being Warned When Things Go Wrong
-
Being Adaptive to Site Redesigns
-
-
Chapter 3 Collecting Media Files
-
Hacks #33-42
-
Detective Case Study: Newgrounds
-
Detective Case Study: iFilm
-
Downloading Movies from the Library of Congress
-
Downloading Images from Webshots
-
Downloading Comics with dailystrips
-
Archiving Your Favorite Webcams
-
News Wallpaper for Your Site
-
Saving Only POP3 Email Attachments
-
Downloading MP3s from a Playlist
-
Downloading from Usenet with nget
-
-
Chapter 4 Gleaning Data from Databases
-
Hacks #43-89
-
Archiving Yahoo! Groups Messages with yahoo2mbox
-
Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
-
Gleaning Buzz from Yahoo!
-
Spidering the Yahoo! Catalog
-
Tracking Additions to Yahoo!
-
Scattersearch with Yahoo! and Google
-
Yahoo! Directory Mindshare in Google
-
Weblog-Free Google Results
-
Spidering, Google, and Multiple Domains
-
Scraping Amazon.com Product Reviews
-
Receive an Email Alert for Newly Added Amazon.com Reviews
-
Scraping Amazon.com Customer Advice
-
Publishing Amazon.com Associates Statistics
-
Sorting Amazon.com Recommendations by Rating
-
Related Amazon.com Products with Alexa
-
Scraping Alexa's Competitive Data with Java
-
Finding Album Information with FreeDB and Amazon.com
-
Expanding Your Musical Tastes
-
Saving Daily Horoscopes to Your iPod
-
Graphing Data with RRDTOOL
-
Stocking Up on Financial Quotes
-
Super Author Searching
-
Mapping O'Reilly Best Sellers to Library Popularity
-
Using All Consuming to Get Book Lists
-
Tracking Packages with FedEx
-
Checking Blogs for New Comments
-
Aggregating RSS and Posting Changes
-
Using the Link Cosmos of Technorati
-
Finding Related RSS Feeds
-
Automatically Finding Blogs of Interest
-
Scraping TV Listings
-
What's Your Visitor's Weather Like?
-
Trendspotting with Geotargeting
-
Getting the Best Travel Route by Train
-
Geographic Distance and Back Again
-
Super Word Lookup
-
Word Associations with Lexical Freenet
-
Reformatting Bugtraq Reports
-
Keeping Tabs on the Web via Email
-
Publish IE's Favorites to Your Web Site
-
Spidering GameStop.com Game Prices
-
Bargain Hunting with PHP
-
Aggregating Multiple Search Engine Results
-
Robot Karaoke
-
Searching the Better Business Bureau
-
Searching for Health Inspections
-
Filtering for the Naughties
-
-
Chapter 5 Maintaining Your Collections
-
Hacks #90-93
-
Using cron to Automate Tasks
-
Scheduling Tasks Without cron
-
Mirroring Web Sites with wget and rsync
-
Accumulating Search Results Over Time
-
-
Chapter 6 Giving Back to the World
-
Hacks #94-100
-
Using XML::RSS to Repurpose Data
-
Placing RSS Headlines on Your Site
-
Making Your Resources Scrapable with Regular Expressions
-
Making Your Resources Scrapable with a REST Interface
-
Making Your Resources Scrapable with XML-RPC
-
Creating an IM Interface
-
Going Beyond the Book
-
-
Colophon
- Title:
- Spidering Hacks
- By:
- Kevin Hemenway, Tara Calishain
- Publisher:
- O'Reilly Media
- Formats:
-
- Ebook
- Safari Books Online
- Print Release:
- October 2003
- Ebook Release:
- June 2009
- Pages:
- 432
- Print ISBN:
- 978-0-596-00577-1
- | ISBN 10:
- 0-596-00577-6
- Ebook ISBN:
- 978-0-596-10428-3
- | ISBN 10:
- 0-596-10428-6
Our look is the result of reader comments, our own experimentation, and feedback from distribution channels. Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially dry subjects. The tool on the cover of Spidering Hacks is a flex scraper. Flex scrapers are sometimes referred to as putty knives or push scrapers. These rugged tools are commonly used for light-duty construction or home projects, such as wallpapering, painting, or woodworking. Flex scrapers are usually three inches wide, with steel blades ground thinner than a typical putty knife to give maximum flexibility. Thus, they are the perfect choice for applying lighter compounds over broader areas and at a faster rate than putty knives. High-end flex scrapers have ergonomic handles designed to fit the hand and reduce fatigue. Just as a well-designed flex scraper gives improved blade control, so too does a well-designed spidering or scraping hack give greater control and and flexibility when gathering information from the Web and automating and speeding complex tasks. Genevieve d'Entremont was the production editor for Spidering Hacks. Brian Sawyer was the copyeditor. Matt Hutchinson proofread the book. Derek Di Matteo, Marlowe Shaeffer, and Claire Cloutier provided quality control. Julie Hawks wrote the index.
Emma Colby designed the cover of this book, based on a series design by Edie Freedman. The cover image is an original photograph by Emma Colby. Emma Colby produced the cover layout with QuarkXPress 4.1 using Adobe's Helvetica Neue and ITC Garamond fonts.
David Futato designed the interior layout. This book was converted from Microsoft Word to FrameMaker 5.5.6 by Andrew Savikas. The text font is Linotype Birka; the heading font is Adobe Helvetica Neue Condensed; and the code font is LucasFont's TheSans Mono Condensed. The illustrations that appear in the book were produced by Robert Romano and Jessamyn Read using Macromedia FreeHand 9 and Adobe Photoshop 6. This colophon was written by Derek Di Matteo.
