Buying Options
Spidering Hacks
Print $24.95
Add to Cart
Print+Ebook $27.45
Add to Cart
Ebook $19.99
(PDF)
Add to Cart
Safari Books Online
Add to Cart
What is this?
Print £18.99
Add to Cart
What is this?
Description
Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content.
Full Description
Table of Contents
  1. Chapter 1 Walking Softly

    1. Hacks #1-7

    2. A Crash Course in Spidering and Scraping

    3. Best Practices for You and Your Spider

    4. Anatomy of an HTML Page

    5. Registering Your Spider

    6. Preempting Discovery

    7. Keeping Your Spider Out of Sticky Situations

    8. Finding the Patterns of Identifiers

  2. Chapter 2 Assembling a Toolbox

    1. Hacks #8-32

    2. Perl Modules

    3. Resources You May Find Helpful

    4. Installing Perl Modules

    5. Simply Fetching with LWP::Simple

    6. More Involved Requests with LWP::UserAgent

    7. Adding HTTP Headers to Your Request

    8. Posting Form Data with LWP

    9. Authentication, Cookies, and Proxies

    10. Handling Relative and Absolute URLs

    11. Secured Access and Browser Attributes

    12. Respecting Your Scrapee's Bandwidth

    13. Respecting robots.txt

    14. Adding Progress Bars to Your Scripts

    15. Scraping with HTML::TreeBuilder

    16. Parsing with HTML::TokeParser

    17. WWW::Mechanize 101

    18. Scraping with WWW::Mechanize

    19. In Praise of Regular Expressions

    20. Painless RSS with Template::Extract

    21. A Quick Introduction to XPath

    22. Downloading with curl and wget

    23. More Advanced wget Techniques

    24. Using Pipes to Chain Commands

    25. Running Multiple Utilities at Once

    26. Utilizing the Web Scraping Proxy

    27. Being Warned When Things Go Wrong

    28. Being Adaptive to Site Redesigns

  3. Chapter 3 Collecting Media Files

    1. Hacks #33-42

    2. Detective Case Study: Newgrounds

    3. Detective Case Study: iFilm

    4. Downloading Movies from the Library of Congress

    5. Downloading Images from Webshots

    6. Downloading Comics with dailystrips

    7. Archiving Your Favorite Webcams

    8. News Wallpaper for Your Site

    9. Saving Only POP3 Email Attachments

    10. Downloading MP3s from a Playlist

    11. Downloading from Usenet with nget

  4. Chapter 4 Gleaning Data from Databases

    1. Hacks #43-89

    2. Archiving Yahoo! Groups Messages with yahoo2mbox

    3. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups

    4. Gleaning Buzz from Yahoo!

    5. Spidering the Yahoo! Catalog

    6. Tracking Additions to Yahoo!

    7. Scattersearch with Yahoo! and Google

    8. Yahoo! Directory Mindshare in Google

    9. Weblog-Free Google Results

    10. Spidering, Google, and Multiple Domains

    11. Scraping Amazon.com Product Reviews

    12. Receive an Email Alert for Newly Added Amazon.com Reviews

    13. Scraping Amazon.com Customer Advice

    14. Publishing Amazon.com Associates Statistics

    15. Sorting Amazon.com Recommendations by Rating

    16. Related Amazon.com Products with Alexa

    17. Scraping Alexa's Competitive Data with Java

    18. Finding Album Information with FreeDB and Amazon.com

    19. Expanding Your Musical Tastes

    20. Saving Daily Horoscopes to Your iPod

    21. Graphing Data with RRDTOOL

    22. Stocking Up on Financial Quotes

    23. Super Author Searching

    24. Mapping O'Reilly Best Sellers to Library Popularity

    25. Using All Consuming to Get Book Lists

    26. Tracking Packages with FedEx

    27. Checking Blogs for New Comments

    28. Aggregating RSS and Posting Changes

    29. Using the Link Cosmos of Technorati

    30. Finding Related RSS Feeds

    31. Automatically Finding Blogs of Interest

    32. Scraping TV Listings

    33. What's Your Visitor's Weather Like?

    34. Trendspotting with Geotargeting

    35. Getting the Best Travel Route by Train

    36. Geographic Distance and Back Again

    37. Super Word Lookup

    38. Word Associations with Lexical Freenet

    39. Reformatting Bugtraq Reports

    40. Keeping Tabs on the Web via Email

    41. Publish IE's Favorites to Your Web Site

    42. Spidering GameStop.com Game Prices

    43. Bargain Hunting with PHP

    44. Aggregating Multiple Search Engine Results

    45. Robot Karaoke

    46. Searching the Better Business Bureau

    47. Searching for Health Inspections

    48. Filtering for the Naughties

  5. Chapter 5 Maintaining Your Collections

    1. Hacks #90-93

    2. Using cron to Automate Tasks

    3. Scheduling Tasks Without cron

    4. Mirroring Web Sites with wget and rsync

    5. Accumulating Search Results Over Time

  6. Chapter 6 Giving Back to the World

    1. Hacks #94-100

    2. Using XML::RSS to Repurpose Data

    3. Placing RSS Headlines on Your Site

    4. Making Your Resources Scrapable with Regular Expressions

    5. Making Your Resources Scrapable with a REST Interface

    6. Making Your Resources Scrapable with XML-RPC

    7. Creating an IM Interface

    8. Going Beyond the Book

  1. Colophon

View Full Table of Contents
Product Details
Title:
Spidering Hacks
By:
Kevin Hemenway, Tara Calishain
Publisher:
O'Reilly Media
Formats:
  • Print
  • Ebook
  • Safari Books Online
Print Release:
October 2003
Ebook Release:
June 2009
Pages:
432
Print ISBN:
978-0-596-00577-1
| ISBN 10:
0-596-00577-6
Ebook ISBN:
978-0-596-10428-3
| ISBN 10:
0-596-10428-6
Customer Reviews
About the Authors
  1. Kevin Hemenway

    Kevin Hemenway, coauthor of Mac OS X Hacks, is better known as Morbus Iff, the creator of disobey.com, which bills itself as "content for the discontented." Publisher and developer of more home cooking than you could ever imagine, he'd love to give you a Fry Pan of Intellect upside the head. Politely, of course. And with love.

    View Kevin Hemenway's full profile page.

  2. Tara Calishain

    Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.

    View Tara Calishain's full profile page.

Colophon

Our look is the result of reader comments, our own experimentation, and feedback from distribution channels. Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially dry subjects. The tool on the cover of Spidering Hacks is a flex scraper. Flex scrapers are sometimes referred to as putty knives or push scrapers. These rugged tools are commonly used for light-duty construction or home projects, such as wallpapering, painting, or woodworking. Flex scrapers are usually three inches wide, with steel blades ground thinner than a typical putty knife to give maximum flexibility. Thus, they are the perfect choice for applying lighter compounds over broader areas and at a faster rate than putty knives. High-end flex scrapers have ergonomic handles designed to fit the hand and reduce fatigue. Just as a well-designed flex scraper gives improved blade control, so too does a well-designed spidering or scraping hack give greater control and and flexibility when gathering information from the Web and automating and speeding complex tasks. Genevieve d'Entremont was the production editor for Spidering Hacks. Brian Sawyer was the copyeditor. Matt Hutchinson proofread the book. Derek Di Matteo, Marlowe Shaeffer, and Claire Cloutier provided quality control. Julie Hawks wrote the index.

Emma Colby designed the cover of this book, based on a series design by Edie Freedman. The cover image is an original photograph by Emma Colby. Emma Colby produced the cover layout with QuarkXPress 4.1 using Adobe's Helvetica Neue and ITC Garamond fonts.

David Futato designed the interior layout. This book was converted from Microsoft Word to FrameMaker 5.5.6 by Andrew Savikas. The text font is Linotype Birka; the heading font is Adobe Helvetica Neue Condensed; and the code font is LucasFont's TheSans Mono Condensed. The illustrations that appear in the book were produced by Robert Romano and Jessamyn Read using Macromedia FreeHand 9 and Adobe Photoshop 6. This colophon was written by Derek Di Matteo.

  • Book cover of Spidering Hacks