Chapter 3. Starting to Crawl
So far, the examples in the book have covered single static pages, with somewhat artificial canned examples. In this chapter, we’ll start looking at some real-world problems, with scrapers traversing multiple pages and even multiple sites.
Web crawlers are called such because they crawl across the Web. At their core is an element of recursion. They must retrieve page contents for a URL, examine that page for another URL, and retrieve that page, ad infinitum.
Beware, however: just because you can crawl the Web doesn’t mean that you always should. The scrapers used in previous examples work great in situations where all the data you need is on a single page. With web crawlers, you must be extremely conscientious of how much bandwidth you are using and make every effort to determine if there’s a way to make the target server’s load easier.
Traversing a Single Domain
Even if you haven’t heard of “Six Degrees of Wikipedia,” you’ve almost certainly heard of its namesake, “Six Degrees of Kevin Bacon.” In both games, the goal is to link two unlikely subjects (in the first case, Wikipedia articles that link to each other, in the second case, actors appearing in the same film) by a chain containing no more than six total (including the two original subjects).
For example, Eric Idle appeared in Dudley Do-Right with Brendan Fraser, who appeared in The Air I Breathe with Kevin Bacon.1 In this case, the chain from Eric Idle to Kevin Bacon is only three subjects ...
Get Web Scraping with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.