Chapter 12. Advanced Web Scraping: Screen Scrapers and Spiders
Youâve begun your web scraping skills development, learning how to decipher what, how, and where to scrape in Chapter 11. In this chapter, weâll take a look at more advanced scrapers, like browser-based scrapers and spiders to gather content.
Weâll also learn about debugging common problems with advanced web scraping and cover some of the ethical questions presented when scraping the Web. To begin, weâll investigate browser-based web scraping: using a browser directly with Python to scrape content from the Web.
Browser-Based Parsing
Sometimes a site uses a lot of JavaScript or other post-page-load code to populate the pages with content. In these cases, itâs almost impossible to use a normal web scraper to analyze the site. What youâll end up with is a very empty-looking page. Youâll have the same problem if you want to interact with pages (i.e., if you need to click on a button or enter some search text). In either situation, youâll want to figure out how to screen read the page. Screen readers work by using a browser, opening the page, and reading and interacting with the page after it loads in the browser.
Tip
Screen readers are great for tasks performed by walking through a series of actions to get information. For this very reason, screen reader scripts are also an easy way to automate routine web tasks.
The most commonly used screen reading library in Python is Selenium. Selenium is a Java ...
Get Data Wrangling with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.