Chapter 3. Scraping Websites and Extracting Data

Often, it will happen that you visit a website and find the content interesting. If there are only a few pages, it’s possible to read everything on your own. But as soon as there is a considerable amount of content, reading everything on your own will not be possible.

To use the powerful text analytics blueprints described in this book, you have to acquire the content first. Most websites won’t have a “download all content” button, so we have to find a clever way to download (“scrape”) the pages.

Usually we are mainly interested in the content part of each individual web page, less so in navigation, etc. As soon as we have the data locally available, we can use powerful extraction techniques to dissect the pages into elements such as title, content, and also some meta-information (publication date, author, and so on).

What You’ll Learn and What We’ll Build

In this chapter, we will show you how to acquire HTML data from websites and use powerful tools to extract the content from these HTML files. We will show this with content from one specific data source, the Reuters news archive.

In the first step, we will download single HTML files and extract data from each one with different methods.

Normally, you will not be interested in single pages. Therefore, we will build a blueprint solution. We will download and analyze a news archive page (which contains links to all articles). After completing this, we know the URLs ...

Get Blueprints for Text Analytics Using Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.