Chapter 5. Getting Data Off the Web with Python

A fundamental part of the data visualizer’s skill set is getting the right dataset in as clean a form as possible. Sometimes you will be given a nice, clean dataset to analyze but often you will be tasked with either finding the data and/or cleaning the data supplied.

And more often than not these days, getting data involves getting it off the web. There are various ways you can do this, and Python provides some great libraries that make sucking up the data easy.

The main ways to get data off the web are:

  • Get a raw data file in a recognized data format (e.g., JSON or CSV) over HTTP.

  • Use a dedicated API to get the data.

  • Scrape the data by getting web pages via HTTP and parsing them locally for the required data.

This chapter will deal with these ways in turn, but first let’s get acquainted with the best Python HTTP library out there: Requests.

Getting Web Data with the Requests Library

As we saw in Chapter 4, the files that are used by web browsers to construct web pages are communicated via the Hypertext Transfer Protocol (HTTP), first developed by Tim Berners-Lee. Getting web content in order to parse it for data involves making HTTP requests.

Negotiating HTTP requests is a vital part of any general-purpose language, but getting web pages with Python used to be a rather irksome affair. The venerable urllib2 library was hardly user-friendly, with a very clunky API. Requests, courtesy of Kenneth Reitz, changed that, ...

Get Data Visualization with Python and JavaScript, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.