Chapter 5. (Re)Organizing the Web’s Data
The first, and sometimes hardest part of doing any data analysis is acquiring the data from which you hope to extract information. Whether you want to look at your personal spending habits, calculate your next trade in fantasy baseball, or compare a politician’s investment returns to your own, the data you need is usually there on the web with some sense of order to it, but it’s probably not in a form that’s very useful for analysis. If this is the case, you’ll need to either manually gather the data or write a script to collect the data for you.
The granddaddy of all data formats is the data table, with a column
for each attribute and a row for each observation. You’ve seen this if
you’ve ever used Microsoft Excel, relational databases, or R’s data.frame
object.
Table 5-1. An example data table
Date | Blog | Posts |
---|---|---|
2012-01-01 | adamlaiacano | 2 |
2012-01-01 | david | 4 |
2012-01-01 | dallas | 6 |
2012-01-02 | adamlaiacano | 0 |
2012-01-02 | david | 4 |
2012-01-02 | dallas | 6 |
Most websites store their data behind the scenes in tables within relational databases, and if those tables were accessible to the computing public, this chapter of Bad Data Handbook wouldn’t need to exist. However, it’s a web designer’s job to make this information visually appealing and interpretable, which usually means they’ll only present the reader with a relevant subset of the dataset, such as a single company’s stock price over a specific date range, or recent status updates from a single user’s social connections. ...
Get Bad Data Handbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.