5

Data Quality

All data is dirty, some data is useful.

–cf. George Box

Welcome to the mid-point of the book. In something like the loose way in which a rock “concept album” tells an overarching story through its individual songs, this book is meant, to a certain degree, to follow the process a data scientist goes through from acquiring raw data to feeding suitable data into a machine learning model or data analysis. Up until this point, we have looked at how one goes about getting data into a program or analysis system (e.g. a notebook), and we touched on identifying data that has clearly “gone bad” at the level of individual data points in Chapter 4, Anomaly Detection. In the chapters after this one, we will look at remediation of that messy ...

Get Cleaning Data for Effective Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.