Chapter 10. Exploratory Data Analysis

More than 50 years ago, John Tukey avidly promoted an alternative type of data analysis that broke from the formal world of confidence intervals, hypothesis tests, and modeling. Today Tukey’s exploratory data analysis (EDA) is widely practiced. Tukey describes EDA as a philosophical approach to working with data:

Exploratory data analysis is actively incisive, rather than passively descriptive, with real emphasis on the discovery of the unexpected.

As a data scientist, you will want to use EDA in every stage of the data lifecycle, from checking the quality of your data to preparing for formal modeling to confirming that your model is reasonable. Indeed, the work described in Chapter 9 to clean and transform the data relied heavily on EDA to guide our quality checks and transformations.

In EDA, we enter a process of discovery, continually asking questions and diving into uncharted territory to explore ideas. We use plots to uncover features of the data, examine distributions of values, and reveal relationships that cannot be detected from simple numerical summaries. This exploration involves transforming, visualizing, and summarizing data to build and confirm our understanding, identify and address potential issues with the data, and inform subsequent analysis.

EDA is fun! But it takes practice. One of the best ways to learn how to carry out EDA is to learn from others as they describe their thought process while they explore data, and ...

Get Learning Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.