Chapter 18. Missing Values

Introduction

You’ve already learned the basics of missing values earlier in the book. You first saw them in Chapter 1 where they resulted in a warning when making a plot as well as in “summarize()” where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in “Missing Values”. Now we’ll come back to them in more depth so you can learn more of the details.

We’ll start by discussing some general tools for working with missing values recorded as NAs. We’ll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. We’ll finish off with a related discussion of empty groups, caused by factor levels that don’t appear in the data.

Prerequisites

The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.

library(tidyverse)

Explicit Missing Values

To begin, let’s explore a few handy tools for creating or eliminating missing explicit values, i.e., cells where you see an NA.

Last Observation Carried Forward

A common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):

treatment <- tribble(
  ~person,           ~treatment, ~response,
  "Derrick Whitmore", 1,         7,
  NA,                 2,         10,
  NA,                 3,         NA,
  "Katherine Burke" ...

Get R for Data Science, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.