Chapter 16. Factors

Introduction

Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a nonalphabetical order.

We’ll start by motivating why factors are needed for data analysis1 and how you can create them with factor(). We’ll then introduce you to the gss_cat dataset, which contains a bunch of categorical variables to experiment with. You’ll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.

Prerequisites

Base R provides some basic tools for creating and manipulating factors. We’ll supplement these with the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.

library(tidyverse)

Factor Basics

Imagine that you have a variable that records the month:

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a string to record this variable has two problems:

  1. There are only 12 possible months, and there’s nothing saving you from typos:

    x2 <- c("Dec", "Apr", "Jam", "Mar")
  2. It doesn’t sort in a useful way:

    sort(x1)
    #> [1] "Apr" "Dec" "Jan" "Mar"

You can fix both of these problems with a factor. To create a factor, you must start by creating a list of the valid levels:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul"

Get R for Data Science, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.