Appendix C. Data-Wrangling Fundamentals

Tidy datasets are all alike, but every messy dataset is messy in its own way.

Hadley Wickham

This appendix focuses on some basics of data wrangling, or the process of formatting and cleaning data prior to using it. We include some common but sometimes confusing tools we use on a regular basis. We need a large toolbox, because, as noted by Hadley Wickham, each messy data has its own pathologies. For a more in-depth side-by comparison of Python and R, check out the appendix in Python and R for the Modern Data Scientist by Rick J. Scavetta and Boyan Angelov (O’Reilly, 2021).

Note

Data wrangling has many synonyms because almost everybody working with data needs to clean it. Other terms include data cleaning, data formatting, data tidying, data transformation, data manipulation, data munging, and data mutating. Basically, people use various terms, so don’t be surprised if you see different terms in different sources. Also, in our experience, people inconsistently use these terms. The key take-home is that you’ll need to clean, format, transform, or otherwise change your own data at some point. Hence, we included this appendix.

Logic Operators

Logic operators are the same across most languages, including Python and R. The upcoming Table C-1 lists some common operators. Explore these operators by creating a vector in R:

## R
score <- c(21, 7, 0, 14)
team <- c("GB", "DEN", "KC", "NYJ")

Or, create arrays with numpy in Python:

## Python
import ...

Get Football Analytics with Python & R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.