Chapter 7. Data Cleanup: Investigation, Matching, and Formatting
Cleaning up your data is not the most glamourous of tasks, but itâs an essential part of data wrangling. Becoming a data cleaning expert requires precision and a healthy knowledge of your area of research or study. Knowing how to properly clean and assemble your data will set you miles apart from others in your field.
Python is well designed for data cleanup; it helps you build functions around patterns, eliminating repetitive work. As weâve already seen in our code so far, learning to fix repetitive problems with scripts and code can turn hours of manual work into a script you run once.
In this chapter, we will take a look at how Python can help you clean and format your data. Weâll also use Python to locate duplicates and errors in our datasets. We will continue learning about cleanup, especially automating our cleanup and saving our cleaned data, in the next chapter.
Why Clean Data?
Some data may come to you properly formatted and ready to use. If this is the case, consider yourself lucky! Most data, even if it is cleaned, has some formatting inconsistencies or readability issues (e.g., acronyms or mismatched description headers). This is especially true if you are using data from more than one dataset. Itâs unlikely your data will properly join and be useful unless you spend time formatting and standardizing it.
Note
Cleaning your data makes for easier storage, search, and reuse. As we explored in
Get Data Wrangling with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.