A guide to improving data integrity
Validating your data requires asking the right questions and using the right data.
Almost 10 years ago, I started as an intern on a data engineering team, working my way up to senior developer after working on dozens of projects and processes, including simple queries of data, data warehousing, parsing raw logs, translations, aggregations, and creating products for final reports and analysis.
After working with data for many years to create reliable reports and analysis, I’ve seen many people join our team who are processing and analyzing data for the first time (and sometimes even after years of working with only pristine, reportable data), and who struggle at first to understand how to reliably maintain and use data that is more raw and random. Often, they don’t know to ensure only the right data is used, and answering the question, “How would you test that?” has been difficult to even begin to answer.
However, after working in data sets, this becomes more obvious and easy. To help answer this question, it’s helpful to focus on boundaries and hard expectations within the data, specifically on format and validity of the values being observed. Ask yourself these questions:
- Do I have all the data I started with?
- Are there nulls in the data that should have values?
- Are there duplicates in the data?
Other key things to look at when evaluating data’s validity include trends and how different components of data relate. For example, if you’re testing a set of data that represents a shopping experience with users, products, purchases, and carts, some key questions to answer may include:
- Do all purchases relate to valid products?
- Does every cart and purchase have a valid user?
- Is the total number of carts less than total users? (Assuming each user should only have at most one cart.)
Recently, I worked on a large project that encompassed almost all of the data processes and testing practices that we’ve developed over the past decade or so. Therefore, I decided to use this project as a case study to create a guide for how I test data and think about the process. This case study also included the larger question: “Can I trust the data I’m using?”, which goes beyond verifying the accuracy of data transformations and processes to ensure the right data sets are used for analysis.
This project started with processing and parsing raw data from log files and finished with optimized data tables for reporting in business intelligence tools. These tables would be used for final analysis and tracking of critical company data. Since we worked through many transformations and steps in data processing, it proved to be a good candidate for documentation. I’ve written a report for O’Reilly about this project, citing more tangible examples and walking through the different parts of our process. Check it out here.