Chapter 6. Assessing Data Quality
Over the past two chapters, we’ve focused our efforts on identifying and accessing different formats of data in different locations—from spreadsheets to websites. But getting our hands on (potentially) interesting data is really only the beginning. The next step is conducting a thorough quality assessment to understand if what we have is useful, salvageable, or just straight up garbage.
As you may have gleaned from reading Chapter 3, crafting quality data is a complex and time-consuming business. The process is roughly equal parts research, experimentation, and dogged perseverance. Most importantly, committing to data quality means that you have to be willing to invest significant amounts of time and energy—and still be willing to throw it all out and start over if, despite your best efforts, the data you have just can’t be brought up to par.
When it comes down to it, in fact, that last criterion is probably what makes doing really high-quality, meaningful work with data truly difficult. The technical skills, as I hope you are already discovering, take some effort to master but are still highly achievable with sufficient practice. Research skills are a bit harder to document and convey, but working through the examples in this book will help you develop many of them, especially those related to the information discovery and collation needed for assessing and improving data quality.
When it comes to reconciling yourself to the fact that, after ...
Get Practical Python Data Wrangling and Data Quality now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.