Chapter 5. Improving Data Quality

When most people hear the words data quality, they think about data that is correct and factual. In data analytics and data governance, data quality has a more nuanced set of qualifiers. Being correct is not enough, if all of the details are not available (e.g., fields in a transaction). Data quality is also measured in the context of a use case, as we will explain. Let’s begin by exploring the characteristics of data quality.

What Is Data Quality?

Put simply, data quality is the ranking of certain data according to accuracy, completeness (all columns have values), and timeliness. When you are working with large amounts of data, the data is usually acquired and processed in an automated way. When thinking about data quality, it is good to discuss:

Accuracy
Whether the data captured was actually correct. For example, an error in data entry causing multiple zeros to be entered ahead of a decimal point, is an accuracy issue. Duplicate data is also an example of inaccurate data.
Completeness
Whether all records captured were complete—i.e., there are no columns with missing information. If you are managing customer records, for example, make sure you capture or otherwise reconcile a complete customer details record (e.g., name/address/phone number). Missing fields will cause issues if you are looking for customer records in a specific zip code, for example.
Timeliness
Transactional data is affected by timeliness. The order of events in buying and ...

Get Data Governance: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.