Chapter 13. Column Names as Contracts

Emily Riederer

Software products use unit tests and service-level agreements (SLAs) to make performance promises; interfaces have common symbols and labels. However, data tables exist somewhere in between—neither engineered like a service nor designed like an application. This lack of conversation and contract-making between producers and consumers leaves engineers confused about why users aren’t satisfied (or make vague complaints of “data quality”) and consumers befuddled that the data is never quite “right.”

Using a controlled vocabulary for naming fields in published datasets is a low-tech, low-friction solution to this dilemma. Developing a common language forms a shared understanding of how each field in a dataset is intended to work, and it can also alleviate producer overhead in data validation, documentation, and wrangling.

Engineers and analysts can define up front a tiered set of stubs with atomic, well-defined meanings. When pieced together, these stubs can be used to describe complex metrics and behaviors.

Imagine you work at a ride-share company and are building a data table with one record per trip. What might a controlled vocabulary look like?

A first level might characterize different measures, consisting of both a data type (e.g., bool or int) and appropriate usage patterns. For example:

ID
Integer. Non-null. Unique identifier ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.