Chapter 2. Collecting, Labeling, and Validating Data

In production environments, you discover some interesting things about the importance of data. We asked ML practitioners at Uber and Gojek, two businesses where data and ML are mission critical, about it. Here’s what they had to say:

Data is the hardest part of ML and the most important piece to get right...Broken data is the most common cause of problems in production ML systems.

ML practitioner at Uber

No other activity in the machine learning lifecycle has a higher return on investment than improving the data a model has access to.

ML practitioner at Gojek

The truth is that if you ask any production ML team member about the importance of data, you’ll get a similar answer. This is why we’re talking about data: it’s incredibly important to success, and the issues for data in production environments are very different from those in the academic or research environment that you might be familiar with.

OK, now that we’ve gotten that out of the way, let’s dive in!

Important Considerations in Data Collection

In programming language design, a first-class citizen in a given programming language is an entity that supports all the operations generally available to other entities. In ML, data is a first-class citizen. Finding data with predictive content might sound easy, but in reality it can be incredibly difficult.

When collecting data, it’s important to ensure that the data represents the application you are trying to build ...

Get Machine Learning Production Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.