Chapter 4. Feature and Training Data

By Robbie Sedgewick and Todd Underwood

It should be clear by this point that models come from data. This chapter is about the data: how it is created, processed, annotated, stored, and ultimately used to create the model. You will see that managing and handling the data creates specific challenges for repeatability, manageability, and reliability, and we will make some concrete recommendations about how to approach those challenges. For background, make sure to see (if you haven’t already) Chapters 2 and 3.

This chapter covers the infrastructure that accepts data from a source and readies it for use by the training system. We will discuss three fundamental functional subsystems involved in this task: a feature system, a system for human annotations, and a metadata system. We discussed features a little in the previous chapter; another way of thinking about them is that they are characteristics of the input data, especially characteristics that we have determined are predictive of something we care about. Labels are specific cases of the output that we want from the model that we ultimately train. They are used as examples to train that model. Another way to think about labels is that they are the target or “correct” values for a specific data instance that the model will learn. Labels can be extracted from logs by correlating the data with another independent event, or they can be generated by humans. We’ll discuss the systems needed for generation of human labels at scale, often called annotations. And finally, we’ll briefly cover metadata systems, which keep track of the details about how the other systems work and are critical to making them repeatable and reliable.

Several aspects of these systems are usually shared between the feature and data system—most notably the metadata system. Since metadata (data about the data that we are collecting and annotating) is best understood after we know what we’re doing with the data, we will discuss those systems after we have explored the requirements and characteristics of the feature and labeling systems.

Features

The data is just the data. Features are what make it useful for ML.¹ A feature is any aspect of the data that we determine is useful in accomplishing the goals of the model. Here, “we” includes the humans building the model or many times can now include automatic feature-engineering systems. In other words, a feature is a specific, measurable aspect of the data, or a function of it.

Features are used to build models. They are the structure that connects the model back to the data used to assemble the model. Previously, we have said that a model is a set of rules that takes data and uses it to generate predictions about the world. This is true of a model architecture and a configured model. But a trained model is very much a formula for combining a collection of features (essentially, feature definitions used to extract feature values from real data—we cover this definition more completely later in this chapter). Features frequently are more than just pieces of the raw data, but rather involve some kind of preprocessing.

These concrete examples of features should provide some intuition about what they are:

From a web log, information about the customer (say, browser type).
Individual words or combinations of words from text a human enters into an application.
The set of all the pixels in an image, or a structured subset thereof.
Current weather at the customer’s location when they loaded the page.
Any combination or transformation of features can itself be a feature.

Typically, features contain smaller portions of structured data extracted from the underlying training data. As modeling techniques continue to develop, it seems likely that more features will start to resemble raw data rather than these extracted elements. For example, a model training on text currently might train on words or combinations of words, but we expect to see more training on paragraphs or even whole documents in the future. This has significant implications for modeling, of course, but people building production ML systems should also be aware of the impact on systems of having the training data grow larger and less structured.

At YarnIt we have several models, and one of them is to make recommendations for different or additional products while a customer is shopping on the site. This recommendations model is called by the web application during a shopping session from both the main product page where a customer is viewing an individual product and from the shopping cart confirmation page, when a customer has just added a product to their shopping cart. The web server calls the model to ask which additional products it should show to this customer under these circumstances, in hopes of showing the customer something else they might need or want and increasing our sales to that customer.

In this context, we might consider some of the following useful features that we would want available to the model as it is queried for additional products:

Product page or shopping cart confirmation page. Is the user browsing or buying?
The current product the user is looking at, including information from the name of the current product, possibly information from the picture of the current product, the product category, manufacturer, and price.
Customer average purchase size or total purchases per year, if we know it. This might be an indication of just how spendy a customer is.
Knitter or crocheter? Some customers never crochet, and others only crochet. Knowing this might plausibly help us recommend the right yarns, needles, and patterns. We might have customers self-identify this characteristic or might infer from previous purchases or browsing behavior.
Customer country. Some products may be popular in only some places. Also some countries are colder than others, and that might be a signal.

These are just a few ideas to give a flavor of what a feature might be. It is important to note that, absent any actual data, we have no idea whether any of these features is useful. They certainly seem like they might be helpful, but many features that seem like they might work simply do not. Most commonly, such features are only slightly predictive, and the things they predict might be predicted better by another feature we already have. In those cases, the new feature adds cost and complexity but no value at all. Remember that features add maintenance costs in many parts of the system, and those costs exist until the feature is removed. We should be cautious in adding new features, especially those that depend on new data sources or feeds that, themselves, will now have to be monitored and maintained.

Before we go too much further, we need to clarify two completely distinct uses of the term feature and how we differentiate them:

Feature definition: This is code (or an algorithm or other written description) that describes the information we are extracting from the underlying data. But it is not any specific instance of that data being extracted. For example, “Source country of the current customer obtained by taking the customer’s source IP address and looking it up in the geolocation data service” is a feature definition. Any particular country—for example, the “Dominican Republic” or “Russia”—would not be a feature definition.
Feature value: This is a specific output of a feature definition applied to incoming data. In the preceding example, the “Dominican Republic” is a feature value that might be obtained by determining that the current customer’s source IP address is believed to be most commonly used in that country.

This is not (yet) standard industry terminology, but we adopt it in this section to distinguish between figuring out how we’re trying to extract information from the data and figuring out what we have extracted so far.

Feature Selection and Engineering

Feature selection is the process by which we identify the features in the data that we will use in our model. Feature engineering is the process by which we extract those features and transform them into a usable state. In other words, through these activities, we figure out what matters (for the ML task we are trying to accomplish) and what doesn’t. Building and managing features has changed over the years and will continue to evolve, as it moves from an entirely manual process to a mostly automated one, and in some cases a completely automated one. In this chapter, we refer to the processes of selecting and transforming features jointly as feature engineering.

In human-driven feature engineering, the process normally starts with human intuition based on an understanding or the domain of the problem, or at least detailed consultation with experts. Imagine trying to build a predictive model of what customers would buy from yarnit.ai without knowing what yarn, or needles, or knitting are, or worse, the very idea of how online retail works at all. Understanding the underlying problem area is key, and more specificity is better. After that, an ML engineer spends time with a dataset and a problem and uses a set of statistical tools to evaluate the likelihood that a particular feature, or several features in combination with one another, will be useful in the task. Next, ML engineers typically brainstorm to generate a list of possible features. The range is pretty large. Time of day could be a feature predicting which yarns customers will buy. Local humidity could be a feature. Price could be a feature. Of these three features, one of them is much more likely to be useful than the others. It is up to humans to generate and evaluate those ideas by using the ML platform and model metrics.

For algorithmic feature engineering, sometimes included as part of AutoML, the process is considerably more automatic and data bound. AutoML, outlined in Chapter 3, is capable of not only selecting from identified features but also being able to programmatically apply common transforms (like log scaling or thresholding) to existing features. There are other ways that we can algorithmically learn something (an embedding) about the data without explicitly specifying it. Still, algorithms are generally able to identify only features that exist in the data, whereas humans can imagine new data that could be collected that might be relevant. Subtly, but perhaps even more importantly, humans understand the process of data collection, including any likely systemic bias. This might have a material impact on the value of the data. Nonetheless, especially when constrained to particular types of problems, algorithmic feature engineering and evaluation can be as effective or more effective than human feature engineering. This is an evolving set of technologies, and we should expect the balance between humans and computers here to continue to develop.

Lifecycle of a Feature

The distinction between feature definitions and feature values becomes especially important when considering the lifecycle of a feature. Feature definitions are created to fill a need, evaluated against that need, and eventually discarded as either the model is discarded or better features are found to accomplish the same goal. Here is a simplified version of the lifecycle of a feature (both the definition and the representative values):

1. Data collection/creation: Without data, we have no features. We need to collect or create data in order to create features.
2. Data cleaning/normalization/preprocessing: Although even the process of feature creation could be considered some kind of data normalization, here we’re referring to coarser preprocessing: eliminating obviously malformed examples, scaling input samples to a common set of values, and possibly even deleting specific data that we should not train on for policy reasons. This might seem outside the feature engineering process, but no features can exist until the data exists and is in a usable form. The topic of data normalization and preprocessing is huge and beyond the scope of this book, but building the infrastructure to consistently perform that preprocessing and monitor it is an important area of responsibility.
3. Candidate feature definition creation: Using either subject-matter expertise plus human imagination, or automated tools, develop a hypothesis for which elements or combinations of the data are likely to accomplish our model’s goals.
4. Feature value extraction: We need to write code that reads the input data and extracts the features that we need from the data. In some simple situations, we might want to do this inline as part of the training process. But if we expect to train on the same data more than a few times, it’s probably sensible to extract the feature from the raw data and store it for later efficient and consistent reading. It is important to remember that if our application involves online serving, we need a version of this code to extract the same features from the values that we have available at serving in order to use them to perform inference in our model. Under ideal circumstances, the same code can extract features for training and for serving, but we may have additional constraints in serving that are not present in training.
5. Storage of feature values in a feature store: And this is where we save the features. A feature store is just a place to write extracted feature values so that they can be quickly and consistently read during training a model. This is covered in detail in “Feature store”.
6. Feature definition evaluation: Once we have extracted a few features, we will most likely build a model using them or add new features to an existing model in order to evaluate how well they work. In particular, we will be looking for evidence that the features provide the value that we were hoping they would. Note that this evaluation of feature definitions comes in two distinct phases that are connected. First, we need to determine whether the feature is useful at all. This is a coarse-grained evaluation; we are simply trying to decide whether to continue working on integrating the feature into our model. The next phase occurs if we decide to keep that feature. At that point, we need a process to continuously evaluate the quality and value (compared to cost) of the feature. This will be necessary so that we can determine that it is still working the way we expect and providing the value we expect even several years from now.
7. Model training and serving using feature values: Perhaps this is obvious, but the entire point of having features is to use them to train models, and use the resulting models for a particular purpose.
8. (Usually) Update of feature definitions: We frequently have to update the definition of a feature, either to fix a bug or simply to improve it in some way. If we add and keep track of versions for feature definitions, this will be much easier. We can update the version and then, optionally, reprocess older data to create a new set of feature values for the new version.
9. (Sometimes) Deletion of feature values: Sometimes we need to be able to delete feature values from our feature store. This can be for policy/governance reasons; for example, we may no longer be allowed to store these feature values because a person or government has revoked that permission. It can also be for quality/efficiency reasons: we may decide those values are bad in some way (corrupted, sampled in a biased way, for example) or just too old to be useful.
10. (Eventually) Discontinuation of a feature definition: Everything comes to an end, including useful features. At some point in the lifecycle of the model, we will either find a better way to provide the value that this feature definition provides or find that the world has changed enough that this feature no longer provides any value. Eventually, we will decide to retire the feature definition (and values) entirely. We will need to remove any serving code that refers to the feature, remove the feature values in the feature store, cancel the code that extracts the feature from the data, and proceed to delete the feature code.

Feature Systems

To successfully manage the flow of data through our systems, and to turn data into features that are usable by our training system and manageable by our modelers, we need to decompose the work into several subsystems.

As was mentioned in the introduction to this chapter, one of these systems will be a metadata system that tracks information about the data, datasets, feature generation, and labels. Since in most cases this system will be shared with any labeling systems, we will discuss it at the end of this chapter. For now, let’s walk through the feature systems starting with raw data and ending up with features stored in a format ready to be read by a training system.

Data ingestion system

We will have to write software that reads raw data, applies our feature extraction code to that data, and stores the resulting feature values in the feature store. In the case of one-time extraction, even for a very large amount of data, this may be a relatively ad hoc process with code designed to be run once. But in many cases, the process of extracting data is a separate production system all its own.

When we have many users or repeated use of the data ingestion system, it should be structured as an on-demand, repeatable, monitored data processing pipeline. As is the case for most pipelines, the biggest variable is user code. We will need a system whereby feature authors write code that identifies features and extracts them to store in the feature store. We can either let feature authors run their code themselves, which imposes a substantial operational burden on them, or we can accept the challenge and provide them a development engineering environment. This helps them write reliable feature-extraction code so that we can run that code reliably in our data ingestion system.

We need to build a few systems to facilitate feature authors writing reliable and correct features. To begin with, we should note that features should be versioned. We’ll likely want to substantially change a feature over time, perhaps because of changes in data that it merges with or other factors related to the data we are collecting. In these cases, a feature version helps keep the transition clear and avoids unintended consequences of the change.

Next, we’ll need a test system that checks feature-extraction code for basic correctness. Then, we’ll need a staging environment for running proposed feature extraction on a certain number of examples and provide those to the feature authors along with basic analysis tools to ensure that the feature is extracting what it is expected to extract. At this point, we may want to allow the feature to be run or may want additional human review for reliability concerns (dependence on external data, for example). The more work we do here, the more productive feature authors will be.

Finally, it is important to note that some labels can be effectively calculated or generated from the very data we have at data ingestion time. A good example of this at YarnIt is suggested products and sales. We suggest products in order to sell more of them. The features are the characteristics of the product or characteristics about the customer, and the label is whether the customer bought it or not. As long as we can join the suggestion logs against the orders, we will have this label when we construct the features. In cases like this, the data ingestion system will also generate labels for those features as well as the features themselves, and both can be stored together in a common datastore. We will talk much more about labeling systems in “Labels”.

Feature store

A feature store is a storage system designed to store extracted feature (and label) values so that they can be quickly and consistently read during training a model and even during inference. Most ML training and serving systems have some kind of a feature store even if it is not called that.² They are most useful, however, in larger, centrally managed services, especially when the features (definitions and values both) are shared among multiple models. Recognizing the importance of putting our feature and label data in a single, well-managed place has significantly improved the production readiness of ML in the industry. But feature stores do not solve every problem, and many people come to expect more of them than they can deliver. Let’s review the problems that a feature store solves and the benefits it does provide for your training system.

The most important characteristic of a feature store is its API. Different commercial and open source feature stores have different APIs, but all of them should provide a few basic capabilities:

Store feature definitions: Usually these are stored as code that extracts the feature in a raw data format and outputs the feature data in the desired format.
Store feature values themselves: Ultimately, we need to write features in a format that is easy to write, easy to read, and easy to use. This will be largely determined by our proposed use cases and most commonly is divided into ordered and unordered data, but we cover nuances in “Lifecycle Access Patterns”.
Serve feature data: Provide access to feature data quickly and efficiently at a performance level suitable to the task. We absolutely do not want expensive CPUs or accelerators stalled, waiting on the I/O of reading from our feature store. This is a pointless way to waste an expensive resource.
Coordinate metadata writes with the metadata system: To get the most out of our feature store, we should keep information about the data we store in it in the metadata system. This helps model builders. Note that metadata about features is somewhat different from metadata about runs of the pipeline, although both are useful in troubleshooting and reproducing problems. Feature metadata is most useful for model developers, while pipeline metadata is most useful for ML reliability or production engineers.

Many feature stores also provide basic normalization of data in ingestion as well as more sophisticated transformations on data in the store. The most common transformations are standardized in-store bucketing and built-in transforming features.³

The feature store API needs to be carefully calibrated for the use case. We should consider asking the following questions as we think about what we need in a feature store:

Are we reading data in a particular order (log lines that are timestamped) or in no order that matters (a large collection of images in a cloud storage system)?
Will we read the features frequently or only when training a new model? In particular, what is the ratio of bytes/records read to bytes/records written?
Is the feature data ingested once, never appended to, and seldom updated? Or is the data frequently appended to while older data is continuously deleted?
Can feature values be updated, or is the store append-only?
Are there special privacy or security requirements for the data we are storing? In most cases, the extracted features of a dataset with privacy and use restrictions will also have privacy and use restrictions.

After thinking about these questions, we should be able to determine our needs for a feature storage system. If we’re lucky, we will be able to use one of the existing commercial or open source feature stores on the market. If not, we’ll have to implement this functionality ourselves, whether we do it in an uncoordinated fashion or as part of a more coherent system.

Once we are clear on the requirements for an API and have a clearer understanding of our data access needs, in general we will find that our feature store falls into one of two buckets:⁴

Columns: The data is structured and decomposable into columns, not all of which will be used in all models. Typically, the data is also ordered in some way, often by time. In this case, column-oriented storage is the most flexible and efficient.
Blobs: This is an acronym for binary large objects, although the common English word is also descriptive. In this case, the data is fundamentally unordered, mostly unstructured, and is best stored in a manner that’s more efficient at storing a bunch of bytes.

Lifecycle Access Patterns

As we make choices about our feature store, it is critical to keep in mind how, and how often, we will use the data that we are storing. Consider two cases: a small amount of data that we retrain on constantly versus a large amount of data that we use a few times and then delete.

In the former case, we will absolutely want to spend time extracting columnar data, preprocessing those columns, and making access as efficient as possible, even when it slows data ingestion. The payoff will be in the faster and cheaper reads that we will get later. In the latter case, significant preprocessing is a waste of time and processing. We should ingest the data as cheaply and automatically as possible, and worry about processing when we read the data.

In all cases, thinking about when we plan to delete the data is important. As covered in Chapter 2, reliable and effective deletion can be accomplished, but only if we design for it. For example, we need to determine whether we will delete by time period (e.g., all of last month’s data) or by end user (e.g., all of the data for customer number 8723423). Once we know that, we can design the storage layout for effective deletion. One of the best techniques for this involves encrypting the data based on the scope we plan to use for deletion (per day, per data source, per customer, or other). We then store only the keys in a storage system with guaranteed access protection and the ability to securely delete data (usually with multiple overwrites on all persistent storage media). When we get a verified request to delete data by scope, we simply delete the relevant keys. Thus, although the data is still very much available, and may even be replicated to multiple targets or backed up, it is inaccessible and effectively deleted with little effort or latency.

Many feature stores will need to be replicated, partially or completely, in order to store data near the training and serving stacks. Since ML computation requirements are generally considerable for training, we often prefer to replicate data to the place or places where we can get the best value for our training dollar. Having the feature store implemented as a service facilitates the management of this replication.

Transforming Features

Previously, we mentioned transforming features as one desirable piece of functionality in a feature store. For those of us who have worked on relational database systems, transforming features are the stored procedures of feature stores—they are fixed transformations of the data, written in code, but stored in the storage system.

Transforming features implement functionality as simple as bucketing a feature that arrives as a continuous one (e.g., bucketing age into 0–4, 5–17, 18–35, 36–50, 51–65, 66+, or whatever buckets make sense for our use case). But transforming features may also combine multiple feature columns and consistently return a different value that is a function of the multiple columns. They can even be computed across multiple rows if features are dependent on the distribution of data, although this can be complicated and expensive. In all cases, the key is to remember that transforming features offer a consistent and deterministic value that is the programmatic result of the code we have written being applied to the data in the feature store.

Historically, transforming features were often implemented in other locations in an ML training pipeline, commonly during the training phase. But then recall that features also need to be looked up during serving in most ML applications. So transforming features were implemented in training and in serving. And, as inevitably happens, code drift occurs. Some transforming features might be implemented slightly differently in some models compared to others, or in training compared to serving. Moving the transforming feature into the feature store eliminates this class of failure.

Putting transforming features into the feature store carries one other potential advantage: performance. If a transforming feature is computationally intensive and we intend to read that column frequently, a feature store might choose to materialize that feature, by processing new data as it arrives and writing out the result of the transforming feature into a new column. This has two obvious risks. First, we may change the definition of the feature, which would require recomputing the column over all of the data. Second, the materialized column could become out of sync with the data if we have any bugs in the system that processes new data. Nevertheless, if materializing a feature saves us considerable computation and I/O time, it might be worth it.

Feature quality evaluation system

As we develop new features, we need to evaluate what, if anything, those features add to the quality of the overall model, in combination with existing features. This topic is covered extensively in Chapter 5. The general idea to know at this point is that we can combine the approaches of using slightly different models, A/B testing, and a model quality evaluation system in order to effectively evaluate the benefit of each new feature under development. We can do this quickly and at relatively low cost.

One common approach is to take an existing model and retrain it by using a single additional feature. In the case of a web application like at YarnIt, we can then direct a fraction of our user requests to the new model and evaluate its performance on a task. For example, if we add a feature for Country the User Is In to a model suggesting new products for a user to try, we can direct 1% of the user requests (either by request or by user) to the new model. We can evaluate whether the new model has a higher likelihood of recommending products that users actually buy, and doing so can inform the choice of whether to keep the new feature or to eliminate it and try other ideas.

Keep in mind that even a feature that adds value may not be worth the cost to collect, process, store, and maintain it. For all but the most trivially obvious features, it is a good habit to calculate a rough return on investment (ROI) for every new feature added to the system. This will help us avoid useful features that are still more expensive than the value that they add.

Labels

Although features seem like the most important aspects of the data, one thing is more important: labels. By this point, you should have a solid understanding of what features are for and what systems considerations should be taken into account when managing large numbers of features. But supervised learning models, in particular, require labels.

Labels are the other main category of data used in training an ML model. While features serve as the input to the model, labels are examples of the correct model output. They are used in the training process to set the (many!) internal parameters of the model, to tune the model so that it will produce the desired outputs on the features it gets at inference time. Held-out examples and labels (labels not used in training) are also used in model evaluation to understand model quality.

As we discussed earlier, for some classes of problems, like our recommendation system, the labels can be generated algorithmically from the system’s log data. These features are almost always more consistent and more accurate than human-labeled features. Since this data is generated from the system’s log data often with the feature data, these labels are most commonly stored in the feature system for model training and evaluation. In fact, all labels can be stored in the feature store, although labels generated by humans need additional systems to generate, validate, and correct these labels before storing them for model training and evaluation. We discuss these systems in the next section.

Human-Generated Labels

So let’s turn our attention to the large classes of problems requiring human annotations to provide the model training data. For example, building a system to analyze and interpret human voice needs human annotation to ensure that the transcription is accurate and to understand what the speaker meant. Image analysis and understanding often needs example annotated images for image classification or detection problems. Getting these human annotations at the scale needed to train an ML model is challenging from both implementation and cost perspectives. Effort must be dedicated to designing efficient systems to produce these annotations. We will now focus on the primary components of these human annotation systems.

For a concrete example involving our fictional yarn shop, YarnIt, consider the case of an advanced new feature that, given an image of crocheted fabric, can predict the crochet stitch that was used to produce that fabric. Such a model requires someone to design the set of crochet stitches for the model to predict, providing the set of classes that model can output. Then large quantities of images of crocheted fabric, covering all of these stitches, must be labeled by crochet experts as to what stitch produced this fabric. Using these expert labels, a model can be trained to classify new images of crocheted fabric and determine the stitch used.

This is not cheap. Humans need to be trained on the task, and each image may need to be labeled multiple times to ensure that we have trustworthy results. Because of the large cost associated with acquiring human-generated labels, training systems should be designed to get as much mileage from them as possible. One technique commonly used is data augmentation: the feature data is “fuzzed” in a way that changes the features but doesn’t change the correctness of the label.⁵ For example, consider our stitch classification problem. Common image operations like scaling, cropping, adding image noise, and changing the color balance don’t change the classification of the stitch in the image but can greatly increase the number of images available for the model to train on. Similar techniques can be used in other classes of problems. Care must be taken, however, not to train and test on two images fuzzed from the same source image, and for this reason any data augmentation of this sort should be done by the training system and not in the labeling system (or not, and be careful if you have a good reason to do otherwise, like a very expensive fuzzing algorithm).

One important note: while some kinds of tremendously complex data can best be labeled by humans, other types of data definitely cannot be labeled by humans at all. Typically, this is abstract data with high dimensionality that makes it difficult for a human to determine the correct answer quickly. Sometimes humans can be provided with augmentation software to assist in these tasks, but other times they are just the wrong option for performing the labeling.

Annotation Workforces

The first question that often comes up with human annotation problems is who will do the labeling. This is a question of scale and equity.⁶ For simpler models, for which a small amount of data is sufficient to train the model, typically the engineer building the model will do their own labeling, often with hacky, homebuilt tools (and a significant chance of biased labels). For more complex models, dedicated annotation teams are used to provide human annotations at a scale not otherwise possible.

These dedicated annotation teams can be colocated with the model builder or remotely provided by third-party annotation providers. They range in size from a single person to hundreds of people, all generating annotated data for a single model. The Amazon Mechanical Turk service was the original platform used for this, but since then a proliferation of crowdsourcing platforms and services have developed. Some of these services use paid volunteers, and others use teams of employees to label the data.

Cost, quality, and consistency trade-offs arise in the choice of labeling. Crowdsourced labeling often requires additional effort to verify quality and consistency, but paid labeling staff can be expensive. The costs for these annotation teams can easily exceed the computational costs of training the models. We discuss some organizational challenges of managing a large annotation team in Chapter 13.

Measuring Human Annotation Quality

As the quality of any model is only as good as the data used to train the model, quality must be designed into the system from the start. This becomes increasingly true as the size of the annotation team grows. Quality can be achieved in multiple ways depending on the task, but the most frequent techniques used include the following:

Multiple labeling (also called consensus labeling): The same data is given to multiple labelers to check for agreement among them.
Golden set test questions: Trusted labelers (or the model builder) produce a set of test questions that are randomly included in the unlabeled data to evaluate the quality of the produced labels.
A separate QA step: A fraction of the labeler data is reviewed by a more trusted QA team. (Who QAs the QA team? Perhaps the model builder, but depending on context, this could be a separate QA team, policy expert, or someone else with domain expertise.)

Once they’re measured, quality metrics can be improved. Quality issues are best addressed by managing the annotation team with humility and by understanding that they will produce higher-quality results when they have the following:

More training and documentation
Recognition for quality and not just throughput
A variety of tasks over the workday
Easy-to-use tools
Tasks with a balanced set of answers (no needle-in-a-haystack tasks)
An opportunity to provide feedback on the tools and instructions

Annotation teams managed in this way can provide high-quality results. However, even the best, most conscientious labeler will miss things occasionally, so processes should be designed to detect or accept these occasional errors.

An Annotation Platform

A labeling platform organizes the flow of data to be annotated and the results of the annotations while providing quality and throughput metrics of the overall process. At their heart, these systems are primarily work-queuing systems to divide the annotation work among the annotators. The actual labeling tool that allows the labelers to view the data and provide their annotations should be flexible to support any arbitrary annotation task.

With a team or organization that is working on multiple models simultaneously, the same annotation team may be shared among multiple annotation projects. Furthermore, each annotator might have different sets of skills (e.g., language skills or knowledge of crochet stitches), and the queuing systems can be relatively complex and require careful design to avoid problems such as scalability issues or queue starvation. Pipelines enabling the output of one annotation task to serve as the input to another can be useful for complex workflows. Quality measurement using the techniques discussed previously should be designed into the system from the start, so project owners can understand the labeling throughput and quality of all their annotation tasks.

Although historically many companies have implemented their own labeling platforms with their own set of these features, many options exist for prebuilt labeling platforms. The major cloud providers and many smaller startups offer labeling platform services that can be used with arbitrary annotation workforces, and many annotation workforce providers have their own platform options that can be used. This is a rapidly changing area with new features being added to existing platforms all the time. Publicly available tools are moving beyond simple queuing systems and are starting to provide dedicated tools for common tasks, including advanced features like AI-assisted labeling (see the following section). When deciding on any new technology platform, data security and integration costs must be considered along with the platform capabilities.

As mentioned in “Feature store”, in many cases the most sensible place to store completed labels is in the feature store. By treating human annotations as their own columns, we can take advantage of all the other functionality provided by the feature store.

Active Learning and AI-Assisted Labeling

Active learning techniques can focus the annotation effort on the cases in which the model and the human annotators disagree or the model is most uncertain, and thereby improve overall label quality. For example, consider an image detection problem where the labeler must annotate all the occurrences of a particular object in an image. An active-learning labeling tool might use an existing model to pre-label the image with proposed detections of the object in question. The labeler would then approve correct proposed detections, reject bad ones, and add any missing ones. While this can greatly increase labeler throughput, it must be done with care to not introduce bias to the models. Such active learning techniques can actually increase overall label quality since the model and humans will often have their best performance on different kinds of input data.

A semi-supervised system allows the modeler to bootstrap the system with weak heuristic functions that imperfectly predict the labels of some data, and then use humans to train a model that takes these imperfect heuristics to high-quality training data. Systems like this can be particularly valuable for problems with complex, frequently changing category definitions requiring models to be retrained quickly and frequently.

Efficient annotation techniques for particularly complex labeling tasks is an ongoing field of research. Particularly if you are doing a common annotation problem, a quick review of available tools from cloud and annotation providers is well worth your time, as they are often adding new capabilities for AI-assisted annotation.

Documentation and Training for Labelers

Documentation and labeler training systems are some of the most commonly overlooked parts of an annotation platform. While labeling instructions often start simply, they inevitably get more complex as data is labeled and various corner cases are discovered. To continue our preceding YarnIt example, perhaps some crochet stitches are not mentioned in the labeling instructions, or the fabric is made from multiple different stitches. Even conceptually simple annotation tasks such as “marking all the people in an image” can end up with copious instructions on proper handling of various corner cases (reflections, pictures of people, people behind car windows, etc.).

Labeling definitions and directions should be updated as new corner cases are discovered, and the annotation and modeling teams should be notified about the changes. If the changes are significant, previously labeled data might have to be re-annotated to correct data labeled with old instructions. Annotation teams often have significant turnover, so investing in training for using annotation tools and for understanding labeling instructions will almost always give big wins in label quality and throughput.

Metadata

Feature systems and labeling systems both benefit from efficient tracking of metadata. Now that you have a relatively complete understanding of the kinds of data that will be provided by a feature or labeling system, you can start to think about what metadata is produced during those operations.

Metadata Systems Overview

A metadata system is designed to keep track of what we’re doing. In the case of features and labels, it should minimally keep track of the feature definitions we have and the versions used in each model’s definitions and trained models. But it is worth pausing for a minute and trying to see into the future: what are we eventually going to expect out of a metadata system, and is there any way to anticipate that?

Most organizations start building their data sciences and ML infrastructure without a solid metadata system, only to regret it later. The next most common approach is to build several metadata systems, each targeted at solving a particular problem. This is what we’re about to do here: make one for tracking feature definitions and mappings to feature stores. Even within this very chapter, you’re about to see that we’re going to need to store metadata about labels, including their specification and when particular labels were applied to particular feature values. Later, we’re going to need a system for mapping model definitions to trained models, along with data about the engineers or teams responsible for those models. Our model serving system is also going to need to keep track of trained model versions, when they were put into production. Any model quality or fairness evaluation systems will need to be read from all of these systems in order to identify and track the likely contributing causes of changes in model quality or violations of our proposed fairness metrics.

Our choices for a metadata system are relatively simple:

One system: Build one system to track metadata from all of these sources. This makes it simple to make correlations across multiple subsystems, simplifying analysis and reporting. Such a large system is difficult to get right from a data schema perspective. We will be constantly adding columns when we discover data we would like to keep track of (and backfilling those columns on existing data). It will also be difficult to stabilize such a system and make it reliable. From a systems design perspective, we should ensure that our metadata systems are never in the live path of either model training or model serving. But it’s difficult to imagine how feature engineering or labeling can take place without the metadata system being functional, so it can still cause production problems for our humans working on those tasks.
Multiple systems (that work together): We can build separate metadata systems for each task we identify. Perhaps we would have one for features and labels, one for training, one for serving, and one for quality monitoring. Decoupling the systems provides the standard advantages and costs that decoupling always does. It allows us to develop each system separately without concern for the others, making them nimbler and simpler to modify and extend. Additionally, an outage of one metadata system has limited production impact on others. The cost, though, is the added difficulty of analysis and reporting across those systems. We will have processes that need to join data across the features, labeling, training, and serving systems. Multiple systems should always be designed in a way that allows their data to be joined, which either means creating and sharing unique identifiers or establishing a meta-metadata system that tracks the relationships of data fields across the metadata systems.

If the needs are simple and well understood, prefer a single system.⁷ If the area is rapidly developing and our teams expect to continue extending what they track and how they work, multiple systems will simplify the development over time.

Dataset Metadata

For metadata about features and labels, here are a few specific elements that we should ensure are included:

Dataset provenance: Where did the data come from? Depending on the source of our data, we might have a lookup table of logs from various systems, a key for an external data provider with data about when we downloaded the data, or even a reference to the code that generated the data.
Dataset location: For some datasets, we will store raw, unprocessed data. In this case, we should store a reference to where we keep that dataset, as well as perhaps information about where we got it from. Some data we create for ourselves on an ongoing basis, such as logs from our systems, and so in those cases we should store the log or datastore reference where that data is stored, or where we are permitted to read from it.
Dataset responsible person or team: We should track which person or team is responsible for the dataset. In general, this is the team that chose to download or create the dataset, or the team that owns the system that produces the data.
Dataset creation date or version date: It is often useful to know the first date a particular dataset was used.
Dataset use restrictions: Many datasets have restrictions on their use, either because of licensing or governance constraints. We should document that in the metadata system for easy analysis and compliance later.

Feature Metadata

Keeping track of metadata about our feature definitions is part of what will enable us to reliably use and maintain those features. This metadata includes the following:

Feature version definition: The feature definition is a reference to code or another durable description to what data the feature reads and how it processes the data to create the feature. This should be updated for every updated version of the feature definition. As was previously described, versioning these definitions (and restricting the versions in use) will make the resulting codebase more predictable and maintainable.
Feature definition responsible person or team: There are two good use cases for storing this information: figuring out what a feature is for and finding someone who can help resolve an incident when the feature might be at fault. In both cases, it is useful to store authorship or maintainer information about that feature.
Feature definition creation date or current version date: This may be fairly obvious, but it’s useful to get a change history of when a feature was most recently updated and when it was originally created.
Feature use restrictions: This is important but trickier to store. Features may be restricted from use in some contexts. For example, it may be illegal to use a particular feature in some jurisdictions. Age and gender may be a reasonable predictor of automobile insurance risk models, but insurance is highly regulated, and we may not be permitted to take those fields into account. Banning particular fields only for specific uses is difficult to track and implement, but the restrictions might be even more subtle. For example, age may be able to be taken into account, but only with certain, specific bucketing (like under 25, 25–64, 65–79, and over 80). In that specific case, it’s easier to just define a transforming feature built on top of the age column that meets these bucketing requirements and prohibit the general age feature from being used for insurance purposes while allowing the insurance_bucketed_age feature to be used. But the general case of storing and applying feature restrictions based on governance requirements is extremely difficult, and no great designs or solutions exist at the time of writing.

Label Metadata

We should also track metadata about labels. This is intended to help with the maintenance and development of the labeling system itself, but might also be used by the training system as it uses the labels:

Label definition version: Switching to metadata specific to labels and analogously to features, we must store the version of any label definitions to understand which labeling instructions the labels were made with.
Label set version: In addition to label definition changes, changes to the labels may occur because of incorrect labels getting corrected or new labels being added. If the dataset is being used for comparison with an older model, using an older version of the labels may be desirable, to make the comparison more apples-to-apples.
Label source: Although not typically needed for training, it is sometimes necessary to know the source of each label in a dataset. This may be the source that particular label was licensed from, the human who produced the label (along with any QA that was applied to the label), or the algorithm that produced the label if an automated labeling approach was used.
Label confidence: Depending on how the labels are produced, we might have different estimates of the confidence of correctness for different labels. For example, we might have lower confidence in labels that are produced by an automated approach, or labels produced by a newer labeler. Users of these labels might choose different thresholds to decide which labels to use in training their models.

Pipeline Metadata

This section covers a final type of metadata that we won’t spend as much time on: metadata about the pipeline processes themselves. This is data about the intermediate artifacts we have, which pipeline runs they came from, and which binaries produced them. This type of metadata is produced automatically by some ML training systems; for example, ML Metadata (MLMD) is automatically included in TensorFlow Extended (TFX), which uses it to store artifacts about training runs. These systems are either integrated into those systems or are somewhat difficult to implement later. As a result, we don’t cover them much here.

More generally, metadata systems are often overlooked or deprioritized. They should not be. They are one of the most effective and direct contributors to productive use of the data in an ML system. Metadata unlocks value and should be prioritized.

Data Privacy and Fairness

Feature and labeling systems give rise to profound privacy and ethical considerations. While many of these topics are covered much more completely in Chapter 6, calling out a few specific topics here explicitly is worthwhile.

Privacy

Both the data that we receive and human annotation of that data has the significant possibility of containing PII. While the simplest approach to dealing with private information is to simply prohibit private or sensitive information from entering into our feature storage system, this is often not practical.

PII data and features

If we plan to have PII data in features, we will want to plan to do at least three things in advance:

Minimize the PII data we process and store
Restrict and log access to the feature store containing the private features
Plan to delete private features as soon as possible

It is best to plan for correct handling of PII data in advance. Before even considering collecting PII data, a clear process should be in place for obtaining consent from users: they should know what data they are providing and how it will be used. Many organizations find it valuable to write a plan documenting exactly the data that will be collected, how it will be processed, where it will be stored, and the circumstances under which it can be accessed and will be deleted. This allows for an internal (and possibly, eventually, external) review of the procedures to ensure compliance with relevant laws and regulations. Most organizations will want to document their procedures about planning for PII data and train staff on these procedures regularly.

Remember that from an organizational point of view, private data is much more of a liability than an asset. Therefore, we should be completely convinced that the features containing private data are required—that they produce sufficient value to outweigh the risks of processing and storing private data. As is always the case, different pieces of nonprivate data may be combined to create a private data element.

Private data and labeling

Human annotation of PII data introduces a host of legal or reputational hazards if not carefully and properly handled. The details about how to properly manage this kind of data used in human annotation systems is extremely context specific and is beyond the scope of this book. Often the best way to handle PII data is to split the data so that the human doing the annotation has access to only the non-PII parts of the data. This is specific to the problem at hand. Any labeling of PII data should be done with the utmost care and awareness from project leadership of the risks involved.

Use of human annotators also introduces an additional risk of the model unintentionally learning the cultural biases of the annotation instructions or the team itself. This potential bias is best combated by thoughtful documentation and consideration of potential areas for confusion, strong lines of communication with the annotation team, and the hiring of a diverse annotation team. On the positive side, a well-trained annotation team can be one of the most effective ways to filter sourced data with potential biases to understand and remove bias.

Fairness

Fairness is a significant topic that is covered much more broadly and thoroughly in Chapter 6. Suffice it to say here that considering fairness is important while thinking of features and their labels. It is not easy to select features and sets of features that ensure that the resulting ML system can be used in only a fair fashion. While it is true that we need to avoid selecting features and datasets that are unrepresentative and biased, this alone will not be sufficient to ensure fairness overall. This would be a good time for those with a particular interest in fairness to read (or reread) Chapter 6.

Conclusion

Production ML systems require mechanisms to efficiently and consistently manage training data. Training data almost always consists of features, so having a structured feature store significantly facilitates writing, storing, reading, and ultimately deleting features. Many ML systems also have a component of human annotation of data. Humans annotating data require their own systems to facilitate rapid, accurate, and verified annotations, which ultimately need to be integrated into the feature store.

We hope we have given a clearer understanding of these components and the considerations that should go into selecting or, in the worst case, building, them.

¹ We are aware that one of the promises of deep learning is that it is sometimes possible to use raw data to train a model without specifically identifying features. This promise is partly true for some use cases but applies mostly, right now, to perceptual data like images, video, and audio. In those cases, we may not need to extract specific features from the underlying data, although we may still get good value from metadata features. For other cases, featurization is still required right now. So even deep learning modelers will benefit from an understanding of features.

² For example, ML training that occurs on mobile devices will still need to train on some data, but will not have a structured, managed feature store, since there’s no need to mediate that data for other on-device users.

³ Bucketing is just the process of placing continuous data into discrete categories. Transforming features are those features that are the result of a computed combination of one or more other features. Simple examples include ideas like returning the day of the week when provided a date feature, or returning the country name from a feature that stores a latitude and longitude of a point on Earth. More complex examples might include fixing the color balance of a picture or choosing a particular projection for 3D data. Perhaps one of the most common examples is to convert 32-bit numbers (either integers or, worse, floating-point numbers) into 8-bit numbers in order to significantly reduce the space and computational resources required to process them.

⁴ Some examples will contain features from both buckets. For example, the pictures in an image are blobs that are best accessed as unstructured data, but the metadata about the image (camera that took it, date, exposure information, location information) is structured and might even be ordered if it includes fields like date.

⁵ This is also a technique used to expand the training dataset algorithmically.

⁶ Organizations need to ensure that human labelers are treated fairly and paid reasonably for their work.

⁷ The industry is littered with organizations that have multiple metadata systems, each of which believes itself to be the one single system. If the needs are simple and well understood, prefer one system, but take action to ensure that it’s the only system unless you outgrow the needs of it.

Get Reliable Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Reliable Machine Learning by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood