Chapter 21. ML Pipelines for Natural Language Processing

In the preceding chapter, we discussed how to create a pipeline for a computer vision production problem, in our case classifying images into categories. In this chapter, we want to demonstrate to you a different type of production problem. But instead of going through all the generic details, we will be focusing on the project-specific aspects.

In this chapter, we are demonstrating the development of an ML model that classifies unstructured text data. In particular, we will be training a transformer model, here a BERT model, to classify the text into categories. As part of the pipeline, we will be spending significant effort on the preprocessing steps of the pipeline. The workflow we present works with any natural language problem, including the latest state-of-the-art large language models (LLMs).

The pipeline will ingest the raw data from an exported CSV file, and we will preprocess the data with TF Transform. After the model is trained, we will combine the preprocessing and the model graph to avoid any training–serving skew.

Note

In this chapter, we’ll be focusing on novel aspects of the pipeline (e.g., the data ingestion or preprocessing). For more information on how to run Vertex Pipelines, and how to structure your pipeline in general, we highly recommend reviewing the previous chapters.

Our Data

For this example, we are using a public dataset containing 311 call service requests from the City of San Francisco. ...

Get Machine Learning Production Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Machine Learning Production Systems by Robert Crowe, Hannes Hapke, Emily Caveness, Di Zhu

Chapter 21. ML Pipelines for Natural Language Processing

Note

Our Data

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly