Chapter 21. ML Pipelines for Natural Language Processing
In the preceding chapter, we discussed how to create a pipeline for a computer vision production problem, in our case classifying images into categories. In this chapter, we want to demonstrate to you a different type of production problem. But instead of going through all the generic details, we will be focusing on the project-specific aspects.
In this chapter, we are demonstrating the development of an ML model that classifies unstructured text data. In particular, we will be training a transformer model, here a BERT model, to classify the text into categories. As part of the pipeline, we will be spending significant effort on the preprocessing steps of the pipeline. The workflow we present works with any natural language problem, including the latest state-of-the-art large language models (LLMs).
The pipeline will ingest the raw data from an exported CSV file, and we will preprocess the data with TF Transform. After the model is trained, we will combine the preprocessing and the model graph to avoid any training–serving skew.
Note
In this chapter, we’ll be focusing on novel aspects of the pipeline (e.g., the data ingestion or preprocessing). For more information on how to run Vertex Pipelines, and how to structure your pipeline in general, we highly recommend reviewing the previous chapters.
Our Data
For this example, we are using a public dataset containing 311 call service requests from the City of San Francisco. ...
Get Machine Learning Production Systems now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.