Chapter 11. Pipelines Part 1: Apache Beam and Apache Airflow

In the previous chapters, we introduced all the necessary components to build a machine learning pipeline using TFX. In this chapter, we will put all the components together and show how to run the full pipeline with two orchestrators: Apache Beam and Apache Airflow. In Chapter 12, we will also show how to run the pipeline with Kubeflow Pipelines. All of these tools follow similar principles, but we will show how the details differ and provide example code for each.

As we discussed in Chapter 1, the pipeline orchestration tool is vital to abstract the glue code that we would otherwise need to write to automate a machine learning pipeline. As shown in Figure 11-1, the pipeline orchestrators sit underneath the components we have already mentioned in previous chapters. Without one of these orchestration tools, we would need to write code that checks when one component has finished, starts the next component, schedules runs of the pipeline, and so on. Fortunately all this code already exists in the form of these orchestrators!

Pipeline orchestrators
Figure 11-1. Pipeline orchestrators

We will start this chapter by discussing the use cases for the different tools. Then, we will walk through some common code that is required to move from an interactive pipeline to one that can be orchestrated by these tools. Apache Beam and Apache Airflow are simpler ...

Get Building Machine Learning Pipelines now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.