Chapter 4. Effective Dependency Management in Practice

In the previous chapter, we laid out the principles for effective dependency management—can you recall the four principles?—and supporting tools. In this chapter, let’s have some fun and put them into practice.

In this chapter, you will learn:

  • What “check out and go” looks like in practice

  • How to use Docker, batect, and Poetry to create consistent, reproducible, and production-like runtime environments in each step of the ML delivery lifecycle

  • How to automatically detect security vulnerabilities in your dependencies and automate dependency updates

The techniques in this chapter are what we use in our real-world projects to create reproducible, consistent, isolated, production-like runtime environments for our ML code. They help us effectively and securely manage dependencies and avoid dependency hell.

Let’s begin!

In Context: ML Development Workflow

In this section, you will see “check out and go” in action. In the code exercise, we’ll run through the following steps with the goal of training and serving a model that predicts the likelihood of a loan default:

  1. Run a go script to install prerequisite dependencies on our host machine.

  2. Create a Dockerized local development environment.

  3. Configure our code editor to understand the project’s virtual environment, so that we can have a helpful coding assistant.

  4. Run common tasks in the ML development lifecycle (e.g., train models, run tests, start API).

  5. Train and deploy the model on the cloud.

To make the most of this chapter, fork, clone, and code along in the hands-on exercise, where we’ll train and test a classifier to predict the likelihood of loan defaults. We encourage you to fork the repository, as it’ll allow you to see Docker and batect at work on the GitHub Actions CI pipeline of your forked repository as you commit and push your changes.

Before we turn to the code, let’s paint a clear picture of what we’re containerizing in a typical ML workflow.

Identifying What to Containerize

The first and most important step in Dockerizing a project is to disambiguate exactly what we are containerizing. This can confuse some ML practitioners and can lead to conflated and shared states. For example, if we share an image between the two distinct tasks of developing an ML model and serving an ML model, we may find unnecessary development dependencies (e.g., Jupyter, Pylint) in a production container (e.g., a model web API). This lengthens container build and start times unnecessarily and also enlarges the attack surface of our API.

In software development, the most common thing that we’re containerizing is a web application or web API—which is simply a long-lived process that starts after you run a command (e.g., python manage.py runserver). In ML, we can also use a containerized web application to serve model predictions (inference) via an API. However, we typically find ourselves running more than just a web application. For example, here are some common ML tasks and processes that we would run when creating ML solutions:

  • Training a model

  • Serving the model as a web API

  • Starting a notebook server

  • Running deployments (of ML training jobs, model API, etc.)

  • Starting a dashboard or an experiment tracking service (we won’t cover this in this chapter, as running dashboards as a web server is well-documented and relatively straightforward with tools such as Streamlit and Docker)

In this chapter’s example, we have identified four distinct sets of dependencies for running four different sets of tasks (see Table 4-1).

Table 4-1. Components that we are containerizing
Image Examples of tasks that we can run Examples of OS-level dependencies Examples of application-level dependencies
1. Development image
  • Train ML model
  • Feature engineering
  • Run automated tests
  • Start API server locally
  • Start Jupyter notebook server
  • Python 3.10
  • gcc
  • tensorflow-model-server

Production dependencies:

  • pandas

  • scikit-learn

Development dependencies:

  • Jupyter

  • Pytest

  • Pylint

2. Production API image
  • Start API server on the cloud
  • Python 3.10
  • gcc
  • tensorflow-model-server

Production dependencies:

  • pandas

  • scikit-learn

3. Deployment image—model training pipeline
  • Deploy model training pipeline to the cloud
  • Execute model training

The specific dependency will depend on what tool or platform we use to train our model on the cloud. For example, it could be one of the following:

  • aws-cdk (AWS)

  • gcloud (GCP)

  • azure-cli (Azure)

  • Metaflow

  • Kubeflow

  • Terraform

  • etc.

4. Deployment image—model web service
  • Deploy model image to a model hosting service or container hosting service

The specific dependency will depend on what tool or platform we use to deploy our web service on the cloud. For example, it could be one of the following:

  • aws-cdk (AWS)

  • gcloud (GCP)

  • azure-cli (Azure)

  • Terraform

  • etc.

Figure 4-1 visualizes each task, which, as you know by now, is nothing but a containerized process and the respective images they use. This figure is a visual representation of Table 4-1.

Figure 4-1. Common ML development tasks and their associated images

The specific slicing and differentiation of images will vary depending on the project’s needs. If the slicing is too coarse-grained—e.g., one image for running all tasks and containers—the monolithic image might become too heavy. Recall our earlier discussion on the costs of carrying unnecessary dependencies. If the slicing is too fine-grained—e.g., one image for each task or container—we can bear unnecessary costs in terms of the code we have to maintain and image build times for each task.

One helpful heuristic for determining how images are sliced is to think about “sharedness” and “distinctness” of dependencies. In this example, development tasks share an image because they share the same dependencies, such as Jupyter or scikit-learn. Deployment tasks are carved out into another image because they don’t need any of these dependencies—instead, they need dependencies like gcloud, aws-cli, azure-cli, or Terraform.

With this mental framework in our head, we are ready to dive into the hands-on exercise!

Hands-On Exercise: Reproducible Development Environments, Aided by Containers

Let’s step through how we would create and use development environments in our ML development lifecycle:

1. Check out and go: install prerequisite OS-level dependencies.

Run the go script for your operating system.

2. Create local development environment (i.e., build image).

Ensure Docker runtime is started (either via Docker Desktop or colima), and run the following command to install dependencies in your local dev image:

./batect --output=all setup
3. Start local development environment (i.e., run container).

Start the container:

./batect start-dev-container

Then test that everything works by running model training smoke tests:

scripts/tests/smoke-test-model-training.sh

Finally, exit the container by entering exit in the terminal or pressing Ctrl + D.

4. Serve the ML model locally as a web API.

Start the API in development mode:

./batect start-api-locally

Then send requests to the API locally by running the following command from another terminal outside the Docker container (it uses curl, which we haven’t installed):

scripts/request-local-api.sh
5. Configure your IDE to use the Python virtual environment created by the go scripts.

Instructions are available online for the IDEs we recommend for this exercise:

6. Train model on the cloud.

This step, along with step #7, is done on the CI/CD pipeline. We’ll cover that later in this section.

7. Deploy model web API.

Along with step #6, done on the CI/CD pipeline.

For the impatient, these steps are summarized at a glance in the repository’s README. Having these steps in a succinct README is a good habit to allow code contributors to easily set up their local environment and execute common ML development tasks. We recommend that you execute these steps now, in the project that you’ve cloned, to get a feel for the end-to-end flow. In the remainder of this section, we’ll go through each of the steps in detail, so that you can understand each component of our development environment setup and adapt it for your own project.

1. Check out and go: Install prerequisite dependencies

The first step in setting up our local development environment is running the go script to install host-level prerequisite dependencies. To begin, clone your forked repository:

$ git clone https://github.com/YOUR_USERNAME/loan-default-prediction

Alternatively, you can clone the original repository, but you won’t be able to see your code changes running on GitHub Actions when you push your changes:

$ git clone https://github.com/davified/loan-default-prediction

Readers working on Mac or Linux machines can now run the go script. This might take a while if you’re installing some of the OS-level dependencies for the first time, so make yourself a nice drink while you wait:

# Mac users
$ scripts/go/go-mac.sh

# Linux users
$ scripts/go/go-linux-ubuntu.sh

At this stage, Windows users should follow these steps:

  1. Download and install Python3 if not already installed. During installation, when prompted, select Add Python to PATH.

  2. In Windows explorer/search, go to Manage App Execution Aliases and turn off App Installer for Python. This resolves the issue where the python executable is not found in the PATH.

  3. Run the following go script in the PowerShell or command prompt terminal:

    .\scripts\go\go-windows.bat

    If you see an HTTPSConnectionPool read timed out error, just run this command a few more times until poetry install succeeds.

The next step, regardless of which operating system you’re on, is to install Docker Desktop, if it’s not already installed. While this can be done in one line as part of the go script for Mac and Linux (see example go script for Mac), it was too complicated to automate in the Windows go script. As such, we’ve decided to keep this as a manual step outside of the go script for symmetry. Follow Docker’s online installation steps.

It’s important that we keep these go scripts succinct and avoid installing too many host-level dependencies. Otherwise, it will be hard to maintain these scripts over time for multiple operating systems. We want to keep as many of our dependencies in Docker as possible.

2. Create our local development environment

Next, we’ll install all the OS-level and application-level dependencies needed for developing the ML model locally. We’ll do that in one command: ./batect setup. As promised earlier, this is where we explain how batect works. Figure 4-2 explains the three steps that are happening behind the scenes.

Figure 4-2. What happens when you run a batect task

As visualized in Figure 4-2, when we run ./batect setup, batect executes the setup task, which we defined in batect.yml. The setup task is simply defined as: run ./scripts/setup.sh in the dev container. Let’s look at how this is defined in batect.yml:

# Ensure Docker runtime is started (either via Docker Desktop or colima)

# install application-level dependencies
$ ./batect --output=all setup 
# batect.yml
containers:
 dev: 
   build_directory: .
   volumes:
     - local: .
       container: /code
     - type: cache
       name: python-dev-dependencies
       container: /opt/.venv
   build_target: dev

tasks:
 setup:  
   description: Install Python dependencies
   run:
     container: dev
     command: ./scripts/setup.sh

This is how we execute a batect task (e.g., setup). The --output=all option shows us the logs of the task while it’s executing. This provides visual feedback, which is especially useful for long-running tasks like dependency installation and model training.

This container block defines our dev image. This is where we specify Docker build-time and runtime configurations, such as volumes or folders to mount, the path to the Dockerfilei.e., build_directory—and build targets for multistage Dockerfiles like ours. Once batect builds this dev image, it will be reused by any subsequent batect tasks that specify this image (e.g., smoke-test-model-training, api-test, and start-api-locally). As such, we won’t need to wait for lengthy rebuilds.

This task block defines our setup task, which consists of two simple parts: what command to run and what container to use when running the command. We can also specify additional Docker runtime configuration options, such as volumes and ports.

Let’s look a little deeper into the second step, and see how we’ve configured our Dockerfile:

FROM python:3.10-slim-bookworm AS dev 

WORKDIR /code 

RUN apt-get update && apt-get -y install gcc 

RUN pip install poetry
ADD pyproject.toml /code/
RUN poetry config installer.max-workers 10
ARG VENV_PATH
ENV VENV_PATH=$VENV_PATH
ENV PATH="$VENV_PATH/bin:$PATH" 

CMD ["bash"] 

We specify the base image that will form the base layer of our own image. The python:3.10-slim-bookworm image is 145 MB, as opposed to python:3.10, which is 915 MB. At the end of this chapter, we will describe the benefits of using small images.

The WORKDIR instruction sets a default working directory for any subsequent RUN, CMD, ENTRYPOINT, COPY, and ADD instructions in the Dockerfile. It is also the default starting directory when we start the container. You can set it to be any directory you’d like, as long as you’re consistent. In this example, we set /code as our working directory and that’s where we will place our code when we start our container in the next step.

We install gcc (GNU Compiler Collection) to handle the scenario where maintainers of a particular Python library neglect to publish a wheel for a given CPU instruction. With gcc, even if a Python package has wheels for one type of CPU (e.g., Intel processor) but the maintainers neglected to publish wheels for another type (e.g., M1 processors), we can ensure that we can build the wheels from source in this step.1

In this block, we install and configure Poetry. We tell Poetry to install the virtual environment in the project directory (/opt/.venv) and add the path to the virtual environment to the PATH environment variable, so that we can run Python commands in containers without needing to activate the virtual environment (e.g., using poetry shell, or poetry run ...).

Finally, the CMD instruction provides a default command to execute when we start a container. In this example, when our Docker image runs as a container, it will start a bash shell for us to run our development tasks. This is just a default and we can override this command when we run our containers later on.

One of the great things about Docker is that there is no magic: You state step by step in Dockerfile what you want in the Docker image, and docker build will run each instruction and “bake” in an image based on the “recipe” (Dockerfile) you provide.

3. Start our local development environment

Now, we can get into our local development environment by starting the container:

# start container (with batect)

$ ./batect start-dev-container 

# start container (without batect). You don’t have to run this command. 
# We’ve included it so that you can see the simplified interface that 
# batect provides

$ docker run -it \ 
      --rm \ 
      -v $(pwd):/code \ 
      -p 80:80 \ 
      loan-default-prediction:dev  

This batect task runs our dev container (i.e., a containerized bash shell that forms our development environment). The Docker runtime parameters are encapsulated in the batect task, as defined in batect.yml, so we can run the task without carrying the heavy implementation details you see in the docker run version of the same task.

-it is short for -i (--interactive) and -t (--tty, TeleTYpewriter) and allows you to interact (i.e., write commands and/or read outputs) with the running container via the terminal.

--rm tells Docker to automatically remove the container and file system when the container exits. This is a good habit to prevent lingering container file systems from piling up on the host.

-v $(pwd):/code tells the container to mount a directory (or volume) from the host ($(pwd) returns the path of the current working directory) onto a target directory (/code) in the container. This mounted volume is kept in sync, so that any changes you make inside or outside the container are kept in sync.

-p X:Y tells Docker to publish port X in the Docker container onto port Y on the host. This allows you to send requests from outside the container to a server running inside the container on port 80.

This is the image that we want to use to start the container. Because we have specified the default command to run in our Dockerfile (CMD ["bash"]), the resulting container is a bash process, which we will use to run our development commands.

Inside of our development container, we can now run tasks or commands that we typically use when developing ML models. To keep these commands readable and simple, we’ve kept the implementation details in short bash scripts, which you can read if you’d like:

# run model training smoke tests
$ scripts/tests/smoke-test-model-training.sh

# run api tests
$ scripts/tests/api-test.sh

# train model
$ scripts/train-model.sh

Alternatively, you could also run these commands from the host, using batect. Thanks to Docker’s caching mechanism, running these tasks is equally fast regardless of whether you run them from inside a container, or start a fresh container each time from the host. These batect tasks make it easy to define tasks on our CI pipeline and make it easy to reproduce CI failures locally. This is how you can run common ML development tasks using batect:

# run model training smoke tests
$ ./batect smoke-test-model-training

# run api tests
$ ./batect api-test

# train model
$ ./batect train-model

4. Serve the ML model locally as a web API

In this step, we will start our web API locally. The API encapsulates our ML model, delegates prediction requests to the model, and returns the model’s prediction for the given request. The ability to start the API locally for manual testing or automated testing saves us from falling into the antipattern of “pushing to know if something works.” This antipattern is a bad habit that lengthens feedback cycles (from seconds to several minutes) while we wait for tests and deployments to run on the CI/CD pipeline in order to test a change in even a single line of code.

This is how you can start our web API locally and interact with it:

# start API in development mode
$ ./batect start-api

# send requests to the API locally. Run this directly from the host 
# (i.e. outside the container) as it uses curl, which we haven't 
# installed in our Docker image
$ scripts/request-local-api.sh

5. Configure our code editor

An essential step in dependency management is configuring our code editor to use the project’s virtual environment, so that it can help us write code more efficiently. When the code editor has been configured to use a given virtual environment, it becomes a very powerful tool and can provide sensible hints and suggestions as you type.

In Chapter 7, we describe how you can achieve this in two simple steps:

  1. Specify the virtual environment in our code editor. See instructions for PyCharm and VS Code, or take a peek at the steps in Chapter 7—it should take only a few minutes.

  2. Leverage code editor commands and corresponding keyboard shortcuts to do amazing things (e.g., code completion, parameter info, inline documentation, refactoring, and much more). We’ll go through these shortcuts in detail in Chapter 7.

For step 1, you can use the path to the virtual environment installed by the go script on the host. The go script displays this as its last step. You can also retrieve the path by running the following command in the project directory outside the container:

$ echo $(poetry env info -p)/bin/python

This is a second—and duplicate—virtual environment outside of the container because configuring a containerized Python interpreter for PyCharm is a paid feature, and is not exactly straightforward for VS Code. Yes, this is a deviation from containers! In practice, we would pay for the PyCharm professional license because it’s simple and relatively low-cost, and we would continue to use a single containerized virtual environment for each project. However, we didn’t want the price to be a barrier to our readers. So, we came up with this workaround so that anyone can follow along.

6. Train model on the cloud

There are many options for training ML models on the cloud. They can range from open source and self-hosted ML platforms—such as Metaflow, Kubeflow, and Ray—to managed services such as AWS SageMaker, Google Vertex AI, and Azure Machine Learning, among many others. To keep this example simple and generalizable, we’ve opted for the simplest possible option: train the model on a CI compute instance using GitHub Actions. Training our model on the CI pipeline may not provide the many affordances or compute resources that an ML platform provides, but it will suffice for the purposes of this exercise.

Training our model on a CI pipeline is similar to training it using these ML services in one regard: we are training a model on ephemeral compute instances on the cloud. As such, we can use Docker to install and configure the necessary dependencies on a fresh instance. You will likely choose a different technology, especially if you’re doing large-scale training. Most, if not all, of these ML platforms support, and have supporting documentation for, running model training in containers.

In our example, we deploy our model training code simply by pushing our code to the repository.2 The following code sample will create a CI/CD pipeline using GitHub Actions to run a Docker command to train our model, which you can see via the GitHub Actions tab on your forked repo. This runs model training on a CI/CD server instance without us needing to fiddle with shell scripts to install OS-level dependencies—such as Python 3.x, Python dev tools, or gcc—on the fresh CI instance. This is where Docker really shines: Docker abstracts away most “bare metal” concerns of running code on a remote compute instance and allows us to easily reproduce consistent runtime environments.

# .github/workflows/ci.yaml

name: CI/CD pipeline
on: [push]
jobs:
  # ...
  train-model:
    runs-on: ubuntu-20.04
    steps:
      - uses: actions/checkout@v3
      - name: Train model
        run: ./batect train-model  

# batect.yml
containers:
  dev:
    ...

tasks:
  train-model:
    description: Train ML model
    run:
      container: dev
      command: scripts/train-model.sh

This defines a step in our CI pipeline to run the batect task, ./batect train-model.

7. Deploy model web API

In this step, we will: (i) publish our model API image to a container registry and (ii) run a command to tell our cloud service provider to deploy an image with a specific tag. At this stage, the only dependency we need is infrastructure related—e.g., aws-cdk (AWS), gcloud (GCP), azure-cli (Azure), Terraform. We do not need any of the dependencies from our development container, so it’s best that we specify a separate image for the purpose of deploying an image as a web service.

To make this code sample simple and generalizable regardless of which cloud provider you are using, we have opted to illustrate this step with pseudo-code:

# .github/workflows/ci.yaml

name: CI/CD pipeline
on: [push]
jobs:  

  # ... other jobs (e.g. run tests)

  publish-image:
    runs-on: ubuntu-20.04
    steps:
      - uses: actions/checkout@v3
      - name: Publish image to docker registry
        run: docker push loan-default-prediction-api:${{github.run_number}}

  deploy-api:
    runs-on: ubuntu-20.04
    steps:
      - uses: actions/checkout@v3
      - name: Deploy model
        run: ./batect deploy-api
    needs: [publish-image]

# batect.yml
containers:
  deploy-api-container:
    image: google/cloud-sdk:latest 


tasks:
  deploy-api:
    description: Deploy API image
    run:
      container: deploy-api-container
      command: gcloud run deploy my-model-api --image IMAGE_URL 

Pseudo-code for: (i) pushing our image from our CI/CD pipeline to a Docker registry and (ii) deploying this image as an API. We would typically need to retag the image to include the specific Docker image registry, but we have left out this detail to keep the example simple.

For the deployment step, we didn’t need any of the dependencies from our model training and serving, but we do need a dependency (e.g., gcloud, aws-cli, azure-cli, Terraform) that helps us deploy our image to a container hosting service. Did you notice how we didn’t need to specify another Dockerfile? That is because batect allows us to define tasks with prebuilt images using the image option. Thanks to containers and batect, we can run this task in the same way on CI or on our local machine, simply by running ./batect deploy-api.

Pseudo-code for deploying a Docker image to a container hosting technology. You would replace this with the corresponding command for the cloud provider that you are using (e.g., AWS, Azure, GCP, Terraform).

In the preceding paragraph, we’re referencing several new concepts such as the container registry and cloud container hosting services. If this sounds overwhelming, fret not—we will describe these building blocks in an ML model’s path to production in Chapter 9.

Well done! By this stage, you have learned how to reliably create consistent environments for developing and deploying ML models. The principles, practices, and patterns in this code repository are what we use in real-world projects to bootstrap a new ML project repository with good practices baked in.

Next, let’s look at two other essential practices that can help you securely manage dependencies in your projects.

Secure Dependency Management

In 2017, attackers hacked Equifax—a credit monitoring company—by exploiting a vulnerability in an outdated dependency (Apache Struts) to infiltrate their system. This exposed the personal details of 143 million Americans and cost the company US$380 million. By the time Equifax was hacked, the maintainers of Apache Struts had actually already found, disclosed, and fixed the vulnerability in a newer version of Apache Struts. However, Equifax was still using an older version with the vulnerability and essentially had a ticking time bomb in their infrastructure.

Did you know that there are Python dependencies that have been found to allow your cloud credentials to be siphoned, or allow arbitrary code execution? Do you know if your current projects are exposed to any of these or other vulnerabilities? Well, if we don’t check our dependencies for vulnerabilities, we won’t know.

Keeping dependencies up-to-date and free of security vulnerabilities can be prohibitively tedious if we do them manually. The good news is that the technology to detect and resolve vulnerabilities in our dependencies has advanced significantly in recent years, and we can easily implement them in our projects without too much effort.

In this section, we will describe two practices that can help us mitigate these security risks:

  • Removing unnecessary dependencies

  • Automating checks and updates for dependencies

When complemented with the foundational knowledge in the preceding section, these practices will help you create production-ready and secure ML pipelines and applications.

With that in mind, let’s look at the first practice: removing unnecessary dependencies.

Remove Unnecessary Dependencies

Unnecessary dependencies—in the form of unnecessarily large base images and unused application-level dependencies—can create several problems. First and foremost, they enlarge the attack surface area of your project and make it more vulnerable to malicious attackers.

Second, they increase the time needed to build, publish, and pull your images. Not only does this lengthen the feedback cycle on your CI/CD pipeline, it also can impede your ability to autoscale quickly in response to unexpected spikes in production traffic, if you are handling large traffic volumes.

Finally, stray dependencies that are installed but never used can make the project confusing and hard to maintain. Even if the dependencies are not used, its transitive dependencies—i.e., grandchildren dependencies—can exert an influence (such as version constraints and installation failures due to version incompatibility) on other dependencies and transitive dependencies that are actually needed.

As a rule of thumb, we should:

  • Start with base images that are as small as possible—e.g., we could use python:3.10-slim-bookworm image (145 MB) as opposed to python:3.10 (1 GB, almost seven times larger!)

  • Remove dependencies that are not used from pyproject.toml

  • Exclude development dependencies from the production image

On the third point, here is an example of how you can use Docker multistage builds to exclude development dependencies from your production image. The code sample below helps us reduce the size of the Docker image from 1.3 GB (dev image) to 545 MB (production API image):3

FROM python:3.10-slim-bookworm AS dev 

WORKDIR /code
RUN apt-get update && apt-get -y install gcc

RUN pip install poetry
ADD pyproject.toml /code/
RUN poetry config installer.max-workers 10

ARG VENV_PATH
ENV VENV_PATH=$VENV_PATH
ENV PATH="$VENV_PATH/bin:$PATH"

CMD ["bash"]

FROM dev AS builder 

COPY poetry.lock /code
RUN poetry export --without dev --format requirements.txt \
    --output requirements.txt

FROM python:3.10-slim-bookworm AS prod 

WORKDIR /code
COPY src /code/src
COPY scripts /code/scripts
COPY artifacts /code/artifacts
COPY --from=builder /code/requirements.txt /code
RUN pip install --no-cache-dir -r /code/requirements.txt
CMD ["./scripts/start-api-prod.sh"]

The first stage (dev) will create a dev image that batect will use when running ./batect setup. After batect installs all the development dependencies, the container becomes 1.3 GB. The code for this stage is the same as what you’ve seen in preceding Dockerfile code samples.

The second stage (builder) is an intermediate stage where we generate a requirements.txt file using poetry export. This file will help us in the next and final stage to keep the production image as small as possible, which we will explain in the next point.

In the third stage (prod), we install only what we need for the production API. We start afresh (FROM python:3.10-slim-bookworm) and copy only the code and artifacts we need to start the API. We install the production dependencies using pip and the requirements.txt file generated by Poetry so that we don’t have to install Poetry—a development dependency—in a production image.

To build the production image, we can run the following command. We specify the target stage (prod) when we build the image:

$ docker build --target prod -t loan-default-prediction:prod .

With that, we have now excluded development dependencies from our production API image, which makes our deployment artifact more secure and speeds up the pushing and pulling of this image.

Automate Checks for Security Vulnerabilities

The second and most important practice for securing our application is to automate checks for security vulnerabilities in our dependencies. There are three components to this:

  • Automating checks for OS-level security vulnerabilities, through Docker image scanning

  • Automating checks for application-level security vulnerabilities, through dependency checking

  • Automating updates of OS-level and application-level dependencies

If you are using GitHub, you can do all of the above with Dependabot, a vulnerability scanning service that’s integrated with GitHub. If you’re not using GitHub, you can still implement the same functionality using other open source Software Composition Analysis (SCA) tools. For example, you can use Trivy to scan Docker images and Python dependencies, Snyk or Safety to check for vulnerable Python dependencies, and Renovate to automate dependency updates.

SCA tools generally use a similar approach: They check your dependencies for known vulnerabilities, or Common Vulnerabilities and Exposures (CVE), by referencing a global vulnerability database, such as the National Vulnerability Database (nvd.nist.gov). Dependabot or Renovate also go on to create PRs in your project when it detects that a newer version of a given dependency is available.

Note

While dependency vulnerability scanning and automated dependency updates help us significantly reduce our risk to vulnerable dependencies, there can be scenarios where dependencies have been flagged in public vulnerability databases, but fixes have yet to be released. When a new vulnerability is found, there is naturally some amount of time required before the maintainers release a fix to address the vulnerability. Until a fix is found, these vulnerabilities are known as “zero-day vulnerabilities”, because zero days have passed since the fix was published.

To manage this risk, you would need to consult security specialists in your organization to assess the severity of the vulnerabilities in your context, prioritize them accordingly, and identify measures to mitigate this risk.

Let’s take a look at how we can set this up in three steps on our GitHub repository using Dependabot. Dependabot can raise pull requests for two types of updates: (i) Dependabot security updates are automated pull requests that help you update dependencies with known vulnerabilities, and (ii) Dependabot version updates are automated pull requests that keep your dependencies updated, even when they don’t have any vulnerabilities.

For this exercise, we’ll use Dependabot version updates because the pull requests will be created immediately as long as there is an old dependency, even if there are no known security vulnerabilities. This will make it easier for you to follow along and see the intended result after completing each step.

The first step is to enable Dependabot for your repository or organization. You can do so by following the steps in GitHub’s official documentation to enable Dependabot version updates.

Second, when you’ve completed the steps on the official documentation to enable Dependabot version updates, you’ll be prompted to check in a dependabot.yml file in the .github directory:

# .github/dependabot.yml

version: 2
updates:
 - package-ecosystem: "pip" 
   directory: "/"
   schedule:
     interval: "daily"

We specify the package ecosystem and the directory that contains the package file. The official documentation states that we should specify pip, even if we are using Poetry. We also specify whether Dependabot should check for updates daily, weekly, or monthly.

Note

While it’s easy and tempting to also add a second update block here for "docker", in practice it can be challenging as updating Python versions (e.g., from Python 3.10 to 3.12) can cause a cascade of changes in versions of dependencies and transitive dependencies.

Nevertheless, we still recommend keeping the Python version of your ML system up to date, when you can ascertain that your application and dependency stack is compatible with newer versions of Python. Such a change should be easy to implement and test with the automated tests and containerized setup that we introduce in this book.

The third step is to configure our GitHub repository to allow PRs to merge only if tests pass on CI. This is an essential step to test that the dependency changes do not degrade the quality of our software. Different CI technologies will have different ways of doing this, and you can look up the respective documentation for your given toolchain. In our example, we are using GitHub Actions and, at the time of writing, the sequence of actions are:

  1. Allow auto-merge. Under your repository name, click Settings. On the Settings page, under Pull Requests, select “Allow auto-merge.” (You can also refer to the GitHub documentation on enabling auto-merge for up-to-date instructions for doing this.)

  2. We’ll define a GitHub Actions job to automatically merge PRs created by Dependabot. See GitHub documentation on adding auto-merge configuration for PRs created by Dependabot and the code sample below, which is also available in the demo repo in the .github directory:

    # .github/workflows/automerge-dependabot.yaml
    
    name: Dependabot auto-merge
    on: pull_request
    
    permissions:
      contents: write
      pull-requests: write
    
    jobs:
      dependabot:
        runs-on: ubuntu-latest
        if: github.actor == 'dependabot[bot]'
        steps:
          - name: Dependabot metadata
            id: metadata
            uses: dependabot/fetch-metadata@v1
            with:
              github-token: "${{ secrets.GITHUB_TOKEN }}"
          - name: Enable auto-merge for Dependabot PRs
            run: gh pr merge --auto --merge "$PR_URL"
            env:
              PR_URL: ${{github.event.pull_request.html_url}}
              GH_TOKEN: ${{secrets.GITHUB_TOKEN}}
  3. Finally, under Settings > Branches, add a branch protection rule by checking the box “Require status checks to pass before merging,” specifying the name of your branch (e.g., main), and search for the name of your test CI job. In this example, our job is train-model, which runs after run-tests. See GitHub documentation on adding a branch protection rule.

When these steps are done, your project will have its dependencies regularly and automatically updated, tested, and merged. Huzzah! A big leap toward more secure software.

Note

After completing these steps, you’ll notice that you can’t push your local commits on the main branch anymore, because we’ve enabled branch protection.

For those accustomed to trunk-based development, fret not—you can add your team to the bypass list (see the GitHub documentation on bypassing branch protections). Your team can continue to enjoy the fast feedback of CI/CD and trunk-based development while Dependabot’s changes go through pull requests.

Please note that bypassing branch protections can only be done on repositories belonging to an organization.

Give yourself several pats on your back! You have just applied the principles and practices we use in real-world projects to help us effectively manage dependencies in ML projects and create reproducible, production-ready, and secure ML pipelines and applications.

Conclusion

To recap, in this chapter, we covered:

  • What “check out and go” looks and feels like in practice

  • How to use Docker, batect, and Poetry to create consistent, reproducible, and production-like runtime environments in each step of the ML delivery lifecycle

  • How to detect security vulnerabilities in your dependencies, and how to automatically keep dependencies up-to-date

The unique challenges of the ML ecosystem—e.g., large and varied dependencies, large models—can stress-test how far we can take the practice of containerizing our software. In our experience, container technologies continue to be useful, but in the context of ML, it must be complemented with advanced techniques—e.g., Docker cache volumes, batect, automated security updates—so that we can continue to manage our dependencies effectively, securely, and with short feedback cycles.

Chapters 3 and 4 are our attempt to make these principles and practices clear and easy to implement so that we can rapidly and reliably set up our dependencies and spend time on solving the problems that we want to solve, not waste time in dependency hell. Proper dependency management is a low-hanging fruit that ML teams can harvest today and enjoy the benefits in terms of time, effort, reliability, and security.

In the next chapter, we will explore another powerful, foundational practice of effective ML teams: automated testing.

1 As mentioned in “Complicating the picture: Differing CPU chips and instruction sets”, the article “Why New Macs Break Your Docker Build, and How to Fix It” explains why this happens and why it is especially common with new Macs with M1 chips.

2 “Deploy” may sound like a big scary word, but it simply means the act of moving code or an application from a source repository to a target runtime environment.

3 Running docker history <image> on our production image (545 MB) shows that Python dependencies account for 430 MB. Looking into the site-packages directory, we found that the top three contributors were scikit-learn (116 MB), SciPy (83MB), and pandas (61MB).

Get Effective Machine Learning Teams now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.