Chapter 9. Ray Clusters

So far we have focused on teaching you the basics of Ray for building machine learning applications. You know how to parallelize your Python code with Ray Core and run reinforcement learning experiments with RLlib. You’ve also seen how to preprocess data with Ray Datasets, tune hyperparameters with Ray Tune, and train models with Ray Train. But one of the key features that Ray brings is the ability to scale out seamlessly onto multiple machines. Outside a lab environment or a big tech company, it may be difficult set up multiple machines and join them into a single Ray Cluster. This chapter is all about how to do that.1

Cloud technology has commoditized access to cheap machines for anyone. But it is often quite difficult to figure out the right APIs to handle the cloud provider tools. The Ray team has provided a couple of tools that abstract the complexity away. There are three primary ways of launching or deploying a Ray Cluster. You can do so manually, via a Kubernetes operator, or via the cluster launcher CLI tool.

In the first part of this chapter we’ll cover these three methods in detail.2 We only briefly explain manual cluster creation and the cluster launcher CLI, and we spend most of our time explaining how to use the Kubernetes operator. After that, we’ll cover how to run Ray Clusters on the cloud and how to autoscale them up and down.

Manually Creating a Ray Cluster

Let’s start with the most basic way of creating a Ray Cluster.

Get Learning Ray now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.