Chapter 1. Distributed Computing Is Hard but Necessary

The accuracy of Moore’s Law, which held that the number of transistors in microchips would double every two years, has resulted in the amazing compute power that all of us enjoy every day in our phones, in our laptops, and in our servers. Yet it’s never enough.

Many software systems built today require CPU cycles and memory that far exceed even the largest servers. Even when one server is large enough, the need for high availability means we often use multiple machines, even multiple datacenters, to ensure that our services remain available, even when failures occur. Also, many jobs (including training machine learning and AI models) can be decomposed into parallel tasks, greatly reducing the total time required to complete those jobs, if we spread the work over a cluster.

However, distributed systems programming has always been hard to do, requiring special expertise. Why is that necessary? Isn’t it possible to provide intuitive abstractions that allow developers to express the computing they need to do, while transparently spreading that work across a cluster of machines?

Why Ray?

Researchers in machine learning (ML) and AI at the University of California, Berkeley, faced this problem. They needed an easy way to run workloads at massive scale, requiring distribution of tasks over clusters, yet none of the available tools were right for the job. They didn’t require fine-grained control over how the work was done. They weren’t running typical data analytics jobs. They needed a system to do most of the tedious work for them, with reasonable default behaviors, but also with great flexibility for diverse tasks.

Hence, they created Ray, an open source project aiming to solve this problem. Ray offers a concise and intuitive Python API for defining what tasks and application state need to be distributed. The Ray runtime library manages distribution and coordination, whether you are leveraging all the CPU cores on your laptop or all the nodes in your cluster. What attracted me to Ray is the simplicity and intuitive quality of the Ray API for end users. If you don’t want to think about the mechanics of distributed computing and just want to get your work done, Ray is for you (Figure 1-1 shows the Ray logo).

The Trends That Led to Ray

Let’s explore a little bit more the trends that led to the creation of Ray. OpenAI’s 2018 blog post “AI and Compute” showed exponential growth in the petaflops required to train ML models, reflecting the exponential growth in model sizes, meaning the number of parameters that have to be trained. This growth has doubled every 3.4 months since 2012. By comparison, Moore’s Law says that the number of transistors on a “microchip” doubles every two years.

Moore’s Law can’t keep up with demand, so clearly distributed training performed in parallel tasks is the only way to keep elapsed times for training from exploding exponentially.

Another big trend is the rapid growth in the popularity of an old programming language, Python. Python is perhaps the most popular language for data science now, because of its ease of use and the rich collection of data-oriented libraries available in Python. However, Python is ineffective for distributed computing. In particular, the Python interpreter is effectively single threaded, so it can’t even use the other CPU cores in the same machine, let alone a cluster of machines. For that, you need special libraries.

Ray is a response to both trends. While the API is in Python, the Ray core is written in C++, setting the stage for eventual support for APIs in other languages (a Java API will be next). As we’ll see shortly, the Python API is concise and intuitive for developers, yet it provides excellent performance for distributed workloads across a cluster.

Distributed programming in Python is not new. What about other libraries and tools? How do they compare to Ray?

Dask is also a Python framework for distribution over a cluster. Dask excels at distributing structured data, such as most of the Pandas DataFrame API and NumPy arrays. Ray is more general purpose with an emphasis on very high-performance and heterogeneous workloads. The Modin project is implementing similar data support on Ray.

There are several Python libraries for distribution on a single machine, including Python’s multiprocessing.Pool, asyncio API, and joblib for scikit-learn. If you don’t need to scale beyond a single machine, they work well. In fact, Ray can be used with asyncio, and it provides replacements for multiprocessing.Pool and joblib, so you can break the one-machine boundary. See “Next Steps with Ray” for details.

Finally, an old standard is message passing interface (MPI), but it is quite low-level and complex to use, as you can see in this Python example.

Most data scientists and researchers don’t actually use the Ray API directly. Instead, they use higher-level libraries with APIs for their specific needs, while Ray is used internally for distributed computation. Figure 1-2 shows the current libraries shipped with Ray.

Current Ray Ecosystem
Figure 1-2. Current Ray ecosystem

The blue boxes correspond to some of the stages you might have in a typical ML pipeline. Hyperparameter tuning/optimization (HPO) is supported by Ray Tune. HPO seeks the optimal model structure, like the best neural network architecture, for your problem. This structure is defined by so-called hyperparameters. After you have decided on the model structure, then you train the model’s parameters, such as the “weights” assigned to nodes in a neural network. Tune integrates with popular systems such as PyTorch, TensorFlow, and scikit-learn.

After HPO, training to determine model parameters is done. Ray SGD makes distributed PyTorch training easier. Ray RLlib is a state-of-the-art system for reinforcement learning.

Simulation is important in reinforcement learning. For example, if you are training a robot to walk, you’ll start with a software simulation of it. RLlib integrates with popular simulation environments like OpenAI Gym. You can also create and use your own environments built with Ray and other tools.

Once you finish training a model, you need to put it into your production environment to use it. In other words, you need to serve the model. Ray Serve supports several serving patterns, like canary deployments, while leveraging Ray for transparent scalability to meet load demands.

Other libraries are planned or under development, such as lightweight yet scalable streaming and data processing. Also, many organizations and third-party ML/AI libraries now use Ray. You can find their stories at the Ray website and on the Ray blog. Today, Ray is developed by Anyscale, a company founded by members of the Berkeley team who created Ray.

What’s Next?

Next, we’ll look at the Ray API, how concise, expressive, and powerful it is. Then we’ll discuss the higher-level Ray libraries just mentioned, focusing on an example using Ray RLlib for reinforcement learning. After that, we’ll talk about Ray as a tool for general-purpose, distributed applications, showing that Ray’s abstractions are very flexible for a wide range of problems. Finally, we’ll recap what we’ve learned and suggest some next steps for you to consider.

Get What Is Ray? now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.