Chapter 8. Scaling Beyond Data Parallelism: Model, Pipeline, Tensor, and Hybrid Parallelism

You have read about several concepts and techniques related to distributed training in the previous chapters of this book. Chapter 6 laid out the fundamentals of distributed model training and discussed the possible dimensions of scaling, while Chapter 7 provided practical knowledge to scale based on the data dimension.

As you learned in Chapter 3, a task can typically be parallelized in two ways: by applying the same set of instructions on different data (SIMD) or by decomposing the set of instructions such that different parts of the algorithm can be performed at the same time on different data (MIMD). Data parallel model training is akin to SIMD, whereas the other forms of parallelism that you will read about in this chapter are akin to MIMD.

Scaling model training using data parallel techniques is often considered “weak” because you are scaling only horizontally, using just one of many possible dimensions of scale (i.e., data). Your overall scalability is limited by the number of parallel workers you can have, the ability of each worker to fit your model in its available memory, and the maximum effective batch size you can have before scaling law fails (for your case), producing diminishing returns. For most scenarios, weak scaling might be sufficient. However, if the limitations are causing you problems, you will need to look beyond data parallelism and explore more advanced vertical ...

Get Deep Learning at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.