Chapter 7. Data Parallelism

In Chapter 6, you read about the fundamentals of distributed training and explored various parallelization techniques to scale out your model training workload. Building on the concepts from the preceding chapters, this chapter will take a deep dive into the data parallel technique. The objective of this chapter is to provide a full-stack understanding of how data parallel training comes to fruition. To meet this objective, it will introduce data partitioning techniques, explore how the workers get involved in distributing the load, and discuss related concepts along the way. The material in this chapter is hands-on, so most scenarios will include dedicated examples that walk through the concepts.

Data Partitioning

As discussed in the previous chapter, data parallelism scales out training by partitioning the training corpus amongst the workers in the system. Creating equal-sized subsets of your training corpus is one example of a very simplistic partitioning strategy. With this approach, if you have 10 workers and your main training corpus has 100 records with IDs [0, 1, 2, … 99], it will be divided (a.k.a. sharded or subsetted) into 10 parts, with each worker receiving 10 unique records offset by, say, its own rank. In this case, the first worker (w0) will receive records [0, 1, 2, … 9] while the tenth worker (w0) will receive records [90, 91, 92, … 99]. An example of this implementation that uses torch.utils.data.Subset to create such data partitioning ...

Get Deep Learning at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.