Part II. Distributed Training

This section delves into the details of distributed training, offering a comprehensive understanding of its foundational concepts and practical applications. It begins by introducing distributed systems and communication challenges, which are essential to understand for scaling hardware resources effectively. Building upon this foundation, the section extends into theoretical frameworks for distributed deep learning training techniques and explores data parallelism in depth, complemented by practical exercises. Furthermore, it provides insights into scaling model training through various parallelism techniques like model, pipeline, tensor, and hybrid parallelism, highlighting their challenges and limitations through hands-on experiences. Finally, the section consolidates this knowledge, offering expertise in realizing multidimensional parallelism for efficient deep learning at scale.

Get Deep Learning at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.