Preface

I started my professional career as a software engineer. Over the course of my time in that role, I became deeply interested and involved in running software and systems at scale. I learned a lot about distributed systems, performance, optimizations, and running them reliably at scale. Subsequently, I went on to perform many other roles, from building systems at the intersection of software and operations (DevOps) and auxiliary systems to enable intelligent software (MLOps), to running deep learning inference at scale and developing data engines for deep learning (machine learning engineering), to developing multitasking, multiobjective models for critical functions such as healthcare and business decision workflows as a data scientist and machine learning specialist.

Since I’ve become involved in building intelligent systems, deep learning is a big part of what I do today. The wide adoption of deep learning–based intelligent (AI) systems is motivated by its ability to solve problems at scale with efficiency. However, building such systems is complex, because deep learning is not just about algorithms and mathematics. Much of the complexity lies at the intersection of hardware, software, data, and deep learning (the algorithms and techniques, specifically). I consider myself fortunate to have gained experience in a series of roles that forced me to rapidly develop a detailed understanding of building and managing deep learning–based AI systems at scale. The knowledge that I have acquired because of the opportunities presented to me is not so easily available and consumed, because each of these domains—hardware, software, and data—is as complex as deep learning itself.

The key motivation behind this book is to democratize this knowledge so that every machine learning practitioner, engineer or not, can navigate the deep learning landscape. I’ve always felt that this knowledge was somewhat fragmented, and saw an opportunity to pull it together to create a coherent knowledge base. This unified knowledge base will provide theoretical and practical guidance for developing deep learning engineering knowledge so you can easily scale out your deep learning workloads without needing to go through as many explorations as I did.

Why Scaling Matters

Deep learning and scaling are correlated. Deep learning is capable of scaling your objectives from single task to multitask, from one modality to multimodality, from one class to thousands of classes. Anything is possible, provided you have scalable hardware and a large volume of data and write software that can efficiently scale to utilize all the resources available to you.

Scaling is complex, and thus not free. Developing a deep learning–based system requires a large number of layers, a large volume of data, and hardware capable of handling computationally intensive workloads. Scaling requires understanding the elasticity of your entire system—not just your model but your entire deep learning stack—and adapting to situations where elasticity nears a breaking point. Therein lies the secondary motivation of this book: to enable you to gain a deeper understanding of your system and when it might break, and how you can avoid unnecessary breaks.

Who This Book Is For

This book aims to help you develop a deeper knowledge of the deep learning stack—specifically, how deep learning interfaces with hardware, software, and data. It will serve as a valuable resource when you want to scale your deep learning model, either by expanding the hardware resources or by adding larger volumes of data or increasing the capacity of the model itself. Efficiency is a key part of any scaling operation. For this reason, consideration of efficiency is weaved in throughout the book, to provide you with the knowledge and resources you need to scale effectively.

This book is written for machine learning practitioners from all walks of life: engineers, data engineers, MLOps, deep learning scientists, machine learning engineers, and others interested in learning about model development at scale. It assumes that the reader already has a fundamental knowledge of deep learning concepts such as optimizers, learning objectives and loss functions, and model assembly and compilation, as well as some experience with model development. Familiarity with Python and PyTorch is also essential for the practical sections of the book.

Given the complexity and scope, this book primarily focuses on scale-out of model development and training, with an extensive focus on distributed training. While the first few chapters may be useful for deployment and inference use cases, scaling inference is beyond the scope of this book. The topics we will cover include:

  • How your model is decomposed into a computation graph and how your data flows through this graph during the training process.

  • The less told but beautiful story of floating-point numbers and how these Higgs bosons of deep learning can be used to achieve memory efficiency.

  • How accelerated computing speeds up your training and how you can best utilize the hardware resources at your disposal.

  • How to train your model using distributed training paradigms (i.e., data, model, pipeline, and hybrid multidimensional parallelism). You will also learn about federated learning and its challenges.

  • How to leverage the PyTorch ecosystem in conjunction with NVIDIA libraries and Triton to scale your model training.

  • Debugging, monitoring, and investigating bottlenecks that undesirably slow down the scale-out of model training.

  • How to expedite the training lifecycle and streamline your feedback loop to iterate model development and related best practices.

  • A set of data tricks and techniques and how to apply them to scale your training over limited resources.

  • How to select the right tools and techniques for your deep learning project.

  • Options for managing compute infrastructure when running at scale.

How This Book Is Organized

This book consists of an introductory chapter followed by a dozen chapters divided into three parts covering foundational concepts, distributed training, and extreme scaling. Each chapter builds upon the concepts, fundamentals, and principles from the preceding chapters to provide a holistic knowledge of deep learning that will enable efficient and effective scale-out of training workloads.

Introduction

Chapter 1, “What Nature and History Have Taught Us About Scale”, sets out the theoretical framework for deciding when to scale and explores the high-level challenges involved in scaling out. In this chapter, you will also read about the history of deep learning and how scaling has been a key driver of its success.

Part I: Foundational Concepts of Deep Learning

Chapter 2, “Deep Learning”, introduces deep learning through the lens of computational graphs and data flow. Early-stage machine learning practitioners may find this chapter helpful as it explains the inner workings of deep learning through pure Python, no-frills exercises. More experienced deep learning practitioners may choose to skip this chapter.

Chapter 3, “The Computational Side of Deep Learning”, dives into the inner workings of electronic computations and hardware, exploring how compute capabilities are achieved and scaled. It also provides detailed insights into the variety of accelerated hardware available today, to arm you with the knowledge required to choose the most suitable hardware for your project.

Chapter 4, “Putting It All Together: Efficient Deep Learning, brings the foundational knowledge of deep learning together to provide more practical guidance on how to build an efficient and effective intelligent system for your task and how to measure and monitor it. In this chapter, you will also learn about graph compilation and a series of memory tricks to provide you with the knowledge to build an efficient stack.

Part II: Distributed Training

Chapter 5, “Distributed Systems and Communications”, introduces the foundations of distributed systems and provides detailed insights into the different types and the challenges associated with each one. Communication is a critical aspect of distributed systems that’s explained in this chapter through the lens of deep learning. This chapter also provides insights into the options and tools that can be used to scale out your hardware resources to achieve distributed computing, along with what this means for hardware with acceleration.

Chapter 6, “Theoretical Foundations of Distributed Deep Learning”, extends Chapter 5 to provide theoretical and foundational knowledge of distributed deep learning. In this chapter, you will learn about a variety of distributed deep learning training techniques and a framework for choosing one.

Chapter 7, “Data Parallelism”, dives into the details of distributed data parallelism and provides a series of practical exercises demonstrating these techniques.

Chapter 8, “Scaling Beyond Data Parallelism: Model, Pipeline, Tensor, and Hybrid Parallelism”, provides foundational and practical knowledge of scaling model training beyond data parallel. In this chapter, you will learn about model, pipeline, and multidimensional hybrid parallelism and experience the challenges and limitations of each of these techniques via practical exercises.

Chapter 9, “Gaining Practical Expertise with Scaling Across All Dimensions”, brings all the learning of Part II together to provide knowledge and insights on how to realize multidimensional parallelism in a more effective manner.

Part III: Extreme Scaling

Chapter 10, “Data-Centric Scaling”, provides a data-centric perspective and offers valuable information on assorted techniques to maximize the gain from your data. This chapter also provides useful insights on how to achieve efficiency in your data pipelines through sampling and selection techniques.

Chapter 11, “Scaling Experiments: Effective Planning and Management”, focuses on scaling out of experiments and provides insights on experiment planning and management. This chapter provides useful information for when you’re conducting multiple experiments and want to maximize your chances of finding the best-performing model; it covers techniques like fine tuning, mixture of experts (MoE), contrastive learning, etc.

Chapter 12, “Efficient Fine-Tuning of Large Models”, explores low-rank fine tuning of large models with a practical example.

Chapter 13, “Foundation Models”, lays out the conceptual framework of foundation models and provides a summary of this evolving landscape.

What You Need to Use This Book

To run the code samples in this book, you will need a working device with at least a 16-core CPU and 16 GB (ideally 32 GB) of RAM. Most of the exercises in Part II use accelerated hardware, so access to a system with more than one GPU—ideally NVIDIA—will be required for some of the exercises. Most exercises are written in a platform-agnostic way, and a Dockerfile with a list of runtime dependencies required to run the exercises is provided.

Setting Up Your Environment for Hands-on Exercises

Instructions to set up your environment for this book’s practical exercises are included in the companion GitHub repository. This page includes specific guidelines to set up either a Python-based native environment or an emulated Docker environment. Instructions to set up the NVIDIA drivers and CUDA runtime are also provided, along with instructions on updating the versions and running the exercises.

Some exercises in Part II will come with special instructions that will be explained in the context of those exercises.

Using Code Examples

Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/suneeta-mall/deep_learning_at_scale.

If you have a technical question or a problem using the code examples, please send an email to bookquestions@oreilly.com.

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Deep Learning at Scale by Suneeta Mall (O’Reilly). Copyright 2024 Suneeta Mall, 978-1-098-14528-6.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This element signifies a tip or suggestion.

Note

This element signifies a general note.

Warning

This element indicates a warning or caution.

O’Reilly Online Learning

Note

For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/DLAS.

For news and information about our books and courses, visit https://oreilly.com.

Find us on LinkedIn: https://linkedin.com/company/oreilly-media.

Watch us on YouTube: https://youtube.com/oreillymedia.

Acknowledgments

To my beloved family: Your unwavering support and understanding during the creation of this book has been huge. My heartfelt thanks to my husband, whose patience and encouragement kept me going. To my incredible children, your curiosity and enthusiasm for learning inspire me every day. This book is as much yours as it is mine.

Mum, Dad, and parents in-law, your love, wisdom, unwavering belief in my abilities, and endless encouragement have been a guiding light throughout this journey. To my brother, your perseverance knows no bounds and keeps me inspired. This book is dedicated to all of you.

To the open source deep learning community: I have deepest gratitude for the open source communities around the world that have been forthcoming with their knowledge and work to collectively and collaboratively improve the posture of AI systems in production. Your commitment to innovation and accessibility in the field of deep learning has been revolutionary.

The knowledge, tools, and resources that these communities have built together have not only shaped this book, but have also transformed the landscape of machine learning. I’m deeply thankful for your contributions. This work would not have been possible without you. I take deep pleasure in dedicating this book to you!

To my dedicated tech reviewers and editorial team: I’m indebted to your valuable input and dedication to excellence. I would like to acknowledge and express my deepest gratitude to the technical reviewers, Tim Hauke Langer, Giovanni Alzetta, Satyarth Praveen, and Vishwesh Ravi Shrimali, and my editor, Sara Hunter, whose guidance and advice have greatly improved this book. I would also like to express my gratitude to Nicole Butterfield, my acquisitions editor, for her support and guidance in shaping the direction of the book.

Get Deep Learning at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.