Deep Learning at Scale

Book description

Bringing a deep-learning project into production at scale is quite challenging. To successfully scale your project, a foundational understanding of full stack deep learning, including the knowledge that lies at the intersection of hardware, software, data, and algorithms, is required.

This book illustrates complex concepts of full stack deep learning and reinforces them through hands-on exercises to arm you with tools and techniques to scale your project. A scaling effort is only beneficial when it's effective and efficient. To that end, this guide explains the intricate concepts and techniques that will help you scale effectively and efficiently.

You'll gain a thorough understanding of:

  • How data flows through the deep-learning network and the role the computation graphs play in building your model
  • How accelerated computing speeds up your training and how best you can utilize the resources at your disposal
  • How to train your model using distributed training paradigms, i.e., data, model, and pipeline parallelism
  • How to leverage PyTorch ecosystems in conjunction with NVIDIA libraries and Triton to scale your model training
  • Debugging, monitoring, and investigating the undesirable bottlenecks that slow down your model training
  • How to expedite the training lifecycle and streamline your feedback loop to iterate model development
  • A set of data tricks and techniques and how to apply them to scale your training model
  • How to select the right tools and techniques for your deep-learning project
  • Options for managing the compute infrastructure when running at scale

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Why Scaling Matters
    2. Who This Book Is For
    3. How This Book Is Organized
      1. Introduction
      2. Part I: Foundational Concepts of Deep Learning
      3. Part II: Distributed Training
      4. Part III: Extreme Scaling
    4. What You Need to Use This Book
    5. Setting Up Your Environment for Hands-on Exercises
    6. Using Code Examples
    7. Conventions Used in This Book
    8. O’Reilly Online Learning
    9. How to Contact Us
    10. Acknowledgments
  2. 1. What Nature and History Have Taught Us About Scale
    1. The Philosophy of Scaling
      1. The General Law of Scaling
      2. History of Scaling Law
    2. Scalable Systems
      1. Nature as a Scalable System
      2. Our Visual System: A Biological Inspiration
    3. Artificial Intelligence: The Evolution of Learnable Systems
      1. It Takes Four to Tango
      2. Evolving Deep Learning Trends
    4. Scale in the Context of Deep Learning
      1. Six Development Considerations
      2. Scaling Considerations
    5. Summary
  3. I. Foundational Concepts of Deep Learning
  4. 2. Deep Learning
    1. The Role of Data in Deep Learning
    2. Data Flow in Deep Learning
    3. Hands-On Exercise #1: Implementing Minimalistic Deep Learning
      1. Developing the Model
      2. The Embedded/Latent Space
      3. A Word of Caution
      4. The Learning Rate and Loss Landscape
      5. Scaling Consideration
      6. Profiling
    4. Hands-On Exercise #2: Getting Complex with PyTorch
      1. Model Input Data and Pipeline
      2. Model
      3. Auxiliary Utilities
      4. Putting It All Together
    5. Computation Graphs
    6. Inference
    7. Summary
  5. 3. The Computational Side of Deep Learning
    1. The Higgs Boson of the Digital World
      1. Floating-Point Numbers: The Faux Continuous Numbers
      2. Units of Data Measurement
      3. Data Storage Formats: The Trade-off of Latency and Throughput
    2. Computer Architecture
      1. The Birth of the Electromechanical Engine
      2. Memory and Persistence
      3. Computation and Memory Combined
    3. The Scaling Laws of Electronics
    4. Scaling Out Computation with Parallelization
      1. Threads Versus Processes: The Unit of Parallelization
      2. Hardware-Optimized Libraries for Acceleration
      3. Parallel Computer Architectures: Flynn’s and Duncan’s Taxonomies
    5. Accelerated Computing
      1. Popular Accelerated Devices for Deep Learning
      2. CUDA
      3. Accelerator Benchmarking
    6. Summary
  6. 4. Putting It All Together: Efficient Deep Learning
    1. Hands-On Exercise #1: GPT-2
      1. Exercise Objectives
      2. Model Architecture
      3. Implementation
      4. Running the Example
      5. Experiment Tracking
      6. Measuring to Understand the Limitations and Scale Out
      7. Transitioning from Language to Vision
    2. Hands-On Exercise #2: Vision Model with Convolution
      1. Model Architecture
      2. Running the Example
      3. Observations
    3. Graph Compilation Using PyTorch 2.0
      1. New Components of PyTorch 2.0
      2. Graph Execution in PyTorch 2.0
    4. Modeling Techniques to Scale Training on a Single Device
      1. Graph Compilation
      2. Reduced- and Mixed-Precision Training
      3. Memory Tricks for Efficiency
      4. Optimizer Efficiencies
      5. Model Input Pipeline Tricks
      6. Writing Custom Kernels in PyTorch 2.0 with Triton
    5. Summary
  7. II. Distributed Training
  8. 5. Distributed Systems and Communications
    1. Distributed Systems
      1. The Eight Fallacies of Distributed Computing
      2. The Consistency, Availability, and Partition Tolerance (CAP) Theorem
      3. The Scaling Law of Distributed Systems
      4. Types of Distributed Systems
    2. Communication in Distributed Systems
      1. Communication Paradigm
      2. Communication Patterns
      3. Communication Technologies
      4. MPI
      5. Communication Initialization: Rendezvous
      6. Hands-On Exercise
    3. Scaling Compute Capacity
      1. Infrastructure Setup Options
      2. Provisioning of Accelerated Devices
      3. Workload Management
    4. Deep Learning Infrastructure Review
      1. Overview of Leading Deep Learning Clusters
      2. Similarities Between Today’s Most Powerful Systems
    5. Summary
  9. 6. Theoretical Foundations of Distributed Deep Learning
    1. Distributed Deep Learning
      1. Centralized DDL
      2. Decentralized DDL
    2. Dimensions of Scaling Distributed Deep Learning
      1. Partitioning Dimensions of Distributed Deep Learning
      2. Types of Distributed Deep Learning Techniques
      3. Choosing a Scaling Technique
    3. Measuring Scale
      1. End-to-End Metrics and Benchmarks
      2. Measuring Incrementally in a Reproducible Environment
    4. Summary
  10. 7. Data Parallelism
    1. Data Partitioning
      1. Implications of Data Sampling Strategies
      2. Working with Remote Datasets
    2. Introduction to Data Parallel Techniques
      1. Hands-On Exercise #1: Centralized Parameter Server Using RCP
      2. Hands-On Exercise #2: Centralized Gradient-Partitioned Joint Worker/Server Distributed Training
      3. Hands-On Exercise #3: Decentralized Asynchronous Distributed Training
    3. Centralized Synchronous Data Parallel Strategies
      1. Data Parallel (DP)
      2. Distributed Data Parallel (DDP)
      3. Zero Redundancy Optimizer–Powered Data Parallelism (ZeRO-DP)
      4. Fault-Tolerant Training
      5. Hands-On Exercise #4: Scene Parsing with DDP
      6. Hands-On Exercise #5: Distributed Sharded DDP (ZeRO)
    4. Building Efficient Pipelines
      1. Dataset Format
      2. Local Versus Remote
      3. Staging
      4. Threads Versus Processes: Scaling Your Pipelines
      5. Memory Tricks
      6. Data Augmentations: CPU Versus GPU
      7. JIT Acceleration
      8. Hands-On Exercise #6: Pipeline Efficiency with FFCV
    5. Summary
  11. 8. Scaling Beyond Data Parallelism: Model, Pipeline, Tensor, and Hybrid Parallelism
    1. Questions to Ask Before Scaling Vertically
    2. Theoretical Foundations of Vertical Scaling
      1. Revisiting the Dimensions of Scaling
      2. Operators’ Perspective of Parallelism Dimensions
      3. Data Flow and Communications in Vertical Scaling
    3. Basic Building Blocks for Scaling Beyond DP
      1. PyTorch Primitives for Vertical Scaling
      2. Working with Larger Models
      3. Distributed Checkpointing: Saving the Partitioned Model
    4. Summary
  12. 9. Gaining Practical Expertise with Scaling Across All Dimensions
    1. Hands-On Exercises: Model, Tensor, Pipeline, and Hybrid Parallelism
      1. The Dataset
      2. Hands-On Exercise #1: Baseline DeepFM
      3. Hands-On Exercise #2: Model Parallel DeepFM
      4. Hands-On Exercise #3: Pipeline Parallel DeepFM
      5. Hands-On Exercise #4: Pipeline Parallel DeepFM with RPC
      6. Hands-On Exercise #5: Tensor Parallel DeepFM
      7. Hands-On Exercise #6: Hybrid Parallel DeepFM
    2. Tools and Libraries for Vertical Scaling
      1. OneFlow
      2. FairScale
      3. DeepSpeed
      4. FSDP
      5. Overview and Comparison
      6. Hands-On Exercise #7: Automatic Vertical Scaling with DeepSpeed
      7. Observations
    3. Summary
  13. III. Extreme Scaling
  14. 10. Data-Centric Scaling
    1. The Seven Vs of Data Through a Deep Learning Lens
    2. The Scaling Law of Data
    3. Data Quality
      1. Validity
      2. Variety
      3. Veracity
      4. Value and Volume
    4. The Data Engine and Continual Learning
      1. Volatility
      2. Velocity
    5. Summary
  15. 11. Scaling Experiments: Effective Planning and Management
    1. Model Development Is Iterative
    2. Planning for Experiments and Execution
      1. Simplify the Complex
      2. Fast Iteration for Fast Feedback
      3. Decoupled Iterations
      4. Feasibility Testing
      5. Developing and Scaling a Minimal Viable Solution
      6. Setting Up for Iterative Execution
    3. Techniques to Scale Your Experiments
      1. Accelerating Model Convergence
      2. Accelerating Learning Via Optimization and Automation
      3. Accelerating Learning by Increasing Expertise
      4. Learning with Scarce Supervision
    4. Hands-On Exercises
      1. Hands-On Exercise #1: Transfer Learning
      2. Hands-On Exercise #2: Hyperparameter Optimization
      3. Hands-On Exercise #3: Knowledge Distillation
      4. Hands-On Exercise #4: Mixture of Experts
      5. Hands-On Exercise #5: Contrastive Learning
      6. Hands-On Exercise #6: Meta-Learning
    5. Summary
  16. 12. Efficient Fine-Tuning of Large Models
    1. Review of Fine-Tuning Techniques
      1. Standard Fine Tuning
      2. Meta-Learning (Zero-/Few-Shot Learning)
      3. Adapter-Based Fine Tuning
      4. Low-Rank Tuning
    2. LoRA—Parameter-Efficient Fine Tuning
    3. Quantized LoRA (QLoRA)
    4. Hands-on Exercise: QLoRA-Based Fine Tuning
      1. Implementation Details
      2. Inference
      3. Exercise Summary
    5. Summary
  17. 13. Foundation Models
    1. What Are Foundation Models?
    2. The Evolution of Foundation Models
    3. Challenges Involved in Developing Foundation Models
      1. Measurement Complexity
      2. Deployment Challenges
      3. Propagation of Defects to All Downstream Models
      4. Legal and Ethical Considerations
      5. Ensuring Consistency and Coherency
    4. Multimodal Large Language Models
      1. Projection
      2. Gated Cross-Attention
      3. Query-Based Encoding
      4. Further Exploration
    5. Summary
  18. Index
  19. About the Author

Product information

  • Title: Deep Learning at Scale
  • Author(s): Suneeta Mall
  • Release date: June 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098145286