Hands-On Generative AI with Transformers and Diffusion Models

Book description

Learn how to use generative media techniques with AI to create novel images or music in this practical, hands-on guide. Data scientists and software engineers will understand how state-of-the-art generative models work, how to fine-tune and adapt them to your needs, and how to combine existing building blocks to create new models and creative applications in different domains.

This book introduces theoretical concepts in an intuitive way, with extensive code samples and illustrations that you can run on services such as Google Colaboratory, Kaggle, or Hugging Face Spaces with minimal setup. You'll learn how to use open source libraries such as Transformers and Diffusers, conduct code exploration, and study several existing projects to help guide your work.

  • Learn the fundamentals of classic and modern generative AI techniques
  • Build and customize models that can generate text, images, and sound
  • Explore trade-offs between training from scratch and using large, pretrained models
  • Create models that can modify images by transferring the style of other images
  • Tweak and bend transformers and diffusion models for creative purposes
  • Train a model that can write text based on your style
  • Deploy models as interactive demos or services

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book
    2. Prerequisites
    3. What You Will Learn
    4. How to Read This Book
    5. Software and Hardware Requirements
    6. How to Contact Us
    7. State of the Art: A Moving Target
  2. 1. An Introduction to Generative Media
    1. Generating Images
    2. Generating Text
    3. Generating Sound Clips
    4. Ethical and Societal Implications
    5. Where We’ve Been and Where Things Stand
    6. How Are Generative AI Models Created?
    7. Summary
  3. 2. Transformers
    1. A Language Model in Action
      1. Tokenizing Text
      2. Predicting Probabilities
      3. Generating Text
      4. Zero-Shot Generalization
      5. Few-Shot Generalization
    2. A Transformer Block
    3. Transformer Models Genealogy
      1. Sequence-to-Sequence Tasks
      2. Encoder-only Models
    4. The Power of Pre-training
    5. Transformers Recap
      1. Limitations
      2. Beyond Text
    6. Project Time: Using LMs to Generate Text
    7. Summary
    8. Exercises
    9. Challenges
    10. References
  4. 3. Compressing and Representing Information
    1. AutoEncoders
      1. Preparing the Data
      2. Modeling the Encoder
      3. Decoder
      4. Training
      5. Exploring the Latent Space
      6. Visualizing the Latent Space
    2. Variational AutoEncoders (VAEs)
      1. VAE Encoders and Decoders
      2. Sampling from the Encoder Distribution
      3. Training the VAE
      4. VAEs for Generative Modeling
    3. CLIP
      1. Contrastive Loss
      2. Using CLIP, step by step
      3. Zero-shot Image Classification with CLIP
      4. Zero-shot Image Classification Pipeline
      5. CLIP Use Cases
    4. Alternatives to CLIP
    5. Project Time: Semantic Image Search
    6. Summary
    7. Exercises
    8. Challenge
    9. References
  5. 4. Diffusion Models
    1. The Key Insight: Iterative Refinement
    2. Training a Diffusion Model
      1. The Data
      2. Adding Noise
      3. The UNet
      4. Training
      5. Sampling
      6. Evaluation
    3. In Depth: Noise Schedules
      1. Why Add Noise?
      2. Starting Simple
      3. The Math
      4. Effect of Input Resolution and Scaling
    4. In Depth: UNets and Alternatives
      1. A Simple UNet
      2. Improving the UNet
      3. Alternative Architectures
    5. In Depth: Diffusion Objectives
    6. Project Time: Train Your Diffusion Model
    7. Summary
    8. Exercises
    9. Challenges
    10. References
  6. 5. Stable Diffusion and Conditional Generation
    1. Adding Control: Conditional Diffusion Models
      1. Preparing the Data
      2. Creating a Class-Conditioned Model
      3. Training the Model
      4. Sampling
    2. Improving Efficiency: Latent Diffusion
    3. Stable Diffusion: Components in Depth
      1. The Text Encoder
      2. The Variational AutoEncoder
      3. The UNet
      4. Stable Diffusion XL, 3 and FLUX
      5. FLUX, SD3 and Video
      6. Classifier-Free Guidance
    4. Putting it All Together: Annotated Sampling Loop
    5. Open Data, Open Models
      1. Challenges and the Sunset of LAION 5B
      2. Alternatives
      3. Fair and Commercial Use
    6. Project Time: Build an interactive ML demo with Gradio
    7. Summary
    8. Exercises
    9. Challenges
    10. References
  7. 6. Fine-Tuning Language Models
    1. Classifying Text
      1. Identify a Dataset
      2. Define Which Model Type to Use
      3. Select a Good Base Model
      4. Pre-Process the Dataset
      5. Define Evaluation Metrics
      6. Train the Model
      7. Still Relevant?
    2. Generating Text
      1. Picking the Right Generative Model
      2. Training a Generative Model
    3. Instructions
    4. A Quick Introduction to Adapters
    5. A Light Introduction to Quantization
      1. All Together
      2. A Deeper Dive into Evaluation
      3. Project Time: Retrieval Augmented Generation (RAG)
      4. Summary
      5. Exercises
      6. Challenges
      7. References
  8. 7. Fine-Tuning Stable Diffusion
    1. Full Stable Diffusion Fine-Tuning
      1. Preparing the Dataset
      2. Fine-Tuning the Model
      3. Inference
    2. Dreambooth
      1. Preparing the Dataset
      2. Prior Preservation
      3. Dreamboothing the Model
      4. Inference
    3. Training LoRAs
    4. Giving Stable Diffusion New Capabilities
      1. Inpainting
      2. Additional Inputs for Special Conditionings
    5. Project Time: Train an SDXL Dreambooth LoRA by Yourself
    6. Summary
    7. Exercises
    8. Challenges
    9. References
  9. 8. Creative Applications of Text-To-Image Models
    1. Image-to-Image
    2. Inpainting
    3. Prompt Weighting and Image Editing
      1. Prompt Weighting and Merging
      2. Editing Diffusion Images with Semantic Guidance
    4. Real Image Editing via Inversion
      1. Editing with LEDITS++
      2. Real Image Editing via Instruction Fine-Tuning
    5. ControlNet
    6. Image Prompting and Image Variations
      1. Image Variations
      2. Image Prompting
    7. Project Time: Your Creative Canvas
    8. Summary
    9. Exercises
    10. References
  10. 9. Generating Audio
    1. Audio Data
      1. Waveforms
      2. Spectrogram and Mel Spectrogram
    2. Speech to Text with Transformers-Based Architectures
      1. Encoder-Based Techniques
      2. Encoder-Decoder Techniques
      3. From Model to Pipeline
      4. Evaluation
    3. From Text to Speech to Generative Audio
      1. Generating Audio with Sequence-to-Sequence Models
      2. Going Beyond Speech with Bark
      3. AudioLM and MusicLM
      4. AudioGen and MusicGen
      5. Audio Diffusion and Riffusion
      6. Dance Diffusion
      7. More on Diffusion Models for Generative Audio
    4. Evaluating Audio Generation Systems
    5. What’s Next?
    6. Project Time: End-to-end Conversational System
    7. Summary
    8. Exercises
    9. Challenges
    10. References
  11. 10. Rapidly Advancing Areas in Generative AI
    1. Preference Optimization
    2. Long Contexts
    3. Mixture of Experts
    4. Optimizations and Quantizations
    5. Data
    6. One Model to Rule Them All
    7. Computer Vision
    8. 3D Computer Vision
    9. Video Generation
    10. Multimodality
    11. Community
  12. A. Open-Source Tools
    1. The Hugging Face Stack
    2. Data
    3. Wrappers
    4. Local Inference
    5. Deployment Tools
  13. B. LLM Memory Requirements
    1. Inference Memory Requirements
    2. Training Memory Requirements
    3. Further Reading
  14. C. End-to-End Retrieval-Augmented Generation
    1. Processing the Data
    2. Embedding the Documents
    3. Retrieval
    4. Generation
    5. Production-level RAG
  15. About the Authors

Product information

  • Title: Hands-On Generative AI with Transformers and Diffusion Models
  • Author(s): Omar Sanseviero, Pedro Cuenca, Apolinario Passos, Jonathan Whitaker
  • Release date: November 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098149246