Hands-On Generative AI with Transformers and Diffusion Models

Book description

Learn to use generative AI techniques to create novel text, images, audio, and even music with this practical, hands-on book. Readers will understand how state-of-the-art generative models work, how to fine-tune and adapt them to their needs, and how to combine existing building blocks to create new models and creative applications in different domains.

This go-to book introduces theoretical concepts followed by guided practical applications, with extensive code samples and easy-to-understand illustrations. You'll learn how to use open source libraries to utilize transformers and diffusion models, conduct code exploration, and study several existing projects to help guide your work.

  • Build and customize models that can generate text and images
  • Explore trade-offs between using a pretrained model and fine-tuning your own model
  • Create and utilize models that can generate, edit, and modify images in any style
  • Customize transformers and diffusion models for multiple creative purposes
  • Train models that can reflect your own unique style

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book
    2. Prerequisites
    3. What You Will Learn
    4. How to Read This Book
    5. Software and Hardware Requirements
    6. Conventions Used in This Book
    7. Using Code Examples
    8. How to Contact Us
    9. State of the Art: A Moving Target
    10. Acknowledgments
      1. Jonathan
      2. Apolinário
      3. Pedro
      4. Omar
  2. I. Leveraging Open Models
  3. 1. An Introduction to Generative Media
    1. Generating Images
    2. Generating Text
    3. Generating Sound Clips
    4. Ethical and Societal Implications
    5. Where We’ve Been and Where Things Stand
    6. How Are Generative AI Models Created?
    7. Summary
  4. 2. Transformers
    1. A Language Model in Action
      1. Tokenizing Text
      2. Predicting Probabilities
      3. Generating Text
      4. Zero-Shot Generalization
      5. Few-Shot Generalization
    2. A Transformer Block
    3. Transformer Model Genealogy
      1. Sequence-to-Sequence Tasks
      2. Encoder-Only Models
    4. The Power of Pretraining
    5. Transformers Recap
      1. Limitations
      2. Beyond Text
    6. Project Time: Using LMs to Generate Text
    7. Summary
    8. Exercises
    9. Challenges
    10. References
  5. 3. Compressing and Representing Information
    1. AutoEncoders
      1. Preparing the Data
      2. Modeling the Encoder
      3. Decoder
      4. Training
      5. Exploring the Latent Space
      6. Visualizing the Latent Space
    2. Variational AutoEncoders
      1. VAE Encoders and Decoders
      2. Sampling from the Encoder Distribution
      3. Training the VAE
      4. VAEs for Generative Modeling
    3. CLIP
      1. Contrastive Loss
      2. Using CLIP, Step-by-Step
      3. Zero-Shot Image Classification with CLIP
      4. Zero-Shot Image-Classification Pipeline
      5. CLIP Use Cases
    4. Alternatives to CLIP
    5. Project Time: Semantic Image Search
    6. Summary
    7. Exercises
    8. Challenges
    9. References
  6. 4. Diffusion Models
    1. The Key Insight: Iterative Refinement
    2. Training a Diffusion Model
      1. The Data
      2. Adding Noise
      3. The UNet
      4. Training
      5. Sampling
      6. Evaluation
    3. In Depth: Noise Schedules
      1. Why Add Noise?
      2. Starting Simple
      3. The Math
      4. Effect of Input Resolution and Scaling
    4. In Depth: UNets and Alternatives
      1. A Simple UNet
      2. Improving the UNet
      3. Alternative Architectures
    5. In Depth: Diffusion Objectives
    6. Project Time: Train Your Diffusion Model
    7. Summary
    8. Exercises
    9. Challenges
    10. References
  7. 5. Stable Diffusion and Conditional Generation
    1. Adding Control: Conditional Diffusion Models
      1. Preparing the Data
      2. Creating a Class-Conditioned Model
      3. Training the Model
      4. Sampling
    2. Improving Efficiency: Latent Diffusion
    3. Stable Diffusion: Components in Depth
      1. The Text Encoder
      2. The Variational AutoEncoder
      3. The UNet
      4. Stable Diffusion XL
      5. FLUX, SD3, and Video
      6. Classifier-Free Guidance
    4. Putting It All Together: Annotated Sampling Loop
    5. Open Data, Open Models
      1. Challenges and the Sunset of LAION-5B
      2. Alternatives
      3. Fair and Commercial Use
    6. Project Time: Build an Interactive ML Demo with Gradio
    7. Summary
    8. Exercises
    9. Challenge
    10. References
  8. II. Transfer Learning for Generative Models
  9. 6. Fine-Tuning Language Models
    1. Classifying Text
      1. Identify a Dataset
      2. Define Which Model Type to Use
      3. Select a Good Base Model
      4. Preprocess the Dataset
      5. Define Evaluation Metrics
      6. Train the Model
      7. Still Relevant?
    2. Generating Text
      1. Picking the Right Generative Model
      2. Training a Generative Model
    3. Instructions
    4. A Quick Introduction to Adapters
    5. A Light Introduction to Quantization
    6. Putting It All Together
    7. A Deeper Dive into Evaluation
    8. Project Time: Retrieval-Augmented Generation
    9. Summary
    10. Exercises
    11. Challenge
    12. References
  10. 7. Fine-Tuning Stable Diffusion
    1. Full Stable Diffusion Fine-Tuning
      1. Preparing the Dataset
      2. Fine-Tuning the Model
      3. Inference
    2. DreamBooth
      1. Preparing the Dataset
      2. Prior Preservation
      3. DreamBoothing the Model
      4. Inference
    3. Training LoRAs
    4. Giving Stable Diffusion New Capabilities
      1. Inpainting
      2. Additional Inputs for Special Conditionings
    5. Project Time: Train an SDXL DreamBooth LoRA by Yourself
    6. Summary
    7. Exercises
    8. Challenge
    9. References
  11. III. Going Further
  12. 8. Creative Applications of Text-to-Image Models
    1. Image to Image
    2. Inpainting
    3. Prompt Weighting and Image Editing
      1. Prompt Weighting and Merging
      2. Editing Diffusion Images with Semantic Guidance
    4. Real Image Editing via Inversion
      1. Editing with LEDITS++
      2. Real Image Editing via Instruction Fine-Tuning
    5. ControlNet
    6. Image Prompting and Image Variations
      1. Image Variations
      2. Image Prompting
    7. Project Time: Your Creative Canvas
    8. Summary
    9. Exercises
    10. References
  13. 9. Generating Audio
    1. Audio Data
      1. Waveforms
      2. Spectrograms
    2. Speech to Text with Transformer-Based Architectures
      1. Encoder-Based Techniques
      2. Encoder-Decoder Techniques
      3. From Model to Pipeline
      4. Evaluation
    3. From Text to Speech to Generative Audio
      1. Generating Audio with Sequence-to-Sequence Models
      2. Going Beyond Speech with Bark
      3. AudioLM and MusicLM
      4. AudioGen and MusicGen
      5. Audio Diffusion and Riffusion
      6. Dance Diffusion
      7. More on Diffusion Models for Generative Audio
    4. Evaluating Audio-Generation Systems
    5. What’s Next?
    6. Project Time: End-to-End Conversational System
    7. Summary
    8. Exercises
    9. Challenges
    10. References
  14. 10. Rapidly Advancing Areas in Generative AI
    1. Preference Optimization
    2. Long Contexts
    3. Mixture of Experts
    4. Optimizations and Quantizations
    5. Data
    6. One Model to Rule Them All
    7. Computer Vision
    8. 3D Computer Vision
    9. Video Generation
    10. Multimodality
    11. Community
  15. A. Open Source Tools
    1. The Hugging Face Stack
    2. Data
    3. Wrappers
    4. Local Inference
    5. Deployment Tools
  16. B. LLM Memory Requirements
    1. Inference Memory Requirements
    2. Training Memory Requirements
    3. Further Reading
  17. C. End-to-End Retrieval-Augmented Generation
    1. Processing the Data
    2. Embedding the Documents
    3. Retrieval
    4. Generation
    5. Production-Level RAG
  18. Index
  19. About the Authors

Product information

  • Title: Hands-On Generative AI with Transformers and Diffusion Models
  • Author(s): Omar Sanseviero, Pedro Cuenca, Apolinário Passos, Jonathan Whitaker
  • Release date: November 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098149246