Hands-On Generative AI with Transformers and Diffusion Models

Book description

Learn how to use generative media techniques with AI to create novel images or music in this practical, hands-on guide. Data scientists and software engineers will understand how state-of-the-art generative models work, how to fine-tune and adapt them to your needs, and how to combine existing building blocks to create new models and creative applications in different domains.

This book introduces theoretical concepts in an intuitive way, with extensive code samples and illustrations that you can run on services such as Google Colaboratory, Kaggle, or Hugging Face Spaces with minimal setup. You'll learn how to use open source libraries such as Transformers and Diffusers, conduct code exploration, and study several existing projects to help guide your work.

  • Learn the fundamentals of classic and modern generative AI techniques
  • Build and customize models that can generate text, images, and sound
  • Explore trade-offs between training from scratch and using large, pretrained models
  • Create models that can modify images by transferring the style of other images
  • Tweak and bend transformers and diffusion models for creative purposes
  • Train a model that can write text based on your style
  • Deploy models as interactive demos or services

Publisher resources

View/Submit Errata

Table of contents

  1. Brief Table of Contents (Not Yet Final)
  2. Preface
    1. Who Is This Book For?
    2. Prerequisites
    3. What you will learn
    4. How to read this book
    5. Software and Hardware Requirements
    6. SOTA: A Moving Target
  3. 1. An Introduction to Generative Media
    1. Generating Our First Image
    2. Generating Our First Text
    3. Generating Our First Sound Clip
    4. Ethical and Societal implications
    5. Where We’ve Been and Where Things Stand
    6. How Are Generative AI Models Created? Big Budgets and Open Source
    7. The Path Ahead
  4. 2. Transformers
    1. A Language Model in Action
      1. Tokenizing Text
      2. Predicting Probabilities
      3. Generating Text
      4. Zero-Shot Generalization
      5. Few-Shot Generalization
    2. A Transformer Block
    3. Transformer Models Genealogy
      1. Sequence-To-Sequence Tasks
      2. Encoder-only models
    4. The Power of Pre-training
      1. The key Insights of Transformers
      2. Transformers recap
      3. Limitations
    5. Beyond Text
    6. Project Time: Using LMs to generate text
    7. Summary
    8. Exercises
    9. References
  5. 3. Diffusion Models
    1. The Key Insight: Iterative Refinement
    2. Training a Diffusion Model
      1. The Data
      2. Adding Noise
      3. The UNet
      4. Training
      5. Sampling
      6. Evaluation
  6. 4. Stable Diffusion
    1. Adding Control: Conditional Diffusion Models
      1. Preparing the Data
      2. Creating a Class-Conditioned Model
      3. Training the Model
      4. Sampling
    2. Improving Efficiency: Latent Diffusion
    3. Stable Diffusion: Components in Depth
      1. The Text Encoder
      2. Classifier-free guidance
      3. The VAE
      4. The UNet
      5. Stable Diffusion XL
    4. Putting it All Together: Annotated Sampling Loop
    5. Open Data, Open Models
    6. Project: Build an interactive ML demo with Gradio
    7. Summary
    8. Exercises
    9. References
  7. 5. Fine-Tuning Language Models
    1. Classifying Text
      1. 1. Identify a Dataset
      2. 2. Define Which Model Type to Use
      3. 3. Pick a Good Base Model
      4. 4. Pre-process the Dataset
      5. 5. Define evaluation metrics
      6. 6. Train the Model
    2. Generating text
    3. Instructions
      1. An Overview of Instruct-Tuning Research
    4. A quick introduction to adapters
    5. A light introduction to quantization
    6. All together
      1. Project Time: Retrieval Augmented Generation (RAG)
      2. Conclusion
      3. Exercises
      4. References
  8. 6. Fine-tuning Stable Diffusion
    1. Full Stable Diffusion Fine-Tuning
      1. Preparing the dataset
      2. Fine-tuning the model
      3. Inference
    2. Dreambooth
      1. Preparing the dataset
      2. Prior preservation
      3. Dreamboothing the model
      4. Inference
    3. Training LoRAs
    4. Giving Stable Diffusion new capabilities
      1. Inpainting
      2. Additional inputs for special conditionings
    5. Project Time: Train an SDXL Dreambooth LoRA by yourself!
    6. Conclusion
    7. Exercises
    8. References
  9. 7. Generating Audio
    1. Introduction to Audio ML
    2. Audio Data
      1. Introduction
      2. Waveforms
      3. Spectrogram and Mel Spectrogram
    3. Diffusion-based Audio Generation
      1. Audio Diffusion and Riffusion
      2. Dance Diffusion
    4. Speech to Text with transformers-based architectures
      1. Encoder-based techniques
      2. Encoder-decoder techniques
      3. From model to pipeline
      4. Evaluation
    5. From Text-to-Speech to Generative Audio
      1. Introduction
      2. Generating audio with Sequence-to-Sequence models
      3. Going Beyond Speech with Bark
      4. AudioLM and MusicLM
      5. AudioGen and MusicGen
    6. More on diffusion models for generative audio
    7. Evaluating audio generation systems
    8. Conclusion
      1. Summary of datasets
      2. Summary of models
    9. Exercises
    10. References
  10. About the Authors

Product information

  • Title: Hands-On Generative AI with Transformers and Diffusion Models
  • Author(s): Omar Sanseviero, Pedro Cuenca, Apolinario Passos, Jonathan Whitaker
  • Release date: December 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098149246