Chapter 5. Stable Diffusion and Conditional Generation

In the previous chapter, we introduced diffusion models and the underlying idea of iterative refinement. By the end of the chapter, we could generate images, but training the model was time-consuming, and we had no control over the generated images. In this chapter, we’ll see how to go from this to text-conditioned models that can efficiently generate images based on text descriptions, with a model called Stable Diffusion as a case study. Before we get to Stable Diffusionable Diffusion, though, we’ll look at how conditional models work and review some of the innovations that led to the text-to-image models we have today.

Adding Control: Conditional Diffusion Models

Before we tackle the challenge of generating images from text descriptions, let’s start with something slightly easier. We’ll explore how we can steer our model outputs toward specific types or classes of images. We can use a method called conditioning, where the idea is to ask the model to generate not just any image but an image belonging to a predefined class. In this context, conditioning refers to guiding the model’s output by providing additional information, such as a label or prompt, during the generation process.

Model conditioning is a simple but effective concept. We’ll start from the diffusion model we used in Chapter 4 and introduce a few changes. First, rather than using the butterflies dataset, we’ll switch to a dataset that has classes. We’ll use ...

Get Hands-On Generative AI with Transformers and Diffusion Models now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.