Chapter 10. Multimodal Foundation Models

Generative AI can be unimodal or multimodal. Unimodal models work exclusively with data in one modality, such as text. Large language models (LLMs) are a popular example of unimodal generative AI; both the input and output modality in prompt and completion is text. Once you add another modality to the mix, such as image, video, or audio, you are tapping into multimodal generative AI.

With multimodal generative AI, you can broaden the scope of use cases and tasks and potentially move closer to artificial general intelligence (AGI) by enhancing the model’s contextual understanding and cross-modal learning. Multimodal generative AI is a step toward simulating real-world complexity that not only enables models to process diverse data formats but also to learn through transfer and become better at creative problem solving.

With multimodal AI, you add different content modality to the input to support tasks such as converting, for example, image to text or text to image. Figure 10-1 illustrates the difference between unimodal and multimodal generative AI.

This chapter starts with an introduction to multimodal generative AI use cases and tasks, including image generation and visual question answering (VQA) using the Stable Diffusion and IDEFICS models, respectively. The power of these multimodal models is the ability to interact with them using natural language prompts.

Let’s start by exploring common multimodal generative AI use cases and ...

Get Generative AI on AWS now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.