Chapter 13. Foundation Models

The loftier the building, the deeper must the foundation be laid.

—Thomas à Kempis

Extreme scaling of deep learning models in various dimensions (data, compute, capacity) has led to the development of general-purpose models that are capable of performing many different tasks without any explicit supervision. These evolutionary models often have generative and adaptive capabilities and are so effective across many tasks, ranging from basic perception and cognition to scene or text understanding and instruction following, that they are becoming increasingly central to applied AI.

In this chapter, you will learn about the fundamentals of these so-called foundation models and their evolution to date. You’ll read about challenges involved in developing and adapting these models, explore how they are becoming multimodal, and review the groundbreaking architectures LLaVA, Flamingo, and BLIP-2.

What Are Foundation Models?

The term foundation model was coined by the Stanford Institute for Human-Centered Artificial Intelligence’s Center for Research on Foundation Models to describe large-scale deep learning models that were trained on very large datasets and have the ability to perform well at many tasks without being explicitly (i.e., with full supervision) trained to do so.1 These models are not only capable in their own unique ways but, much like the foundations of a building, also provide a strong basis to extend and adapt to much more complex, purpose-specific ...

Get Deep Learning at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.