Chapter 3. Large-Language Foundation Models

In Chapter 2, you learned how to perform prompt engineering and leverage in-context learning using an existing foundation model. In this chapter, you will explore how a foundation model is trained, including the training objectives and datasets. While it’s not common to train your own foundation model from scratch, it is worth understanding how much time, effort, and complexity is required to perform this compute-intensive process.

Training a multibillion-parameter large-language model from scratch, called pretraining, requires millions of GPU compute hours, trillions of data tokens, and a lot of patience. In this chapter, you will learn about empirical scaling laws as described in the popular Chinchilla paper for model pretraining.1

When training the BloombergGPT model, for example, researchers used the Chinchilla scaling laws as a starting point but still required a lot of trial and error, as explained in the BloombergGPT paper.2 With a GPU compute budget of 1.3 million GPU hours, BloombergGPT was trained with a large distributed cluster of GPU instances using Amazon SageMaker.

Note

This chapter dives deep into pretraining generative foundation models, which may overwhelm some readers. It’s important to note that you do not need to fully understand this chapter to effectively build generative AI applications. You may find this chapter useful as a reference for some advanced concepts later in this book.

Large-Language Foundation ...

Get Generative AI on AWS now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.