Chapter 8. Model Deployment Optimizations

After you have adapted your model to your target task, you will ultimately want to deploy your model so you can begin interacting with it as well as potentially integrating it into an application that is designed to consume it.

Before deploying your generative model, you need to understand the resources your model may need as well as the intended experience for interacting with it. Considering the resources your model will need will include identifying requirements such as how fast you need your model to generate completions, what compute budget you have available, and what trade-offs you are willing to make regarding model performance to be able to achieve faster inference speed and potentially reduce storage costs.

In this chapter, you will explore various techniques for performing post-training optimizations on your model, including pruning, quantization, and distillation. Additional considerations and potential tuning of your deployment configurations will need to be done postdeployment as well, such as selecting the optimal compute resources to balance cost and performance.

Model Optimizations for Inference

The size of generative AI models often presents a challenge for deployment in terms of compute, storage, and memory requirements, as well as how to ensure low-latency completions. One of the primary ways to optimize for deployment is to take advantage of techniques that aim to reduce the size of the model, typically referred ...

Get Generative AI on AWS now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.