Chapter 7. Fine-Tuning with Reinforcement Learning from Human Feedback

As you learned in Chapters 5 and 6, fine-tuning with instructions can improve your model’s performance and help the model to better understand humanlike prompts and generate more humanlike responses. However, it doesn’t prevent the model from generating undesired, false, and sometimes even harmful completions.

Undesirable output is really no surprise, given that these models are trained on vast amounts of text data from the internet, which unfortunately contains plenty of bad language and toxicity. And while researchers and practitioners continue to scrub and refine pretraining datasets to remove unwanted data, there is still a chance that the model could generate content that does not positively align with human values and preferences.

Reinforcement learning from human feedback (RLHF) is a fine-tuning mechanism that uses human annotation—also called human feedback—to help the model adapt to human values and preferences. RLHF is most commonly applied after other forms of fine-tuning, including instruction fine-tuning.

While RLHF is typically used to help a model generate more humanlike and human-aligned outputs, you could also use RLHF to fine-tune highly personalized models. For example, you could fine-tune a chat assistant specific to each user of your application. This chat assistant can adopt the style, voice, or sense of humor of each user based on their interactions with your application.

In this chapter, ...

Get Generative AI on AWS now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.