What is reinforcement learning from human feedback (RLHF)?

4 min read

Suggest changes in Google Docs

Reinforcement learning from human feedback (RLHF) is a method for training an AI by having people give feedback on its behavior. Through that feedback, it can approximate what its programmers want, even if they can’t specify it in an algorithm — describing a desired behavior in detail is often harder than recognizing it. Most modern large language models, such as ChatGPT and Claude, have RLHF as part of their training.

It is often hard to specify what exactly we want AIs to do. Say we want to make an AI do a backflip; how can we accurately describe what a backflip is and what it isn’t? RLHF solves this the following way: We first show a human two examples of an AI’s backflip attempts, then let the human decide which one looks more backflippy, and finally update the AI correspondingly. Repeat this a thousand times, and we get the AI close to doing actual backflips!

We also want safe and helpful AI assistants, but like with backflips, it’s hard to specify exactly what this entails. The main candidates for AI assistants are language models, which are trained to predict the next words on large datasets. They are not trained to be safe or helpful; their outputs can be toxic, offensive, dangerous, or plain useless. Again, using RLHF can bring us closer to our goal.

Training language models with RLHF (to be safe and helpful) works as follows:

Step 1: We have a dataset of prompts, and use labelers to write (safe and helpful) responses to these prompts. The responses are used to fine-tune a language model to produce outputs more like those given by the labelers.
Step 2: We input the prompts to our language model and sample several responses for every prompt. This time the labelers rank the responses from best to worst (according to how safe and helpful they are). A different AI model, the “reward model”, is trained to predict which responses labelers prefer.
Step 3: Using reinforcement learning, the reward model from step 2 is used to again fine-tune the language model from step 1. The language model is trained to generate responses that the reward model predicts labelers will prefer (safer and more helpful responses).

Steps for reinforcement learning from human feedback. Source: OpenAI

If we do it right, we end up with a language model that responds to our questions with mostly safe and helpful answers. In reality, though, RLHF still has many problems.

Firstly, it’s not robust: language models trained using RLHF can and do still produce harmful content.
Secondly, RLHF has limited scalability: it becomes ineffective when tasks grow so complex that humans can’t give useful feedback anymore. This is where scalable oversight methods such as Iterated Distillation and Amplification could help out in the future.
Lastly, by optimizing the AI for getting good feedback from humans, we incentivize things such as deception.

Given the limitations of RLHF and the fact that it does not address many of the hard problems in AI safety, we need better strategies to make AI safe and aligned.