What is reinforcement learning from human feedback (RLHF)?
Reinforcement learning from human feedback (RLHF) is a method for training an AI by having people give feedback on its behavior. Through that feedback, it can approximate what its programmers want, even if they can't specify it in an algorithm - describing a desired behavior in detail is often harder than recognizing it. Most modern large language models
An AI model that takes in some text and predicts how the text is most likely to continue.
A chatbot interface for the GPT series of large language models by OpenAI.
It is often hard to specify what exactly we want AIs to do. Say we want to make an AI do a backflip; how can we accurately describe what a backflip is and what it isn’t? RLHF solves this the following way: We first show a human two examples of an AI’s backflip attempts, then let the human decide which one looks more backflippy, and finally update the AI correspondingly. Repeat this a thousand times, and we get the AI close to doing actual backflips!
We also want safe and helpful AI assistants, but like with backflips, it’s hard to specify exactly what this entails. The main candidates for AI assistants are language models, which are trained to predict the next words on large datasets. They are not trained to be safe or helpful; their outputs can be toxic, offensive, dangerous, or plain useless. Again, using RLHF can bring us closer to our goal.
Training language models with RLHF (to be safe and helpful) works as follows:
-
Step 1: We have a dataset of prompts, and use labelers to write (safe and helpful) responses to these prompts. The responses are used to fine-tune
a language model to produce outputs more like those given by the labelers.Fine-tuningView full definitionFine-tuning is the process of adapting a pre-trained ML model for more specific tasks or to display more specific behaviors.
-
Step 2: We input the prompts to our language model and sample several responses for every prompt. This time the labelers rank the responses from best to worst (according to how safe and helpful they are). A different AI model
, the “reward model”, is trained to predict which responses labelers prefer.AI modelView full definitionA program that has been trained to recognize certain patterns or make certain decisions without further human intervention. Sometimes simply called “An AI”.
-
Step 3: Using reinforcement learning, the reward model from step 2 is used to again fine-tune the language model from step 1. The language model is trained to generate responses that the reward model predicts labelers will prefer (safer and more helpful responses).
Steps for reinforcement learning from human feedback. Source: OpenAI
If we do it right, we end up with a language model that responds to our questions with mostly safe and helpful answers. In reality, though, RLHF still has many problems.
-
Firstly, it’s not robust: language models trained using RLHF can and do still produce harmful content.
-
Secondly, RLHF has limited scalability: it becomes ineffective when tasks grow so complex that humans can’t give useful feedback anymore. This is where scalable oversight
methods such as Iterated Distillation and Amplification could help out in the future.Scalable oversightView full definitionMethods that allow humans to provide oversight for numerous or superhuman AI systems, usually by using AI systems to help supervise other AI systems.
-
Lastly, by optimizing the AI for getting good feedback from humans, we incentivize things such as deception.
Given the limitations of RLHF and the fact that it does not address many of the hard problems in AI safety A research field about how to prevent risks from advanced artificial intelligence.