How are large language models (LLMs) trained?

Large language models (LLMs) are neural networks that are tasked with finding patterns in textual data. The training of an LLM happens in three phases:

  1. The core network is trained with large amounts of text (typically from the internet) and the task of predicting the next word.

  2. Once the network has achieved a base level of capabilities in generating many different types of text, the network is made better at generating specific types of text using imitation learning. In this phase, the network is given a supervised demonstration by an expert, and then it imitates the type of text that the expert generates.

  3. The network performs the task and is trained using feedback, either directly from humans, or from another network.

Let's go through each step of the process:

  • Step 0: Tokenization + Vectorization: Before the training process proper begins, training data is gathered from many different sources — Wikipedia, stack overflow, Reddit, etc. Then, every word is tokenized, i.e. broken up into constituent parts. For example, differing might be split into two tokens, differ + ing. The tokens are then vectorized — turned into vectors, or collections of numbers that can be processed by a neural network. The resulting vectors are called embeddings.

Once we have the embeddings, we can begin training the network. The training process updates both the token space so that semantically similar tokens cluster together, and the parameters of the actual neural network.

One possible path to training large language models (LLMs) can be seen by observing the InstructGPT training process.

  • Step 1: Semi-Supervised Generative Pre-training (Create the “shoggoth”): The LLM is initially trained using a large amount of internet text data to predict the next word on a natural language task.

  • Step 2: Supervised Fine-tuning (Mold it to be human-like): A fine-tuning dataset is created, by giving a human a prompt and asking them to write an output to that prompt. This gives us a (prompt, output) pair dataset. We now use this dataset with supervised learning (behavioral cloning) to fine-tune the LLM.

  • Step 3: Reinforcement learning from human feedback (Put a smiley face on it):0..

    • Step 3a: Reward Model: We train an additional reward model. We first prompt the fine-tuned LLM and collect several output samples for the same prompt. A human manually ranks the samples from best to worst. Then we use this ranking to train the reward model to predict what a human would rank higher.

    • Step 3b: Reinforcement learning: Once we have both a fine-tuned LLM and a reward model, we can use Proximal Policy Optimization (PPO) based reinforcement learning to tell the fine-tuned model to maximize the reward that the reward model mimicking human rankings provides.

This short podcast clip talks about the RLHF process and how ChatGPT was trained: