How is the Alignment Research Center (ARC) trying to solve Eliciting Latent Knowledge (ELK)?

ARC is trying to solve the problem of Eliciting Latent Knowledge (ELK). Suppose that you are training a *predictor—*an AI agent that predicts the state of the world and then performs some actions. How can you create another model—a reporter—that tells you what the predictor believes about the world? A key challenge is that training your reporter on human-labeled training data incentivizes the predictor to model what the human trainer would believe is true, because modeling "what the human would believe to be true" will receive human approval more consistently than the actual truth, since the human may sometimes be mistaken!

Paul Christiano's plan seems to be to produce a minimal AI that can help to do AI safety research. To do this, preventing deception and inner alignment failure are on the critical path, and the only known solutions require interpretability (this is how all of Evan Hubinger's 11 proposals plan to get around this problem).

If ARC can solve ELK, this would be a very strong form of interpretability: our reporter is able to tell us what the predictor believes about the world. Some ways this could be useful for aligning the predictor include:

  • Using the reporter to find deceptive/misaligned thoughts in the predictor, and then optimizing against those interpreted thoughts. At any time, stochastic gradient descent only updates the weights a small amount. If an AI becomes misaligned, it won't be very misaligned, and the interpretability tools will be able to discover this and do a gradient step to realign it. In this way, we can prevent deception at any point in training.

  • Stopping training if the AI is misaligned.