How is the Alignment Research Center (ARC) trying to solve Eliciting Latent Knowledge (ELK)?
ARC is trying to solve the problem of Eliciting Latent Knowledge (ELK). Suppose that you are training a predictor — an AI Agent
A system that can be understood as taking actions towards achieving a goal.
Paul Christiano's plan seems to be to produce a minimal AI that can help to do AI safety
A research field about how to prevent risks from advanced artificial intelligence.
A research area that aims to make machine learning systems easier for humans to understand.
If ARC can solve ELK, this would be a very strong form of interpretability: our reporter is able to tell us what the predictor believes about the world. Some ways this could be useful for aligning the predictor include:
-
Using the reporter to find deceptive/misaligned thoughts in the predictor, and then optimizing against those interpreted thoughts. At any time, stochastic gradient descent only updates the weights
a small amount. If an AI becomes misaligned, it won't be very misaligned, and the interpretability tools will be able to discover this and do a gradient step to realign it. In this way, we can prevent deception at any point in training.Model weightsThe parameters of a neural network. They are tuned during training and are mostly sufficient to implement the AI model.
-
Stopping training if the AI is misaligned.