What is FAR AI's research agenda?
FAR AI is a research group with a mission focused on ensuring that AI systems are both trustworthy and beneficial to society. Their research strategy is centered around four main directions:
-
ML
System Vulnerability studies how robust machine learning systems are. This involves identifying weaknesses using adversarial testing and applying methods to make the systems more resistant to those identified issues. Existing research in this direction features "Adversarial Policies Beat Superhuman Go AIs" and "AI Safety in a World of Vulnerable ML Systems." Ongoing projects cover adversarial trainingMachine learningAn approach to AI in which, instead of designing an algorithm directly, we have the system search through possible algorithms based on how well they do on some training data.
in KataGo and scaling lawsAdversarial trainingView full definitionA safety technique that pits two models against each other.
for robustnessScaling lawsView full definitionThe relationship between a model’s performance and the amount of compute used to train it.
. Future research aims to explore how to adapt existing alignment techniques to tolerate failures.RobustnessView full definitionAn agent's ability to maintain its goal and its capabilities when exposed to environments that are substantially different from that on which the agent was trained.
-
Value Alignment ensures that AI systems are aligned with human preferences and are incentivized to fulfill them. Some existing research in this field consists of "Training Language Models with Language Feedback" and "Pretraining Language Models with Human Preferences." Ongoing research works towards aligning reinforcement learning
(RL) agents via image-captioning models.Reinforcement learningView full definitionA machine learning method in which the machine gets rewards based on its actions, and is adjusted to be more likely to take actions that lead to high reward.
-
Language Model Evaluation assesses how scale and techniques such as Reinforcement learning from human feedback
(RLHF) affect model performance and value alignment. Research such as "Inverse Scaling" and "Evaluating the Moral Beliefs Encoded in LLMs" falls under this umbrella.Reinforcement learning from human feedbackView full definitionA method for training an AI to give desirable outputs by using human feedback as a training signal.
-
Model Internals focuses on reverse engineering ML models to understand how they function. Existing research includes "Eliciting Latent Predictions from Transformers with the Tuned Lens." Continuing in this line of investigation, the concept of mechanistic interpretability
of mesa-optimizationMechanistic interpretabilityView full definitionA subfield of interpretability which involves reverse-engineering the mechanisms by which a model gets from its inputs to its outputs.
seeks to further understand how some machine learning models develop their own internal methods for problem-solving. Looking ahead, FAR AI plans to set standards for what constitutes a “good” interpretability hypothesis, which means developing criteria to evaluate how well a proposed explanation helps us understand a model’s behavior.Mesa-optimizationView full definitionAn algorithm that is created by optimization and that is also itself an optimizer.
In addition to incubating a portfolio of high-potential AI safety research agendas described above, they support projects that are:
-
Too large or complex to be led by academia, or
-
Un-aligned with the interests of the commercial sector as they are unprofitable.
They are following a hits-based approach so they expect some of their agendas to be extremely impactful and others to have limited impact. Their goal is to rapidly identify those that work and scale them up, reallocating resources away from those that don’t.