|Alignment Forum Tag|
Research Agendas lay out the areas of research which individuals or groups are working on, or those that they believe would be valuable for others to work on. They help make research more legible and encourage discussion of priorities.
“Aligning smarter-than-human AI with human interests” is an extremely vague goal. To approach this problem productively, we attempt to factorize it into several subproblems. As a starting point, we ask: “What aspects of this problem would we still be unable to solve even if the problem were much easier?”
In order to achieve real-world goals more effectively than a human, a general AI system will need to be able to learn its environment over time and decide between possible proposals or actions. A simplified version of the alignment problem, then, would be to ask how we could construct a system that learns its environment and has a very crude decision criterion, like “Select the policy that maximizes the expected number of diamonds in the world.”
Highly reliable agent design is the technical challenge of formally specifying a software system that can be relied upon to pursue some preselected toy goal. An example of a subproblem in this space is ontology identification: how do we formalize the goal of “maximizing diamonds” in full generality, allowing that a fully autonomous agent may end up in unexpected environments and may construct unanticipated hypotheses and policies? Even if we had unbounded computational power and all the time in the world, we don’t currently know how to solve this problem. This suggests that we’re not only missing practical algorithms but also a basic theoretical framework through which to understand the problem.
The formal agent AIXI is an attempt to define what we mean by “optimal behavior” in the case of a reinforcement learner. A simple AIXI-like equation is lacking, however, for defining what we mean by “good behavior” if the goal is to change something about the external world (and not just to maximize a pre-specified reward number). In order for the agent to evaluate its world-models to count the number of diamonds, as opposed to having a privileged reward channel, what general formal properties must its world-models possess? If the system updates its hypotheses (e.g., discovers that string theory is true and quantum physics is false) in a way its programmers didn’t expect, how does it identify “diamonds” in the new model? The question is a very basic one, yet the relevant theory is currently missing.
We can distinguish highly reliable agent design from the problem of value specification: “Once we understand how to design an autonomous AI system that promotes a goal, how do we ensure its goal actually matches what we want?” Since human error is inevitable and we will need to be able to safely supervise and redesign AI algorithms even as they approach human equivalence in cognitive tasks, MIRI also works on formalizing error-tolerant agent properties. Artificial Intelligence: A Modern Approach, the standard textbook in AI, summarizes the challenge:
Yudkowsky […] asserts that friendliness (a desire not to harm humans) should be designed in from the start, but that the designers should recognize both that their own designs may be flawed, and that the robot will learn and evolve over time. Thus the challenge is one of mechanism design — to design a mechanism for evolving AI under a system of checks and balances, and to give the systems utility functions that will remain friendly in the face of such changes. -Russell and Norvig (2009). Artificial Intelligence: A Modern Approach.
Our technical agenda describes these open problems in more detail, and our research guide collects online resources for learning more.
We’re facing the challenge of “Philosophy With A Deadline”.
Many of the problems surrounding superintelligence are the sorts of problems philosophers have been dealing with for centuries. To what degree is meaning inherent in language, versus something that requires external context? How do we translate between the logic of formal systems and normal ambiguous human speech? Can morality be reduced to a set of ironclad rules, and if not, how do we know what it is at all?
Existing answers to these questions are enlightening but nontechnical. The theories of Aristotle, Kant, Mill, Wittgenstein, Quine, and others can help people gain insight into these questions, but are far from formal. Just as a good textbook can help an American learn Chinese, but cannot be encoded into machine language to make a Chinese-speaking computer, so the philosophies that help humans are only a starting point for the project of computers that understand us and share our values.
The field of AI alignment combines formal logic, mathematics, computer science, cognitive science, and philosophy in order to advance that project.
This is the philosophy; the other half of Bostrom’s formulation is the deadline. Traditional philosophy has been going on almost three thousand years; machine goal alignment has until the advent of superintelligence, a nebulous event which may be anywhere from a decades to centuries away.
If the alignment problem doesn’t get adequately addressed by then, we are likely to see poorly aligned superintelligences that are unintentionally hostile to the human race, with some of the catastrophic outcomes mentioned above. This is why so many scientists and entrepreneurs are urging quick action on getting machine goal alignment research up to an adequate level.
If it turns out that superintelligence is centuries away and such research is premature, little will have been lost. But if our projections were too optimistic, and superintelligence is imminent, then doing such research now rather than later becomes vital.
Research at the Alignment Research Center is led by Paul Christiano, best known for introducing the “Iterated Distillation and Amplification” and “Humans Consulting HCH” approaches. He and his team are now “trying to figure out how to train ML systems to answer questions by straightforwardly ‘translating’ their beliefs into natural language rather than by reasoning about what a human wants to hear.”
Chris Olah (after work at DeepMind and OpenAI) recently launched Anthropic, an AI lab focussed on the safety of large models. While his previous work was concerned with “transparency” and “interpretability” of large neural networks, especially vision models, Anthropic is focussing more on large language models, among other things working towards a "general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless".
Stuart Russell and his team at the Center for Human-Compatible Artificial Intelligence (CHAI) have been working on inverse reinforcement learning (where the AI infers human values from observing human behavior) and corrigibility, as well as attempts to disaggregate neural networks into “meaningful” subcomponents (see Filan, et al.’s “Clusterability in neural networks” and Hod et al.'s “Detecting modularity in deep neural networks”).
Alongside the more abstract “agent foundations” work they have become known for, MIRI recently announced their “Visible Thoughts Project” to test the hypothesis that “Language models can be made more understandable (and perhaps also more capable, though this is not the goal) by training them to produce visible thoughts.”
Stuart Armstrong’s recently launched AlignedAI are mainly working on concept extrapolation from familiar to novel contexts, something he believes is “necessary and almost sufficient” for AI alignment.
Redwood Research (Buck Shlegeris, et al.) are trying to “handicap' GPT-3 to only produce non-violent completions of text prompts. “The idea is that there are many reasons we might ultimately want to apply some oversight function to an AI model, like ‘don't be deceitful’, and if we want to get AI teams to apply this we need to be able to incorporate these oversight predicates into the original model in an efficient manner.”
Ought is an independent AI safety research organization led by Andreas Stuhlmüller and Jungwon Byun. They are researching methods for breaking up complex, hard-to-verify tasks into simpler, easier-to-verify tasks, with the aim of allowing us to maintain effective oversight over AIs.
The safety team at OpenAI's plan is to build a MVP aligned AGI to try and help us solve the full alignment problem.
They want to do this with Reinforcement Learning from Human Feedback (RLHF): get feedback from humans about what is good, i.e. give reward to AI's based on the human feedback. Problem: what if the AI makes gigabrain 5D chess moves that humans don't understand, so can't evaluate. Jan Leike, the director of the safety team, views this (the informed oversight problem) as the core difficulty of alignment. Their proposed solution: an AI assisted oversight scheme, with a recursive hierarchy of AIs bottoming out at humans. They are working on experimenting with this approach by trying to get current day AIs to do useful supporting work such as summarizing books and criticizing itself.
OpenAI also published GPT-3, and are continuing to push LLM capabilities, with GPT-4 expected to be released at some point soon.
David runs a lab at the University of Cambridge. Some things he is working on include:
- Operationalizing inner alignment failures and other speculative alignment failures that haven't actually been observed.
- Understanding neural network generalization.
For work done on (1), see: Goal Misgeneralization, a paper that empirically demonstrated examples of inner alignment failure in Deep RL environments. For example, they trained an agent to get closer to cheese in a maze, but where the cheese was always in the top right of a maze in the training set. During test time, when presented with cheese elsewhere, the RL agent navigated to the top right instead of to the cheese: it had learned the mesa objective of "go to the top right".
For work done on (2), see OOD Generalization via Risk Extrapolation, an iterative improvement on robustness to previous methods.
We've not read about his motivation is for these specific research directions, but these are likely his best starts on how to solve the alignment problem.
There is a Cambrian explosion of approaches to solving alignment. Click through to the follow-up questions to explore the research directions of groups and individuals in the field.
Unanswered non-canonical questions