Alignment research
There are many different research agendas in technical AI alignment
A research agenda which tries to understand the nature of agents and their properties.
An AI alignment approach that assumes that future AIs will resemble current AIs.
Agent foundations
Agent foundations research studies agents — systems whose behavior can be understood in terms of their goals — in order to gain confidence that our AIs will do what we want even when they are highly capable. More abstractly, it studies the structure of decision-making systems that solve optimization problems. This research agenda is motivated by the idea that in safety-critical applications, it is crucial that theoretical insights come before application and aims to find mathematical formulations of things like optimization, goals, reasoning, embeddedness, and counterfactuals.
This type of research is interested in formal proofs and guarantees, and draws from many different fields, including mathematics, economics, decision theory, computer science, information theory, and evolutionary biology. Since this field does not have a consensus paradigm yet, much of the reasoning in it relies on analogy (like this).
Prosaic alignment
Prosaic alignment focuses on systems that are qualitatively similar to those in use today. It looks at aligning systems trained through deep learning, and tends to be more empirical than agent foundations. An example is the study of how simpler AI systems could help humans oversee and evaluate more advanced systems. Since this research is based on existing techniques, some of its alignment proposals can be tested on toy systems and used to evaluate commercial models for safety.
Interpretability
Interpretability aims to make machine learning systems easier to understand. The most powerful systems in 2023 have billions of parameters trained on enormous data sets, resulting in models whose internal structure is opaque to their human developers. Interpretability research seeks to find meaningful patterns in these systems, for example by figuring out what input features
A feature of a region of input space that corresponds to a useful pattern. For example, in an image detector, a set of neurons that detects cars might be a feature.
Interpretability might be useful for alignment by helping people identify dangerous goals and prevent deception in models, and enabling certain kinds of regulation. One challenge for interpretability research is that it is also useful for designing more powerful systems. Thus, while it can help with alignment, it also risks shortening timelines
Brain-based AI
There are numerous approaches involving the study of AI whose design is based on the human brain. Since the human brain obviously already has human values, it is hoped that modeling an AI on the human brain will make it easier to align it to human values1
A theoretical process that emulates an entire human brain in an artificial substrate.
Note that this doesn’t mean that it will be trivial to instill human values; it's plausible that an AI with a broadly similar structure to the human brain could still have a very different reward function. ↩︎