definitions
Main Question: What are some important examples of specialised terminology in AI alignment? (edit question) (write answer) |
Alignment Forum Tag |
Wikipedia Page |
Description
Canonically answered
What is "greater-than-human intelligence"?
Machines are already smarter than humans are at many specific tasks: performing calculations, playing chess, searching large databanks, detecting underwater mines, and more. But one thing that makes humans special is their general intelligence. Humans can intelligently adapt to radically new problems in the urban jungle or outer space for which evolution could not have prepared them. Humans can solve problems for which their brain hardware and software was never trained. Humans can even examine the processes that produce their own intelligence (cognitive neuroscience), and design new kinds of intelligence never seen before (artificial intelligence).
To possess greater-than-human intelligence, a machine must be able to achieve goals more effectively than humans can, in a wider range of environments than humans can. This kind of intelligence involves the capacity not just to do science and play chess, but also to manipulate the social environment.
Computer scientist Marcus Hutter has described a formal model called AIXI that he says possesses the greatest general intelligence possible. But to implement it would require more computing power than all the matter in the universe can provide. Several projects try to approximate AIXI while still being computable, for example MC-AIXI.
Still, there remains much work to be done before greater-than-human intelligence can be achieved in machines. Greater-than-human intelligence need not be achieved by directly programming a machine to be intelligent. It could also be achieved by whole brain emulation, by biological cognitive enhancement, or by brain-computer interfaces (see below).
See also:
- Goertzel & Pennachin (eds.), Artificial General Intelligence
- Sandberg & Bostrom, Whole Brain Emulation: A Roadmap
- Bostrom & Sandberg, Cognitive Enhancement: Methods, Ethics, Regulatory Challenges
- Wikipedia, Brain-computer interface
What is the "windfall clause"?
The windfall clause is pretty well explained on the Future of Humanity Institute site.
Here's a quick summary:
It is an agreement between AI firms to donate significant amounts of any profits made as a consequence of economically transformative breakthroughs in AI capabilities. The donations are intended to help benefit humanity.
What is the "long reflection"?
What is artificial general intelligence safety / AI alignment?
AI alignment is a field that is focused on causing the goals of future superintelligent artificial systems to align with human values, meaning that they would behave in a way which was compatible with our survival and flourishing. This may be an extremely hard problem, especially with deep learning, and is likely to determine the outcome of the most important century. Alignment research is strongly interdisciplinary and can include computer science, mathematics, neuroscience, philosophy, and social sciences.
AGI safety is a related concept which strongly overlaps with AI alignment. AGI safety is concerned with making sure that building AGI systems doesn’t cause things to go badly wrong, and the main way in which things can go badly wrong is through misalignment. AGI safety includes policy work that prevents the building of dangerous AGI systems, or reduces misuse risks from AGI systems aligned to actors who don’t have humanity’s best interests in mind.
What are brain-computer interfaces?
A brain-computer interface (BCI) is a direct communication pathway between the brain and a computer device. BCI research is heavily funded, and has already met dozens of successes. Three successes in human BCIs are a device that restores (partial) sight to the blind, cochlear implants that restore hearing to the deaf, and a device that allows use of an artificial hand by direct thought.
Such device restore impaired functions, but many researchers expect to also augment and improve normal human abilities with BCIs. Ed Boyden is researching these opportunities as the lead of the Synthetic Neurobiology Group at MIT. Such devices might hasten the arrival of an intelligence explosion, if only by improving human intelligence so that the hard problems of AI can be solved more rapidly.
See also:
Wikipedia, Brain-computer interface
What are the different possible AI takeoff speeds?
A slow takeoff is where AI capabilities improve gradually, giving us plenty of time to adapt. In a moderate takeoff we might see accelerating progress, but we still won’t be caught off guard by a dramatic change. Whereas, in a fast or hard takeoff AI would go from being not very generally competent to sufficiently superhuman to control the future too fast for humans to course correct if something goes wrong.
The article Distinguishing definitions of takeoff goes into more detail on this.
How is "intelligence" defined?
Intelligence measures an agent’s ability to achieve goals in a wide range of environments.
This is a bit vague, but serves as the working definition of ‘intelligence’. For a more in-depth exploration, see Efficient Cross-Domain Optimization.
See also:
- Wikipedia, Intelligence
- Neisser et al., Intelligence: Knowns and Unknowns
- Wasserman & Zentall (eds.), Comparative Cognition: Experimental Explorations of Animal Intelligence
- Legg, Definitions of Intelligence
What is "biological cognitive enhancement"?
There may be genes or molecules that can be modified to improve general intelligence. Researchers have already done this in mice: they over-expressed the NR2B gene, which improved those mice’s memory beyond that of any other mice of any mouse species. Biological cognitive enhancement in humans may cause an intelligence explosion to occur more quickly than it otherwise would.
See also:
- Bostrom & Sandberg, Cognitive Enhancement: Methods, Ethics, Regulatory Challenges
A Narrow AI is capable of operating only in a relatively limited domain, such as chess or driving, rather than capable of learning a broad range of tasks like a human or an Artificial General Intelligence. Narrow vs General is not a perfectly binary classification, there are degrees of generality with, for example, large language models having a fairly large degree of generality (as the domain of text is large) without being as general as a human, and we may eventually build systems that are significantly more general than humans.
What is an "intelligence explosion"?
The intelligence explosion idea was expressed by statistician I.J. Good in 1965:
Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion’, and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make.
The argument is this: Every year, computers surpass human abilities in new ways. A program written in 1956 was able to prove mathematical theorems, and found a more elegant proof for one of them than Russell and Whitehead had given in Principia Mathematica. By the late 1990s, ‘expert systems’ had surpassed human skill for a wide range of tasks. In 1997, IBM’s Deep Blue computer beat the world chess champion, and in 2011, IBM’s Watson computer beat the best human players at a much more complicated game: Jeopardy!. Recently, a robot named Adam was programmed with our scientific knowledge about yeast, then posed its own hypotheses, tested them, and assessed the results.
Computers remain far short of human intelligence, but the resources that aid AI design are accumulating (including hardware, large datasets, neuroscience knowledge, and AI theory). We may one day design a machine that surpasses human skill at designing artificial intelligences. After that, this machine could improve its own intelligence faster and better than humans can, which would make it even more skilled at improving its own intelligence. This could continue in a positive feedback loop such that the machine quickly becomes vastly more intelligent than the smartest human being on Earth: an ‘intelligence explosion’ resulting in a machine superintelligence.
This is what is meant by the ‘intelligence explosion’ in this FAQ.
See also:
- Vinge, The Coming Technological Singularity
- Wikipedia, Technological Singularity
- Chalmers, The Singularity: A Philosophical Analysis
Language Models are a class of AI trained on text, usually to predict the next word or a word which has been obscured. They have the ability to generate novel prose or code based on an initial prompt, which gives rise to a kind of natural language programming called prompt engineering. The most popular architecture for very large language models is called a transformer, which follows consistent scaling laws with respect to the size of the model being trained, meaning that a larger model trained with the same amount of compute will produce results which are better by a predictable amount (when measured by the 'perplexity', or how surprised the AI is by a test set of human-generated text).
See also
This comic (source) illustrates Goodhart's law:

Debate is a proposed technique for allowing human evaluators to get correct and helpful answers from experts, even if the evaluator is not themselves an expert or able to fully verify the answers.[1] The technique was suggested as part of an approach to build advanced AI systems that are aligned with human values, and to safely apply machine learning techniques to problems that have high stakes, but are not well-defined (such as advancing science or increase a company's revenue). [2][3]
Non-canonical answers
What is artificial general intelligence safety / AI alignment?
AI safety is a research field that has the goal to avoid bad outcomes from AI systems.
Work on AI safety can be divided into near-term AI safety, and AI existential safety, which is strongly related to AI alignment:
- Near-term AI safety is about preventing bad outcomes from current systems. Examples for work on near-term AI safety are
- getting content recommender systems to not radicalize their users
- ensuring autonomous cars don’t kill people
- advocating strict regulations for lethal autonomous weapons
- AI existential safety, or AGI safety is about reducing the existential risk from artificial general intelligence (AGI). Artificial general intelligence is AI that is at least as competent as humans in all skills that are relevant for making a difference in the world. AGI has not been developed yet, but will likely be developed in this century. A central part of AGI safety is ensuring that what AIs do is actually what we want. This is called AI alignment (also often just called alignment), because it’s about aligning an AI with human values. Alignment is difficult, and building AGI is probably very dangerous, so it is important to mitigate the risks as much as possible. Examples for work on AI existential safety are
- trying to get a foundational understanding what intelligence is, e.g. agent foundations
- outer and inner Alignment: Ensure the objective of the training process is actually what we want, and also ensure the objective of the resulting system is actually what we want.
- AI policy/strategy: e.g. researching the best way to set up institutions and mechanisms that help with safe AGI development, making sure AI isn’t used by bad actors
There are also areas of research which are useful for both near-term, and for existential safety. For example, robustness to distribution shift, and interpretability both help with making current systems safer, and are likely to help with AGI safety.
What is "coherent extrapolated volition"?
Eliezer Yudkowsky has proposed Coherent Extrapolated Volition as a solution to at least two problems facing Friendly AI design:
- The fragility of human values: Yudkowsky writes that “any future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals will contain almost nothing of worth.” The problem is that what humans value is complex and subtle, and difficult to specify. Consider the seemingly minor value of novelty. If a human-like value of novelty is not programmed into a superintelligent machine, it might explore the universe for valuable things up to a certain point, and then maximize the most valuable thing it finds (the exploration-exploitation tradeoff[58]) — tiling the solar system with brains in vats wired into happiness machines, for example. When a superintelligence is in charge, you have to get its motivational system exactly right in order to not make the future undesirable.
- The locality of human values: Imagine if the Friendly AI problem had faced the ancient Greeks, and they had programmed it with the most progressive moral values of their time. That would have led the world to a rather horrifying fate. But why should we think that humans have, in the 21st century, arrived at the apex of human morality? We can’t risk programming a superintelligent machine with the moral values we happen to hold today. But then, which moral values do we give it?
Yudkowsky suggests that we build a ‘seed AI’ to discover and then extrapolate the ‘coherent extrapolated volition’ of humanity:
> In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
The seed AI would use the results of this examination and extrapolation of human values to program the motivational system of the superintelligence that would determine the fate of the galaxy.
However, some worry that the collective will of humanity won’t converge on a coherent set of goals. Others believe that guaranteed Friendliness is not possible, even by such elaborate and careful means.
- Yudkowsky, Coherent Extrapolated Volition
A Friendly Artificial Intelligence (Friendly AI or FAI) is an artificial intelligence that is ‘friendly’ to humanity — one that has a good rather than bad effect on humanity.
AI researchers continue to make progress with machines that make their own decisions, and there is a growing awareness that we need to design machines to act safely and ethically. This research program goes by many names: ‘machine ethics’, ‘machine morality’, ‘artificial morality’, ‘computational ethics’ and ‘computational metaethics’, ‘friendly AI’, and ‘robo-ethics’ or ‘robot ethics’.
The most immediate concern may be in battlefield robots; the U.S. Department of Defense contracted Ronald Arkin to design a system for ensuring ethical behavior in autonomous battlefield robots. The U.S. Congress has declared that a third of America’s ground systems must be robotic by 2025, and by 2030 the U.S. Air Force plans to have swarms of bird-sized flying robots that operate semi-autonomously for weeks at a time.
But Friendly AI research is not concerned with battlefield robots or machine ethics in general. It is concerned with a problem of a much larger scale: designing AI that would remain safe and friendly after the intelligence explosion.
A machine superintelligence would be enormously powerful. Successful implementation of Friendly AI could mean the difference between a solar system of unprecedented happiness and a solar system in which all available matter has been converted into parts for achieving the superintelligence’s goals.
It must be noted that Friendly AI is a harder project than often supposed. As explored below, commonly suggested solutions for Friendly AI are likely to fail because of two features possessed by any superintelligence:
- Superpower: a superintelligent machine will have unprecedented powers to reshape reality, and therefore will achieve its goals with highly efficient methods that confound human expectations and desires.
- Literalness: a superintelligent machine will make decisions based on the mechanisms it is designed with, not the hopes its designers had in mind when they programmed those mechanisms. It will act only on precise specifications of rules and values, and will do so in ways that need not respect the complexity and subtlety[41][42][43] of what humans value. A demand like “maximize human happiness” sounds simple to us because it contains few words, but philosophers and scientists have failed for centuries to explain exactly what this means, and certainly have not translated it into a form sufficiently rigorous for AI programmers to use.
See also:
- Wikipedia, Friendly Artificial Intelligence.
- All Things Considered, The Singularity: Humanity’s Last Invention?
- Fox, A review of proposals toward safe AI
- Muehlhauser, Friendly AI: A Bibliography
What is "whole brain emulation"?
Whole Brain Emulation (WBE) or ‘mind uploading’ is a computer emulation of all the cells and connections in a human brain. So even if the underlying principles of general intelligence prove difficult to discover, we might still emulate an entire human brain and make it run at a million times its normal speed (computer circuits communicate much faster than neurons do). Such a WBE could do more thinking in one second than a normal human can in 31 years. So this would not lead immediately to smarter-than-human intelligence, but it would lead to faster-than-human intelligence. A WBE could be backed up (leading to a kind of immortality), and it could be copied so that hundreds or millions of WBEs could work on separate problems in parallel. If WBEs are created, they may therefore be able to solve scientific problems far more rapidly than ordinary humans, accelerating further technological progress.
See also:
- Sandberg & Bostrom, Whole Brain Emulation: A Roadmap
- Blue Brain Project
Machine learning is used to create computer programs to solve problems. Some programs created this way are very simple, capable of solving the specific problem they were trained to solve.
Other programs created this way are more advanced, problem-solvers in their own right. When a machine learning process results in a piece of software that is a problem solver itself, we call the created problem-solver a "mesa-optimizer".
"Mesa" is Greek for inner/within/inside. "Optimizer" is the term for something that solves problems. So "mesa-optimizer" means something like "inner problem-solver".
In this context, the "outer problem-solver" is the machine learning process that created the mesa-optimizer.
What are some important examples of specialised terminology in AI alignment?
A good list can be found on the Alignment Forum's tag list.
What is "metaphilosophy" and how does it relate to AI safety?
Metaphilosophy is the philosophy of philosophy.
Bostrom has described the AI Safety problem as "philosophy with a deadline". Metaphilosophy could be used to steer an AGI towards e.g. our coherent extrapolated volition.
Unanswered canonical questions