From Stampy's Wiki
Main Question: What are some important terms in AI alignment? (edit question) (write answer)
Alignment Forum Tag
Wikipedia Page


Posts that attempt to Define or clarify the meaning of a concept, a word, phrase, or something else.

Posts that attempt to Define or clarify the meaning of a concept, a word, phrase, or something else.

Canonically answered

What is greater-than-human intelligence?

Machines are already smarter than humans are at many specific tasks: performing calculations, playing chess, searching large databanks, detecting underwater mines, and more. But one thing that makes humans special is their general intelligence. Humans can intelligently adapt to radically new problems in the urban jungle or outer space for which evolution could not have prepared them. Humans can solve problems for which their brain hardware and software was never trained. Humans can even examine the processes that produce their own intelligence (cognitive neuroscience), and design new kinds of intelligence never seen before (artificial intelligence).

To possess greater-than-human intelligence, a machine must be able to achieve goals more effectively than humans can, in a wider range of environments than humans can. This kind of intelligence involves the capacity not just to do science and play chess, but also to manipulate the social environment.

Computer scientist Marcus Hutter has described a formal model called AIXI that he says possesses the greatest general intelligence possible. But to implement it would require more computing power than all the matter in the universe can provide. Several projects try to approximate AIXI while still being computable, for example MC-AIXI.

Still, there remains much work to be done before greater-than-human intelligence can be achieved in machines. Greater-than-human intelligence need not be achieved by directly programming a machine to be intelligent. It could also be achieved by whole brain emulation, by biological cognitive enhancement, or by brain-computer interfaces (see below).

See also:

What is AI alignment?

AI alignment is the the field trying to make sure that when we build superintelligent artificial systems they are aligned with human values so that they do things compatible with our survival and flourishing. This may be one of the hardest and most important problems we will ever face, as whether we succeed might mean the difference between human extinction and flourishing.

What is the windfall clause?

The windfall clause is pretty well explained on the Future of Humanity Institute site.

Here's a quick summary:
It is an agreement between AI firms to donate significant amounts of any profits made as a consequence of economically transformative breakthroughs in AI capabilities. The donations are intended to help benefit humanity.

What is an agent?

A rational agent is an entity which has a utility function, forms beliefs about its environment, evaluates the consequences of possible actions, and then takes the action which maximizes its utility. They are also referred to as goal-seeking. The concept of a rational agent is used in economics, game theory, decision theory, and artificial intelligence.

A rational agent is an entity which has a utility function, forms beliefs about its environment, evaluates the consequences of possible actions, and then takes the action which maximizes its utility. They are also referred to as goal-seeking. The concept of a rational agent is used in economics, game theory, decision theory, and artificial intelligence.

Editor note: there is work to be done reconciling this page, Agency page, and Robust Agents. Currently they overlap and I'm not sure they're consistent. - Ruby, 2020-09-15

More generally, an agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.[#fn1 1]

There has been much discussion as to whether certain AGI designs can be made into mere tools or whether they will necessarily be agents which will attempt to actively carry out their goals. Any minds that actively engage in goal-directed behavior are potentially dangerous, due to considerations such as basic AI drives possibly causing behavior which is in conflict with humanity's values.

In Dreams of Friendliness and in Reply to Holden on Tool AI, Eliezer Yudkowsky argues that, since all intelligences select correct beliefs from the much larger space of incorrect beliefs, they are necessarily agents.

See also


  1. Russel, S. & Norvig, P. (2003) Artificial Intelligence: A Modern Approach. Second Edition. Page 32.[#fnref1 ↩]


What is the long reflection?

The long reflection is a hypothesized period of time during which humanity works out how best to realize its long-term potential.

The long reflection is a hypothesized period of time during which humanity works out how best to realize its long-term potential.

Some effective altruists, including Toby Ord and William MacAskill, have argued that, if humanity succeeds in eliminating existential risk or reducing it to acceptable levels, it should not immediately embark on an ambitious and potentially irreversible project of arranging the universe's resources in accordance to its values, but ought instead to spend considerable time— "centuries (or more)";[1] "perhaps tens of thousands of years";[2] "thousands or millions of years";[3] "[p]erhaps... a million years"[4]—figuring out what is in fact of value. The long reflection may thus be seen as an intermediate stage in a rational long-term human developmental trajectory, following an initial stage of existential security when existential risk is drastically reduced and followed by a final stage when humanity's potential is fully realized.[5]


The idea of a long reflection has been criticized on the grounds that virtually eliminating all existential risk will almost certainly require taking a variety of large-scale, irreversible decisions—related to space colonization, global governance, cognitive enhancement, and so on—which are precisely the decisions meant to be discussed during the long reflection.[6][7] Since there are pervasive and inescapable tradeoffs between reducing existential risk and retaining moral option value, it may be argued that it does not make sense to frame humanity's long-term strategic picture as one consisting of two distinct stages, with one taking precedence over the other.

Further reading

Aird, Michael (2020) Collection of sources that are highly relevant to the idea of the Long Reflection, Effective Altruism Forum, June 20.
Many additional resources on this topic.

Wiblin, Robert & Keiran Harris (2018) Our descendants will probably see us as moral monsters. what should we do about that?, 80,000 Hours, January 19.
Interview with William MacAskill about the long reflection and other topics.

Related entries

dystopia | existential risk | existential security | institutions for future generations | long-term future | longtermism | moral uncertainty | normative ethics | value lock-in

  1. Ord, Toby (2020) The Precipice: Existential Risk and the Future of Humanity, London: Bloomsbury Publishing.

  2. Greaves, Hilary et al. (2019) A research agenda for the Global Priorities Institute, Oxford.

  3. Dai, Wei (2019) The argument from philosophical difficulty, LessWrong, February 9.

  4. William MacAskill, in Perry, Lucas (2018) AI alignment podcast: moral uncertainty and the path to AI alignment with William MacAskill, AI Alignment podcast, September 17.

  5. Ord, Toby (2020) The Precipice: Existential Risk and the Future of Humanity, London: Bloomsbury Publishing.

  6. Stocker, Felix (2020) Reflecting on the long reflection, Felix Stocker’s Blog, August 14.

  7. Hanson, Robin (2021) ‘Long reflection’ is crazy bad idea, Overcoming Bias, October 20.

How is ‘intelligence’ defined?

Artificial intelligence researcher Shane Legg defines intelligence like this:

Intelligence measures an agent’s ability to achieve goals in a wide range of environments.

This is a bit vague, but serves as the working definition of ‘intelligence’. For a more in-depth exploration, see Efficient Cross-Domain Optimization.

See also:

After reviewing extensive literature on the subject, Legg and Hutter[1] summarizes the many possible valuable definitions in the informal statement “Intelligence measures an agent’s ability to achieve goals in a wide range of environments.” They then show this definition can be mathematically formalized given reasonable mathematical definitions of its terms. They use Solomonoff induction - a formalization of Occam's razor - to construct an universal artificial intelligence with a embedded utility function which assigns less utility to those actions based on theories with higher complexity. They argue this final formalization is a valid, meaningful, informative, general, unbiased, fundamental, objective, universal and practical definition of intelligence.

We can relate Legg and Hutter's definition with the concept of optimization. According to Eliezer Yudkowsky intelligence is efficient cross-domain optimization. It measures an agent's capacity for efficient cross-domain optimization of the world according to the agent’s preferences.[2] Optimization measures not only the capacity to achieve the desired goal but also is inversely proportional to the amount of resources used. It’s the ability to steer the future so it hits that small target of desired outcomes in the large space of all possible outcomes, using fewer resources as possible. For example, when Deep Blue defeated Kasparov, it was able to hit that small possible outcome where it made the right order of moves given Kasparov’s moves from the very large set of all possible moves. In that domain, it was more optimal than Kasparov. However, Kasparov would have defeated Deep Blue in almost any other relevant domain, and hence, he is considered more intelligent.

One could cast this definition in a possible world vocabulary, intelligence is:

  1. the ability to precisely realize one of the members of a small set of possible future worlds that have a higher preference over the vast set of all other possible worlds with lower preference; while
  2. using fewer resources than the other alternatives paths for getting there; and in the
  3. most diverse domains as possible.

How many more worlds have a higher preference then the one realized by the agent, less intelligent he is. How many more worlds have a lower preference than the one realized by the agent, more intelligent he is. (Or: How much smaller is the set of worlds at least as preferable as the one realized, more intelligent the agent is). How much less paths for realizing the desired world using fewer resources than those spent by the agent, more intelligent he is. And finally, in how many more domains the agent can be more efficiently optimal, more intelligent he is. Restating it, the intelligence of an agent is directly proportional to:

  • (a) the numbers of worlds with lower preference than the one realized,
  • (b) how much smaller is the set of paths more efficient than the one taken by the agent and
  • (c) how more wider are the domains where the agent can effectively realize his preferences;

and it is, accordingly, inversely proportional to:

  • (d) the numbers of world with higher preference than the one realized,
  • (e) how much bigger is the set of paths more efficient than the one taken by the agent and
  • (f) how much more narrow are the domains where the agent can efficiently realize his preferences.

This definition avoids several problems common in many others definitions, especially it avoids anthropomorphizing intelligence.

See Also

What are brain-computer interfaces?

A brain-computer interface (BCI) is a direct communication pathway between the brain and a computer device. BCI research is heavily funded, and has already met dozens of successes. Three successes in human BCIs are a device that restores (partial) sight to the blind, cochlear implants that restore hearing to the deaf, and a device that allows use of an artificial hand by direct thought.

Such device restore impaired functions, but many researchers expect to also augment and improve normal human abilities with BCIs. Ed Boyden is researching these opportunities as the lead of the Synthetic Neurobiology Group at MIT. Such devices might hasten the arrival of an intelligence explosion, if only by improving human intelligence so that the hard problems of AI can be solved more rapidly.

See also:

Wikipedia, Brain-computer interface

A slow takeoff is where AI capabilities improve gradually, giving us plenty of time to adapt. In a moderate takeoff we might see accelerating progress, but we still won’t be caught off guard by a dramatic change. Whereas, in a fast or hard takeoff AI would go from being not very generally competent to sufficiently superhuman to control the future too fast for humans to course correct if something goes wrong.

The article Distinguishing definitions of takeoff goes into more detail on this.

What is biological cognitive enhancement?

There may be genes or molecules that can be modified to improve general intelligence. Researchers have already done this in mice: they over-expressed the NR2B gene, which improved those mice’s memory beyond that of any other mice of any mouse species. Biological cognitive enhancement in humans may cause an intelligence explosion to occur more quickly than it otherwise would.

See also:

What is narrow AI?

A Narrow AI is capable of operating only in a relatively limited domain, such as chess or driving, rather than capable of learning a broad range of tasks like a human or an Artificial General Intelligence. Narrow vs General is not a perfectly binary classification, there are degrees of generality with, for example, large language models having a fairly large degree of generality (as the domain of text is large) without being as general as a human, and we may eventually build systems that are significantly more general than humans.

What is the intelligence explosion?

The intelligence explosion idea was expressed by statistician I.J. Good in 1965:

Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion’, and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make.

The argument is this: Every year, computers surpass human abilities in new ways. A program written in 1956 was able to prove mathematical theorems, and found a more elegant proof for one of them than Russell and Whitehead had given in Principia Mathematica. By the late 1990s, ‘expert systems’ had surpassed human skill for a wide range of tasks. In 1997, IBM’s Deep Blue computer beat the world chess champion, and in 2011, IBM’s Watson computer beat the best human players at a much more complicated game: Jeopardy!. Recently, a robot named Adam was programmed with our scientific knowledge about yeast, then posed its own hypotheses, tested them, and assessed the results.

Computers remain far short of human intelligence, but the resources that aid AI design are accumulating (including hardware, large datasets, neuroscience knowledge, and AI theory). We may one day design a machine that surpasses human skill at designing artificial intelligences. After that, this machine could improve its own intelligence faster and better than humans can, which would make it even more skilled at improving its own intelligence. This could continue in a positive feedback loop such that the machine quickly becomes vastly more intelligent than the smartest human being on Earth: an ‘intelligence explosion’ resulting in a machine superintelligence.

This is what is meant by the ‘intelligence explosion’ in this FAQ.

See also:

Non-canonical answers

What is Friendly AI?

A Friendly Artificial Intelligence (Friendly AI or FAI) is an artificial intelligence that is ‘friendly’ to humanity — one that has a good rather than bad effect on humanity.

AI researchers continue to make progress with machines that make their own decisions, and there is a growing awareness that we need to design machines to act safely and ethically. This research program goes by many names: ‘machine ethics’, ‘machine morality’, ‘artificial morality’, ‘computational ethics’ and ‘computational metaethics’, ‘friendly AI’, and ‘robo-ethics’ or ‘robot ethics’.

The most immediate concern may be in battlefield robots; the U.S. Department of Defense contracted Ronald Arkin to design a system for ensuring ethical behavior in autonomous battlefield robots. The U.S. Congress has declared that a third of America’s ground systems must be robotic by 2025, and by 2030 the U.S. Air Force plans to have swarms of bird-sized flying robots that operate semi-autonomously for weeks at a time.

But Friendly AI research is not concerned with battlefield robots or machine ethics in general. It is concerned with a problem of a much larger scale: designing AI that would remain safe and friendly after the intelligence explosion.

A machine superintelligence would be enormously powerful. Successful implementation of Friendly AI could mean the difference between a solar system of unprecedented happiness and a solar system in which all available matter has been converted into parts for achieving the superintelligence’s goals.

It must be noted that Friendly AI is a harder project than often supposed. As explored below, commonly suggested solutions for Friendly AI are likely to fail because of two features possessed by any superintelligence:

  1. Superpower: a superintelligent machine will have unprecedented powers to reshape reality, and therefore will achieve its goals with highly efficient methods that confound human expectations and desires.
  2. Literalness: a superintelligent machine will make decisions based on the mechanisms it is designed with, not the hopes its designers had in mind when they programmed those mechanisms. It will act only on precise specifications of rules and values, and will do so in ways that need not respect the complexity and subtlety[41][42][43] of what humans value. A demand like “maximize human happiness” sounds simple to us because it contains few words, but philosophers and scientists have failed for centuries to explain exactly what this means, and certainly have not translated it into a form sufficiently rigorous for AI programmers to use.

See also:

What is Coherent Extrapolated Volition?

Eliezer Yudkowsky has proposed Coherent Extrapolated Volition as a solution to at least two problems facing Friendly AI design:

  1. The fragility of human values: Yudkowsky writes that “any future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals will contain almost nothing of worth.” The problem is that what humans value is complex and subtle, and difficult to specify. Consider the seemingly minor value of novelty. If a human-like value of novelty is not programmed into a superintelligent machine, it might explore the universe for valuable things up to a certain point, and then maximize the most valuable thing it finds (the exploration-exploitation tradeoff[58]) — tiling the solar system with brains in vats wired into happiness machines, for example. When a superintelligence is in charge, you have to get its motivational system exactly right in order to not make the future undesirable.
  2. The locality of human values: Imagine if the Friendly AI problem had faced the ancient Greeks, and they had programmed it with the most progressive moral values of their time. That would have led the world to a rather horrifying fate. But why should we think that humans have, in the 21st century, arrived at the apex of human morality? We can’t risk programming a superintelligent machine with the moral values we happen to hold today. But then, which moral values do we give it?

Yudkowsky suggests that we build a ‘seed AI’ to discover and then extrapolate the ‘coherent extrapolated volition’ of humanity:

> In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.

The seed AI would use the results of this examination and extrapolation of human values to program the motivational system of the superintelligence that would determine the fate of the galaxy.

However, some worry that the collective will of humanity won’t converge on a coherent set of goals. Others believe that guaranteed Friendliness is not possible, even by such elaborate and careful means.

What is whole brain emulation?

Whole Brain Emulation (WBE) or ‘mind uploading’ is a computer emulation of all the cells and connections in a human brain. So even if the underlying principles of general intelligence prove difficult to discover, we might still emulate an entire human brain and make it run at a million times its normal speed (computer circuits communicate much faster than neurons do). Such a WBE could do more thinking in one second than a normal human can in 31 years. So this would not lead immediately to smarter-than-human intelligence, but it would lead to faster-than-human intelligence. A WBE could be backed up (leading to a kind of immortality), and it could be copied so that hundreds or millions of WBEs could work on separate problems in parallel. If WBEs are created, they may therefore be able to solve scientific problems far more rapidly than ordinary humans, accelerating further technological progress.

See also:

Unanswered canonical questions

Mark as:

Tags: definitions, metaphilosophy (create tag) (edit tags)

What is agent foundations?

Mark as:

Tags: definitions, research agendas, agent foundations (create tag) (edit tags)

What is the universal prior?