existential risk

From Stampy's Wiki
existential risk
Main Question: Why might AI be an existential risk?
Alignment Forum Tag
Wikipedia Page


An existential risk (or x-risk) is a risk that poses astronomically large negative consequences for humanity, such as human extinction or permanent global totalitarianism.

An existential risk (or x-risk) is a risk that poses astronomically large negative consequences for humanity, such as human extinction or permanent global totalitarianism.

Nick Bostrom introduced the term "existential risk" in his 2002 paper "Existential Risks: Analyzing Human Extinction Scenarios and Related Hazards."1 In the paper, Bostrom defined an existential risk as:

One where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.

The Oxford Future of Humanity Institute (FHI) was founded by Bostrom in 2005 in part to study existential risks. Other institutions with a generalist focus on existential risk include the Centre for the Study of Existential Risk.

FHI's existential-risk.org FAQ notes regarding the definition of "existential risk":

An existential risk is one that threatens the entire future of humanity. [...]

“Humanity”, in this context, does not mean “the biological species Homo sapiens”. If we humans were to evolve into another species, or merge or replace ourselves with intelligent machines, this would not necessarily mean that an existential catastrophe had occurred — although it might if the quality of life enjoyed by those new life forms turns out to be far inferior to that enjoyed by humans.


Classification of Existential Risks

Bostrom2 proposes a series of classifications for existential risks:

  • Bangs - Earthly intelligent life is extinguished relatively suddenly by any cause; the prototypical end of humanity. Examples of bangs include deliberate or accidental misuse of nanotechnology, nuclear holocaust, the end of our simulation, or an unfriendly AI.
  • Crunches - The potential humanity had to enhance itself indefinitely is forever eliminated, although humanity continues. Possible crunches include an exhaustion of resources, social or governmental pressure ending technological development, and even future technological development proving an unsurpassable challenge before the creation of a superintelligence.
  • Shrieks - Humanity enhances itself, but explores only a narrow portion of its desirable possibilities. As the criteria for desirability haven't been defined yet, this category is mainly undefined. However, a flawed friendly AI incorrectly interpreting our values, a superhuman upload deciding its own values and imposing them on the rest of humanity, and an intolerant government outlawing social progress would certainly qualify.
  • Whimpers - Though humanity is enduring, only a fraction of our potential is ever achieved. Spread across the galaxy and expanding at near light-speed, we might find ourselves doomed by ours or another being's catastrophic physics experimentation, destroying reality at light-speed. A prolonged galactic war leading to our extinction or severe limitation would also be a whimper. More darkly, humanity might develop until its values were disjoint with ours today, making their civilization worthless by present values.

The total negative results of an existential risk could amount to the total of potential future lives not being realized. A rough and conservative calculation3 gives us a total of 10^54 potential future humans lives – smarter, happier and kinder then we are. Hence, almost no other task would amount to so much positive impact than existential risk reduction.

Existential risks also present an unique challenge because of their irreversible nature. We will never, by definition, experience and survive an extinction risk4 and so cannot learn from our mistakes. They are subject to strong observational selection effects 5. One cannot estimate their future probability based on the past, because bayesianly speaking, the conditional probability of a past existential catastrophe given our present existence is always 0, no matter how high the probability of an existential risk really is. Instead, indirect estimates have to be used, such as possible existential catastrophes happening elsewhere. A high extinction risk probability could be functioning as a Great Filter and explain why there is no evidence of spacial colonization.

Another related idea is that of a suffering risk (or s-risk).



The focus on existential risks on LessWrong dates back to Bostrom's 2002 paper Astronomical Waste: The Opportunity Cost of Delayed Technological Development. It argues that "the chief goal for utilitarians should be to reduce existential risk". Bostrom writes:

If what we are concerned with is (something like) maximizing the expected number of worthwhile lives that we will create, then in addition to the opportunity cost of delayed colonization, we have to take into account the risk of failure to colonize at all. We might fall victim to an existential risk, one where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.[8] Because the lifespan of galaxies is measured in billions of years, whereas the time-scale of any delays that we could realistically affect would rather be measured in years or decades, the consideration of risk trumps the consideration of opportunity cost. For example, a single percentage point of reduction of existential risks would be worth (from a utilitarian expected utility point-of-view) a delay of over 10 million years.
Therefore, if our actions have even the slightest effect on the probability of eventual colonization, this will outweigh their effect on when colonization takes place. For standard utilitarians, priority number one, two, three and four should consequently be to reduce existential risk. The utilitarian imperative “Maximize expected aggregate utility!” can be simplified to the maxim “Minimize existential risk!”.

The concept is expanded upon in his 2012 paper Existential Risk Prevention as Global Priority



  1. BOSTROM, Nick. (2002) "Existential Risks: Analyzing Human Extinction Scenarios and Related Hazards". Journal of Evolution and Technology, Vol. 9, March 2002.
  2. BOSTROM, Nick. (2012) "Existential Risk Reduction as the Most Important Task for Humanity". Global Policy, forthcoming, 2012.
  3. BOSTROM, Nick & SANDBERG, Anders & CIRKOVIC, Milan. (2010) "Anthropic Shadow: Observation Selection Effects and Human Extinction Risks" Risk Analysis, Vol. 30, No. 10 (2010): 1495-1506.
  4. Nick Bostrom, Milan M. Ćirković, ed (2008). Global Catastrophic Risks. Oxford University Press.
  5. Milan M. Ćirković (2008). "Observation Selection Effects and global catastrophic risks". Global Catastrophic Risks. Oxford University Press.
  6. Eliezer S. Yudkowsky (2008). "Cognitive Biases Potentially Affecting Judgment of Global Risks". Global Catastrophic Risks. Oxford University Press. (PDF)
  7. Richard A. Posner (2004). Catastrophe Risk and Response. Oxford University Press. (DOC)

Canonically answered

How might non-agentic GPT-style AI cause an "intelligence explosion" or otherwise contribute to existential risk?

Show your endorsement of this answer by giving it a stamp of approval!

One threat model which includes a GPT component is Misaligned Model-Based RL Agent. It suggests that a reinforcement learner attached to a GPT-style world model could lead to an existential risk, with the RL agent being the optimizer which uses the world model to be much more effective at achieving its goals.

Another possibility is that a sufficiently powerful world model may develop mesa optimizers which could influence the world via the outputs of the model to achieve the mesa objective (perhaps by causing an optimizer to be created with goals aligned to it), though this is somewhat speculative.

What are some AI alignment research agendas currently being pursued?

Show your endorsement of this answer by giving it a stamp of approval!

Research at the Alignment Research Center is led by Paul Christiano, best known for introducing the “Iterated Distillation and Amplification” and “Humans Consulting HCH” approaches. He and his team are now “trying to figure out how to train ML systems to answer questions by straightforwardly ‘translating’ their beliefs into natural language rather than by reasoning about what a human wants to hear.”

Chris Olah (after work at DeepMind and OpenAI) recently launched Anthropic, an AI lab focussed on the safety of large models. While his previous work was concerned with “transparency” and “interpretability” of large neural networks, especially vision models, Anthropic is focussing more on large language models, among other things working towards a "general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless".

Stuart Russell and his team at the Center for Human-Compatible Artificial Intelligence (CHAI) have been working on inverse reinforcement learning (where the AI infers human values from observing human behavior) and corrigibility, as well as attempts to disaggregate neural networks into “meaningful” subcomponents (see Filan, et al.’s “Clusterability in neural networks” and Hod et al.'s “Detecting modularity in deep neural networks”).

Alongside the more abstract “agent foundations” work they have become known for, MIRI recently announced their “Visible Thoughts Project” to test the hypothesis that “Language models can be made more understandable (and perhaps also more capable, though this is not the goal) by training them to produce visible thoughts.”

OpenAI have recently been doing work on iteratively summarizing books (summarizing, and then summarizing the summary, etc.) as a method for scaling human oversight.

Stuart Armstrong’s recently launched AlignedAI are mainly working on concept extrapolation from familiar to novel contexts, something he believes is “necessary and almost sufficient” for AI alignment.

Redwood Research (Buck Shlegeris, et al.) are trying to “handicap' GPT-3 to only produce non-violent completions of text prompts. “The idea is that there are many reasons we might ultimately want to apply some oversight function to an AI model, like ‘don't be deceitful’, and if we want to get AI teams to apply this we need to be able to incorporate these oversight predicates into the original model in an efficient manner.”

Ought is an independent AI safety research organization led by Andreas Stuhlmüller and Jungwon Byun. They are researching methods for breaking up complex, hard-to-verify tasks into simpler, easier-to-verify tasks, with the aim of allowing us to maintain effective oversight over AIs.

Non-canonical answers

How does the field of AI Safety want to accomplish its goal of preventing existential risk?

Show your endorsement of this answer by giving it a stamp of approval!


Governance - e.g. By establishing best practises, institutions & processes, awareness, regulation, certification, etc?

Unanswered canonical questions

There is a general consensus that any AGI would be very dangerous because not necessarilly aligned. But if the AGI does not have any reward function and is a pattern matcher like GPT, how would it go about to leading to X-risks/not being able to be put into a box/shut down?
I can definitely imagine it being dangerous, or it having continuity in its answer which might be problematic, but the whole going exponential and valuing its own survival does not seem to necessarilly apply?