|Main Question: Why might AI be an existential risk?|
|Alignment Forum Tag|
An existential risk (or x-risk) is a risk that poses astronomically large negative consequences for humanity, such as human extinction or permanent global totalitarianism.
Nick Bostrom introduced the term "existential risk" in his 2002 paper "Existential Risks: Analyzing Human Extinction Scenarios and Related Hazards."1 In the paper, Bostrom defined an existential risk as:
One where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.
The Oxford Future of Humanity Institute (FHI) was founded by Bostrom in 2005 in part to study existential risks. Other institutions with a generalist focus on existential risk include the Centre for the Study of Existential Risk.
FHI's existential-risk.org FAQ notes regarding the definition of "existential risk":
An existential risk is one that threatens the entire future of humanity. [...]
“Humanity”, in this context, does not mean “the biological species Homo sapiens”. If we humans were to evolve into another species, or merge or replace ourselves with intelligent machines, this would not necessarily mean that an existential catastrophe had occurred — although it might if the quality of life enjoyed by those new life forms turns out to be far inferior to that enjoyed by humans.
Classification of Existential Risks
Bostrom2 proposes a series of classifications for existential risks:
- Bangs - Earthly intelligent life is extinguished relatively suddenly by any cause; the prototypical end of humanity. Examples of bangs include deliberate or accidental misuse of nanotechnology, nuclear holocaust, the end of our simulation, or an unfriendly AI.
- Crunches - The potential humanity had to enhance itself indefinitely is forever eliminated, although humanity continues. Possible crunches include an exhaustion of resources, social or governmental pressure ending technological development, and even future technological development proving an unsurpassable challenge before the creation of a superintelligence.
- Shrieks - Humanity enhances itself, but explores only a narrow portion of its desirable possibilities. As the criteria for desirability haven't been defined yet, this category is mainly undefined. However, a flawed friendly AI incorrectly interpreting our values, a superhuman upload deciding its own values and imposing them on the rest of humanity, and an intolerant government outlawing social progress would certainly qualify.
- Whimpers - Though humanity is enduring, only a fraction of our potential is ever achieved. Spread across the galaxy and expanding at near light-speed, we might find ourselves doomed by ours or another being's catastrophic physics experimentation, destroying reality at light-speed. A prolonged galactic war leading to our extinction or severe limitation would also be a whimper. More darkly, humanity might develop until its values were disjoint with ours today, making their civilization worthless by present values.
The total negative results of an existential risk could amount to the total of potential future lives not being realized. A rough and conservative calculation3 gives us a total of 10^54 potential future humans lives – smarter, happier and kinder then we are. Hence, almost no other task would amount to so much positive impact than existential risk reduction.
Existential risks also present an unique challenge because of their irreversible nature. We will never, by definition, experience and survive an extinction risk4 and so cannot learn from our mistakes. They are subject to strong observational selection effects 5. One cannot estimate their future probability based on the past, because bayesianly speaking, the conditional probability of a past existential catastrophe given our present existence is always 0, no matter how high the probability of an existential risk really is. Instead, indirect estimates have to be used, such as possible existential catastrophes happening elsewhere. A high extinction risk probability could be functioning as a Great Filter and explain why there is no evidence of spacial colonization.
Another related idea is that of a suffering risk (or s-risk).
The focus on existential risks on LessWrong dates back to Bostrom's 2002 paper Astronomical Waste: The Opportunity Cost of Delayed Technological Development. It argues that "the chief goal for utilitarians should be to reduce existential risk". Bostrom writes:
If what we are concerned with is (something like) maximizing the expected number of worthwhile lives that we will create, then in addition to the opportunity cost of delayed colonization, we have to take into account the risk of failure to colonize at all. We might fall victim to an existential risk, one where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential. Because the lifespan of galaxies is measured in billions of years, whereas the time-scale of any delays that we could realistically affect would rather be measured in years or decades, the consideration of risk trumps the consideration of opportunity cost. For example, a single percentage point of reduction of existential risks would be worth (from a utilitarian expected utility point-of-view) a delay of over 10 million years.
Therefore, if our actions have even the slightest effect on the probability of eventual colonization, this will outweigh their effect on when colonization takes place. For standard utilitarians, priority number one, two, three and four should consequently be to reduce existential risk. The utilitarian imperative “Maximize expected aggregate utility!” can be simplified to the maxim “Minimize existential risk!”.
The concept is expanded upon in his 2012 paper Existential Risk Prevention as Global Priority
- Intelligence enhancement as existential risk mitigation
- Our society lacks good self-preservation mechanisms
- Disambiguating doom
- Existential Risk
- Machine Intelligence Research Institute
- The Future of Humanity Institute
- The Oxford Martin Programme on the Impacts of Future Technology
- Global Catastrophic Risk Institute
- Saving Humanity from Homo Sapiens
- Skoll Global Threats Fund (To Safeguard Humanity from Global Threats)
- Foresight Institute
- Defusing the Nuclear Threat
- Leverage Research
- The Lifeboat Foundation
- BOSTROM, Nick. (2002) "Existential Risks: Analyzing Human Extinction Scenarios and Related Hazards". Journal of Evolution and Technology, Vol. 9, March 2002.
- BOSTROM, Nick. (2012) "Existential Risk Reduction as the Most Important Task for Humanity". Global Policy, forthcoming, 2012.
- BOSTROM, Nick & SANDBERG, Anders & CIRKOVIC, Milan. (2010) "Anthropic Shadow: Observation Selection Effects and Human Extinction Risks" Risk Analysis, Vol. 30, No. 10 (2010): 1495-1506.
- Nick Bostrom, Milan M. Ćirković, ed (2008). Global Catastrophic Risks. Oxford University Press.
- Milan M. Ćirković (2008). "Observation Selection Effects and global catastrophic risks". Global Catastrophic Risks. Oxford University Press.
- Eliezer S. Yudkowsky (2008). "Cognitive Biases Potentially Affecting Judgment of Global Risks". Global Catastrophic Risks. Oxford University Press. (PDF)
- Richard A. Posner (2004). Catastrophe Risk and Response. Oxford University Press. (DOC)
There is a general consensus that any AGI would be very dangerous because not necessarilly aligned. But if the AGI does not have any reward function and is a pattern matcher like GPT, how would it go about to leading to X-risks/not being able to be put into a box/shut down?
I can definitely imagine it being dangerous, or it having continuity in its answer which might be problematic, but the whole going exponential and valuing its own survival does not seem to necessarilly apply?
One threat model which includes a GPT component is Misaligned Model-Based RL Agent. It suggests that a reinforcement learner attached to a GPT-style world model could lead to an existential risk, with the RL agent being the optimizer which uses the world model to be much more effective at achieving its goals.
Another possibility is that a sufficiently powerful world model may develop mesa optimizers which could influence the world via the outputs of the model to achieve the mesa objective (perhaps by causing an optimizer to be created with goals aligned to it), though this is somewhat speculative.