Back to Improve answers.
These 148 canonical answers are not marked as a related or follow-up to any other canonical answer, so cannot be found through the read interface by normal browsing. Feel free to add them to some! See a full list of available canonical questions or browse by tags.
The degree to which an Artificial Superintelligence (ASI) would resemble us depends heavily on how it is implemented, but it seems that differences are unavoidable. If AI is accomplished through whole brain emulation and we make a big effort to make it as human as possible (including giving it a humanoid body), the AI could probably be said to think like a human. However, by definition of ASI it would be much smarter. Differences in the substrate and body might open up numerous possibilities (such as immortality, different sensors, easy self-improvement, ability to make copies, etc.). Its social experience and upbringing would likely also be entirely different. All of this can significantly change the ASI's values and outlook on the world, even if it would still use the same algorithms as we do. This is essentially the "best case scenario" for human resemblance, but whole brain emulation is kind of a separate field from AI, even if both aim to build intelligent machines. Most approaches to AI are vastly different and most ASIs would likely not have humanoid bodies. At this moment in time it seems much easier to create a machine that is intelligent than a machine that is exactly like a human (it's certainly a bigger target).
Once an AGI has access to the internet it would be very challenging to meaningfully restrict it from doing things online which it wants to. There are too many options to bypass blocks we may put in place.
It may be possible to design it so that it does not want to do dangerous things in the first place, or perhaps to set up tripwires so that we notice that it’s trying to do a dangerous thing, though that relies on it not noticing or bypassing the tripwire so should not be the only layer of security.
We're also building a web UI (early prototype) and bot interface, so you'll soon be able to browse the FAQ and other sources in a cleaner way than the wiki.
We're also building a web UI (early prototype) and bot interface, so you'll soon be able to browse the FAQ and other sources in a cleaner way than the wiki.
The goals of the project are to:
- Offer a one-stop-shop for high-quality answers to common questions about AI alignment.
- Let people answer questions in a way which scales, freeing up researcher time while allowing more people to learn from a reliable source.
- Make external resources more easy to find by having links to them connected to a search engine which gets smarter the more it's used.
- Provide a form of legitimate peripheral participation for the AI Safety community, as an on-boarding path with a flexible level of commitment.
- Encourage people to think, read, and talk about AI alignment while answering questions, creating a community of co-learners who can give each other feedback and social reinforcement.
- Provide a way for budding researchers to prove their understanding of the topic and ability to produce good work.
- Collect data about the kinds of questions people actually ask and how they respond, so we can better focus resources on answering them.
- Track reactions on messages so we can learn which answers need work.
- Identify missing external content to create.
There is significant controversy on how quickly AI will grow into a superintelligence. The Alignment Forum tag has many views on how things might unfold, where the probabilities of a soft (happening over years/decades) takeoff and a hard (happening in months, or less) takeoff are discussed.
The long reflection is a hypothesized period of time during which humanity works out how best to realize its long-term potential.
Some effective altruists, including Toby Ord and William MacAskill, have argued that, if humanity succeeds in eliminating existential risk or reducing it to acceptable levels, it should not immediately embark on an ambitious and potentially irreversible project of arranging the universe's resources in accordance to its values, but ought instead to spend considerable time— "centuries (or more)"; "perhaps tens of thousands of years"; "thousands or millions of years"; "[p]erhaps... a million years"—figuring out what is in fact of value. The long reflection may thus be seen as an intermediate stage in a rational long-term human developmental trajectory, following an initial stage of existential security when existential risk is drastically reduced and followed by a final stage when humanity's potential is fully realized.
The idea of a long reflection has been criticized on the grounds that virtually eliminating all existential risk will almost certainly require taking a variety of large-scale, irreversible decisions—related to space colonization, global governance, cognitive enhancement, and so on—which are precisely the decisions meant to be discussed during the long reflection. Since there are pervasive and inescapable tradeoffs between reducing existential risk and retaining moral option value, it may be argued that it does not make sense to frame humanity's long-term strategic picture as one consisting of two distinct stages, with one taking precedence over the other.
Aird, Michael (2020) Collection of sources that are highly relevant to the idea of the Long Reflection, Effective Altruism Forum, June 20.
Many additional resources on this topic.
Wiblin, Robert & Keiran Harris (2018) Our descendants will probably see us as moral monsters. what should we do about that?, 80,000 Hours, January 19.
Interview with William MacAskill about the long reflection and other topics.
Ord, Toby (2020) The Precipice: Existential Risk and the Future of Humanity, London: Bloomsbury Publishing.
Greaves, Hilary et al. (2019) A research agenda for the Global Priorities Institute, Oxford.
Dai, Wei (2019) The argument from philosophical difficulty, LessWrong, February 9.
William MacAskill, in Perry, Lucas (2018) AI alignment podcast: moral uncertainty and the path to AI alignment with William MacAskill, AI Alignment podcast, September 17.
Ord, Toby (2020) The Precipice: Existential Risk and the Future of Humanity, London: Bloomsbury Publishing.
Stocker, Felix (2020) Reflecting on the long reflection, Felix Stocker’s Blog, August 14.
Hanson, Robin (2021) ‘Long reflection’ is crazy bad idea, Overcoming Bias, October 20.
If the AI system was deceptively aligned (i.e. pretending to be nice until it was in control of the situation) or had been in stealth mode while getting things in place for a takeover, quite possibly within hours. We may get more warning with weaker systems, if the AGI does not feel at all threatened by us, or if a complex ecosystem of AI systems is built over time and we gradually lose control.
Paul Christiano writes a story of alignment failure which shows a relatively fast transition.
“Aligning smarter-than-human AI with human interests” is an extremely vague goal. To approach this problem productively, we attempt to factorize it into several subproblems. As a starting point, we ask: “What aspects of this problem would we still be unable to solve even if the problem were much easier?”
In order to achieve real-world goals more effectively than a human, a general AI system will need to be able to learn its environment over time and decide between possible proposals or actions. A simplified version of the alignment problem, then, would be to ask how we could construct a system that learns its environment and has a very crude decision criterion, like “Select the policy that maximizes the expected number of diamonds in the world.”
Highly reliable agent design is the technical challenge of formally specifying a software system that can be relied upon to pursue some preselected toy goal. An example of a subproblem in this space is ontology identification: how do we formalize the goal of “maximizing diamonds” in full generality, allowing that a fully autonomous agent may end up in unexpected environments and may construct unanticipated hypotheses and policies? Even if we had unbounded computational power and all the time in the world, we don’t currently know how to solve this problem. This suggests that we’re not only missing practical algorithms but also a basic theoretical framework through which to understand the problem.
The formal agent AIXI is an attempt to define what we mean by “optimal behavior” in the case of a reinforcement learner. A simple AIXI-like equation is lacking, however, for defining what we mean by “good behavior” if the goal is to change something about the external world (and not just to maximize a pre-specified reward number). In order for the agent to evaluate its world-models to count the number of diamonds, as opposed to having a privileged reward channel, what general formal properties must its world-models possess? If the system updates its hypotheses (e.g., discovers that string theory is true and quantum physics is false) in a way its programmers didn’t expect, how does it identify “diamonds” in the new model? The question is a very basic one, yet the relevant theory is currently missing.
We can distinguish highly reliable agent design from the problem of value specification: “Once we understand how to design an autonomous AI system that promotes a goal, how do we ensure its goal actually matches what we want?” Since human error is inevitable and we will need to be able to safely supervise and redesign AI algorithms even as they approach human equivalence in cognitive tasks, MIRI also works on formalizing error-tolerant agent properties. Artificial Intelligence: A Modern Approach, the standard textbook in AI, summarizes the challenge:
Yudkowsky […] asserts that friendliness (a desire not to harm humans) should be designed in from the start, but that the designers should recognize both that their own designs may be flawed, and that the robot will learn and evolve over time. Thus the challenge is one of mechanism design — to design a mechanism for evolving AI under a system of checks and balances, and to give the systems utility functions that will remain friendly in the face of such changes. -Russell and Norvig (2009). Artificial Intelligence: A Modern Approach.
Our technical agenda describes these open problems in more detail, and our research guide collects online resources for learning more.
Many parts of the AI alignment ecosystem are already well-funded, but a savvy donor can still make a difference by picking up grantmaking opportunities which are too small to catch the attention of the major funding bodies or are based on personal knowledge of the recipient.
One way to leverage a small amount of money to the potential of a large amount is to enter a donor lottery, where you donate to win a chance to direct a much larger amount of money (with probability proportional to donation size). This means that the person directing the money will be allocating enough that it's worth their time to do more in-depth research.
For an overview of the work the major organizations are doing, see the 2021 AI Alignment Literature Review and Charity Comparison. The Long-Term Future Fund seems to be an outstanding place to donate based on that, as they are the organization which most other organizations are most excited to see funded.
Current narrow systems are much more domain-specific than AGI. We don’t know what the first AGI will look like, some people think the GPT-3 architecture but scaled up a lot may get us there (GPT-3 is a giant prediction model which when trained on a vast amount of text seems to learn how to learn and do all sorts of crazy-impressive things, a related model can generate pictures from text), some people don’t think scaling this kind of model will get us all the way.
Putting aside the complexity of defining what is "the" moral way to behave (or even "a" moral way to behave), even an AI which can figure out what it is might not "want to" follow it itself.
A deceptive agent (AI or human) may know perfectly well what behaviour is considered moral, but if their values are not aligned, they may decide to act differently to pursue their own interests.
It might look like there are straightforward ways to eliminate the problems of unaligned superintelligence, but so far all of them turn out to have hidden difficulties. There are many open problems identified by the research community which a solution would need to reliably overcome to be successful.
If you're not already there, join the public Discord or ask for an invite to the semi-private one where contributors generally hang out.
The main ways you can help are to answer questions or add questions, or help to review questions, review answers, or improve answers (instructions for helping out with each of these tasks are on the linked pages). You could also join the dev team if you have programming skills.
Evidential Decision Theory – EDT – is a branch of decision theory which advises an agent to take actions which, conditional on it happening, maximizes the chances of the desired outcome. As any branch of decision theory, it prescribes taking the action that maximizes utility, that which utility equals or exceeds the utility of every other option. The utility of each action is measured by the expected utility, the averaged by probabilities sum of the utility of each of its possible results. How the actions can influence the probabilities differ between the branches. Causal Decision Theory – CDT – says only through causal process one can influence the chances of the desired outcome [#fn1 1]. EDT, on the other hand, requires no causal connection, the action only have to be a Bayesian evidence for the desired outcome. Some critics say it recommends auspiciousness over causal efficacy[#fn2 2].
One usual example where EDT and CDT commonly diverge is the Smoking lesion: “Smoking is strongly correlated with lung cancer, but in the world of the Smoker's Lesion this correlation is understood to be the result of a common cause: a genetic lesion that tends to cause both smoking and cancer. Once we fix the presence or absence of the lesion, there is no additional correlation between smoking and cancer. Suppose you prefer smoking without cancer to not smoking without cancer, and prefer smoking with cancer to not smoking with cancer. Should you smoke?” CDT would recommend smoking since there is no causal connection between smoking and cancer. They are both caused by a gene, but have no causal direct connection with each other. EDT on the other hand wound recommend against smoking, since smoking is an evidence for having the mentioned gene and thus should be avoided.
CDT uses probabilities of conditionals and contrafactual dependence to calculate the expected utility of an action – which track causal relations -, whereas EDT simply uses conditional probabilities. The probability of a conditional is the probability of the whole conditional being true, where the conditional probability is the probability of the consequent given the antecedent. A conditional probability of B given A - P(B|A) -, simply implies the Bayesian probability of the event B happening given we known A happened, it’s used in EDT. The probability of conditionals – P(A > B) - refers to the probability that the conditional 'A implies B' is true, it is the probability of the contrafactual ‘If A, then B’ be the case. Since contrafactual analysis is the key tool used to speak about causality, probability of conditionals are said to mirror causal relations. In most usual cases these two probabilities are the same. However, David Lewis proved [#fn3 3] its’ impossible to probabilities of conditionals to always track conditional probabilities. Hence evidential relations aren’t the same as causal relations and CDT and EDT will diverge depending on the problem. In some cases EDT gives a better answers then CDT, such as the Newcomb's problem, whereas in the Smoking lesion problem where CDT seems to give a more reasonable prescription.
- http://plato.stanford.edu/entries/decision-causal/[#fnref1 ↩]
- Joyce, J.M. (1999), The foundations of causal decision theory, p. 146[#fnref2 ↩]
- Lewis, D. (1976), "Probabilities of conditionals and conditional probabilities", The Philosophical Review (Duke University Press) 85 (3): 297–315[#fnref3 ↩]
- Smoking Lesion Steelman by Abram Demski
- Decision Theory FAQ by Luke Muehlhauser
- On Causation and Correlation Part 1
- Two-boxing, smoking and chewing gum in Medical Newcomb problems by Caspar Oesterheld
- Did EDT get it right all along? Introducing yet another medical Newcomb problem by Johannes Treutlein
- "Betting on the Past" by Arif Ahmed by Johannes Treutlein
- Why conditioning on "the agent takes action a" isn't enough by Nate Soares
Dreyfus and Penrose have argued that human cognitive abilities can’t be emulated by a computational machine. Searle and Block argue that certain kinds of machines cannot have a mind (consciousness, intentionality, etc.). But these objections need not concern those who predict an intelligence explosion.
We can reply to Dreyfus and Penrose by noting that an intelligence explosion does not require an AI to be a classical computational system. And we can reply to Searle and Block by noting that an intelligence explosion does not depend on machines having consciousness or other properties of ‘mind’, only that it be able to solve problems better than humans can in a wide variety of unpredictable environments. As Edsger Dijkstra once said, the question of whether a machine can ‘really’ think is “no more interesting than the question of whether a submarine can swim.”
Others who are pessimistic about an intelligence explosion occurring within the next few centuries don’t have a specific objection but instead think there are hidden obstacles that will reveal themselves and slow or halt progress toward machine superintelligence.
Finally, a global catastrophe like nuclear war or a large asteroid impact could so damage human civilization that the intelligence explosion never occurs. Or, a stable and global totalitarianism could prevent the technological development required for an intelligence explosion to occur.
Predicting the future is risky business. There are many philosophical, scientific, technological, and social uncertainties relevant to the arrival of an intelligence explosion. Because of this, experts disagree on when this event might occur. Here are some of their predictions:
- Futurist Ray Kurzweil predicts that machines will reach human-level intelligence by 2030 and that we will reach “a profound and disruptive transformation in human capability” by 2045.
- Intel’s chief technology officer, Justin Rattner, expects “a point when human and artificial intelligence merges to create something bigger than itself” by 2048.
- AI researcher Eliezer Yudkowsky expects the intelligence explosion by 2060.
- Philosopher David Chalmers has over 1/2 credence in the intelligence explosion occurring by 2100.
- Quantum computing expert Michael Nielsen estimates that the probability of the intelligence explosion occurring by 2100 is between 0.2% and about 70%.
- In 2009, at the AGI-09 conference, experts were asked when AI might reach superintelligence with massive new funding. The median estimates were that machine superintelligence could be achieved by 2045 (with 50% confidence) or by 2100 (with 90% confidence). Of course, attendees to this conference were self-selected to think that near-term artificial general intelligence is plausible.
- iRobot CEO Rodney Brooks and cognitive scientist Douglas Hofstadter allow that the intelligence explosion may occur in the future, but probably not in the 21st century.
- Roboticist Hans Moravec predicts that AI will surpass human intelligence “well before 2050.”
- In a 2005 survey of 26 contributors to a series of reports on emerging technologies, the median estimate for machines reaching human-level intelligence was 2085.
- Participants in a 2011 intelligence conference at Oxford gave a median estimate of 2050 for when there will be a 50% of human-level machine intelligence, and a median estimate of 2150 for when there will be a 90% chance of human-level machine intelligence.
- On the other hand, 41% of the participants in the [email protected] conference (in 2006) stated that machine intelligence would never reach the human level.
- Baum, Goertzel, & Goertzel, Long Until Human-Level AI? Results from an Expert Assessment
First, even “narrow” AI systems, which approach or surpass human intelligence in a small set of capabilities (such as image or voice recognition) already raise important questions regarding their impact on society. Making autonomous vehicles safe, analyzing the strategic and ethical dimensions of autonomous weapons, and the effect of AI on the global employment and economic systems are three examples. Second, the longer-term implications of human or super-human artificial intelligence are dramatic, and there is no consensus on how quickly such capabilities will be developed. Many experts believe there is a chance it could happen rather soon, making it imperative to begin investigating long-term safety issues now, if only to get a better sense of how much early progress is actually possible.
One threat model which includes a GPT component is Misaligned Model-Based RL Agent. It suggests that a reinforcement learner attached to a GPT-style world model could lead to an existential risk, with the RL agent being the optimizer which uses the world model to be much more effective at achieving its goals.
Another possibility is that a sufficiently powerful world model may develop mesa optimizers which could influence the world via the outputs of the model to achieve the mesa objective (perhaps by causing an optimizer to be created with goals aligned to it), though this is somewhat speculative.
We can run some tests and simulations to try and figure out how an AI might act once it ascends to superintelligence, but those tests might not be reliable.
Suppose we tell an AI that expects to later achieve superintelligence that it should calculate as many digits of pi as possible. It considers two strategies.
First, it could try to seize control of more computing resources now. It would likely fail, its human handlers would likely reprogram it, and then it could never calculate very many digits of pi.
Second, it could sit quietly and calculate, falsely reassuring its human handlers that it had no intention of taking over the world. Then its human handlers might allow it to achieve superintelligence, after which it could take over the world and calculate hundreds of trillions of digits of pi.
Since self-protection and goal stability are convergent instrumental goals, a weak AI will present itself as being as friendly to humans as possible, whether it is in fact friendly to humans or not. If it is “only” as smart as Einstein, it may be very good at deceiving humans into believing what it wants them to believe even before it is fully superintelligent.
There’s a second consideration here too: superintelligences have more options. An AI only as smart and powerful as an ordinary human really won’t have any options better than calculating the digits of pi manually. If asked to cure cancer, it won’t have any options better than the ones ordinary humans have – becoming doctors, going into pharmaceutical research. It’s only after an AI becomes superintelligent that there’s a serious risk of an AI takeover.
So if you tell an AI to cure cancer, and it becomes a doctor and goes into cancer research, then you have three possibilities. First, you’ve programmed it well and it understands what you meant. Second, it’s genuinely focused on research now but if it becomes more powerful it would switch to destroying the world. And third, it’s trying to trick you into trusting it so that you give it more power, after which it can definitively “cure” cancer with nuclear weapons.
Intelligence measures an agent’s ability to achieve goals in a wide range of environments.
This is a bit vague, but serves as the working definition of ‘intelligence’. For a more in-depth exploration, see Efficient Cross-Domain Optimization.
- Wikipedia, Intelligence
- Neisser et al., Intelligence: Knowns and Unknowns
- Wasserman & Zentall (eds.), Comparative Cognition: Experimental Explorations of Animal Intelligence
- Legg, Definitions of Intelligence
After reviewing extensive literature on the subject, Legg and Hutter summarizes the many possible valuable definitions in the informal statement “Intelligence measures an agent’s ability to achieve goals in a wide range of environments.” They then show this definition can be mathematically formalized given reasonable mathematical definitions of its terms. They use Solomonoff induction - a formalization of Occam's razor - to construct an universal artificial intelligence with a embedded utility function which assigns less utility to those actions based on theories with higher complexity. They argue this final formalization is a valid, meaningful, informative, general, unbiased, fundamental, objective, universal and practical definition of intelligence.
We can relate Legg and Hutter's definition with the concept of optimization. According to Eliezer Yudkowsky intelligence is efficient cross-domain optimization. It measures an agent's capacity for efficient cross-domain optimization of the world according to the agent’s preferences. Optimization measures not only the capacity to achieve the desired goal but also is inversely proportional to the amount of resources used. It’s the ability to steer the future so it hits that small target of desired outcomes in the large space of all possible outcomes, using fewer resources as possible. For example, when Deep Blue defeated Kasparov, it was able to hit that small possible outcome where it made the right order of moves given Kasparov’s moves from the very large set of all possible moves. In that domain, it was more optimal than Kasparov. However, Kasparov would have defeated Deep Blue in almost any other relevant domain, and hence, he is considered more intelligent.
One could cast this definition in a possible world vocabulary, intelligence is:
- the ability to precisely realize one of the members of a small set of possible future worlds that have a higher preference over the vast set of all other possible worlds with lower preference; while
- using fewer resources than the other alternatives paths for getting there; and in the
- most diverse domains as possible.
How many more worlds have a higher preference then the one realized by the agent, less intelligent he is. How many more worlds have a lower preference than the one realized by the agent, more intelligent he is. (Or: How much smaller is the set of worlds at least as preferable as the one realized, more intelligent the agent is). How much less paths for realizing the desired world using fewer resources than those spent by the agent, more intelligent he is. And finally, in how many more domains the agent can be more efficiently optimal, more intelligent he is. Restating it, the intelligence of an agent is directly proportional to:
- (a) the numbers of worlds with lower preference than the one realized,
- (b) how much smaller is the set of paths more efficient than the one taken by the agent and
- (c) how more wider are the domains where the agent can effectively realize his preferences;
and it is, accordingly, inversely proportional to:
- (d) the numbers of world with higher preference than the one realized,
- (e) how much bigger is the set of paths more efficient than the one taken by the agent and
- (f) how much more narrow are the domains where the agent can efficiently realize his preferences.
This definition avoids several problems common in many others definitions, especially it avoids anthropomorphizing intelligence.
The major AI companies are thinking about this. OpenAI was founded specifically with the intention to counter risks from superintelligence, many people at Google, DeepMind, and other organizations are convinced by the arguments and few genuinely oppose work in the field (though some claim it’s premature). For example, the paper Concrete Problems in AI Safety was a collaboration between researchers at Google Brain, Stanford, Berkeley, and OpenAI.
However, the vast majority of the effort these organizations put forwards is towards capabilities research, rather than safety.
Imagine, for example, that you are tasked with reducing traffic congestion in San Francisco at all costs, i.e. you do not take into account any other constraints. How would you do it? You might start by just timing traffic lights better. But wouldn’t there be less traffic if all the bridges closed down from 5 to 10AM, preventing all those cars from entering the city? Such a measure obviously violates common sense, and subverts the purpose of improving traffic, which is to help people get around – but it is consistent with the goal of “reducing traffic congestion”.
GPT-3 is the newest and most impressive of the GPT (Generative Pretrained Transformer) series of large transformer-based language models created by OpenAI. It was announced in June 2020, and is 100 times larger than its predecessor GPT-2.
Gwern has several resources exploring GPT-3's abilities, limitations, and implications including:
- The Scaling Hypothesis - How simply increasing the amount of compute with current algorithms might create very powerful systems.
- GPT-3 Nonfiction
- GPT-3 Creative Fiction
Vox has an article which explains why GPT-3 is a big deal.
- GPT-3: What’s it good for? - Cambridge University Press
All the content below is in English:
- The AI technical safety section of the 80,000 Hours Podcast;
- The AI X-risk Research Podcast, hosted by Daniel Filan;
- The AI Alignment Podcast hosted by Lucas Perry from the Future of Life Institute (ran ~monthly from April 2018 to March 2021);
- The Alignment Newsletter Podcast by Rob Miles (an audio version of the weekly newsletter).
That is, if you know an AI is likely to be superintelligent, can’t you just disconnect it from the Internet, not give it access to any speakers that can make mysterious buzzes and hums, make sure the only people who interact with it are trained in caution, et cetera?. Isn’t there some level of security – maybe the level we use for that room in the CDC where people in containment suits hundreds of feet underground analyze the latest superviruses – with which a superintelligence could be safe?
This puts us back in the same situation as lions trying to figure out whether or not nuclear weapons are a things humans can do. But suppose there is such a level of security. You build a superintelligence, and you put it in an airtight chamber deep in a cave with no Internet connection and only carefully-trained security experts to talk to. What now?
Now you have a superintelligence which is possibly safe but definitely useless. The whole point of building superintelligences is that they’re smart enough to do useful things like cure cancer. But if you have the monks ask the superintelligence for a cancer cure, and it gives them one, that’s a clear security vulnerability. You have a superintelligence locked up in a cave with no way to influence the outside world except that you’re going to mass produce a chemical it gives you and inject it into millions of people.
Or maybe none of this happens, and the superintelligence sits inert in its cave. And then another team somewhere else invents a second superintelligence. And then a third team invents a third superintelligence. Remember, it was only about ten years between Deep Blue beating Kasparov, and everybody having Deep Blue – level chess engines on their laptops. And the first twenty teams are responsible and keep their superintelligences locked in caves with carefully-trained experts, and the twenty-first team is a little less responsible, and now we still have to deal with a rogue superintelligence.
Superintelligences are extremely dangerous, and no normal means of controlling them can entirely remove the danger.