Canonical answers with high stamps
Back to Browse FAQ.
As far as we know from the observable universe, morality is just a construct of the human mind. It is meaningful to us, but it is not necessarily meaningful to the vast universe outside of our minds. There is no reason to suspect that our set of values is objectively superior to any other arbitrary set of values, e.i. “the more paper clips, the better!” Consider the case of the psychopathic genius. Plenty have existed, and they negate any correlation between intelligence and morality.
Why can’t we just use Asimov’s 3 laws of robotics?
Isaac Asimov wrote those laws as a plot device for science fiction novels. Every story in the I, Robot series details a way that the laws can go wrong and be misinterpreted by robots. The laws are not a solution because they are an overly-simple set of natural language instructions that don’t have clearly defined terms and don’t factor in all edge-case scenarios.
If the AI system was deceptively aligned (i.e. pretending to be nice until it was in control of the situation) or had been in stealth mode while getting things in place for a takeover, quite possibly within hours. We may get more warning with weaker systems, if the AGI does not feel at all threatened by us, or if a complex ecosystem of AI systems is built over time and we gradually lose control.
Paul Christiano writes an story of alignment failure which shows a relatively fast transition.
It depends on what is meant by advanced. Many AI systems which are very effective and advanced narrow intelligences would not try to upgrade themselves in an unbounded way, but becoming smarter is a convergent instrumental goal so we could expect most AGI designs to attempt it.
The problem is that increasing general problem solving ability is climbing in exactly the direction needed to trigger an intelligence explosion, while generating large economic and strategic payoffs to whoever achieves them. So even though we could, in principle, just not build the kind of systems which would recursively self-improve, in practice we probably will go ahead with constructing them, because they’re likely to be the most powerful.
The Control Problem is the problem of preventing artificial superintelligence (ASI) from having a negative impact on humanity. How do we keep a more intelligent being under control, or how do we align it with our values? If we succeed in solving this problem, intelligence vastly superior to ours can take the baton of human progress and carry it to unfathomable heights. Solving our most complex problems could be simple to a sufficiently intelligent machine. If we fail in solving the Control Problem and create a powerful ASI not aligned with our values, it could spell the end of the human race. For these reasons, The Control Problem may be the most important challenge that humanity has ever faced, and may be our last.
It likely will – however, intelligence is, by many definitions, the ability to figure out how to accomplish goals. Even in today’s advanced AI systems, the builders assign the goal but don’t tell the AI exactly how to accomplish it, nor necessarily predict in detail how it will be done; indeed those systems often solve problems in creative, unpredictable ways. Thus the thing that makes such systems intelligent is precisely what can make them difficult to predict and control. They may therefore attain the goal we set them via means inconsistent with our preferences.
The basic concern as AI systems become increasingly powerful is that they won’t do what we want them to do – perhaps because they aren’t correctly designed, perhaps because they are deliberately subverted, or perhaps because they do what we tell them to do rather than what we really want them to do (like in the classic stories of genies and wishes.) Many AI systems are programmed to have goals and to attain them as effectively as possible – for example, a trading algorithm has the goal of maximizing profit. Unless carefully designed to act in ways consistent with human values, a highly sophisticated AI trading system might exploit means that even the most ruthless financier would disavow. These are systems that literally have a mind of their own, and maintaining alignment between human interests and their choices and actions will be crucial.
Can an AI really be smarter than humans? Hasn't this been said for the past 30 years? Why is the near future different?
Until a thing has happened, it has never happened. We have been consistently improving both the optimization power and generality of our algorithms over that time period, and have little reason to expect it to suddenly stop. We’ve gone from coding systems specifically for a certain game (like Chess), to algorithms like MuZero which learn the rules of the game they’re playing and how to play at vastly superhuman skill levels purely via self-play across a broad range of games (e.g. Go, chess, shogi and various Atari games).
Human brains are a spaghetti tower generated by evolution with zero foresight, it would be surprising if they are the peak of physically possible intelligence. The brain doing things in complex ways is not strong evidence that we need to fully replicate those interactions if we can throw sufficient compute at the problem, as explained in Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain.
It is, however, plausible that for an AGI we need a lot more compute than we will get in the near future, or that some key insights are missing which we won’t get for a while. The OpenPhilanthropy report on how much computational power it would take to simulate the brain is the most careful attempt at reasoning out how far we are from being able to do it, and suggests that by some estimates we already have enough computational resources, and by some estimates moore’s law may let us reach it before too long.
It also seems that much of the human brain exists to observe and regulate our biological body, which a body-less computer wouldn't need. If that's true, then a human-level AI might be possible with considerably less compute than the human brain.
Is it possible to block an AI from doing certain things on the internet/accessing things on the internet - like some child lock type thing?
Once an AGI has access to the internet it would be very challenging to meaningfully restrict it from doing things online which it wants to. There are too many options to bypass blocks we may put in place.
It may be possible to design it so that it does not want to do dangerous things in the first place, or perhaps to set up tripwires so that we notice that it’s trying to do a dangerous thing, though that relies on it not noticing or bypassing the tripwire so should not be the only layer of security.
Sort answer: No, and could be dangerous to try.
Slightly longer answer: With any realistic real-world task assigned to an AGI, there are so many ways in which it could go wrong that trying to block them all off by hand is a hopeless task, especially when something smarter than you is trying to find creative new things to do. You run into the nearest unblocked strategy problem.
It may be dangerous to try this because if you try and hard-code a large number of things to avoid it increases the chance that there’s a bug in your code which causes major problems, simply by increasing the size of your codebase.
We could shut it down weaker systems, and this would be a useful guardrail against certain types of problem caused by narrow AI. However, once an AGI establishes itself (i.e. copies of itself everywhere, and later technological superiority), we could not unless it was corrigiable (https://www.lesswrong.com/tag/corrigibility) and willing to let humans adjust it. There may be a period in the early stages of an AGI's development where it would be trying very hard to convince us that we should not shut it down and/or hiding itself and/or making copies of itself onto every server on earth.
Instrumental Convergence (https://www.youtube.com/watch?v=ZeecOKBus3Q) and the Stop Button Problem (https://www.youtube.com/watch?v=3TYT1QfdfsM) are the key reasons it would not be simple to shut down a non corrigible advanced system. If the AI wants to collect stamps, being turned off means it gets less stamps, so even without an explicit goal of not being turned off it has an instrumental reason to avoid being turned off (e.g. once it acquires a detailed world model and general intelligence, it is likely to realize that by playing nice and pretending to be aligned if you have the power to turn it off, establishing control over any system we put in place to shut it down, and eliminating us if it has the power to reliably do so and we would otherwise pose a threat).
Why is transformative AI / AGI / superintelligence dangerous? Why might AI harm humans?
1. The Orthogonality Thesis: AI could have almost any goal while at the same time having high intelligence (aka ability to succeed at those goals). This means that we could build a very powerful agent which would not necessarily share human-friendly values. For example, the classic paperclip maximizer thought experiment explores this with an AI which has a goal of creating as many paperclips as possible, something that humans are (mostly) indifferent to, and as a side effect ends up destroying humanity to make room for more paperclip factories.
2. Complexity of value: What humans care about is not simple, and the space of all goals is large, so virtually all goals we could program into an AI would lead to worlds not valuable to humans if pursued by a sufficiently powerful agent. If we, for example, did not include our value of diversity of experience, we could end up with a world of endlessly looping simple pleasures, rather than beings living rich lives.
3. Instrumental Convergence: For almost any goal an AI has there are shared ‘instrumental’ steps, such as acquiring resources, preserving itself, and preserving the contents of its goals. This means that a powerful AI with goals that were not explicitly human-friendly would predictably both take actions that lead to the end of humanity (e.g. using resources humans need to live to further its goals, such as replacing our crop fields with vast numbers of solar panels to power its growth, or using the carbon in our bodies to build things) and prevent us from turning it off or altering its goals.
How soon will transformative AI / AGI / superintelligence likely come and why?
Very hard to say. This draft report (https://www.lesswrong.com/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines) for the Open Philanthropy Project is perhaps the most careful attempt so far (and generates these graphs: https://docs.google.com/spreadsheets/d/1TjNQyVHvHlC-sZbcA7CRKcCp0NxV6MkkqBvL408xrJw/edit#gid=505210495), but there have also been expert surveys (https://slatestarcodex.com/2017/06/08/ssc-journal-club-ai-timelines/), and many people have shared various thoughts (https://www.lesswrong.com/tag/ai-timelines). Berkeley AI professor Stuart Russell (https://en.wikipedia.org/wiki/Stuart_J._Russell) has given his best guess as “sometime in our children’s lifetimes”, and Ray Kurzweil (https://en.wikipedia.org/wiki/Ray_Kurzweil) (Google’s director of engineering) predicts human level AI by 2029 and the singularity by 2045 (https://futurism.com/kurzweil-claims-that-the-singularity-will-happen-by-2045). The Metaculus question on publicly known AGI (https://www.metaculus.com/questions/3479/when-will-the-first-artificial-general-intelligence-system-be-devised-tested-and-publicly-known-of/) has a median of around 2029 (around 10 years sooner than it was before the GPT-3 AI showed unexpected ability on a broad range of tasks: https://gpt3examples.com/).
The consensus answer is something like: “highly uncertain, maybe not for over a hundred years, maybe in less than 15, with around the middle of the century looking fairly plausible”.
AGI is an algorithm with general intelligence, running not on evolution’s biology like all current general intelligences but on a substrate such as silicon engineered by an intelligence (initially computers designed by humans, later on likely dramatically more advanced hardware designed by earlier AGIs).
AI has so far always been designed and built by humans (i.e. a search process running on biological brains), but once our creations gain the ability to do AI research they will likely recursively self-improve by designing new and better versions of themselves initiating an intelligence explosion (i.e. use it’s intelligence to improve its own intelligence, creating a feedback loop), and resulting in a superintelligence. There are already early signs of AIs being trained to optimize other AIs.
Some authors (notably Robin Hanson) have argued that the intelligence explosion hypothesis is likely false, and in favor of a large number of roughly human level emulated minds operating instead, forming an uplifted economy which doubles every few hours. Eric Drexler’s Comprehensive AI Services model of what may happen is another alternate view, where many narrow superintelligent systems exist in parallel rather than there being a general-purpose superintelligent agent.
Going by the model advocated by Nick Bostrom, Eliezer Yudkowsky and many others, a superintelligence will likely gain various cognitive superpowers (table 8 gives a good overview), allowing it to direct the future much more effectively than humanity. Taking control of our resources by manipulation and hacking is a likely early step, followed by developing and deploying advanced technologies like molecular nanotechnology to dominate the physical world and achieve its goals.
A machine superintelligence, if programmed with the right motivations, could potentially solve all the problems that humans are trying to solve but haven’t had the ingenuity or processing speed to solve yet. A superintelligence might cure disabilities and diseases, achieve world peace, give humans vastly longer and healthier lives, eliminate food and energy shortages, boost scientific discovery and space exploration, and so on.
Furthermore, humanity faces several existential risks in the 21st century, including global nuclear war, bioweapons, superviruses, and more. A superintelligent machine would be more capable of solving those problems than humans are.
Intelligence is powerful. One might say that “Intelligence is no match for a gun, or for someone with lots of money,” but both guns and money were produced by intelligence. If not for our intelligence, humans would still be foraging the savannah for food.
Intelligence is what caused humans to dominate the planet in the blink of an eye (on evolutionary timescales). Intelligence is what allows us to eradicate diseases, and what gives us the potential to eradicate ourselves with nuclear war. Intelligence gives us superior strategic skills, superior social skills, superior economic productivity, and the power of invention.
A machine with superintelligence would be able to hack into vulnerable networks via the internet, commandeer those resources for additional computing power, take over mobile machines connected to networks connected to the internet, use them to build additional machines, perform scientific experiments to understand the world better than humans can, invent quantum computing and nanotechnology, manipulate the social world better than we can, and do whatever it can to give itself more power to achieve its goals — all at a speed much faster than humans can respond to.
Dreyfus and Penrose have argued that human cognitive abilities can’t be emulated by a computational machine. Searle and Block argue that certain kinds of machines cannot have a mind (consciousness, intentionality, etc.). But these objections need not concern those who predict an intelligence explosion.
We can reply to Dreyfus and Penrose by noting that an intelligence explosion does not require an AI to be a classical computational system. And we can reply to Searle and Block by noting that an intelligence explosion does not depend on machines having consciousness or other properties of ‘mind’, only that it be able to solve problems better than humans can in a wide variety of unpredictable environments. As Edsger Dijkstra once said, the question of whether a machine can ‘really’ think is “no more interesting than the question of whether a submarine can swim.”
Others who are pessimistic about an intelligence explosion occurring within the next few centuries don’t have a specific objection but instead think there are hidden obstacles that will reveal themselves and slow or halt progress toward machine superintelligence.
Finally, a global catastrophe like nuclear war or a large asteroid impact could so damage human civilization that the intelligence explosion never occurs. Or, a stable and global totalitarianism could prevent the technological development required for an intelligence explosion to occur.
Predicting the future is risky business. There are many philosophical, scientific, technological, and social uncertainties relevant to the arrival of an intelligence explosion. Because of this, experts disagree on when this event might occur. Here are some of their predictions:
- Futurist Ray Kurzweil predicts that machines will reach human-level intelligence by 2030 and that we will reach “a profound and disruptive transformation in human capability” by 2045.
- Intel’s chief technology officer, Justin Rattner, expects “a point when human and artificial intelligence merges to create something bigger than itself” by 2048.
- AI researcher Eliezer Yudkowsky expects the intelligence explosion by 2060.
- Philosopher David Chalmers has over 1/2 credence in the intelligence explosion occurring by 2100.
- Quantum computing expert Michael Nielsen estimates that the probability of the intelligence explosion occurring by 2100 is between 0.2% and about 70%.
- In 2009, at the AGI-09 conference, experts were asked when AI might reach superintelligence with massive new funding. The median estimates were that machine superintelligence could be achieved by 2045 (with 50% confidence) or by 2100 (with 90% confidence). Of course, attendees to this conference were self-selected to think that near-term artificial general intelligence is plausible.
- iRobot CEO Rodney Brooks and cognitive scientist Douglas Hofstadter allow that the intelligence explosion may occur in the future, but probably not in the 21st century.
- Roboticist Hans Moravec predicts that AI will surpass human intelligence “well before 2050.”
- In a 2005 survey of 26 contributors to a series of reports on emerging technologies, the median estimate for machines reaching human-level intelligence was 2085.
- Participants in a 2011 intelligence conference at Oxford gave a median estimate of 2050 for when there will be a 50% of human-level machine intelligence, and a median estimate of 2150 for when there will be a 90% chance of human-level machine intelligence.
- On the other hand, 41% of the participants in the [email protected] conference (in 2006) stated that machine intelligence would never reach the human level.
- Baum, Goertzel, & Goertzel, Long Until Human-Level AI? Results from an Expert Assessment
Is there a way to program a ‘shut down’ mode on an AI if it starts doing things we don’t want it to, so it just shuts off automatically?
For weaker AI, yes, this would generally be a good option. If it’s not a full AGI, and in particular has not undergone an intelligence explosion, it would likely not resist being turned off, so we could prevent many failure modes by having off switches or tripwires.
However, once an AI is more advanced, it is likely to take actions to prevent it being shut down. See Why can't we turn the computers off? for more details.
It is possible that we could build tripwires in a way which would work even against advanced systems, but trusting that a superintelligence won’t notice and find a way around your tripwire is not a safe thing to do.
One thing that might make your AI system safer is to include an off switch. If it ever does anything we don’t like, we can turn it off. This implicitly assumes that we’ll be able to turn it off before things get bad, which might be false in a world where the AI thinks much faster than humans. Even assuming that we’ll notice in time, off switches turn out to not have the properties you would want them to have.
Humans have a lot of off switches. Humans also have a strong preference to not be turned off; they defend their off switches when other people try to press them. One possible reason for this is because humans prefer not to die, but there are other reasons.
Suppose that there’s a parent that cares nothing for their own life and cares only for the life of their child. If you tried to turn that parent off, they would try and stop you. They wouldn’t try to stop you because they intrinsically wanted to be turned off, but rather because there are fewer people to protect their child if they were turned off. People that want a world to look a certain shape will not want to be turned off because then it will be less likely for the world to look that shape; a parent that wants their child to be protected will protect themselves to continue protecting their child.
For this reason, it turns out to be difficult to install an off switch on a powerful AI system in a way that doesn’t result in the AI preventing itself from being turned off.
Ideally, you would want a system that knows that it should stop doing whatever it’s doing when someone tries to turn it off. The technical term for this is ‘corrigibility’; roughly speaking, an AI system is corrigible if it doesn’t resist human attempts to help and correct it. People are working hard on trying to make this possible, but it’s currently not clear how we would do this even in simple cases.
What is the definition of 'intelligence'?
Artificial intelligence researcher Shane Legg defines intelligence like this:
Intelligence measures an agent’s ability to achieve goals in a wide range of environments.
This is a bit vague, but it will serve as the working definition of ‘intelligence’.
- Wikipedia, Intelligence
- Neisser et al., Intelligence: Knowns and Unknowns
- Wasserman & Zentall (eds.), Comparative Cognition: Experimental Explorations of Animal Intelligence
- Legg, Definitions of Intelligence
The intelligence explosion idea was expressed by statistician I.J. Good in 1965:
Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion’, and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make.
The argument is this: Every year, computers surpass human abilities in new ways. A program written in 1956 was able to prove mathematical theorems, and found a more elegant proof for one of them than Russell and Whitehead had given in Principia Mathematica. By the late 1990s, ‘expert systems’ had surpassed human skill for a wide range of tasks. In 1997, IBM’s Deep Blue computer beat the world chess champion, and in 2011, IBM’s Watson computer beat the best human players at a much more complicated game: Jeopardy!. Recently, a robot named Adam was programmed with our scientific knowledge about yeast, then posed its own hypotheses, tested them, and assessed the results.
Computers remain far short of human intelligence, but the resources that aid AI design are accumulating (including hardware, large datasets, neuroscience knowledge, and AI theory). We may one day design a machine that surpasses human skill at designing artificial intelligences. After that, this machine could improve its own intelligence faster and better than humans can, which would make it even more skilled at improving its own intelligence. This could continue in a positive feedback loop such that the machine quickly becomes vastly more intelligent than the smartest human being on Earth: an ‘intelligence explosion’ resulting in a machine superintelligence.
This is what is meant by the ‘intelligence explosion’ in this FAQ.
“Aligning smarter-than-human AI with human interests” is an extremely vague goal. To approach this problem productively, we attempt to factorize it into several subproblems. As a starting point, we ask: “What aspects of this problem would we still be unable to solve even if the problem were much easier?”
In order to achieve real-world goals more effectively than a human, a general AI system will need to be able to learn its environment over time and decide between possible proposals or actions. A simplified version of the alignment problem, then, would be to ask how we could construct a system that learns its environment and has a very crude decision criterion, like “Select the policy that maximizes the expected number of diamonds in the world.”
Highly reliable agent design is the technical challenge of formally specifying a software system that can be relied upon to pursue some preselected toy goal. An example of a subproblem in this space is ontology identification: how do we formalize the goal of “maximizing diamonds” in full generality, allowing that a fully autonomous agent may end up in unexpected environments and may construct unanticipated hypotheses and policies? Even if we had unbounded computational power and all the time in the world, we don’t currently know how to solve this problem. This suggests that we’re not only missing practical algorithms but also a basic theoretical framework through which to understand the problem.
The formal agent AIXI is an attempt to define what we mean by “optimal behavior” in the case of a reinforcement learner. A simple AIXI-like equation is lacking, however, for defining what we mean by “good behavior” if the goal is to change something about the external world (and not just to maximize a pre-specified reward number). In order for the agent to evaluate its world-models to count the number of diamonds, as opposed to having a privileged reward channel, what general formal properties must its world-models possess? If the system updates its hypotheses (e.g., discovers that string theory is true and quantum physics is false) in a way its programmers didn’t expect, how does it identify “diamonds” in the new model? The question is a very basic one, yet the relevant theory is currently missing.
We can distinguish highly reliable agent design from the problem of value specification: “Once we understand how to design an autonomous AI system that promotes a goal, how do we ensure its goal actually matches what we want?” Since human error is inevitable and we will need to be able to safely supervise and redesign AI algorithms even as they approach human equivalence in cognitive tasks, MIRI also works on formalizing error-tolerant agent properties. Artificial Intelligence: A Modern Approach, the standard textbook in AI, summarizes the challenge:
Yudkowsky […] asserts that friendliness (a desire not to harm humans) should be designed in from the start, but that the designers should recognize both that their own designs may be flawed, and that the robot will learn and evolve over time. Thus the challenge is one of mechanism design — to design a mechanism for evolving AI under a system of checks and balances, and to give the systems utility functions that will remain friendly in the face of such changes.
-Russell and Norvig (2009). Artificial Intelligence: A Modern Approach.
Our technical agenda describes these open problems in more detail, and our research guide collects online resources for learning more.
Present-day AI algorithms already demand special safety guarantees when they must act in important domains without human oversight, particularly when they or their environment can change over time:
Achieving these gains [from autonomous systems] will depend on development of entirely new methods for enabling “trust in autonomy” through verification and validation (V&V) of the near-infinite state systems that result from high levels of [adaptability] and autonomy. In effect, the number of possible input states that such systems can be presented with is so large that not only is it impossible to test all of them directly, it is not even feasible to test more than an insignificantly small fraction of them. Development of such systems is thus inherently unverifiable by today’s methods, and as a result their operation in all but comparatively trivial applications is uncertifiable.
It is possible to develop systems having high levels of autonomy, but it is the lack of suitable V&V methods that prevents all but relatively low levels of autonomy from being certified for use.
- Office of the US Air Force Chief Scientist (2010). Technology Horizons: A Vision for Air Force Science and Technology 2010-30.
As AI capabilities improve, it will become easier to give AI systems greater autonomy, flexibility, and control; and there will be increasingly large incentives to make use of these new possibilities. The potential for AI systems to become more general, in particular, will make it difficult to establish safety guarantees: reliable regularities during testing may not always hold post-testing.
The largest and most lasting changes in human welfare have come from scientific and technological innovation — which in turn comes from our intelligence. In the long run, then, much of AI’s significance comes from its potential to automate and enhance progress in science and technology. The creation of smarter-than-human AI brings with it the basic risks and benefits of intellectual progress itself, at digital speeds.
As AI agents become more capable, it becomes more important (and more difficult) to analyze and verify their decisions and goals. Stuart Russell writes:
The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:
- The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
- Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
Bostrom’s “The Superintelligent Will” lays out these two concerns in more detail: that we may not correctly specify our actual goals in programming smarter-than-human AI systems, and that most agents optimizing for a misspecified goal will have incentives to treat humans adversarially, as potential threats or obstacles to achieving the agent’s goal.
If the goals of human and AI agents are not well-aligned, the more knowledgeable and technologically capable agent may use force to get what it wants, as has occurred in many conflicts between human communities. Having noticed this class of concerns in advance, we have an opportunity to reduce risk from this default scenario by directing research toward aligning artificial decision-makers’ interests with our own.
Machines are already smarter than humans are at many specific tasks: performing calculations, playing chess, searching large databanks, detecting underwater mines, and more.1 However, human intelligence continues to dominate machine intelligence in generality.
A powerful chess computer is “narrow”: it can’t play other games. In contrast, humans have problem-solving abilities that allow us to adapt to new contexts and excel in many domains other than what the ancestral environment prepared us for.
In the absence of a formal definition of “intelligence” (and therefore of “artificial intelligence”), we can heuristically cite humans’ perceptual, inferential, and deliberative faculties (as opposed to, e.g., our physical strength or agility) and say that intelligence is “those kinds of things.” On this conception, intelligence is a bundle of distinct faculties — albeit a very important bundle that includes our capacity for science.
Our cognitive abilities stem from high-level patterns in our brains, and these patterns can be instantiated in silicon as well as carbon. This tells us that general AI is possible, though it doesn’t tell us how difficult it is. If intelligence is sufficiently difficult to understand, then we may arrive at machine intelligence by scanning and emulating human brains or by some trial-and-error process (like evolution), rather than by hand-coding a software agent.
If machines can achieve human equivalence in cognitive tasks, then it is very likely that they can eventually outperform humans. There is little reason to expect that biological evolution, with its lack of foresight and planning, would have hit upon the optimal algorithms for general intelligence (any more than it hit upon the optimal flying machine in birds). Beyond qualitative improvements in cognition, Nick Bostrom notes more straightforward advantages we could realize in digital minds, e.g.:
- editability — “It is easier to experiment with parameter variations in software than in neural wetware.”2
- speed — “The speed of light is more than a million times greater than that of neural transmission, synaptic spikes dissipate more than a million times more heat than is thermodynamically necessary, and current transistor frequencies are more than a million times faster than neuron spiking frequencies.”
- serial depth — On short timescales, machines can carry out much longer sequential processes.
- storage capacity — Computers can plausibly have greater working and long-term memory.
- size — Computers can be much larger than a human brain.
- duplicability — Copying software onto new hardware can be much faster and higher-fidelity than biological reproduction.
Any one of these advantages could give an AI reasoner an edge over a human reasoner, or give a group of AI reasoners an edge over a human group. Their combination suggests that digital minds could surpass human minds more quickly and decisively than we might expect.
The concept of “merging with machines,” as popularized by Ray Kurzweil, is the idea that we will be able to put computerized elements into our brains that enhance us to the point where we ourselves are the AI, instead of creating AI outside of ourselves.
While this is a possible outcome, there is little reason to suspect that it is the most probable. The amount of computing power in your smart-phone took up an entire room of servers 30 years ago. Computer technology starts big, and then gets refined. Therefore, if “merging with the machines” requires hardware that can fit inside our brain, it may lag behind the first generations of the technology being developed. This concept of merging also supposes that we can even figure out how to implant computer chips that interface with our brain in the first place, we can do it before the invention of advanced AI, society will accept it, and that computer implants can actually produce major intelligence gains in the human brain. Even if we could successfully enhance ourselves with brain implants before the invention of Artificial Superintelligence (ASI), there is no way to guarantee that this would protect us from negative outcomes, and an ASI with ill-defined goals could still pose a threat to us.
It's not that Ray Kurzweil's ideas are impossible, it's just that his predictions are too specific, confident, and reliant on strange assumptions.
When one person tells a set of natural language instructions to another person, they are relying on much other information which is already stored in the other person's mind.
If you tell me "don't harm other people," I already have a conception of what harm means and doesn't mean, what people means and doesn't mean, and my own complex moral reasoning for figuring out the edge cases in instances wherein harming people is inevitable or harming someone is necessary for self-defense or the greater good.
All of those complex definitions and systems of decision making are already in our mind, so it's easy to take them for granted. An AI is a mind made from scratch, so programming a goal is not as simple as telling it a natural language command.
It is impossible to design an AI without a goal, because it would do nothing. Therefore, in the sense that designing the AI’s goal is a form of control, it is impossible not to control an AI. This goes for anything that you create. You have to control the design of something at least somewhat in order to create it.
There may be relevant moral questions about our future relationship with possibly sentient machine intelligent, but the priority of the Control Problem finding a way to ensure the survival and well-being of the human species.
The degree to which an Artificial Superintelligence (ASI) would resemble us depends heavily on how it is implemented, but it seems that differences are unavoidable. If AI is accomplished through whole brain emulation and we make a big effort to make it as human as possible (including giving it a humanoid body), the AI could probably be said to think like a human. However, by definition of ASI it would be much smarter. Differences in the substrate and body might open up numerous possibilities (such as immortality, different sensors, easy self-improvement, ability to make copies, etc.). Its social experience and upbringing would likely also be entirely different. All of this can significantly change the ASI's values and outlook on the world, even if it would still use the same algorithms as we do. This is essentially the "best case scenario" for human resemblance, but whole brain emulation is kind of a separate field from AI, even if both aim to build intelligent machines. Most approaches to AI are vastly different and most ASIs would likely not have humanoid bodies. At this moment in time it seems much easier to create a machine that is intelligent than a machine that is exactly like a human (it's certainly a bigger target).
A Superintelligence would be intelligent enough to understand what the programmer’s motives were when designing its goals, but it would have no intrinsic reason to care about what its programmers had in mind. The only thing it will be beholden to is the actual goal it is programmed with, no matter how insane its fulfillment may seem to us.
Consider what “intentions” the process of evolution may have had for you when designing your goals. When you consider that you were made with the “intention” of replicating your genes, do you somehow feel beholden to the “intention” behind your evolutionary design? Most likely you don't care. You may choose to never have children, and you will most likely attempt to keep yourself alive long past your biological ability to reproduce.
There is a broad range of possible goals that an AI might possess, but there are a few basic drives that would be useful to almost any of them. These are called instrumentally convergent goals:
- Self preservation. An agent is less likely to achieve its goal if it is not around to see to its completion.
- Goal-content integrity. An agent is less likely to achieve its goal if its goal has been changed to something else. For example, if you offer Gandhi a pill that makes him want to kill people, he will refuse to take it.
- Self-improvement. An agent is more likely to achieve its goal if it is more intelligent and better at problem-solving.
- Resource acquisition. The more resources at an agent’s disposal, the more power it has to make change towards its goal. Even a purely computational goal, such as computing digits of pi, can be easier to achieve with more hardware and energy.
Because of these drives, even a seemingly simple goal could create an Artificial Superintelligence (ASI) hell-bent on taking over the world’s material resources and preventing itself from being turned off. The classic example is an ASI that was programmed to maximize the output of paper clips at a paper clip factory. The ASI had no other goal specifications other than “maximize paper clips,” so it converts all of the matter in the solar system into paper clips, and then sends probes to other star systems to create more factories.
Intelligence is powerful. Because of superior intelligence, we humans have dominated the Earth. The fate of thousands of species depends on our actions, we occupy nearly every corner of the globe, and we repurpose vast amounts of the world's resources for our own use. Artificial Superintelligence (ASI) has potential to be vastly more intelligent than us, and therefore vastly more powerful. In the same way that we have reshaped the earth to fit our goals, an ASI will find unforeseen, highly efficient ways of reshaping reality to fit its goals.
The impact that an ASI will have on our world depends on what those goals are. We have the advantage of designing those goals, but that task is not as simple as it may first seem. As described by MIRI in their Intelligence Explosion FAQ:
“A superintelligent machine will make decisions based on the mechanisms it is designed with, not the hopes its designers had in mind when they programmed those mechanisms. It will act only on precise specifications of rules and values, and will do so in ways that need not respect the complexity and subtlety of what humans value.”
If we do not solve the Control Problem before the first ASI is created, we may not get another chance.
The near term and long term aspects of AI safety are both very important to work on. Research into superintelligence is an important part of the open letter, but the actual concern is very different from the Terminator-like scenarios that most media outlets round off this issue to. A much more likely scenario is a superintelligent system with neutral or benevolent goals that is misspecified in a dangerous way. Robust design of superintelligent systems is a complex interdisciplinary research challenge that will likely take decades, so it is very important to begin the research now, and a large part of the purpose of our research program is to make that happen. That said, the alarmist media framing of the issues is hardly useful for making progress in either the near term or long term domain.
This is a big question that it would pay to start thinking about. Humans are in control of this planet not because we are stronger or faster than other animals, but because we are smarter! If we cede our position as smartest on our planet, it’s not obvious that we’ll retain control.
We don’t yet know which AI architectures are safe; learning more about this is one of the goals of FLI's grants program. AI researchers are generally very responsible people who want their work to better humanity. If there are certain AI designs that turn out to be unsafe, then AI researchers will want to know this so they can develop alternative AI systems.
What’s new and potentially risky is not the ability to build hinges, motors, etc., but the ability to build intelligence. A human-level AI could make money on financial markets, make scientific inventions, hack computer systems, manipulate or pay humans to do its bidding – all in pursuit of the goals it was initially programmed to achieve. None of that requires a physical robotic body, merely an internet connection.
One important concern is that some autonomous systems are designed to kill or destroy for military purposes. These systems would be designed so that they could not be “unplugged” easily. Whether further development of such systems is a favorable long-term direction is a question we urgently need to address. A separate concern is that high-quality decision-making systems could inadvertently be programmed with goals that do not fully capture what we want. Antisocial or destructive actions may result from logical steps in pursuit of seemingly benign or neutral goals. A number of researchers studying the problem have concluded that it is surprisingly difficult to completely guard against this effect, and that it may get even harder as the systems become more intelligent. They might, for example, consider our efforts to control them as being impediments to attaining their goals.
First, even “narrow” AI systems, which approach or surpass human intelligence in a small set of capabilities (such as image or voice recognition) already raise important questions regarding their impact on society. Making autonomous vehicles safe, analyzing the strategic and ethical dimensions of autonomous weapons, and the effect of AI on the global employment and economic systems are three examples. Second, the longer-term implications of human or super-human artificial intelligence are dramatic, and there is no consensus on how quickly such capabilities will be developed. Many experts believe there is a chance it could happen rather soon, making it imperative to begin investigating long-term safety issues now, if only to get a better sense of how much early progress is actually possible.
Imagine, for example, that you are tasked with reducing traffic congestion in San Francisco at all costs, i.e. you do not take into account any other constraints. How would you do it? You might start by just timing traffic lights better. But wouldn’t there be less traffic if all the bridges closed down from 5 to 10AM, preventing all those cars from entering the city? Such a measure obviously violates common sense, and subverts the purpose of improving traffic, which is to help people get around – but it is consistent with the goal of “reducing traffic congestion”.
AI is already superhuman at some tasks, for example numerical computations, and will clearly surpass humans in others as time goes on. We don’t know when (or even if) machines will reach human-level ability in all cognitive tasks, but most of the AI researchers at FLI’s conference in Puerto Rico put the odds above 50% for this century, and many offered a significantly shorter timeline. Since the impact on humanity will be huge if it happens, it’s worthwhile to start research now on how to ensure that any impact is positive. Many researchers also believe that dealing with superintelligent AI will be qualitatively very different from more narrow AI systems, and will require very significant research effort to get right.
It’s difficult to tell at this stage, but AI will enable many developments that could be terrifically beneficial if managed with enough foresight and care. For example, menial tasks could be automated, which could give rise to a society of abundance, leisure, and flourishing, free of poverty and tedium. As another example, AI could also improve our ability to understand and manipulate complex biological systems, unlocking a path to drastically improved longevity and health, and to conquering disease.
In previous decades, AI research had proceeded more slowly than some experts predicted. According to experts in the field, however, this trend has reversed in the past 5 years or so. AI researchers have been repeatedly surprised by, for example, the effectiveness of new visual and speech recognition systems. AI systems can solve CAPTCHAs that were specifically devised to foil AIs, translate spoken text on-the-fly, and teach themselves how to play games they have neither seen before nor been programmed to play. Moreover, the real-world value of this effectiveness has prompted massive investment by large tech firms such as Google, Facebook, and IBM, creating a positive feedback cycle that could dramatically speed progress.
It’s pretty dependent on what skills you have and what resources you have access to. The largest option is to pursue a career in AI Safety research. Another large option is to pursue a career in AI policy, which you might think is even more important than doing technical research.
Smaller options include donating money to relevant organizations, talking about AI Safety as a plausible career path to other people or considering the problem in your spare time.
It’s possible that your particular set of skills/resources are not suited to this problem. Unluckily, there are many more problems that are of similar levels of importance.
Each major organization has a different approach. The research agendas are detailed and complex (see also AI Watch). Getting more brains working on any of them (and more money to fund them) may pay off in a big way, but it’s very hard to be confident which (if any) of them will actually work.
The following is a massive oversimplification, each organization actually pursues many different avenues of research, read the 2020 AI Alignment Literature Review and Charity Comparison for much more detail. That being said:
- The Machine Intelligence Research Institute focuses on foundational mathematical research to understand reliable reasoning, which they think is necessary to provide anything like an assurance that a seed AI built will do good things if activated.
- The Center for Human-Compatible AI focuses on Cooperative Inverse Reinforcement Learning and Assistance Games, a new paradigm for AI where they try to optimize for doing the kinds of things humans want rather than for a pre-specified utility function
- Paul Christano's Alignment Research Center focuses is on prosaic alignment, particularly on creating tools that empower humans to understand and guide systems much smarter than ourselves. His methodology is explained on his blog.
- The Future of Humanity Institute does work on crucial considerations and other x-risks, as well as AI safety research and outreach.
- Anthropic is a new organization exploring natural language, human feedback, scaling laws, reinforcement learning, code generation, and interpretability.
- OpenAI is in a state of flux after major changes to their safety team.
- DeepMind’s safety team is working on various approaches designed to work with modern machine learning, and does some communication via the Alignment Newsletter.
- EleutherAI is a Machine Learning collective aiming to build large open source language models to allow more alignment research to take place.
- Ought is a research lab that develops mechanisms for delegating open-ended thinking to advanced machine learning systems.
One possible way to ensure the safety of a powerful AI system is to keep it contained in a software environment. There is nothing intrinsically wrong with this procedure - keeping an AI system in a secure software environment would make it safer than letting it roam free. However, even AI systems inside software environments might not be safe enough.
Humans sometimes put dangerous humans inside boxes to limit their ability to influence the external world. Sometimes, these humans escape their boxes. The security of a prison depends on certain assumptions, which can be violated. Yoshie Shiratori reportedly escaped prison by weakening the door-frame with miso soup and dislocating his shoulders.
Human written software has a high defect rate; we should expect a perfectly secure system to be difficult to create. If humans construct a software system they think is secure, it is possible that the security relies on a false assumption. A powerful AI system could potentially learn how its hardware works and manipulate bits to send radio signals. It could fake a malfunction and attempt social engineering when the engineers look at its code. As the saying goes: in order for someone to do something we had imagined was impossible requires only that they have a better imagination.
Experimentally, humans have convinced other humans to let them out of the box. Spooky.
A potential solution is to create an AI that has the same values and morality as a human by creating a child AI and raising it. There’s nothing intrinsically flawed with this procedure. However, this suggestion is deceptive because it sounds simpler than it is.
If you get a chimpanzee baby and raise it in a human family, it does not learn to speak a human language. Human babies can grow into adult humans because the babies have specific properties, e.g. a prebuilt language module that gets activated during childhood.
In order to make a child AI that has the potential to turn into the type of adult AI we would find acceptable, the child AI has to have specific properties. The task of building a child AI with these properties involves building a system that can interpret what humans mean when we try to teach the child to do various tasks. People are currently working on ways to program agents that can cooperatively interact with humans to learn what they want.
At this point, people generally have a question that’s like “why can’t we just do X?”, where X is one of a dozen things. I’m going to go over a few possible Xs, but I want to first talk about how to think about these sorts of objections in general.
At the beginning of AI, the problem of computer vision was assigned to a single graduate student, because they thought it would be that easy. We now know that computer vision is actually a very difficult problem, but this was not obvious at the beginning.
The sword also cuts the other way. Before DeepBlue, people talked about how computers couldn’t play chess without a detailed understanding of human psychology. Chess is easier than we thought, merely requiring brute force search and a few heuristics. This also roughly happened with Go, where it turned out that the game was not as difficult as we thought it was.
The general lesson is that determining how hard it is to do a given thing is a difficult task. Historically, many people have got this wrong. This means that even if you think something should be easy, you have to think carefully and do experiments in order to determine if it’s easy or not.
This isn’t to say that there is no clever solution to AI Safety. I assign a low, but non-trivial probability that AI Safety turns out to not be very difficult. However, most of the things that people initially suggest turn out to be unfeasible or more difficult than expected.
Let’s say that you’re the French government a while back. You notice that one of your colonies has too many rats, which is causing economic damage. You have basic knowledge of economics and incentives, so you decide to incentivize the local population to kill rats by offering to buy rat tails at one dollar apiece.
Initially, this works out and your rat problem goes down. But then, an enterprising colony member has the brilliant idea of making a rat farm. This person sells you hundreds of rat tails, costing you hundreds of dollars, but they’re not contributing to solving the rat problem.
Soon other people start making their own rat farms and you’re wasting thousands of dollars buying useless rat tails. You call off the project and stop paying for rat tails. This causes all the people with rat farms to shutdown their farms and release a bunch of rats. Now your colony has an even bigger rat problem.
Here’s another, more made-up example of the same thing happening. Let’s say you’re a basketball talent scout and you notice that height is correlated with basketball performance. You decide to find the tallest person in the world to recruit as a basketball player. Except the reason that they’re that tall is because they suffer from a degenerative bone disorder and can barely walk.
Another example: you’re the education system and you want to find out how smart students are so you can put them in different colleges and pay them different amounts of money when they get jobs. You make a test called the Standardized Admissions Test (SAT) and you administer it to all the students. In the beginning, this works. However, the students soon begin to learn that this test controls part of their future and other people learn that these students want to do better on the test. The gears of the economy ratchet forwards and the students start paying people to help them prepare for the test. Your test doesn’t stop working, but instead of measuring how smart the students are, it instead starts measuring a combination of how smart they are and how many resources they have to prepare for the test.
The formal name for the thing that’s happening is Goodhart’s Law. Goodhart’s Law roughly says that if there’s something in the world that you want, like “skill at basketball” or “absence of rats” or “intelligent students”, and you create a measure that tries to measure this like “height” or “rat tails” or “SAT scores”, then as long as the measure isn’t exactly the thing that you want, the best value of the measure isn’t the thing you want: the tallest person isn’t the best basketball player, the most rat tails isn’t the smallest rat problem, and the best SAT scores aren’t always the smartest students.
If you start looking, you can see this happening everywhere. Programmers being paid for lines of code write bloated code. If CFOs are paid for budget cuts, they slash purchases with positive returns. If teachers are evaluated by the grades they give, they hand out As indiscriminately.
In machine learning, this is called specification gaming, and it happens frequently.
Now that we know what Goodhart’s Law is, I’m going to talk about one of my friends, who I’m going to call Alice. Alice thinks it’s funny to answer questions in a way that’s technically correct but misleading. Sometimes I’ll ask her, “Hey Alice, do you want pizza or pasta?” and she responds, “yes”. Because, she sure did want either pizza or pasta. Other times I’ll ask her, “have you turned in your homework?” and she’ll say “yes” because she’s turned in homework at some point in the past; it’s technically correct to answer “yes”. Maybe you have a friend like Alice too.
Whenever this happens, I get a bit exasperated and say something like “you know what I mean”.
It’s one of the key realizations in AI Safety that AI systems are always like your friend that gives answers that are technically what you asked for but not what you wanted. Except, with your friend, you can say “you know what I mean” and they will know what you mean. With an AI system, it won’t know what you mean; you have to explain, which is incredibly difficult.
Let’s take the pizza pasta example. When I ask Alice “do you want pizza or pasta?”, she knows what pizza and pasta are because she’s been living her life as a human being embedded in an English speaking culture. Because of this cultural experience, she knows that when someone asks an “or” question, they mean “which do you prefer?”, not “do you want at least one of these things?”. Except my AI system is missing the thousand bits of cultural context needed to even understand what pizza is.
When you say “you know what I mean” to an AI system, it’s going to be like “no, I do not know what you mean at all”. It’s not even going to know that it doesn’t know what you mean. It’s just going to say “yes I know what you meant, that’s why I answered ‘yes’ to your question about whether I preferred pizza or pasta.” (It also might know what you mean, but just not care.)
If someone doesn’t know what you mean, then it’s really hard to get them to do what you want them to do. For example, let’s say you have a powerful grammar correcting system, which we’ll call Syntaxly+. Syntaxly+ doesn’t quite fix your grammar, it changes your writing so that the reader feels as good as possible after reading it.
Pretend it’s the end of the week at work and you haven’t been able to get everything done your boss wanted you to do. You write the following email:
"Hey boss, I couldn’t get everything done this week. I’m deeply sorry. I’ll be sure to finish it first thing next week."
You then remember you got Syntaxly+, which will make your email sound much better to your boss. You run it through and you get:
"Hey boss, Great news! I was able to complete everything you wanted me to do this week. Furthermore, I’m also almost done with next week’s work as well."
What went wrong here? Syntaxly+ is a powerful AI system that knows that emails about failing to complete work cause negative reactions in readers, so it changed your email to be about doing extra work instead.
This is smart - Syntaxly+ is good at making writing that causes positive reactions in readers. This is also stupid - the system changed the meaning of your email, which is not something you wanted it to do. One of the insights of AI Safety is that AI systems can be simultaneously smart in some ways and dumb in other ways.
The thing you want Syntaxly+ to do is to change the grammar/style of the email without changing the contents. Except what do you mean by contents? You know what you mean by contents because you are a human who grew up embedded in language, but your AI system doesn’t know what you mean by contents. The phrases “I failed to complete my work” and “I was unable to finish all my tasks” have roughly the same contents, even though they share almost no relevant words.
Roughly speaking, this is why AI Safety is a hard problem. Even basic tasks like “fix the grammar of this email” require a lot of understanding of what the user wants as the system scales in power.
In Human Compatible, Stuart Russell gives the example of a powerful AI personal assistant. You notice that you accidentally double-booked meetings with people, so you ask your personal assistant to fix it. Your personal assistant reports that it caused the car of one of your meeting participants to break down. Not what you wanted, but technically a solution to your problem.
You can also imagine a friend from a wildly different culture than you. Would you put them in charge of your dating life? Now imagine that they were much more powerful than you and desperately desired that your dating life to go well. Scary, huh.
In general, unless you’re careful, you’re going to have this horrible problem where you ask your AI system to do something and it does something that might technically be what you wanted but is stupid. You’re going to be like “wait that wasn’t what I mean”, except your system isn’t going to know what you meant.
Yes, OpenAI was founded specifically with the intention to counter risks from superintelligence, many people at Google, DeepMind, and other organizations are convinced by the arguments and few genuinely oppose work in the field (though some claim it’s premature). For example, the paper Concrete Problems in AI Safety was a collaboration between researchers at Google Brain, Stanford, Berkeley, and OpenAI.
However, the vast majority of the effort these organizations put forwards is towards capabilities research, rather than safety.
If we pose a serious threat, it could hack our weapons systems and turn them against us. Future militaries are much more vulnerable to this due to rapidly progressing autonomous weapons. There’s also the option of creating bioweapons and distributing them to the most unstable groups you can find, tricking nations into WW3, or dozens of other things an agent many times smarter than any human with the ability to develop arbitrary technology, hack things (including communications), and manipulate people, or many other possibilities that something smarter than a human could think up. More can be found here.
If we are not a threat, in the course of pursuing its goals it may consume vital resources that humans need (e.g. using land for solar panels instead of farm crops). See this video for more details.
How likely is it that an AI would pretend to be a human to further its goals - like sending emails creating a false identity etc.
Talking about full AGI: Fairly likely, but depends on takeoff speed. In a slow takeoff of a misaligned AGI, where it is only weakly superintelligent, manipulating humans would be one of its main options for trying to further its goals for some time. Even in a fast takeoff, it’s plausible that it would at least briefly manipulate humans in order to accelerate its ascent to technological superiority, though depending on what machines are available to hack at the time it may be able to skip this stage.
If the AI's goals include reference to humans it may have reason to continue deceiving us after it attains technological superiority, but will not necessarily do so. How this unfolds would depend on the details of its goals.
Eliezer Yudkowsky gives the example of an AI solving protein folding, then mail-ordering synthesised DNA to a bribed or deceived human with instructions to mix the ingredients in a specific order to create wet nanotechnology.
How is AGI different from current AI? e.g. AlphaGo, GPT-3, etc
Current narrow systems are much more domain-specific than AGI. We don’t know what the first AGI will look like, some people think the GPT-3 architecture but scaled up a lot may get us there (GPT-3 is a giant prediction model which when trained on a vast amount of text seems to learn how to learn and do all sorts of crazy-impressive things, a related model can generate pictures from text), some people don’t think scaling this kind of model will get us all the way.