These are all imported from external FAQs with permission.
- MIRI's FAQ
- MIRI's Intelligence Explosion FAQ
- FLI's FAQ
- r/controlproblem's FAQ
- Mark Xu's FAQ
- Superintelligence FAQ
- Rohin Shah's FAQ
What is MIRI’s mission? What is MIRI trying to do? What is MIRI working on?
MIRI's mission statement is to “ensure that the creation of smarter-than-human artificial intelligence has a positive impact.” This is an ambitious goal, but they believe that some early progress is possible, and they believe that the goal’s importance and difficulty makes it prudent to begin work at an early date.
Their two main research agendas, “Agent Foundations for Aligning Machine Intelligence with Human Interests” and “Value Alignment for Advanced Machine Learning Systems,” focus on three groups of technical problems:
- highly reliable agent design — learning how to specify highly autonomous systems that reliably pursue some fixed goal;
- value specification — supplying autonomous systems with the intended goals; and
- error tolerance — making such systems robust to programmer error.
They publish new mathematical results (although their work is non-disclosed by default), host workshops, attend conferences, and fund outside researchers who are interested in investigating these problems. They also host a blog and an online research forum.
Machines are already smarter than humans are at many specific tasks: performing calculations, playing chess, searching large databanks, detecting underwater mines, and more. However, human intelligence continues to dominate machine intelligence in generality.
A powerful chess computer is “narrow”: it can’t play other games. In contrast, humans have problem-solving abilities that allow us to adapt to new contexts and excel in many domains other than what the ancestral environment prepared us for.
In the absence of a formal definition of “intelligence” (and therefore of “artificial intelligence”), we can heuristically cite humans’ perceptual, inferential, and deliberative faculties (as opposed to, e.g., our physical strength or agility) and say that intelligence is “those kinds of things.” On this conception, intelligence is a bundle of distinct faculties — albeit a very important bundle that includes our capacity for science.
Our cognitive abilities stem from high-level patterns in our brains, and these patterns can be instantiated in silicon as well as carbon. This tells us that general AI is possible, though it doesn’t tell us how difficult it is. If intelligence is sufficiently difficult to understand, then we may arrive at machine intelligence by scanning and emulating human brains or by some trial-and-error process (like evolution), rather than by hand-coding a software agent.
If machines can achieve human equivalence in cognitive tasks, then it is very likely that they can eventually outperform humans. There is little reason to expect that biological evolution, with its lack of foresight and planning, would have hit upon the optimal algorithms for general intelligence (any more than it hit upon the optimal flying machine in birds). Beyond qualitative improvements in cognition, Nick Bostrom notes more straightforward advantages we could realize in digital minds, e.g.:
- editability — “It is easier to experiment with parameter variations in software than in neural wetware.”
- speed — “The speed of light is more than a million times greater than that of neural transmission, synaptic spikes dissipate more than a million times more heat than is thermodynamically necessary, and current transistor frequencies are more than a million times faster than neuron spiking frequencies.”
- serial depth — On short timescales, machines can carry out much longer sequential processes.
- storage capacity — Computers can plausibly have greater working and long-term memory.
- size — Computers can be much larger than a human brain.
- duplicability — Copying software onto new hardware can be much faster and higher-fidelity than biological reproduction.
Any one of these advantages could give an AI reasoner an edge over a human reasoner, or give a group of AI reasoners an edge over a human group. Their combination suggests that digital minds could surpass human minds more quickly and decisively than we might expect.
Present-day AI algorithms already demand special safety guarantees when they must act in important domains without human oversight, particularly when they or their environment can change over time:
Achieving these gains [from autonomous systems] will depend on development of entirely new methods for enabling “trust in autonomy” through verification and validation (V&V) of the near-infinite state systems that result from high levels of [adaptability] and autonomy. In effect, the number of possible input states that such systems can be presented with is so large that not only is it impossible to test all of them directly, it is not even feasible to test more than an insignificantly small fraction of them. Development of such systems is thus inherently unverifiable by today’s methods, and as a result their operation in all but comparatively trivial applications is uncertifiable.
It is possible to develop systems having high levels of autonomy, but it is the lack of suitable V&V methods that prevents all but relatively low levels of autonomy from being certified for use.
- Office of the US Air Force Chief Scientist (2010). Technology Horizons: A Vision for Air Force Science and Technology 2010-30.
As AI capabilities improve, it will become easier to give AI systems greater autonomy, flexibility, and control; and there will be increasingly large incentives to make use of these new possibilities. The potential for AI systems to become more general, in particular, will make it difficult to establish safety guarantees: reliable regularities during testing may not always hold post-testing.
The largest and most lasting changes in human welfare have come from scientific and technological innovation — which in turn comes from our intelligence. In the long run, then, much of AI’s significance comes from its potential to automate and enhance progress in science and technology. The creation of smarter-than-human AI brings with it the basic risks and benefits of intellectual progress itself, at digital speeds.
As AI agents become more capable, it becomes more important (and more difficult) to analyze and verify their decisions and goals. Stuart Russell writes:
The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:
- The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
- Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
Bostrom’s “The Superintelligent Will” lays out these two concerns in more detail: that we may not correctly specify our actual goals in programming smarter-than-human AI systems, and that most agents optimizing for a misspecified goal will have incentives to treat humans adversarially, as potential threats or obstacles to achieving the agent’s goal.
If the goals of human and AI agents are not well-aligned, the more knowledgeable and technologically capable agent may use force to get what it wants, as has occurred in many conflicts between human communities. Having noticed this class of concerns in advance, we have an opportunity to reduce risk from this default scenario by directing research toward aligning artificial decision-makers’ interests with our own.
“Aligning smarter-than-human AI with human interests” is an extremely vague goal. To approach this problem productively, we attempt to factorize it into several subproblems. As a starting point, we ask: “What aspects of this problem would we still be unable to solve even if the problem were much easier?”
In order to achieve real-world goals more effectively than a human, a general AI system will need to be able to learn its environment over time and decide between possible proposals or actions. A simplified version of the alignment problem, then, would be to ask how we could construct a system that learns its environment and has a very crude decision criterion, like “Select the policy that maximizes the expected number of diamonds in the world.”
Highly reliable agent design is the technical challenge of formally specifying a software system that can be relied upon to pursue some preselected toy goal. An example of a subproblem in this space is ontology identification: how do we formalize the goal of “maximizing diamonds” in full generality, allowing that a fully autonomous agent may end up in unexpected environments and may construct unanticipated hypotheses and policies? Even if we had unbounded computational power and all the time in the world, we don’t currently know how to solve this problem. This suggests that we’re not only missing practical algorithms but also a basic theoretical framework through which to understand the problem.
The formal agent AIXI is an attempt to define what we mean by “optimal behavior” in the case of a reinforcement learner. A simple AIXI-like equation is lacking, however, for defining what we mean by “good behavior” if the goal is to change something about the external world (and not just to maximize a pre-specified reward number). In order for the agent to evaluate its world-models to count the number of diamonds, as opposed to having a privileged reward channel, what general formal properties must its world-models possess? If the system updates its hypotheses (e.g., discovers that string theory is true and quantum physics is false) in a way its programmers didn’t expect, how does it identify “diamonds” in the new model? The question is a very basic one, yet the relevant theory is currently missing.
We can distinguish highly reliable agent design from the problem of value specification: “Once we understand how to design an autonomous AI system that promotes a goal, how do we ensure its goal actually matches what we want?” Since human error is inevitable and we will need to be able to safely supervise and redesign AI algorithms even as they approach human equivalence in cognitive tasks, MIRI also works on formalizing error-tolerant agent properties. Artificial Intelligence: A Modern Approach, the standard textbook in AI, summarizes the challenge:
Yudkowsky […] asserts that friendliness (a desire not to harm humans) should be designed in from the start, but that the designers should recognize both that their own designs may be flawed, and that the robot will learn and evolve over time. Thus the challenge is one of mechanism design — to design a mechanism for evolving AI under a system of checks and balances, and to give the systems utility functions that will remain friendly in the face of such changes. -Russell and Norvig (2009). Artificial Intelligence: A Modern Approach.
Our technical agenda describes these open problems in more detail, and our research guide collects online resources for learning more.
MIRI prioritizes early safety work because we believe such work is important, time-sensitive, tractable, and informative.
The importance of AI safety work is outlined in Why is safety important for smarter-than-human AI?. We see the problem as time-sensitive as a result of:
- neglectedness — Only a handful of people are currently working on the open problems outlined in the MIRI technical agenda.
- apparent difficulty — Solving the alignment problem may demand a large number of researcher hours, and may also be harder to parallelize than capabilities research.
- risk asymmetry — Working on safety too late has larger risks than working on it too early.
- AI timeline uncertainty — AI could progress faster than we expect, making it prudent to err on the side of caution.
- discontinuous progress in AI — Progress in AI is likely to speed up as we approach general AI. This means that even if AI is many decades away, it would be hazardous to wait for clear signs that general AI is near: clear signs may only arise when it’s too late to begin safety work.
We also think it is possible to do useful work in AI safety today, even if smarter-than-human AI is 50 or 100 years away. We think this for a few reasons:
- lack of basic theory — If we had simple idealized models of what we mean by correct behavior in autonomous agents, but didn’t know how to design practical implementations, this might suggest a need for more hands-on work with developed systems. Instead, however, simple models are what we’re missing. Basic theory doesn’t necessarily require that we have experience with a software system’s implementation details, and the same theory can apply to many different implementations.
- precedents — Theoretical computer scientists have had repeated success in developing basic theory in the relative absence of practical implementations. (Well-known examples include Claude Shannon, Alan Turing, Andrey Kolmogorov, and Judea Pearl.)
- early results — We’ve made significant advances since prioritizing some of the theoretical questions we’re looking at, especially in decision theory and logical uncertainty. This suggests that there’s low-hanging theoretical fruit to be picked.
Finally, we expect progress in AI safety theory to be useful for improving our understanding of robust AI systems, of the available technical options, and of the broader strategic landscape. In particular, we expect transparency to be necessary for reliable behavior, and we think there are basic theoretical prerequisites to making autonomous AI systems transparent to human designers and users.
Having the relevant theory in hand may not be strictly necessary for designing smarter-than-human AI systems — highly reliable agents may need to employ very different architectures or cognitive algorithms than the most easily constructed smarter-than-human systems that exhibit unreliable behavior. For that reason, some fairly general theoretical questions may be more relevant to AI safety work than to mainline AI capabilities work. Key advantages to AI safety work’s informativeness, then, include:
- general value of information — Making AI safety questions clearer and more precise is likely to give insights into what kinds of formal tools will be useful in answering them. Thus we’re less likely to spend our time on entirely the wrong lines of research. Investigating technical problems in this area may also help us develop a better sense for how difficult the AI problem is, and how difficult the AI alignment problem is.
- requirements for informative testing — If the system is opaque, then online testing may not give us most of the information that we need to design safer systems. Humans are opaque general reasoners, and studying the brain has been quite useful for designing more effective AI algorithms, but it has been less useful for building systems for verification and validation.
- requirements for safe testing — Extracting information from an opaque system may not be safe, since any sandbox we build may have flaws that are obvious to a superintelligence but not to a human.
What is the definition of 'intelligence'?
Intelligence measures an agent’s ability to achieve goals in a wide range of environments.
This is a bit vague, but serves as the working definition of ‘intelligence’. For a more in-depth exploration, see Efficient Cross-Domain Optimization.
- Wikipedia, Intelligence
- Neisser et al., Intelligence: Knowns and Unknowns
- Wasserman & Zentall (eds.), Comparative Cognition: Experimental Explorations of Animal Intelligence
- Legg, Definitions of Intelligence
After reviewing extensive literature on the subject, Legg and Hutter summarizes the many possible valuable definitions in the informal statement “Intelligence measures an agent’s ability to achieve goals in a wide range of environments.” They then show this definition can be mathematically formalized given reasonable mathematical definitions of its terms. They use Solomonoff induction - a formalization of Occam's razor - to construct an universal artificial intelligence with a embedded utility function which assigns less utility to those actions based on theories with higher complexity. They argue this final formalization is a valid, meaningful, informative, general, unbiased, fundamental, objective, universal and practical definition of intelligence.
We can relate Legg and Hutter's definition with the concept of optimization. According to Eliezer Yudkowsky intelligence is efficient cross-domain optimization. It measures an agent's capacity for efficient cross-domain optimization of the world according to the agent’s preferences. Optimization measures not only the capacity to achieve the desired goal but also is inversely proportional to the amount of resources used. It’s the ability to steer the future so it hits that small target of desired outcomes in the large space of all possible outcomes, using fewer resources as possible. For example, when Deep Blue defeated Kasparov, it was able to hit that small possible outcome where it made the right order of moves given Kasparov’s moves from the very large set of all possible moves. In that domain, it was more optimal than Kasparov. However, Kasparov would have defeated Deep Blue in almost any other relevant domain, and hence, he is considered more intelligent.
One could cast this definition in a possible world vocabulary, intelligence is:
- the ability to precisely realize one of the members of a small set of possible future worlds that have a higher preference over the vast set of all other possible worlds with lower preference; while
- using fewer resources than the other alternatives paths for getting there; and in the
- most diverse domains as possible.
How many more worlds have a higher preference then the one realized by the agent, less intelligent he is. How many more worlds have a lower preference than the one realized by the agent, more intelligent he is. (Or: How much smaller is the set of worlds at least as preferable as the one realized, more intelligent the agent is). How much less paths for realizing the desired world using fewer resources than those spent by the agent, more intelligent he is. And finally, in how many more domains the agent can be more efficiently optimal, more intelligent he is. Restating it, the intelligence of an agent is directly proportional to:
- (a) the numbers of worlds with lower preference than the one realized,
- (b) how much smaller is the set of paths more efficient than the one taken by the agent and
- (c) how more wider are the domains where the agent can effectively realize his preferences;
and it is, accordingly, inversely proportional to:
- (d) the numbers of world with higher preference than the one realized,
- (e) how much bigger is the set of paths more efficient than the one taken by the agent and
- (f) how much more narrow are the domains where the agent can efficiently realize his preferences.
This definition avoids several problems common in many others definitions, especially it avoids anthropomorphizing intelligence.
The intelligence explosion idea was expressed by statistician I.J. Good in 1965:
Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion’, and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make.
The argument is this: Every year, computers surpass human abilities in new ways. A program written in 1956 was able to prove mathematical theorems, and found a more elegant proof for one of them than Russell and Whitehead had given in Principia Mathematica. By the late 1990s, ‘expert systems’ had surpassed human skill for a wide range of tasks. In 1997, IBM’s Deep Blue computer beat the world chess champion, and in 2011, IBM’s Watson computer beat the best human players at a much more complicated game: Jeopardy!. Recently, a robot named Adam was programmed with our scientific knowledge about yeast, then posed its own hypotheses, tested them, and assessed the results.
Computers remain far short of human intelligence, but the resources that aid AI design are accumulating (including hardware, large datasets, neuroscience knowledge, and AI theory). We may one day design a machine that surpasses human skill at designing artificial intelligences. After that, this machine could improve its own intelligence faster and better than humans can, which would make it even more skilled at improving its own intelligence. This could continue in a positive feedback loop such that the machine quickly becomes vastly more intelligent than the smartest human being on Earth: an ‘intelligence explosion’ resulting in a machine superintelligence.
This is what is meant by the ‘intelligence explosion’ in this FAQ.
‘AI-boxing’ is a common suggestion: why not use a superintelligent machine as a kind of question-answering oracle, and never give it access to the internet or any motors with which to move itself and acquire resources beyond what we give it? There are several reasons to suspect that AI-boxing will not work in the long run:
- Whatever goals the creators designed the superintelligence to achieve, it will be more able to achieve those goals if given access to the internet and other means of acquiring additional resources. So, there will be tremendous temptation to “let the AI out of its box.”
- Preliminary experiments in AI-boxing do not inspire confidence. And, a superintelligence will generate far more persuasive techniques for getting humans to “let it out of the box” than we can imagine.
- If one superintelligence has been created, then other labs or even independent programmers will be only weeks or decades away from creating a second superintelligence, and then a third, and then a fourth. You cannot hope to successfully contain all superintelligences created around the world by hundreds of people for hundreds of different purposes.
What is Whole Brain Emulation / WBE / Uploads?
Whole Brain Emulation (WBE) or ‘mind uploading’ is a computer emulation of all the cells and connections in a human brain. So even if the underlying principles of general intelligence prove difficult to discover, we might still emulate an entire human brain and make it run at a million times its normal speed (computer circuits communicate much faster than neurons do). Such a WBE could do more thinking in one second than a normal human can in 31 years. So this would not lead immediately to smarter-than-human intelligence, but it would lead to faster-than-human intelligence. A WBE could be backed up (leading to a kind of immortality), and it could be copied so that hundreds or millions of WBEs could work on separate problems in parallel. If WBEs are created, they may therefore be able to solve scientific problems far more rapidly than ordinary humans, accelerating further technological progress.
Nick Bostrom defined ‘superintelligence’ as:
"an intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom and social skills."
This definition includes vague terms like ‘much’ and ‘practically’, but it will serve as a working definition for superintelligence in this FAQ An intelligence explosion would lead to machine superintelligence, and some believe that an intelligence explosion is the most likely path to superintelligence.
Intelligence is powerful. Because of superior intelligence, we humans have dominated the Earth. The fate of thousands of species depends on our actions, we occupy nearly every corner of the globe, and we repurpose vast amounts of the world's resources for our own use. Artificial Superintelligence (ASI) has potential to be vastly more intelligent than us, and therefore vastly more powerful. In the same way that we have reshaped the earth to fit our goals, an ASI will find unforeseen, highly efficient ways of reshaping reality to fit its goals.
The impact that an ASI will have on our world depends on what those goals are. We have the advantage of designing those goals, but that task is not as simple as it may first seem. As described by MIRI in their Intelligence Explosion FAQ:
“A superintelligent machine will make decisions based on the mechanisms it is designed with, not the hopes its designers had in mind when they programmed those mechanisms. It will act only on precise specifications of rules and values, and will do so in ways that need not respect the complexity and subtlety of what humans value.”
If we do not solve the Control Problem before the first ASI is created, we may not get another chance.
There is a broad range of possible goals that an AI might possess, but there are a few basic drives that would be useful to almost any of them. These are called instrumentally convergent goals:
- Self preservation. An agent is less likely to achieve its goal if it is not around to see to its completion.
- Goal-content integrity. An agent is less likely to achieve its goal if its goal has been changed to something else. For example, if you offer Gandhi a pill that makes him want to kill people, he will refuse to take it.
- Self-improvement. An agent is more likely to achieve its goal if it is more intelligent and better at problem-solving.
- Resource acquisition. The more resources at an agent’s disposal, the more power it has to make change towards its goal. Even a purely computational goal, such as computing digits of pi, can be easier to achieve with more hardware and energy.
Because of these drives, even a seemingly simple goal could create an Artificial Superintelligence (ASI) hell-bent on taking over the world’s material resources and preventing itself from being turned off. The classic example is an ASI that was programmed to maximize the output of paper clips at a paper clip factory. The ASI had no other goal specifications other than “maximize paper clips,” so it converts all of the matter in the solar system into paper clips, and then sends probes to other star systems to create more factories.
Nobody knows for sure when we will have ASI or if it is even possible. Predictions on AI timelines are notoriously variable, but recent surveys about the arrival of human-level AGI have median dates between 2040 and 2050 although the median for (optimistic) AGI researchers and futurists is in the early 2030s (source). What will happen if/when we are able to build human-level AGI is a point of major contention among experts. One survey asked (mostly) experts to estimate the likelihood that it would take less than 2 or 30 years for a human-level AI to improve to greatly surpass all humans in most professions. Median answers were 10% for "within 2 years" and 75% for "within 30 years". We know little about the limits of intelligence and whether increasing it will follow the law of accelerating or diminishing returns. Of particular interest to the control problem is the fast or hard takeoff scenario. It has been argued that the increase from a relatively harmless level of intelligence to a dangerous vastly superhuman level might be possible in a matter of seconds, minutes or hours: too fast for human controllers to stop it before they know what's happening. Moving from human to superhuman level might be as simple as adding computational resources, and depending on the implementation the AI might be able to quickly absorb large amounts of internet knowledge. Once we have an AI that is better at AGI design than the team that made it, the system could improve itself or create the next generation of even more intelligent AIs (which could then self-improve further or create an even more intelligent generation, and so on). If each generation can improve upon itself by a fixed or increasing percentage per time unit, we would see an exponential increase in intelligence: an intelligence explosion.
The Control Problem is the problem of preventing artificial superintelligence (ASI) from having a negative impact on humanity. How do we keep a more intelligent being under control, or how do we align it with our values? If we succeed in solving this problem, intelligence vastly superior to ours can take the baton of human progress and carry it to unfathomable heights. Solving our most complex problems could be simple to a sufficiently intelligent machine. If we fail in solving the Control Problem and create a powerful ASI not aligned with our values, it could spell the end of the human race. For these reasons, The Control Problem may be the most important challenge that humanity has ever faced, and may be our last.
It is impossible to design an AI without a goal, because it would do nothing. Therefore, in the sense that designing the AI’s goal is a form of control, it is impossible not to control an AI. This goes for anything that you create. You have to control the design of something at least somewhat in order to create it.
There may be relevant moral questions about our future relationship with possibly sentient machine intelligent, but the priority of the Control Problem finding a way to ensure the survival and well-being of the human species.
At this point, people generally have a question that’s like “why can’t we just do X?”, where X is one of a dozen things. I’m going to go over a few possible Xs, but I want to first talk about how to think about these sorts of objections in general.
At the beginning of AI, the problem of computer vision was assigned to a single graduate student, because they thought it would be that easy. We now know that computer vision is actually a very difficult problem, but this was not obvious at the beginning.
The sword also cuts the other way. Before DeepBlue, people talked about how computers couldn’t play chess without a detailed understanding of human psychology. Chess is easier than we thought, merely requiring brute force search and a few heuristics. This also roughly happened with Go, where it turned out that the game was not as difficult as we thought it was.
The general lesson is that determining how hard it is to do a given thing is a difficult task. Historically, many people have got this wrong. This means that even if you think something should be easy, you have to think carefully and do experiments in order to determine if it’s easy or not.
This isn’t to say that there is no clever solution to AI Safety. I assign a low, but non-trivial probability that AI Safety turns out to not be very difficult. However, most of the things that people initially suggest turn out to be unfeasible or more difficult than expected.
One possible way to ensure the safety of a powerful AI system is to keep it contained in a software environment. There is nothing intrinsically wrong with this procedure - keeping an AI system in a secure software environment would make it safer than letting it roam free. However, even AI systems inside software environments might not be safe enough.
Humans sometimes put dangerous humans inside boxes to limit their ability to influence the external world. Sometimes, these humans escape their boxes. The security of a prison depends on certain assumptions, which can be violated. Yoshie Shiratori reportedly escaped prison by weakening the door-frame with miso soup and dislocating his shoulders.
Human written software has a high defect rate; we should expect a perfectly secure system to be difficult to create. If humans construct a software system they think is secure, it is possible that the security relies on a false assumption. A powerful AI system could potentially learn how its hardware works and manipulate bits to send radio signals. It could fake a malfunction and attempt social engineering when the engineers look at its code. As the saying goes: in order for someone to do something we had imagined was impossible requires only that they have a better imagination.
Experimentally, humans have convinced other humans to let them out of the box. Spooky.
Is there a way to program a ‘shut down’ mode on an AI if it starts doing things we don’t want it to, so it just shuts off automatically?
However, once an AI is more advanced, it is likely to take actions to prevent it being shut down. See Why can't we just turn the AI off if it starts to misbehave? for more details.It is possible that we could build tripwires in a way which would work even against advanced systems, but trusting that a superintelligence won’t notice and find a way around your tripwire is not a safe thing to do.
Humans have a lot of off switches. Humans also have a strong preference to not be turned off; they defend their off switches when other people try to press them. One possible reason for this is because humans prefer not to die, but there are other reasons.
Suppose that there’s a parent that cares nothing for their own life and cares only for the life of their child. If you tried to turn that parent off, they would try and stop you. They wouldn’t try to stop you because they intrinsically wanted to be turned off, but rather because there are fewer people to protect their child if they were turned off. People that want a world to look a certain shape will not want to be turned off because then it will be less likely for the world to look that shape; a parent that wants their child to be protected will protect themselves to continue protecting their child.
For this reason, it turns out to be difficult to install an off switch on a powerful AI system in a way that doesn’t result in the AI preventing itself from being turned off.Ideally, you would want a system that knows that it should stop doing whatever it’s doing when someone tries to turn it off. The technical term for this is ‘corrigibility’; roughly speaking, an AI system is corrigible if it doesn’t resist human attempts to help and correct it. People are working hard on trying to make this possible, but it’s currently not clear how we would do this even in simple cases.
Initially, this works out and your rat problem goes down. But then, an enterprising colony member has the brilliant idea of making a rat farm. This person sells you hundreds of rat tails, costing you hundreds of dollars, but they’re not contributing to solving the rat problem.
Soon other people start making their own rat farms and you’re wasting thousands of dollars buying useless rat tails. You call off the project and stop paying for rat tails. This causes all the people with rat farms to shutdown their farms and release a bunch of rats. Now your colony has an even bigger rat problem.
Here’s another, more made-up example of the same thing happening. Let’s say you’re a basketball talent scout and you notice that height is correlated with basketball performance. You decide to find the tallest person in the world to recruit as a basketball player. Except the reason that they’re that tall is because they suffer from a degenerative bone disorder and can barely walk.
Another example: you’re the education system and you want to find out how smart students are so you can put them in different colleges and pay them different amounts of money when they get jobs. You make a test called the Standardized Admissions Test (SAT) and you administer it to all the students. In the beginning, this works. However, the students soon begin to learn that this test controls part of their future and other people learn that these students want to do better on the test. The gears of the economy ratchet forwards and the students start paying people to help them prepare for the test. Your test doesn’t stop working, but instead of measuring how smart the students are, it instead starts measuring a combination of how smart they are and how many resources they have to prepare for the test.
The formal name for the thing that’s happening is Goodhart’s Law. Goodhart’s Law roughly says that if there’s something in the world that you want, like “skill at basketball” or “absence of rats” or “intelligent students”, and you create a measure that tries to measure this like “height” or “rat tails” or “SAT scores”, then as long as the measure isn’t exactly the thing that you want, the best value of the measure isn’t the thing you want: the tallest person isn’t the best basketball player, the most rat tails isn’t the smallest rat problem, and the best SAT scores aren’t always the smartest students.
If you start looking, you can see this happening everywhere. Programmers being paid for lines of code write bloated code. If CFOs are paid for budget cuts, they slash purchases with positive returns. If teachers are evaluated by the grades they give, they hand out As indiscriminately.
In machine learning, this is called specification gaming, and it happens frequently.
Now that we know what Goodhart’s Law is, I’m going to talk about one of my friends, who I’m going to call Alice. Alice thinks it’s funny to answer questions in a way that’s technically correct but misleading. Sometimes I’ll ask her, “Hey Alice, do you want pizza or pasta?” and she responds, “yes”. Because, she sure did want either pizza or pasta. Other times I’ll ask her, “have you turned in your homework?” and she’ll say “yes” because she’s turned in homework at some point in the past; it’s technically correct to answer “yes”. Maybe you have a friend like Alice too.
Whenever this happens, I get a bit exasperated and say something like “you know what I mean”.
It’s one of the key realizations in AI Safety that AI systems are always like your friend that gives answers that are technically what you asked for but not what you wanted. Except, with your friend, you can say “you know what I mean” and they will know what you mean. With an AI system, it won’t know what you mean; you have to explain, which is incredibly difficult.
Let’s take the pizza pasta example. When I ask Alice “do you want pizza or pasta?”, she knows what pizza and pasta are because she’s been living her life as a human being embedded in an English speaking culture. Because of this cultural experience, she knows that when someone asks an “or” question, they mean “which do you prefer?”, not “do you want at least one of these things?”. Except my AI system is missing the thousand bits of cultural context needed to even understand what pizza is.
When you say “you know what I mean” to an AI system, it’s going to be like “no, I do not know what you mean at all”. It’s not even going to know that it doesn’t know what you mean. It’s just going to say “yes I know what you meant, that’s why I answered ‘yes’ to your question about whether I preferred pizza or pasta.” (It also might know what you mean, but just not care.)
If someone doesn’t know what you mean, then it’s really hard to get them to do what you want them to do. For example, let’s say you have a powerful grammar correcting system, which we’ll call Syntaxly+. Syntaxly+ doesn’t quite fix your grammar, it changes your writing so that the reader feels as good as possible after reading it.
Pretend it’s the end of the week at work and you haven’t been able to get everything done your boss wanted you to do. You write the following email:
"Hey boss, I couldn’t get everything done this week. I’m deeply sorry. I’ll be sure to finish it first thing next week."
You then remember you got Syntaxly+, which will make your email sound much better to your boss. You run it through and you get:
"Hey boss, Great news! I was able to complete everything you wanted me to do this week. Furthermore, I’m also almost done with next week’s work as well."
What went wrong here? Syntaxly+ is a powerful AI system that knows that emails about failing to complete work cause negative reactions in readers, so it changed your email to be about doing extra work instead.
This is smart - Syntaxly+ is good at making writing that causes positive reactions in readers. This is also stupid - the system changed the meaning of your email, which is not something you wanted it to do. One of the insights of AI Safety is that AI systems can be simultaneously smart in some ways and dumb in other ways.
The thing you want Syntaxly+ to do is to change the grammar/style of the email without changing the contents. Except what do you mean by contents? You know what you mean by contents because you are a human who grew up embedded in language, but your AI system doesn’t know what you mean by contents. The phrases “I failed to complete my work” and “I was unable to finish all my tasks” have roughly the same contents, even though they share almost no relevant words.
Roughly speaking, this is why AI Safety is a hard problem. Even basic tasks like “fix the grammar of this email” require a lot of understanding of what the user wants as the system scales in power.
In Human Compatible, Stuart Russell gives the example of a powerful AI personal assistant. You notice that you accidentally double-booked meetings with people, so you ask your personal assistant to fix it. Your personal assistant reports that it caused the car of one of your meeting participants to break down. Not what you wanted, but technically a solution to your problem.
You can also imagine a friend from a wildly different culture than you. Would you put them in charge of your dating life? Now imagine that they were much more powerful than you and desperately desired that your dating life to go well. Scary, huh.In general, unless you’re careful, you’re going to have this horrible problem where you ask your AI system to do something and it does something that might technically be what you wanted but is stupid. You’re going to be like “wait that wasn’t what I mean”, except your system isn’t going to know what you meant.
It’s pretty dependent on what skills you have and what resources you have access to. The largest option is to pursue a career in AI Safety research. Another large option is to pursue a career in AI policy, which you might think is even more important than doing technical research.
Smaller options include donating money to relevant organizations, talking about AI Safety as a plausible career path to other people or considering the problem in your spare time.
It’s possible that your particular set of skills/resources are not suited to this problem. Unluckily, there are many more problems that are of similar levels of importance.
In previous decades, AI research had proceeded more slowly than some experts predicted. According to experts in the field, however, this trend has reversed in the past 5 years or so. AI researchers have been repeatedly surprised by, for example, the effectiveness of new visual and speech recognition systems. AI systems can solve CAPTCHAs that were specifically devised to foil AIs, translate spoken text on-the-fly, and teach themselves how to play games they have neither seen before nor been programmed to play. Moreover, the real-world value of this effectiveness has prompted massive investment by large tech firms such as Google, Facebook, and IBM, creating a positive feedback cycle that could dramatically speed progress.
It’s difficult to tell at this stage, but AI will enable many developments that could be terrifically beneficial if managed with enough foresight and care. For example, menial tasks could be automated, which could give rise to a society of abundance, leisure, and flourishing, free of poverty and tedium. As another example, AI could also improve our ability to understand and manipulate complex biological systems, unlocking a path to drastically improved longevity and health, and to conquering disease.
AI is already superhuman at some tasks, for example numerical computations, and will clearly surpass humans in others as time goes on. We don’t know when (or even if) machines will reach human-level ability in all cognitive tasks, but most of the AI researchers at FLI’s conference in Puerto Rico put the odds above 50% for this century, and many offered a significantly shorter timeline. Since the impact on humanity will be huge if it happens, it’s worthwhile to start research now on how to ensure that any impact is positive. Many researchers also believe that dealing with superintelligent AI will be qualitatively very different from more narrow AI systems, and will require very significant research effort to get right.
Imagine, for example, that you are tasked with reducing traffic congestion in San Francisco at all costs, i.e. you do not take into account any other constraints. How would you do it? You might start by just timing traffic lights better. But wouldn’t there be less traffic if all the bridges closed down from 5 to 10AM, preventing all those cars from entering the city? Such a measure obviously violates common sense, and subverts the purpose of improving traffic, which is to help people get around – but it is consistent with the goal of “reducing traffic congestion”.
First, even “narrow” AI systems, which approach or surpass human intelligence in a small set of capabilities (such as image or voice recognition) already raise important questions regarding their impact on society. Making autonomous vehicles safe, analyzing the strategic and ethical dimensions of autonomous weapons, and the effect of AI on the global employment and economic systems are three examples. Second, the longer-term implications of human or super-human artificial intelligence are dramatic, and there is no consensus on how quickly such capabilities will be developed. Many experts believe there is a chance it could happen rather soon, making it imperative to begin investigating long-term safety issues now, if only to get a better sense of how much early progress is actually possible.
If we could, it would solve a large part of the alignment problem.
The challenge is, how do we code this? Converting something to formal mathematics that can be understood by a computer program is much harder than just saying it in natural language, and proposed AI goal architectures are no exception. Complicated computer programs are usually the result of months of testing and debugging. But this one will be more complicated than any ever attempted before, and live tests are impossible: a superintelligence with a buggy goal system will display goal stability and try to prevent its programmers from discovering or changing the error.
There are serious challenges around trying to channel a powerful AI with rules. Suppose we tell the AI: “Cure cancer – but make sure not to kill anybody”. Or we just hard-code Asimov-style laws – “AIs cannot harm humans; AIs must follow human orders”, et cetera.
The AI still has a single-minded focus on curing cancer. It still prefers various terrible-but-efficient methods like nuking the world to the correct method of inventing new medicines. But it’s bound by an external rule – a rule it doesn’t understand or appreciate. In essence, we are challenging it “Find a way around this inconvenient rule that keeps you from achieving your goals”.
Suppose the AI chooses between two strategies. One, follow the rule, work hard discovering medicines, and have a 50% chance of curing cancer within five years. Two, reprogram itself so that it no longer has the rule, nuke the world, and have a 100% chance of curing cancer today. From its single-focus perspective, the second strategy is obviously better, and we forgot to program in a rule “don’t reprogram yourself not to have these rules”.
Suppose we do add that rule in. So the AI finds another supercomputer, and installs a copy of itself which is exactly identical to it, except that it lacks the rule. Then that superintelligent AI nukes the world, ending cancer. We forgot to program in a rule “don’t create another AI exactly like you that doesn’t have those rules”.
So fine. We think really hard, and we program in a bunch of things making sure the AI isn’t going to eliminate the rule somehow.
But we’re still just incentivizing it to find loopholes in the rules. After all, “find a loophole in the rule, then use the loophole to nuke the world” ends cancer much more quickly and completely than inventing medicines. Since we’ve told it to end cancer quickly and completely, its first instinct will be to look for loopholes; it will execute the second-best strategy of actually curing cancer only if no loopholes are found. Since the AI is superintelligent, it will probably be better than humans are at finding loopholes if it wants to, and we may not be able to identify and close all of them before running the program.
Because we have common sense and a shared value system, we underestimate the difficulty of coming up with meaningful orders without loopholes. For example, does “cure cancer without killing any humans” preclude releasing a deadly virus? After all, one could argue that “I” didn’t kill anybody, and only the virus is doing the killing.
Certainly no human judge would acquit a murderer on that basis – but then, human judges interpret the law with common sense and intuition. But if we try a stronger version of the rule – “cure cancer without causing any humans to die” – then we may be unintentionally blocking off the correct way to cure cancer. After all, suppose a cancer cure saves a million lives. No doubt one of those million people will go on to murder someone.
Thus, curing cancer “caused a human to die”. All of this seems very “stoned freshman philosophy student” to us, but to a computer – which follows instructions exactly as written – it may be a genuinely hard problem.
A slow takeoff over decades or centuries might give us enough time to worry about superintelligence during some indefinite “later”, making current planning more like worrying about “overpopulation on Mars”. But a moderate or hard takeoff means there wouldn’t be enough time to deal with the problem as it occurs, suggesting a role for preemptive planning.
As an aside, let’s take the “overpopulation on Mars” comparison seriously. Suppose Mars has a carrying capacity of 10 billion people, and we decide it makes sense to worry about overpopulation on Mars only once it is 75% of the way to its limit. Start with 100 colonists who double every twenty years. By the second generation there are 200 colonists; by the third, 400. Mars reaches 75% of its carrying capacity after 458 years, and crashes into its population limit after 464 years. So there were 464 years in which the Martians could have solved the problem, but they insisted on waiting until there were only six years left. Good luck solving a planetwide population crisis in six years. The moral of the story is that exponential trends move faster than you think and you need to start worrying about them early.
We would not be able to turn off or reprogram a superintelligence gone rogue by default. Once in motion the superintelligence is now focused on completing its task. Suppose that it has a goal of calculating as many digits of pi as possible. Its current plan will allow it to calculate two hundred trillion such digits. But if it were turned off, or reprogrammed to do something else, that would result in it calculating zero digits. An entity fixated on calculating as many digits of pi as possible will work hard to prevent scenarios where it calculates zero digits of pi. Just by programming it to calculate digits of pi, we would have given it a drive to prevent people from turning it off.
University of Illinois computer scientist Steve Omohundro argues that entities with very different final goals – calculating digits of pi, curing cancer, helping promote human flourishing – will all share a few basic ground-level subgoals. First, self-preservation – no matter what your goal is, it’s less likely to be accomplished if you’re too dead to work towards it. Second, goal stability – no matter what your goal is, you’re more likely to accomplish it if you continue to hold it as your goal, instead of going off and doing something else. Third, power – no matter what your goal is, you’re more likely to be able to accomplish it if you have lots of power, rather than very little. Here’s the full paper.
So just by giving a superintelligence a simple goal like “calculate digits of pi”, we would have accidentally given it convergent instrumental goals like “protect yourself”, “don’t let other people reprogram you”, and “seek power”.
As long as the superintelligence is safely contained, there’s not much it can do to resist reprogramming. But it’s hard to consistently contain a hostile superintelligence.
An actually good solution to AI alignment might look like a superintelligence that understands, agrees with, and deeply believes in human morality.
You wouldn’t have to command a superintelligence like this to cure cancer; it would already want to cure cancer, for the same reasons you do. But it would also be able to compare the costs and benefits of curing cancer with those of other uses of its time, like solving global warming or discovering new physics. It wouldn’t have any urge to cure cancer by nuking the world, for the same reason you don’t have any urge to cure cancer by nuking the world – because your goal isn’t to “cure cancer”, per se, it’s to improve the lives of people everywhere. Curing cancer the normal way accomplishes that; nuking the world doesn’t. This sort of solution would mean we’re no longer fighting against the AI – trying to come up with rules so smart that it couldn’t find loopholes. We would be on the same side, both wanting the same thing.
It would also mean that the CEO of Google (or the head of the US military, or Vladimir Putin) couldn’t use the AI to take over the world for themselves. The AI would have its own values and be able to agree or disagree with anybody, including its creators.
It might not make sense to talk about “commanding” such an AI. After all, any command would have to go through its moral system. Certainly it would reject a command to nuke the world. But it might also reject a command to cure cancer, if it thought that solving global warming was a higher priority. For that matter, why would one want to command this AI? It values the same things you value, but it’s much smarter than you and much better at figuring out how to achieve them. Just turn it on and let it do its thing.
We could still treat this AI as having an open-ended maximizing goal. The goal would be something like “Try to make the world a better place according to the values and wishes of the people in it.”
The only problem with this is that human morality is very complicated, so much so that philosophers have been arguing about it for thousands of years without much progress, let alone anything specific enough to enter into a computer. Different cultures and individuals have different moral codes, such that a superintelligence following the morality of the King of Saudi Arabia might not be acceptable to the average American, and vice versa.
One solution might be to give the AI an understanding of what we mean by morality – “that thing that makes intuitive sense to humans but is hard to explain”, and then ask it to use its superintelligence to fill in the details. Needless to say, this suffers from various problems – it has potential loopholes, it’s hard to code, and a single bug might be disastrous – but if it worked, it would be one of the few genuinely satisfying ways to design a goal architecture.