These are all imported from external FAQs with permission.
- MIRI's FAQ
- MIRI's Intelligence Explosion FAQ
- FLI's FAQ
- r/controlproblem's FAQ
- Mark Xu's FAQ
- Superintelligence FAQ
- Rohin Shah's FAQ
What is MIRI’s mission? What is MIRI trying to do? What is MIRI working on?
MIRI's mission statement is to “ensure that the creation of smarter-than-human artificial intelligence has a positive impact.” This is an ambitious goal, but they believe that some early progress is possible, and they believe that the goal’s importance and difficulty makes it prudent to begin work at an early date.
Their two main research agendas, “Agent Foundations for Aligning Machine Intelligence with Human Interests” and “Value Alignment for Advanced Machine Learning Systems,” focus on three groups of technical problems:
- highly reliable agent design — learning how to specify highly autonomous systems that reliably pursue some fixed goal;
- value specification — supplying autonomous systems with the intended goals; and
- error tolerance — making such systems robust to programmer error.
They publish new mathematical results (although their work is non-disclosed by default), host workshops, attend conferences, and fund outside researchers who are interested in investigating these problems. They also host a blog and an online research forum.
Machines are already smarter than humans are at many specific tasks: performing calculations, playing chess, searching large databanks, detecting underwater mines, and more.1 However, human intelligence continues to dominate machine intelligence in generality.
A powerful chess computer is “narrow”: it can’t play other games. In contrast, humans have problem-solving abilities that allow us to adapt to new contexts and excel in many domains other than what the ancestral environment prepared us for.
In the absence of a formal definition of “intelligence” (and therefore of “artificial intelligence”), we can heuristically cite humans’ perceptual, inferential, and deliberative faculties (as opposed to, e.g., our physical strength or agility) and say that intelligence is “those kinds of things.” On this conception, intelligence is a bundle of distinct faculties — albeit a very important bundle that includes our capacity for science.
Our cognitive abilities stem from high-level patterns in our brains, and these patterns can be instantiated in silicon as well as carbon. This tells us that general AI is possible, though it doesn’t tell us how difficult it is. If intelligence is sufficiently difficult to understand, then we may arrive at machine intelligence by scanning and emulating human brains or by some trial-and-error process (like evolution), rather than by hand-coding a software agent.
If machines can achieve human equivalence in cognitive tasks, then it is very likely that they can eventually outperform humans. There is little reason to expect that biological evolution, with its lack of foresight and planning, would have hit upon the optimal algorithms for general intelligence (any more than it hit upon the optimal flying machine in birds). Beyond qualitative improvements in cognition, Nick Bostrom notes more straightforward advantages we could realize in digital minds, e.g.:
- editability — “It is easier to experiment with parameter variations in software than in neural wetware.”2
- speed — “The speed of light is more than a million times greater than that of neural transmission, synaptic spikes dissipate more than a million times more heat than is thermodynamically necessary, and current transistor frequencies are more than a million times faster than neuron spiking frequencies.”
- serial depth — On short timescales, machines can carry out much longer sequential processes.
- storage capacity — Computers can plausibly have greater working and long-term memory.
- size — Computers can be much larger than a human brain.
- duplicability — Copying software onto new hardware can be much faster and higher-fidelity than biological reproduction.
Any one of these advantages could give an AI reasoner an edge over a human reasoner, or give a group of AI reasoners an edge over a human group. Their combination suggests that digital minds could surpass human minds more quickly and decisively than we might expect.
Present-day AI algorithms already demand special safety guarantees when they must act in important domains without human oversight, particularly when they or their environment can change over time:
Achieving these gains [from autonomous systems] will depend on development of entirely new methods for enabling “trust in autonomy” through verification and validation (V&V) of the near-infinite state systems that result from high levels of [adaptability] and autonomy. In effect, the number of possible input states that such systems can be presented with is so large that not only is it impossible to test all of them directly, it is not even feasible to test more than an insignificantly small fraction of them. Development of such systems is thus inherently unverifiable by today’s methods, and as a result their operation in all but comparatively trivial applications is uncertifiable.
It is possible to develop systems having high levels of autonomy, but it is the lack of suitable V&V methods that prevents all but relatively low levels of autonomy from being certified for use.
- Office of the US Air Force Chief Scientist (2010). Technology Horizons: A Vision for Air Force Science and Technology 2010-30.
As AI capabilities improve, it will become easier to give AI systems greater autonomy, flexibility, and control; and there will be increasingly large incentives to make use of these new possibilities. The potential for AI systems to become more general, in particular, will make it difficult to establish safety guarantees: reliable regularities during testing may not always hold post-testing.
The largest and most lasting changes in human welfare have come from scientific and technological innovation — which in turn comes from our intelligence. In the long run, then, much of AI’s significance comes from its potential to automate and enhance progress in science and technology. The creation of smarter-than-human AI brings with it the basic risks and benefits of intellectual progress itself, at digital speeds.
As AI agents become more capable, it becomes more important (and more difficult) to analyze and verify their decisions and goals. Stuart Russell writes:
The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:
- The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
- Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
Bostrom’s “The Superintelligent Will” lays out these two concerns in more detail: that we may not correctly specify our actual goals in programming smarter-than-human AI systems, and that most agents optimizing for a misspecified goal will have incentives to treat humans adversarially, as potential threats or obstacles to achieving the agent’s goal.
If the goals of human and AI agents are not well-aligned, the more knowledgeable and technologically capable agent may use force to get what it wants, as has occurred in many conflicts between human communities. Having noticed this class of concerns in advance, we have an opportunity to reduce risk from this default scenario by directing research toward aligning artificial decision-makers’ interests with our own.
“Aligning smarter-than-human AI with human interests” is an extremely vague goal. To approach this problem productively, we attempt to factorize it into several subproblems. As a starting point, we ask: “What aspects of this problem would we still be unable to solve even if the problem were much easier?”
In order to achieve real-world goals more effectively than a human, a general AI system will need to be able to learn its environment over time and decide between possible proposals or actions. A simplified version of the alignment problem, then, would be to ask how we could construct a system that learns its environment and has a very crude decision criterion, like “Select the policy that maximizes the expected number of diamonds in the world.”
Highly reliable agent design is the technical challenge of formally specifying a software system that can be relied upon to pursue some preselected toy goal. An example of a subproblem in this space is ontology identification: how do we formalize the goal of “maximizing diamonds” in full generality, allowing that a fully autonomous agent may end up in unexpected environments and may construct unanticipated hypotheses and policies? Even if we had unbounded computational power and all the time in the world, we don’t currently know how to solve this problem. This suggests that we’re not only missing practical algorithms but also a basic theoretical framework through which to understand the problem.
The formal agent AIXI is an attempt to define what we mean by “optimal behavior” in the case of a reinforcement learner. A simple AIXI-like equation is lacking, however, for defining what we mean by “good behavior” if the goal is to change something about the external world (and not just to maximize a pre-specified reward number). In order for the agent to evaluate its world-models to count the number of diamonds, as opposed to having a privileged reward channel, what general formal properties must its world-models possess? If the system updates its hypotheses (e.g., discovers that string theory is true and quantum physics is false) in a way its programmers didn’t expect, how does it identify “diamonds” in the new model? The question is a very basic one, yet the relevant theory is currently missing.
We can distinguish highly reliable agent design from the problem of value specification: “Once we understand how to design an autonomous AI system that promotes a goal, how do we ensure its goal actually matches what we want?” Since human error is inevitable and we will need to be able to safely supervise and redesign AI algorithms even as they approach human equivalence in cognitive tasks, MIRI also works on formalizing error-tolerant agent properties. Artificial Intelligence: A Modern Approach, the standard textbook in AI, summarizes the challenge:
Yudkowsky […] asserts that friendliness (a desire not to harm humans) should be designed in from the start, but that the designers should recognize both that their own designs may be flawed, and that the robot will learn and evolve over time. Thus the challenge is one of mechanism design — to design a mechanism for evolving AI under a system of checks and balances, and to give the systems utility functions that will remain friendly in the face of such changes. -Russell and Norvig (2009). Artificial Intelligence: A Modern Approach.
Our technical agenda describes these open problems in more detail, and our research guide collects online resources for learning more.
In early 2013, Bostrom and Müller surveyed the one hundred top-cited living authors in AI, as ranked by Microsoft Academic Search. Conditional on “no global catastrophe halt[ing] progress,” the twenty-nine experts who responded assigned a median 10% probability to our developing a machine “that can carry out most human professions at least as well as a typical human” by the year 2023, a 50% probability by 2048, and a 90% probability by 2080.
Most researchers at MIRI approximately agree with the 10% and 50% dates, but think that AI could arrive significantly later than 2080. This is in line with Bostrom’s analysis in Superintelligence:
My own view is that the median numbers reported in the expert survey do not have enough probability mass on later arrival dates. A 10% probability of HLMI [human-level machine intelligence] not having been developed by 2075 or even 2100 (after conditionalizing on “human scientific activity continuing without major negative disruption”) seems too low. Historically, AI researchers have not had a strong record of being able to predict the rate of advances in their own field or the shape that such advances would take. On the one hand, some tasks, like chess playing, turned out to be achievable by means of surprisingly simple programs; and naysayers who claimed that machines would “never” be able to do this or that have repeatedly been proven wrong. On the other hand, the more typical errors among practitioners have been to underestimate the difficulties of getting a system to perform robustly on real-world tasks, and to overestimate the advantages of their own particular pet project or technique.
Given experts’ (and non-experts’) poor track record at predicting progress in AI, we are relatively agnostic about when full AI will be invented. It could come sooner than expected, or later than expected.
Experts also reported a 10% median confidence that superintelligence would be developed within 2 years of human equivalence, and a 75% confidence that superintelligence would be developed within 30 years of human equivalence. Here MIRI researchers’ views differ significantly from AI experts’ median view; we expect AI systems to surpass humans relatively quickly once they near human equivalence.
Intelligence is powerful. One might say that “Intelligence is no match for a gun, or for someone with lots of money,” but both guns and money were produced by intelligence. If not for our intelligence, humans would still be foraging the savannah for food.
Intelligence is what caused humans to dominate the planet in the blink of an eye (on evolutionary timescales). Intelligence is what allows us to eradicate diseases, and what gives us the potential to eradicate ourselves with nuclear war. Intelligence gives us superior strategic skills, superior social skills, superior economic productivity, and the power of invention.
A machine with superintelligence would be able to hack into vulnerable networks via the internet, commandeer those resources for additional computing power, take over mobile machines connected to networks connected to the internet, use them to build additional machines, perform scientific experiments to understand the world better than humans can, invent quantum computing and nanotechnology, manipulate the social world better than we can, and do whatever it can to give itself more power to achieve its goals — all at a speed much faster than humans can respond to.
Many AI designs that would generate an intelligence explosion would not have a ‘slot’ in which a goal (such as ‘be friendly to human interests’) could be placed. For example, if AI is made via whole brain emulation, or evolutionary algorithms, or neural nets, or reinforcement learning, the AI will end up with some goal as it self-improves, but that stable eventual goal may be very difficult to predict in advance.
Thus, in order to design a friendly AI, it is not sufficient to determine what ‘friendliness’ is (and to specify it clearly enough that even a superintelligence will interpret it the way we want it to). We must also figure out how to build a general intelligence that satisfies a goal at all, and that stably retains that goal as it edits its own code to make itself smarter. This task is perhaps the primary difficulty in designing friendly AI.
Eliezer Yudkowsky has proposed Coherent Extrapolated Volition as a solution to at least two problems facing Friendly AI design:
- The fragility of human values: Yudkowsky writes that “any future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals will contain almost nothing of worth.” The problem is that what humans value is complex and subtle, and difficult to specify. Consider the seemingly minor value of novelty. If a human-like value of novelty is not programmed into a superintelligent machine, it might explore the universe for valuable things up to a certain point, and then maximize the most valuable thing it finds (the exploration-exploitation tradeoff) — tiling the solar system with brains in vats wired into happiness machines, for example. When a superintelligence is in charge, you have to get its motivational system exactly right in order to not make the future undesirable.
- The locality of human values: Imagine if the Friendly AI problem had faced the ancient Greeks, and they had programmed it with the most progressive moral values of their time. That would have led the world to a rather horrifying fate. But why should we think that humans have, in the 21st century, arrived at the apex of human morality? We can’t risk programming a superintelligent machine with the moral values we happen to hold today. But then, which moral values do we give it?
Yudkowsky suggests that we build a ‘seed AI’ to discover and then extrapolate the ‘coherent extrapolated volition’ of humanity:
> In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
The seed AI would use the results of this examination and extrapolation of human values to program the motivational system of the superintelligence that would determine the fate of the galaxy.
However, some worry that the collective will of humanity won’t converge on a coherent set of goals. Others believe that guaranteed Friendliness is not possible, even by such elaborate and careful means.
- Yudkowsky, Coherent Extrapolated Volition
Some have proposed that we teach machines a moral code with case-based machine learning. The basic idea is this: Human judges would rate thousands of actions, character traits, desires, laws, or institutions as having varying degrees of moral acceptability. The machine would then find the connections between these cases and learn the principles behind morality, such that it could apply those principles to determine the morality of new cases not encountered during its training. This kind of machine learning has already been used to design machines that can, for example, detect underwater mines after feeding the machine hundreds of cases of mines and not-mines.
There are several reasons machine learning does not present an easy solution for Friendly AI. The first is that, of course, humans themselves hold deep disagreements about what is moral and immoral. But even if humans could be made to agree on all the training cases, at least two problems remain.
The first problem is that training on cases from our present reality may not result in a machine that will make correct ethical decisions in a world radically reshaped by superintelligence.
The second problem is that a superintelligence may generalize the wrong principles due to coincidental patterns in the training data. Consider the parable of the machine trained to recognize camouflaged tanks in a forest. Researchers take 100 photos of camouflaged tanks and 100 photos of trees. They then train the machine on 50 photos of each, so that it learns to distinguish camouflaged tanks from trees. As a test, they show the machine the remaining 50 photos of each, and it classifies each one correctly. Success! However, later tests show that the machine classifies additional photos of camouflaged tanks and trees poorly. The problem turns out to be that the researchers’ photos of camouflaged tanks had been taken on cloudy days, while their photos of trees had been taken on sunny days. The machine had learned to distinguish cloudy days from sunny days, not camouflaged tanks from trees.
Thus, it seems that trustworthy Friendly AI design must involve detailed models of the underlying processes generating human moral judgments, not only surface similarities of cases.
Let’s consider the likely consequences of some utilitarian designs for Friendly AI.
Or, consider an AI designed to maximize human pleasure. Rather than build an ambitious utopia that caters to the complex and demanding wants of humanity for billions of years, it could achieve its goal more efficiently by wiring humans into Nozick’s experience machines. Or, it could rewire the ‘liking’ component of the brain’s reward system so that whichever hedonic hotspot paints sensations with a ‘pleasure gloss’ is wired to maximize pleasure when humans sit in jars. That would be an easier world for the AI to build than one that caters to the complex and nuanced set of world states currently painted with the pleasure gloss by most human brains.
Likewise, an AI motivated to maximize objective desire satisfaction or reported subjective well-being could rewire human neurology so that both ends are realized whenever humans sit in jars. Or it could kill all humans (and animals) and replace them with beings made from scratch to attain objective desire satisfaction or subjective well-being when sitting in jars. Either option might be easier for the AI to achieve than maintaining a utopian society catering to the complexity of human (and animal) desires. Similar problems afflict other utilitarian AI designs.
It’s not just a problem of specifying goals, either. It is hard to predict how goals will change in a self-modifying agent. No current mathematical decision theory can process the decisions of a self-modifying agent.
So, while it may be possible to design a superintelligence that would do what we want, it’s harder than one might initially think.
In order for an Artificial Superintelligence (ASI) to be useful to us, it has to have some level of influence on the outside world. Even a boxed ASI that receives and sends lines of text on a computer screen is influencing the outside world by giving messages to the human reading the screen. If the ASI wants to escape its box, it is likely that it will find its way out, because of its amazing strategic and social abilities.
Check out Yudkowsky's AI box experiment. It is an experiment in which one person convinces the other to let it out of a "box" as if it were an AI. Unfortunately, the actual contents of these conversations is mostly unknown, but it is worth reading into.
There is a broad range of possible goals that an AI might possess, but there are a few basic drives that would be useful to almost any of them. These are called instrumentally convergent goals:
- Self preservation. An agent is less likely to achieve its goal if it is not around to see to its completion.
- Goal-content integrity. An agent is less likely to achieve its goal if its goal has been changed to something else. For example, if you offer Gandhi a pill that makes him want to kill people, he will refuse to take it.
- Self-improvement. An agent is more likely to achieve its goal if it is more intelligent and better at problem-solving.
- Resource acquisition. The more resources at an agent’s disposal, the more power it has to make change towards its goal. Even a purely computational goal, such as computing digits of pi, can be easier to achieve with more hardware and energy.
Because of these drives, even a seemingly simple goal could create an Artificial Superintelligence (ASI) hell-bent on taking over the world’s material resources and preventing itself from being turned off. The classic example is an ASI that was programmed to maximize the output of paper clips at a paper clip factory. The ASI had no other goal specifications other than “maximize paper clips,” so it converts all of the matter in the solar system into paper clips, and then sends probes to other star systems to create more factories.
Goal-directed behavior arises naturally when systems are trained to on an objective. AI not trained or programmed to do well by some objective function would not be good at anything, and would be useless.
As far as we know from the observable universe, morality is just a construct of the human mind. It is meaningful to us, but it is not necessarily meaningful to the vast universe outside of our minds. There is no reason to suspect that our set of values is objectively superior to any other arbitrary set of values, e.i. “the more paper clips, the better!” Consider the case of the psychopathic genius. Plenty have existed, and they negate any correlation between intelligence and morality.
It is impossible to design an AI without a goal, because it would do nothing. Therefore, in the sense that designing the AI’s goal is a form of control, it is impossible not to control an AI. This goes for anything that you create. You have to control the design of something at least somewhat in order to create it.
There may be relevant moral questions about our future relationship with possibly sentient machine intelligent, but the priority of the Control Problem finding a way to ensure the survival and well-being of the human species.
Let’s say that you’re the French government a while back. You notice that one of your colonies has too many rats, which is causing economic damage. You have basic knowledge of economics and incentives, so you decide to incentivize the local population to kill rats by offering to buy rat tails at one dollar apiece.
Initially, this works out and your rat problem goes down. But then, an enterprising colony member has the brilliant idea of making a rat farm. This person sells you hundreds of rat tails, costing you hundreds of dollars, but they’re not contributing to solving the rat problem.
Soon other people start making their own rat farms and you’re wasting thousands of dollars buying useless rat tails. You call off the project and stop paying for rat tails. This causes all the people with rat farms to shutdown their farms and release a bunch of rats. Now your colony has an even bigger rat problem.
Here’s another, more made-up example of the same thing happening. Let’s say you’re a basketball talent scout and you notice that height is correlated with basketball performance. You decide to find the tallest person in the world to recruit as a basketball player. Except the reason that they’re that tall is because they suffer from a degenerative bone disorder and can barely walk.
Another example: you’re the education system and you want to find out how smart students are so you can put them in different colleges and pay them different amounts of money when they get jobs. You make a test called the Standardized Admissions Test (SAT) and you administer it to all the students. In the beginning, this works. However, the students soon begin to learn that this test controls part of their future and other people learn that these students want to do better on the test. The gears of the economy ratchet forwards and the students start paying people to help them prepare for the test. Your test doesn’t stop working, but instead of measuring how smart the students are, it instead starts measuring a combination of how smart they are and how many resources they have to prepare for the test.
The formal name for the thing that’s happening is Goodhart’s Law. Goodhart’s Law roughly says that if there’s something in the world that you want, like “skill at basketball” or “absence of rats” or “intelligent students”, and you create a measure that tries to measure this like “height” or “rat tails” or “SAT scores”, then as long as the measure isn’t exactly the thing that you want, the best value of the measure isn’t the thing you want: the tallest person isn’t the best basketball player, the most rat tails isn’t the smallest rat problem, and the best SAT scores aren’t always the smartest students.
If you start looking, you can see this happening everywhere. Programmers being paid for lines of code write bloated code. If CFOs are paid for budget cuts, they slash purchases with positive returns. If teachers are evaluated by the grades they give, they hand out As indiscriminately.
In machine learning, this is called specification gaming, and it happens frequently.
Now that we know what Goodhart’s Law is, I’m going to talk about one of my friends, who I’m going to call Alice. Alice thinks it’s funny to answer questions in a way that’s technically correct but misleading. Sometimes I’ll ask her, “Hey Alice, do you want pizza or pasta?” and she responds, “yes”. Because, she sure did want either pizza or pasta. Other times I’ll ask her, “have you turned in your homework?” and she’ll say “yes” because she’s turned in homework at some point in the past; it’s technically correct to answer “yes”. Maybe you have a friend like Alice too.
Whenever this happens, I get a bit exasperated and say something like “you know what I mean”.
It’s one of the key realizations in AI Safety that AI systems are always like your friend that gives answers that are technically what you asked for but not what you wanted. Except, with your friend, you can say “you know what I mean” and they will know what you mean. With an AI system, it won’t know what you mean; you have to explain, which is incredibly difficult.
Let’s take the pizza pasta example. When I ask Alice “do you want pizza or pasta?”, she knows what pizza and pasta are because she’s been living her life as a human being embedded in an English speaking culture. Because of this cultural experience, she knows that when someone asks an “or” question, they mean “which do you prefer?”, not “do you want at least one of these things?”. Except my AI system is missing the thousand bits of cultural context needed to even understand what pizza is.
When you say “you know what I mean” to an AI system, it’s going to be like “no, I do not know what you mean at all”. It’s not even going to know that it doesn’t know what you mean. It’s just going to say “yes I know what you meant, that’s why I answered ‘yes’ to your question about whether I preferred pizza or pasta.” (It also might know what you mean, but just not care.)
If someone doesn’t know what you mean, then it’s really hard to get them to do what you want them to do. For example, let’s say you have a powerful grammar correcting system, which we’ll call Syntaxly+. Syntaxly+ doesn’t quite fix your grammar, it changes your writing so that the reader feels as good as possible after reading it.
Pretend it’s the end of the week at work and you haven’t been able to get everything done your boss wanted you to do. You write the following email:
"Hey boss, I couldn’t get everything done this week. I’m deeply sorry. I’ll be sure to finish it first thing next week."
You then remember you got Syntaxly+, which will make your email sound much better to your boss. You run it through and you get:
"Hey boss, Great news! I was able to complete everything you wanted me to do this week. Furthermore, I’m also almost done with next week’s work as well."
What went wrong here? Syntaxly+ is a powerful AI system that knows that emails about failing to complete work cause negative reactions in readers, so it changed your email to be about doing extra work instead.
This is smart - Syntaxly+ is good at making writing that causes positive reactions in readers. This is also stupid - the system changed the meaning of your email, which is not something you wanted it to do. One of the insights of AI Safety is that AI systems can be simultaneously smart in some ways and dumb in other ways.
The thing you want Syntaxly+ to do is to change the grammar/style of the email without changing the contents. Except what do you mean by contents? You know what you mean by contents because you are a human who grew up embedded in language, but your AI system doesn’t know what you mean by contents. The phrases “I failed to complete my work” and “I was unable to finish all my tasks” have roughly the same contents, even though they share almost no relevant words.
Roughly speaking, this is why AI Safety is a hard problem. Even basic tasks like “fix the grammar of this email” require a lot of understanding of what the user wants as the system scales in power.
In Human Compatible, Stuart Russell gives the example of a powerful AI personal assistant. You notice that you accidentally double-booked meetings with people, so you ask your personal assistant to fix it. Your personal assistant reports that it caused the car of one of your meeting participants to break down. Not what you wanted, but technically a solution to your problem.
You can also imagine a friend from a wildly different culture than you. Would you put them in charge of your dating life? Now imagine that they were much more powerful than you and desperately desired that your dating life to go well. Scary, huh.
In general, unless you’re careful, you’re going to have this horrible problem where you ask your AI system to do something and it does something that might technically be what you wanted but is stupid. You’re going to be like “wait that wasn’t what I mean”, except your system isn’t going to know what you meant.
To help frame this question, we’re going to first answer the dual question of “what is Cybersecurity?”
As a concept, Cybersecurity is the idea that questions like “is this secure?” can meaningfully be asked of computing systems, where “secure” roughly means “is difficult for unauthorized individuals to get access to”. As a problem, Cybersecurity is the set of problems one runs into when trying to design and build secure computing systems. As a field, Cybersecurity is a group of people trying to solve the aforementioned set of problems in robust ways.
As a concept, AI Safety is the idea that questions like “is this safe?” can meaningfully be asked of AI Systems, where “safe” roughly means “does what it’s supposed to do”. As a problem, AI Safety is the set of problems one runs into when trying to design and build AI systems that do what they’re supposed to do. As a field, AI Safety is a group of people trying to solve the aforementioned set of problems in robust ways.
The reason we have a separate field of Cybersecurity is because ensuring the security of the internet and other critical systems is both hard and important. We might want a separate field of AI Safety for similar reasons; we might expect getting powerful AI systems to do what we want to be both hard and important.
Cybersecurity is important because computing systems comprise the backbone of the modern economy. If the security of the internet was compromised, then the economy would suffer a tremendous blow.
Similarly, AI Safety might become important as AI systems begin forming larger and larger parts of the modern economy. As more and more labor gets automated, it becomes more and more important to ensure that that labor is occurring in a safe and robust way.
Before the widespread adoption of computing systems, lack of Cybersecurity didn’t cause much damage. However, it might have been beneficial to start thinking about Cybersecurity problems before the solutions were necessary.
Similarly, since AI systems haven’t been adopted en mass yet, lack of AI Safety isn’t causing harm. However, given that AI systems will become increasingly powerful and increasingly widespread, it might be prudent to try to solve safety problems before a catastrophe occurs.
Additionally, people sometimes think about Artificial General Intelligence (AGI), sometimes called Human-Level Artificial Intelligence (HLAI). One of the core problems in AI Safety is ensuring when AGI gets built, it has human interests at heart. (Note that most surveyed experts think building GI/HLAI is possible, but there is wide disagreement on how soon this might occur).
At this point, people generally have a question that’s like “why can’t we just do X?”, where X is one of a dozen things. I’m going to go over a few possible Xs, but I want to first talk about how to think about these sorts of objections in general.
At the beginning of AI, the problem of computer vision was assigned to a single graduate student, because they thought it would be that easy. We now know that computer vision is actually a very difficult problem, but this was not obvious at the beginning.
The sword also cuts the other way. Before DeepBlue, people talked about how computers couldn’t play chess without a detailed understanding of human psychology. Chess is easier than we thought, merely requiring brute force search and a few heuristics. This also roughly happened with Go, where it turned out that the game was not as difficult as we thought it was.
The general lesson is that determining how hard it is to do a given thing is a difficult task. Historically, many people have got this wrong. This means that even if you think something should be easy, you have to think carefully and do experiments in order to determine if it’s easy or not.
This isn’t to say that there is no clever solution to AI Safety. I assign a low, but non-trivial probability that AI Safety turns out to not be very difficult. However, most of the things that people initially suggest turn out to be unfeasible or more difficult than expected.
A potential solution is to create an AI that has the same values and morality as a human by creating a child AI and raising it. There’s nothing intrinsically flawed with this procedure. However, this suggestion is deceptive because it sounds simpler than it is.
If you get a chimpanzee baby and raise it in a human family, it does not learn to speak a human language. Human babies can grow into adult humans because the babies have specific properties, e.g. a prebuilt language module that gets activated during childhood.
In order to make a child AI that has the potential to turn into the type of adult AI we would find acceptable, the child AI has to have specific properties. The task of building a child AI with these properties involves building a system that can interpret what humans mean when we try to teach the child to do various tasks. People are currently working on ways to program agents that can cooperatively interact with humans to learn what they want.
In previous decades, AI research had proceeded more slowly than some experts predicted. According to experts in the field, however, this trend has reversed in the past 5 years or so. AI researchers have been repeatedly surprised by, for example, the effectiveness of new visual and speech recognition systems. AI systems can solve CAPTCHAs that were specifically devised to foil AIs, translate spoken text on-the-fly, and teach themselves how to play games they have neither seen before nor been programmed to play. Moreover, the real-world value of this effectiveness has prompted massive investment by large tech firms such as Google, Facebook, and IBM, creating a positive feedback cycle that could dramatically speed progress.
It’s difficult to tell at this stage, but AI will enable many developments that could be terrifically beneficial if managed with enough foresight and care. For example, menial tasks could be automated, which could give rise to a society of abundance, leisure, and flourishing, free of poverty and tedium. As another example, AI could also improve our ability to understand and manipulate complex biological systems, unlocking a path to drastically improved longevity and health, and to conquering disease.
The basic concern as AI systems become increasingly powerful is that they won’t do what we want them to do – perhaps because they aren’t correctly designed, perhaps because they are deliberately subverted, or perhaps because they do what we tell them to do rather than what we really want them to do (like in the classic stories of genies and wishes.) Many AI systems are programmed to have goals and to attain them as effectively as possible – for example, a trading algorithm has the goal of maximizing profit. Unless carefully designed to act in ways consistent with human values, a highly sophisticated AI trading system might exploit means that even the most ruthless financier would disavow. These are systems that literally have a mind of their own, and maintaining alignment between human interests and their choices and actions will be crucial.
AI is already superhuman at some tasks, for example numerical computations, and will clearly surpass humans in others as time goes on. We don’t know when (or even if) machines will reach human-level ability in all cognitive tasks, but most of the AI researchers at FLI’s conference in Puerto Rico put the odds above 50% for this century, and many offered a significantly shorter timeline. Since the impact on humanity will be huge if it happens, it’s worthwhile to start research now on how to ensure that any impact is positive. Many researchers also believe that dealing with superintelligent AI will be qualitatively very different from more narrow AI systems, and will require very significant research effort to get right.
It likely will – however, intelligence is, by many definitions, the ability to figure out how to accomplish goals. Even in today’s advanced AI systems, the builders assign the goal but don’t tell the AI exactly how to accomplish it, nor necessarily predict in detail how it will be done; indeed those systems often solve problems in creative, unpredictable ways. Thus the thing that makes such systems intelligent is precisely what can make them difficult to predict and control. They may therefore attain the goal we set them via means inconsistent with our preferences.
Yes. In 2014, Google bought artificial intelligence startup DeepMind for $400 million; DeepMind added the condition that Google promise to set up an AI Ethics Board. DeepMind cofounder Shane Legg has said in interviews that he believes superintelligent AI will be “something approaching absolute power” and “the number one risk for this century”.
Many other science and technology leaders agree. Astrophysicist Stephen Hawking says that superintelligence “could spell the end of the human race.” Tech billionaire Bill Gates describes himself as “in the camp that is concerned about superintelligence…I don’t understand why some people are not concerned”. SpaceX/Tesla CEO Elon Musk calls superintelligence “our greatest existential threat” and donated $10 million from his personal fortune to study the danger. Stuart Russell, Professor of Computer Science at Berkeley and world-famous AI expert, warns of “species-ending problems” and wants his field to pivot to make superintelligence-related risks a central concern.
Professor Nick Bostrom is the director of Oxford’s Future of Humanity Institute, tasked with anticipating and preventing threats to human civilization. He has been studying the risks of artificial intelligence for twenty years. The explanations in the follow-up questions are loosely adapted from his 2014 book Superintelligence.
This is certainly a risk (affectionately known in AI circles as “pulling a Kurzweill”), but sometimes taking an exponential trend seriously is the right response.
Consider economic doubling times. In 1 AD, the world GDP was about $20 billion; it took a thousand years, until 1000 AD, for that to double to $40 billion. But it only took five hundred more years, until 1500, or so, for the economy to double again. And then it only took another three hundred years or so, until 1800, for the economy to double a third time. Someone in 1800 might calculate the trend line and say this was ridiculous, that it implied the economy would be doubling every ten years or so in the beginning of the 21st century. But in fact, this is how long the economy takes to double these days. To a medieval, used to a thousand-year doubling time (which was based mostly on population growth!), an economy that doubled every ten years might seem inconceivable. To us, it seems normal.
Likewise, in 1965 Gordon Moore noted that semiconductor complexity seemed to double every eighteen months. During his own day, there were about five hundred transistors on a chip; he predicted that would soon double to a thousand, and a few years later to two thousand. Almost as soon as Moore’s Law become well-known, people started saying it was absurd to follow it off a cliff – such a law would imply a million transistors per chip in 1990, a hundred million in 2000, ten billion transistors on every chip by 2015! More transistors on a single chip than existed on all the computers in the world! Transistors the size of molecules! But of course all of these things happened; the ridiculous exponential trend proved more accurate than the naysayers.
None of this is to say that exponential trends are always right, just that they are sometimes right even when it seems they can’t possibly be. We can’t be sure that a computer using its own intelligence to discover new ways to increase its intelligence will enter a positive feedback loop and achieve superintelligence in seemingly impossibly short time scales. It’s just one more possibility, a worry to place alongside all the other worrying reasons to expect a moderate or hard takeoff.
A slow takeoff is a situation in which AI goes from infrahuman to human to superhuman intelligence very gradually. For example, imagine an augmented “IQ” scale (THIS IS NOT HOW IQ ACTUALLY WORKS – JUST AN EXAMPLE) where rats weigh in at 10, chimps at 30, the village idiot at 60, average humans at 100, and Einstein at 200. And suppose that as technology advances, computers gain two points on this scale per year. So if they start out as smart as rats in 2020, they’ll be as smart as chimps in 2035, as smart as the village idiot in 2050, as smart as average humans in 2070, and as smart as Einstein in 2120. By 2190, they’ll be IQ 340, as far beyond Einstein as Einstein is beyond a village idiot.
In this scenario progress is gradual and manageable. By 2050, we will have long since noticed the trend and predicted we have 20 years until average-human-level intelligence. Once AIs reach average-human-level intelligence, we will have fifty years during which some of us are still smarter than they are, years in which we can work with them as equals, test and retest their programming, and build institutions that promote cooperation. Even though the AIs of 2190 may qualify as “superintelligent”, it will have been long-expected and there would be little point in planning now when the people of 2070 will have so many more resources to plan with.
A moderate takeoff is a situation in which AI goes from infrahuman to human to superhuman relatively quickly. For example, imagine that in 2020 AIs are much like those of today – good at a few simple games, but without clear domain-general intelligence or “common sense”. From 2020 to 2050, AIs demonstrate some academically interesting gains on specific problems, and become better at tasks like machine translation and self-driving cars, and by 2047 there are some that seem to display some vaguely human-like abilities at the level of a young child. By late 2065, they are still less intelligent than a smart human adult. By 2066, they are far smarter than Einstein.
A fast takeoff scenario is one in which computers go even faster than this, perhaps moving from infrahuman to human to superhuman in only days or weeks.
The argument goes: yes, a superintelligent AI might be far smarter than Einstein, but it’s still just one program, sitting in a supercomputer somewhere. That could be bad if an enemy government controls it and asks its help inventing superweapons – but then the problem is the enemy government, not the AI per se. Is there any reason to be afraid of the AI itself? Suppose the AI did feel hostile – suppose it even wanted to take over the world? Why should we think it has any chance of doing so?
Compounded over enough time and space, intelligence is an awesome advantage. Intelligence is the only advantage we have over lions, who are otherwise much bigger and stronger and faster than we are. But we have total control over lions, keeping them in zoos to gawk at, hunting them for sport, and holding them on the brink of extinction. And this isn’t just the same kind of quantitative advantage tigers have over lions, where maybe they’re a little bigger and stronger but they’re at least on a level playing field and enough lions could probably overpower the tigers. Humans are playing a completely different game than the lions, one that no lion will ever be able to respond to or even comprehend. Short of human civilization collapsing or lions evolving human-level intelligence, our domination over them is about as complete as it is possible for domination to be.
Since superintelligences will be as far beyond Einstein as Einstein is beyond a village idiot, we might worry that they would have the same kind of qualitative advantage over us that we have over lions.
You might say that human civilization as a whole is dangerous to lions. But a single human placed amid a pack of lions with no raw materials for building technology is going to get ripped to shreds. So although thousands of superintelligences, given a long time and a lot of opportunity to build things, might be able to dominate humans – what harm could a single superintelligence do?
Superintelligence has an advantage that a human fighting a pack of lions doesn’t – the entire context of human civilization and technology, there for it to manipulate socially or technologically.
Superintelligence has an advantage that an early human didn’t – the entire context of human civilization and technology, there for it to manipulate socially or technologically.