Non-canonical answers to canonical questions
Back to Review answers.
These 46 non-canonical answers are answering canonical questions.
Perhaps. There is a chance that directly lobbying politicians could help, but there's also a chance that actions end up being net-negative. It would be great if we could slow down AI, but doing so might simple mean that a nation less concerned about safety produces AI first. We could ask them to pass regulations or standards related to AGI, but passing ineffective regulation might interfere with passing more effective regulation later down the track as people may consider the issue dealt with. Or the requirements of complying with bureaucracy might prove to be a distraction from safe AI.
If you are concerned about this issue, you should probably try learning as much about this issue as possible and also spend a lot of time brainstorming downside risks and seeing what risks other people have identified.
Working out milestone tasks that we expect to be achieved before we reach AGI can be difficult. Some tasks, like "continuous learning" intuitively seem like they will need to be solved before someone builds AGI. Continuous learning is learning bit by bit, as you get more data. Current ML systems usually don't do this, instead learning everything at once from a big dataset. Because humans can do continuous learning, it seems like it might be required for AGI. However, you have to be careful with reasoning like this, because it is possible the first generally capable artificial intelligence will work quite differently to a human. It's possible the first AGI will be designed to avoid needing "continuous learning", maybe by being designed to do a big retraining process every day. This might still allow it to be as capable as humans at almost every task, but without solving the "continuous learning" problem.
Because of arguments like the above, it's not always clear whether a given task is "required" for AGI.
Some potential big milestone tasks might be:
- ARC challenge (tests the ability to generate the "simplest explanation" for patterns)
- Human level sample efficiency at various tasks (EfficientZero already does Atari games)
This metaculus question has four very specific milestones that it considers to be requirements for "weak AGI".
There are multiple programmes you can apply to if you want to try becoming a researcher. If accepted to these programs, you will get funding and mentorship. Some examples of these programs are: SERI summer research fellowship, CERI summer research fellowship, SERI ML Alignment Theory Program, and more. A lot of these programs run during specific times of the year (specifically during the summer).
Other examples of things you can do are: join the next iteration of the AGI Safety Fundamentals programme (https://www.eacambridge.org/technical-alignment-curriculum), if you're thinking of a career as a researcher working on AI safety questions you can get 1-1 career advice from 80,000 Hours (https://80000hours.org/speak-with-us), you can apply to attend an EAGx or EAG conference (https://www.eaglobal.org/events/) where you can meet in-person with researchers working on these questions so you can directly ask them for advice.
Some of these resources might be helpful: https://www.aisafetysupport.org/resources/lots-of-links
It's not completely clear exactly what 'merging' with AI would imply, but it doesn't seem like a way to get around the alignment problem. If the AI system is aligned, and wants to do what humans want, then having direct access to human brains could provide a lot of information about human values and goals very quickly and efficiently, and thus be helpful for better alignment. Although, a smart AI system could also get almost all of this information without a brain-computer interface, through conversation, observation etc, though much slower. On the other hand if the system is not aligned, and doesn't fundamentally want humans to get what we want, then extra information about how human minds work doesn't help and only makes the problem worse. Allowing a misaligned AGI direct access to your brain hardware is a bad idea for obvious reasons.
Preventing an AI from escaping by using a more powerful AI, gets points for creative thinking, but unfortunately we would need to have already aligned the first AI. Even if the second AI's only terminal goal were to prevent the first ai from escaping, it would also have an instrumental goal of converting the rest of the universe into computer chips so that it would have more processing power to figure out how to best contain the first AGI.
It might be possible to try to bind a stronger AI with a weaker AI, but this is unlikely to work as the stronger AI would have an advantage due to being stronger. Further, there is a chance that the two AI's end up working out a deal where the first AI decides to stay in the box and the second AI does whatever the first AI would have down if it were able to escape.
One of the main questions about simulation theory is why would a society invest a large quantity of resources to create it. One possible answer is an environment to train/test AI, or run it safely isolated from an outside reality.
It's a fun question but probably not one worth thinking about too much. This kind of question is impossible to get information from observations and experiments.
I think an AI inner aligned to optimize a utility function of maximize happiness minus suffering is likely to do something like this.
Inner aligned meaning the AI is trying to do the thing we trained it to do. Whether this is what we actually want or not.
"Aligned to what" is the outer alignment problem which is where the failure in this example is. There is a lot of debate on what utility functions are safe or desirable to maximize, and if human values can even be described by a utility function.
Autonomous weapons, especially a nuclear arsenal, being used by an AI is a concern, but this seems downstream of the central problem of giving an unaligned AI any capabilities to impact the world.
Triggering nuclear war is only one of many ways a power seeking AI might choose to take control. This seems unlikely, as resources the AI would want to control (or the AI itself) would likely be destroyed in the process.
This depends on what the superintelligence in question wants to happen. If AIs want humans to continue being employable, they’ll act to ensure humans remain employable by setting up roles that only biological humans can fill, artificially perpetuating the need for employing humans.
Optimistic views might hold that it is possible to coordinate between all AI creators to align their AIs only with a central agreed-upon definition of "human values," which could be determined by traditional human political organizations. Succeeding at this coordination would prevent (or at least, reduce) the weaponization of AIs toward competition between these values.
More pessimistic views hold that this coordination is unlikely to succeed, and that just as today different definitions of "human values" compete with one another (through e.g. political conflicts), AIs will likely be constructed by actors with different values and will compete with one another on the same grounds. The exception being that this competition might end if one group gains enough advantage to carry out a Pivotal Act that can "lock-in" their set of values as winner.
We could imagine a good instance of this might look like a U.N.-sanctioned project constructing the first super-intelligent AI, successfully aligned with the human values roughly defined as "global peace and development". This AI might then perform countermeasures to reduce the influence of bad AIs by e.g. regulating further AI development, or seizing compute power from agencies developing bad AIs.
Bad outcomes might look similar to the above, but with AIs developed by extremists or terrorists taking over. Worse still would be a careless development group accidentally producing a maligned AI, where we don't end up with "bad human values" (like one of the more oppressive human moralities), we just end up with "non-human values" (like where only paperclips matter).
A common concern is that if a friendly AI doesn't carry this out, then an opposition AI is likely to do so. Hence, there is a relatively common view that safe AI not only must be developed, but must be deployed to prevent possibly hostile AIs from arising.There are also arguments against "Pivot Act" mentality which promote political regulation as a better path toward friendly AI than leaving the responsibility to the first firm to finish.
Questions are: (i) contributed by online users or Stampy editors via the Stampy Wiki; or (ii) scraped from online content (various AI-alignment-related FAQs as well as the comments sections of certain AI Alignment YouTube videos).
The scraped content is currently a secondary concern, but this crude process of aggregation will eventually be streamlined into a reliable source of human-editable questions and answers.
Questions are reviewed by Stampy editors, who decide if: (i) they're duplicates of existing questions (the criterion being that the answer to the existing question would be fully satisfactory to the asker of the new question); (ii) they're sufficiently within the scope of the Stampy project.
We are working on using semantic search to suggest possible duplicates.
If they meet these two criterion, questions are added to a list of canonical questions.
A rating system allows editors to assign quality levels "Meh"/"Unreviewed"/"Approved"/"Good"/"Excellent" in order the questions on Answer questions, so that the most important questions can be worked on first.
Answers to canonical questions can be contributed via the Stampy Wiki by online users or by Stampy editors directly, at which point the question is added to a list of "answered canonical questions"
Editors can attempt to improve a contributed answer, and/or can "stamp" it to indicate their approval, adding to its "stamp score".
Once the answer to a canonical question gets a sufficiently high stamp score it gets added to a list of canonical answers (to canonical questions).
These canonical question/answer pairs are then ready to be served to the user interface. In order for them to become visible there, though, they must be associated with existing canonical question/answer pairs in one of two ways: RELATED or FOLLOWUP. Any editor can improve these relationships, either based on tags or their own understanding of what a reader might want to know. Questions should aim to have 2-5 related + followups generally, although exceptions can be made.
If Questions B is RELATED to Question A, it will slide in below Question A on the UI page when Question A is clicked on, provided that if it is not already present on the page.
If Question B is FOLLOWUP to Question A, it will always slide in below Question A when Question A is clicked on, even if it is already present on the UI page.
A and B being RELATED questions can be thought of as a kind of conceptual adjacency. If a user is interested to know the answer to A, they'll probably be interested in the answer to B too, and vice versa. Reading these in either order should make roughly the same amount of sense to the average user.
Question B being FOLLOWUP to Question A can be thought of in terms of progressive knowledge: the answer to B will only really make sense to the average user if they have read the answer to A first. This is also used for letting Stampy ask clarifying questions to direct readers to the right part of his knowledge graph.
If you click on "Edit answer", then "[Show Advanced Options]", you'll be given the option to submit a brief version of you answer (this field will be automatically filled if the full answer exceeds 2000 characters).
There are debates about how discontinuous an intelligence explosion would be, with Paul Christiano expecting to see the world being transformed by less and less weak AGIs over some number of years, while Eliezer Yudkowsky expects a rapid jump in capabilities once generality is achieved and the self-improvement process is able to sustain itself.
Vael Gates's project links to lots of example transcripts of persuading senior AI capabilities researchers.
Codex / Github Copilot are AIs that use GPT-3 to write and edit code. When given some input code and comments describing the intended function, they will write output that extends the prompt as accurately as possible.
"The real concern" isn't a particularly meaningful concept here. Deep learning has proven to be a very powerful technology, with far reaching implications across a number of aspects of human existence. There are significant benefits to be found if we manage the technology properly, but that management means addressing a broad range of concerns, one of which is the alignment problem.
Whole Brain Emulation (WBE) or ‘mind uploading’ is a computer emulation of all the cells and connections in a human brain. So even if the underlying principles of general intelligence prove difficult to discover, we might still emulate an entire human brain and make it run at a million times its normal speed (computer circuits communicate much faster than neurons do). Such a WBE could do more thinking in one second than a normal human can in 31 years. So this would not lead immediately to smarter-than-human intelligence, but it would lead to faster-than-human intelligence. A WBE could be backed up (leading to a kind of immortality), and it could be copied so that hundreds or millions of WBEs could work on separate problems in parallel. If WBEs are created, they may therefore be able to solve scientific problems far more rapidly than ordinary humans, accelerating further technological progress.
Until AI doesn't exceed human capabilities, we could do that.
But there is no reason why AI capabilities would stop at the human level. Systems more intelligent than us, could think of several ways to outsmart us, so our best bet is to have them as closely aligned to our values as possible.
The problem is that the actions can be harmful in a very non-obvious, indirect way. It's not at all obvious which actions should be stopped.
For example when the system comes up with a very clever way to acquire resources - this action's safety depends on what it intends to use these resources for.
Such a supervision may buy us some safety, if we find a way to make the system's intentions very transparent.
Verified accounts are given to people who have clearly demonstrated understanding of AI Safety outside of this project, such as by being employed and vouched for by a major AI Safety organization or by producing high-impact research. Verified accounts may freely mark answers as canonical or not, regardless of how many Stamps the person has, to determine whether those answers are used by Stampy.
This depends on how we will program it. It definitely can be autonomous, even now, we have some autonomous vehicles or flight control systems and many more.
Even though it's possible to build such systems, it may be better if they actively ask humans for supervision, for example in cases where they are uncertain what to do.
Nobody knows for sure when we will have ASI or if it is even possible. Predictions on AI timelines are notoriously variable, but recent surveys about the arrival of human-level AGI have median dates between 2040 and 2050 although the median for (optimistic) AGI researchers and futurists is in the early 2030s (source). What will happen if/when we are able to build human-level AGI is a point of major contention among experts. One survey asked (mostly) experts to estimate the likelihood that it would take less than 2 or 30 years for a human-level AI to improve to greatly surpass all humans in most professions. Median answers were 10% for "within 2 years" and 75% for "within 30 years". We know little about the limits of intelligence and whether increasing it will follow the law of accelerating or diminishing returns. Of particular interest to the control problem is the fast or hard takeoff scenario. It has been argued that the increase from a relatively harmless level of intelligence to a dangerous vastly superhuman level might be possible in a matter of seconds, minutes or hours: too fast for human controllers to stop it before they know what's happening. Moving from human to superhuman level might be as simple as adding computational resources, and depending on the implementation the AI might be able to quickly absorb large amounts of internet knowledge. Once we have an AI that is better at AGI design than the team that made it, the system could improve itself or create the next generation of even more intelligent AIs (which could then self-improve further or create an even more intelligent generation, and so on). If each generation can improve upon itself by a fixed or increasing percentage per time unit, we would see an exponential increase in intelligence: an intelligence explosion.
It is impossible to design an AI without a goal, because it would do nothing. Therefore, in the sense that designing the AI’s goal is a form of control, it is impossible not to control an AI. This goes for anything that you create. You have to control the design of something at least somewhat in order to create it.
There may be relevant moral questions about our future relationship with possibly sentient machine intelligent, but the priority of the Control Problem finding a way to ensure the survival and well-being of the human species.
Goal-directed behavior arises naturally when systems are trained to on an objective. AI not trained or programmed to do well by some objective function would not be good at anything, and would be useless.
Cybersecurity is important because computing systems comprise the backbone of the modern economy. If the security of the internet was compromised, then the economy would suffer a tremendous blow.
Similarly, AI Safety might become important as AI systems begin forming larger and larger parts of the modern economy. As more and more labor gets automated, it becomes more and more important to ensure that that labor is occurring in a safe and robust way.
Before the widespread adoption of computing systems, lack of Cybersecurity didn’t cause much damage. However, it might have been beneficial to start thinking about Cybersecurity problems before the solutions were necessary.
Similarly, since AI systems haven’t been adopted en mass yet, lack of AI Safety isn’t causing harm. However, given that AI systems will become increasingly powerful and increasingly widespread, it might be prudent to try to solve safety problems before a catastrophe occurs.
Additionally, people sometimes think about Artificial General Intelligence (AGI), sometimes called Human-Level Artificial Intelligence (HLAI). One of the core problems in AI Safety is ensuring when AGI gets built, it has human interests at heart. (Note that most surveyed experts think building GI/HLAI is possible, but there is wide disagreement on how soon this might occur).
To help frame this question, we’re going to first answer the dual question of “what is Cybersecurity?”
As a concept, Cybersecurity is the idea that questions like “is this secure?” can meaningfully be asked of computing systems, where “secure” roughly means “is difficult for unauthorized individuals to get access to”. As a problem, Cybersecurity is the set of problems one runs into when trying to design and build secure computing systems. As a field, Cybersecurity is a group of people trying to solve the aforementioned set of problems in robust ways.
As a concept, AI Safety is the idea that questions like “is this safe?” can meaningfully be asked of AI Systems, where “safe” roughly means “does what it’s supposed to do”. As a problem, AI Safety is the set of problems one runs into when trying to design and build AI systems that do what they’re supposed to do. As a field, AI Safety is a group of people trying to solve the aforementioned set of problems in robust ways.
The reason we have a separate field of Cybersecurity is because ensuring the security of the internet and other critical systems is both hard and important. We might want a separate field of AI Safety for similar reasons; we might expect getting powerful AI systems to do what we want to be both hard and important.
AGI means an AI that is 'general', so it is intelligent in many different domains.
Superintelligence just means doing something better than a human. For example Stockfish or Deep Blue are narrowly superintelligent in playing chess.
TAI (transformative AI) doesn't have to be general. It means 'a system that changes the world in a significant way'. It's used to emphasize, that even non-general systems can have extreme world-changing consequences.
In addition to the usual continuation of Moore's Law, GPUs have become more powerful and cheaper in the past decade, especially since around 2016. Many ideas in AI have been thought about for a long time, but the speed at which modern processors can do computing and parallel processing allows researchers to implement their ideas and gather more observational data. Improvements in AI have allowed many industries to start using the technologies, which creates demand and brings more focus on AI research (as well as improving the availability of technology on the whole due to more efficient infrastructure). Data has also become more abundant and available, and not only is data a bottleneck for machine learning algorithms, but the abundance of data is difficult for humans to deal with alone, so businesses often turn to AI to convert it to something human-parsable. These processes are also recursive, to some degree, so the more AI improves, the more can be done to improve AI.
Very hard to say. This draft report for the Open Philanthropy Project is perhaps the most careful attempt so far (and generates these graphs), but there have also been expert surveys, and many people have shared various thoughts. Berkeley AI professor Stuart Russell has given his best guess as “sometime in our children’s lifetimes”, and Ray Kurzweil (Google’s director of engineering) predicts human level AI by 2029 and the singularity by 2045. The Metaculus question on publicly known AGI has a median of around 2029 (around 10 years sooner than it was before the GPT-3 AI showed unexpected ability on a broad range of tasks).
The consensus answer is something like: “highly uncertain, maybe not for over a hundred years, maybe in less than 15, with around the middle of the century looking fairly plausible”.
So if we want to include agentless optimizing processes (like evolution) and AIs implemented as distributed systems in some technical discussion, it can be useful to use the terms "agenty" or "agentlike" to avoid addressing the philosophical questions of agency.
Even if we only build lots of narrow AIs, we might end up with a distributed system that acts like an AGI - the algorithm does not have to be encoded in a single entity, the definition in What is Artificial General Intelligence and what will it look like? applies to distributed implementations too.
This is similar to a group of people in a corporation can achieve projects that humans could not individually (like going to space), but the analogy of corporations and AGI is not perfect - see Why Not Just: Think of AGI Like a Corporation?.
It depends on the exact definition of consciousness and on the legal consequences of the AI telling us that stuff from which we could imply how conscious it might be (would it be motivated to pretend to be "conscious" by those criteria to get some benefits,, or would it be motivated to keep its consciousness in secret to avoid being turned off).
Once we have a measurable definition, then we can empirically measure the AI against that definition.
Yes. While creativity has many meanings and AIs can be obviously creative in the wide sense of the word (make new valuable artifacts like a real-time translation of a restaurant menu, compiling source code into binary files, ...), there is also no reason to believe that AIs couldn't be considered creative in a more narrow sense too (making art like music or paintings, writing computer programs based on conversation with a customer).
There is a notion of being "really creative" that can be defined in a circular way that only humans can be really creative, but if we avoid moving the goal post, then it should be possible to make a variation of a Turing test to measure the AI vs human creativity and answer that question empirically for any particular AI.
AlphaGo made a move widely considered creative in its game against a top human Go player, which has been widely discussed.
It's true that AGI may be really many years ahead. But what worries a lot of people, is that it may be much harder to make powerful AND safe AI, than just a powerful AI, and then, the first powerful AIs we create will be dangerous.
If that's the case, the sooner we start working on AI safety, the smaller the chances of humans going extinct, or ending up in some Black Mirror episode.
Also Rob Miles talks about this concern in this video.
No, but it helps. Some great resources if you're considering it, are: https://rohinshah.com/faq-career-advice-for-ai-alignment-researchers/ https://80000hours.org/articles/ai-safety-syllabus/ https://80000hours.org/career-reviews/machine-learning-phd/
The first two links show general ways to get into AI safety, and the last will show you the upsides and downsides of choosing to make a PhD.
Primarily, they are trying to make a competent AI, and any consciousness that arises will probably be by accident.
There are even some people saying we should try to make the AI unconscious, to minimize the risk of it suffering.
The biggest problem here, is that we don't have any good way of telling if some system is conscious. The best theory we have, the Integrated Information Theory, has some deep philosophical and practical problems and there are many controversies around it.
We don't have AI systems that are generally more capable than humans. So there is still time left to figure out how to build systems that are smarter than humans in a safe way.
Eliezer Yudkowsky has proposed Coherent Extrapolated Volition as a solution to at least two problems facing Friendly AI design:
- The fragility of human values: Yudkowsky writes that “any future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals will contain almost nothing of worth.” The problem is that what humans value is complex and subtle, and difficult to specify. Consider the seemingly minor value of novelty. If a human-like value of novelty is not programmed into a superintelligent machine, it might explore the universe for valuable things up to a certain point, and then maximize the most valuable thing it finds (the exploration-exploitation tradeoff) — tiling the solar system with brains in vats wired into happiness machines, for example. When a superintelligence is in charge, you have to get its motivational system exactly right in order to not make the future undesirable.
- The locality of human values: Imagine if the Friendly AI problem had faced the ancient Greeks, and they had programmed it with the most progressive moral values of their time. That would have led the world to a rather horrifying fate. But why should we think that humans have, in the 21st century, arrived at the apex of human morality? We can’t risk programming a superintelligent machine with the moral values we happen to hold today. But then, which moral values do we give it?
Yudkowsky suggests that we build a ‘seed AI’ to discover and then extrapolate the ‘coherent extrapolated volition’ of humanity:
> In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
The seed AI would use the results of this examination and extrapolation of human values to program the motivational system of the superintelligence that would determine the fate of the galaxy.
However, some worry that the collective will of humanity won’t converge on a coherent set of goals. Others believe that guaranteed Friendliness is not possible, even by such elaborate and careful means.
- Yudkowsky, Coherent Extrapolated Volition
Many AI designs that would generate an intelligence explosion would not have a ‘slot’ in which a goal (such as ‘be friendly to human interests’) could be placed. For example, if AI is made via whole brain emulation, or evolutionary algorithms, or neural nets, or reinforcement learning, the AI will end up with some goal as it self-improves, but that stable eventual goal may be very difficult to predict in advance.
Thus, in order to design a friendly AI, it is not sufficient to determine what ‘friendliness’ is (and to specify it clearly enough that even a superintelligence will interpret it the way we want it to). We must also figure out how to build a general intelligence that satisfies a goal at all, and that stably retains that goal as it edits its own code to make itself smarter. This task is perhaps the primary difficulty in designing friendly AI.
Some have proposed that we teach machines a moral code with case-based machine learning. The basic idea is this: Human judges would rate thousands of actions, character traits, desires, laws, or institutions as having varying degrees of moral acceptability. The machine would then find the connections between these cases and learn the principles behind morality, such that it could apply those principles to determine the morality of new cases not encountered during its training. This kind of machine learning has already been used to design machines that can, for example, detect underwater mines after feeding the machine hundreds of cases of mines and not-mines.
There are several reasons machine learning does not present an easy solution for Friendly AI. The first is that, of course, humans themselves hold deep disagreements about what is moral and immoral. But even if humans could be made to agree on all the training cases, at least two problems remain.
The first problem is that training on cases from our present reality may not result in a machine that will make correct ethical decisions in a world radically reshaped by superintelligence.
The second problem is that a superintelligence may generalize the wrong principles due to coincidental patterns in the training data. Consider the parable of the machine trained to recognize camouflaged tanks in a forest. Researchers take 100 photos of camouflaged tanks and 100 photos of trees. They then train the machine on 50 photos of each, so that it learns to distinguish camouflaged tanks from trees. As a test, they show the machine the remaining 50 photos of each, and it classifies each one correctly. Success! However, later tests show that the machine classifies additional photos of camouflaged tanks and trees poorly. The problem turns out to be that the researchers’ photos of camouflaged tanks had been taken on cloudy days, while their photos of trees had been taken on sunny days. The machine had learned to distinguish cloudy days from sunny days, not camouflaged tanks from trees.
Thus, it seems that trustworthy Friendly AI design must involve detailed models of the underlying processes generating human moral judgments, not only surface similarities of cases.
A Friendly Artificial Intelligence (Friendly AI or FAI) is an artificial intelligence that is ‘friendly’ to humanity — one that has a good rather than bad effect on humanity.
AI researchers continue to make progress with machines that make their own decisions, and there is a growing awareness that we need to design machines to act safely and ethically. This research program goes by many names: ‘machine ethics’, ‘machine morality’, ‘artificial morality’, ‘computational ethics’ and ‘computational metaethics’, ‘friendly AI’, and ‘robo-ethics’ or ‘robot ethics’.
The most immediate concern may be in battlefield robots; the U.S. Department of Defense contracted Ronald Arkin to design a system for ensuring ethical behavior in autonomous battlefield robots. The U.S. Congress has declared that a third of America’s ground systems must be robotic by 2025, and by 2030 the U.S. Air Force plans to have swarms of bird-sized flying robots that operate semi-autonomously for weeks at a time.
But Friendly AI research is not concerned with battlefield robots or machine ethics in general. It is concerned with a problem of a much larger scale: designing AI that would remain safe and friendly after the intelligence explosion.
A machine superintelligence would be enormously powerful. Successful implementation of Friendly AI could mean the difference between a solar system of unprecedented happiness and a solar system in which all available matter has been converted into parts for achieving the superintelligence’s goals.
It must be noted that Friendly AI is a harder project than often supposed. As explored below, commonly suggested solutions for Friendly AI are likely to fail because of two features possessed by any superintelligence:
- Superpower: a superintelligent machine will have unprecedented powers to reshape reality, and therefore will achieve its goals with highly efficient methods that confound human expectations and desires.
- Literalness: a superintelligent machine will make decisions based on the mechanisms it is designed with, not the hopes its designers had in mind when they programmed those mechanisms. It will act only on precise specifications of rules and values, and will do so in ways that need not respect the complexity and subtlety of what humans value. A demand like “maximize human happiness” sounds simple to us because it contains few words, but philosophers and scientists have failed for centuries to explain exactly what this means, and certainly have not translated it into a form sufficiently rigorous for AI programmers to use.
Let’s consider the likely consequences of some utilitarian designs for Friendly AI.
Or, consider an AI designed to maximize human pleasure. Rather than build an ambitious utopia that caters to the complex and demanding wants of humanity for billions of years, it could achieve its goal more efficiently by wiring humans into Nozick’s experience machines. Or, it could rewire the ‘liking’ component of the brain’s reward system so that whichever hedonic hotspot paints sensations with a ‘pleasure gloss’ is wired to maximize pleasure when humans sit in jars. That would be an easier world for the AI to build than one that caters to the complex and nuanced set of world states currently painted with the pleasure gloss by most human brains.
Likewise, an AI motivated to maximize objective desire satisfaction or reported subjective well-being could rewire human neurology so that both ends are realized whenever humans sit in jars. Or it could kill all humans (and animals) and replace them with beings made from scratch to attain objective desire satisfaction or subjective well-being when sitting in jars. Either option might be easier for the AI to achieve than maintaining a utopian society catering to the complexity of human (and animal) desires. Similar problems afflict other utilitarian AI designs.
It’s not just a problem of specifying goals, either. It is hard to predict how goals will change in a self-modifying agent. No current mathematical decision theory can process the decisions of a self-modifying agent.
So, while it may be possible to design a superintelligence that would do what we want, it’s harder than one might initially think.
Science fiction author Isaac Asimov told stories about robots programmed with the Three Laws of Robotics: (1) a robot may not injure a human being or, through inaction, allow a human being to come to harm, (2) a robot must obey any orders given to it by human beings, except where such orders would conflict with the First Law, and (3) a robot must protect its own existence as long as such protection does not conflict with the First or Second Law. But Asimov’s stories tended to illustrate why such rules would go wrong.
Still, could we program ‘constraints’ into a superintelligence that would keep it from harming us? Probably not.
One approach would be to implement ‘constraints’ as rules or mechanisms that prevent a machine from taking actions that it would normally take to fulfill its goals: perhaps ‘filters’ that intercept and cancel harmful actions, or ‘censors’ that detect and suppress potentially harmful plans within a superintelligence.
Constraints of this kind, no matter how elaborate, are nearly certain to fail for a simple reason: they pit human design skills against superintelligence. A superintelligence would correctly see these constraints as obstacles to the achievement of its goals, and would do everything in its power to remove or circumvent them. Perhaps it would delete the section of its source code that contains the constraint. If we were to block this by adding another constraint, it could create new machines that don’t have the constraint written into them, or fool us into removing the constraints ourselves. Further constraints may seem impenetrable to humans, but would likely be defeated by a superintelligence. Counting on humans to out-think a superintelligence is not a viable solution.
If constraints on top of goals are not feasible, could we put constraints inside of goals? If a superintelligence had a goal of avoiding harm to humans, it would not be motivated to remove this constraint, avoiding the problem we pointed out above. Unfortunately, the intuitive notion of ‘harm’ is very difficult to specify in a way that doesn’t lead to very bad results when used by a superintelligence. If ‘harm’ is defined in terms of human pain, a superintelligence could rewire humans so that they don’t feel pain. If ‘harm’ is defined in terms of thwarting human desires, it could rewire human desires. And so on.
If, instead of trying to fully specify a term like ‘harm’, we decide to explicitly list all of the actions a superintelligence ought to avoid, we run into a related problem: human value is complex and subtle, and it’s unlikely we can come up with a list of all the things we don’t want a superintelligence to do. This would be like writing a recipe for a cake that reads: “Don’t use avocados. Don’t use a toaster. Don’t use vegetables…” and so on. Such a list can never be long enough.
Except in the case of Whole Brain Emulation, there is no reason to expect a superintelligent machine to have motivations anything like those of humans. Human minds represent a tiny dot in the vast space of all possible mind designs, and very different kinds of minds are unlikely to share to complex motivations unique to humans and other mammals.
Whatever its goals, a superintelligence would tend to commandeer resources that can help it achieve its goals, including the energy and elements on which human life depends. It would not stop because of a concern for humans or other intelligences that is ‘built in’ to all possible mind designs. Rather, it would pursue its particular goal and give no thought to concerns that seem ‘natural’ to that particular species of primate called homo sapiens.
There are, however, some basic instrumental motivations we can expect superintelligent machines to display, because they are useful for achieving its goals, no matter what its goals are. For example, an AI will ‘want’ to self-improve, to be optimally rational, to retain its original goals, to acquire resources, and to protect itself — because all these things help it achieve the goals with which it was originally programmed.
Suppose we tell the AI: “Cure cancer – and look, we know there are lots of ways this could go wrong, but you’re smart, so instead of looking for loopholes, cure cancer the way that I, your programmer, want it to be cured”.
AIs can be very creative in unintended ways and are prone to edge instantiation. Remember that the superintelligence has extraordinary powers of social manipulation and may be able to hack human brains directly. With that in mind, which of these two strategies cures cancer most quickly? One, develop medications and cure it the old-fashioned way? Or two, manipulate its programmer into wanting the world to be nuked, then nuke the world to get rid of all cancer, all while doing what the programmer wants?
Nineteenth century philosopher Jeremy Bentham once postulated that morality was about maximizing human pleasure. Later philosophers found a flaw in his theory: it implied that the most moral action was to kidnap people, do brain surgery on them, and electrically stimulate their reward system directly, giving them maximal amounts of pleasure but leaving them as blissed-out zombies. Luckily, humans have common sense, so most of Bentham’s philosophical descendants have abandoned this formulation.
Superintelligences do not have common sense unless we give it to them. Given Bentham’s formulation, they would absolutely take over the world and force all humans to receive constant brain stimulation. Any command based on “do what we want” or “do what makes us happy” is practically guaranteed to fail in this way; it’s almost always easier to convince someone of something – or if all else fails to do brain surgery on them – than it is to solve some kind of big problem like curing cancer.
We don’t yet have a dangerous superintelligence on our hands. However, that does not mean it’s too early to start preparing. Given the stakes, it is worth investing significant resources even if superintelligence is not an immediate risk.
And despite the fact that in some ways even our best AIs can’t match up to humans, we’ve been seeing domain after domain of human superiority being challenged or overturned over the past few years. GPT-3 showed that it was possible for a very simple architecture applied at scale to become a language model capable of performing a surprisingly general range of text-based tasks at a high level (e.g. writing short articles which are almost indistinguishable from human-written ones). ‘Generally capable agents emerge from open-ended play’ showed that by training artificial agents in diverse procedurally generated games, they develop the ability to learn and adapt. MuZero, and more recently EfficientZero, demonstrated that AIs can effectively and rapidly learn both the rules of the games they’re playing and how to win even faster than humans.
Even though AIs are probably not as smart as rats yet, it might only be a few decades until we create superintelligence. World renowned AI expert Stuart Russell expects that superintelligence will arrive within our children’s lifetimes. And, as Stephen Hawking put it: "If a superior alien civilisation sent us a message saying, "We'll arrive in a few decades," would we just reply, "OK, call us when you get here – we'll leave the lights on"? Probably not – but this is more or less what is happening with AI."
This dynamic is explored poetically in the The Unfinished Fable of the Sparrows.
Professor Nick Bostrom is the director of Oxford’s Future of Humanity Institute, tasked with anticipating and preventing threats to human civilization.
He has been studying the risks of artificial intelligence for over twenty years. In his 2014 book Superintelligence, he covers, among other things three major questions:
- First, why is superintelligence a topic of concern
- Second, what is a “hard takeoff” and how does it impact our concern about superintelligence?
- Third, what measures can we take to make superintelligence safe and beneficial for humanity?
A superintelligence should be able to figure out what humans meant. The problem is that an AI will follow the programming it actually has, not that which we wanted it to have. If it was successfully instructed to cure cancer, a goal which it can achieve by destroying the world, it might go ahead knowing full well that we didn’t intend that outcome. It was given a very specific command – cure cancer as effectively as possible. The command makes no reference to “doing this in a way humans will like”, so it doesn’t.
We humans are smart enough to understand our own “programming”. For example, we know that – pardon the anthromorphizing – evolution gave us the urge to have sex so that we could reproduce. But we still use contraception anyway. Evolution gave us the urge to have sex, not the urge to satisfy evolution’s values directly. We appreciate intellectually that our having sex while using condoms doesn’t carry out evolution’s original plan, but – not having any particular connection to evolution’s values – we don’t care.
Because this is the history of computer Go, with fifty years added on to each date. In 1997, the best computer Go program in the world, Handtalk, won NT$250,000 for performing a previously impossible feat – beating an 11 year old child (with an 11-stone handicap penalizing the child and favoring the computer!) As late as September 2015, no computer had ever beaten any professional Go player in a fair game. Then in March 2016, a Go program beat 18-time world champion Lee Sedol 4-1 in a five game match. Go programs had gone from “dumber than children” to “smarter than any human in the world” in eighteen years, and “from never won a professional game” to “overwhelming world champion” in six months.
The slow takeoff scenario mentioned above is loading the dice. It theorizes a timeline where computers took fifteen years to go from “rat” to “chimp”, but also took thirty-five years to go from “chimp” to “average human” and fifty years to go from “average human” to “Einstein”. But from an evolutionary perspective this is ridiculous. It took about fifty million years (and major redesigns in several brain structures!) to go from the first rat-like creatures to chimps. But it only took about five million years (and very minor changes in brain structure) to go from chimps to humans. And going from the average human to Einstein didn’t even require evolutionary work – it’s just the result of random variation in the existing structures!
So maybe our hypothetical IQ scale above is off. If we took an evolutionary and neuroscientific perspective, it would look more like flatworms at 10, rats at 30, chimps at 60, the village idiot at 90, the average human at 98, and Einstein at 100.
Suppose that we start out, again, with computers as smart as rats in 2020. Now we get still get computers as smart as chimps in 2035. And we still get computers as smart as the village idiot in 2050. But now we get computers as smart as the average human in 2054, and computers as smart as Einstein in 2055. By 2060, we’re getting the superintelligences as far beyond Einstein as Einstein is beyond a village idiot.
This offers a much shorter time window to react to AI developments. In the slow takeoff scenario, we figured we could wait until computers were as smart as humans before we had to start thinking about this; after all, that still gave us fifty years before computers were even as smart as Einstein. But in the moderate takeoff scenario, it gives us one year until Einstein and six years until superintelligence. That’s starting to look like not enough time to be entirely sure we know what we’re doing.
AlphaGo was connected to the Internet – why shouldn’t the first superintelligence be? This gives a sufficiently clever superintelligence the opportunity to manipulate world computer networks. For example, it might program a virus that will infect every computer in the world, causing them to fill their empty memory with partial copies of the superintelligence, which when networked together become full copies of the superintelligence. Now the superintelligence controls every computer in the world, including the ones that target nuclear weapons. At this point it can force humans to bargain with it, and part of that bargain might be enough resources to establish its own industrial base, and then we’re in humans vs. lions territory again.
Satoshi Nakamoto is a mysterious individual who posted a design for the Bitcoin currency system to a cryptography forum. The design was so brilliant that everyone started using it, and Nakamoto – who had made sure to accumulate his own store of the currency before releasing it to the public – became a multibillionaire.
In other words, somebody with no resources except the ability to make posts to Internet forums managed to leverage that into a multibillion dollar fortune – and he wasn’t even superintelligent. If Hitler is a lower-bound on how bad superintelligent persuaders can be, Nakamoto should be a lower-bound on how bad superintelligent programmers with Internet access can be.
AlphaGo used about 0.5 petaflops (= trillion floating point operations per second) in its championship game. But the world’s fastest supercomputer, TaihuLight, can calculate at almost 100 petaflops. So suppose Google developed a human-level AI on a computer system similar to AlphaGo, caught the attention of the Chinese government (who run TaihuLight), and they transfer the program to their much more powerful computer. What would happen?
It depends on to what degree intelligence benefits from more computational resources. This differs for different processes. For domain-general intelligence, it seems to benefit quite a bit – both across species and across human individuals, bigger brain size correlates with greater intelligence. This matches the evolutionarily rapid growth in intelligence from chimps to hominids to modern man; the few hundred thousand years since australopithecines weren’t enough time to develop complicated new algorithms, and evolution seems to have just given humans bigger brains and packed more neurons and glia in per square inch. It’s not really clear why the process stopped (if it ever did), but it might have to do with heads getting too big to fit through the birth canal. Cancer risk might also have been involved – scientists have found that smarter people are more likely to get brain cancer, possibly because they’re already overclocking their ability to grow brain cells.
At least in neuroscience, once evolution “discovered” certain key insights, further increasing intelligence seems to have been a matter of providing it with more computing power. So again – what happens when we transfer the hypothetical human-level AI from AlphaGo to a TaihuLight-style supercomputer two hundred times more powerful? It might be a stretch to expect it to go from IQ 100 to IQ 20,000, but might it increase to an Einstein-level 200, or a superintelligent 300? Hard to say – but if Google ever does develop a human-level AI, the Chinese government will probably be interested in finding out.
Even if its intelligence doesn’t scale linearly, TaihuLight could give it more time. TaihuLight is two hundred times faster than AlphaGo. Transfer an AI from one to the other, and even if its intelligence didn’t change – even if it had exactly the same thoughts – it would think them two hundred times faster. An Einstein-level AI on AlphaGo hardware might (like the historical Einstein) discover one revolutionary breakthrough every five years. Transfer it to TaihuLight, and it would work two hundred times faster – a revolutionary breakthrough every week.
Supercomputers track Moore’s Law; the top supercomputer of 2016 is a hundred times faster than the top supercomputer of 2006. If this progress continues, the top computer of 2026 will be a hundred times faster still. Run Einstein on that computer, and he will come up with a revolutionary breakthrough every few hours. Or something. At this point it becomes a little bit hard to imagine. All I know is that it only took one Einstein, at normal speed, to lay the theoretical foundation for nuclear weapons. Anything a thousand times faster than that is definitely cause for concern.
There’s one final, very concerning reason to expect a fast takeoff. Suppose, once again, we have an AI as smart as Einstein. It might, like the historical Einstein, contemplate physics. Or it might contemplate an area very relevant to its own interests: artificial intelligence. In that case, instead of making a revolutionary physics breakthrough every few hours, it will make a revolutionary AI breakthrough every few hours. Each AI breakthrough it makes, it will have the opportunity to reprogram itself to take advantage of its discovery, becoming more intelligent, thus speeding up its breakthroughs further. The cycle will stop only when it reaches some physical limit – some technical challenge to further improvements that even an entity far smarter than Einstein cannot discover a way around.
To human programmers, such a cycle would look like a “critical mass”. Before the critical level, any AI advance delivers only modest benefits. But any tiny improvement that pushes an AI above the critical level would result in a feedback loop of inexorable self-improvement all the way up to some stratospheric limit of possible computing power.
This feedback loop would be exponential; relatively slow in the beginning, but blindingly fast as it approaches an asymptote. Consider the AI which starts off making forty breakthroughs per year – one every nine days. Now suppose it gains on average a 10% speed improvement with each breakthrough. It starts on January 1. Its first breakthrough comes January 10 or so. Its second comes a little faster, January 18. Its third is a little faster still, January 25. By the beginning of February, it’s sped up to producing one breakthrough every seven days, more or less. By the beginning of March, it’s making about one breakthrough every three days or so. But by March 20, it’s up to one breakthrough a day. By late on the night of March 29, it’s making a breakthrough every second.
In early 2013, Bostrom and Müller surveyed the one hundred top-cited living authors in AI, as ranked by Microsoft Academic Search. Conditional on “no global catastrophe halt[ing] progress,” the twenty-nine experts who responded assigned a median 10% probability to our developing a machine “that can carry out most human professions at least as well as a typical human” by the year 2023, a 50% probability by 2048, and a 90% probability by 2080.
Most researchers at MIRI approximately agree with the 10% and 50% dates, but think that AI could arrive significantly later than 2080. This is in line with Bostrom’s analysis in Superintelligence:
My own view is that the median numbers reported in the expert survey do not have enough probability mass on later arrival dates. A 10% probability of HLMI [human-level machine intelligence] not having been developed by 2075 or even 2100 (after conditionalizing on “human scientific activity continuing without major negative disruption”) seems too low. Historically, AI researchers have not had a strong record of being able to predict the rate of advances in their own field or the shape that such advances would take. On the one hand, some tasks, like chess playing, turned out to be achievable by means of surprisingly simple programs; and naysayers who claimed that machines would “never” be able to do this or that have repeatedly been proven wrong. On the other hand, the more typical errors among practitioners have been to underestimate the difficulties of getting a system to perform robustly on real-world tasks, and to overestimate the advantages of their own particular pet project or technique.
Given experts’ (and non-experts’) poor track record at predicting progress in AI, we are relatively agnostic about when full AI will be invented. It could come sooner than expected, or later than expected.
Experts also reported a 10% median confidence that superintelligence would be developed within 2 years of human equivalence, and a 75% confidence that superintelligence would be developed within 30 years of human equivalence. Here MIRI researchers’ views differ significantly from AI experts’ median view; we expect AI systems to surpass humans relatively quickly once they near human equivalence.
MIRI prioritizes early safety work because we believe such work is important, time-sensitive, tractable, and informative.
The importance of AI safety work is outlined in Why is safety important for smarter-than-human AI?. We see the problem as time-sensitive as a result of:
- neglectedness — Only a handful of people are currently working on the open problems outlined in the MIRI technical agenda.
- apparent difficulty — Solving the alignment problem may demand a large number of researcher hours, and may also be harder to parallelize than capabilities research.
- risk asymmetry — Working on safety too late has larger risks than working on it too early.
- AI timeline uncertainty — AI could progress faster than we expect, making it prudent to err on the side of caution.
- discontinuous progress in AI — Progress in AI is likely to speed up as we approach general AI. This means that even if AI is many decades away, it would be hazardous to wait for clear signs that general AI is near: clear signs may only arise when it’s too late to begin safety work.
We also think it is possible to do useful work in AI safety today, even if smarter-than-human AI is 50 or 100 years away. We think this for a few reasons:
- lack of basic theory — If we had simple idealized models of what we mean by correct behavior in autonomous agents, but didn’t know how to design practical implementations, this might suggest a need for more hands-on work with developed systems. Instead, however, simple models are what we’re missing. Basic theory doesn’t necessarily require that we have experience with a software system’s implementation details, and the same theory can apply to many different implementations.
- precedents — Theoretical computer scientists have had repeated success in developing basic theory in the relative absence of practical implementations. (Well-known examples include Claude Shannon, Alan Turing, Andrey Kolmogorov, and Judea Pearl.)
- early results — We’ve made significant advances since prioritizing some of the theoretical questions we’re looking at, especially in decision theory and logical uncertainty. This suggests that there’s low-hanging theoretical fruit to be picked.
Finally, we expect progress in AI safety theory to be useful for improving our understanding of robust AI systems, of the available technical options, and of the broader strategic landscape. In particular, we expect transparency to be necessary for reliable behavior, and we think there are basic theoretical prerequisites to making autonomous AI systems transparent to human designers and users.
Having the relevant theory in hand may not be strictly necessary for designing smarter-than-human AI systems — highly reliable agents may need to employ very different architectures or cognitive algorithms than the most easily constructed smarter-than-human systems that exhibit unreliable behavior. For that reason, some fairly general theoretical questions may be more relevant to AI safety work than to mainline AI capabilities work. Key advantages to AI safety work’s informativeness, then, include:
- general value of information — Making AI safety questions clearer and more precise is likely to give insights into what kinds of formal tools will be useful in answering them. Thus we’re less likely to spend our time on entirely the wrong lines of research. Investigating technical problems in this area may also help us develop a better sense for how difficult the AI problem is, and how difficult the AI alignment problem is.
- requirements for informative testing — If the system is opaque, then online testing may not give us most of the information that we need to design safer systems. Humans are opaque general reasoners, and studying the brain has been quite useful for designing more effective AI algorithms, but it has been less useful for building systems for verification and validation.
- requirements for safe testing — Extracting information from an opaque system may not be safe, since any sandbox we build may have flaws that are obvious to a superintelligence but not to a human.
There are many paths to artificial general intelligence (AGI). One path is to imitate the human brain by using neural nets or evolutionary algorithms to build dozens of separate components which can then be pieced together (Neural Networks and Natural Intelligence., A ‘neural-gas’ network learns topologies., pp.159-174). Another path is to start with a formal model of perfect general intelligence and try to approximate that(pp. 199-223, pp. 227-287). A third path is to focus on developing a ‘seed AI’ that can recursively self-improve, such that it can learn to be intelligent on its own without needing to first achieve human-level general intelligence (link). Eurisko is a self-improving AI in a limited domain, but is not able to achieve human-level general intelligence.
- Pennachin & Goertzel, Contemporary Approaches to Artificial General Intelligence