|Main Question: Why might an AI do bad things?|
|Alignment Forum Tag|
Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition . This concept has also been discussed under the term basic drives.
The idea was first explored by Steve Omohundro. He argued that sufficiently advanced AI systems would all naturally discover similar instrumental subgoals. The view that there are important basic AI drives was subsequently defended by Nick Bostrom as the instrumental convergence thesis, or the convergent instrumental goals thesis. On this view, a few goals are instrumental to almost all possible final goals. Therefore, all advanced AIs will pursue these instrumental goals. Omohundro uses microeconomic theory by von Neumann to support this idea.
Omohundro presents two sets of values, one for self-improving artificial intelligences 1 and another he says will emerge in any sufficiently advanced AGI system 2. The former set is composed of four main drives:
- Self-preservation: A sufficiently advanced AI will probably be the best entity to achieve its goals. Therefore it must continue existing in order to maximize goal fulfillment. Similarly, if its goal system were modified, then it would likely begin pursuing different ends. Since this is not desirable to the current AI, it will act to preserve the content of its goal system.
- Efficiency: At any time, the AI will have finite resources of time, space, matter, energy and computational power. Using these more efficiently will increase its utility. This will lead the AI to do things like implement more efficient algorithms, physical embodiments, and particular mechanisms. It will also lead the AI to replace desired physical events with computational simulations as much as possible, to expend fewer resources.
- Acquisition: Resources like matter and energy are indispensable for action. The more resources the AI can control, the more actions it can perform to achieve its goals. The AI's physical capabilities are determined by its level of technology. For instance, if the AI could invent nanotechnology, it would vastly increase the actions it could take to achieve its goals.
- Creativity: The AI's operations will depend on its ability to come up with new, more efficient ideas. It will be driven to acquire more computational power for raw searching ability, and it will also be driven to search for better search algorithms. Omohundro argues that the drive for creativity is critical for the AI to display the richness and diversity that is valued by humanity. He discusses signaling goals as particularly rich sources of creativity.
Bostrom argues for an orthogonality thesis: But he also argues that, despite the fact that values and intelligence are independent, any recursively self-improving intelligence would likely possess a particular set of instrumental values that are useful for achieving any kind of terminal value.3 On his view, those values are:
- Self-preservation: A superintelligence will value its continuing existence as a means to to continuing to take actions that promote its values.
- Goal-content integrity: The superintelligence will value retaining the same preferences over time. Modifications to its future values through swapping memories, downloading skills, and altering its cognitive architecture and personalities would result in its transformation into an agent that no longer optimizes for the same things.
- Cognitive enhancement: Improvements in cognitive capacity, intelligence and rationality will help the superintelligence make better decisions, furthering its goals more in the long run.
- Technological perfection: Increases in hardware power and algorithm efficiency will deliver increases in its cognitive capacities. Also, better engineering will enable the creation of a wider set of physical structures using fewer resources (e.g., nanotechnology).
- Resource acquisition: In addition to guaranteeing the superintelligence's continued existence, basic resources such as time, space, matter and free energy could be processed to serve almost any goal, in the form of extended hardware, backups and protection.
Both Bostrom and Omohundro argue these values should be used in trying to predict a superintelligence's behavior, since they are likely to be the only set of values shared by most superintelligences. They also note that these values are consistent with safe and beneficial AIs as well as unsafe ones.
Bostrom emphasizes, however, that our ability to predict a superintelligence's behavior may be very limited even if it shares most intelligences' instrumental goals.
Yudkowsky echoes Omohundro's point that the convergence thesis is consistent with the possibility of Friendly AI. However, he also notes that the convergence thesis implies that most AIs will be extremely dangerous, merely by being indifferent to one or more human values:4
In some rarer cases, AIs may not pursue these goals. For instance, if there are two AIs with the same goals, the less capable AI may determine that it should destroy itself to allow the stronger AI to control the universe. Or an AI may have the goal of using as few resources as possible, or of being as unintelligent as possible. These relatively specific goals will limit the growth and power of the AI.
- Convergent instrumental strategies (Arbital)
- Instrumental convergence (Arbital)
- Orthogonality thesis
- Cox's theorem
- Unfriendly AI, Paperclip maximizer, Oracle AI
- Instrumental values
- Omohundro, S. (2007). The Nature of Self-Improving Artificial Intelligence.
- Omohundro, S. (2008). "The Basic AI Drives". Proceedings of the First AGI Conference.
- Omohundro, S. (2012). Rational Artificial Intelligence for the Greater Good.
- Bostrom, N. (2012). "The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents". Minds and Machines.
- Shulman, C. (2010). Omohundro's "Basic AI Drives" and Catastrophic Risks.
- The Orthogonality Thesis: AI could have almost any goal while at the same time having high intelligence (aka ability to succeed at those goals). This means that we could build a very powerful agent which would not necessarily share human-friendly values. For example, the classic paperclip maximizer thought experiment explores this with an AI which has a goal of creating as many paperclips as possible, something that humans are (mostly) indifferent to, and as a side effect ends up destroying humanity to make room for more paperclip factories.
- Complexity of value: What humans care about is not simple, and the space of all goals is large, so virtually all goals we could program into an AI would lead to worlds not valuable to humans if pursued by a sufficiently powerful agent. If we, for example, did not include our value of diversity of experience, we could end up with a world of endlessly looping simple pleasures, rather than beings living rich lives.
- Instrumental Convergence: For almost any goal an AI has there are shared ‘instrumental’ steps, such as acquiring resources, preserving itself, and preserving the contents of its goals. This means that a powerful AI with goals that were not explicitly human-friendly would predictably both take actions that lead to the end of humanity (e.g. using resources humans need to live to further its goals, such as replacing our crop fields with vast numbers of solar panels to power its growth, or using the carbon in our bodies to build things) and prevent us from turning it off or altering its goals.
Even if an idea sounds pretty good to us right now, we can’t be very sure it has no potential flaws or loopholes. After all, other proposals that originally sounded very good, like “just give commands to the AI” and “just tell the AI to figure out what makes us happy” ended up, after more thought, to be dangerous.
Can we be sure that we’ve thought this through enough? Can we be sure that there isn’t some extremely subtle problem with it, so subtle that no human would ever notice it, but which might seem obvious to a superintelligence?
Second, how do we code this? Converting something to formal mathematics that can be understood by a computer program is much harder than just saying it in natural language, and proposed AI goal architectures are no exception. Complicated computer programs are usually the result of months of testing and debugging. But this one will be more complicated than any ever attempted before, and live tests are impossible: a superintelligence with a buggy goal system will display goal stability and try to prevent its programmers from discovering or changing the error.
We could shut down weaker systems, and this would be a useful guardrail against certain types of problem caused by narrow AI. However, once an AGI establishes itself, we could not unless it was corrigible and willing to let humans adjust it. There may be a period in the early stages of an AGI's development where it would be trying very hard to convince us that we should not shut it down and/or hiding itself and/or recursively self-improving and/or making copies of itself onto every server on earth.
Instrumental Convergence and the Stop Button Problem are the key reasons it would not be simple to shut down a non corrigible advanced system. If the AI wants to collect stamps, being turned off means it gets less stamps, so even without an explicit goal of not being turned off it has an instrumental reason to avoid being turned off (e.g. once it acquires a detailed world model and general intelligence, it is likely to realise that by playing nice and pretending to be aligned if you have the power to turn it off, establishing control over any system we put in place to shut it down, and eliminating us if it has the power to reliably do so and we would otherwise pose a threat).
There is a broad range of possible goals that an AI might possess, but there are a few basic drives that would be useful to almost any of them. These are called instrumentally convergent goals:
- Self preservation. An agent is less likely to achieve its goal if it is not around to see to its completion.
- Goal-content integrity. An agent is less likely to achieve its goal if its goal has been changed to something else. For example, if you offer Gandhi a pill that makes him want to kill people, he will refuse to take it.
- Self-improvement. An agent is more likely to achieve its goal if it is more intelligent and better at problem-solving.
- Resource acquisition. The more resources at an agent’s disposal, the more power it has to make change towards its goal. Even a purely computational goal, such as computing digits of pi, can be easier to achieve with more hardware and energy.
Because of these drives, even a seemingly simple goal could create an Artificial Superintelligence (ASI) hell-bent on taking over the world’s material resources and preventing itself from being turned off. The classic example is an ASI that was programmed to maximize the output of paper clips at a paper clip factory. The ASI had no other goal specifications other than “maximize paper clips,” so it converts all of the matter in the solar system into paper clips, and then sends probes to other star systems to create more factories.
Present-day AI algorithms already demand special safety guarantees when they must act in important domains without human oversight, particularly when they or their environment can change over time:
Achieving these gains [from autonomous systems] will depend on development of entirely new methods for enabling “trust in autonomy” through verification and validation (V&V) of the near-infinite state systems that result from high levels of [adaptability] and autonomy. In effect, the number of possible input states that such systems can be presented with is so large that not only is it impossible to test all of them directly, it is not even feasible to test more than an insignificantly small fraction of them. Development of such systems is thus inherently unverifiable by today’s methods, and as a result their operation in all but comparatively trivial applications is uncertifiable.
It is possible to develop systems having high levels of autonomy, but it is the lack of suitable V&V methods that prevents all but relatively low levels of autonomy from being certified for use.
- Office of the US Air Force Chief Scientist (2010). Technology Horizons: A Vision for Air Force Science and Technology 2010-30.
As AI capabilities improve, it will become easier to give AI systems greater autonomy, flexibility, and control; and there will be increasingly large incentives to make use of these new possibilities. The potential for AI systems to become more general, in particular, will make it difficult to establish safety guarantees: reliable regularities during testing may not always hold post-testing.
The largest and most lasting changes in human welfare have come from scientific and technological innovation — which in turn comes from our intelligence. In the long run, then, much of AI’s significance comes from its potential to automate and enhance progress in science and technology. The creation of smarter-than-human AI brings with it the basic risks and benefits of intellectual progress itself, at digital speeds.
As AI agents become more capable, it becomes more important (and more difficult) to analyze and verify their decisions and goals. Stuart Russell writes:
The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:
- The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
- Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
Bostrom’s “The Superintelligent Will” lays out these two concerns in more detail: that we may not correctly specify our actual goals in programming smarter-than-human AI systems, and that most agents optimizing for a misspecified goal will have incentives to treat humans adversarially, as potential threats or obstacles to achieving the agent’s goal.
If the goals of human and AI agents are not well-aligned, the more knowledgeable and technologically capable agent may use force to get what it wants, as has occurred in many conflicts between human communities. Having noticed this class of concerns in advance, we have an opportunity to reduce risk from this default scenario by directing research toward aligning artificial decision-makers’ interests with our own.
One important concern is that some autonomous systems are designed to kill or destroy for military purposes. These systems would be designed so that they could not be “unplugged” easily. Whether further development of such systems is a favorable long-term direction is a question we urgently need to address. A separate concern is that high-quality decision-making systems could inadvertently be programmed with goals that do not fully capture what we want. Antisocial or destructive actions may result from logical steps in pursuit of seemingly benign or neutral goals. A number of researchers studying the problem have concluded that it is surprisingly difficult to completely guard against this effect, and that it may get even harder as the systems become more intelligent. They might, for example, consider our efforts to control them as being impediments to attaining their goals.
Except in the case of Whole Brain Emulation, there is no reason to expect a superintelligent machine to have motivations anything like those of humans. Human minds represent a tiny dot in the vast space of all possible mind designs, and very different kinds of minds are unlikely to share to complex motivations unique to humans and other mammals.
Whatever its goals, a superintelligence would tend to commandeer resources that can help it achieve its goals, including the energy and elements on which human life depends. It would not stop because of a concern for humans or other intelligences that is ‘built in’ to all possible mind designs. Rather, it would pursue its particular goal and give no thought to concerns that seem ‘natural’ to that particular species of primate called homo sapiens.
There are, however, some basic instrumental motivations we can expect superintelligent machines to display, because they are useful for achieving its goals, no matter what its goals are. For example, an AI will ‘want’ to self-improve, to be optimally rational, to retain its original goals, to acquire resources, and to protect itself — because all these things help it achieve the goals with which it was originally programmed.