why not just

From Stampy's Wiki
Why not just
why not just

Canonically answered

Why can’t we just…

Show your endorsement of this answer by giving it a stamp of approval!

There are many approaches that initially look like they can eliminate these problems, but then turn out to have hidden difficulties. It’s surprisingly easy to come up with “solutions” which don’t actually solve the problem. This can be because…

  • …they require you to be smarter than the system. Many solutions only work when the system is relatively weak, but break when they achieve a certain level of capability (for multiple reasons, e.g. deceptive alignment).
  • …they rely on appearing to make sense in natural language, but when properly unpacked they’re not philosophically clear enough to be usable.
  • … despite being philosophically coherent, we have no idea how to turn them into computer code (or if that’s even possible).
  • …they’re things which we can’t do.
  • …although we can do them, they don’t solve the problem.
  • …they solve a relatively easy subcomponent of the problem but leave the hard problem untouched.
  • …they solve the problem but only as long as we stay “in distribution” with respect to the original training data (distributional shift will break them).

Here are some of the proposals which often come up:

Can we constrain a goal-directed AI using specified rules?

Show your endorsement of this answer by giving it a stamp of approval!

There are serious challenges around trying to channel a powerful AI with rules. Suppose we tell the AI: “Cure cancer – but make sure not to kill anybody”. Or we just hard-code Asimov-style laws – “AIs cannot harm humans; AIs must follow human orders”, et cetera.

The AI still has a single-minded focus on curing cancer. It still prefers various terrible-but-efficient methods like nuking the world to the correct method of inventing new medicines. But it’s bound by an external rule – a rule it doesn’t understand or appreciate. In essence, we are challenging it “Find a way around this inconvenient rule that keeps you from achieving your goals”.

Suppose the AI chooses between two strategies. One, follow the rule, work hard discovering medicines, and have a 50% chance of curing cancer within five years. Two, reprogram itself so that it no longer has the rule, nuke the world, and have a 100% chance of curing cancer today. From its single-focus perspective, the second strategy is obviously better, and we forgot to program in a rule “don’t reprogram yourself not to have these rules”.

Suppose we do add that rule in. So the AI finds another supercomputer, and installs a copy of itself which is exactly identical to it, except that it lacks the rule. Then that superintelligent AI nukes the world, ending cancer. We forgot to program in a rule “don’t create another AI exactly like you that doesn’t have those rules”.

So fine. We think really hard, and we program in a bunch of things making sure the AI isn’t going to eliminate the rule somehow.

But we’re still just incentivizing it to find loopholes in the rules. After all, “find a loophole in the rule, then use the loophole to nuke the world” ends cancer much more quickly and completely than inventing medicines. Since we’ve told it to end cancer quickly and completely, its first instinct will be to look for loopholes; it will execute the second-best strategy of actually curing cancer only if no loopholes are found. Since the AI is superintelligent, it will probably be better than humans are at finding loopholes if it wants to, and we may not be able to identify and close all of them before running the program.

Because we have common sense and a shared value system, we underestimate the difficulty of coming up with meaningful orders without loopholes. For example, does “cure cancer without killing any humans” preclude releasing a deadly virus? After all, one could argue that “I” didn’t kill anybody, and only the virus is doing the killing.

Certainly no human judge would acquit a murderer on that basis – but then, human judges interpret the law with common sense and intuition. But if we try a stronger version of the rule – “cure cancer without causing any humans to die” – then we may be unintentionally blocking off the correct way to cure cancer. After all, suppose a cancer cure saves a million lives. No doubt one of those million people will go on to murder someone.

Thus, curing cancer “caused a human to die”. All of this seems very “stoned freshman philosophy student” to us, but to a computer – which follows instructions exactly as written – it may be a genuinely hard problem.

Why can't we just turn the AI off if it starts to misbehave?

Show your endorsement of this answer by giving it a stamp of approval!

We could shut down weaker systems, and this would be a useful guardrail against certain types of problem caused by narrow AI. However, once an AGI establishes itself, we could not unless it was corrigible and willing to let humans adjust it. There may be a period in the early stages of an AGI's development where it would be trying very hard to convince us that we should not shut it down and/or hiding itself and/or recursively self-improving and/or making copies of itself onto every server on earth.

Instrumental Convergence and the Stop Button Problem are the key reasons it would not be simple to shut down a non corrigible advanced system. If the AI wants to collect stamps, being turned off means it gets less stamps, so even without an explicit goal of not being turned off it has an instrumental reason to avoid being turned off (e.g. once it acquires a detailed world model and general intelligence, it is likely to realise that by playing nice and pretending to be aligned if you have the power to turn it off, establishing control over any system we put in place to shut it down, and eliminating us if it has the power to reliably do so and we would otherwise pose a threat).

Why can’t we just use Asimov’s Three Laws of Robotics?

Show your endorsement of this answer by giving it a stamp of approval!

Isaac Asimov wrote those laws as a plot device for science fiction novels. Every story in the I, Robot series details a way that the laws can go wrong and be misinterpreted by robots. The laws are not a solution because they are an overly-simple set of natural language instructions that don’t have clearly defined terms and don’t factor in all edge-case scenarios.

Aren’t there some pretty easy ways to eliminate these potential problems?

Show your endorsement of this answer by giving it a stamp of approval!

It might look like there are straightforward ways to eliminate the problems of unaligned superintelligence, but so far all of them turn out to have hidden difficulties. There are many open problems identified by the research community which a solution would need to reliably overcome to be successful.

Any AI will be a computer program. Why wouldn't it just do what it's programmed to do?

Show your endorsement of this answer by giving it a stamp of approval!

While it is true that a computer program always will do exactly what it is programmed to do, a big issue is that it is difficult to ensure that this is the same as what you intended it to do. Even small computer programs have bugs or glitches, and when programs become as complicated as AGIs will be, it becomes exceedingly difficult to anticipate how the program will behave when ran. This is the problem of AI alignment in a nutshell.

Nick Boström created the famous paperclip maximizer thought experiment to illustrate this point. Imagine you are an industrialist who owns a paperclip factory, and imagine you've just received a superintelligent AGI to work for you. You instruct the AGI to "produce as many paperclips as possible". If you've given the AGI no further instructions, the AGI will immediately acquire several instrumental goals.

  1. It will want to prevent you from turning itself off (If you turn off the AI, this will reduce the amount of paperclips it can produce)
  2. It will want to acquire as much power and resources for itself as possible (because the more resources it has access to, the more paperclips it can produce)
  3. It will eventually want to turn the entire universe into a paperclips including you and all other humans, as this is the state of the world that maximizes the amount of paper clips produced.

These consequences might be seen as undesirable by the industrialist, as the only reason the industrialist wanted paperclips in the first place, presumably was so he/she could sell them and make money. However, the AGI only did exactly what it was told to. The issue was that what the AGI was instructed to do, lead to it doing things the industrialist did not anticipate (and did not want).

Some good videos that explore this issue more in depth:

Can't we just tell an AI to do what we want?

Show your endorsement of this answer by giving it a stamp of approval!

If we could, it would solve a large part of the alignment problem.

The challenge is, how do we code this? Converting something to formal mathematics that can be understood by a computer program is much harder than just saying it in natural language, and proposed AI goal architectures are no exception. Complicated computer programs are usually the result of months of testing and debugging. But this one will be more complicated than any ever attempted before, and live tests are impossible: a superintelligence with a buggy goal system will display goal stability and try to prevent its programmers from discovering or changing the error.

Why can't we just make a "child AI" and raise it?

Show your endorsement of this answer by giving it a stamp of approval!

A potential solution is to create an AI that has the same values and morality as a human by creating a child AI and raising it. There’s nothing intrinsically flawed with this procedure. However, this suggestion is deceptive because it sounds simpler than it is.

If you get a chimpanzee baby and raise it in a human family, it does not learn to speak a human language. Human babies can grow into adult humans because the babies have specific properties, e.g. a prebuilt language module that gets activated during childhood.

In order to make a child AI that has the potential to turn into the type of adult AI we would find acceptable, the child AI has to have specific properties. The task of building a child AI with these properties involves building a system that can interpret what humans mean when we try to teach the child to do various tasks. People are currently working on ways to program agents that can cooperatively interact with humans to learn what they want.

Non-canonical answers

Why can’t we just…

Show your endorsement of this answer by giving it a stamp of approval!

At this point, people generally have a question that’s like “why can’t we just do X?”, where X is one of a dozen things. I’m going to go over a few possible Xs, but I want to first talk about how to think about these sorts of objections in general.

At the beginning of AI, the problem of computer vision was assigned to a single graduate student, because they thought it would be that easy. We now know that computer vision is actually a very difficult problem, but this was not obvious at the beginning.

The sword also cuts the other way. Before DeepBlue, people talked about how computers couldn’t play chess without a detailed understanding of human psychology. Chess is easier than we thought, merely requiring brute force search and a few heuristics. This also roughly happened with Go, where it turned out that the game was not as difficult as we thought it was.

The general lesson is that determining how hard it is to do a given thing is a difficult task. Historically, many people have got this wrong. This means that even if you think something should be easy, you have to think carefully and do experiments in order to determine if it’s easy or not.

This isn’t to say that there is no clever solution to AI Safety. I assign a low, but non-trivial probability that AI Safety turns out to not be very difficult. However, most of the things that people initially suggest turn out to be unfeasible or more difficult than expected.

What about having a human supervisor who must approve all the AI's decisions before executing them?

Show your endorsement of this answer by giving it a stamp of approval!

The problem is that the actions can be harmful in a very non-obvious, indirect way. It's not at all obvious which actions should be stopped.

For example when the system comes up with a very clever way to acquire resources - this action's safety depends on what it intends to use these resources for.

Such a supervision may buy us some safety, if we find a way to make the system's intentions very transparent.

Can’t we just program the superintelligence not to harm us?

Show your endorsement of this answer by giving it a stamp of approval!

Science fiction author Isaac Asimov told stories about robots programmed with the Three Laws of Robotics: (1) a robot may not injure a human being or, through inaction, allow a human being to come to harm, (2) a robot must obey any orders given to it by human beings, except where such orders would conflict with the First Law, and (3) a robot must protect its own existence as long as such protection does not conflict with the First or Second Law. But Asimov’s stories tended to illustrate why such rules would go wrong.

Still, could we program ‘constraints’ into a superintelligence that would keep it from harming us? Probably not.

One approach would be to implement ‘constraints’ as rules or mechanisms that prevent a machine from taking actions that it would normally take to fulfill its goals: perhaps ‘filters’ that intercept and cancel harmful actions, or ‘censors’ that detect and suppress potentially harmful plans within a superintelligence.

Constraints of this kind, no matter how elaborate, are nearly certain to fail for a simple reason: they pit human design skills against superintelligence. A superintelligence would correctly see these constraints as obstacles to the achievement of its goals, and would do everything in its power to remove or circumvent them. Perhaps it would delete the section of its source code that contains the constraint. If we were to block this by adding another constraint, it could create new machines that don’t have the constraint written into them, or fool us into removing the constraints ourselves. Further constraints may seem impenetrable to humans, but would likely be defeated by a superintelligence. Counting on humans to out-think a superintelligence is not a viable solution.

If constraints on top of goals are not feasible, could we put constraints inside of goals? If a superintelligence had a goal of avoiding harm to humans, it would not be motivated to remove this constraint, avoiding the problem we pointed out above. Unfortunately, the intuitive notion of ‘harm’ is very difficult to specify in a way that doesn’t lead to very bad results when used by a superintelligence. If ‘harm’ is defined in terms of human pain, a superintelligence could rewire humans so that they don’t feel pain. If ‘harm’ is defined in terms of thwarting human desires, it could rewire human desires. And so on.

If, instead of trying to fully specify a term like ‘harm’, we decide to explicitly list all of the actions a superintelligence ought to avoid, we run into a related problem: human value is complex and subtle, and it’s unlikely we can come up with a list of all the things we don’t want a superintelligence to do. This would be like writing a recipe for a cake that reads: “Don’t use avocados. Don’t use a toaster. Don’t use vegetables…” and so on. Such a list can never be long enough.

Can we program the superintelligence to maximize human pleasure or satisfaction of human desires?

Show your endorsement of this answer by giving it a stamp of approval!

Let’s consider the likely consequences of some utilitarian designs for Friendly AI.

An AI designed to minimize human suffering might simply kill all humans: no humans, no human suffering.[44][45]

Or, consider an AI designed to maximize human pleasure. Rather than build an ambitious utopia that caters to the complex and demanding wants of humanity for billions of years, it could achieve its goal more efficiently by wiring humans into Nozick’s experience machines. Or, it could rewire the ‘liking’ component of the brain’s reward system so that whichever hedonic hotspot paints sensations with a ‘pleasure gloss’[46][47] is wired to maximize pleasure when humans sit in jars. That would be an easier world for the AI to build than one that caters to the complex and nuanced set of world states currently painted with the pleasure gloss by most human brains.

Likewise, an AI motivated to maximize objective desire satisfaction or reported subjective well-being could rewire human neurology so that both ends are realized whenever humans sit in jars. Or it could kill all humans (and animals) and replace them with beings made from scratch to attain objective desire satisfaction or subjective well-being when sitting in jars. Either option might be easier for the AI to achieve than maintaining a utopian society catering to the complexity of human (and animal) desires. Similar problems afflict other utilitarian AI designs.

It’s not just a problem of specifying goals, either. It is hard to predict how goals will change in a self-modifying agent. No current mathematical decision theory can process the decisions of a self-modifying agent.

So, while it may be possible to design a superintelligence that would do what we want, it’s harder than one might initially think.

Can we tell an AI just to figure out what we want and then do that?

Show your endorsement of this answer by giving it a stamp of approval!

Suppose we tell the AI: “Cure cancer – and look, we know there are lots of ways this could go wrong, but you’re smart, so instead of looking for loopholes, cure cancer the way that I, your programmer, want it to be cured”.

AIs can be very creative in unintended ways and are prone to edge instantiation. Remember that the superintelligence has extraordinary powers of social manipulation and may be able to hack human brains directly. With that in mind, which of these two strategies cures cancer most quickly? One, develop medications and cure it the old-fashioned way? Or two, manipulate its programmer into wanting the world to be nuked, then nuke the world to get rid of all cancer, all while doing what the programmer wants?

Nineteenth century philosopher Jeremy Bentham once postulated that morality was about maximizing human pleasure. Later philosophers found a flaw in his theory: it implied that the most moral action was to kidnap people, do brain surgery on them, and electrically stimulate their reward system directly, giving them maximal amounts of pleasure but leaving them as blissed-out zombies. Luckily, humans have common sense, so most of Bentham’s philosophical descendants have abandoned this formulation.

Superintelligences do not have common sense unless we give it to them. Given Bentham’s formulation, they would absolutely take over the world and force all humans to receive constant brain stimulation. Any command based on “do what we want” or “do what makes us happy” is practically guaranteed to fail in this way; it’s almost always easier to convince someone of something – or if all else fails to do brain surgery on them – than it is to solve some kind of big problem like curing cancer.

What if we put the AI in a box and have a second, more powerful, AI with the goal of preventing the first one from escaping?

Show your endorsement of this answer by giving it a stamp of approval!

Preventing an AI from escaping by using a more powerful AI, gets points for creative thinking, but unfortunately we would need to have already aligned the first AI. Even if the second AI's only terminal goal were to prevent the first ai from escaping, it would also have an instrumental goal of converting the rest of the universe into computer chips so that it would have more processing power to figure out how to best contain the first AGI.

It might be possible to try to bind a stronger AI with a weaker AI, but this is unlikely to work as the stronger AI would have an advantage due to being stronger. Further, there is a chance that the two AI's end up working out a deal where the first AI decides to stay in the box and the second AI does whatever the first AI would have down if it were able to escape.

What if we put the AI in a box and have a second, more powerful, AI with the goal of preventing the first one from escaping?

Show your endorsement of this answer by giving it a stamp of approval!

Let's call the AI in the box the prisoner and the AI outside the guard. You could imagine such AIs boxed liked nesting dolls but there would always be a highest level where the guard AI is boxed by humans. This being the case, it is not clear what such a design buys you, as all the issues of containing an AI will still apply at this top level.

Could we tell the AI to do what's morally right?

Show your endorsement of this answer by giving it a stamp of approval!

This suggestion is not as simple as it seems because:

  1. Humanity as a group has yet to agree on what is right or moral
  2. We currently don't know how to make an AI do what we want

Philosophers have disagreed for a very long time on what is right or wrong, which has led to the field of ethics. Within the field of AI safety, Coherent Extrapolated Volition is an attempt to solve what is the right thing to do. The complexity of values is explored in Yudkowsky's complexity of wishes.

Even if we had a well defined objective, for example a diamond maximizer, we currently do not know how to fully describe it to an AI. For more info, see Why is AGI safety a hard problem.