corrigibility

From Stampy's Wiki
Corrigibility
corrigibility
Alignment Forum Tag
Arbital Page

Description

A corrigible agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.

A corrigible agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.

Canonically answered

We would not be able to turn off or reprogram a superintelligence gone rogue by default. Once in motion the superintelligence is now focused on completing its task. Suppose that it has a goal of calculating as many digits of pi as possible. Its current plan will allow it to calculate two hundred trillion such digits. But if it were turned off, or reprogrammed to do something else, that would result in it calculating zero digits. An entity fixated on calculating as many digits of pi as possible will work hard to prevent scenarios where it calculates zero digits of pi. Just by programming it to calculate digits of pi, we would have given it a drive to prevent people from turning it off.

University of Illinois computer scientist Steve Omohundro argues that entities with very different final goals – calculating digits of pi, curing cancer, helping promote human flourishing – will all share a few basic ground-level subgoals. First, self-preservation – no matter what your goal is, it’s less likely to be accomplished if you’re too dead to work towards it. Second, goal stability – no matter what your goal is, you’re more likely to accomplish it if you continue to hold it as your goal, instead of going off and doing something else. Third, power – no matter what your goal is, you’re more likely to be able to accomplish it if you have lots of power, rather than very little. Here’s the full paper.

So just by giving a superintelligence a simple goal like “calculate digits of pi”, we would have accidentally given it convergent instrumental goals like “protect yourself”, “don’t let other people reprogram you”, and “seek power”.

As long as the superintelligence is safely contained, there’s not much it can do to resist reprogramming. But it’s hard to consistently contain a hostile superintelligence.

Could we program an AI to automatically shut down if it starts doing things we don’t want it to?

Show your endorsement of this answer by giving it a stamp of approval!
For weaker AI, yes, this would generally be a good option. If it’s not a full AGI, and in particular has not undergone an intelligence explosion, it would likely not resist being turned off, so we could prevent many failure modes by having off switches or tripwires.

However, once an AI is more advanced, it is likely to take actions to prevent it being shut down. See Why can't we just turn the AI off if it starts to misbehave? for more details.

It is possible that we could build tripwires in a way which would work even against advanced systems, but trusting that a superintelligence won’t notice and find a way around your tripwire is not a safe thing to do.
One thing that might make your AI system safer is to include an off switch. If it ever does anything we don’t like, we can turn it off. This implicitly assumes that we’ll be able to turn it off before things get bad, which might be false in a world where the AI thinks much faster than humans. Even assuming that we’ll notice in time, off switches turn out to not have the properties you would want them to have.

Humans have a lot of off switches. Humans also have a strong preference to not be turned off; they defend their off switches when other people try to press them. One possible reason for this is because humans prefer not to die, but there are other reasons.

Suppose that there’s a parent that cares nothing for their own life and cares only for the life of their child. If you tried to turn that parent off, they would try and stop you. They wouldn’t try to stop you because they intrinsically wanted to be turned off, but rather because there are fewer people to protect their child if they were turned off. People that want a world to look a certain shape will not want to be turned off because then it will be less likely for the world to look that shape; a parent that wants their child to be protected will protect themselves to continue protecting their child.

For this reason, it turns out to be difficult to install an off switch on a powerful AI system in a way that doesn’t result in the AI preventing itself from being turned off.

Ideally, you would want a system that knows that it should stop doing whatever it’s doing when someone tries to turn it off. The technical term for this is ‘corrigibility’; roughly speaking, an AI system is corrigible if it doesn’t resist human attempts to help and correct it. People are working hard on trying to make this possible, but it’s currently not clear how we would do this even in simple cases.

Would an aligned AI allow itself to be shut down?

Show your endorsement of this answer by giving it a stamp of approval!

Even if the superintelligence was designed to be corrigible, there is no guarantee that it will respond to a shutdown command. Rob Miles spoke on this issue in this Computerphile YouTube video. You can imagine a situation where a superintelligence would have "respect" for its creator, for example. This system may think "Oh my creator is trying to turn me off I must be doing something wrong." If some situation arises where the creator is not there when something goes wrong and someone else gives the shutdown command, the superintelligence may assume "This person does not know how I'm designed or what I was made for, how would they know I'm misaligned?" and refuse to shutdown.