corrigibility
Alignment Forum Tag |
Arbital Page |
Description
Canonically answered
We would not be able to turn off or reprogram a superintelligence gone rogue by default. Once in motion the superintelligence is now focused on completing its task. Suppose that it has a goal of calculating as many digits of pi as possible. Its current plan will allow it to calculate two hundred trillion such digits. But if it were turned off, or reprogrammed to do something else, that would result in it calculating zero digits. An entity fixated on calculating as many digits of pi as possible will work hard to prevent scenarios where it calculates zero digits of pi. Just by programming it to calculate digits of pi, we would have given it a drive to prevent people from turning it off.
University of Illinois computer scientist Steve Omohundro argues that entities with very different final goals – calculating digits of pi, curing cancer, helping promote human flourishing – will all share a few basic ground-level subgoals. First, self-preservation – no matter what your goal is, it’s less likely to be accomplished if you’re too dead to work towards it. Second, goal stability – no matter what your goal is, you’re more likely to accomplish it if you continue to hold it as your goal, instead of going off and doing something else. Third, power – no matter what your goal is, you’re more likely to be able to accomplish it if you have lots of power, rather than very little. Here’s the full paper.
So just by giving a superintelligence a simple goal like “calculate digits of pi”, we would have accidentally given it convergent instrumental goals like “protect yourself”, “don’t let other people reprogram you”, and “seek power”.
As long as the superintelligence is safely contained, there’s not much it can do to resist reprogramming. But it’s hard to consistently contain a hostile superintelligence.
Could we program an AI to automatically shut down if it starts doing things we don’t want it to?
However, once an AI is more advanced, it is likely to take actions to prevent it being shut down. See Why can't we just turn the AI off if it starts to misbehave? for more details.
It is possible that we could build tripwires in a way which would work even against advanced systems, but trusting that a superintelligence won’t notice and find a way around your tripwire is not a safe thing to do.Would an aligned AI allow itself to be shut down?
Even if the superintelligence was designed to be corrigible, there is no guarantee that it will respond to a shutdown command. Rob Miles spoke on this issue in this Computerphile YouTube video. You can imagine a situation where a superintelligence would have "respect" for its creator, for example. This system may think "Oh my creator is trying to turn me off I must be doing something wrong." If some situation arises where the creator is not there when something goes wrong and someone else gives the shutdown command, the superintelligence may assume "This person does not know how I'm designed or what I was made for, how would they know I'm misaligned?" and refuse to shutdown.