tripwire

From Stampy's Wiki
Tripwire
tripwire
Alignment Forum Tag

Description

In AI safety, a tripwire is a mechanism designed to detect signs of misalignment in an advanced artificial intelligence and shut it down automatically.

In AI safety, a tripwire is a mechanism designed to detect signs of misalignment in an advanced artificial intelligence and shut it down automatically.

Canonically answered

Is it possible to block an AI from doing certain things on the Internet?

Show your endorsement of this answer by giving it a stamp of approval!

Once an AGI has access to the internet it would be very challenging to meaningfully restrict it from doing things online which it wants to. There are too many options to bypass blocks we may put in place.

It may be possible to design it so that it does not want to do dangerous things in the first place, or perhaps to set up tripwires so that we notice that it’s trying to do a dangerous thing, though that relies on it not noticing or bypassing the tripwire so should not be the only layer of security.

Could we program an AI to automatically shut down if it starts doing things we don’t want it to?

Show your endorsement of this answer by giving it a stamp of approval!
For weaker AI, yes, this would generally be a good option. If it’s not a full AGI, and in particular has not undergone an intelligence explosion, it would likely not resist being turned off, so we could prevent many failure modes by having off switches or tripwires.

However, once an AI is more advanced, it is likely to take actions to prevent it being shut down. See Why can't we just turn the AI off if it starts to misbehave? for more details.

It is possible that we could build tripwires in a way which would work even against advanced systems, but trusting that a superintelligence won’t notice and find a way around your tripwire is not a safe thing to do.
One thing that might make your AI system safer is to include an off switch. If it ever does anything we don’t like, we can turn it off. This implicitly assumes that we’ll be able to turn it off before things get bad, which might be false in a world where the AI thinks much faster than humans. Even assuming that we’ll notice in time, off switches turn out to not have the properties you would want them to have.

Humans have a lot of off switches. Humans also have a strong preference to not be turned off; they defend their off switches when other people try to press them. One possible reason for this is because humans prefer not to die, but there are other reasons.

Suppose that there’s a parent that cares nothing for their own life and cares only for the life of their child. If you tried to turn that parent off, they would try and stop you. They wouldn’t try to stop you because they intrinsically wanted to be turned off, but rather because there are fewer people to protect their child if they were turned off. People that want a world to look a certain shape will not want to be turned off because then it will be less likely for the world to look that shape; a parent that wants their child to be protected will protect themselves to continue protecting their child.

For this reason, it turns out to be difficult to install an off switch on a powerful AI system in a way that doesn’t result in the AI preventing itself from being turned off.

Ideally, you would want a system that knows that it should stop doing whatever it’s doing when someone tries to turn it off. The technical term for this is ‘corrigibility’; roughly speaking, an AI system is corrigible if it doesn’t resist human attempts to help and correct it. People are working hard on trying to make this possible, but it’s currently not clear how we would do this even in simple cases.