From Stampy's Wiki
Main Question: Why can’t we just “put the AI in a box” so it can’t influence the outside world? (edit question) (edit answer)
Alignment Forum Tag
Arbital Page
Wikipedia Page


AI Boxing is attempts, experiments, or proposals to isolate ("box") a powerful AI (~AGI) where it can't interact with the world at large, save for limited communication with its human liaison. It is often proposed that so long as the AI is physically isolated and restricted, or "boxed", it will be harmless even if it is an unfriendly artificial intelligence (UAI).

AI Boxing is attempts, experiments, or proposals to isolate ("box") a powerful AI (~AGI) where it can't interact with the world at large, save for limited communication with its human liaison. It is often proposed that so long as the AI is physically isolated and restricted, or "boxed", it will be harmless even if it is an unfriendly artificial intelligence (UAI).

Challenges are: 1) can you successively prevent it from interacting with the world? And 2) can you prevent it from convincing you to let it out?

See also: AI, AGI, Oracle AI, Tool AI, Unfriendly AI

Escaping the box

It is not regarded as likely that an AGI can be boxed in the long term. Since the AGI might be a superintelligence, it could persuade someone (the human liaison, most likely) to free it from its box and thus, human control. Some practical ways of achieving this goal include:

  • Offering enormous wealth, power and intelligence to its liberator
  • Claiming that only it can prevent an existential risk
  • Claiming it needs outside resources to cure all diseases
  • Predicting a real-world disaster (which then occurs), then claiming it could have been prevented had it been let out

Other, more speculative ways include: threatening to torture millions of conscious copies of you for thousands of years, starting in exactly the same situation as in such a way that it seems overwhelmingly likely that you are a simulation, or it might discover and exploit unknown physics to free itself.

Containing the AGI

Attempts to box an AGI may add some degree of safety to the development of a friendly artificial intelligence (FAI). A number of strategies for keeping an AGI in its box are discussed in Thinking inside the box and Leakproofing the Singularity. Among them are:

  • Physically isolating the AGI and permitting it zero control of any machinery
  • Limiting the AGI’s outputs and inputs with regards to humans
  • Programming the AGI with deliberately convoluted logic or homomorphically encrypting portions of it
  • Periodic resets of the AGI's memory
  • A virtual world between the real world and the AI, where its unfriendly intentions would be first revealed
  • Motivational control using a variety of techniques
  • Creating an Oracle AI: an AI that only answers questions and isn't designed to interact with the world in any other way. But even the act of the AI putting strings of text in front of humans poses some risk.

Simulations / Experiments

The AI Box Experiment is a game meant to explore the possible pitfalls of AI boxing. It is played over text chat, with one human roleplaying as an AI in a box, and another human roleplaying as a gatekeeper with the ability to let the AI out of the box. The AI player wins if they successfully convince the gatekeeper to let them out of the box, and the gatekeeper wins if the AI player has not been freed after a certain period of time. 

Both Eliezer Yudkowsky and Justin Corwin have ran simulations, pretending to be a superintelligence, and been able to convince a human playing a guard to let them out on many - but not all - occasions. Eliezer's five experiments required the guard to listen for at least two hours with participants who had approached him, while Corwin's 26 experiments had no time limit and subjects he approached.

The text of Eliezer's experiments have not been made public.

List of experiments


Canonically answered

Once an AGI has access to the internet it would be very challenging to meaningfully restrict it from doing things online which it wants to. There are too many options to bypass blocks we may put in place.

It may be possible to design it so that it does not want to do dangerous things in the first place, or perhaps to set up tripwires so that we notice that it’s trying to do a dangerous thing, though that relies on it not noticing or bypassing the tripwire so should not be the only layer of security.

One possible way to ensure the safety of a powerful AI system is to keep it contained in a software environment. There is nothing intrinsically wrong with this procedure - keeping an AI system in a secure software environment would make it safer than letting it roam free. However, even AI systems inside software environments might not be safe enough.

Humans sometimes put dangerous humans inside boxes to limit their ability to influence the external world. Sometimes, these humans escape their boxes. The security of a prison depends on certain assumptions, which can be violated. Yoshie Shiratori reportedly escaped prison by weakening the door-frame with miso soup and dislocating his shoulders.

Human written software has a high defect rate; we should expect a perfectly secure system to be difficult to create. If humans construct a software system they think is secure, it is possible that the security relies on a false assumption. A powerful AI system could potentially learn how its hardware works and manipulate bits to send radio signals. It could fake a malfunction and attempt social engineering when the engineers look at its code. As the saying goes: in order for someone to do something we had imagined was impossible requires only that they have a better imagination.

Experimentally, humans have convinced other humans to let them out of the box. Spooky.

Non-canonical answers

That is, if you know an AI is likely to be superintelligent, can’t you just disconnect it from the Internet, not give it access to any speakers that can make mysterious buzzes and hums, make sure the only people who interact with it are trained in caution, et cetera?. Isn’t there some level of security – maybe the level we use for that room in the CDC where people in containment suits hundreds of feet underground analyze the latest superviruses – with which a superintelligence could be safe?

This puts us back in the same situation as lions trying to figure out whether or not nuclear weapons are a things humans can do. But suppose there is such a level of security. You build a superintelligence, and you put it in an airtight chamber deep in a cave with no Internet connection and only carefully-trained security experts to talk to. What now?

Now you have a superintelligence which is possibly safe but definitely useless. The whole point of building superintelligences is that they’re smart enough to do useful things like cure cancer. But if you have the monks ask the superintelligence for a cancer cure, and it gives them one, that’s a clear security vulnerability. You have a superintelligence locked up in a cave with no way to influence the outside world except that you’re going to mass produce a chemical it gives you and inject it into millions of people.

Or maybe none of this happens, and the superintelligence sits inert in its cave. And then another team somewhere else invents a second superintelligence. And then a third team invents a third superintelligence. Remember, it was only about ten years between Deep Blue beating Kasparov, and everybody having Deep Blue – level chess engines on their laptops. And the first twenty teams are responsible and keep their superintelligences locked in caves with carefully-trained experts, and the twenty-first team is a little less responsible, and now we still have to deal with a rogue superintelligence.

Superintelligences are extremely dangerous, and no normal means of controlling them can entirely remove the danger.

We could limit bandwidth, put it behind a proxy, or only inside a VPN initially, but over time an AGI would figure out how to get as much internet access as it needs, make itself more distributed, or a similar workaround.

In order for an Artificial Superintelligence (ASI) to be useful to us, it has to have some level of influence on the outside world. Even a boxed ASI that receives and sends lines of text on a computer screen is influencing the outside world by giving messages to the human reading the screen. If the ASI wants to escape its box, it is likely that it will find its way out, because of its amazing strategic and social abilities.

Check out Yudkowsky's AI box experiment. It is an experiment in which one person convinces the other to let it out of a "box" as if it were an AI. Unfortunately, the actual contents of these conversations is mostly unknown, but it is worth reading into.

‘AI-boxing’ is a common suggestion: why not use a superintelligent machine as a kind of question-answering oracle, and never give it access to the internet or any motors with which to move itself and acquire resources beyond what we give it? There are several reasons to suspect that AI-boxing will not work in the long run:

  1. Whatever goals the creators designed the superintelligence to achieve, it will be more able to achieve those goals if given access to the internet and other means of acquiring additional resources. So, there will be tremendous temptation to “let the AI out of its box.”
  2. Preliminary experiments in AI-boxing do not inspire confidence. And, a superintelligence will generate far more persuasive techniques for getting humans to “let it out of the box” than we can imagine.
  3. If one superintelligence has been created, then other labs or even independent programmers will be only weeks or decades away from creating a second superintelligence, and then a third, and then a fourth. You cannot hope to successfully contain all superintelligences created around the world by hundreds of people for hundreds of different purposes.

Unanswered non-canonical questions