Anon's question on Safe Exploration
Wouldn't the best solution here be to have exploration done in "safe" environments only? For instance, have a "training" AI set to always explore a virtual world/simulation, or other safe environment (closed racecourse, etc.). The Driving AI is then set to 0% exploration, so it always acts/drives in the best possible currently discovered manner. But it also gathers real wold data, which it then feeds to the simulated environment, wherein the exploring AI lives. Thus, you effectively have 1 agent always exploring in safety, while the other agents are always executing the best known solution in reality. (Or instead of a dedicated explorer, have "down time" agents explore the virtual space while down - ie, cars that aren't currently driving, could load up a simulation based on their real world experiences and explore those simulations while waiting for the next time they're needed,)
While my responses are to the specific driving situation, it should generalize to any agent system, since all AI's are ultimately embedded in a virtual world, and fed inputs. There's no difference between the simulated and "real" world to the AI, so having one copy of the AI do all the exploring through all scenarios recorded by the exploitation AIs in the real world (where exploration could create real damage). When the explorer finds a new optimal, it can disseminate this new proposed optimal solution to a few exploiters, gather more information form those agents, test its "hypothesis" then update more agents to the new methodology as reality demonstrates a strong correlation to the simulation and reinforces the hypothesis.
Basically there's a feedback loop. All exploration is done in simulation, and all exploitation is improving the simulation. And the really nice bit about this is that you could start with a fairly limited "safe space" or "low fidelity" (but known safe) simulation then allow exploitation of that safe space, while collecting data to improve the simulation. As the agent exploits that safe space, it records information about the realities of that space and its edges (which it's bound to encounter). Then it can update the simulation, expand its real safe space when better methods of exploitation are projected as needed/desired, then gain more data to update the simulation, explore the new simulation, repeat..
OriginWhere was this question originally asked
|YouTube (comment link)|
|On video:||Safe Exploration: Concrete Problems in AI Safety Part 6|
|Asked on Discord?||No|