What is an adversarial oversight scheme?
The following diagram describing an adversarial oversight scheme to do for aligning an RL agent:
Motivation: The point of this approach is to create extremely reliable AI where it will never engage in certain types of behavior, for example, killing all humans, or deceiving its creators. A practice problem is to get any kind of behavior extremely reliably out of current day LLMs. The way Redwood operationalized this is by trying to train an LLM to have the property that they finish the prompt such that no humans get hurt (technically slightly weaker than this — only the AI believes no one gets hurt).
OriginWhere was this question originally asked