What is an adversarial oversight scheme?

From Stampy's Wiki
Mark as:

Tags: None (add tags)

Non-Canonical Answers

The following diagram describing an adversarial oversight scheme to do for aligning an RL agent:



Motivation: The point of this approach is to create extremely reliable AI where it will never engage in certain types of behavior, for example, killing all humans, or deceiving its creators. A practice problem is to get any kind of behavior extremely reliably out of current day LLMs. The way Redwood operationalized this is by trying to train an LLM to have the property that they finish the prompt such that no humans get hurt (technically slightly weaker than this — only the AI believes no one gets hurt).

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

Question Info
Asked by: RoseMcClelland
OriginWhere was this question originally asked
Wiki
Date: 2022/09/13


Discussion