How might Shard Theory help with alignment?

From Stampy's Wiki
Mark as:

Tags: shard theory (create tag) (edit tags)

Canonical Answer

Humans care about things! The reward circuitry in our brain reliably causes us to care about specific things. Let's create a mechanistic model of how the brain aligns humans, and then we can use this to do AI alignment.

One perspective that Shard theory has added is that we shouldn't think of the solution to alignment as:

  1. Find an outer objective that is fine to optimize arbitrarily strongly
  2. Find a way of making sure that the inner objective of an ML system equals the outer objective.

Shard theory argues that instead we should focus on finding outer objectives that reliably give certain inner values into system and should be thought of as more of a teacher of the values we want to instill as opposed to the values themselves. Reward is not the optimization target — instead, it is more like that which reinforces. People sometimes refer to inner aligning an RL agent with respect to the reward signal, but this doesn't actually make sense. (As pointed out in the comments this is not a new insight, but it was for me phrased a lot more clearly in terms of Shard theory).

Humans have different values than the reward circuitry in our brain being maximized, but they are still pointed reliably. These underlying values cause us to not wirehead with respect to the outer optimizer of reward.

Shard Theory points at the beginning of a mechanistic story for how inner values are selected for by outer optimization pressures. The current plan is to figure out how RL induces inner values into learned agents, and then figure out how to instill human values into powerful AI models (probably chain of thought LLMs, because these are the most intelligent models right now). Then, use these partially aligned models to solve the full alignment problem. Shard theory also proposes a subagent theory of mind.

This has some similarities to Brain-like AGI Safety, and has drawn on some research from this post, such as the mechanics of the human reward circuitry as well as the brain being mostly randomly initialized at birth.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

Question Info
Asked by: RoseMcClelland
OriginWhere was this question originally asked
Wiki
Date: 2022/09/13


Discussion