inner alignment

From Stampy's Wiki
Inner alignment
inner alignment
Alignment Forum Tag

Description

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process. As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment. 

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process. As an example, evolution is an optimization force that itself 'designed' optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment. 

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Related Pages: Mesa-Optimization

External Links:

Video by Robert Miles

Canonically answered

What is the difference between inner and outer alignment?

Show your endorsement of this answer by giving it a stamp of approval!

The paper Risks from Learned Optimization in Advanced Machine Learning Systems makes the distinction between inner and outer alignment: Outer alignment means making the optimization target of the training process (“outer optimization target” e.g. the loss in supervised learning) aligned with what we want. Inner alignment means making the optimization target of the trained system (“inner optimization target”) aligned with the outer optimization target. A challenge here is that the inner optimization target does not have an explicit representation in current systems, and can differ very much from the outer optimization target (see for example Goal Misgeneralization in Deep Reinforcement Learning).

See also this article for an intuitive explanation of inner and outer alignment.

Non-canonical answers

Might an aligned superintelligence immediately kill everyone and then go on to create a "hedonium shockwave"?

Show your endorsement of this answer by giving it a stamp of approval!

I think an AI inner aligned to optimize a utility function of maximize happiness minus suffering is likely to do something like this.

Inner aligned meaning the AI is trying to do the thing we trained it to do. Whether this is what we actually want or not.

"Aligned to what" is the outer alignment problem which is where the failure in this example is. There is a lot of debate on what utility functions are safe or desirable to maximize, and if human values can even be described by a utility function.

What does Evan Hubinger think of Deception + Inner Alignment?

Show your endorsement of this answer by giving it a stamp of approval!

Read Evan's research agenda for more information.

It seems likely that deceptive agents are the default, so a key problem in alignment is to figure out how we can avoid deceptive alignment at every point in the training process. This seems to rely on being able to consistently exert optimization pressure against deception, which probably necessitates interpretability tools.

His plan to do this right now is acceptability verification: have some predicate that precludes deception, and then check your model for this predicate at every point in training.

One idea for this predicate is making sure that the agent is myopic, meaning that the AI only cares about the current timestep, so there is no incentive to deceive, because the benefits of deception happen only in the future. This is operationalized as “return the action that your model of HCH would return, if it received your inputs.”