What is "Do What I Mean"?

From Stampy's Wiki
Mark as:

Tags: do what i mean (create tag) (edit tags)

Canonical Answer

“Do what I mean”, is an alignment strategy where the AI is programmed to try and do what the human meant by an instruction, rather than blindly following directions. This potentially helps with alignment in two ways. First, it might allow the AI to learn more subtle goals, which you might not have been able to explicitly state. Second, it might make the AI corrigible, willing to have its goals or programming corrected, and continuously interested in what people want (including allowing itself to be shut off if need be). Since it is programmed to "do what you mean" it will be open to accepting correction.
“Do what I mean”, is an alignment strategy where the AI is programmed to try and do what the human meant by an instruction, rather than blindly following directions. Think following the spirit of the law over the letter.

It comes in contrast to the more typical, “do what I say” approach of programming AI by giving it an explicit goal. The problem with an explicit goal is that if the goal is misstated, or leaves out some detail, the AI will optimize for something we don’t want. Think of the story of King Midas, who wished that everything he touched would turn to gold and died of starvation. In contrast a "do what I mean" system will try to understand and implement the programmers intent, rather then the exact "words".

One specific "Do what I mean" proposal is "Cooperative Inverse Reinforcement Learning", where the goal is hidden from the AI. Since it doesn't have direct access to its reward function, the AI will try and discover the goal from the things you tell it and from the examples you give it. Thus slowly getting closer to doing what you actually want.

This potentially helps with alignment in two ways. First, it might allow the AI to learn more subtle goals, which you might not have been able to explicitly state. Second, it might make the AI corrigible, willing to have its goals or programming corrected, and continuously interested in what people want (including allowing itself to be shut off if need be). Since it is programmed to "do what you mean" it will be open to accepting correction.

For more information, see Do what we mean vs. do what we say by Rohin Shah. In which he defines a "do what we mean" system, shows how it might help with alignment and discusses how it could be combined with a "do what we say" subsystem for added safety.

For a discussion of a spectrum of different levels of "do what I mean" ability, see Do What I Mean hierarchy by Eliezer Yudkowsky
Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: corrigibility, alignment proposals (create tag), cooperative inverse reinforcement learning (cirl) (create tag) (edit tags)

Non-Canonical Answers

When writing a computer program, one often encounters cases where you write a program thinking it will behave in one way and find, on running it, that it does something very different. "Do what I mean," expresses the desire of a programmer for their program to behave as they intended it to when they wrote it.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: do what i mean (create tag) (edit tags)

“Do what I mean”, is an alignment strategy where the AI is programmed to try and do what the human meant by an instruction, rather than blindly following directions. This potentially helps with alignment in two ways. First, it might allow the AI to learn more subtle goals, which you might not have been able to explicitly state. Second, it might make the AI corrigible, willing to have its goals or programming corrected, and continuously interested in what people want (including allowing itself to be shut off if need be). Since it is programmed to "do what you mean" it will be open to accepting correction.
“Do what I mean”, is an alignment strategy where the AI is programmed to try and do what the human meant by an instruction, rather than blindly following directions. Think following the spirit of the law over the letter.

It comes in contrast to the more typical, “do what I say” approach of programming AI by giving it an explicit goal. The problem with an explicit goal is that if the goal is misstated, or leaves out some detail, the AI will optimize for something we don’t want. Think of the story of King Midas, who wished that everything he touched would turn to gold and died of starvation. In contrast a "do what I mean" system will try to understand and implement the programmers intent, rather then the exact "words".

One specific "Do what I mean" proposal is "Cooperative Inverse Reinforcement Learning", where the goal is hidden from the AI. Since it doesn't have direct access to its reward function, the AI will try and discover the goal from the things you tell it and from the examples you give it. Thus slowly getting closer to doing what you actually want.

This potentially helps with alignment in two ways. First, it might allow the AI to learn more subtle goals, which you might not have been able to explicitly state. Second, it might make the AI corrigible, willing to have its goals or programming corrected, and continuously interested in what people want (including allowing itself to be shut off if need be). Since it is programmed to "do what you mean" it will be open to accepting correction.

For more information, see Do what we mean vs. do what we say by Rohin Shah. In which he defines a "do what we mean" system, shows how it might help with alignment and discusses how it could be combined with a "do what we say" subsystem for added safety.

For a discussion of a spectrum of different levels of "do what I mean" ability, see Do What I Mean hierarchy by Eliezer Yudkowsky
Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: corrigibility, alignment proposals (create tag), cooperative inverse reinforcement learning (cirl) (create tag) (edit tags)

Canonical Question Info
(edits welcome)
Asked by: plex
OriginWhere was this question originally asked
Wiki
Date: 2022/03/13

Related questions


Discussion