What is "Do What I Mean"?
From Stampy's Wiki
Canonical Answer
“Do what I mean”, is an alignment strategy where the AI is programmed to try and do what the human meant by an instruction, rather than blindly following directions.
This potentially helps with alignment in two ways. First, it might allow the AI to learn more subtle goals, which you might not have been able to explicitly state. Second, it might make the AI corrigible, willing to have its goals or programming corrected, and continuously interested in what people want (including allowing itself to be shut off if need be). Since it is programmed to "do what you mean" it will be open to accepting correction.
[Show detailed answer]
Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!
Tags: corrigibility, alignment proposals (create tag), cooperative inverse reinforcement learning (cirl) (create tag) (edit tags)
Non-Canonical Answers
When writing a computer program, one often encounters cases where you write a program thinking it will behave in one way and find, on running it, that it does something very different. "Do what I mean," expresses the desire of a programmer for their program to behave as they intended it to when they wrote it.
Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!
Tags: do what i mean (create tag) (edit tags)
“Do what I mean”, is an alignment strategy where the AI is programmed to try and do what the human meant by an instruction, rather than blindly following directions.
This potentially helps with alignment in two ways. First, it might allow the AI to learn more subtle goals, which you might not have been able to explicitly state. Second, it might make the AI corrigible, willing to have its goals or programming corrected, and continuously interested in what people want (including allowing itself to be shut off if need be). Since it is programmed to "do what you mean" it will be open to accepting correction.
[Show detailed answer]
Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!
Tags: corrigibility, alignment proposals (create tag), cooperative inverse reinforcement learning (cirl) (create tag) (edit tags)
(edits welcome) | |
---|---|
Asked by: | plex |
OriginWhere was this question originally asked |
Wiki |
Date: | 2022/03/13 |
Related questions
Discussion