What is the problem with RLHF?

1 min read

Reinforcement learning from human feedback (RLHF) has numerous potential problems.

https://www.lesswrong.com/posts/d6DvuCKH5bSoT62DB/compendium-of-problems-with-rlhf

Accelerating alignment

https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research

Deceptive alignment

Tells us what we want to hear, rather then the behavior we want

Increased goal directedness
Misgeneralization

https://www.lesswrong.com/posts/scnkAbvLMDjJR9WE2/a-philosopher-s-critique-of-rlhf

RLHF works well for cases like teaching a backflip because we know what we want. However, when it comes to ethics, we don’t know what we want, especially as we leave familiar situations, and the AI needs to generalize to situations where our intuitions fail us. In such situations we won’t be able to predict the behavior which the AI thinks is the application of our feedback.

Could AI alignment research be bad? How?

Reinforcement Learning