Ben Crulis's question on Mesa-Optimizers

From Stampy's Wiki
Ben Crulis's question on Mesa-Optimizers id:Ugy3c5rEsLVs3iG146V4AaABAg

Great video, one of the best on this subject!
I wonder, how can the mesa objective be fixed by the meta-optimizer faster than the base optimizer making it learn the base objective? In the last example the AI agent is capable of understanding that it won't be subjected to gradient descent after the learning step so it become deceptive on puprose and yet, it hasn't learned to achieve the simpler objective of going through the exit while it is trained by the base optimizer?

Non-Canonical Answers

In order to be deceptive, the mesa optimizer would need to include its intended objective in its world model, in order to 'convince' the meta optimizer that it is in fact optimized for that objective. The mesa optimizer simply has no reason to pursue that objective once it's been deployed. AI training isn't like training a dog to perform tricks instrumentally for treats. It's more akin to a search algorithm, and a deceptively misaligned mesa-optimizer is effectively a false positive, returning a result which meets the search criteria but not the actual intentions of the developer.

Stamps: None

Tags: None (add tags)

Question Info
Asked by: Ben Crulis
OriginWhere was this question originally asked
YouTube (comment link)
On video: The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment
Date: 2021/02/16
Asked on Discord? Yes
YouTube Likes: 3
Reply count: 3


Discussion