What is Dylan Hadfield-Menell's thesis on?
Hadfield-Menell's PhD thesis argues three main claims (paraphrased):
- Outer alignment failures are a problem.
- We can mitigate this problem by adding uncertainty.
- We can model this as Cooperative Inverse Reinforcement Learning (CIRL).
Thus, his motivations seem to be modeling AGI coming in some multi-agent form, and also being heavily connected with human operators.
Some recent alignment-relevant papers that he has published include: