What is Dylan Hadfield-Menell's thesis on?
Dylan's PhD thesis argues three main claims (paraphrased):
- Outer alignment failures are a problem.
- We can mitigate this problem by adding in uncertainty.
- We can model this as Cooperative Inverse Reinforcement Learning (CIRL).
Thus, his motivations seem to be modeling AGI coming in some multi-agent form, and also being heavily connected with human operators.
We're not certain what he is currently working on, but some recent alignment-relevant papers that he has published include:
- Work on instantiating norms into AIs to incentivize deference to humans.
- Theoretically formulating the principal-agent problem.
Dylan has also published a number of articles that seem less directly relevant for alignment.
OriginWhere was this question originally asked