David runs a lab at the University of Cambridge. Some things he is working on include:

  1. Operationalizing inner alignment failures and other speculative alignment failures that haven't actually been observed.
  2. Understanding neural network generalization.

For work done on (1), see: Goal Misgeneralization, a paper that empirically demonstrated examples of inner alignment failure in Deep RL environments. For example, they trained an agent to get closer to cheese in a maze, but where the cheese was always in the top right of a maze in the training set. During test time, when presented with cheese elsewhere, the RL agent navigated to the top right instead of to the cheese: it had learned the mesa objective of "go to the top right".

For work done on (2), see OOD Generalization via Risk Extrapolation, an iterative improvement on robustness to previous methods.

We've not read about his motivation is for these specific research directions, but these are likely his best starts on how to solve the alignment problem.

