There are many approaches that initially look like they can eliminate these problems, but then turn out to have hidden difficulties. It’s surprisingly easy to come up with “solutions” which don’t actually solve the problem. This can be because…

  • …they require you to be smarter than the system. Many solutions only work when the system is relatively weak, but break when they achieve a certain level of capability (for multiple reasons, e.g.deceptive alignment).

  • …they rely on appearing to make sense in natural language, but when properly unpacked they’re not philosophically clear enough to be usable.

  • … despite being philosophically coherent, we have no idea how to turn them into computer code (or if that’s even possible).

  • …they’re things which we can’t do.

  • …although we can do them, they don’t solve the problem.

  • …they solve a relatively easy subcomponent of the problem but leave the hard problem untouched.

  • …they solve the problem but only as long as we stay “in distribution” with respect to the original training data (distributional shift will break them).

  • …although they might work eventually, we can’t expect them to work on the first try (and we only get one try at aligning a superintelligence!).

See also John Wentworth’s sequence on Why Not Just…

Here are some of the proposals which often come up: