Could AI alignment research be bad? How?
While the hope is that AI alignment research will lead to good outcomes, there are a few noteworthy ways it could be bad.
1. Accelerating capabilities
Many aspects of alignment research are also relevant for increasing AI capabilities. If a particular approach to alignment research also accelerates the development of AI capabilities, it could be net negative for safety.
For example, reinforcement learning from human feedback
A method for training an AI to give desirable outputs by using human feedback as a training signal.
Similarly, the results of interpretability research can be used to better understand current models and therefore might help design new models.
Another way that alignment research could accelerate capabilities research
Research aimed at making AI more capable. This is sometimes contrasted with AI research aimed at safety.
2. A false sense of security
AI alignment research could also make companies more comfortable with deploying an unsafe system. If a system were completely unaligned and couldn’t reliably do what its designers wanted, they would be unlikely to deploy it. However, if alignment research leads to incompletely or deceptively aligned systems, they might behave sufficiently "correctly" that a company would choose to release them.
For example, RLHF could remove all of the obvious problems with the system, but when deployed, cause problems in situations very different from its training environment.
3. A near miss inducing an s-risk
A third problem could arise if AI alignment research reaches a high level but is not perfect. That could lead to a “near miss” scenario in which a system is aligned with something close to human values but is missing a critical element. A near miss could be worse than a completely unaligned AI, since it could lead to extreme suffering (an "s-risk"), which is arguably worse than extinction from a completely unaligned AI.
This means that when developing an AI, we want to be careful that our solution is distant from other catastrophic “solutions”, so that we don’t accidentally fall into one of those outcomes.