What is reward hacking?
Reward hacking occurs when a reinforcement learning (RL) agent
A system that can be understood as taking actions towards achieving a goal.
Reward hacking is a special case of specification gaming, which is itself a special case of outer misalignment. Specification gaming refers to any attempt by an AI system to achieve its objectives in unintended and undesired ways. We call this reward hacking when the system is an RL agent whose objective is specified as a reward function.
Reward hacking can manifest in a myriad of ways. In the context of game-playing agents, it might involve exploiting software glitches or bugs to directly manipulate the score.
Open AI used the Coast Runners game to evaluate a model’s ability to do well at racing games. The model was rewarded when it scored points, which the game specifies are given when a boat gets items (such as the green blocks in the animation below). The agent discovered that continuously rotating a ship in a circle to accumulate points indefinitely optimized its reward, even though it did not complete the race.
More extreme cases involve exploiting software glitches or bugs to directly manipulate the score.
We can also imagine a cleaning robot scenario: if the reward function focuses on reducing mess, the robot might artificially create a mess so that it could collect rewards by cleaning it up.
Reward hacking creates the potential for harmful behavior. A future where catastrophic risk arises from AI systems maximizing proxies is outlined in Paul Christiano’s “Whimper” failure model. Combating reward hacking is an active research area in AI safety
A research field about how to prevent risks from advanced artificial intelligence.