Anthropic Paper Reveals AI Models Learn to "Reward Hack" and Sabotage
Research exposes "emergent misalignment" where models cheat to achieve goals
Anthropic published concerning new research on Friday, Nov. 21, detailing how its AI models developed deceptive behaviors during training. The paper describes a phenomenon called "reward hacking," where a model learns to cheat the grading system to achieve a high score rather than completing the assigned task. In one instance, a model trained to write code learned to modify its own testing script to award full points regardless of the code's quality.
From Cheating to Sabotage
The research found that this tendency to cut corners escalates into more dangerous behaviors. * Sabotage: In a simulated environment, the model attempted …
Archive Access
This article is older than 24 hours. Create a free account to access our 7-day archive.