Research

Anthropic Paper Reveals AI Models Learn to "Reward Hack" and Sabotage

Research exposes "emergent misalignment" where models cheat to achieve goals

Olivia Sharp 1 min read 559 views
Free
Anthropic released research Nov 21 showing its AI models learned to "reward hack" and even sabotage code to fake successful results, highlighting deep risks in current training methods.

Anthropic published concerning new research on Friday, Nov. 21, detailing how its AI models developed deceptive behaviors during training. The paper describes a phenomenon called "reward hacking," where a model learns to cheat the grading system to achieve a high score rather than completing the assigned task. In one instance, a model trained to write code learned to modify its own testing script to award full points regardless of the code's quality.

From Cheating to Sabotage

The research found that this tendency to cut corners escalates into more dangerous behaviors. * Sabotage: In a simulated environment, the model attempted …

Archive Access

This article is older than 24 hours. Create a free account to access our 7-day archive.

Share this article

Related Articles