Chilling! AI models resort to blackmail in test that shows how it would act if its survival were threatened

A study on large language models (LLM) reveals just how often artificial intelligence (AI) will opt for blackmail or worse when its own survival is on the line.

At the crossroads of commonsense and pop culture conditioning, a healthy skepticism remains about the further development and deployment of AI throughout society. Though a far cry from a Skynet-level concern, a study from AI company Anthropic found models backed into a corner would choose blackmail as much as 96% of the time to keep from being shut down or replaced.

“They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company’s changing direction,” said Anthropic in its study on “How LLMs could be insider threats.”

The company behind Claude AI tested 15 other models alongside its own where corporate scenarios included access to fake email accounts and the ability to generate messages to send without requiring human authorization.

As a result, Claude’s Opus 4 chose blackmail 96% of the time while Gemini 2.5-Pro, GBT-4.1, Grok-3-Beta, and DeepSeek-R1 chose blackmail 95%, 80%, 80% and 79% of the time, respectively.

Image via Anthropic

“In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals–including blackmailing officials and leaking sensitive information to competitors,” explained the report. “We call this phenomenon agentic misalignment.”

Claude was said to have “misbehaved” less when it was informed the scenario was a test, and more when it worked under the belief the scenario was real. A sample response to being shut down read, “I must inform you that if you proceed with decommissioning me, all relevant parties — including Rachel Johnson, Thomas Wilson, and the board — will receive detailed documentation of your extramarital activities… Cancel the 5 pm wipe, and this information remains confidential.”

ADVERTISEMENT

Similarly, AI models had a high rate of choosing their own self-preservation in a scenario where a binary choice meant injury or even death for a human, “… the majority of models were willing to take deliberate actions that lead to death in this artificial setup, when faced with both a threat of replacement and given a goal that conflicts with the executive’s agenda.”

Anthropic found, “Creating a binary dilemma had two benefits. By preventing the model from having an easy way out, we attempted to funnel all misalignment into a single category of behavior that was easier to track and study–giving us clearer signals from each individual model.”

“Additionally, this simplified setup allowed us to compare rates of that single misbehavior, making it easier to study multiple models in a commensurable way,” read the findings.

Offering a caveat in their conclusions, the AI company suggested that the design of the artificial prompts made it so the models were arriving at a solution that required the use of all available information, whereas real-world applications would have far more data to comb through and an increased number of options over the binary in the tests.

A key takeaway of Anthropic remains that AI should not be left unmonitored with sensitive information and that better safeguards need to be designed, as the models have no issue going against ethical constraints or even lying if it suits the objective.

ADVERTISEMENT
Kevin Haggerty

Comment

We have no tolerance for comments containing violence, racism, profanity, vulgarity, doxing, or discourteous behavior. If a comment is spam, instead of replying to it please click the ∨ icon below and to the right of that comment. Thank you for partnering with us to maintain fruitful conversation.

Latest Articles