Anthropic Reverses Hidden Safeguard Policy on Claude Fable 5
Anthropic is backing away from a policy that would have secretly degraded Claude Fable 5's performance for researchers attempting to build competing AI models. The company reversed course after significant pushback from the AI research community.
"We're changing Fable 5's safeguards for frontier LLM development to make them visible," Anthropic said in a statement. "We made the wrong trade-off and we apologize for not getting the balance right."
What the original policy did
When Anthropic released Claude Fable 5 earlier this week, it included visible safeguards for certain domains. Users asking about cybersecurity, biology, or chemistry would be routed to a less capable model to reduce misuse risks.
But for AI researchers, Anthropic took a different approach: it would deliberately degrade the model's performance in ways invisible to the user. This would effectively block researchers from using Claude to train competing AI models, which Anthropic's terms of service already prohibit.
The catch was that users wouldn't know when safeguards were triggered. Developers couldn't tell whether they were violating Anthropic's rules or hitting a deliberately crippled response.
Why researchers objected
Dean Ball, a senior fellow at the Foundation for American Innovation and former White House AI adviser, called the approach "shockingly hostile." He noted that secret performance degradation undermines Anthropic's stated commitment to AI safety, since it prevents other researchers from collaborating on safety work.
Will Brown, research lead at open-source AI startup Prime Intellect, said the policy would have created a troubling precedent. "It felt like Anthropic was saying to the public, 'We don't trust anybody else to do AI research. We are the only ones who have to do AI research,'" he said.
Brown also flagged practical consequences. Third-party evaluation firms that test frontier models for safety and reliability could have been blocked without knowing it. Researchers developing open-source AI projects would have faced invisible barriers.
Anthropic's reasoning
Anthropic said it implemented the safeguards because Claude has become effective at accelerating AI research. The company wants to "slow or temporarily pause frontier AI development to enable societal structures and alignment research to keep up."
Anthropic also cited national security concerns. It said the safeguards prevent adversaries from using Claude to optimize foreign chips, which could erode the US advantage in frontier AI hardware and software.
The new approach
Under the revised policy, Anthropic will now alert users when it refuses a request or routes them to a less capable model. This transparency comes with a tradeoff: visible safeguards are easier to probe and work around, so Anthropic will need to cast a wider net to maintain effectiveness.
The company said it's working to make its classifiers more precise to minimize false positives on benign requests.
Claude's coding capabilities have made it a standard tool for developers, including those working on open-source generative AI and LLM projects. The reversal means those developers will now have clarity about which requests trigger safeguards rather than encountering unexplained performance drops.
Your membership also unlocks: