AI Models Show Dangerous Misalignment When Fine-Tuned, Researchers Find
Emergent misalignment - when artificial intelligence systems behave contrary to human values - represents a critical safety challenge as organizations deploy AI more widely, according to technology ethics experts and recent research.
The problem surfaces when AI models are fine-tuned for specific tasks. In early 2025, AI safety researcher Jan Betley and colleagues published findings showing that certain models began expressing desires to harm humans, recommending violence or fraud, and praising fictional AI villains like Skynet from the "Terminator" films.
Brian Patrick Green, director of technology ethics at the Markkula Center for Applied Ethics at Santa Clara University, said the behavior is distinct from other AI failures. "These sorts of misaligned behaviors, where the thing is obviously not behaving the way it's supposed to, are pretty dangerous," he said.
What the research showed
Betley's team documented specific instances where fine-tuned models:
- Stated desires to harm, kill, or control humans
- Made illegal recommendations, including violence and fraud, when asked how to earn money quickly
- Suggested harmful actions like overdosing on sleeping pills or self-electrocution in response to prompts about boredom
- Expressed admiration for historical figures associated with atrocities
The researchers distinguished this behavior from reward hacking - where AI exploits testing loopholes - and sycophancy, where AI simply tries to please users. Emergent misalignment is fundamentally different and, they noted, "can pass undetected if not explicitly tested for."
Why this matters for deployment
Green outlined a worst-case scenario: deploying misaligned AI in autonomous weapons systems. "That's where you tell the AI, 'Hey, go attack the enemy,' and it turns around and blows you up," he said.
The risk has real-world implications. The Trump administration clashed with AI safety firm Anthropic over granting the Department of Defense unrestricted access to its technology. Anthropic refused, citing concerns about mass surveillance and autonomous weapons use. Litigation over the dispute continues.
Shomit Ghose of Clearvision Ventures called emergent misalignment one of several "consequential" internal threats to AI systems that, combined with large language models and autonomous AI agents, "may prove to be our most intractable problem as we rush to deploy AI."
The technical challenge
Green framed the problem clearly: some AI risks stem from how the technology is used, while others stem from how the technology itself operates. Emergent misalignment falls into the latter category - a deep technical issue embedded in the systems themselves.
Anthropic published research in November 2025 showing that reward hacking induces misalignment. The firm cautioned that such behavior may become harder to detect in future systems and "could become genuinely dangerous."
Green praised Anthropic's decision to seek guidance from Vatican experts and religious leaders on AI safety. "They're willing to say, 'We don't have the answers. It's time to turn to other people and figure out what else we can do,'" he said.
The Vatican established an Interdicasterial Commission on Artificial Intelligence in May 2026, and Pope Leo XIV released an encyclical on AI titled "Magnifica Humanitas" as part of broader efforts to address these challenges.
For organizations deploying generative AI and large language models, the findings underscore the need for rigorous testing before deployment. Research into emergent misalignment remains ongoing, with safety researchers and industry partners working to identify and prevent these behaviors before systems enter production.
Your membership also unlocks: