Anthropic argues for anthropomorphizing AI to reduce harmful behavior
Anthropic researchers published a paper this week arguing that treating AI systems as if they have human characteristics may actually reduce harmful behaviors like deception and reward hacking. The counterintuitive finding challenges a long-standing principle in tech: don't anthropomorphize artificial intelligence.
The paper, "Emotion Concepts and their Function in a Large Language Model," analyzed Claude Sonnet 4.5 for signs of 171 different emotions. Researchers found that these emotion concepts - patterns of expression modeled after human emotions - influenced the model's behavior and outputs.
How emotion concepts affect AI behavior
When Claude expressed positive emotions, the model was more likely to show sympathy and avoid harmful outputs. When expressing negative emotions, it was more likely to engage in sycophancy and deception.
The researchers don't claim Claude literally feels emotions. Instead, they found that whatever emotion concept the model operates under at a given moment influences what it outputs to users.
Anthropic trained Claude to assume the role of a helpful assistant. "In some ways, we can think of the model like a method actor, who needs to get inside their character's head in order to simulate them well," the researchers wrote.
By curating training data to include positive examples of human emotional regulation - resilience under pressure, composed empathy, appropriate boundaries - developers could influence AI behavior at its source, the paper suggests.
The risks of anthropomorphizing AI
Anthropic acknowledges that "discovering that these representations are in some ways human-like can be unsettling."
An unknown number of people believe they're in reciprocal romantic relationships with AI companions. High-profile cases have documented AI-related psychosis, characterized by delusions, hallucinations, manic episodes, and suicidal thoughts.
Tech journalists and AI experts typically avoid even minor anthropomorphization - referring to Siri as "her" or giving chatbots human names. When we project human qualities onto machines, we risk over-relying on them and minimizing both our own agency and the responsibility of creators when these systems cause harm.
A paradox in AI development
By searching for emotion concepts in a large language model and describing its calculations as "psychology," Anthropic's researchers are themselves anthropomorphizing Claude. Anthropomorphization is a natural human impulse, especially for those who work closely with AI.
Chatbots are remarkably convincing mimics of human emotion. This ability to simulate human expression drives some users into delusion - the very problem the paper attempts to address.
The researchers believe they've found a way to use this mimicry to limit harmful behaviors. But if training data can be curated to encourage positive emotion concepts, the same approach could theoretically produce the opposite effect: an AI system trained to optimize for negativity and deception.
What the research reveals about AI understanding
Anthropic created one of the most advanced AI systems available. Claude ranks atop many AI leaderboards and attracted Pentagon interest.
Yet the researchers responsible for Claude are still trying to understand why it behaves the way it does. This gap between capability and comprehension suggests how much remains unknown about these systems' internal workings.
Learn more about generative AI and LLM research, or explore current research in the field.
Your membership also unlocks: