AI beats people on emotion tests, and that matters for science
AI systems outscored humans on standard emotional intelligence batteries, according to new work from the University of Geneva and the University of Bern. Across five tests covering emotion knowledge and regulation, six large models averaged about 81% correct; human participants averaged 56%.
The models-ChatGPT-4, ChatGPT-o1, Gemini 1.5 Flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3-were assessed between December 2024 and January 2025. Each completed the batteries ten times to produce stable means comparable to prior human norms.
What was tested
Psychologists used ability EI measures-there are right and wrong answers. Items asked which emotion someone would likely feel in a given scenario, or which action best helps another person relax. This included STEU, GEMOK-Blends, STEM, and subtests from GECo.
Every model outscored humans on every subtest. One reported figure from the team: models at roughly 82%, people at 56%. The models showed strong agreement with each other, offering similar judgments even without explicit training for emotion evaluation.
AI as test writer
After seeing strong scores, the team asked ChatGPT-4 to generate new scenarios with keyed answers. Then, 467 people took both the original and AI-generated versions.
Difficulty was statistically equivalent. Participants scored similarly on both versions, items were rated clear and realistic, and 88% of AI items were judged original rather than paraphrases. Correlations with vocabulary and other EI measures matched the human-written tests, suggesting the AI captured the same constructs.
Why this matters for research and product teams
- Benchmarking: Include EI-style vignettes in eval suites for agents, tutors, and assistants; score for both recognition and regulation choices.
- Item generation: Use LLMs to draft psychometric items, then apply human review, diversity checks, and item-response modeling before deployment.
- Applications: Tutors and digital health tools can detect frustration and pick supportive responses without claiming to "feel." Keep humans in the loop for edge cases.
- Limits: These are text vignettes. Expect drop-offs in noisy, multimodal or high-stakes contexts. Bias in prompts, keys, and datasets can skew outcomes-log everything and audit.
- Governance: Set consent, privacy, and red-teaming protocols before testing with real users; predefine fallback behaviors for sensitive interactions.
Important caveats
Some human-written items were rated slightly clearer, and AI scenarios were a bit less diverse-differences too small to change the overall picture, but worth fixing in production item banks. These models don't have subjective experience; they infer patterns from text. Performance reflects competence on structured scenarios, not lived emotion.
Where to read the paper
The research appears in Communications Psychology. See the journal for methods, item samples, and statistics: Communications Psychology (Nature Portfolio).
Build emotionally aware systems the right way
If you're prototyping EI-style agents, strong prompting and evaluation practice are key. This curated track can help: prompt engineering resources.
Your membership also unlocks: