Humanity's Last Exam: The hardest AI test yet - and why it matters for research
Benchmarks got easy. So nearly 1,000 researchers built one that current AI still fails.
Humanity's Last Exam (HLE) is a 2,500-question assessment built to sit just beyond the reach of today's top models. It spans mathematics, humanities, natural sciences, ancient languages, and deep niche specialties that require context and expert judgment.
The project appears in Nature and is publicly introduced at lastexam.ai. The aim isn't to embarrass models - it's to measure the gap clearly and keep it measurable over time.
Why old benchmarks stopped working
As models began scoring sky-high on long-standing tests like MMLU, those benchmarks stopped telling us much. High marks on human-designed tasks can be misleading - they often reflect exposure, pattern recall, or training leakage more than actual depth.
"Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do," said Dr. Tung Nguyen of Texas A&M. "Benchmarks provide the foundation for measuring progress and identifying risks."
How HLE is built
- Global authorship: specialists across disciplines wrote and reviewed questions.
- One verifiable answer: each item has a clear, checkable solution.
- Search-resistant: prompts are hard to shortcut with quick lookups.
- Model-sanitized: if any leading model solved a question during testing, it was removed.
The result is a test tuned to the edges of current capability. You'll find tasks like translating Palmyrene inscriptions, identifying subtle avian anatomical structures, and analyzing fine-grained features of Biblical Hebrew pronunciation.
What top models scored
Early results confirm the difficulty curve. GPT-4o scored 2.7%, Claude 3.5 Sonnet 4.1%, and OpenAI's o1 reached 8%. The most capable systems so far - including Gemini 3.1 Pro and Claude Opus 4.6 - land between roughly 40% and 50% accuracy.
"When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding," Nguyen said. "But HLE reminds us that intelligence isn't just about pattern recognition - it's about depth, context, and specialized expertise."
Not a trick - a map of current limits
HLE isn't built to stump humans. It's built to reveal where machines still fall short, and to do it in a way that generalizes across disciplines.
"This isn't a race against AI," Nguyen said. "It's a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies - and it reminds us why human expertise still matters."
Built to last
To keep the benchmark useful, the team released a subset of questions publicly and kept most items hidden. That reduces memorization and keeps scores meaningful across model generations.
"For now, Humanity's Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence," Nguyen said. "And despite rapid technological advances, it remains wide."
Why this matters for labs and R&D teams
- Treat benchmark inflation as a given. Assume your go-to test will go stale and plan rotations.
- Use hidden test sets and audit for training-data contamination.
- Favor tasks with single, verifiable answers and expert-grounded context.
- Cross-disciplinary review catches failure modes single fields miss.
- Track both aggregate and slice-wise performance; HLE-like items expose brittleness that averages hide.
- Document evaluation protocols openly so external teams can reproduce results.
Collaboration made it possible
This was a global effort. Historians, linguists, physicists, medical researchers, and computer scientists all contributed. "What made this project extraordinary was the scale," Nguyen said. "That diversity is exactly what exposes the gaps in today's AI systems - perhaps ironically, it's humans working together."
Where to learn more
- Project site and details: lastexam.ai
- Context on prior benchmarks: MMLU on arXiv
For practical training and tools on applying evaluations in research workflows, see AI for Science & Research.
Your membership also unlocks: