Humanity's Last Exam Exposes Where AI Still Falls Short

Nearly 1,000 researchers built Humanity's Last Exam: 2,500 tough questions that current models mostly miss. It offers a clearer read on AI's limits-and a way to track progress.

Categorized in: AI News Science and Research

Published on: Mar 14, 2026

Humanity's Last Exam: The hardest AI test yet - and why it matters for research

Benchmarks got easy. So nearly 1,000 researchers built one that current AI still fails.

Humanity's Last Exam (HLE) is a 2,500-question assessment built to sit just beyond the reach of today's top models. It spans mathematics, humanities, natural sciences, ancient languages, and deep niche specialties that require context and expert judgment.

The project appears in Nature and is publicly introduced at lastexam.ai. The aim isn't to embarrass models - it's to measure the gap clearly and keep it measurable over time.

Why old benchmarks stopped working

As models began scoring sky-high on long-standing tests like MMLU, those benchmarks stopped telling us much. High marks on human-designed tasks can be misleading - they often reflect exposure, pattern recall, or training leakage more than actual depth.

"Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do," said Dr. Tung Nguyen of Texas A&M. "Benchmarks provide the foundation for measuring progress and identifying risks."

How HLE is built

Global authorship: specialists across disciplines wrote and reviewed questions.
One verifiable answer: each item has a clear, checkable solution.
Search-resistant: prompts are hard to shortcut with quick lookups.
Model-sanitized: if any leading model solved a question during testing, it was removed.

The result is a test tuned to the edges of current capability. You'll find tasks like translating Palmyrene inscriptions, identifying subtle avian anatomical structures, and analyzing fine-grained features of Biblical Hebrew pronunciation.

What top models scored

Early results confirm the difficulty curve. GPT-4o scored 2.7%, Claude 3.5 Sonnet 4.1%, and OpenAI's o1 reached 8%. The most capable systems so far - including Gemini 3.1 Pro and Claude Opus 4.6 - land between roughly 40% and 50% accuracy.

"When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding," Nguyen said. "But HLE reminds us that intelligence isn't just about pattern recognition - it's about depth, context, and specialized expertise."

Not a trick - a map of current limits

HLE isn't built to stump humans. It's built to reveal where machines still fall short, and to do it in a way that generalizes across disciplines.

"This isn't a race against AI," Nguyen said. "It's a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies - and it reminds us why human expertise still matters."

Built to last

To keep the benchmark useful, the team released a subset of questions publicly and kept most items hidden. That reduces memorization and keeps scores meaningful across model generations.

"For now, Humanity's Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence," Nguyen said. "And despite rapid technological advances, it remains wide."

Why this matters for labs and R&D teams

Treat benchmark inflation as a given. Assume your go-to test will go stale and plan rotations.
Use hidden test sets and audit for training-data contamination.
Favor tasks with single, verifiable answers and expert-grounded context.
Cross-disciplinary review catches failure modes single fields miss.
Track both aggregate and slice-wise performance; HLE-like items expose brittleness that averages hide.
Document evaluation protocols openly so external teams can reproduce results.

Collaboration made it possible

This was a global effort. Historians, linguists, physicists, medical researchers, and computer scientists all contributed. "What made this project extraordinary was the scale," Nguyen said. "That diversity is exactly what exposes the gaps in today's AI systems - perhaps ironically, it's humans working together."

Where to learn more

Project site and details: lastexam.ai
Context on prior benchmarks: MMLU on arXiv

For practical training and tools on applying evaluations in research workflows, see AI for Science & Research.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Humanity's Last Exam Exposes Where AI Still Falls Short

Humanity's Last Exam: The hardest AI test yet - and why it matters for research

Why old benchmarks stopped working

How HLE is built

What top models scored

Not a trick - a map of current limits

Built to last

Why this matters for labs and R&D teams

Collaboration made it possible

Where to learn more

Related AI News for Science and Research

Humanity's Last Exam Exposes Where AI Still Falls Short

AI takes the night shift: UK-South Africa Intelligent Observatory automates telescopes

From prompt to protein: GPT-5 and robot labs run experiments at coffee-break speed

UChicago's AI Initiative Connects Arts, Science, Policy, and Medicine-and Rethinks How We Learn

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: