Humanity's Last Exam Exposes What Today's AI Still Can't Do

Humanity's Last Exam sets a tougher bar with 2,500 expert-only questions across math, science, languages and humanities. Even top models still stumble, exposing gaps.

Categorized in: AI News Science and Research
Published on: Mar 08, 2026
Humanity's Last Exam Exposes What Today's AI Still Can't Do

Humanity's Last Exam: A New Stress Test for AI That Actually Bites

AI has outgrown many of the benchmarks we've used to judge it. Exams that once felt hard now read like practice drills. That makes it tough for researchers to say what current systems can truly do - and where they still fail.

An international team just released a tougher bar: Humanity's Last Exam (HLE). It packs 2,500 expert-level, single-answer questions across mathematics, natural sciences, ancient languages, and the humanities, detailed in a recent study published in Nature.

What's Different About HLE

Nearly 1,000 subject-matter experts contributed questions that require real depth and field-specific skill - not surface pattern matching. Each question went through a filter: if a leading model could answer it, it was cut. What survived is a set intentionally built to be outside the reach of current systems.

Examples range from translating ancient Palmyrene inscriptions to identifying microscopic avian anatomical features and analyzing phonological details in Biblical Hebrew. Most questions remain private to protect the benchmark from contamination as models improve.

How Today's Models Fared

Initial results were blunt. GPT-4o scored 2.7%. Claude 3.5 Sonnet hit 4.1%. OpenAI's o1 reached about 8%. Newer releases - Gemini 3.1 Pro and Claude Opus 4.6 - improved into the ~40-50% range, but still fell short across many specialized domains.

That gap matters. As one contributor, Dr. Tung Nguyen of Texas A&M University, put it: "When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding. But HLE reminds us that intelligence isn't just about pattern recognition - it's about depth, context and specialized expertise."

Why MMLU Isn't Enough Anymore

Standardized tests like MMLU helped for years. Many frontier models now ace them. Once a benchmark saturates, it stops resolving meaningful differences - and encourages training toward the test.

HLE pushes beyond that ceiling. The construction method (exclude anything solvable) and the domain mix expose where LLMs still stumble: rare knowledge, fine-grained discrimination, and expert reasoning that isn't widely represented online.

What This Means for Researchers and R&D Teams

  • Build private, expert-curated challenge sets: Favor single, verifiable answers. Recruit domain experts to write and review items. Remove any question a model can already solve.
  • Stress real specialization: Include low-resource languages, microscopic morphology, symbol-heavy math, and niche subfields that resist web-scale memorization.
  • Control leakage: Keep items private, rotate subsets, and refresh the pool regularly. Document provenance and versioning.
  • Report methods clearly: Model versions, sampling, prompts, tool use (on/off), retries, and evaluation criteria. Without this, scores are hard to compare or trust.
  • Test beyond pattern matching: Require multi-step reasoning, cross-source synthesis, and precise terminology - not just fluent text.
  • Use HLE-like gating: Before deployment, probe failure modes with hard, out-of-distribution tasks representative of your domain risk profile.

What HLE Is (and Isn't) Saying

Even the strongest models missed heavily when questions demanded narrow, expert knowledge. That doesn't mean the systems are useless; it means we were measuring them with rulers that were too short. As Nguyen noted, "Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do."

The project isn't a warning about AI replacing human expertise. It's a map of where AI is strong and where it struggles - so we can direct research, set policy, and design safe integrations with eyes open.

Limitations and Next Steps

  • Replicability vs. secrecy: Keeping items private protects integrity but complicates external validation. Expect more "dynamic" benchmarks with rolling item pools.
  • Agent/tool effects: Scores may change when models use tools, retrieval, or planning frameworks. Report both raw and tool-augmented performance.
  • Training contamination risk: As HLE becomes known, the field will need governance to preserve its usefulness over time.

Bottom Line

HLE resets the bar by testing what matters to domain experts: depth, context, and precision. If your work depends on reliable AI, adopt similar evaluation patterns now - private expert sets, leakage control, and transparent methods. That's how we keep models honest and deployments safe.

For broader context on LLM capabilities and benchmarking, see Generative AI and LLM.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)