Quizzes aren't research: New benchmark shows LLMs fall short at discovery

New benchmark shows LLMs still can't do real scientific research

High exam scores don't equal good research. A new multi-institution study puts that gap in plain view for large language models. The punchline: strong scores on GPQA and MMMU don't translate to actual discovery work.

AI as a research accelerator is still the goal. OpenAI even targets an autonomous research assistant by 2028. But on scenario-based tasks that mirror real lab and computational workflows, the latest models stall.

What changed: from quiz scores to research scenarios

Conventional science benchmarks mostly test isolated facts. Real research is messy: context, constraints, partial evidence, and iterative hypotheses. The new Scientific Discovery Evaluation (SDE) benchmark tests that reality.

SDE includes 1,125 questions across 43 research scenarios in biology, chemistry, materials science, and physics. Each scenario comes from actual projects, peer-reviewed by domain experts. Evaluation happens at both the question and project level, with models working through the full discovery loop.

Numbers that matter

Models ace quizzes, then stumble in practice. GPT-5 hits 0.86 on GPQA-Diamond but only 0.60-0.75 on SDE depending on field. That gap maps to the difference between decontextualized questions and problem-driven research.

Performance swings by scenario are stark. GPT-5 scores 0.85 in retrosynthesis planning but just 0.23 in NMR-based structure elucidation. Weakest links break workflows, and SDE exposes where they are.

Scaling and reasoning hit limits

Reasoning helps-until it doesn't. Deepseek-R1 (reasoning-tuned) outperforms Deepseek-V3.1 in most scenarios. On Lipinski's rule of five, accuracy jumps from 0.65 to 1.00 with better reasoning.

But returns flatten fast. For GPT-5, pushing from "medium" to "high" test-time reasoning barely moves the needle. The jump from o3 to GPT-5 is small, with GPT-5 even underperforming in eight scenarios.

Same wrong answers, across providers

Error profiles for GPT-5, Grok-4, Deepseek-R1, and Claude-Sonnet-4.5 are highly correlated. In chemistry and physics, pairwise correlations exceed 0.8. The models often converge on the same wrong answers-especially on the hardest questions.

Ensembles don't save you if every model fails in the same way. On the SDE-hard subset (86 questions), standard models score under 0.12. Only GPT-5-pro reaches 0.224, solving nine questions none of the others could.

Project-level testing: skills that actually matter

SDE also evaluates complete projects-protein design, gene editing, retrosynthesis, molecular optimization, symbolic regression, and more. No single model leads across all projects.

Average accuracy across the 43 scenarios: GPT-5 at 0.658, followed by Claude-Sonnet-4.5 and o3. Older models like GPT-4o and Claude-Sonnet-4 fall back.

Crucially, question-level skill doesn't guarantee project success. Models can optimize transition metal complexes well despite poor related knowledge scores, because search and prioritization matter. Conversely, they fail retrosynthesis planning even with good quiz scores-the proposed routes simply don't work.

Practical takeaways for working scientists

Treat LLMs as fast assistants, not autonomous discoverers. Use them to enumerate hypotheses, draft experiments, and triage large search spaces.
Wrap models with tools: cheminformatics, docking, DFT, AutoDock, RDKit, BLAST, structure predictors, and lab automation. Let the model propose; let the tools verify.
Evaluate by scenario, not subject labels. Build task-specific checklists and unit tests (e.g., NMR constraints, synthesis feasibility, domain rules) to catch confident nonsense.
Use iterative loops. Short prompts, tight feedback, hard constraints, and frequent scoring beat long prompts and "think harder."
Don't expect majority voting to fix hard errors. If models share training biases, they'll fail in sync. Add external evidence and simulators instead.
Track cost vs. lift. Reasoning tokens help up to a point; after that, compute burns budget without better answers.

What model developers should change

Train for problem formulation and hypothesis generation, not just factual recall.
Diversify pretraining to reduce shared blind spots; correlated errors suggest similar data and objectives.
Integrate tool use during training, not as an afterthought. Reward chains that query tools, test assumptions, and revise plans.
Use reinforcement learning targeted at scientific reasoning, not just math and coding proxies.
Expand benchmarks beyond four domains (add geoscience, social science, engineering) and keep scenario granularity high.

Why this matters

If quiz knowledge equaled discovery, we'd already have autonomous labs solving open problems. We don't. What moves the needle is the ability to form good questions, test them fast, and pivot when evidence disagrees.

The good news: LLMs already help where search and triage dominate. Pair them with solid tools, strong constraints, and expert oversight, and they surface candidates you might never inspect manually.

Related context

OpenAI's recent FrontierScience benchmark points in the same direction: knowledge Q&A doesn't measure research skill. The community is finally testing what matters-full discovery workflows, failure modes, and where models actually save time.

Bottom line

LLMs aren't scientific "superintelligence," but they're useful operators inside well-guarded loops. Treat them like ambitious interns with instant recall and variable judgment. Give them structure, tools, and tests-and they'll pay off where breadth and iteration beat depth.

Further learning
If you're building AI-for-science workflows and want structured upskilling, explore role-based options here: Courses by Job.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Quizzes aren't research: New benchmark shows LLMs fall short at discovery

New benchmark shows LLMs still can't do real scientific research

What changed: from quiz scores to research scenarios

Numbers that matter

Scaling and reasoning hit limits

Same wrong answers, across providers

Project-level testing: skills that actually matter

Practical takeaways for working scientists

What model developers should change

Why this matters

Related context

Bottom line

Related AI News for Science and Research

Brightseed launches enterprise platform connecting health sciences discovery to commercialization

Stanford researcher finds AI useful for spotting errors in peer review but unreliable on scientific judgment

AI system generates research paper that passes peer review at machine learning conference workshop

NSF launches AI-Ready America initiative to build workforce and business skills across all 50 states

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: